Anastasi, Anne - Psychological Testing II

Pri'lciplcs of Psychological Testing Vali~ity: Measurement and Interpretation 183
Jound in texts on psychological statistics (e.g., Guilford and Fruchter, \vith. the criterion. acts ~s a suppressor variable to eliminate or suppress
,'1973). Essentially, such an equation is~.ased on the corre1atioTl..~~dL- the l?,elev~nt varIance In the other test. For example, reading compre-
::lestwith the criterion, as well as . t c rrel tions a hellSlon might correlate highly with scores on a mathematical or a
· \·tests. vious y, those tests that correlate higher with the criterion mechanical aptitude test, because the test problems require the under-
· ','shOUfd receive more weight. It is equally important, howev~r, to take ~ta~1ding of complic~ted written instructions. If reading comprehension
; intoaccount the correlation of each test with the other tests In the bat- IS Ir~elevant:!o the lob behavior to be predicted, the reading compre-
·:tery. Tests correlating highly with each other rcpresent needless dupli- henSIOn reqUired by the tests introduces error variance andji)wers the
.cation,since th'ey cover ti:>a large extent the same aspects of the criterion. predicti~e vali?ity of the tests. Administering a reading corrip¥~hension
The inclusion of two such tests ",ill not appreciably increase the validity test an~ mcludmg.scores on this test in the regression equation will elimi-
of the entire battery, even though both tests may correlate highly with nate th~s error vanan~e and raise the validity of the battery. The suppres-
the criterion. In such a case, one of the h;s.ts would serve about as sor vanable appears In the regression equation with a negative weight.
effectively as the pair; only one would therefore be retained in the bat- Thus, .the higher an individual's score on reading comprehension, the
terv. I mote IS deducted from his score on the mathematical or mechanical test.
. Even after the most serious instances of dUElication have been elimi- The ~se of suppressor variables is illustrated by a study of 63 industrial
nated h~-;~v~~tk-t~sts rem-iini~ thebatte~ \"ill correlate with ~echamcs (Sorenson, 1'966). The most effective battery for predicting
each ~fuerro-v~rying-d~rees. "For maximum predtctiv~tests that Job pe:formanc~ in this group included: (1) a questionnaire covering
makea-iTiOienearly unique contribution to the total battery should re- ed~cat~on, prevIOu~ mechanical experience, and other background data
ceive greater weight than those that partly duplicate the functions of (cnterIon correlation = .30); ( 2) a mechanical insight test stressing
other tesrs:-I'i.'fThe computation of a multiple regression equation, ,each practical mechanics of the "nuts-and-bolts" type (criterion correlation =
test is wei hted in direct proportion to its correlation with the criterion .22); and (3) a test of mechanical comprehension oriented toward the
and.ln,jIULerSe proJLortion to its corre alions wit t e other tests. Thus, academic understanding of mechanical principles (criterion correlation =-
-thehighest weight will be assig!led to th.e t~~ with the highest vali~ity -.04; corr~lation with test 2 = .71). The third test functioned as a sup-
and the least amount of overlap with the rest of the battery. pressor van able , as can be seen from thc following regression equation:
The validity orthe entire battery can be found by --computing the
C;;:: 17T1 + 10T2 - 6T3 + 866
multiple correlation (R) between- the criterion and the battery. '!lis cor-
relation indicates the hi hest redictive value that can be obtained from ~Vithout the suppressor variable the test battery would "overpredict" the
the given attery, when each test i~venoptimum- weight for predicti~ __ ]ob~~:~ormance of individuals who obtain high scores on the practical
th~rion ill..9uestion. The optimum weights are those determined by . ,~ecliamcs tes~ through an application of mechanical principles but who
the regression equation:- / lack the praCtlcal mechanical know-how required on the job. The irrele-
It should be noted that these weights are optiolUm only for the par- ." vant contribution of academic knowledge of mechanical principles to
ticular sample in which they were found. Because of chance errors in scores on the practical mechanics tests was thus ruled out by the sup-
the correlation coefficients used in deriving them, the regression weights pressor variable.
may vary from sample to sample. Hence, the battery should be cross.; Attem~ts to employ -suppressor variables to improve the validity of
validated by correlating the predicted criterion scores with the actual ~erso~allty tests have proved disappointing (}. S. ''liggins, 1973). In any
criterion scores in a new sample. Formulas are available for estimating sl~ua.tiOn, mor~over, the more direct· procedure of revising a test to
the amount of shrink~ge in a multiple correlation to be expected when ehmmate the Irrelevant variance is preferable to the indirect statistical
the regression equation is applied to a second sample, but empirical ~limination of such variance through a suppressor variable. 'When changes
verification is preferable whenever possible. The larger the sample on In the test are not feaSible, the investigation of suppressor variables
which regression weights were derived, the smaller the shrinkage will be. should be considered.4 .
Under certain conditions the predictive validity of a battery can be
~ There has also been some explorationof the inclusionof continuous moderator
improved by including in the regression equation a test having a ze~o vanables in regression equations through nona4~Jtive and higher-order functions;
correlation with the criterion but a high correlation with another test m but the results have not been promising (Kirkpatrick et al. 1968· Saunders 19$.
the battery. This curious situation arises when the test that is uncorrelated J. S. Wiggins, 1973). .,."", ,
Validity: Measuremellt and Interpretation 185
score in the battery, even though individual differences within the group
ULTIPLE CUTOFF An alternative strategy for combining test
SCORES. were not Significantly correlated with criterion ratings. It would seem
utilizes multiple cutoff points. Briefly, this procedure involves the that women who enter or remain in this type of job are already selected
blishment of a minimum cutoff score on each test. Every individual with regard to Finger Dexterity.s
'falls below such a minimum score on anyone of the tests is rejected. The validity of the composite KFM pattern of cutting scores in a group
, those persons who reach or exceed the cutoff scores in all tests are of 194 workers is shown in Table 17. It will be seen that, of 150 good
ted. An example of this technique is provided by the General Apti- workers, 120 fell above the cutting scores in the three aptitudes and 30
Test Battery (GATB) developed by the United States Employment were false rejects, falling below one or more cutoffs. Of the 44 poor
ce for use in the occupational counseling program of its State Em- workers, 30 were correctly identified and 14 were false acceptances. The
mentService offices (U.S. Department of Labor, 1970a). Of the nine overall efficacy of this cutoff pattern is indicated by a tetracporic cor-
ude scores yielded by this battery, those to be considered for each relation of .70 between predicted status and criterion" ratings.
pation were chosen on the basis of criterion correlations as well as
os and standard deviations of workers in that occupation. TABLE 17
Effectiveness of GATB Cutoff Scores 'On Aptitudes K, F, and M in
Identifying Good and Poor Workers
E~'~~
sllative.DataUsed to Establish Cutoff Scores on GATB (From U.S. Department or Labor. 1958, p. 14)
Criterion Nonqualifying
SD Correlation
Cood 120 150
General Learning Ability 75.1 14.2 -.094 Poor 14 44
Verbal' 80.1 11.3 -.085 Total
Numerical 73.2 18.4 134 194
-.064
Spatial 78.9 15.9 - .041
F~rm Perception 80.1 23.5 -.012
If only scores yielding significant validity coefficients are taken into ac-
Clerical Perception 86.3 16.6 .088
c.ount, .one o.r more essential abilities in which all workers in the occupa-
Motor Coordination 89.3 20.7 .316°
Finger Dexterity 92.4 18.1 .155 tiO~ excel :r)]gh~be overlooked. Hence the need for considering also those
Manual Dexterity 88.2 18.6 .437u aptltudes m whIch workers excel as a group, even when individual differ-
ences beyond a certain minimum are unrelated to degree of job success.
; Significant
at .05 level. ~he ~ultiple cutoff me:hod ~s preferable to the regression equation in
•• Significantat .01 level. SItuatIons such as these, in which test scores are not linearly related td the
criterjo~. In some jobs, mQreover, workers may be so homogeneous in a
~ey ~aIt that the range of individual differences is too narrow to yield a
, The development of GATB occupational standards for machine cutters slgmficant correlation between test scores' and criterion.
inthe food-canning and preserving industry is illustrated in Table 16. In The strongest argument for the use of multiple cutoffs rather than a're-
!~rmsof standard scores with a mean of 100 and an SD of 20, the cutoff
~ession ~quation centers around the question of com~~atory qu,alifica-
'!coresfor this occupation were set at 75 in Motor Coordination (K), tions. \Vlth the regression equation, an individual wli<)"ia'tes low 'in one
Finger Dexterity (F), and Manual Dexterity (M),(U.S. Department of test may receive an acceptable total score because he 'r~t¢s very high in
Labor,1970a, Section IV, p. 51). Table 16 gives mean, standard deviation,
'andcorrelation with the criterio~ (supervisory ratings) for each of the
5 The. data hav~ been somewhat simplifiedfor illustrativep~rposes. Actually,the
ninescores in a group of 57 women workers. On the basis of criterion nna] chOIceof aptItudes and ~tting scoreswas based On se~rate.allaJyses of three
, correlations,Manual Dexterity and Motor Coordination appeared promis- groups of workers on r~la:ed Jobs.on the results obtained i!(a comblqed sample of
ing. Finger Dexterity was added because it yielded the highest mean 194 cases, and on quahtataveJob analysesof the operationsiay-olved. .(.
,\
~86 Principles of Psyc1lOlogical Tcstillg Validity: Measurement and Interpretation 187
:someother test in the battery. A marked deficiency in one skill may thus In placement, the assignments are based on a single score. This score may
'.be compensated for by outstanding ability along other lines. It is pos- be derived from a single test, such as a mathematics achievement test. If
.sible,however, that certain types of activity may require essential skills a hattery of t.ests has been administered, a composite score computed
Jorwhich there is no substitute. In such cases, individuals falling below from a single regression equation wpuld be employed. Examples of place-
.'itherequired minimum in the essential skill will fail, regardless of their ment decisions include the sectioning of college freshmen into different
::otherabilities. An opera singer, for example, cannot afford to have poor ~lathem~tics classes ~n the basis of tlleir achievement test scores, assign-
"pitchdiscrimination, regardless of how well he meets the other require- 1~~ ~ppltcants to clencal jobs requiring different levels of skill and respon-
,;..mentsof such a career. Similarly, operators of sound-detection devices in SIbIlity, and placing psychiatric patients into "more disturbed" and '1ess
:'submarines need good auditory discrimination. Those men incapable of disturbed" wards. It is evident that in each of these decisions only one
~.makingthe necessary discriminations cannot succeed in such an assign- criterion is employed and that placement is determined bv the individual's
: ment, regardless of superior mechanical aptitude, general intelligence, or position along a single predictor scale. .
1 other traits in which they may exccl. With a multiple cutoff strategy, in- Classification, on the other hand, always involves two or more criteria.
;dividuals laeking any essential skill would always be rejected, while with In a military situation, for example, classification is a major problem, since
a regression equation they might be accepted. each. man in an available manpower pool must be assigned to the military
, When the relation between tests and criterion is linear and additive, on speCIalty where he can serve most effectively. Classification decisions are
: the other hand, a higher proportion of correct decisions will be reached ~kewise required in industry, when new employees aTe aSSigned to train-
X' with a regression equation than with multiple cutoffs. Another important mg programs for different kinds of jobs. Other examples include the
advantage of the regression equation is that it provides an estimate of ~unseling of students regarding choice of college curriculum (science,
each person's criterion score, thereby permitting the relative evaluation hberal arts, etc.), as well as field of concentration. Counseling is based
of all individuals. With multiple cutoffs, no further differentiation is pos- essentially on classification, since the client is told his chances of .succe~d-
sible among those accepted or among those rejected. In many situations, i~g i~ different .acad~mic programs or occupations. Clinical diagn~sis is
the best strategy may involve a combination of both procedures. Thus, hk~\VlSe a c~~sS1ficatJOn.problem, the major purposes of each diagnosis
the multiple cutoff may be applied nrst, in order to reject those falling bemg a deCiSion regardmg the most appropriate type of therapy.
below minimum standards on any test, and predicted criterion scores may Although placement can be done with either one or more predictors,
then be computed for the remaining acceptable cases by the use of a re- classification requires multiple predictors whose validitv is individually
gression equation. If enough is knowll about the particular job require- determined a~ainst each criterion. A classification batte;y requires a dii-
ments, the preliminary screening may be dOlle in terms of only one or fer~nt re?resslOn equation for each criterion. Some of the tests may have
two essential skills, prior to the application of the regression equation. ~~Ights I~ all the equations, although of different values; others may be
mcluded m only one or two equations, having zero or negligible weights
for some of the criteria. Thus, the combination of tests employcd out of
th~ to.tal battery, as well as the speCific weights, differs with the particular
cntenon. An example of such a classification battery is that developed by
THE NATUREOF CLASSIFICATION. Psychological tests may be used for I the Air. Force for assignment of personnel to different tr.aining programs
purposes of selection, placement, or classiHcation. In selection, each in- (DuBOIS, 1947). This battery, consisting of both paper-and-pencil and ap-
dividual is either accepted or rejected. Deciding whether or not to admit paratus tests, provided stanine scores for pilots, navigators, bombardiers,
a student to college, to hire a joh applicant, or to accept an army recruit a~d a. few other air-crew specialties. By finding an individual's estimated
for officer training are examples of sele~tion decisions. VVhen selection is cnte~lOn scores from the different regression equations, it was possible to
done sequentially, the earlier stages are often called "screening," the term pre~1Ct whether, for example, he was better qualjfied as a pilot than as a
"selection" being reserved for the more intensive final stages. "Screening naVlgator. .
may also be used .to deSignate any rapid, rough selection process even
when not followed by further selection procedures.
Both placement and classification differ from selection in that no one is ~~IZINC THE UTILIZATION OF TALENT.bifferential pre9iction of cri-
rejected, or eliminated from the program. All individuals are aSSigned to tena WIth a battery of tests permits a Emler utilization of available human
appropriate "treatments" so as to maximize the effectiveness of outcomes. reS,ources than is possible with a single general test or with a composite
Validity: Mcasurcment and Interpretation 189
the selected employees decreases; but it remains better than chance even
when the correlation is .80. With lower selection ratios we can of course
obtain better qualified personnel. A5 can be seen in Table 18, however,
for each selection ratio, mean job performance is better whcn applicants
arc chose~ thr~ugh cl~ssification than through selection strategies.
A ?racticallIIustrahon of the advantages of classification strategies is
proV1de~ ?y the use of Aptitude Area scores in the . igninent of person-
nel to military o~cupational specialties in the U.S.' . (Maier & Fuchs,
197.2): Each Aptitude Area corresponds to a group." my jobs requiring
a slI1~llar,pattern of aptitudes, 1mow ledge, and interests. From a 13-test
claSSification battery, combinations of three to five tests are used to find
the individual's score in each Aptitude Area. Figure 20 shows the results
of an investigation of 7,500 applicants for enlistment in which the use of
Aptitude Area scores was compared ">ith the use of a global screening
test, the Armed Forces Qualification Test (AFQT). It will be noted that
only 56 percent of this group reached or exceeded the 50th percentile on
the AFQT, while 80 percent reached or exceeded the average standard
score of 100 on their best Aptitude Area. 'Thus, when individuals are aDo-.
cated to specifi~ jobs on the basis of the aptitudes required for each job,-a
verr large majority are able to perform as well as the average of the
entire sample or better. This apparent impossibility, in which nearly
everyone could be above average, can be attained by capitalizing on the
fact that nearly everyone excels in some aptitude.
The same point is illustrated with a quite different population in a
study of gifted children by Feldman and Bratton (1972). For demonsfi!a-
tion purposes, 49 c)rildren in two fifth-grade classes were evaluated on
TABLE 18
MeanStandard Criterion Score of Persons placed on Two Johs by
. Selectionor Classification Strategies
50th percentile
or higher
, (Adapted from Brogden. 1951, p. 182) on AFQT
Classffication: 20% Below

Two predictors whose intercorrelation is
f~* * i'ifj't t
SelectionRatio -Selection:
Standard score
forEach-Job Single Predictor 100 or higher
0 .20 040 .60 .80
in best
Aptitude Area
1.03 1.02 1.01 1.00 .96
5% .88
.87 .86 .84 .82 .79
10 .70
.48 .68 .67 .65 .62 .59 Flc. 20. Percentages Scoring Above Average on AFQT and on Best Aptitude
20 .46 .43
.55 .53 .50 ~ea of Army Classification Battery in a Sample of 7,500 Applicants for En-
30 .32
.41 .37 .34 .29 lisbnent.
40 .18 .42
.31 .28 .25 .22 .17 (Data from U.S. Army Reaearc:h Institute for. the BehaviCiW ~d. Social SeieDc:e5,
50 .00
Courtesy J. E. Uhlaner, 1974.)
, Principles of Psychological Testing Validity: Measurement and Interpretation 191
. f 19 measures, all of which had pl'('viously been used to select t? the par~icular gr?up he resembles most closely. Although the regres-
ts for special programs for the gifted, Among these measures were sion equation permits the prediction of degree of success in each Beld,
,scoreson a group intelligence test and on an educational achieve- the multiple discriminant function treats all persons in one category as of
battery, tests of separate aptitudes and separate academic areas equal status. Group membership is the only criterion data utilized by this
s reading and arithmetic, a test of creative thinking, grades in music method. The discriminant function is useful when criteriori scores are
~rt, and teachers' nominations of the most "gifted" and the most unavailable and only group membership can be ascertained. Some tests,
tive" children in each class. When the five highest ranking children for inst~nce, are validated by administering them to persons in different
ted by each criterion were identifled, they included 92 percent of occupations, although no measure of degree of vocational success is avail-
group. Thus, it was again shown that nearly all members of a group able for individuals within each field.
excel when multivariate criteria are employed. ~e discriminant function is also appropriate when there is a nonlinear
~elahon. between the criterion and one or more predictors. For example,
m certa~n perso~~lity traits .ther~ may be an optimum range for a given
IFFERENTIAL VALIDITY.In the evaluation of a classification battery, the occupation. IndIViduals havmg either more or less of the trait in question
jar consid'eration is its differential validity against the separate criteria. ~uld thus be at a disadvantage. It seems reasonable to expect, for
he object of such a battery is to predict the difference in each person's msta.nce, that salesmen shOWing a moderately high amount of social
erformance in two or more jobs, training programs, or other criterion dommance would be most likely to succeed, and that the chances of
uations. Tests chosen for such a battery should yield very different su~cess wo~ld ?e~line as scores move in either direction from this region.
alidity coefficients for the separate criteria. In a two-criterion cIassifica- WIth the dlscnmmant function, we would tend to select individuals fall-
'on problem, for example, the ideal test would have a high correlation ing within this optimum range. With the regression equation, on the other
:with one criterion and a zero correlation (or preferably a negative cor- hand, the more dominant the-score, the more favorable would be the
relation) with the other criterion. General intelligence tests are relatively predicted ~utcome. If the correlation between predictor and criterion
poor for classification purposes, since they predict success about equally were negative, of course, the regression equation would yield more favor-
well in most areas. Hence, their correlations with the criteria to be dif- abl~ predic~ons for the low scorers. But there is no dir~ct way whereby
ferentiated would be too similar. An individual scoring high on such a ~n mtermedlate score would receive maximum credit. Although in many
test would be classified as successful for either assignment, and it would mstances the two techniques would lead to the same choices there are
be impossible to predict in which he would do better. In a classification situat~ons i~ wh.ich. p~rsons wou~d be differently classified by'regression
battery, we need some tests that are good predictors of criterion A and equatIons and.dIscnmm~nt functIons. For most psychological testing pur-
poor predictors of criterion B, and other tests that are poor predictors of pose~, re.gresslOn equations provide a more effective technique. Under
A and good predictors of B. certam CIrcumstances, however, the discriminant function is better suited
Statistical procedures have been developed for selecting tests so as to to yield the required information.
maximize the differential validity of a classification battery (Brogden,
1951; Horst, 1954; Mollenkopf, 1950b; Thorndike, 1949). When the num-
ber of criteria is greater than two, however, the problem becomes very
complex, no completely analytical solutions are yet available for these I
situations. In practice, various empirical approaches are followed to ap- TIlE ~ROB~EM~ If we waQt to use tests to predict outcomes in some
proximate the desired goals. '_ future sItuation~ ~uch.as an applicant's performance in college or on a job,
we. need ~ests WJ~ high predictive validity against the specific criterion.
This reqUIrement IS commonly overlooked in the develOpment o{ so-called
MULTIPLEDISC~JNANT FUNCTIONS.An alternative way of handling culture-fair tests (to be discussed further in Ch. ~), In the effort to
classiflcation decisions is by means of the multiple discriminant function include in such tests only £Unctions common to different -cultures or sub-
(French, 1966). Essentially, this is a mathematical procedure for deter- cultw:es, we ma~ choose content that has little relevance to any criterion
mining how closely the individual's scores on a whole set of tests ap- we WIShto predICt. A better solution is to chO()s~~riterion-relevant con-
proximate the scores typical of persons in a given occupation, curriculum, tent and then investigate the possible effect of m~deratorvanahl.s on test
psychiatric syndrome, or other category. A person would then be aSSigned scores. Validity coefficients, regression weights, and cutoff scores may
Prillciplesof Psyc1101ogical Tcstitlg
Validity: 'Measurement atld Interpretation 193
<. vary as a function of differences in the examinees' experiential back-
through these tally m k . k
'grounds. These values should therefore be checked within subgroups for tion is th : ar s I~ 'nown as the regression line, and its equa-
e regreSSIOn equatIon I th· I h
, whomthere is reason to expect such effects. would have anI, d" n IS examp e, t e regression equation
It should be noted, however, that the predictive characteristics of test ) one pre Ictor. The multiple reg " '.
cussed earlier in this cha t hI' reSSlOn equatIOns. dIS-
'scoresare less likely to vary among cultural groups when the test is in- is the same. p er ave severa predIctors, but the principle
, trinsically relevant to criterion performance. If a verbal test is employed
When both test and criteri '
to predict nonverbal job performance, a fortuitous validity may be found (SD == 1 00) th l on scores are expressed as standard 'sco.res
in one cultural group because of traditional associations of past experi- ficient. For this: s ope o,f the regr~ssion lin~ equals the correlation coef-
ences within that culture. ln a group with a different experiential back- coefficient in the ~~~n, If a te~. Yle~ds a Significantly different validity
ground, however, the validity of the test may disappear. On the other Figure 21 ro'd hgrou~s, ~ IS dIfference is described as slope bias.
hand, a test that directly samples criterion behavior, or one that measures p VI es sc ematIc Illustrations of regression lines for several
, essential prerequisite skills, is likely to retain its validity in different
groups.
Since the mid-l96Os, there has been a rapid accumulation of research
on possible ethnic differences in the predictive meaning of test scores. In
. this connection, the EEOC Guidelines (Appendix B) explicitly state:
"Data must be generated and results separately reported for minority and
nonminority groups wherever technically feasible." The functions and
implications of separate validation studies are also discussed in the re-
ports of the AP A task forces on the testing of minority groups for both
educational and employment purposes (American Psychological Associa-
tion, 1968; Cleary, Humphreys, Kendrick, & Wesman, 1975). The large
majority of studies conducted thus far have dealt with black Americans,
although a few have included other ethnic minorities. The problems in-
vestigated are generally subsumed under the heading of test bias. In this
context, the term "bias" is employed in its well-established statistical
sense, to designate constant or systematic error as opposed to chance
error. This is the same sense in which we speak of a biased sample, in
contrast to a random sample. The principal questions that have been
raised regarding test bias pertain to validity coefficients (slope bias) and
to the relationship between group means on the test and on the criterion
(intercept bias). These questions will he examined in the next two see--
tions.
SLOPE BIAS. To facilitate an understanding of the technical aspects of

test bias, let us begin with a scatter diagram, or bivariate distribution,
such as those illustrated in Chapter 5 (Figs. 8,~, 100see especially Fig.
10, p. 110). For the present purpose, the horizontal axis (X) represents
scores on a test and the vertical axis (Y) represents criterion scores, such
Test Score
as college grade-point average or an index of job performance. It will be F 21 . Test Score
C
recalled that the tally marks, showing the position of each individual on
show tho e Slop~
I • and ~ntercept Bill.s.;
regions WIthin which
in Predicting Criterion Scores The ellipses
be f h '.
both test and criterion, indicate the direction and general magnitude of scores are plotted against their I1lce~t
· . rs 0 efac group fall when their test
the correlation between the two variables. The line of best fit drawn "rl enon per onnance.
(Cases 1, 2, and :I adapted (rom MAG d
• ",.or on, 1953, p. 3.)
Validity: Measurement and Interpretation 195
Principles of Psgc1lOlogical Testing
,194
groups in the chapter on Personnel Psychology for the Annual Revieu: of
· bivariate distributions. The ellipses represent the region within which
· 'the tally marks of each sample would fall. Case 1 shows the bivariate
PStJchology, Bray and Moses (1972, p. 554) concluded: "It does appear,
however, that the closer the study comes to the ideal, the less likelihood
distribution of two groups with different means in the predictor, but with
there is of finding differential validity."
i identical regression lines between predictor and criterion. In this case,
Similar results have been obtained in investigations of black and white
. there is no test bias, since any given test score (X) corresponds to ~e
coll~gc students, Validity coefficients of the College Board Scholastic
· . identical criterion score (Y) in both groups. Case 2 illustrates slope bIas,
Aptitude Test for black students were generally as high as those obtained
, with a lower validity coefficient in the minority group.
for white students, or higher. These relationships were found when the
In comparative v~Hdati~n studies, certain methodological precautions
black and white samples were attending the same collegeii, aii well as
must be observed. For example, the use of ratings as a criterion in this
when they were attending separate colleges (Cleary, 1968; Kendrick &
. type of stud/may yield results that are out oUine with those obtained
T~omas, 1970; Stanley & Porter, 1967). Working at a very different level,
with more objective criteria (Bass & Turner, 1973; Campbell. Crooks,
MItchell (1967) studied the validities of two educational readiness tests
Mahoney, & Rock, 1973; Kirkpatrick, Ewen, Barrett, & Katzell, 1968). A
against end-of-year achievement test scores of nrst-grade schoolchildren.
second example involves the comparison of ethnic samples from different
In the large samples of black and white children tested, validities of total
institutions. lIt such cases, ethnic and institutional factors are likely to be
scores and of subtests were very similar for the two ethnic groups, al-
confounded in the results ,(Kirkpatrick et al., 1969).
though te~dmg to run somewhat higher for the blacks. These findings
A common difficulty arises from the fact that in several studies the num-
were consistently corroborated in later studies of black and white chil-
ber of cases in the minority sample was much smaller than in the majority
~ren, as well as in ~~parisons of children classmed by father's occupa-
sample, Under these conditions, the same validity coefficient could be
tIonal level and by mdices of ed,ucational and socioeconomic level of the
statistically significant in the majority sample and not significant in the
community.6
minority sample. With 100 cases, for example, a correlation of .27. is
clearly significant at the .01 level; with 30 cases, the same correlatIOn
falls far short of the minimum value required for significance even at the
INTERCEPT BL-\.S. Even when a test yields the same validity coefficients
.05 level. For this reason, the proper procedure to follow in such dif-
f~r two groups, it may show intercept bias. The intercept of a regression
ferential validation studies is to evaluate the difference between the two
line refers to the point at which the line intersects the axis. A test exhibits
validity coefficients, rather than testing each separately for significance
intercept bias if it systematically underpredicts or overpredicts criterion
(Humphreys, 1973; Standards, 1974: E9). By the latter procedure, one
performance for a particular group. Let us look again at Case 1 in Figure
could easily "demonstrate" that a test is valid for, let us say, whites and
21, in which majority and minority samples show identical regressions.
not valid for blacks. All that would be needed for this purpose is a large
Under these conditions, there is neither slope nor intercept bias. Although
enough group of whites and a small enough group of blacksl Cross-
~he g~oups diff~r si~ni£cantly in mean test scores, they show a correspond-
validation of results in a second pair of independent samples is also de-
mg dIfference In cnterion performance. In Case 3, on the other hand, the
sirable, to check whether the group yielding higher validity in the first
two groups have regression lines with the same slope but different inter-
study still does so in the second.
cepts .. In ~his case, the majority group (B) has a higher intercept than
A sophisticated statistical analysis of the results of 19 published studies
the mmonty group (A), that is, the majority regression line intersects
reporting validity coefficients for black and white employment samples
the Y axis at a higher point than does the minority regression line. Al-
casts serious doubt on the conclusions reached in some of the earlier
though the validity coefficients computed within each group are equal,
studies (Schmidt, Berner, & Hunter, 1973). Taking into account the ob-
any test score (X) will correspond to different criterion scores in the
tained validities and the size 6f samples in each study, the investigators
two groups, as shown by points YA and YB. The same test score thus has
demonstrated that the discrepancies in validity coefficients Found between
a d~ffc:rent predictive meaning for the two groups:':'In this situation, the
blacks and whites did not differ from chance expectancy. The same con-
maJonty mean again exceeds the minority mean in bo.th test and criterion
clusion was reached when only studies including approximately equal
as it did in Case 1. Because of the intercept differe~ce, however, the us~
numbers of blacks and whites were surveyed (O'Connor, \Vexley, &
Alexander, 1975). \VeIl-designed recent studies corroborate these findings
6 Summaries of these studies can be obtained from the test publisher Harcourt
for industrial samples (Campbell et aI., 1973) and army personnel
Brace JovanOVich, Inc. '
(Maier & Fuchs, 1973). In their review of validity studies with minority
196 Principles of PsycllOIogical Tcstillg Validity: Measurement and Interpretation 197
of the majority regression line for both groups would overpredict the simplified exposition. This is an area in which the statistically unsophisti-
• criterion performance of minority group members. If a single cutoff cated must step warily. Some investigators have proposed alternative
score (X) were applied to both groups, it would discriminate in favor of definitions of test bias, not in terms of predicted criterion scores, but in
the minority group. Intercept bias discriminates against the group with terms of percentage of minority and majority group members who exceed
the higher intercept. cutoff points on test and criterion (Cole, 1972; Linn, 1973, 1975; Schmidt
Psychologists who are concerned about the possible unfairness of tests & Hunter, 1974; Thorndike, 1971). In a comparative evaluation of several
for minority group members visualize a situation illustrated in Case 4. alternative models of test bias, however, other psychomebicians (Gross &
Note that, in this case, the majority excels on the test, but majority and Su, 1975; Petersen, 1974; Petersen & Novick, 1976) questioned the con-
minority perform t<quaIly well on the criterion. The minority group now ceptual and methodological soundness of these procedures and fonnu-
has the higher intercept. Selecting all applicants in terms of a test cutoff ; lated .a comprehensive. ~athematical model for culture-fair personnel
established for the majority group would thus discriminate unfairly s~l~cbon. ~ased on declSlon theory, this model combines data on proba-
against the minority. Under these conditions, use of the majority regres- bility of different outcomes with judgments of the relative utility of each
sion line for both groups underpredicts the criterion performance of outc0n,te, such as accepting an unsuccessful applicant or rejecting a
minority group members. This situation is likely to occur when a large poten~al success. It is argued that there is no best single model for fair
proportion of test variance is irrelevant to criterion performance and selection. The proposed model provides a means of formulating, within
measures functions in which the majority excels the minority. A thorough each context, a decision strategy that ma~mizes the· expected overall
job analysis and satisfactory test validity provide safeguards against the u~lity in accordance with specified testing goals and judged utilities of
choice of such a test. diHerent outcomes.
Reilly (1973) has demonstrated mathematically that Case 3 will occur R.es~arch on test bias is still being actively pursued, with regard to both
if the two groups differ in a third variable (e.g., sociocultural back- statistical theory and empirical investigations. In our present state of
ground) which correlates positively with both test and criterion. Under knowledge, there is insufficient basis for the use of different cutoff scores
these conditions, the testoverpredicts the performance of minority group for different subgroups of the population. Statistical adjustments in test
members and the use of the same cutoff scores for both groups favors the scores, cutoffs, and prediction formulas hold little promise as a means of
minority. The findings of empirical studies do 'in fact support this ex- correcting social inequities. More constructive solutions are suggested by
pectation. Using principally a statistical procedure developed by Gullik- other approaches discussed earlier in this chapter. One is illustrated by
, sen and Wilks (1950), investigators have checked intercept bias in the multiple aptitude testing and classification strategies, which pennit the
prediction of college grades (Cleary, 1968; Temp, 1971), law school fullest utilization of the diverse aptitude patterns fostered by different
~. grades (Linn, 1975), performance in army and air force training pro- cultural backgrounds. Another approach is through adaptive treatments,
grams (M. A. Gordon, 1953; Maier & Fuchs, 1973; Shore & Marion, such as individualized training programs. In order to maximize the 6t of
1972), and a wide variety of industrial criteria (Campbell et aI., 1973; such programs to individual characteristics, it is essential that tests reveal
Grant & Bray, 1970; Ruch, 1972). as fully and accurately as possible the person's present level of develop-
It is interesting to note that the same results have been obtained when ment in the requisite abilities.
comparisons were made between groups classified according to educl!--
tional or socioeconomic level. The Army Classification Battery tended to
overpredict the performance of high school dropouts and underpredict
the performance of college graduates in training programs for military
occupational specialties (Maier, 1972). Similarly, the college grades of
students whose fathers were in the professions were underpredicted from
various academic aptitude tests, while the grades of students from lower
occupational levels tended to be overpredicted (Hewer, 1965). In all
these studies, comparisons of higher-scoring and lower-scoring groups
revealed either no significant difference in intercepts or a slight bias in
avor of the groups scoring lower on the tests.
The problem of test bias is complex-more so than appears in this
PERcEXTAGE PASSING. For most testing purposes, the difficulh· of an item
is defined in terms of the percentage of persons who answer it correctly.
Iterrl Analysis The easier the item, the larger "'rill this percentage be. A word that is cor-
rectly defined hy 70 percent of the standardization sample (p = ~70) is
regarded as easier than one that is correctly defined by only 15 percent
(p= .15). It is customary to arrange items in order of difficulty, so that
examinees begin with relatively easy items and proceed to items of in-
creasing difficulty. This arrangement gives the individual confidence in
: FAMILIARITY with the basic concepts and techniques of item analy- approaching the test and also reduces the likelihood of his wasting ~~
. sis, like knowledge about other phases .of test constructio~, .can ?elp time on items beyond his ability to the neglect of easier items he can cor-
; the test user in his evaluation of published tests. In addltion, Item rectly complete.
:. analysisis particularly relevant to the construction of informal, local tests, In the process of test construction, a major reason for measuring item
difficulty is to choose items of suitable difficulty level. ~fost standardized
. suchas the quizzes and examinations prepared by teachers for classroom
a~ility tests are designed to assess as accuratel): as possible each indiviqtt';
. use.Some of the general guidelines .for effectiv~ item writi~g, as. well as
the simpler statistical techniques of Item analysIS, can matenally Improve al s level of attainment in the particular ability. For this purpose, if no
one passes an item, it is excess ·baggage in the test. The same is true of
" classroom tests and are worth using even with small groups.
. Items can be analyzed qualitatively, in terms of their content and. fo~, ~tems that everyone passes. Neither of these types ofiliPms provides any
.1 and quantitatively, in terms of their statistical. ~rope~es. Q~alitatlve mformation about individual differences. Sinee such items do not affecf
analYsisincludes the consideration of content validIty, dIscussed m Chap- the. v~riability of test scores, they contribute nothing to the reliability or
< ter 6, and the evaluation of items in terms of effective item-writing pro- valIdIty of the test. The closer the difficulty of an item approaches 1.00
cedures, to be discussed in Chapter 14. Quantitative analysis includes or 0, the less differential information about examinees it contributes.
principally the measurement of item difficulty and. item validity. Both Conversely, the closer the difficulty level approaches .50, the more dif-
the validity and the reliability of any test depend ultimately on ~he.char- ~erentiatiQ.nsthe item can make. Suppose out of 100 persons, 50 pass an
acteristics of its items. High reliability and validity can be bUllt mto a Item and 5a fail it (p = .50). This item enables us to differentiate be-
test in advance through item analysis. Tests can be improved through the tween each of tnqse who passed it and each of those who failed it. We
thus have 00 x~50 or 2500 paired comparisons, or bits of differential
selection, substitution, or revision of items. .
Item analysis makes it possible to shorten a test, and at the same time information. An item passed by 70 percent of the persons provides 70 X
increase its validity and reliability. Other things being equal, a long~r test 30 or 2100 bits of information; one passed by 90 percent provides 90 X 10
ismore valid and reliable than a shorter one. The effect of lengthemng or or 900; pne passed by 100 percent provides 100 X 0 or O. The same rela-
shortening a test on the reliability coefficient was discussed in Chapter 5, tionships would hold for harder items, passed by fewer than 50 percent.
where the Spearman-Brown formula for estimating this effect was al~o For maximum differentiation, then, it would seem that one should
presented. These estimated changes in reliability occur when t~e dis- choose all items at the .50 difficulty level. The decision is complicated,
carded items are equivalent to those that remain, or when eqUlvalent ,however, by the fact that items within a test tend to be intercorrelated.
new items are added to the test. Similar changes in validity will result nw more homogeneous the test, the higher will these intercorrelations
from the deletion or addition of items of equivalent validity. All such be. In an extreme case, if all items were perfectly intercorrelated and all
estimates of change in reliability or validity refer to the lengthening or were ~f .50 difficulty level, the same 50 persons' QP.t of 100 would .pass
shortening of tests through a random selection of items, without item each Item. Consequently, half of the examinees would obtain p~rlget
analysis. When a test is shortened by eliminating the least satisfactory scores and the other half zero scores. Because of item intercorrelatlons it
items, however, the short test may be more valid and reliable than the is best to select items with a moderate spread of diffi9,ulty level. but
whose average difficulty is .50. . ..'.
original, longer instrument.
Item Allal!l.~is ZOI
=
(50 - 34 16). An item passed hy exactly 50 percent of the cases falls
INTERVAL SCALES. The percentage of persons passing an item expresses at the mean and would thus have a 0 value on this scale. The more dif-
item difficulty in terms of an ordinal scale; that is, it correctly indicates ficult items have plus values, the easier items minus values. The difficulty
the rank order or relative difficulty of items. For example, if Items 1, 2, value corresponding to any percentage passing can be found by refer-
and 3 are passed by 30, 20, and 10 percent of the cases, respectively, we ence to a normal curve frequency table, given in anv standard statistics
can conclude that Item 1 is the easiest and Item 3 is the hardest of the text. '
three. But we cannot infer that the difference in difficulty bctween Items Because item difficulties expressed in terms of normal curve ,,-units
1 and 2 is equal to that between Items 2 and 3. Equal percentage dif- involve negative values and decimals, they are usually converted into a
ferences would correspond to equal differences in difficulty only in a ~ore m~na~ea~le scale. One such scale, employed by Educational Test-
rectangular distribution, in which the cases were uniformly distributed Ing SerVIce In Its test development, uses a unit designated by the Greek
throughout the range. This problem is similar to that encountered in con- letter delta (tL). The relation between tL and normal curve ~-values (x)
nection with percentile scores, which are also based on percentages of is shown below:
cases. It will be recalled from Chapter 4 that percentile scores do not
represent equal units, but differ in magnitude from the center to the ex- A == 13 + 4x
tremes of the distribution (Fig. 4, Ch. 4). The co~st~nts 13 and ~ were chosen arbitrarily in order to provide a scale
If :we assume a nonnal distribution of the trait measured by any given that ehmmates .negative values and yields a range of integers wide
item, the difficulty level of the item can be expressed in terms of an equal- enough to permIt the dropping of decimals. An item passed by nearly
unit interval scale by reference to a table of normal curve frequencies. In 100 percent of the cases (99.87%), falling at -30', would have a tL of:
Chapter 4 we saw,' for example, that approximately 34 percent of the 13 + (4)( -3) == I. This is the lowest value likely to be found in most
cases in a normal distribution fall between the mean and a distance of I" groups. At the other extreme, an item passed by less than 1 percent
in either direction (Fig. 3, Ch. 4). With this information, we can examine (0.13%) of the cases, will have a value of +3" and a tL of: 13 +
Figure 22, which shows the difficulty level of an item passed by 84 per- ( 4)( 3) ==.25. An item falling at the mean will have a 0 u-value and a tL
cent of the cases. Since it is the persons in the upper part of the distribu- ~f: 13 +. (4}(0) == 13. The tL scale is thus a scale in which practicaHyJaIl
tion who pass and those in the lower part who fail, this 84 percent in- Items. WIll fa:ll- between 1 and 25, and the mean difficulty value within
cludes the upper haH (50%) plus 34 percent of the cases from the lower any gIven group corresponds to 13.
=
half (50 + 34 84). Hence, the item falls 1" below the mean, as shown An i~por~ant practical advantage of the tL scale over other possible
in Figurc 22. An item passed by 16 percent of the cases would fall 1" conversIons IS that a table is available (Fan, 1952) from which tL ,can be
above the mean, since above this point there are 16 percent of the cases fou~d by simply entering the value of p (proportion of persons pass;ing
the Item). The table eliminates the necessity of looking up normal curve
IT-values and. transforming t~ese values to tL's. For most practical pur-
poses, an ordmal measure of Item difficulty, such as percentage passing, is
adequate .. For more precise statistical analyses, reqUiring the measure-
ment of difficulty on an interval scale, tL values can be obtained with little
additional effort.
DISTRIB~ON OF TEST SCORES. The difficulty of the test as a whole is, of

course, dIrectly dependent on the difficultt of the items that make up the
-30' -20' -10' test. A ~omprehensive check of the difficulty of the total test for the
t
Item 1 Mecn
population for which it is designed is provided-by the distribution of total
scores. If the standardization sample is a represel1tative cross section of
FIC. 22. Relation between Percentage of Persons Passing an Item and the Item such a p~pu1ation, then it is generally expected {hll.t the scores will fall
DiffioUltjlin Normal Curve Units. roughly mto a nonnal distribution curve.
Item AnalysiS 203
adjustments are continued until the distribution becomes at least roughly

normal. Under these conditions, the 1110stlikely score, obtained by the
largest number of subjects, usually corresponds to about 50 percent cor-
rect items. To the layman who is unfamiliar with the methods of psycho-
logical test construction, a 50 perceIlt score may seem shockingly low. It
is sometimes objected, on this basis, that the examiner has set too Iowa
standard of passing on the test. Or the inference is drawn that the group
tested is a particularly poor one. Both conclusions, of course, are totally
meaningless when viewed in the light of the procedures followed in de-
veloping psychological tests. Such tests are deliberately constructed and
specifically modified so as to yield a mean score of approximately 50 per-
cent correct. Only in this way can the maximum differentiation between
individuals at all ability levels be obtained with the test. With a mean of
approximately 50 percent correct items, there is the maximum opportu-
nity for a normal distribution, with individual scores spreading widely at
P>.Piling at Upper End of the Scale
both extremes.'
FIG.23. Skewed Distribution Curves.
b . d d' tribution curve is not -- Diiitribution of Ability

Let us suppose, however, that the 0 ~am~ IS ts A and B. The ---- Distribution of Test Scores
normal but clearly skewed, as illustrated m FIgure 23th'parlowend suggests
. 'b' 'th iling of scores a t e ,
first of these distn utlOns, WI a Pd' deration lack-
that the te~t has too high a fl.o~r for thedf:;;~i~:tee~~~;;;ly at the' lower
ing a SuffiCIentnumber of easy Items to h ld normally scatter
end of the range. The result is that persons w 0 WOU t'hI's test A
c b' 0 or near-zero scores on '
over a considerable range 0 tam ze~ f bt' d This artificial
eak at the low end of the scale IS there ore.o ame: which a nor- ------
'------~.
Test Range
~iling of scores is illustrated schematically. in .~lg~re :~' :nparticular test.
mally distributed group yields a skew~d ~stn ;ti~n F'gure 23 with the FIG. 24. Skewness Resulting from Insufficient Test Floor.
The opposite skewness is illustrated m art 0 I . fR'· nt test
d finding that suggests msu ere
scores piled up at the upper en , a 1 0 ulation to se- The difficulty level of the items that constitute a test determines ~t ..
ceiling, Administering a test designed fordthe ge~l~ra Pal~ yield such a
only the mean difficulty level of the test and the test floor and ceiling?
lected sa~pl~s o~ college or gradu:~c~~ o~:~n:: ::~rly perfect scores. ,
skewed dlstnbuhon, a number of 51 , d' ·f
1differences among
but also the spread of test scores. As mentioned earlier, the maximum
spread of total test scores is obtained with. items whose difficulty levels
With such a test, i: is i~possible to m~;s;~~;n d;C~: items had been in- cluster closely around .50. The fact that such a selection of items yields
the more able subjects In t~e ~r?up. ld d btedly have scored wider differentiation than would be obtained with items that are them-
eluded in the test, some mdlVlduals wou un ou
higher than the present test permits, . nonnormal distri- I Actually, the normal curve provides finer disp~mination at the ends than at the
When the standardization sample yrelds a ~ark~~ly 'I odined until middle of the scale. Equal discrimination at all pbints of the scale would require 'a
rectangular distribution, The normal curve, however, has an advantage if subsequent
bution on a test, ~he difRcu~ty le~el ;; the ~~st l~~r t;:a~~; of deviation statistical analyses of scores are to' be conducted, because many current statistical
a normal cu~e IS apprOXImate.. epe~re x:ffficult items may be added, techniques assume approximate normality of distribution.,Ji'or this and other reasons,
from normaht~ t~at aplears, e~~~r:r:::e position of items in the scale it is likely that most tests designed for general use wilfcli~tinue to follow a normal-
~~;e~e~:e;~:l=:~~ W;;g::~ a~s~g~ed to certain responses revised. Such

curve pattern for some time to come.
204 Principles of Psychological Testing Item Analysis 205
selves widely distributed in difficult), level is illustrated in Figure 25. The concentrated around .50. The reliability coefficient was also highest in
three distributions of total scores given in Figure 25 were obtained by this case, and it was particularly low in'the test compQ~ed of items with
Ebel (1965) with three 16-item tests assembled for this purpose. The extreme difficulty values (Test 3). This simple demonstration serves only
items in Test 1 were chosen so as to cluster close to the .50 difficulty level; to clarify the point; similar conclusions have been reached in more tech-
those in Test 2 are widely dist~ibuted over the entire difficulty range; nical analyses of the problem, with both statistical and experimental
and those in Test 3 fall at the two extremes of difficulty. It will be noted procedures (Cronbach & Warrington, 1952; Lord, 1952; Lord & Novick,
that the widest spread of total test scores was obtained with the items 1968).
RELATINC ITEM DIFFICULTY TO TESTINC PURPOSE. Standardized psycho-

logical tests have generally been designed to elicit maximum differentia-
tion among individuals at all levels. Our discussion of item difficulty thus
far has been directed to this type of test. In the construction of tests to
serve special purposes, however, the choice of appropriate item dif-
ficulties, as well as the optimal form of the distribution of test scores, de-
pends upon the type of discrimination sought. Accordingly, tests designed
for screening purposes should utilize items whose difficulty values come
~ closest to the desired selection ratio (Lord, 1953). For example, to select
4 8 n ~ the upper 20 percent of the cases, the best items are those clustering
Test Scores around a p of .20. Since in a screening test no differentiation is required
tcithin the accepted or rejected groups, the most effective use of testing
time is obtained when items cluster near the critical cutoff. It follows, for
instance, that if a test is to be used to screen scholarship applicants from
a college population, the items should be considerably more difficult
Z:'
"'§ 50 than the average for that population. Similarly, if slow learners are being
:E selected for a remedial training program, items that are much easier than
i5
average would be desirable.
E
~ Another illustration is provided by the National Assessment of Educa-
~ tional Progress (Womer, 1970). Designed as an attempt to obtain direct
4 B n ~
Test Scores information on the outcomes of education throughout the United States,
this project examined carefully ch6sen representative samples of the
population at four age levels: 9-, 13-, and 17-year-olds and young adults
.•
G>
:>
between the ages of 26 and 35. The project was not concerned W,ith the
C achievement of individuals. Its aim was to describe the knowledge, un-
> derstanding, and skills manifested by t~e American population at the
~
~ 50 specified age levels. Within each of the c9ntent areas and each of the age
i5 groups tested, achievement was assessed jn terms of three questions; (1)
E What do almost all persons know? (2) What does the typic.al or average
~ person know? (3) What do the most abUi persons know? To answer these
questions, exercises' were prepared at t~ree difficulty levels: one tliird- of
Test Scores
Fie. 25. Relation between Distribution of Test Scores and Distribution of Item 1 Because of the nature of many of the tests, the tenn "exercise" was considered
Difficulty Values. more appropriate than "item." For purposes of the present disC!1ssio.~ •.they can be
regarded as items. ".0' "
(From Ebel, 1965, p. 563.)
zo(i Principles of PSfJc1lOlogica/ Testing Item Analysis 207
the exercises were "easy" (.90): one third were "average" (.50); and one and a continuous variable (the criterion). In certain situations, the cri-
third were "difficult" (.10). The actual percentages of persons passing terion too may be dichotomous, as in graduation versus nongraduation
each exerdse varied somewhat around these values. But the goal of the from college or success versus failure on a job. Moreover, a continuous
test constructors was to approximate the three values as closely as pos- criterion may be dichotomized for purposes of analysis. The basic rela-
sible. tionship between item and criterion is illustrated by the three item char-
A third example of the choice of item difficulty levels in terms of acteristic curves reproduced in Figure 26. Using fictitious data, each of
special testing goals is to be found in mastery testing. It will be recalled these curves shows the percentages of persons in each class-interval of
(eh. 4) that mastery testing is often associated with criterion-referenced criterion score who pass the item. It can be seen that Item 1 has a low
testing. If the purpose of the test is to ascertain whether an individual has validity, since it is passed by nearly the same proportion of persons
adequately mastered the basic essentials of a skill or whether he has throughout the criterion range. Items 2 and 3 are hetter, showing a
acquired the prerequisite knowledge to advance to the next step in a closer correspondence between percentage passing and criterion score.
learning program, then the items sll,ould probably be at the .80 or .90 Of these two, 3 is the more valid item, since its curve rises more steeply.
difficulty level. Under these conditions, we would expect the majority of
those taking the examination to complete nearly all items correctly. Thus,
the very easy items (even those passed by 1000/<: of the cases), which are 100
discarded as nondiscriminative in the usual standardized test, are the very " 90
5"1 80
items that would be included in a mastery test. Similarly, a pretest, ad- ~ ~ 70
ministered prior to a learning unit to detennine whether any of the ~= 60
students have already acquired the skills to be taught, will yield very low ~ g> 50
percentages of passing for each item. In this case, items with very low or
If"::: 40
c~ 30
even zero p values should not be discarded, since they reveal what re- ~ 20
mains to be learned. ~ 10
0
It is apparent from thesc examples that the appropriate difficulty level 0 10 15 20 25 30 35 40 45 50
of items depends upon the purpose of the test. Although in most testing Criterion Score
situations items clustering around a medium difficulty (.50) yield the FIG. 26. Item Characteristic Curves for Three Hypothetical Items.
maximum information about the individual's performance level, decisions (Adapted from Lord, 1953, p. 520.)
about item difficulty cannot be made routinely, without knowing how
the test scores 'will be used.
Although item characteristic curves can provide a vivid graphic rep-
resentation of differences in item validities, decisions about individual
items can be made more readily if validity is expressed by a single nu-
merical index for each item. Over fifty different indices of item validity
ITEM-CRITERION RELATIONSHIPS. All indices of item validity are based on have been developed and used in test construction. One difference
the relationship between item response and criterion performance. Any among them pertains to their applicability to dichotomous or continuous
criterion employed for test validation is also suitable for item validation. measures. Among those applicable to dichotomous variables, moreover,
Item analysis can be employed to improve not only the convergent but some assume a continuous and normal distribution of the underlying
also the discriminant validity of a test (see Ch. 6). Thus, items can be trait, on which the dichotomy has been artificially imposed; others as-
chosen on the basis of high correlation vdth a criterion and low correla- sume a true dichotomy. Another difference concerns the relation of item
tion with any irrelevant.variable that may affect test performance. In the difficulty to item validity. Certain indices measure item validity inde-
development of an al,"ithmetic reasoning test, for example, items that cot- pendently of item difficulty. Others yield higher validities for. items 'l§1O~'e
relate Significantly with a reading comprehension test would be dis- to the .50 difficulty level than for those at the ex~mes of difficulty.
carded. Despite differences in procedure and assumptions, most item validity
Since item responses are generally recorded as pass or fail, the measure- indices provide closely similar results. Althoqg.!l. the numerical values
ment of item validity usually involves a dichotomous variable (the item) of the indices may differ, the items that are retaihed and those that are
rejected on the hasis of different validity indices are largely the same. In
fact, the variation in item validity data from sample to sample is gen- SIMPLE ANALYSIS WITH SMALL GROUPS. Because item analysis is fre-
erally greater than that among different methods. For this reason, the quently conducted with small groups, such as the students who have
choice of method is often based on the amount of computational labor taken a classroom quiz, we shall consider first a simple procedure espe-
required and the availability of special computational aids. Among the cially suitable for this situation. Let us suppose that in a class of 60
published computational aids are a number of abacs or nomographs. students we have chosen the 20 students (337c) with the highest and the
These are computing diagrams with which, for example, the yalue of an 20 with the lowest scores. We now have three groups of papers which we
item-criterion correlation can be read directly if the percentages of per- may call the Upper (U), Middle (M), and Lower (L) groups. First we
sons passing the item in high and low criterion groups are known (Guil- need to tally thc correct responses to each item given by students in the
ford & Fruchter, 1973, pp. 445-458; Henrysson, 1971). three groups. This can be done most readily if we list the item numbers
in one column and prepare three other columns headed U, M, and L. As
we come to each student's paper, we simply place a tally next to each
USE OF EXTREME GROUPS. A common practice in item analysis is to item he answered correctly. This is done for each of the 20 papers in the
compare the proportion of cases who pass an item in contrasting criterion U group, then for each of the 20 in the M group, and finally for each of
groups. VVhen the criterion is measured along a continuous scale, as in the 20 in the L group. We are now ready to count up the tallies and re-
the case of course grades, job ratings, or output records, upper (U) and cord totals for each group as shown in Table 19. For illustrative pur-
lower (L) criterion groups are selected from the extremes of the distribu- poses, the first seven items have been entered. A rough index of the valid-
tion. Obviously, the more extreme the groups the sharper will be the ity or discriminative value of each item can be found by subtracting the
differentiation. But the use of very extreme groups, such as upper and number of persons answering it correctly in the L group from the number
lower 10 percent, would reduce the reliability of the results because of answering it correctly in the U group. These U-L differences are given
the small number of cases utilized. In a normal distribution, the optimum in the last column of Table 19. A measure of item difficulty can be ob-
point at which these two conditions balance is reached with the upper tained with the same data by adding the number passing ea~h item in all
and lower 27' percent (KeIley, 1939). When the distribution is flatter three criterion groups (U + M + L).
than the normal curve, the optimum percentage is slightly greater than
27 and approaches 33 (Cureton, 1957b). With small groups, as in an TABLE 19
ordinary classroom, the sampling error of item statistics is so large that Simple Item Analysis Procedure: Number of Persons Gh'ing Correct
only rough results can be obtained. Under these conditions, therefore, Response in Each Criterion Group
we need not be too concerned about the exact percentage of cases i.n
the two contrasted groups. Any convenient number between 25 percent
~~..:a:a::UOii:!mm
U M L
- Difficulty Discrimina-
and 33 percent will serve satisfactorily. Item (20) (20) (20) (U + M + L) tion (U - L)
With the large and normally distributed· samples employed in the de-
velopment of standardized tests, it has been customary to work ,,~ith
1 15 9 7 31 8
2 20 20 16 56° 4
the upper and lower 27 percent of the criterion distribution. Many· of 1_
3 19 18 9 46 10
the tables and abacs prepared to fl,lcilitatethe computation of item valid- 4 10 11 16 37 - 6-
ity indices are bascd on the assumption that the "27 percent rule" has 5 11 13 11 35 0-
been followed. As high"speed computers become more generally avail- 6 16 14 9 39 7
able, it is likely that the various labor-saving procedures developed to 7 5 0 0 5- 5
facilitate item analysis will be gradually replaced by more exact and
sophisticated methods. With computer facilities, it is better to analyze
the results for the entire sample, rather than working with upper and
lower extremes. Mathematical procedures have also been developed for
measuring item validity from the item characteristic curves, but their
application is not feasible witho~t access to a computer (Baker, 1971;
Henrysson, 1971; Lord & Novick,'1968 ).
Examination of Table 19 reveals four questionable items that have been chosen for discussion. This tabulation gives thc number of students in
identified for further consideration or for class discussion. Two items, 2 the U and L groups who chose each option ill answering the particular
• and 7, have been Singled out because one seems to be too easy, having items.
been passed by 56 out of 60 students, and the other too difficult, having Although Item 2 has been included in Table 20, there is little more we
been passed by only 5. Items 4 and 5, while satisfactory with regard to can learn about it by tabulating the frequency of each wrong option,
difficulty level, show a negative and zero discriminative value, respec- since only 4 persons in the L group and none in the U group chose wrong
tively. \Ve would also consider in this category any items with a very answers. Discussion of the item with the students, however, may help to
small positive U-L difference, of roughly three or less when groups of determine whether the item as a whole was too easy and therefore of
approximately this size are being compared. With larger groups, we little intrinsic value, whether some defect in its con;tructioll served to
would expect larger differences to occur by chance in a nondiscriminat- give away the right answer, or whether it is a good item dealing with a
ing item. point that happened to have been effectively taught and well remem-
The purpose of item analysis in a teacher-made test is to identify de- bered. In the first case, the item would probably be discarded, in the
ficiencies either in the test or in the teaching. Discussing questionable second it would be revised, and in the third it would be retained un-
items with; the class is often sufficient to diagnose the problem. If the changed.
wording of the item was at fault, it can be revised or discarded in subse- The data on Item 4 suggest that the third option had some unsus-
quent testing. Discussion may show, however, that the item was satis- pected implications that led 9 of the -better students to prefer it to the
factory, but the point being tested had not been properly understood. In correct alternative. The point could easily be settled by asking those
that case, the topic may be reviewed and clarified. In narrowing down students to eXl'lain why they chose it. In Item 5, the fault seems to lie in
the source of the difficulty, it is often helpful to carry out a supple- the wording either of the stem or of the correct alternative, because the
menta!)' analysis, as sho"'1!Din Table 20, \vith at least some of the items students who missed the item ~ere uniformly distributed over the four
wrong options. Item 7 is an unusualIv difficult one, which was answered
TABLE 20 incorrectly by 15 of the U and all of the L group. The slight clustering of
Response Analysis of Individual Items responses on incorrect option 3 suggests a superficial attractiveness of
1';~~"'W\~~~':;;:;;T.%i!''J'''"4.~~~"l':''..!'..m.·::-:;''·it:re''~~~~~~~
this option, especially for the more easily misled L group. Similarly, the
Response Options' lack of choices of the correct response (option 1) by any of the L group
Item Group suggests that this alternative was so worded that superficially, or to the
1 2 3 4 5 uninformed, it seemed wrong. Both of these features, of course, are
desiderata of good test items. Class discussion might show that Item 7
2 Upper 0 0 0 20 0 is a good item dealing with a point that few class members had actually
Lower 2 0 1 16 1 learned.
4 Upper 0 10 9 0 J.
Lower 2 16 2 0 0
THE INDEX OF DISCRIMINATION. If the numbers of persons passing each
5 Upper 2 3 2 11 2' item in U and L criterion groups are expressed as percentages, the differ-
Lower 1 3 3 11 2 ence between these two percentages provides an index of item validity
7 that can be interpreted independently of the size of the particular sample
Upper 5 3 5 4 3
Lower 0 5 8 3 4 in which it was obtained. This index has been repeatedly described in
the psychometric literature (see, e.g., Ebel,-19&S; Johnson, 1951;~Mosier
& McQuitty, 1940) and has been variously designated as ULI,ULD; or
simply D. Despite its simplicity? it has been shoVi'Jlto agree quite closely
with other, more elaborate measures of item validity" (~ngelhart, 1*).
The computation of D can be illustrated by referen~ to the data previ·
?usly reported in ~ble 19. First, the numbers of persons passing eacp
ltem in the U and L~ps are changed to percerltages. Because thl;l
Principles of Psyc1lOlogical Testing
"number of cases in each group is 20, we could divide each number by 20 cases in the total sample pass an item, there can be no difference in per-
: andmultiply the result by 100. It is easier, however, to divide 100 by 20, centage passing inU and L groups; hence D is zero. At the other ex-
;:lwhichgives 5, and then multiply each number by that constant. Thu~, treme, if 50 percent pass an item, it would be possible for all the'U
J~forItem 1, 15 X 5 = 75 (U group) and 7 X 5 = 35 (L group). For thIS cases and none of the L cases to pass it, thus yielding a D of 100
'·item, then, D, is: 75 - 35 = 40. The remaining values for the seven (100 - 0 = 100). If 70 percent pass, the maximum value that D could
items are given in Table 2l,3 take can he illustrated as follows: (U) 50/50 = 100%; (L) 20/50 = 40%;
D can have anv value between + 100 and -100. If all members of the D = 100 - 40 = 60. It will be recalled that, for most testing purposes,
; U group and non~ of the L group pass an item, D equals 100. Conversely,
~;ifall members of the L group and none of the U group pass it, D equals TABLE 22
-100. If the percentages of both groups passing an item are equal, D Relation of Maximum Value of D to Item Difficulty
. will he zero. D has several interesting properties. It has been demon-
....strated (Ebel, 1965; Findley, 1956) that D is directly proportional to the
21
.:. TABLE 100 o
: Computation of Index of Discrimination (Data from Table 19) 90 20
70 60
Percentage Passing Difference 50 100
Item (Index of 30 60
Upper Group Lower Group Discrimination) 10 20
o o
1 75 35 40
2 100 80 20
3 95 45 50 items closer to the 50 percent difficulty level are preferable. Hence, item
4 50 80 -30 validity indices that favor this difficulty level are often appropriate for
5 55 55 0 item selection.
6 80 45 35
7 25 0 25
PH,l COEFFICIENT. Many indices of item validity report the relationship
between item and criterion in the form of a correlation coefficient. One of
difference between the numbers of correct and incorrect discriminations
these is the phi coefficient (</». Computed from a fourfold table, ¢ is
made by an item. Correct discriminations are based on the number of
based on the proportions of cases passing and failing an item in U and L
passes in the U group versus the number of failures in the L group; in-
criterion groups. Like all correlation coefficients, it yields values between
correct discriminations are based on the number of failures in the U
+ 1.00 and -1.00. The ¢ coefficient assumes a genuine dichotomy in both
group versus the number of passes in the L,group. Ebel (1967) has also
item response and criterion variable. Consequently, it is strictly ap-
shown that there is a close relation between the mean D index of the
plicable only to the dichotomous conditions under which it was obtained
items and the reliability coefficient of the test. The higher the mean D.
and cannot be generalized to any underlying relationship between the
the higher the reliability.
traits m~asured by item and criterion. Like the D index, ¢ is biased
Another noteworthy characteristic of D is one it shares with several
toward the middle difficulty levels-that is, it yields the highest possible
other indices of item validity. The values of D are not independent of ·correlations for dichotomies closest to a 50-50 split.
item difficulty but are biased in favor of intermediate difficulty levels.
Several computational aids are available for finding ¢ coefficients.
Table 22 shows the maximum possible value of D for items v.~th different
\Vhen the number of cases in U and L criterion groups is equal, ¢ can be
percentages of correct responses. If either 100 percent or 0 percent of the
found with the Jurgensen tables (1947) by simply entering the per::.
3 The alert reader may have noticed that the same result can be obtained by centages passing the item in U and L groups. Since in conducting an
~i1J)plvmultiplying the differences in thc last column of Tr,ble 19 by the constant, 5. item analysis it is W"Jally feasible to select U and L groups of equal
14 Principles of Psyc1lOlogieal Testing Item AnalysiS :..
j :
'ze,the Jurgensen tables are widely used for this purpose. \Vhen the two mated biserial correlations, but it has been shown that their standard
;:riteriongroups are unequal, cf> can be found with another set of tables, errors are somewhat larger than those of biserial correlations computed
•repared by Edgerton (1960), although their application is slightly more from all the data in the usual way. That is, the rb!.s estimated from the
imeconsuming. Fan tables fluctuates more from sample to sample than does the rbis
The significance level of a cf> coefficient can be readily computed computed by formula. \\lith this information, one could use the standard
throughthe relation of cf> to both Chi Square and the Normal Curve error of rbi. to estimate approximately how large the correlation should
Ratio.Applying the latter, we can identify the minimum value of cf> that be for statistical significance.4It should be reiterated that, \vith computer
wouldreach statistical significance at the .05 or .01 levels with the fol- facilities, biserial correlations can be readily found from the responses of
owingformulas: the total sample; and this is the preferred procedure.
1.96
cf>.05 = VN
2.58
cf>.Ol = ,IN Item analysis is frequently conducted against total score on the test
itself. This is a common practice in the case of achievement tests and
In these fonmulas, N represents the total number of cases in both cri- especially teacher-made classroom tests, for which an external criterion
terion groups combined. Thus, if there were 50 cases in U and 50 in L is rarely available. As was noted in Chapter 6, this procedure yields a
groups,N would be 100 and the minimum cf> significant at the .05 level measure of internal consistency, not external validity. Such a procedure
would be 1.96 -7- ylOO = .196. Any item whose cf> reached or exceeded is appropriate as a refinement of content validation and of certain aspects
.196would thus be valid at the .05 level of significance. of construct validation .
For tests requiring criterion-related validity, however, the use of total
score in item analysis needs careful scrutiny. Under certain conditions,
BISERIAL CORRELATION. As a final example of a commonly used measure the two approaches may lead to opposite results, the items chosen on the
of item validity, we may consider the biserial correlation coefficient basis of external validity being the very ones rejected on the basis of
(TbiS), which contrasts with cf> in hvo major respects. First, rbis assumes internal consistency. Let us suppose that the preliminary form of a
a continuous and normal distribution of the trait underlying the dichoto- scholastic aptitude test consists of 100 arithmetic items and 50 vocabu-
mousitem response. Second, it yields a measure of item-criterion relation- lary items. In order to select items from this initial pool by the method
ship that is independent of item difficulty. To compute rbis directly from of internal consistency, the biserial correlation between performance on
the data, we would need the mean criterion score of those who pass and each item and total score on the 150 items may be used.s It is apparent
those who fail the item, as well as the proportion of cases passing and that such biserial correlations would tend to be higher for the arithmetic
failing the item in the entire sample and the standard deviation of the than for the vocabulary items, since the total score is based on twice as
criterion scores. . many arithmetic items. If it is desired to retain the 75 'best" items in
Computing all the needed terms and applying the rbis formula for each the final form of the test, it is likely that most of these items will prove
item can be quite time consuming. Tables have been prepared ham to be arithmetic problems. In terms of the criterion of scholastic achieve-
which fbig can be estimated by merely' entering the percentages passing ment, however, the vocabulary items might have been more valid pre-
the item in the upper and lower 27 percent of the criterion group (Fan, dictors than the arithmetic items. If such is the case, the item analysis
1952;1954). These are the previously mentioned tables obtainable from will have served to lower rather than raise the validity of the test.
Educational Testing Service. With these tables it is possible by entering The practice of rejecting items that have low correlations v,ith total
the percentage passing in U and L groups to find three values: an esti- Scoreprovides a means of purifying or homogenizing the test. By such a
mate of p, the percentage who pass the item in the entire sample; the 4 The formula for CT'bf, can be found in any standard statistics te~i, such as
previously described A, a measure of item difficulty on an interval scale; Guilford and Fruchter (1973, pp. 293-296).
and fbi.s between item and criterion. These tables are only applicable 5 Such part-whole correlations "'ill be somewhat inflated by the common specine
when exactly 27 percent of the cases are placed in U and L groups. and error variance in the item and the test of which it is a part. Formulas ,have
been developed to correct for this effect (Guilford & Fruchter, 197:3, pp.' 454-456).
There is no way of computing exact significance levels for these esti-
rocedure, the items with the highest average intercorrc1ations will be procedures are, of course, unlike. One aims to increase the breadth of
etained.This method of selecting items will increase test validity only criterion coverage and reduce duplication; the other attempts to raise the
hen the original pool of items measures a single trait and when this homogeneity of the test. Both are desirable objectives of test construction.
ait is present in the criterion. Some types of tests, however, measure a The appropriate procedure depends largely on the nature and purpose
mbination of traits required by a complex criterion. Purifying the test of the test. One extreme can be illustrated by a biographical inventory, in
such a case may reduce its criterion coverage and thus lower validity. which items can only be evaluated and selected in terms of an external
, The selection of items to maximize test validity may be likened to the criterion and the coverage of content is highly heterogeneous. The op-
lectionof tests that will yield the highest validity for a battery. It will posite extreme can be illustrated by a spelling test, whose content is
, e recalled (eh. 7) that the test contributing most toward battery highly homogeneous and in which internal consIstency would be a de-
'alidity is one having the highest correlation with the criterion and the sirable goal for item selection.
,west correlation with the other tests in the battery. If this principle is For many testing purposes, a satisfactory compromise is to sort the
pplied to the selection of items, it means that the most satisfactory relatively homogeneous items into separate tests, or subtests, each of
temsare those with the highest item validities and the lowest coefficients which will cover a different aspect of the criterion. Thus, breadth of
£internal consistency. On this basis, it is possible to determine the net coverage is achieved through a variety of tests, each yielding a relatively
ffectiveness of an item-that is, the net increase in test validity that ac- unambiguous score, rather than through heterogeneity of items v.rithin a
'rues from the addition of that particular item. Thus, an item that has single test. By such a procedure, items with 10\\' indices of internal con-
high correlation with the external criterion but a relatively low correla- sistency would not be discarded, but would be segregated. Within each
'on with total score would be preferred to one correlating highly with subtest or item group, fairly high internal consistencv could thus be at-
oth criterion and test score, since the first item presumably measures tained. At the same time, internal 'consistency would ~ot be accepted as a
n aspect of the criterion not adequately covered by the rest of the test. substitute for criterion-related validity, and some attention would be
It might seem that items could be selected by the same methods used given to adequacy of coverage and to the avoidance of excessive concen-
choosingtests for inclusion in a battery. Thus, each item could be cor- tration of items in certain areas.
lated with the criterion and with every other item. The best items
hosenby -this method could then be weighted by means of a regression
'quation.Such a procedure, however, is neither feasible nor theoretically
.efensible.Not only would the computation labor be prohibitive, but
':teritem correlations are also subject to excessive sampling fluctuation \\'bether or not speed is relevant to the function being measured,
d the resulting regression weights would be too unstable to provide a item ,indices computed from a speeded test may be misleading. Except
asis for item selection, unless extremely large samples are used. For for items that all or nearly all subjects have had time to attempt, the item
•esereasons, several approxirryationprocedures have been developed for indices found from a speed test will reflect the position of the item in the
lecting items in terms of their net contribution to test validity. Some test rather than its intrinsic difficulty or validity. Items that appear late
these methods involve an empirical build-up process, whereby items in the test will be passed by a relatively small percentage of the total
e added to an ever-growing pool, and the validity of each successive sample, because only a few persons have time to reach these items. Re-
mposite is recomputed. Others begin with the complete set of items gardless of how easy the item may be, if it occurs late in a speeded test,
d reduce the pool by successive elimination of the poorest items until it will appear difficult. Even if the item merely asked for the subject's
'e desired test validity is attained. Because even these techniques re- name, the percentage of persons who pass it might be very low when
ire considerable computational labor, their use is practicable only the item is placed toward the end of a speeded test.
'hen computer facilities are available (Fossum, 1973; Henrysson, 1971). Similarly, item validities tend to be overestimated for those items that
It should be noted that all techniques for selecting items in terms of h~~e not been reached by all subjects. Because the more proficient in-
eir net effectiveness represent the opposite approach from that followed diViduals tend to work faster, they are more likely to reach one of the
'en items are chosen on the basis of internal consistency. In the former later items in a speed test (Mollenkopf, 1950a). Thus, regardless of the
ocedure, a high item-test correlation increases the probability that nature of the item itself, some correlation between the item and the cri-
e item will be rejected; in the latter, a high item-test correlation in- terion would be obtained if the item occurred late in a speed test. "
'eases the probability of its acceptance. The objectives of the two To avoid some of these difficulties, we could limit the analysis of each
Principles of Psychological Testing
itemto those persons who have reached the item. This is not a completely suIts clearly showed that the position of an item in the speed tests
i:satisfactorysolution, however, unless the number of persons failing to affected its indices of difficulty and validity. \Vhen the same item oc-
;reach the item is small. Such a procedure would involve the use of a curred later in a speeded test, it was passed by a greater percentage of
.:rapidly shrinking number of cases, and would thus render the results on those attempting it, and it yielded a higher item-criterion correlation .
,.thelater items quite unreliable. ~10reover, the persons on whom the later The difficulties encountered in the item analysis of speeded tests are
; items are analyzed would probably constitute a selected sample, and fundamentally similar to those discussed in Chapter 5 in coniJection
: hence would not be comparable to the larger samples used for the earlier with the reliability of speeded tests. Various solutions, both empirical
:.'items.As has already been pointed out, the faster subjects tend also to and statistical, have been developed for meeting these difficulties. One
i be the more profiCient. The later items would thus be analyzed on a empirical solution is to administer the test with a longer time limit to
. superior sample of individuals. One effect of such a selective factor the group on '''"hich item analysis is to be carried out. This solution is
would be to lower the apparent difficulty level of the later items, since satisfactory provided that speed itself is not an important aspect of the
·:the percent~ge 'passing would be greater in the selected superior group ability to be measured by the test. Apart from the technical problems
" than in the entire sample. It will be noted that this is the opposite error presented by speCific tests, however, it is well to keep in mind that item-
from that introduced when the percentage passing is computed in terms analysis data obtained with speeded tests are suspect and call for careful
,"of the entire sample. In that case, the apparent difficulty of items is scrutiny.
spuriously ri,lised.
The effect of the above procedure on indices of item validity is less
obvious, but nonetheless real. It has been observed, for example, that
some low-scoring examinees tend to hurry through the test, mar1..--ing
,; items almost at random in their effort to try all items within the time MEANING ~F CROSS VALIDATION. It ~ essential that test validi!y be com-
allowed. This tendency is much less common among high-scoring ex- puted on a dIfferent sample of persOllSlrom that on which tne items were
aminees. As a result, the sample on which a late-appearing item is selected. T~ independent determination of the~.yalidi~ of the entire
analyzed is likely to consist of some very poor respondents, who will te'st1sl<nown as cross validation ( Mosier, 1951). Any validity coefficient -
perform no better than chance on the item, and a larger number of very computed on the same sample that was used br item-selection purposes
proficient and fast respondents, who are likely to answer the item cor- will capitalize on chance errors within that particular sample and will
rectly. In such a group, the item-criterion correlation will probably be consequently be spuriously high. In fact, a high validity coefficient could
•. higher than it would be in a more representative sample. In the absence result under such circumstances even when the test has no validitv at
of such random respondents, on the other hand, the sample on which all in predicting the particular criterion. •
the later items are analyzed will cover a relatively narrow range of Let us suppose that out of a sample of 100 medical students, the 30
ability. Under these conditions, the validities of the later items will tend with the highest and the 30 with the lowest medical school grades have
to be lower than they would be if computed on the entire unselected been chosen to represent contrasted criterion groups. If, now, these two
sample. groups are compared in a number of traits actually irrelevant to success
The anticipated effects of speed on indices of item difficulty and item in medical school, certain chance differences will undoubtedly be found.
validity have been empirically verified, both when item statistics are Thus, there might be an excess of private-school graduates and of red-
computed with the entire sample (\Vesman, .l949) and when they ~re haired persons within the upper criterion group. If we were to assign
computed with only those persons who attempt the item (Mollenkopf, each individual a score by crediting him with one point for private-school
1950a). In the latter study, comparable groups of high school students graduation and one point for red hair, the mean of such scores would un-
were given two forms of a verbal test and two forms of a mathematics doubtedly be higher in the upper than in the lower criterion g~oup.
test. Each of the two forms contained the same items as the other, but This is not evidence for the validity of the predictors, however, since
items occurring early in one form were placed late in the other. Each such a validation process is based on a circular argument. The two pre-
form was administered with a short time limit (speed conditions) and dictors were chosen in the first place on the basis of the chance varia-
with a very liberal time limit (power conditions). Various intercompari- tions that characterized this particular sample. And the same chance
sons were thus possible between forms and timing conditions. The re- differences are operating to produce the mean differences in total score.
itemto those persons who have reached the item. This is not a completely suIts clearly showed that the position of an item in the speed tests
~;satisfactorysolution, however, unless the number of persons failing to affected its indices of difficulty and validity. \Vhen the same item oc-
;reach the item is small. Such a procedure \'Vould involve the use of a curred later in a speeded test, it was passed by a greater percentage of
f;rapidlyshrinking number of cases, and would thus render the results on those attempting it, and it yielded a higher item-criterion correlation.
'the later items quite unreliable. ~10reover, the persons on whom the later The difficulties encountered in the item analysis of speeded tests are
items are analyzed would probably constitute a selected sample, and fundamentally similar to those discussed in Chapter 5 in connection
\:hencewould not be comparable to the larger samples used for the earlier with the reliability of speeded tests. Various solutions, both empirical
:.items.As has alre'ady been pointed out, the faster subjects tend also to and statistical, have been developed for meeting these difficulties. One
< be the more profiCient. The later items would thus be analyzed on a empirical solution is to administer the test with a longer time limit to
!superior sample of individuals. One effect of such a selective factor the group on which item analysis is to be carried out. This solution is
would be to lower the apparent difficulty level of the later items, since satisfactory provided that speed itself is not an important aspect of the
:the percentage 'passing would be greater in the selected superior group ability to be measured by the test. Apart from the technical problems
" than in the entire sample. It will be noted that this is the opposite error presented by speCific tests, however, it is well to keep in mind that item-
from that introduced when the percentage passing is computed in terms analysis data obtained with speeded tests are suspect and call for careful
I of the entire sample. In that case, the apparent difficulty of items is scrutiny.
spuriously r@.ised.
The effect of the above procedure on indices of item validity is less
,. obvious, but nonetheless real. It has been observed, for example, that
some low-scoring examinees tend to hurry through the test, mar1.ing
items almost at random in their effort to try all items within the time MEANING ~F CROSS VALIDATION. It i.?essential that test validi!X,be com-
allowed. This tendency is much less common among high-scoring ex- puted on a dIfferent samp!e of persOllSIrom that on which tne items were
aminees. As a result, ~he sample on which a late-appearing item is ~elected. T~ independent determination of the~.yalidity of the entire
analyzed is likely to consist of some very poor respondents, who will test is known as cross validation ( Mosier, 1951). Any validity coefficient -
perform no better than chance on the item, and a larger number of very computed on the same sample that was used far item-selection purposes
proficient and fast respondents, who are likely to answer the item cor- will capitalize on chance errors within that particular sample and will
rectly. In such a group, the item-criterion correlation will probably be consequently be spuriously high. In fact, a high validity coefficient could
higher than it would be in a more representative sample. In the absence result under such circumstances even when the test has no validitv at
of such random respondents, on the other hand, the sample on which all in predicting the particular criterion. •
the later items are analyzed will cover a relatively narrow range of Let us suppose that out of a sample of 100 medical students, the 30
ability. Under these conditions, the validities of the later items will tend with the highest and the 30 with the lowest medical school grades have
. to be lower than they would be if computed on the entire unselected been chosen to represent contrasted criterion groups. If, now, these two
'" sample. groups are compared in a number of traits actually irrelevant to success
The anticipated effects of speed on indices of item difficulty and item in medical school, certain chance differences will undoubtedly be found.
validity have been empirically verified, both when item statistics are Thus, there might be an excess of private-school graduates and of red-
computed with the entire sample (Wesman, 1949) and when they ~re haired persons within the upper criterion group. If we were to assign
computed with only those persons who attempt the item (Mollenkopf, each individual a score by crediting him with one point for private-school
1950a). In the latter study, comparable groups of high school students graduation and one point for red hair, the mean of such scores would un-
were given two forms of a verbal test and two forms of a mathematics doubtedly be higher in the upper than in the lower criterion g~oup.
test. Each of the two forms contained the same items as the other, but This is not evidence for the validity of the predictors, however, since
items occurring early in one form were placed late in the other. Each such a validation process is based on a circular argument. The two pre-
form was administered with a short time limit (speed conditions) and dictors were chosen in the first place on the basis of the chance varia-
with a very liberal time limit (power conditions). Various intercompari- tions that characterized this particular sample. And the same chance
sons were thus possible between forms and timing conditions. The re- differences are operating to produce the mean differences in total score.
220 Principles of PsycllOlogical Testing Item Analysis Z:li.
.When tested in another sample, however, the chance differences in fre- sence of each item or response sign. Because of the procedure followed in
quency of private-school graduation and red hair are likely to disappear generating these chance scores, Cureton facetiously named the test the
',or be reversed. Consequently, the validity of the scores will collapse. "B-Projective Psychokinesis Test."
An item analysis was then conducted, ,,'ith each student's grade-point
average as the criterion. On this basis, 24 "items" were selected out of
~' AN EMPIRICAL EXAMPLE. A specific illustration of the need for cross the 85. Of these, 9 occurred more frequently among the students with
"validation is provided by an early investigation conducted with the an average grade of B or better and received a weight of + 1; 15 oc-
~Rorschach inkblot test (Kurtz, 1948). In an attempt to determine whether curred more frequently among the students "ith an average grade below
{'the Rorschach could be of any help in selecting sales managers for life B and received a weight of -1. The sum of these item weights consti-
,insurance agencies, this test was administered to 80 such managers. tuted the total score for each student. Despite the known chance deriva-
.- These managers had been carefully chosen from several hundred em- tion of these "test scores," their correlation with the grade criterion in
ployed by eight life insurance companies, so as to represent an; upper the original group of 29 students proved to be .82. Such a finding is simi-
criterion group of 42 considered very satisfactory by their respective lar to that obtained with the Rorschach scores in the pre\iously cited
companies, and a lower criterion group of 38 considered unsatisfactory. study. In both instances, the apparent correspondence between test score
The 80~cords were studied by a Rorschach expert, \\'ho selected a and criterion resulted from the utilization of the same chance differences
set of~s, or response characteristics: occurring more freque~tly in both in selecting items and in determining validity of total test scores.
one critenon group than in the other. SIgns found more often 111 the
upper criterion group were scored + 1 if present and 0 if absent; those
more common in the lower group were scored -1 or O. Since there were CONDITIONS AFFECTING VALIDITY SHRINKAGE. The amount of shrinkage
16 signs of each type, total scores could range theoretically from -16 of a validity coefficient in cross validation depends in part on the size of
; to +16. the original item pool and the proportion of items retained. \\Then the
When the scoring key based on these 32 signs was reapplied to the number of original items is large and the proportion retained is small, \
original group of 80 persons, 79 of the 80 were correctly classified as be- there is more opportunity to capitalize on chance differences and thus l
ing in the upper or lower group. The correlation between test score and obtain a spuriously high validity coefficient. Another condition affecting
criterion would thus have been close to 1.00. However, when the test amount of shrinkage in cross validation is size of sample. Since spuriously
was cross-validated on a second comparable sample of 41 managers, 21 in high validity in the initial sample results from an accumulation of sam-
the upper and 20 in the lower group, the validity coefficient dropped to pling errors, smaller groups (which yield larger sampling errors) will
a negligible .02. It was thus apparent that the key developed in the first exhibit greater validity shrinkage.
sample had no validity for selecting such personnel. If items are chosen on the basis of previously formulated hypotheses,
derived from psychological theory or from past experience with the cri-
terion, validity shrinkage in cross validation \vill be minimized. For ex-
AN EXAMPLE WITH CHANCE DATA. That the use of a single sample for ample, if a particular hypothesis required that the answer "Yes" be more
item selection and test validation can produce a completely spurious va- frequent among successful students, then the item would not be retained
lidity coefficient under pure chance conditions was vividly demonstrated if a Significantly larger number of "Yes" answers wer'e' given by the U1l-
by Cureton (1950). The criterion to be predicted was the grade-point successful students. The opposite, blindly empirical approach would be
average of each of 29 students registered in a psychology course. This illusdtrated by assembling a miscellaneous set of questions with little re- .}
criterion was dichotomized into grades of B or better and grades below ?ar to their relevance to the criterion behavior, and then retaining all
B. The "items" consisted of 85 tags, numbered from 1 to 85 on one side. Items yielding Significant positive or negative correlations with the cri-
To obtain a test score for each subject, the 85 tags were shaken in a terion. Under the latter circumstances, we would expect much more f
container and dropped on the table. All tags that fell with numbered ~hrinkage than under the former. In summary, shrinkage of test validity
side up were recorded as indicating the presence of that particular item 111 cross validation will be greatest when samples are small, the initial
in the student's test performance. Twenty-nine throws of the 85 tags thus item pool is large, the proportion of items retained is small, and items
provided complete records for each student, showing the presence or ab- are assembled without previously formulated rationale. '
Item Analysis 2:<;.,
centage passing or the delta values (!:l) of the same items in two groups.
If there is no significant item X group interaction, i.e., if the relative dif-
ficulties of items are the same for both groups, this correlation should be
, EXPLORATORY STUDIES. Insofar as diverse cultures or subcultures foster close to 1.00. These more sophisticated statistical techniques have been
,r the development of different skills and knowledge, these differences employed in studies with the College Board's Preliminary Scholastic
/' will be reflected in test scores. An individual's general level of perform- Aptitude Test, administered to high school students. Relative item dif-
; ance will be higher in those aptitudes stimulated and encouraged by his ficulties were investigated with reference to ethnic, socioeconomic, and
particular experiential background. A further question pertains to the urban-rural categories (Angoff & Ford, 1973; Cleary & Hilton, 1968).
f. relative difficulty of items for groups with dissimilar cultural back- The results show some significant though small item X group inter-
grounds. If difficulty is measured in the usual way, in terms of percentage actions. Correlations between the delta values for two ethnic groups were
; of respondents passing each item, will the rank order of the items be the slightly lower than the correlations for two random samples of the same
:,same across groups, regardless of overall level of performance? Early in- ethnic group. Two of the bivariate distributions of such correlations
( vestigations of this question with urban and rural children revealed a are illustrated in Figures 27 and 28. \Yhen two random samples of
number of significant differences in relative item difficulties on the white high school students were compared (Fig. 27), the item deltas
Stanford-Binet (Jones, Conrad, & Blanchard, 1930) and on a general were closely similar, yielding a correlation of .987.\Vhen black and white
information te!St (Shim berg, 1929). samples were compared (Fig. 28), the items not only proved more dif-
A comprehensive test of group differences in relative item difficulties ficult as a whole for the black students, but they also showed more dis-
is provided by measures of item X group interaction, computed through crepancies in relative difficulty, as indicated by a correlation of .929.
an analysis of variance. Another procedure is to correlate either the per- Efforts to identify reasons for these differences led to two suggestive
18 18
~0. 16 16
E
m
(/)
'"
0.
.~
.r:: E
ro
~ 14 (/)
14
"tJ E
c .r::
0
u ~
'"
(/)
.2
.2 12 ~ 12
~ 2
:oJ C3
Q; 0
0
10 10
8 8
Deltas for 81ack Sample
FIG. 27. Bivariate Distribution of Item Difficulties of Preliminary Scholastic FIG. 28. Bivariate Distribution of Item Difficulties of Preliminary Scholastic
Aptitude Test for Two Random Samples of 'White High School Students. Aptitude Test for Random Samples of Black and White High School Students.
(From Angoff & Ford, 1973, p. 99. Reproduced by permission of National Council (From Angoff & Ford, 1973, p. 100. Reproduced by permission of National Council
on :Measurement in Education.) on Measurement in Education.)
: findings. First, examination of item content failed to reveal any relation A similar diversity of procedure can be found with reference to other
".between the affected items and known differences in the experiential group differences in item performance. In the development of a socio-
~backgrounds of the groups. Second, equating the groups on a related economic status scale for the Minnesota Multiphasic Personality In-
t, cognitive variable reduced both the group difference in mean scores a.nd ventory, only those items were retained that differentiated significantly
i the item X group interaction. The latter finding suggests that the relative between the responses of high school students in two contrasted socio-
, difficulty of items depends at least in part on the absolute performance economic groups (Gough, 1948). Cross validation of this status scale on
. level in the ability measured by the test. It is possible, for example, that a new sample of high school students yielded a correlation of .50 with
persons at different aptitude levels utilize different work methods, objective indices of socioeconomic status. The object of this test is to de-
, problem-solving techniques, or cognitive skills in responding to the same termine the degree to which an individual's emotional and social re-
, items. Those items that prove relatively difficult when solved by method sponses resemble those characteristic of persons in upper or lower socio-
A could prove relatively easy when solved by method B, and vice versa. economic levels, respectively. Hence, those items showing the maximum
. It should be added that all the techniques used in studying item X differentiation between social classes were included in the scale, and
group interactions in ability tests are also applicable to personality' tests. those showing little or no differentiation were discarded. This procedure
In the latter tests, what is measured is not the difficulty of items but the is similar to that followed in the development of masculinity-femininity
relative frequency in choice of specific response options, as on an attitude scales. It is apparent that in both types of tests the group differentiation
scale or personality inventory. constitutes the criterion in terms of which the test is validated. In such
cases, socioeconomic level and sex, respectively, represent the most rele-
vant variables on the basis of which items can be chosen.
ITEM SELECTION TO MINIMIZE OR MAXIMIZE GROUP DIFFERENCES. In the Examples of the opposite approach to socioeconomic or cultural differ-
construction of certain tests, item X group interactions have been used entials in test responses can also be found. An extensive project on such
as one basis for the selection of items. In the development of the Stanford- cultural differentials in intelligence test items was conducted at the Uni-
Binet, for example, an effort was made to exclude any item that favored versity of Chicago (Eells etal., 1951). These investigators believed that
either sex significantly, on the assumption that such items might reflect most intelligence tests might be unfair to children from lower socioeco-
purely fortuitous and irrelevant differences in the experiences of the two nomic levels, since many of the test items presuppose infon11ation, skills,
sexes (McNemar, 1942, Ch. 5). Owing to Jhe limited number of items or interests typical of middle-class children. To obtain evidence for such
available for each age level, however, it was not possible to eliminate all a hypothesis, a detailed item analysis was conducted on eight widely used
sex-differentiating items. In order to rule out sex differences in total score, group intelligence tests. For each item, the frequencies of correct re-
therefore, the remaining sex-differentiating items were balanced, approxi- sponses by children in higher and lower socioeconomic levels were com-
mately the same number favoring boys and girls. pared. Following this investigation, two members of the research team
No generalization can be made regarding the elimination of sex differ- prepared a special test designed to be "fair" to lower-cla;s urban Ameri-
ences, or any other group differences, in the selection of test items. \Vhile can children. In the construction of this test, an effort was made to ex-
certain tests, like the Stanford-Binet, have sought to equalize the per- clude the types of items previously found to favor middle-class children.6
formance of the two sexes, others have retained such differences and re- As in the case of sex differences, no rigid policy can be laid down re-
., port separate norms for the two sexes. This practice is relatively icommon garding items that exhibit cultural differentiation. CertaIn basic facts of
in the case of special aptitude tests, in which fairly large differences in test construction and interpretation should, however, be noted. First,
favor of one or the other sex have been consistently found. whether items that differentiate significantly between certain groups are
Under certain circumstances, moreover, items may be chosen, not to
minimize, but to maximize, sex differentiation. An example of the latter
retained or discarded should depend on the purpose for which the test
is designed. If the criteria to be predicted show significant differences
T.
procedure is to be found in the masculinity-femininity scales developed between the sexes, socioeconomic groups, or other categories of persons,
for use with several personality inventories (to be discussed in Ch. 17). then it is to be expected that the test items will also exhibit such group
Since the purpose of these scales is to measure the degree to which an 6 Known as the Davis-Eells Games, this test was subsequently discontinued be-
individual's responses agree with those characteristic of men or of cause it proved unsatisfactory in a number of ways, including low validity in pre-
women in our culture, only those items that differentiate significantly be- dicting academic achievement and other practical criteria. ~Ioreover, the anticipated
tween the sexes are retained. ad"antage of lower-class children on this test did not hold up in other samples.
differences. To eliminate items showing these differences might serve

. onlyto lower the validity of the test for predicting the given criteria (see
Anastasi, 1966). In the second place, tests designed to measure an in-
PART 3
dividual's resemblance to one or another group should obviously magnify
'Jests of
the differentiation between such groups. For these tests, items showing
the largest group differences in response should be chosen, as in the case
of the masculinity-femininity and social status scales cited above.
The third point is of primary concern, not to the test constructor, but
, .to the test user and the general student of psychology who vl'ishes to in-
• terpret test results properly. Tests whose items have been selected with
reference to the responses of any special groups cannot be used to com-
Gel'le1"alIntellectual Level
., pare such groups. For example, the statement that boys and girls do not
differ significantly in Stanford-Binet IQ provides no information what-
ever regarding sex differences. Since sex differences were deliberately
eliminated in the process of selecting items for the test, their absence
from the final>scores merely indicates that this aspect of test construction
was successfully executed. Similarly, lack of socioeconomic differences on
I a test constructed so as to eliminate such differences would provide no
infom1ation on the relative performance of groups varying in socioeco-
nomic status.
Tests designed to maximize group differentiation, such as the mascu-
linity-femininity and social status scales, are equally unsuitable for group
comparisons. In these cases, the sex or socioeconomic differentiation in
personality characteristics would be artificially magnified. To obtain an
unbiased estimate of the existing group differences, the test items must
be selected without reference to the responses of such groups. The prin-
cipal conclusion to be drawn from'the present discussion is that proper
interpretation of scores on any test requires a knO\vledge of the basis on
which items were selected for that test.
11ldividlwl Tests 231
i30 Tests of Gcnuallllfellccfllal Lct:cl
~validatedagainst relatively broad criteria. They characteristically provide Detailed instructions for administering and scoring each test: were pro-
'~ sinO"lescore, such as an IQ, indicating the individual's general intcl- vided, and the IQ was employed for the first time in any psychological
test.
:iectu~l level. A typical approach is to arrive at this global estimate of
"intellectual performance by "the sinking of shafts at critical points" The second Stanford reVlSlon, appearing in 1937, consisted of two
:(Terman & Merrill, 1937, p. 4). In other words, a wide variety of tasks equivalent forms, Land M (Terman & Merrill, 1937). In this revision,
,is presented to the subject in the expectation that an adequate sampling the scale was greatly expanded and completely restandardized on a new
1>ofall important intellectual functions will thus be covered. In actual and carefully chosen sample of the U.S. population. The 3,184 subjects
';practice, the tests are usually overloaded with certain functions, such as employed for this purpose included approximately 100 children at each
tverbal ability, and completely omit others.
half-year interval from 1 Yz to 511z years, 200 at each age from 6 to 14,
" Because so many intelligence tests are validated against measures of and 100 at each age from 15 to 18. All subjects were within one month of
'~cademic achievement, they are often designated as tests of scholastic a birthday (or half-year birthday) at the time of testing, and every age
~aptitude. Intelligence tests are frequently employe? as preliminar!' group co~tained an equal number of boys and girls. From age 6 up,
.'screening instruments, to be followed by tests of speCIal aptItudes. ThIS most subjects were tested in school, although a few of the older subjects
'practice is especially prevalent in the testing of normal adolescents. or were obtained outside of school in order to round out the sampling.
~adults for educational and vocational counseling, personnel selectIOn, Preschool children were contacted in a variety of wavs, manv of them
;:i:andsimilar purposes. Another common use of general inte.lligence tests. is being siblings of the schoolchildren included in the sa~ple. D~spite seri-
,ltobe found in clinical testing, especially in the identification and classIfi- ous efforts to obtain a representative cross-section of the population,
;'cation of the mentally retarded. For clinical purposes, individual tests the sampling was somewhat higher ,than the U.S. population in socio-
-'are generally employ~d. Among the most widely used in~ividual ~ntell~- economic level, contained an excess of urban cases, and included only
native-born whites. . '
. gence tests are the Stanford-Binet and \\'echsler scales dIscussed m thIS
A third revision, published in 1960, provided a single form (L-M)
~chapter.
incorporating the best items from the two 1937 forms (Terman & !vler-
rill, 1960). \Vithout introducing any new content, it was thus possible
to eliminate obsolescent items and to relocate items whose difficultv level
had al~ered during the intervening years owing to cultural changes. In
EVOLUTIONOF THE SCALES.The ori£!inal Binet-Simon scales kLVe al- prepanng the 1960 Stanford-Binet, the authors were faced with a com-
>ready been described briefly in Chapter 1. It will be recalled that the mon dilemma of psychological testing. On the one hand, frequent re-
'1905 scale consisted simply of 30 short tests, arranged in ascending order visions are desirable in order to profit from technical advances and re-
of difficulty. The 1908 scale was the first age scale; and the 1911 scale finements in test construction and from prior experienccln the use of
introduced minor improvements and additions. The age range covered the test, as well as to keep test content up to date. The last-named con-
by the 1911 re\ision extended from 3 vears to the adult level. Among the sideration is especially important for information items and for pictorial
n;anv translations and adaptations of the early Binet te~;-s were a number material which may be affected by changing fashions in dress, household
, of American revisions, of '\vhich the most viable has been the Stanford- appliances, cars, and ether common articles. The use of obsolete test
, Binet.' The first Stanford revision of the Binet-Simon scales, prepared by content may seriously undem1ine rapport and may alter the difficulty
Terman and his associates at Stanford University, was published in 1916 level of items. On the other hand, revision mav render much of the ac-
(Terman, 1916). This revision introduced so many changes and additions cumulated data inapplicable to the new for~. Tests that have been
as to represent virtually a new test. Over one third of the items \\'ere ne\\", Widel~' used. for many years have acquired a rich body of interpretive,
and a number of old items were revis'ed, reallocated to different age matenal whICh should be carefully weighed against the need for re-
levels, or discarded. The entire scale was restandardized on an American VISIon. It was for these reasons that the authors of the Stanford-Binet
sample of approximately one thousand children and four hundred adults. chose to condense the two earlier forms into one, tl;ercby steering a
course between the twin hazards of obsolescence and discontinuity. The
1 A detailed account of the Binet-Simon scales and of the development, me, and
, The itenE in the Binet scales are commonly called "tests," since each is separatelY
c];ni",:1 in!erpreL,tion of the Stanford-Binet can be found in Sattler (1974, Chs.
and Inn)' C'ur·t(-:-dn ~e\'eral pnrts.
adrninist(·recJ .
·13Z Tests of General Intellectual Level
were those for whom the primary language spoken in the home was not
lossof a parallel form was not too great a price to pay for accomplish-
English. To cover ages 2 to 8, the investigators located siblings of the
~gthis purpose. By 1960 there was less need for an alte~na~e. form .than group-tested children, choosing each child on the basis of the Cocrnitive
Iherehad been in 1937 when no other well-constructed mdl\'ldual mtel-
Abilities Test score obtained by his or her older sibling. Additional cases
llgencescale was available. at the u?per ages were recruited in the same way. The Stanford-Binet
/ In the preparation of the 1960 Stanford-Binet, items were sel~cted
sample mcl~ded approximately 100 cases in each half-year age group
whom forms Land M on the basis of the performance of 4,498 subjects,
from 2 to 5 ~ years, and 100 at each year group from 6 to 18.
:aged 2112 to 18 )'ears who had taken either or both forms of the test
II I ' In comparison with the 1937 norn1S, the 1972 norms are based on a
'between 1950 and 1954. The subjects were examined in six states situ-
mor~ representative sample, as well as being updated and hence re-
';atedin the Northeast, in the Midwest, and on the \Vest Coast. Although
~e~tmg a~y effects of intervening cultural changes on test performance. It
';thesecases did not constitute a representative sampling of American
ISmterestmg to note that the later norms show some improvement in test
'schoolchildren,ca:re was taken to avoid the operation of major selective
performance. at all ages. The improvement is substantial at the preschool
factors,3The 1960 Stanford-Binet did not involve a restandardization of
ages, averagmg about 10 IQ points. Th·· test authors attribute this im-
",thenormative scale, The new samples were utilized only to identify
prov~ment ~o th~ impact of radio and television on young children and
ichanges in item difficulty over the intervening period. Accordingly, the
the mcreasmg literacy and educational lew·] of parents, among other
.: difficulty of e;;tch item was redetermined by finding the percentage of cultural changes. There is a smaller but clearly discernible improvement
,:children passing it at successive mental ages on the 1937 forms. Th~s for at. ages 15 and over which, as the authors suggest, may be associated
, purposes of item analysis the children were grouped, not accordmg to
WIth the larger proportion of students who continued their education
"their chronological age, but according to the mental age they had ob-
through high school in the 1970s than in the 1930s.
::tained on the 1937 forms. Consequently, mental ages and IQ's on the
l!< 1960 Form L-M were still expressed in terms of the 1937 normative
"
"" samp Ie. AmII~ISTRATIOK A~D SCORING. The materials needed to administer the
; The next stage was the 1972 restandardization of For:n L-~1 (Terman Stanford-Binet are shown in Figure 29. The\' include a' box of standard
& Merrill, 1973, Part 4), This time the test content remamed unchanged,4
toy objects for use at the younger age lev~ls, two booklets of printed
. but the norms were derived from a new sample of approximately 2,100
cards, a record booklet for recording responses, and a test manual. The
cases tested during the 1971-1972 academic year. To achieve national tests are grouped into age levels ex-tending from age II to superior adult.
representativeness despite the prac:tical impossibility of admi.nistering in- Between the. ages of II and V, the test proceeds by half-year intervals.
dividual tests to very large samples, the test publishers took advantage
Thus, there IS a level corresponding to age II, one to age II-5, one to
of a sample of approximately 20,000 children at each age level, employed
age I~I, and so for~h. Because progress is so rapid during these early
in the standardization of a group test (Cognitive Abilities Test). This
~ges, It proved feaSIble and desirable to measure change over six-mo:,th
sample of some 200,000 schoolchildren in grades 3 through 12 was mtervals. Between V and XIV the age levels correspond to vearlv inter-
chosen from communities stratified in terms of size, geographical region,
vals. The remaining levels are designated as Average Adult ~nd Superior
and economic status, and included black, Mexican-Ar·;erican, and Puerto
Adult levels I, II, and III. Each age level contains six tests, with the
Rican children. exception of t~e .Average Adult level, which contains eight.
The children to be tested with the Stanford-Binet were identified
The tests wlthm anyone age level are of approximately uniform dif-
through their scores on the verbal battery of the Cognitive Abilities
£~ulty and are arranged without regard to such residual differences in
Test, so tbat the distribution of scores in this subsample corresponded dIfficulty as may be present. An alternate test is also prO\'ided at each
to the national distribution of the entire sample. The only cases excluded
age leve~. Being of approximately equivalent difficulty, the alternate may
3 For speCial statistical analyses, there were two additional samples of California be substituted for any of the tests in the level. Alternates are used if one
children, including 100 6-year-oJds stratified with regard to father's occnpation and ~f .the regul~r tests must be omitted because special circumstances make
100 15-year-olds stratified with regard to both father's occupation and grade dis- It mappropnatc for the individual or because some irrecrularitv interfered
tribution. with ils standardized administration. Co /
4 'Vith onl". two very minor exceptions: the picture on the "doll card" at age II
Four test~ in each year level were selected on the basis of validity and
was updatcc1:' and the ~".ord "charcoal" was permitted as a substitute for "coal" in
representativeness to constitute an ablncuiatcd scale for use when time
tll€' Similarit:es test a' age VE.
Indiddual Tests 235
~.234 1 (;~IS UJ \.x(;IIl,;lU( lUfCllc.t,..lllUI .l..J •••. Lol. I·
arc administered, since the subsequent conduct of the examination de-

~ does not permit the administration of the entire scale. T~lese tbesttsare pends on the child's performance on pn'viously administered levels.
;" marked with an asterisk on the recor d b 00 kl e t s. ComI)ansons. e ween
1 Many clinicians regard the Stanford-Binet not only as a standardized
: fun-scale and abbreviated-scale IQ's on a variety of ,groups sh~w ~ ~ ose test, but also as a clinical interview. The very characteristics that make
{ correspondence between the two, the correlations bel~g ~pp~oxl~:6~ ~;~ this scale so difficult to administer also create opportunities for inter-
,;.high as the reliability coefficient of the full scale (Rlme stem, , IQ action between examiner and subject, and provide other sources of clues
. '11 19-3 61 6')) The mean ,
.' tIer 1974, p. 116; Terman & 1lern, ',pp. -.... . 'e anc for the experienced clinician. Even more than most other individual tests,
~.h '. . tends to run sIi<'htI)·lower on the short scale. Thls dlSCI P hY the Stanford-Binet makes it possible to observe the subject's work meth-
OVd?Vel, . 0 , h'crh on eac
~ is also found when the numbers of persons sconng 1.:;0e; IQ' ods, his approach to a problem, and other qualitative aspects of perfonn-
, . are compoared, Over 50 percent of the subjects recelVe ower s ance. The examiner may also have an opportunity to judge certain person-
.' vefSlon h' h
" on the short version, while only SO percent score 19 er. ality characteristics, such as activity level, self-confidence, persistence,
,~.
and ability to concentrate. Any qualitative observations made in the
course of Stanford-Binet administration should, of course, be clearly rec-
ognized as such and ought not to be interpreted in the same manner as
objective test scores. The value of such qualitative observations depends
to a large extent on the skill, experience, and ps:'chological sophistication
of the examiner, as well as on his awareness of the pitfalls and limitations
inherent in this type of observation. The types of clinical observations
that can be made during an iridividual intelligence examination are
richly illustrated by Moriarty (1960, 1961, 1966), who sees in the testing
session an opportunity to investigate the child's behavior in meeting a
challenging, demanding, difficult, or frustrating situation.
~ ,:
~j
l In taking the Stanford-Binet, no one subject tries all items. Each indi-
vidual is tested only over a range of age levels suited to his own intel-
.~. lectual level. Testing usually requires no more than thirty to forty min-
i
I ..:'
utes for younger children and not more than one hour and a half for
older ex~mine~s. The standard procedure is to begin testing at a level
slightly below the expected mental age of the examinee. Thus, the first
tests given should be easy enough to arouse confidence,; but not so easy
as to cause boredom and anno·vance. If the individual fails any test
within the year level first admini;tered, the next lower level is given. This
procedure continues until a level is reached at which all tests are passed. 1;
This level is known as the basal age. Testing is then continued upward
i
to a level at which all tests are failed, deSignated as the ceiling age,
' . t' the Stanford-Binet. When this level is reached, the test is discontinued.
FIG. 29. Test Materials Employed in Ad milliS er~ng
Individual Stanford-Binet items, or tests, are scored on an all-or-none'l
(Courtesy Houghton Mifflin Company.)
basis. For each test, the minimal performance that constitutes "passing"
. ., l' mte.II'1gence t es ts '. the Stanford-Binet is specified in the manual. For example, in idenLfying objects by use at
In common with most mdividua
. h'ghl,' traI'ned examiner. Both administratiOn and sconng are
. year level II-6, the child passes if he correctly identifies three out of six 1
reqUIres a 1 . . 1 f ']' 'tv d designated objects; in answering comprehension questions at year level
. ly comnlicated for many of the tests. Conslderab e aml Ian 'f an
fall' 1 . d f oth per orm- VIII, any four correct answers out of six represent a passing perfonn-
exp~rienc(' with the scale are therefore req~ue or a smO t SliCThtin-
ance Hesitation and fumbling may be rumous to rafPpor. • t'>f ther
ance. Certain tests appear in identical form at different year levels, but r
< • • It th d'fficult\r 0 ltems,."\. ur are scored with a different standard of passing. Such tests are admin- I
adwrtent chancres in wordmg may a er e 1 .
istered only once, the individual's pcrfomlance detemlining the year level
complication isC'presented by the fact that tests must be scored as they
236 Tests of General Intellectual Lcccl
~t which they are credited. The vocabulary test, for exampl:, may be tain constant IQ variability at all ages, the SD's of ratio IQ's on these
scoredanywhere from level VI to Superior Adult III, dependmg on the scales fluctuated from a low of 13 at age YI to a high of 21 at age II-6.
~numberof words correctly defined. . Thus, an IQ of 113 at age VI corresponded to an IQ of 121 at age II~6.
.. The items passed and failed by anyone individual will show a certam SpeCial correction tables were developed to adjust for the major IQ
amountof scatter among adjacent year levels. 'Ve do not find that ex- variations in the 1937 scales (McNemar, 1942, pp. 172-174). All these
"amineespass all tests at or below their mental age level and fail all tests difficulties were circumvented in the 1960 form through the use of devia-
~abovesuch a level. Instead, the successfully passed tests are spread over tion IQ's, which automatically have the same SD throuO'hout the aae
'several year levels, bounded by the subject's basal age at one extreme range. .00
:andhis ceiling age at the other. The subject's mental age on the Stanford- As an aid to the examiner, Pinneau prepared tables in which deviation
;Binetis found by crediting him "ith his basal age and adding to that age IQ's can be looked up by entering ~fA and CA in years and months.
. further months of credit for every test passed beyond the basal level. In These Pinneau tables are reproduced in the Stanford-Binet manual
~thehalf-year levels between II and V, each of the six tests counts as one (Terman & Merrill, 1973). The latest manual includes both the 1972 and
i'month· between VI and XIV each of the six tests corresponds to two the 1937 normative IQ tables. For most testing purposes, the 1972 norms
:~month~of credit. Since each ~f the adult levels (AA, SA I, SA II, and are appropriate, showing how the child's performance compares with
.:SAIII) covers more than one year of mental age, the months of credit for that of others of his own age in his generation. To provide comparability
.. each test an! adjusted accordingly. For example, the Average Adult with IQ's obtained earlier, however, the 1937 nom1S are more suitabl~.
: levelincludes eight tests, each of which is credited with two months; the They would thus be preferred in a continuing longitudinal study, or in
.: SuperiorAdult I level contains six tests, each receiving four months. . comparing an individual's IQ with the IQ he obtained on the Stanford-
, The highest mental age theoretically attainable on the Stanford-Bmet Binet at a younger age. When used in this way, the 1937 standardization
, is 22 years and 10 months. Such a score is not, of course, a true mental sample represents a fixed reference group, just as the students taking
age, but a numerical score indicating degree of superiority above the the College Board Scholastic Aptitude Test in 1941 provide a fixed ref-
A~'eraae
o Adult performance. It certainh'. does not correspond to the erence group for that test (see Ch. 4).
achievement of the average 22-year-old; according to the 1972 norms, Although the deviation IQ is the most convenient index for evaluatinO'
the average 22-year-old obtains a mental age of 16-8. For any adult over an individual's standing in his age group, the MA itself can serve a use~
18years of age,' a mental age of 16-8 yields an IQ of 100 on this scale. In ful function. To say that a 6-year-old child performs as well as a typical
fact) above 13 years
-,'
me;ital acres
b
cease to have the same significance as 8-year-old usually conveys more meaning to a layman than saying he has
thev do at lower levels, since it is just beyond 13 that the mean MA an IQ of 137. A knowledge of the child's MA level also facilitates an
begins to lag behind CA on this scale. The Stanford-Binet .is not suitable understanding of what can be expected of him in terms of education~l
for adult testing, espeCially within the normal and supenor range. De- achievement and other developmental norms of behavior. It should be
spite the three Superior Adult levels, there is insufficient ceiling for most noted, however, that the MA's obtained on the Stanford-Binet are still
superior adults or even for very superior adolescents (Kennedy et aI., expressed in terms of the 1937 nom1S. It is only the IQ tables that in-
1960). In such cases, it is often impossible to reach a ceiling age level at corporate the updated 1972 norms. Reference ~o these tables will show,
which all tests are biled. ~Ioreover, most of the Stanford-Binet tests for example, that if a child whose CA is 5-0 obtains an 1'.1Aof 5-0, his
have more appeal for children than for adults, the content being ,of IQ is not 100. To receive an IQ of 100 with the 1972 norms, this child
relatively little interest to most adults. would need an MA of 5-6.
One of the advantages of the Stanford-Binet derives from the mass of
interpretive data and clinical experience that have been accumulated
KORMATI''E II'TERPRETATIO~. A major innovation introduced in the regarding this test. For many clinicians, educato:·s, and others concerned
1960 Stanford-Binet was the substitution of deviation IQ's for the ratio with the evaluation of general ability level, the Stanford-Binet IQ has
IQ's used in the earlier forms. These deviation IQ's are standard scores become almost synonymous with intelligence. Much has been learned
wilh a mean of 100 and an SD of 16. As explained in Chapter 4, the about what sort of behavior can be expected from a child with an IQ of
principal advantage of this type of IQ is that it provi~es c~mpara~le 50 or 80 or 120 on this test. The di<.tributions of IQ's in the successive
scores at all age levels, thus eliminating the vagaries of ratIo IQ s. DespIte standardization samples (1916, 193i, 19(2) have provided a· common
the care with which the 19'37scaks were developed in the effort to ob- frame Or reference fo" the interpretation of IQ's.
'}l3/) Tests 0t GCllcralintel/ceit/al LCrc{
Individual Tests 239
.'. B f the size of the error of mcasurement of a St~nford-~inet -2 SD down to -3 SD, ranges from 68 (100 - 2 X 16) to 52 (100"':"
" ecause
';:IQ 0
it is customar)T to allow approximately a 10-pom . t b a nd on either 90 3 X 16). The other IQ ranges can be found in a similar manner. The
"' 'd, f the obtained IQ for chance YanatJOn. .. Th us any IQ between percentages of cases at each level are those expected in a normal dIstri-
~and'SIe 110 0
is considered equivalent to the average IQ .0f 100 .'
IQ's
d . above . bution (see Fig. 6, Ch. 4). They agree quite closely with the percentages
'j,llo represent superior deviations, those below 90 mfer~or. eVlatJO~s. of persons at these IQ levels found empirically in the general popula-
'There is no generally accepted frame of reference for class:fymg supenor tion. The frequency of mental retardation in the general population is
lIQ's. It may be noteworthy, however, that in. the cla~slcal, 100:g.-term usually estimated as close to 2 percent. The Stanford-Binet manual con-
;. , t'g tion of gifted children bv Terman and hiS co-wOlkers a mlmm.um tains still another classification of levels of mental retardation, based on
limes I140
}IQof a. was requir.=d for incluSIOnm
-. . t h e pnnclpa
" 1 pa rt of the project somewhat different IQ limits, which has been widely used as an in-
£(Terman & Oden, 19~9). . l'fi' terpretive frame of reference by clinical psychologists (Terman & Mer-
t At the other end of the scale, a widely used educahonal c aSSI catdl~nl rill, 1973, p. 18).
;':of mental retardates recognizes the educa bl e, trama . bl'e, an d custo ....,: la The use of such classifications of IQ level" although of unquestionable
~
.}categories. The educable group, 111 e. . th IQ range from 50
d to 15,
'bl, cans help in standardizing the interpretation of test performance, carries
' d ce to at least the third grade in academIc work-an POSSI ) a certain dangers. Like all classifications of persons, it should not be rigidly
f~i~~nas the sixth grade-if taught in a specially adapted classroo; applied, nor used to the exclusion of other data about the individual. j
~:situation. The trainable group, with IQ's between 25 ~nd 50, can' e of
Thdered,~re,d hco~:se, nOI,~harPbdividinghlin~s betwleen tdheh"mentally re~ '
~"taught self-care and social adjustment ~n a protecte.d enVIronment. Those tar e an t e norma or etween t e 'norma" an t e "superior.
~below IQ 25 generally require custodial ~nd ~urSlllg care. . ._ Individuals with IQ's of 60 have been hown to make satisfactory ad-
'j In its manual on terminology and claSSIficatIOn,the Amencan ASSOCI justments to the demands of daily living, while some \\-ith IQ's close to
i,.,'.. atJon on l. 1\1entaI D e ficlency
. (AAMD)
1
lists four levels of mental'fi retarda-
. . 100 may require institutional care.'
·;.tion defined more precisely in terms of SD units. This classl cahon IS Decisions regarding institutionalization, parole, discharge, or special
.!':GIven111
. ' . Table 23 , together with the Stanford-Binet IQ ranges '11b correspond- t d training of mental retardates must take into account not only IQ but also
;:,fn to each level and the expected pe~c.e~1tageof cases. It WI. e no ~e social maturity, emotional adjustment, physical condition, and other cir-
: th~t the classification is based on a dIVISIOn~f the lo,,:er ?ortlOn_~~D. cumstances of the individual case. The AA:\iD defines mental retarda-
'I,; normal distribution curye into steps of 1 SD each, begmnm.g at 1 d tion as "significantly subaverage general intellectual functioning existing 'l
;: The advantage of such a classiBcation is that it can be. readily trans ate concurrently with deficits in adaptive behavior, and manifested during the
\(mto . stan dar d scores 0 r deviation IQ's in an)' scale. Smce the d'Stanford- f developmental period" (Grossman, 1973, p. 11). This definition is further
" Binet deviation IQ scale has an SD of 16, the mild level, exten mg rom explicated in the stipulation that a child should not be classified as men-
tally retarded unless he is deficient in both intellectual! functioning, as
indicated by IQ level, and in adaptive behavior, as measured by such
fuu~ . . instruments as the Vineland Social Maturity Scale or the AAMD Adaptive
Levelsof Mental Retardation as Defined in Manual of Amencan Behavior Scales (to be discussed in Chapter 10).
Associationon Mental Deficiency
Nor is high IQ synonymous with genius. Persons with IQ's of 160 do
(Data in first two columns from Grossm~l, 1973 ' p. 18. Reprinted by permIssion of occasionally lead undistinguished lives, while SOmewith IQ's much closer
the American Association on Mental DefiCIency)
~:;-)jr"·P':r:i:;r7"tyrr'···~~'::u:":t-!FC·~""''''''''~"J..z.::·.;g::.::!!;~~~'''1::t~::.<:~;:.'~~~1".~~_._
.. ,._,~_~. _~'
to 100 may make outstanding contributions. High-level achievement in ,', r
specific fields may require special talents, originality, persistence, single- !
Range of
Cutoff Points Stanford-Binet lQ Percentage ness of purpose, and other propitious emotional and motivational con-
ditions.
(in SD units from Mean) (SD = 16) of Cases
Mild -2 68--52 2.14

Moderate -0
n
51-36 0.13 RELIABILITY. The reliability of the 19.37Stanford-Binet was determined
Severe -4 35-20 0.003 by correlating IQ's on Forms Land 1\1administered to the standardiza-
Profound -5 19 and below 0.00003 tion group within an interval of one week or less. Such reliability co-
efficients are thus measures of both short-teml temporal st"bilit:: <Jnd
Tests of General Intellectual Level
-
145-14 9 ! I / I I
;uivalence of content across the two item samples. An exceptionally
140-14 4 II i I
orough analysis of the reliability of this test was carried out with I
.\ferenceto age and IQ level of subjects (McNemar, 1942, Ch. 6). In 135-13 9 I I /
!~neral,the Stanford-Binet tends to be more reliable for the older than 130-13 4 I , I II /
I
lorthe younger ages, and for the lower than for the higher IQ's. Thus, 125-129 I I 1 II /
I
It ages 2~.1z to 5y:!, the reliability coefficients range from .83 (for IQ 120-124 .. \ I I I /
" I
140-149) to .91 (for IQ 60-69); for ages 6 to 13, they range from .91 to 115-119 i Ili"Il~I'
! •. I' //I I
97, respectively, for the same IQ levels; and for ages 14 to 18, the cor- 110-114 I I I , ~1ii!t1""'1/1
I. / //I 1
I
I
Wspondingrange of reliability coefficients extends from .95 to .98. I ./lit
tifi !
I
I
.:The'increasing reliability of scores with increasing age is characteristic

-' 105-109 I I
, " -IIit
111
"" I I
Eo 100-104 I I " 11
fii;'
"i I I
I I I
I I
of tests in general. It results in part from the better control of conditions u..
co
95-99 I I //I I:: i I " i
III /1 I
~.at is possible \"ith older subjects (especially in comparison with the i,1
I I
./II' ./IitI./lit'~1
o 90-94 //I .-llltl!w,m, I
.preschoolages). Another factor is the slowing down of developmental Q &5-89 I 1111
"
./lit -lilt -llitll i
I I !,
i
I I
;iatewith age. \Vhen reliability is measured by retesting, individuals who
areundergoing less change are also likely to exhibit less random fluctu-
80-84 iI/ i I 11/\, i ! i
I
I
I I !
I' iT I
;
I
75-79 iI/ 1'1 I
iationover shod periods of time (Pinneau, 1961, Ch. 5).
~.'The higher reliability obtained with lower IQ levels at any given CA, 70-74 ,1/ I I
I i I I I
,1m the other hand, appears to be associated with the specific structural 65-69 I, 11 I i I I I I I I I I
:fharacteristics of the Stanford-Binet. It will be recalled that because of 60-64 I I i I I i I I
'thedifference in number of items available at different age levels, each 55-59 i- I I I I I I I I I
.itemreceives a weight of one month at the lowest levels, a .... "eight of two 50-54 . I i I i I I II I I !
monthsat the intermediate levels, and weights of four, five, or six months 45-49 i i I I I I I I I i i
I
,jatthe highest levels. This weighting tenc.s to magnify the error of meas- 40-44 I
I I I I I i I II
I I i 1
,urement at the upper levels, because the chance passing or failure of a
'singleitem makes a larger difference in total score at these levels than it
~ 0-
""
o
I
~
"<J!
-.:-
r
<r)
-.:.
0- ~ 0- "<t 0- 0-
!~~~~~~~~
0- g~ b 0 ~
o 0 ~ ::
0
N'N
~
M
I
~
:doesat lower levels. Since at any given CA, individuals with higher IQ's
~aretested with higher age levels on the scale, their IQ's will have a lar~er 10 on Form M
error of measurement and lower reliability (Pinneau, 1961, Ch. 5). The FIG. 30. Parallel-Form Reliabil"tv f h .
of IQ's Obtained by Seven-Yea;-Ol~ ~h~ldStanford-Bmet: Bivariate Distribution
.relationship between IQ level and reliability of the Stanford-Binet is also • 1 ren on Forms Land M
-illustrated graphically in Figure 30, showing the bivariate distribution of (From Company.)
Miffiin Terman & Merril' 1,
19~-
J', p. 45. Reproduced by permission . of Houghton
!lQ'sobtained by 7-year-old children on Forms Land M. It will be ob-
~.servedthat the individual entries fall close to the diagonal at lower IQ
'levelsand spread farther apart at the higher levels. This indicates closer
".agreement between Land 1\1IQ's at lower levels and wider discrepancies standardization sample were tested wi h' .
I
vear birthda)' ThIS' na 1 . t m a month of a blrthda)' or half-
;between them at upper levels. With such ~ fan-shaped scatter diagram, a , . rrow y restncted . 1
lower reliability coefficients th f ~g~ range ""ou d tend to produce
single correlation coefficient is misleading. For this reason, separate re- more heterogeneous sa 1 an oun or most tests, which employ
;'liability coefficients have been reported for different portions of the IQ
reliability coefficient of
of approximatel)' 5 IQ P . t (
~b
:~'d~~~~a:~d1tO'.t~rms of individual IQ's, a
eh b1\e an error of measurenent
:,range.
" On the whole, the data indicate that the Stanford-Binet is a highly reli- am s see 5) I tho -
are about 2: 1 that a ch'ld' t S . . n a ~r words, the chances
;able test, most of the reported reliability coefficients for the various age IS rue tanford-Bin t IQ d'ff b"
;and IQ levels being over .90. Such high reliability coefficients were ob- l ess from the IQ obtained in a sinal . e 1 ers y;) points or
of 100 that it varies b b e testmg, and the chances are 95 out
tained despite the fact that they were computed separately within each y no more than 10 points (5 >~1.96 = 9.8). Re-
:.age group. It will be recalled in this connection that all subjects in the
Individual Tests 243
ijl'ctillgthe same differences found in the reliability coefficients, these
,~;rorsof measurement will be somewhat higher for younger than for s~e also A. J. Edwards, 1963 ).5 The correlations are at least as high as
olderchildren, and somewhat higher for brighter than for duller indi- t. ose normally found between tests designed to measure the same func-
;;iduals. tions, and they fall \\ithin the range of common reliabilitv coefficients'-
Insofar ~:. all t~e fun~,tions listed are relevant to what'is commonlv re-
garded as mtelhgence, the scale ma~' be said to have content vali·ditv.
VALIDITY. Some information bearing on the content t;alidity of the The preponderance of verbal content at the upper levels is defended by
the test authors on theoreti,:'al grounds. Thus, they write:
Stanford-Binet is provided by an e:~amination of the tasks to be per-
iormed bv the examinee in the various tests. These tasks run the gamut
Jromsimple manipulation to abstract reasoning. At the earliest age levels, At these le~'els the ~,ajor intellectual differences between subjects reduce
larg~ly to. ,dIfferences m the ability to do conceptual thinking, and facilit in
thetests require chiefly eye-hand coordination, perceptual discrimination,
dealmg \\"Ith concepts is most readily sampled bv the use of verbal t~sts
Jnd ability to follow directions, as in block building, stringing beads,
somparing lengths, and matching geometric fomls. A relatively Ia:rge
~anfuar' esse~tially, is the shorthand of the higher thought processes ar d
t e :ve at whIch this shorthand functions is one of the most importa;t d~-
:number of tests at the lower levels also involve the identification of com- ~~~)I~ants of the level of the processes themselves (Terman & Merrill, 19,37,
:monobjects presented in toy models or in pictures.
( Several tests occurring over a wide age range call for practical judg-
~'mentor common sense. For example, the child is asked, "\\11at should ~t shou;d be ad~e? that clinical psychologists have developed several
::youdo if you found on the streets of a city a three-year-old baby that was ~c en~es, or classIfymg Stanford-Binet tests, as aids in the qualitative
; lostfrom its parents?" In other tests the examinee is asked to explain why escnption of the individual's test" performance (see Sattler, 1974, Ch.
i certain practices are commonly follo\ved or certain objects are employed 10). ~attern analyses of the examinee's successes and failures in different
lin daily living. A number of tests calling; for the interpretation of pictori- functIOns mav provide h I f I Iff h .
I ' e p u cues or urt er clmical exploration. The
! ally or verbally presented situations, or the detection of absurdities in :esu ts of such anal~'ses, however, should be reO'arded as tentative and
\ either pictures or brief stories, also seem to fall into this category. mterpreted \~rjth caution. 1-.10st functions are r~presented bv too few
:, Memory tests are found throughout the scale and utilize a wide variety t:sts to perm.It rehable measurement; and the coverage of an): one func-
, of materials. The individual is required to recall or recognize objects, tIOn vanes \\'1de]~' from one year le\·el to another.
. pictures. geometric designs, bead patterns, digits, sentences, and the Data on the criteriol1-related t;alidity of the Stanford-Binet both . _
. content of passages. Several tests of spatial orientation occur at widely c~n:eJ~t ,and pr~dicti\"e, have been obtained chiefly in terms of acadec~~c
" scattered levels. These include maze-tracing, paper-folding, paper-cut- aCrI~\ e,llent Smce the publication of the original 1916 Scale manv cor-
" ting, rearrangement of geometric figures, and directional orientation. re atlQns have ~een. computed between Stanford-Binet ~IQ 'and ~chool
Skills acquired in school, such as reading and arithmetic, are required for grlad:s, teachers ratmgs, and achievement test scores. Most of these cor-
successful performance at the upper year levels. re atlOns fall between 40 .J 7- S h I
b I ' anlt. u. c 00 progress was likewise found to
The most common type of test, especially at the upper age levels, is ere ated to Stanford-Bine.t IQ. Children who were accelerated bv one or
that employing verbal content In this category are to be found such "'ell- more grades averaged conslderablv higher in IQ than th . t ']
d I ' . ose a norma age-
known tests as vocabulary, analogies, sentence completion, disaLianged gr,a e ocatlOn! and children who were retarded by one or more rades
sentences, defining abstract terms, and iriterpreting proverbs. Some stress a\ e~a"ged con.slder~bly below (~rcNemar, 1942, Ch. 3). g
verbal fluency, as in naming unrelated words as rapidly as possible, LIke most ~ntelhgence tests, the Stanford-Binet correlates hiO"hlv with
giving rhymes, or building sentences containing three given words. It perf0:mance m nearly all academic courses, but its correlations bar~ hi h-
should also be noted that many of the tests that are not predominantly g
est wIth. the pr.edominantly verbal courses, such as English and histo T
verbal in content nevertheless require the understanding of fairly com- ~~~elatlO~ WIth achievement test scores show the same pattern. InY)~
plex verbal instructions. That the scale as a whole is heavily weighted y of hIgh school sophomores, for example, Form L IO's correlated
...
with verbal ability is indicated bv the correlations obtained between the
45-word vocabul~rv test and n;ental ages on the entire scale. These
correlations were f~und to be ,71, .83, .86, and .83 for groups of examinees
agel 8, 11, 14, and IS years, respectively (~1d\emar, 1942, pp. 139-140;
I Tests of General 11Itellectllal Level Indicidual Tests 245
~73with Reading Comprehension scores, .54 with Biology sco:es, and :4,8 pretation of IQ's, moreover, the scale should be highly saturated with a
" fh Geometry scores (Bond, 1940). Correlations in the .50 sand .60 s Single common factor. The latter point has already been discussed.in
fie been found with college grades. Among college students, both co~nection ~\'ith homogeneity in Chapter 5. If the scores were heavily
'selective factors and insufficient test ceiling frequently lower the cor- weIghted WIth two group factors, such as verbal and numerical aptitudes,
:~liations. an IQ of, let us say, 115 obtained by different persons might indicate
~here have been relatively few validation studies with the 1960 Form high v~rbal ability in one case and high numerical ability in the other.
L.M (see Himelstein, 1966). Kennedy, Van de Reit, and White (1963) re- \fd\emar (1942, Ch. 9) conducted separate factorial analYses of
I£ort a correlation of .69 with total score on the California Achievement Stanford-Binet items at 14 age levels, including half-year groups' from 2
! rst in a large sample of black elementary school children. Cor~elations to 5 and year groups at ages 6, 7, 9, 11, 13, 15, and 18. The number of
.,,,;th scores on separate parts of the same battery were: Readmg, .68; subjects employed in each analysis varied from 99 to 200 and the
.Arithmetic, .64; and Language, .70. number of items ranged from 19 'to 35. In each of these anal;'ses, tetra-
, IIn interpreting the IQ, it should be borne in mind that the Stanford- choric c.orrelations were computed between the items, and th'e resulting
i.net-like most so-called intelligence tests-is largely a measure of correlatlOns were factor analyzed. By including items from adjacent vear
[scholastic aptitude and that it is heavily loaded with verbal fu~ctions, leve~s in ~ore than one analysis, some evidence was obtained regarding
't-~peciallv at the upper levels. Individuals with a language handIcap, as the IdentIty of the common factor at different ages. The factor loadings
!ell as those whose strongest abilities lie along nonverbal lines, will thus of tests that recur at several age levels provided further data on this
.'score relatively Iowan such a test. Similarly, there are undoubtedly a point. In general, the results of these analyses indicated that perfo:mance
',number of fields in which scholastic aptitude and verbal comprehension on Stanford-Binet items is largely explicable in temlS of a Single common
~e not of primary importance. Obviously, to apply a~y test to situations factor. Evidence of additional group factors was found at a few age
,hr which it is inappropriate will only reduce its effectlveness. Because of levels, but the contribution of l'hese factors was small. It was likewise
the common identification of Stanford-Binet IQ with the very concept demonstrated that the common factor found at adjacent age revels was
If intelligence, there has been a tendency to expect too much from this es:entially the same, although this conclusion may not apply to more
'De test. WIdely separated age levels. In fact, there was some evidence to sucrgest
,j Data on the construct ualidity of the Stanford-Binet come from many that the common factor becomes increasingly verbal as the higher C;ges
"ources. Continuitv in the functions measured in the 1916, 1937, and are approached. The common factor loading of the \'ocabulary test, for
[960 scales was en~ured by retaining in each version only those items that example, rose from .59 at age 6 to .91 at age 18.
correlated satisfactorily with mental age on the preceding form. Hence, Other factor-analytic studies of both the 1937 and the 1960 forms have
{\he information that clinicians have accumulated over the years regarding used statistical techniques deSigned to bring out more fully the operation
kpica] behavior of individuals at different l\1A and IQ levels can be of group factors (L. V. Jones, 1949, 1954; Hamsev & Vane, 1970; Sattler,
,.Gtilized in their interpretation of scores on this scale. 1974, Ch. 10; Stott & Ball, 1965). Among the fact~rs thus identified were
f). Age differentiation represents the major criterion in the selectiOI: of s~\:~r~l verbal, memory, reasoning, spatial visualization, and perceptual
Stanford-Binet items. Thus, there is assurance that the Stanford-Bmet amhtIes. In general, the results suggest that there is much in common in
h1easures abilities that increase with age during childhood and adoles- the scale as a whole-a characteristic that is largel:-' built into the Stan-
.' cence in our culture. In each form, internal consistency was a further ford-Binet by selecting items that have high correlations with total scores.
, ;criterion for item selection. That there is a good deal of functional At the same time, performance is also influenced by a number of speCial
:homogeneity in the Stanford-Binet, despite the apparent variety of con- abilities whose composition varies with the age level tested.
; 'tent, is indicated bv a mean item-scale correlation of .66 for th::. 1960
~.revision. The prcdo~inance of verbal functions in the scale is shown by
Itbe higher correlation of verbal than nonverbal items with performance
on the total scale (Terman & Merrill, 1973, pp. 33-34).
Further data pertaining to construct validity are provided by sevcral The rest of this chapter is concerned with the intelligence scales pre-
;independent factor analyses of Stanford-Binet items. If IQ'~ are to be pared by David \Vechsler. Although administered as individual tests and
! comparable at different ages, the scale should have appro::mlatl'l~· the designed for many of the same uses as the Stanford-Binet these scales
same LHtorial composition at all age levels. For an unamb1[!uous ll1ter- differ in sr:ver:J.1iinpor':mt \V3YSfrom the earlier test. H8th~r tlJan !y,jnr
Tests of General Intellcetl/Ill Lcn'l
Indiddual Tests 247
prganized into age levels, all items of a given type are grouped into sub- Iation of words rf'C' d d . .
tests and arranged in increasing order of difficulty within each subtest. In He I'k' II d f'IVe un ue weight In the traditional intelligencf' test
I eWIse ca e attenf t ti' l' .. .
'this respect the \/Vechsler scales follow the pattern established for group ad Its d' IOn a le mapp Icablhty of mental age non11$ to
u ,an pomted out that few adults had .' lb' .
tests, rather than that of the Stanford-Binet. Another characteristic th d d' . pre, IOUSy een mcluded in
e stan ar IzatlOn samples for individual' t II'
feature of these scales is the inclusion of verbal and performance It was t . In e Igence tests.
\ubtests, from which separate verbal and performance IQ's are com- a mdee,t these vanous objections that the original \Vechsler-
B e IIevue was evelo d I f
,puted. similar to th pe. n arm and content, this scale was closel"
: Besides their use as measures of general intelligence, the \Vechsler .h· 1 'h e more recent \Vechsler Adult Intelligence Scale (\V AIS ')
\\ IC 1 as now SUPI)] t d' Th
,\scales have been investigated as a possible aid in psychiatric diagnosis. . '. an e It. e earlier scale had a number f
,Beginning with the observation that brain damage, psycho'.ic deteriora- :ee~~n~~~~o~~~~;~:c~:s~'p~rticu~a~\ wi~~ regard to size and representativ~-
,'tion, and emotional difficulties may affect some intellectual functions corrected I'n tIle 1 t p e. ~n Ie lablht~, of subtests, which were largel).
a er reVISion.
" more than others, \Vechsler and other clinical psychologists argueq that
, an analysis of the individual's relative perfomlance on different subtests
should reveal specific psychiatric disorders. The problems and results per- DESCRIPTIOK. Published in 19-- h \ 7
taining to such a profile a . alysis of the Wechsler scales will be analyzed Six subtests are grouped into a ~~r~a~ S~a~:Sa:donfilpri~estele'p,enfsubtests.
Seal Th b 'e 111 a a er ormance
in Chapter 16, as an exan,ple of the clinical use of tests. e.. edse.s~ tes:s are listed and briefly described below in the order
of th elr a lTIIl1lstratIon. '
The interest aroused by the Wechsler scales and the extent of their use
is attested by some 2,000 publications appearing to date about these VERBAL SCALE
scales, In addition to the usual test reviews in the Mental1l1easurements
1. Informatioll' 99 questio .
Yearbooks, research pertaining to the \Vechsler scales has been surveyed adults have' p:esumabl./~ ~ovenng a Wid: variety of information that
periodically in journals (Guertin et a1., 1956, 1962, 1966, 1971; Littell, -\n effort was d t' a. an opportumty to acquire in our culture.
1960; Rabin & Guertin, 19.51; Zimmemlan & \Voo-Sam, 1972) and has " ma e 0 aVOId specialized d' k
might be added h' or aca emlc -nowledge. It
been' summ~rized in several books (Glasser & Zimmerman, 1967; ~lata- . "t
f or a Iana tIme m mf .at questIOns of general information have been used
I h'"
razzo, 1972; \Vechsler, 19.58; Zimmerman, \yoo-Sam, & Glasser, 1973). dividual': intell t IOlrma psyc ~atnc examinations to establish the in-
ec ua eve land hIS practical orientation.
2. Comprehension' 14 items' h f h' h
should be d . d ' m eac 0 \V IC the examinee explains what
one un er certain circumst h
OF THE WAIS. The first form of the \Vechsler scales, hlO\\'n followed th . f " ances, w y certain practices are
A'\TECEDE'\TS
as the \Vechsler-Bellevue Intelligence Scale, was published in 1939. One judg

, e meanmg
t d
° proverbs
J
t D' d '
' e c. eSlgne to measure practical
of the primClry objectives in its preparation was to provide an intelligence Com;:~lel~~onc~::~n bse;~;, this tfi~stis similar to the Stanford-Binet
, u I s specI c content was 6:hosen so as to be
test suitable for adults. In first presenting this scale, \Vcchsler pointed more consonant with the interests and activities of adults.
out that previously available intelligence tests had been designed pri- 3. Arithmetic' 14 proble "1 h
marily for schoolchildren and had been adapied for adult use by adding school aritil1Tleti ,-ms Slml ar. to t ose encountered in elementary
more difficult items of the same kinds. The content of such tests was often without th -c. £ :Sach problem IS orally presented and is to be solved
e use 01 paper and pencil.
of little interest to adults. Unless the test items have a certain ;minimum
4. Similarities: 13 items
of face validity, rapport cannot be properly "established with adult ex- things are alike. requiring the subject to say in what way two
aminees. Many intelligence test items, written with special reference to
the daily activities of the schoolchild, clearly lack face validity for most 5. ~pig:~dS~:~: ~nraltlhY
presendted lists of three to nine digits are to be orall\!
adults. As Wechsler (1939, p. 17) expressed it, "Asking the ordinary . e secon part the e . .
two to eight digits backwards.' xammee must reproduce list. of
housewife to furnish you with a rh~-me to the words, 'day,' 'cat,' and
'mill,' or an ex-army sergea;"lt to gi\'e you a sentence with the words, 6. Vocabulary' 40 words of . . d"ffi
and visualh:' Thee' . m?reaskmg IcuIty are presented both orally
'boy,' 'river,' 'ball," is not particularly apt to evoke either interest or .. xammee IS as oed what each word means.
respect," PERFOR~{AXCE SCALE
The overemphasis on speed in most tests also tends to handicap the
7. Digit Symbol: This is a veri r h r ",' •
older ]W)crl11,Similarh', \Vech<lcr beL~\'ecJ th::~ rclati\'el~' routine' manipl.i- which hQS ofte', been '. 1 d S ~ll~ o. t e rami,lar code-substitution test
• J. • 111C U eu In n".n]an~ui:l.Zt irJtenj.£:t;jce 5c:~JC'S. TIle
Tests of General Intellectual Lct;cl
Sillce the publication of the original Wechsler-Bellevue scale, a large
key contains 9 svmbols paired with the 9 digits. With this key before
number of abbreviated scales have been proposed. These scales are
him, the examin~e has B-2 minutes to fill in as many symbols as he can
formed simply by omitting some of the subtests and prorating scores to
under the numbers on the answer sheet.
obtain a Full Scale IQ comparable to the published norms. The fact that
8. Picture Completion: 21 cards. each containing a. pic:ur.e f7m whi~~ several subtest combinations, while effecting considerable saving in time,
some part is missing. Examinee must tell what IS mlssmg rom ea correlate over .90 with Full Scale IQ's has encouraged the development
picture. and use of abbreviated scales for rapid screening purposes. Extensive
9. Block Design: This subtest uses a set of cards conthainingddesigns i~ ~e~ research has been conducted to identify the most effective combinations
and white and a set of identical one-inch blocks w ose SI es ~re .pam e of two, three, four, and five subtests in predicting Verbal, Performance,
red, white, and red-and-\\'h::e. The examinee is shown o~e oeslgn at a and Full Scale IQ's (Doppelt, 1956; Levy, 1968; Maxwell, 1957; McNe-
time, which he must reproduce by choosing and assemblmg the proper mar, 1950; Silverstein, 1970, 1971; Tellegen & Briggs, 1967). A compara-
blocks (see Fig. 31). tive analysis of a single four-subtest combination at different age levels
from 18-19 to 75 and over yielded correlations of .95 to .97 with Full
Scale IQ's (Doppelt, 1956). Equally close correspondences have been
found in several studies of abbreviated scales formed by reducing the
number of items within subtests (see Guertin et aI., 1966, pp. 388-389;
Matarazzo, 1972, pp. 252-255). 1",1uch of this research has utilized the
\VAIS standardization data; but similar studies have been conducted on
mental retardates and psychiatric patients (see ~fatarazzo, 1972, p. 252).
L....
FIG. 31. The Block Design Test of the Wechsler Adult Intelligence Scale.
FIG. 32. Easy Item from ~he WAIS Picture Arrangement Test.
(Courtesy the Psychological Corporation.)
(~eproduced by pecmission. Copyright © 1955, The Psychological Corporation, New
Yurk, N.Y. All rights reserved.)
Picture Arrangement: Each item consists·' of a set of cards containing

10. pictures to be rearranged in the proper sequen~e so ~s to tell a story.
Figure 32 shows one set of cards in the order m which they ar~ pre- Although an excessive amount of energy seems to have been expended ..
sented to the examinee. This set sho\\'s the easiest of 8 items makmg up in assembling and checking short forms of the \Vechsler scales, it is
the test. probably inadvisable to use such abbreviated versions except as rough
Ob'cet Assembly: In each of the four parts of this subtest, cutouts are to screening devices. Many of the qualitative observations made possible by
11. be ~ssembled to mah a flat picture of a familiar object. the administration of an individual scale are lost when abbreviated scales
are .used. l"forcO\'er, the assumption that the original Full Scale norms
Both speed and correctness of performance ~re ta~en. into account ~ are applicable to prorated total scores on short scales may not always be
scoring Arithmetic. Digit Symhol, Block DeSign, Plcbre Arrangemen L
,
justified .
. _.,..1 n":.,'r.,.,+- ~. r:(,~,,·,-,1~1\·
171dicidual Tests 251
some extent, the difference in standard deviation of \Vechsler and Stan-
~~ORMS.The \\7 AIS standardization sample was chosen with exceptional
fo~d-Binet IQ's may account for the differences between the IQ's obtained
"e to insure its representativeness. The principal normative sample wIth the two scales. It ,,,,ill be recalled that the SD of the Stallford-Bin~t
.sisted of 1,700 cases, including an equal number of men and women
IS? is 16, ;-,hile that of the Wechsler IQ is 15. The discrepancies in indi-
ributed over seven age levels between 16 and 64 years. Subjects we~e
vIdual I<? s, however, are larger than would be expected on the basis of
cted so as to m8tch as closely as possible the proportions in the 19e>0
sU,ch a dIfference. Another difference between the two scales is that the
'ted States census with regard to geographical region, urban-rural
"'echsler has less floor and ceiling than the Stanford-Binet and hence
'dence race (white versus nonwhite), occupational level, and edu-
does not dis.crimi.nate as well at the extremes of the IQ range.
ion. A~ each age level, one man and one woman from an institution for
The relatIOnshIp between Stanford-Binet and Wechsler IQ's depends
ntal retardates were also included. Supplementary norms for older
not .only on IQ level, but also on age. Other things being equal, older
sons were established by testing an "old-age sample" of 475 persons,
subjects tend to obtain higher IQ's on the Wechsler than on the Stanford-
ed 60 years and over, in a typical midwestern city (Doppelt & \Vallac~,
Binet, while t;he re~erse is true of younger subjects. One explanation for
55).
such a trend IS obvIOusly provided by the use of a declinina standard in
It is admittedlv difficult to obtain a representative sample of the popu-
tion over 60. -'Although the WAIS sample is probably more nearly t~e computation of the \Vechsler IQ's of older persons. On ~he Stanford-
Bmet, on the other hand, all adults are evaluated in terms of the averaae
,presentative than any other elderly sample t.ested ~rior to that tim.e,
peak age o.n that scale, viz., 18 years. It is also possible that, since the
·here is evidence to suggest that Significant regIOnal dIfferences occur III
Stanford-Bmet was standardized primarily on children and the \Vechsler
I~erelative magnitude of Verbal and Performance scores at these age
on adults, the content of the former tends to favor children while that of
vels (Eisdorfer & Cohen, 1961). Furthermore, the current applicab~lity the latter favors older persons.
.f norms gathered prior to 1955 is questionable, in ,:iew of th~ ra?I?ly
isinCT educati:mal and cultural level of the populatIOn. Especlall) Im-
ort~nt in this connection is the rechecking of the age decrement among
RELIABaITY. For each of the eleven subtests, as well as for Verbal,
lder persons. .
P~rf~rmancc, and Full Scale IQ's, reliability coefficients were computed
Raw scores on'each \YAIS subtest are transmuted ll1to standard scores
\\~thll1 the 18-19, ~4, and 45-54 year samples. These three groups
I'ith a mean of 10 and an SD of 3. These scaled scores were derived from
\\ere ch~sen. as bemg representative of the age range covered by the
a reference group of 500 cases which included all persons between the
standardIzatIOn sample. Odd-even reliability coefficients (corrected for
ages of 20 and 34 in the standardization sample. All sub~est scores are
full test length by the Spearman-Brown fom1Ula) were emplo,'ed for
Jhus expressed in comparable units and in terms of a fixed refereI:ce
ev.e~y,subtest except Digit Span and Digit Symbol. The reliability of
.group. Verbal, Performance, and Full Scale scores are found by addmg
DIgIt ~p~n was estimated from the correlation between Di'gits Fon: ..ard
'the scaled scores on the six Verbal subtests, the five Performance sub-
a~d DI!?ts Backward Scores. No split-half technique could be utilized
tests, and all ele\'en subtests, respectively. By reference to appropriate
WIth DIgIt Symbol, which is a highly speeded test. The reliability of this
, tables provided in the manual, these three scores can be e::pressed as
test :vas therefore. determifled by parallel-form procedures in ~ group
" deviation IQ's with a mean of 100 and an SD of 15. Such IQ s, however,
speCIally tested WIth WAIS and \Vechsler-BelIevuc Digit Symbol sub-
are found with reference to the individual's ovm age group. They there- tests. .
~ fore show the individual's standing in comparison with persons of his
Full Scale IQ's ):ielded. reIi~bility coefficients of .97 in all three age
own age level. ,.., ,
samples. Verbal IQ s,had IdentIcal reliabilities of .96 in the three groups,
In the interpretation of "TAIS IQ s, the relatIve magmtude of IQ s ob-
and P~rforman:e IQ: had reliabilities of .93 and .94. All three IQ's are
< tained on the \\7echsler scales and on other.intelligence tests should also
, be taken into account. It has been repeatedly found that brighter sub- t~~s hIghly .reh.a.~le m terms of .internal consistency. As might be ex-
p~~ted, th~ mdI.VIdual sU,btests yIeld lower reliabilities, ranging from a
. jects tend to score higher on the St~nford-Binet th,an on the Wechsler
few coe~clents m the .60 s fo~n? \\ith Digit Span, Picture Arrangement,
scales, while duller subjects score hIgher on the \\'echsler than o~ t~,e
and .0bJect Assembly, to coeffiCIents as high as .96 for Vocabularv. It is
Stanford-Binet. For example, studies of college freshmen show SIgnifi-
partIcularly important to consider these subtest. reliabilities -' when
cantly higher mean IQ's on the Stanford-Binet than on the \Vechsler,
evaluating the significance of differences bet\\'ecn subtest Scores obtained
r whiJe thr ]'(~,'crse is ~eneral1\' found among the mentally retarded. To
by the samf' indi,'idu;::l as in nrofiJC' ana)\',;s
J • _ ~ •
Tests of General Intellectual Level
course, were already selected in terms of the abilities measured bv these
tThe \YAIS manual also reports standard errors of measurement for
tests. Correlations in the .40's and .50's have been found between'Verbal
~he three IQ's and for subtest scores. For V~rbal IQ, such e:rors were 3
IS? and col1eg~ or engineering school grades. In all these groups, the
,pointsin each group, for Performance IQ, Just under 4 pomts, and for
\'erbal Scale YIelded somewhat higher correlations than the Full Scale;
(FullScale IQ, 2.60. We could thus conclude, for example, that the
correlations with the Performance Scale were much lower. Even the cor-
j~hancesare roughly 2: 1 that an individual's true Verbal IQ falls within
relations with the Verbal Scale, however, were not appreciably higher
\?points of his obtained Verbal IQ. The above values compare fa\:orably
than those ob~ained with the Stanford-Binet and with well-h10wn group
~)l'iththe 5-point error of measurement found for the St.anford-~ll1:: .. It
tes~s. In studIes. of mental retardates, \\7 AIS IQ's have proved to be
!fshouldbe remembered, however, that the Stanford-Bmet rehabIhbes
sat~sfactory predIctors of institutional release rate and subsequent work
)werebased on paranel forms administered over intervals of one week .or
adjustment (see Guertin et aI., 1966).
:iess; under such conditions we would anticipate somewhat lower reha-
The \Vechsler scales have been repeatedly correlated with the Stan-
:',bilitycoefficients and greater fluctuation of scores. ford-Bine: as well as with other well-known tests of intelligence (Guertin
,
et aI., 19,1; !\latarazzo, 1972; Wechsler, 1958). Correlations with the
Stanford-Binet in unselected adolescent or adult groups and among
~ VALIDITY. Any discussion of yalidity of the \VAIS must draw on re-
mental retardates cluster around .80. \7\Tithin more homogeneous samples,
J;'searchdone with the earlier Wechsler-Bellevue as well. Since all changes
such as college students, the correlations tend to be considerablv lower.
{,introduced in'the \7\7 AIS represent improvements over the Wechsler-
Group tests yield somewhat lower correlations with the \Vechsl:r scales,
,i\ Bellevue (in reliability, ceiling, normative sample, etc.) and since the
although such correlations vary widely as a function of the particular
l' nature of the test has remained substantially the same, it is reasonable to
i suppose that validity data obtained on the \Vechsler-Bellevue \Vm under-
test and the nature and heterogeneity of the sample. For both Stanford-
Binet and group scales, correlations are nearlv alwavs hi()'her with the
'; estimate rather than overestimate the validity of the WAIS. Wechsler Verbal Scale than with the Fun Scale, while cor~elations with
r; The \VAIS manual itself contains no validity data, but several aspects
the Performance Scale are much lower than either. On the other hand
of validitv are covered in the subsequent books by Wechsler (1958) and
Performance IQ's correlate more highly than Verbal IQ's with tests of
by Mata;azzo (1972). \Vechsler (1958, Ch. 5) argues that the psycho-
spatial abilities. For example, a correlation of .7'2 was found between
logical functions tapped by each of the 11 chosen subtests fit the defin.i-
Performance IQ and the Minnesota Paper Form Board Test in a grouD of
tion of intelligence, that similar tests have been successfully employed m
16-year-old boys and girls (Janke & Havighurst, 1945). In other studies,
previously developed intelligence scales, and that such tests have proved
Pe~~ormance}Q:s correlated .70 with Raven's Progressive Mab'ices (Hall,
, their ,"vorth in clinical experience: The test author himself places the
190/) and .30 WIth the Bennett !\lechanical Comprehension Test (\Vechs-
major emp11asison this approach to validity. The treatment is essentially
ler, 1958, p. 228).
in terms of content validity, although it has overtones of construct validity
Of some relevance to the construct t:alidity of the \Vechsler scales are
without supporting data. Much of the discussion in Matarazzo's book is
the intercorre.lations of subtests and of Verbal and Performance IQ's, as
of the same nature, dealing with the construct of global intelligence, but
w~ll as :actonal ana:y~es of the scales. In the process of standardizing the
having only a tenuous relation to the evaluation of the \VAIS as a measur-
\\ AIS, mtercorrelatJons of Verbal and Performance Scales and of the 11
ing instrument. subtes:s were computed on the same three age groups on which reliability
Some empirical data on concurrent criterion-related validity are sum-
coeffiCIentshad been found, namely, 18-19, 2,5-34, and 45-54. Verbal and
marized in the two books (Matarazzo, 1972,' p. 284; \Vechsler, 1958,
Performance Scale scores correlated .77, .77, and .81, respectively, in
Ch. 14). These data include mean IQ differences among various edu-
these three groups. IntercorreJations of separate subtests were also similar
cational and occupational groups, as well as a few correlations with job-
in. the three age groups, running higher among Verbal than among Per-
performance ratings and academic grades. Most group differences, though
formance subtests. Correlations between Verbal and Performance sub-
small, are in the expected directions. Persons in white-colhr jobs of
~ests, al~hough still lower on the whole, were substantial.' For example,
different kinds and levels averaged higher in Verbal than in Performance
111 the 2.:>-34year group. correlations among Verbal subtests ranged from
JQ. but skilled workers averaged higher in Performance than in Verba1.
.40 to .81, among Performance subtests from .44 to .62, and between
In studies of industrial executives and psychiatric residents, Verbal IQ
Performance and Verbal subtests from .30 to .67. Both individual subtest
correlated in the .30's with overall performaflce ratings. Both groups, of
254 Tcsts of Gcncraliniellccillol LeGcl
lndiddllal Tesls 255
correlations and correlations between total Verbal and Performance Scale
'L\IS subtests re(iuire memory at all ages. Cntil differential deterioration
/. scores suggest that the two scales have much in common and that the
~ets 1I1, however, individual differences in the retentivc ability required
: allocation of tests to one or the other scale may be somewhat arbitrary. In most of the subtests are insignificant. .
Factorial analyses of the \\'echsler scales have been conducted with a
variety• of subJ'ects rangina t:> from eiuhth-grade
b
pupils to the old-age stand-
. ardization sample (aged 60-75+) and including both normal and ab- WECHSLER INTELLIGENCE SCALE FOR CHILDREN
, normal groups. Th.ey have also employed different statistical procedures
'. and haw approached the anal~'sis from different points of view. Some . DESCRIPTIOX. The \Vechsler Intelligence Scale fl.: Children ('nSC)
. have been directly concerned "ith age changes in the factorial organi- \\-as filst prepared as a dowll\vard extension of the original \Vechsler-
zation of the \\'echsler subtests, but the findings of different investigators
B.elJe\ ue (Seashore, \Vesman, & Doppelt, 1950). :Many items were taken
are inconsistent in this regard. As an example, we may examine the fac-
dl,rect.l:' from the adult test; and easier items of the same types were
torial analyses of the WAIS conducted by J. Cohen (I957a, 1957b! with
aoded .to each test. A revised edition, \HSC-R, was published in 1974.
the intercorrelations of subtests obtained on four age groups in the stand-
The \\ISC-R consists of twelve subtests, of \\'hich two are used only as
ardization sample (18-19, 25-34, 45-54, and 60-75+ ). The major results
~lterna:e~ or ~s supplementary tests if time permits. The materials ~sed
of this study are in line with those of other investigations using com-
In a~mlllls:ermg th: \VISC-R are pictured in Figure 33; and Figure 34
parable pro~edures, as well as with the findings of later studies by Cohen
sho\\s a cluld workmg on one of the easier items of the Object Assembly
and his associates on different populations (see Guertin et aI., 1962,
1966) .
That all 11 subtests have much in common was demonstrated in
Cohen's study by the presence of a single general factor that accounted
for about 50 percent of the total variance of the battery. In addition, three
major group factors were identified. One was a verbal comprehension
factor , with larueb weiuhts
t:> in the Vocabularv, • Information, Comprehen-
sion, and Similarities subtests. A perceptual organization factor was
found chiefly in Block Design and Object Assembly. This factor may
actually rep~esent a combination of the perceptual speed and spatial
visualization factors repeatedl~' found in factorial analyses of aptitude
tests. The results of an earlier investigation by P. C. Davis (1956), in
which "reference tests" measuring various factors were included with the
Wechsler subtests, support this composite interpretation of the percep-
tual organization factor.
The third major group factor identified by Cohen was described as a
memory factor. Found prinCipally in Arithmetic and Digit Span, it ap-
parently includes both immediate rate memory for new material and
recall of previously learned material. Ability to concentrate and to resist
distraction may be involved in this factor. Of special interest is the finding
that the memory factor increased sharply in prominence in the old-age
sample. At that age level it had signifi.cant loadings, not only in Arith-
metic and Digit Span, but also in Vocabulary, Information, Comprehen-
sion, and Digit Symbol. Cohen points out that during senescence memory
begins to deteriorate at different ages and rates in different persons. Indi- FIG. 33. Materials Used with the V'echsler Intelligence Scale for Children-
vidual differences in memory thus come to playa more prominent part Re\"ised.
in intellectual functioning than had been true at earlier ages. ~1any of the (Courtesy The Psychological Corporation.)
Tests of Gcncral. Intellectllol Lcecl
7.1ubtcsl.As in the other \Vcchsler scales, the subtcsts are grouped into a
Werbal and a Performance scale as follows:
VERBAL SCALE PERFORMA~CE SCALE
1. Information 2. Picture Completion
3. Similarities 4. Picture Arrangement
5. Arithmetic 6. Block Design

;
7. Vocabulary 8. Object Assembly
9. ComprehenSion 10. Coding (or 11azes)

I
(Digit Span)
The numbers correspond to the order in which the subtests are ad-
. ministered. Unlike the procedure followed in the \VAIS and the earlier
';;WISC, the Verbal and Performance subtests in the \VISC-R are ad-
I
ministered in alternating order. The \lazes subtest, which requires more
.; time, may be substituted for Coding if the examiner so prefers. Any other
substitution, including the substitution of Mazes for any other subtest
and the substitution of Digit Span for any of the Verbal subtests, should
be made onl)· if one of the regular subtests must be omitted because of FIG. 34. The Object Assemblv Test of the Wechsler Intelligence Scale for
'special handicaps or accidental disruption of testing procedure. The Children-Revised.
supplementary tests may always be administered in addition to the (Courtesy The Psychologic2.1 Corporation.)
regular battery and their inclusion is advised because of the qualitative

and diagnostic information proYided. In such cases, however, their scores of the subtests. Several of the subtests were lenathened in order to in-
are not used in finding IQ's. crease reliability; and improvements were introduced in administration
With regard to content, the only ~ubtest that does not appear in the and scoring procedures.
adult scale is !\1azes. This test consists of nine paper-and-pencil mazes of . As i~1the case.of the \VAIS, there has been considerable experimenta-
, increasing difficult)!, to be completed within deSignated time limits and tion WIth abhrevlated scales of the vVISC. The correlations of these short
scored in terms of errors. The Coding subtest corresponds to the Digit forms with Full Scale IQ's run lower than in the \\7 AIS. \Vith scales
Symbol subtest of the \VAIS, with an easier part added. The remaining consisting of five or six subtests from both Verbal and Performance sets.
subtests represent downward extensions of the adult tests. The develop- correlations in the .80's have been found with Full Scale IQ's. It should
ment of the \VISe was somewhat paradoxical, since \Vechsler embarked be n~ted, however, that these data were obtained with the earlier form
upon his original enterprise partly because of the need for an adult scale of \\ ISC. With the increased length and improved reliability of the
that would 110t be a mere upward extension of a'.'ailable children's scales.' Vv1SC-R subtests, the correlations of abbreviated scales with Full Scale
The first edition of the \VISC was, in fact, criticized because its content IQ's will undoubtedly be higher. Using the \VISC standardization data
was not sufficiently child-oriented. and a procedure that takes subtest reliabilitv into account Silverstein
In the revised edition (\VISC-R), special efforts were made to replace (~970) identified th: ten best combinations of two, three, f;ur, and five
or modify adult-oriented items so as to bring their content closer to com- \, ISC subtests. A WIdely used two-test combination consists of Vocabu-
mon childhood experiences. In the Arithmetic subtest, for instance, Jar)' ,and Block Design. The same cautions mentioned regarding the use
"cigars" was changed to "cand:' bars" and items about a taxi and 'a card of " AIS short forms can be repeated here.
!!,arlJc were replaced. Other changes included the elimination of items
that might be differentially familiar to particular groups of children and NOHJ\!S. The treatment of scores on the 'VISC-R follows the same pro-
the inclusiolJ of D1' .··c female and bLck subjects in the pictoriai content cedures used in ti,E' adult sC3.le, witll minor differences. Raw scores on
Il1dir;idllal Tests 259
each subtest are first transmuted into normalized standard scores 1citlJin "~, 1O~~-11~,f, 14~~-,lSI/~), the tests being readministered after approxi-
" tile c7lilas own age group. Tables of such scaled scores are provided for mately a one-month Interval. Average split-half reliabilities for Verbal,
; every four-month interval between the ages of 6-0 and 16-11 years. As ~erfomlance, an~ Full Scale IQ's were .94, .90, and .96. The correspond-
; in the adult scales, the subtest scaled scores are n-pressed in terms of a Ing retest coeffiCIents were .93, .90, and ,95. Some practice effect was ob-
'1 distribution with a mean of 10 and an SD of 3 points. The scaled subtest served on the retest, with mean gains of 31/" IQ points on the Verbal
scores are added and converted into a deviation IQ with a mean of 100 S.cale, 9% on the Performance Scale, and 7 O1;the Full Scale. Such a prac-
; and an SD of 15. Verbal, Performance, and Full Scale IQ's can be hce e~ect should be taken into account when retesting children after
~.found by the same method. short Intervals of time.
Altho~gh a mental age is not needed to compute a deviation IQ, the Subtest reliabihties are generally satisfactorv and hiaher than in the
, WISC-R provides data for interpreting performance on individual sub- earlier form. Split-half reliabilities averaged 'across a~e groups range
. tests in terms of age norms. For each subtest, the manual gives the mean from .70 to .86; and mean retest coefficients range from .65 to ,88. A note-
raw score found in the standardization sample for each age from 6-2 to worthy feature of the manual is the inclusion of tables givina the standard
, 16-10 at intervals of four months. A child's test age can be found ;by error of measurement for subtests and for Verbal, Perform~lce and Full
, looking up the age corresponding to his raw score. A mean or median Scale IQ's within each age group, as well as the minimum 'difference
'·-test a~e on the entire scale can be computed if desired. between scores required for significance at specified levels. In compari-
",} Th~ standardization sample for the \VISC-R included 100 boys and 100 s~ns. of Verbal and ~erformance IQ's, a difference of 11 or 12 points is
; girls at each year of age from 6% through 16)'2, giving a total of 2200 sIgmfi~~nt at ,the ,~o level. The standard error of the Full Scale IQ is
'~icases. Each child was tested within six weeks of his midyear. For appro~Imatel: 3 P~l11ts. Thus t~e cha:lces are 95 out of 100 that a child's
. example, the 8-:'ear-olds ranged from 8-years 4-months IS-days to 8-years tr~e \\ ISC-R_ IQ ,dIffers from hIS obtamed IQ by no more than ±6 points
: 7-months IS-days. The sample was stratified on the basis of the 1970 (0 X 1.96 = 0.88/ . .
,:,U.S. Census with respect to geographic region, urban-rural residence,
..: occupation of head of household, and race (white-nonwhite). Bilinguals
,.were included if they could spea'; and understand English. Institutional- VALIDITY. :\0 discussion of validity is included in the WISC-R manual.
"ized mental retardates and children with severe emotional disorders were To be su~e, no~mative tables of standard score equivalents for each sub-
·",excluded. Testing was conducted in 32 states. (including Hawaii) and ~est prOVIde e\'1dence of age differentiation, but no evaluation of the data
'Washington, D.C. In many respects, the \VISC-R standardization sample In terms of this criterion is given. A number of independent in\'estiaators
is more nearl:' representatiw of the U.S. population within the designated hu\'e found concurrent validity coefficients between the earlier {'VISC
age limits than is any other samplt employed in standardizing individual ~nd ac.hie\'ement tests or other academic criteria of intelligence cluster-
tests. Ing ~etween .50 and .60 (Littell, 1960; Zimmemlan & Woot-Sam, 1972).
, Unlike the earlier WISC and the WAIS, the WISC-R yields me:;n IQ's As "ould be expected, th~ Verbal Scale tended to correlate higher than
that are very dose to Stanford-Binet IQ's (1972 norms). The increased the PerfoImance Scale WIth such criteria. \\11en children in the WISC
similarity in'IQ may be attributed in part to the closeness in time when sta,ndardization s~mple were c1assifie~ according to father's occupational
the norms were gathered for the two scales, and in part to the previously le\ el, the us~al hIerarchy of mean IQs was found (Seashore, Wesman, &
.mentioned improvements in content, administration, and scoring pro- Dopp.elt, 1900). The difference tended to be slightly larger in Verbal
cedures in the \V1SC-R. than 111Performance IQ and to decrease somewhat with age, possibly
because ~f exposure to relatively uniform education (Estes, 1953, 1955).
"T~e \\ ISC-R manual reports correlations with 1972 Stanford-Binet IQ's
. RELIABILITY. Both split-half and retest coefficients were computed for :'lt~ll1 homoTgeneous age groups. The mean correlation with Full Scale IQ
\7ISC-R subtests,6 as well as for Verbal, Performancc, and Full Scale IQ's. IS.. /3, The \ erb.: Scale again correlates more highly with the Stanford-
dd-even reliabilities were computed separately within each of the 11 Bmet than does dle. Performance Scale (.71 versus ,60). Among the su b-
(Tecrroups;
.b 0 . retest coefficients were found within three age groups (61.j- tests, Vocabulary Vlelds the highest mean correlation (,69) and Codina
the lowest (.26). c
'6 Only retest coefficients are reported for Coding and Digit Span, for which split-
Addit.ional information pro\'ided in the WISC-R manual includes inter-
alf coefficients would be inappropriate.
correlatIOns among the individual sul)tests, as we]] as the Co;[cbtion of
260 Tests of Ge1JCral Intdlectual Lcrd
, each subtest with Verbal, Performance, and Full Scale scores, and of VEHBAL SCALE
PEnFOI\~I.-\:\CE SCALE
, these three composite scores with each other. All correlations are given Information "Animal House
; separately for the 200 cases in each of the 11 age groups in the standardi-
Vocabulary Picture Completion
zation sample. The correlations between total Verbal and Performance
scores ranged from .60 to .73 within age groups, averaging .67. Thus the Arithmetic :\'1azes
two parts of the scale have much in common, although the correlations
between them are low enough to justify the retention of the separate "Geometric Design
scores. Block DeSign
Factorial analyses of the earlier \VISe subtests identified factors quite "Sentences (Supplementary Test)
similar to those found in the adult scales, namely general, verbal compre-
hension, perceptual-spatial, and memory (or freedom from distractibility) "S~ntences" is a memory test, substituted for the WISe Digit Span. The
factors (see Littell, 1960; Zimmerman & Woo-Sam, 1972). In a more chIld. repeats. each sentence immediatel:' after oral presentation by the
recent study (Silverstein, 1973), the Wlse subtests were factor analyzed exam1l1er:TIns test can be used as an alternate for one of the other verbal
separately in groups of 505 English-speaking white, 318 black, and 487 ~ests! or .It can be admin~ster.ed as an additional test to provide further
Mexican-Ametican children aged 6 to 11 years. The results revealed a mfOll1:atlOnabo~t the chIld, 111which case it is not included in the total
verbal comprehension factor having substantial correlations with the five sC,oreIII ?~lculatmg the IQ. "Animal House" is basically similar to the
verbal tests; and a perceptual organization factor having substantial cor- Vi, AIS DIgIt. Symbol and: he WISe Coding test. A key at the top of the
relations with Block Design and Object Assembly. A major finding of this board has T-ncturesof dog, chicken, fish, and cat. each with a differentl ;
study was the similarity of factor structure across the three ethnic groups, colored cyl~nder ~its "house") under it. The child is to insert the correctl),
sug;esting that the tests measure the same abilities in these groups. A colored cyhnder 111the hole beneath each animal on the board (see Fi~.
factor analysis of the \\'ISe-R scores of the standardization sample at 11
age levels bet\\'een 6~~ and 16% years yielded clear evidence of three
major factors at each age level (Kaufman, 1975a). These factors cor-
responded closel:' to the previously described factors of verbal compre-
hension, perceptual organization, and freedom from distractibility.
'VECHSLER PRESCHOOL A::\D PRI\IARY SCALE

OF INTELLIGENCE
DESCRIPTIO~. In more than one sense, the \Vechsler Preschool and Pri-
marv Scale of Intelligence (WPPSI) is the baby of the series. Published
in 1967, this scale is designed for ages 4 to 6% years. The scale includes
11 subtests, only 10 of which are used in finding the IQ. Eight of the sub-
tests are downward extensions and adaptations of \VISe subtests; the
other three were newlY constructed to replace \VISe subtests that proved
unsuitable for a variety of reasons. As in the \VISe and \\7 AIS, the sub-
tests are grouped into a Verbal and a Performance scale, from which
Verbal, Performance, and Full Scale IQ's are found. As in \VISC-R, the
administration of Verbal and Performance subtests is alternated in order
to enhance variety and help to maintain the young child's interest and
cooperation. Total testing time ranges from 50 to 75 minutes, in one or FIG. 35. The Animal House T ' f hev \" h J
. eSl 0 t ec IS er Preschool and Primarv
two testing sessions. Scale of Intelligence. 'J
In the following list thr: new subtests hnve been starred: (Counes)' The Ps)'cho)o~;caJ Corporation.)
26z Tes!s of General Intellectual Lcccl
Indit:·idllal Tests 263
.:35). Time. errors, and omissions determine the score. "Geometric ~esign"
more between Verbal and Performance IQ's is sufficiently important to
: requires the copying of 10 simple designs with a colored pencIl. be investigated.
The possibility of using short forms seems to have aroused as mu.ch
Stability over time was checked in a group of 50 kindergarten children
. interest for WPPSI as it had for W AIS and WISC. Some of the sam.e m-
retested after an average interval of 11 weeks. Under these conditions,
, vestigators have been concerned with the derivation of such abbrevIated
reliability of Full Scale IQ was .92; for Vei-bal IQ it was .86; and for
scales at all three levels, notably Silverstein (1968a, 1968b, 1970, 1971). Performance IQ, .89.
In a particularly well-designed study, Kaufman .(1972 ~ constructed a
i short form consistincr of two Verbal subtests (Anthmetrc and Compre-
hension) and two Performance subtests (Block Design and Pic:~re
VALIDITY, As is true of the other two \Vechsler scales, the WPSSI
Completion). \Vithin individual age levels, this battery yielded rehabIhty
manual contains no section labeled "validity," although it does provide
; coefficients ranging from .91 to .94 and correlations 'vith Full S:::ale IQ
some data of tangential relevance to the validity of the instrument.
rangina from .89 t~ .92. Half the WPPSI standardization sample of 1,200
Jntercorrelations of the 11 subtests within each acre level in the stand-
cases \~as used in the selection of the tests; the other half was used to b
ardization sample fall largely between 040 and .60. Correlations between
cross validate the resulting battery. Kaufman reiterates the c~stomary
Verbal and Performance sub tests are nearly• as hicrh as those within each
caution to use the short form only for screening purposes when trme does b
'cale. The overlap between the two scales is also indicated by an average
not pem1it the administration of the entire scale.
correlation of .66 between Verbal and Performance IQ's.
The manual reports a correlation of .75 with Stanford-Binet IQ in a
NOR!\1S. The W1>PSI was standardized on a national sample of 1,200 group of 98 children aged 5 to 6 years. As in the case of the 'VISC, the
children-IOO bovs and 100 girls in each of six half-year age groups from Stanford-Binet correlates higher with the Verbal IQ (.76) than "ith the
4 to 61.~.Childre~ were tested within six weeks of the required birthday Performance IQ (.56). This finding was corroborated in subsequent
or mid;'ear date. The sample was stratified against 1960 census dat~ With studies by other investigators working with a variety of groups. In
refere~ce to geographical region, urban-rural residence, proporhon of thirteen studies surveyed by Sattler (1974, p. 209), median correlations
whites and nonwhites, and father's occupational level. Raw scores on of \VPSSI "ith Stanford-Binet IQ were .82, .81, and .67 for Full Scale,
each subtest are converted to normalized standard scores with a mean of Verbal, and Performance IQ's, respectively. Correlations have also been
10 and an SD of 3 \\ithin each quarter-year group. The sums of the scaled found with a number of other general ability tests (for references, see
scores or. the Verbal, Performance, and Full Scale are then cOD\'erted to Sattler, 1974, p. 210 and Appendix B-9). Data on predictive validity are
deviation IQ's \\ith a mean of 100 and an SD of 15. Although \Vechsler meager (Kaufman, 1973a).
arcrues against the use of mental age scores because of their possible A carefully designed reanalysis of the standardization sample of 1,200
m~interpretations, the manual provides a table f~r converting raw scores cases (Kaufman, 1973b) provided information on the rel1a:ion of WPSSI
on each subtest to "test ages" in quarter-year umts. scores to socioeconomic status (as indicated by father's occupational
level), urban versus rural residence, and geographic region. For each of
RELIABILITY. For every subtest except Animal House, relia,?ility was
these three variables, \VPSSI Verbal, Perfonnance, and Full Scale IQ's
found bv correia tin a odd and even scores and applying the Spearman- were compared between samples matched on all the other stratification
Brown f~rmula. Sin~e scores on Animal House depend to a considerable variables (including the other two variables under investigation plus sex,
age, and color). .
extent on speed, its reliability was found by a retest at the end of. t~e
testing session. Reliability coefficients wer.e c~mputed separa:ely wIt~m Socioeconomic status yielded Significant differences only at the ex-
each half-year arre group in the standardlzatlOn sample. \VInle varymg tremes of the distribution. Children \vith fathers in the professional and
with subt~st andOag~ lewl, these reliabiuties fell mostly in the .80's. Re- technical categories averaged Significantly higher than all other groups
liability of the Full Scale IQ varied between .92 and .94; for Verb:!l IQ, (Mean IQ = 110); and children whose fathers were in the unskilled
it var;ed between .87 and .90; and for Performance IQ, between .84 and category averaged Significantly lower than all other groups (Mean IQ =
.91. Standard errors of measurement are also provided in the manual, as 92.1 ). Geographic region showed no clear relation to sCOres.No signiB.-
well as tables for evaluating the significance of the difference between can.t differences was found between matched urban and rural samples,
<·"'>fPC .. From these (bt;) it is sl1~~ested that a difference of 15 points or
un~lke ea~lie; studies With. the \VISC (Seashore, Wesman, & Doppelt,
19.)0) anc. tlW Stanfc,-d-Bmet (\f c?\'em2r, 1 g4~). TIw investigator al-
",264 Tes/s of GC7Jrral Ill/dlre/flal Lcccl
;ii tributes the discrepancy principally to the contribution of other variables,

;' which were controlled in this study but not in the earlier studies. An-
(other major difference, howewr, is in the time when the data were
CHAPTER 10
, gathered. The intervening 25 or 30 years have witnessed marked changes
I, both in population movements between urban and rural environments
k and in the educational and cultural facilities available in those environ-
Testsfor
I' ments. It is reasoi1able to expect that such sociocultural changes could
have cancelled out the earlier differences in intelligence test performance
characterizing children in these two types of environment,
SjJecial PojJulations
Since t11e publication of the \VPSSI, several investigators have factor
analyzed the subtest scores on samples of diverse populations (see Sattler,
1974, pp. 227-230). A comprehensive study (Hollenbeck & Kaufman,
1973) applied several factor-analytic techniques to three separate age
groups in the Vo/PSSI standardization sample. The results provided
consistent e\{idence of a general factor in the battery as a whole, together T ,
HE TESTS brougl t t
'd I d
VI ua
'th
an r
h ' .
1 oget er m thIS chapter include both I'ndl'
g oup scales. Thev were de\'eloped p nman
W1 persons who cannot be
J'
I
, '1v for use-
.'
with two broad group factors: a verbal factor with substantial loadings
in the six verbal subtests in each age group; and a performance factor
with substantial loadings in the five performance tests in the two older
with traditional instruments
the preceding cha ter or th~
next chapter. Hisforicall
:~c.
h' p~op.er ~ .or adequately examined
as t e mdIvIdual scales described in
h ) teal group tests to be considered in the
groups and somewhat lower but still appreciable loadings in the youngest were designated as perfo:~ t e nds of tests surveyed in this chapter
P rf ance, non language or nonverbal
group (ages 4 to 4y:! years), The separation of the two group factors c ormance tests on th ,h 1 ' l' .
was generally less clear-cut in the youngest group, a finding that is in with a minimal use' of ~ e \\ 0 e, mv~ ve the manipulation of objects,
line with much of the earlier research on the organization of abilities language on the part oPfa~therand,pe?cll. Nonlanguage-tests require no
eI er exammer or exa - Th" .
in young children, "'hen subtest scores were factor analyzed separately for these tests can be' b mmee. e mstructions
for black and white children in the standardization sample, the results in without the use of oralgoIvren't)t' delmonstration, gesture, and pantomime,
wn en anguage Apt t' f 1
both groups were closely similar to those obtained in the total sample group tests was the Arm' E. . . . ro a )pe 0 non anguage
(Kaufman & Hollenbeck, 1974)., foreian.s eab . . y xammatlOn Beta, developed for testing
1991 b) RP .. ng and Illiterate recruits during 'World War I (Y k
~ , ("VISlOnsof this test were sub 1 er es,
For most testing purpo 't . sequent y prepared for civilian use.
ses, 1 IS not necessar)' t 1" 1
from test administration sine th .' 0 e Immate a I language
CONCLUDING REl\IARKS ONTHEWECHSLER SCALES.The currentl" available edae of a common l' e .• e exammees usually have some knowl-
b anguage. l\10reover short . I' ,
forms of the three \Vechsler scales reflect an increasing level ~f sophisti- usually be translated 0' .' , SImp e mstructlOns can
. r gIVen successlvel)' in t, I .
cation and experience in test construction, corresponding to the decade appreciably altering the nature 0 d'ffi 1 f \\ 0 angu.ages WIthout
when they were developed: "\TAIS (1955), \~7PSSI (1967), WISC-R tes:l~, however requires the ~ I hC~ty 0 the test. None of these
, exammee Imself t 'h '
(1974). In comparison with other individually administered tests, tI\eir spoken language. 0 use elt er wntten or
principal strengths stem from the size and representativeness of the Still another related categor ' is that of .
standardization samples, particularly for adult and preschool populations, deSignated as nonreadin test~ 1\'1 nonverb~l tests, more properly
and the technical quality of their test construction procedures, The school children fall into rh' ~ ost tests for pnmary school and pre-
IS
treatment of reliability and errors of measurement is especially com- readers at anv age level Whc~legory,. a,s do tests for illiterates and non-
-' . Ie reqUlnng no read' ..
mendable. The weakest feature of all three scales is the dearth of em- tests make extensive use of a l' . mg or wntmg, these
ra mstructions and '.
pirical data on validity. The factor-analytic studies contribute to a clarifi- part of the examiner, Moreover th f commumcati~n on the
cation of the constructs in terms of which performance on the \Vechsler hension, such as voc~bular)' and' th eyr dequently measure verbal comnre-
e un erstandin(T of t ' •.
scales may be described; but even these studies \-vould have been more paragraphs, through the use of 't . 1 . c> sen ences ane.;short
, pIC ana matenal suppl t d .h
informative if they had included more indices of behavior external to Instructions to accompany each item Unlik h emen e WIt oral
tIlE'scales themselves. would thus bt> un suite -J for forei . 1.' e t e nonlanguage fests, they
gn-spe:L,mg or deaf persons.
Tests of GCllcralllltclIcctllol Lncl
:Allhough the traditional categories of performance, nonlanguage, and

h~:1d, t.Urtl over, reach for and grasp objects, and fo1law a moving object
6nverbal tests contribute to an understanding of the purposes that
WIth hIS e~·es. The preschool child, on the other hand, can walk, sit at a
ffifferenttests may serve, the distinctions have become somewhat blurred
table, use his hands in manipulating test objects, and communicate by
·more and more test batteries were developed that cut across these
language. A~ the preschool level, the child is also much more responsive
dtegories. The combination of verbal and performance tests in the
to the exammer as a person, whereas for the infant the examiner serves
~echsler scales is a classical example.
primarily as a means of providing stimulus objects. Preschool testing is a
;Inthe pres~nt chapter, tests have been classified, not in terms of the~r
more h.i~hly interpersonal process-a feature that augments both th~ op-
rohtent or administrative procedures, but with reference to theIr portumties and the difficulties presented by the test situation.
pjincipal uses. Three major categories can be recognized ~rom this
The proper psychological examinatio~ of young children requires
'ewpoint: tests for the infant and preschool level; tests sUl~able for
coverage of a broad spectrum of behavior, includina motor and social as
ersonswith diverse sensorv and motor handicaps; and tests deSIgned for
well as cognitive traits. This orientation is re£le~ed in some of the
~e across cultures or subcultures. Such a clas.·Jication must remain
pioneer developmental scales to be discussed in the next section; it has
exible, however, since several of the tests have proved useful in more
also been reaffirmed in recent analyses of the field b\' specialists in early
ufan one context. This is espeCially true of some of the instruments chIldhood education (Anderson & Messick, 1974). .
o~ginally designed for cross-cultural testing, which are now more com-
only used in clinical testing.
·Finally, although some of the tests covered in this chapter were de-
DEVELOPMEKTAL SCALES.Following a series of longitudinal studies of
signed as group tests, they are frequently administered individually. A
th~ normal course .of beh~vior developIl1ent in the infant and preschool
feware widely used in clinical testing to supplement the us~al. t~pe ~f
cluld, Gesell and hIS aSSOCIates at Yale prepared the Gesell Developmen-
ihtelligence test and thus provide a fuller picture of the mdIvldual s
tal Schedules (Gesell & Amatruda, 1947). These schedules cover four
i)ltellectual functioning. Several permit the kind of qualitative observa-
major area~ of beha\ior: motor, adaptive, language, and personal-social.
tJ9llSassoGiated with individual testing and may require considerable
They prOVIde a standardized rocedure for observin and evaluatinrr
dinical sophistica,tion for a detailed interpretation of test perf~rmance.
t~ehavior develop. in t e c iJd's daily life. Although a
®nthe whole, they are closer to the individual tests illustrated 111Chap-
fe",: may be properly described as tests, most of the items in these
tfr 9 than to the group tests to be surveyed in Chapter II.
1: schedules are p~observational. Data are obtained through the direct
observation of the child's responses to standard to\,s and other stimulus
objects an? are supplemented by information pro~'ided by the mother.
In evaluatm~ t~e Cl'iild's responses, the exa~qs aIded by ve\y detaued
v~rbal des~nptlOns of the behavior typical of different age levels, together
~~Alltests designed for infants and preschool children require individual
WIth drawmgs such as those reproduced in Figure 36. "'hile extending
a,oministration. Some kindergarten children can be tested in small groups
from the age of ~ \:eeks to 6 years, the Gesell schedules typify the ap-
i'tith the types of tests constructed for the primary grades. In general,
?roach followed m mfant testing. Items from these schedules have been
~9wever, group tests are not applicable until the child has reached scl:lOol
~ncorporated in several other developmental scales designed for the
a~e. Most tests for children below the age of 6 are either performance mfant level.
of oral tests. A few involve rudimentary manipulation of paper and Althou.gh ~oth observational and scoring procedures are less highly
p'~ncil. . . ...
stand~rdlz~d ll1 the Gese~l schedules than in the usual psychological test,
tIt is customary to subdivide the first five years of life mto the mfant
there IS eVIdence that, WIth adequate training, examiner reliabilities over
neriod and the preschool period. The first extends from birth to the age of
.,..95 can be attained (Knobloch & Pasamanick, 1960). In general these
a.pproximately 18 months; the second, from 18 to 60 months. From the
s~hedules may lfe regarded as a refinement and elaboration of the ~ualita-
Yiewpoint of test administration, it should be noted that the infant, must
hve observa~ion~ routinely made by pediatricians rndother specialists
~e tested while he is either lying down or supported on a person slap.
concerned WIth mfant development. They appear to ·be most useful as a .
~peech is of little use in giving test instructions, although the child's O:VIl
supplement to me~ical examinations for the identification of neurological
l£nguage development provides relevant data. Many of the tests deal WIth
defect a.nd or!Iamcally caused behavioral abnormalities in early life
-::;",onnmnt()T npvplnnmpnt. as ilJustrated by the infant's ability to lift his
(Donofno, 1960; Knobloch & Pasamanick .1960). .
----~··-l p~lrti(_'ularly in connection \vith the adn1_ini~trati1.Jll of indi\"iclu:.tlized train-
ing programs.' The original Oseretsky te~;ts c.\tended from 4 tr:! 16 wars.
the tests being arranged into veal' levels as in the Stanford-Biurt·. The
Oserctsky scal~ was cfesigned t~ cover all major types of motor bt:hav~or,
from postural reactions and gross bodih- movements to fin ger coordin<l-
bon and cur:trol of facial muscles. Administr,ltion of these tests reouires
only simple and easily obtainable materials, such as matchsticks, \\'~oden
spools, thread, paper, rope, boxes, and rubber ball. Directions are O"iven
\-. ---
orally and by demonstration. '"
\
In 1955, the Lincoln-Oseretsky Motor Dewlopment Scale (Sloan, 1955)
\vas issued as a revision and restandardization of the Oserctskv tests \\ith
simplified instructions and improved scoring procedures. CO~'eling onlv
ages 6 to 14, this revision includes 36 of the original S.s items. 1'h; tesG.
which in this revision are arranged in order of difficult\', \vere chosen OJ~
the basis of age correlation, reliability, and certain p;'actical consider a-
ti?ns. Tentative percentile norms were found on a standardization sample
of 380 boys and 369 girls attending public schools in central Illinois.
~plit-half :'eliabiliti~s computed for single age and sex groups fell mostly
m the .80 sand .90 s. A one-year retest \'ielded a correlation of .70. A
factor analysis of a slightly longer, earlier'version indicated a single com-
mon factor identified as motor development. J
The Vineland Social ~faturit)· Scale (Doll, 19':)3, 1965) is a develop-
FIG. 36. Dra\vings Emplo:'ed with the Gesell Developmental Schedules to Il- mental schedule concerned with the individual's abilitv to look after his
lustrate Typical Behavior at 2.8 Weeks of Age. practical needs and to take responsibility. Although cov~ring a range from
(From De,:clopmentnl Diagnosis, by Amold Gesell and Catherine S. Amatruda .. Copy- birth to over 25 years, this scale has been found most useful at the
right 1941. 1947, by Arnold Gesell. By permission of Paul B. Roeber, Inc., pubhsher.)
younger age levels, and particularly with the mentalh' retarded. The
entire scale consists of 117 items grouped into veal' l~vels. The infor-
Another variety of developmental scale is more restricted in its be- mation required for each item is obtained, not through test situations,
havioral coverag~ but extends over a much wider age range. The proto- but through an interview with an informant or \vith the examinee him-
'types of such scales are the Oseretsky Tests of ~~otor Proficienc~' and th,e self. The scale is based on \vhat the individual has actuallv done in his
Vineland Social :Maturitv Scale. Although extcndmg well beyonu the pre- daily liVing. The items fall into eight categories : general ;elf-help, self-
school acres, these scale; are relevant to the present discussion because of help in eating, self-help in dressing, self-direction, occupation, communi-
certain similarities to the Gesell scales in content and in general approach. cation, locomotion, and socialization. A social age (SA) and a social
Thev are also more suitable for use at the lower age and intellectual quotient (SQ) can be computed from the person's record on the entire
leveis than at hi ::her levels. scale.
The Oseretsh:Tests of Motor Proficiency were originally published in The Vineland scale was standardized on 620 cases, including 10 males
Russia in 1923. They were subsequently translated into several languages and 10 females at each year from birth to .30 years. These ~jorms un-
and used in a number of European countries. In 1946, Doll (1946), then doubtedly need updating. ~loreover, the sample contained too few cases
Director of Research for the Vineland Training School, sponsored and at :<1ch age and \vas not suffiCiently representative of the general popu-
f'dited an English translation of the Porhlguese adaptation of these tests. btIon, most of the cases coming from middle-class homes. A retest relia-
A scaley.LElot 2~ develoP111e~!.~_~~.I?.:.<.:.~~~l.y,,~£JJLi.Q.!:':~i~::'~L!~~_I.!.1_e.n_ta!ly bility of .92 is reported for 12:3 cases, the retest intervals varying from
retard:ed, ....1Y!1Oare--alSo·'treq u entlY!~~~r.jed_.i~L.!I!ot2I-l!l.!!fJ!.?lls .. Oth er
applications ~Ttl'1e1JsE'r('tsky-festS- are found in the testi,ug o~' ch~l~:en I Other motor tests, designed specifically for use in the diagnosis of learning
with motor handicaps, minimal brain dysfunction, or learmng dlsablhhes, clisabilitle3, will be considered in Chapter 16.
---------
2.-1-0
mechI' to nine mOl1tlis. The-' nse (of diIfi'rcnt examiners or inform~lnt5
Md lFit apprec:id)ly affect results in this ~roup, as long as all intrlrrn,lllb
~ndhac! an adequate opportunity to observe the subjects.
Validity of the scale was determined chieHy on the basis of age diiIer- Specific items with multiple responses are provided uncleI' each domain
mtiation, comparison of normals with mental retardates, and correlation or subdoI~l~i~l. Special instructions are included for handling questions
of scores with judgments of observers \\'110 knew the subjects well. Cor- about actIVIties tl:at the individual may ha\'e had no 0ppOItunity to per-
IBlationsbetween the Vineland scale and the Stanford-Billet vary widely, form (e.g., shoppmg, restaurant eating).
but are sufficiently low, in general, to indicate that different facets of Part 2 is. desi?ned to assess maladaptive behavior related to personality
behavior are being tapped by the (\\'0 scales. The Vineland Social lvfa- and behaVIOr dIsorders. It covers 14 behavior domains, such as violent and
IlJrit~:Scale has proved helpful to clinicians in diagnosing mental retarda- destructive behavior, withdrawal, and hyperactive tendencies. In each of .
lionand in reaching decisions regarding institutionalization. For example, the 14 categories, specific items of behavior (e.g., "bites others" "at-
an individual who is intel1ectu3.lly dencient in terms of the Stanford- tempts to set fires") are rated 1 if they occur occ~sionallv and 2 i'f they
Binet may be able to adjust sati~factorily outside an institution if his occur frenuentlv. .
" .
social age on the Vineland scale is adequate. Instructions for administering and scoring are clearly set forth on the
A newer and more cornprehensive instrument is the Adaptive Behavior form itself and explained further in the ma~ual. The s~ale vie Ids a sum-
Scale (ABS), prepared by a committee of the American Association on n:ar1' profile of percentile scores in each of the 24 beha;rior domains.
~lental Deficiency. DeSigned primaril~- for mental retardates, this scale ;';orms were obtained on institutionalized mental retardates of both sexes
can also be used with emotionally maladjusted and other handicapped behveen the ages of 3 and 69 years. Percentile equivalents are reported
persons. Adaptive behavior is defined as "the effectiveness of an indi- for 11 age levels, grouped by one-year intervals at the youngest ages and
\'idual in coping with the natural and social demands of his or her by 2, 3, 10, and 20 years at the older ages. Cases were drawn from in-
environment" (American Association on ~'fental Deficiency, 1974). In its stitu~ions throughout the United States, the numbers at each age level
197-i revision, this scale provides a Single form applicable from the age rangmg from about 100 to slightly over 500.
of 3 veal'S on. Like the Vineland, it is based on observations of evervdav In the interpretation of 5col:-es, the authors emphasize the in-')ortance
beha'vior and may be completed b~' parents, teachers, ward personn~l, dr of considering the individual's ability to remain and to function ade-
others who have been in close contact with the examinee. The informa- quately in his own setting, community, or' neighborhood. Preliminary
, tion may also be obtained through questioning or interviewing of one or data 01.1 rater reliability and on concurrent and construct validity ar~
more obseITers. promlsmg. The authors refer to several lines of research under wav
The ABS consists of two parts. Part 1 is a developmental scale covering including the application of the scale to noninstitutionalized ment;i
10 behavior domains, several of which are divided into subdomains as ret~r~ates a~d to emotiona.lly disturbed. but nonretarded persons, longi-
indicated below: tuc,.n,ll stu(~les of change m Scores dunng treatment and trainin!2; pro-
grams, and Investigations of various psychometric features of the instru-
Independent Functioning: eating, toilet use, cleanliness, appearilnce, care ment itself.
of clothing, dressing and undressing, travel, general independent function-
All the developmental scales discussed in this section fall short of
ing
psychometric r.equirement.s, particularly with reference to representative-
Physical Development: sensory development, motor development ness of normatIve data. \\'Ithin the framework of current test-construction
Economic Activity: money handling and budgeting, shopping skills l~rocedures, they all require more research to permit adequate interpreta-
tion of test results: Their principal interest stems from their indicating
Language Development: expression, comprehension, social langu8.ge de-
the scope of functIOns that need to be included in the examination of
velop;m .. t
young children and of older children '""ith mental or physical disorders.
Numbers .,nd Time
Domestic Activity: cleaning, kitchen duties, other domestic activities
BAYLEYSCALESOF INFANT DEVELOPMEl'.'T.The decades of the 1960s and
Vocational Activity
~~e 1970s. \\"~tnessed an u~sur~e of interest in tests for infants and pre-
Self-Direction: initiative, perseverance, leisure time school chI1dIen. One contnbutmg factor \vas the rapid expansion of edu-
," 1 .. '"'' . f ~ ,) "'lllv ' C"1llorvrl.
1'\'ta1"<" Ieel. 1" , , .. '1- Pl' \"",
_'\ilUi.d-:, \ ('.,. tl'l"
__ \vir!'o.
. _"'"
G1UOn;" prOC',l "il1~,ljl. IJll\IL '. . . . ,'. . '.'." .... 1, ,_",,'
' re'tel r\('ve1oDInCnt
,'C • 01 ,' pI ",.,i ' "1 ]'l'''O'r-,rns
.. t:~,.,1<.1<.. .." vo .... ·, or cunljJenS,Hl1rl
< .' d, ••. L."liOn'I iw'. problenl soh·jnc:. \·ocahzahcm. tIle bc'c:irminE'< !if verbal eomm,l1Iic,::-
f~r~ult~mlllv ~li5advant::tged children. To meet these pressmg prac:lca_
ti\;n,and rudimenta~':' abstract thi;lking. The ;"lr;tor Scale provides meas-
needs, new tests have appeared and considerable research has been ,-on
ures of gross motor abilities, such as sitting, standing, walking, and stair
ducted on innovative approaches to assessment.'
An espec'l. 11v w'ell eO'lstructnd test for the earliest age lewls
..
IS tile Bar
.. climbing, as well as manipulatory Shills at
hands an~clfingers~ At the in-
.. - .,. '-- . . ~ n~ ..; a fant level. locomotor and manipulatory development plays an irnportant
lev Scales ot I~Ifant Development, illustrated 111 Figure ,j i. In eO! po~ aL.nb
part in the child's interactions with his environment and hence in the de-
's~me items from the Gesell schedules and other infant and preschool. test~,
velopment of his mental processes. The Infant Behavior Record is a rating
these scales represent the end-product of many years. of resear:h b)' fBar
scale completed by the examiner after the other t\\'o parts have been ad-
'lev and her co-workers, including the longitudmal mvestlgatlOHS a t le
ministered, It is designed to assess various aspects of personality de\-elop-
B~rkcley GrO\vth Study.
ment, such as emotional and social oeha\'ioL attention span, persistence,
and goal directedness.
In the technic:ll quality of their test-construction procedures, the Bayley
scales are clearly outstanding among tests for the infant leveL Norms were
established on 1,262 children, about equally distributed between the ages
'~,' ,t'~ of 2 and 30 months. The standardization sample was chosen Sl) as to be
.., j ::J:-~'"
. l:'.-'b;.
representative of the U.S. population in terms of urban·rural residence,
<:~~~l'
major geographic region, sex ratio, race (white-nonwhite), and education
Ii>
of head of household, Institutionalized children, premature babies, and
children over the age of 12 months from bilingual homes \\'ere excluded.
r (. ~lental and \Jotor ;~ales yield separate developmental indexes, expressed
as normalized standard Scores with a mean of 100 and an SD of 16 (as in
Stanford-Binet deviation IQ's). These dcvelop:nental indexes are found
within the child's own age group, classified by half-month steps from 2
to 6 months and by one-month steps from 6 to 30 months.
Split-half reliabilit:, coefficients "'ithin separate age groups ranged
from .81 to .93, with a median value of .88, for the ?vlental Scale; and
from .68 to .92, with a median of .84, for the :\lotor Scale. These coef-
ficients compare favorably with the reliabilities usuallv found in testing
infants. The manual reports standard errOrs of measurement and mini-
mum differences between indexes on the ;'lentaJ and hIotor scales re-
quired for statistical significance. Data on tester-observer agreement and
on retest reliability after a one-week intelTal are also encouraging.
Bayley obsen'es that these scales, like all infant tests, should be used
FIG. 37. Test Ob'Jec's
t En111!oved
1 with the Ba)·ley Scales of Infant Deyelop- prinCipally to assess current developmental status rather than to predict
ment.
subsequent ability levels. Development of abilities at these early ages is
(Courtesy The Psychological Corporation.) susceptible to so many intervening influences as to render long-term pre-
dictions of little value. The scales can be most helpful in the early detec-
The Bayley scales provide three complementary tools for assessing t~~
tion of sensory and new-ological defects, emotional disturbances, and
developm~ntal status of children between the ages of 2 mont,~1s and 272 environmental deficits.
vears: the ~lental Scale, the Motor Scale, and the Infant Beha\lOr Record.
'. 2 Surveys of available tests for infant and preschool levels can be f~~nd in Stott
d B II (196.5) and Thomas (1970). The \V echsler Preschool and 1 nmary Scale
lIoICCAHTHY SCALES OF CHILDREl>;'S ABILITIES. At the preschool level, a
~~l Int:lligence, discussed in Chapter 9, also belongs in the present category.
recently developed instrument is the McCarthy Scales of Children's
Abilities (\ISC.\L sl1i(11)lc for children bct\\'('cn the af~f:':;of 21-S anc18% [L'ditioll,'ll IQ's, the term H) \yas c1r:1ib::ratelv a\'oicledlwcausc of its rn~1nY
veal's. It consists of 13 tests, grouped into six o\'erb.pping :;ca 1£'5: Verlxtl, mislcacling: connotation~. Tile GCl is ell'scribed as an index of the child;s
Perceptual-Performance, Qll<1ntitative, General Cognitive, ~femory, and functioning at the time of testing, \dth no implications of immlltabiljt~,
Motor. Figure 38 illustrates the Conceptual Gl'Ouping Test from the or etiology. Scores on the separate scales are normalized standard scores
Percephwl-Perfonnance Scale. In this test, the child is sho'.~.'n red, yel- with a mean of .so and an SD of 10 in terms o'f the same age groups.
, lo\\', and blue circles and sqn:1res in two sizes and asked to find the pieces The standardization sample of 1,0,32 cases included apprmimately 100
l with stated characteristics. The General Cognitive score comes closest to children. at each of 10 age levels, by half-year steps between 21~ and 51,~
the traditional global measure of intellectual development. It is found and by one-year steps between 5% and 8%. At each age level, the s::l.luple
from the sum of the scores on the first three scales, which do not overlap contained an equal number of boys and girls and was stratified by color
each other but which do contain all the 1\,lemory tests and all but three (white-nonwhite), geogral)hic region, father's occupational level, and
of the ~\lotor tests. The General Cognitive ~cor~ is thus based on 15 of (approximately) urban-rural residmce, in accordance with the 1970 U.S.
the 18 tests in the entire batten'. Census. Institutionalized mental retardates and children with severe be-
havioralor emotional disorders, brain damage, or obvious physical defects
\\'E're excluded; bilinguals were tested only if they could speak and under-
stand English.
Split~half reliability for the General Cognitive Index averaged .9:3
\\'ithin age le:\/e15; average coefficients for the other five scales ranged
from .79 to .88. The manual also reports standard errors of measurement
and minimum differences between scale scores required for significance
at the .05 level. Retest reliabilities over a one-month interval for 125 chil-
dren classified into three age groups averaged .90 for GCI and ranged
from .69 to .89 for the separate scales.
'\'ith regard to validity, the manual cites sug?'estive
.....~
but meaO'ert='
data on
predictive validity against an educational achievement batten' adminis-
tered at the end of the first grade. The initial selection of test~ and their
grouping into scales was based on a combination of clinical experience,
findings of developmental psychology, and the results of factor-analytic
research. In the course of developing the scales, a somewhat longer se'ries
of tests was factor analyzed separatel~' within three age levels on about
60 percent of the standardization sample (Kaufman & Hollen beck, 1973).
General cognitive, memory, and motor factors were identified at all ages;
other factors showed developmental trends, suggesting that different
~,-
f:i>-
abilities might be utilized in performing the same tasks at different ages.
Eiw,.:' ::~ ,".~:'..".:.;1: '
For example, drawing tests had a large motor component at the younger
FIG. 38. Examiner Administering the Conceptual Grouping Test of the Mc- ages but were predominantly conceptual at the older ages. The results of
Carthy Scales of Children's Abilities. the preliminary factor analyses were substantially corroborated in subse-
\ Courtesy The Psychological Corporation.) quent factor analyses of the final version of the battery on the entire
standardization s;mple (Kaufman, 1975b). .
Other studies have investigated differences in MSCA performance in
Scores on the General Cognitive Scale are expressed as a General relation to sex, race (black-white) and socioeconomic status (father's oc-
Cognitive Index (GCl), which is a normalized standard score with a cupational level). I\'o significant sex differences were fou Id in either
mean of 100 and an SD of 16, found within each 3-month age group. The Gel or any of the separate scale indexes (Kaufman & Kaufman, 1973b).
manual makes it clear that, although the GCl is in the same units as Ethnic comparisons (Kaufman & Kaufman, 1973a) revealed few signifi-
~antc1iifc;'enc(:,: in LL,,'OI of blacks on the \Intor Scale in the ',-oilnc:;e:it ;t~l' own research, and in part because of their ddailed description in puh-
group (4-5~~); and in £.-\\'01'of \yhites on the other sGlles 'in th; olJ.~~t Jislwd sources,'
age group (6 1 G~S 1,~ ). ::"lo1'eoH'1",inH'stigations of paternal occupation.ll At the LTniversitv of \lontre:.11, Laurenc1eau and Pinard have been en-
, 1erel"ithin both ethnic groups suggest that socioeconomic status may be gaged in an unus'ually comprehe~ long-term res~ch project de-
more important than race as a concomitant of performance on the Signed to replicate Piaget's work under standardized conditions, with
~\IcCarth:·: scales (Kaufman & Kaufm:m, 1975). 1<lrge representatiH' samples, and in a different cultural milieu (baur!ll-
deau (.'\Pinard, 1962, 1970; Pinard & Laurendeau. 1964L A byproduct of
tIllS research is the (, nstruction of scales of mental development that will
PIAGETI..\~ SCALES.Although applicable well beyond the preschool level, eventually be available to other investigators. At this time, however, the
the scales modeled on the developmental theories of Jean Piaget have authors feel it would be pre'mature to release their scale of mental de-
,thus far found their major applications in early childhood. All such scales wlopment until they have completed much more research \\'ith their
are in an experimental form; few are commercially available. :,Iost have tests.6
been de\'doped for use in the authors' own research programs, although In the course of their investigation;, Lnurendeauand Pinard adminis-
some of these scales are available to other research workers. At this stage, tered a battery of f! tests to 700 children ranging in age from 2 to 12
'j the maJ'or contribution of Piaa..,etian scales to the I)Svcholoaical testincrb- of years. The tests for children under 4 were newly constructed or adapted
L.: • u
children consists in t~viding a theoretiCal framev,'ork that focuses from conventional scales, although all were chosen to assess characteristics
on developmental sequences and a procedural approach c~acteriz_el:fby that Piaget attributed to this d~velopmental period. Twenty-five of the
fleJJbil.it}:...an.d 'Ill alitatiye interpretatioJl.. tests, designed chiefly for ages 4 and up, were modeled directly after
, Some of the features of Piagetian scales, with speCial reference to Piagetian tasks. Thus far, the results obtained with 10 of these tests have
j normative interpretation of performance, were discussed in Chapter 4. been reported in detail in two books (Laurendeau & Pinard, 1962, 1970).
, Basically. Piagetian scales are ordinal in the sense that the~' presuppose a Five tests deal with _£!!,ysali~; including the child's explanations of the
uniform sequence of development through successive stages. They are nature and causes of dreams, the differences between animate and in-
, also content-referenced, insofar as they provide qualitative descriptions animate objects, what makes it dark' at night, what causes the movements
of what the child is actualh' . able to do, Piaaetian tasks focus on the 101W- of clouds, ahd why some objects float and others sink These tests are
b '::0
term de\Tlopment of specific concepts or cognitive schemata,3 rather than administered almost entirely through oral questionnaires, which fall about
, Oil broad traits, \Yith recYard to administration the maJ'or obJ'ect of Piaae- mid\\'av between Piaget's unstructured "methode clinique" and the com-
II b ' b
han scales is to elicit the child's explanation for an observed event and pletel~' controlled techniques of traditional tests. All questions are stand-
. the reasons that underlie his explanation.-5coring ,is characteristically ardized; but depending upon the child's initial responses, the examiner
based on the quality of the child's responses to a relatively small number follO\\'s alternative routes in his further exploration of the child's thinking
of problem situations presented to him, rather than on the number or dif- processes.
ficulty of successfull;' completed items. For this purpose, misconceptions The other five tests are concerned with the child's conce,',ts 1 slJ(ice.
of ..;J_----,.
revealed by the incorrect responses are of primary interest. The examine,r They include such tasks as recognizing objects by touch and identifying
co~centrates more on the process of 12rghlem solving. than on the prg.Q.JJ,c.t_ them amonITa visuallv~ lm:,sented drawinasb of the same ob]'ects; .'
arranaina
b b
Because of its highlv individualized procedures PiafTetian testina ,::,is a set of tov~ lampposts in a sb'aiaht
b
line between two tOY
~'
houses' placina b
u ~ ' 0
well suited for clinical work. It has also attracted the attention of edu- a toy man in the same spots in the child's landscape that he occupies in
cators he cause it clJermits the integration of testing and teachiner.40 Its most the examiner's identical landscape; designating right and left on the
~
frequent use, hO\\'E'\'E'1',is still in research on develoI)mental psvcholoav.
_ b_ child's own body, on the examiner in different positions, and in the rela-
The three sets of tests described below have been selected in part because tion of objects on the table; and problems of perspective, in which the
of their present or anticipated availability for use outside of the authors'
5 Other major projects on the standardization of Piagetian scales are heing con-
J "Schemata" is the plural of "sch,'ma:' a tpnn commonly encountered in uucted by S, K. Escalona at the Albert Einstein College of l\'ledicine in !\ew York
Pi'~f':etian writings and signifying essentially a framework into which the individual City, R. D. Tuddenham at the University of California in Berkeley, N. P. Vinh-Bang
fits incomin!?; semon' data. and B, lnhelder in Piaget's laboratorv in Geneva, Switzerland, and E, A. Lunzer at
, An exa~lple of 'such an application IS "Let's Look at Children," to be discussed the University of :-'lanr-hestcr in EnRland,
in Chapter 14, 6 Personal commun 'lion from Professor A. Pinard, April 3, 1974.
child indicates how three to\' mountains look to a man standing in L1if- 6. DCt:clopmmd of Sc7lel)wta for rcbti11'C; to objects-respondin; to objects
[erent places, Several of tht·'se spatial tests deal \\'ith the "('gocentrism" 0\' looking, feeling, manipulatjn~, drnpping, thro,ving, elc., ,mel by so-
cia]])' instigated ~chemata appropriate to particubr objects (e.g., "drh'-
of the vauna child's thinking, which makes it difficult for him to regard
ing" toy C:l~',building \;..ith blocks, wearing beads. naming objects).
objects' fron~ viewpoints oth~r than his O\\'n.
The complete protocol of the child's responses to each test is sco~ed as 1\0 nom,s are provided, b\lt the authors collt'cted data on sewral psy-
a unit, in terms of the den·lopmental le'el indicated b~' the quality of chometric properties of their scales by administering them to 84 infants,
the responses. Laurendeau and Pinard have subjected their tests to ex- including at least four at each month of age up to one year and at least
tensi\'e statistical analyses. Their standardization sample of 700 cases in- four at each t\yO months of age between one and two years. ~\Iost of these
cluded 2:5 boys and 25 girls at each six-r~~interval from 2 to 5 year~ subjects were children of graduate students and staff at the UniHorsity of
and at each one-vear interval £romolo12. The children were selected so Illinois. Both observer agreement and test-retest agreement after a 48-
as to constitute ~ representative sample of the French Canadian popula- hour interval are reported. In general, the tests appear quite satisfactory
tion of \Iontreal with regard to father's occupationd level and school in both respects. An index of ordinality, computed for each scale from
grade (or number of children in the family at the preschool ages). Be- the scores of the same 84 children, ranged from .8o:? to .991. The authors
sides prO\iding normative age data, the authors analyzed thel[ results report that .50 is considered minimally~satisfactory evidence of ordinality
for ordinaiitv, or uniformitv of sequence in the attainment of response with the index employed.7
levels bv different childrel~. They also investigated the degree of simi- Uzgiris and Hunt clearly explain that these are only provisional scales,
laritv in' the developmental stages reached by each child in the different although they are available to other investigators for research purposes.
tests. lntercorrelations of scores on the five causality tests ranged from Apart from j~urnal articles reporting specifi; studies in which the scales
.59 to ,78; on the Bve space test~, the correlations ranged from .37 to .67 were employed, the authors describe the tests in a book (Uzgiris & Hunt,
(Laurendeau & Pinard, 1962, 1'.'136; 1970, p. 412), 1975) and also provide six sound films demonstratin~ their use. The
2T~ Ordinal Scales of Psychological Development prepared by Uzgiris scales were originally designed to measure the effects of speciBc el1\"iron-
and ~ (1975.Lare desie;ned for a much younger age level than are nwntal conditions on the rate and course of development of infants. Stud-
tilose" of Laurendeau and Pinard, extending from the age of 2 weeks to 2 ies of infants reared under different conditions (Paraskevopoulos & Hunt,
years. These ages cover approximately what Pia get characterizes as the 1971) and on infants participating in intervention programs (Hunt,
I -ensorimotor period and \'vithin which he recognizes six stages. In order Paraskevopoulos, Schickedanz, & Uzgiris, 1975) have thus far indicated
to increase the sensitivity of their instruments, l'zgiris and Hunt classify significant effects of these environmental variables on the mean age at
the responses into more 'thall six levels, the numbe~ varying from 7 to 14 which children attain different steps in the developmental scales.
in the different scales. The series inciudes six scales, designated" as fol- Unlike the first two examples of P!agetian scales, the 90ncept Assess..-
lows: .~ 111£l1LKit-<;:~~tion is a published test which mav be purchased on
-I the same basis as other psychological tests. Designed for ages 4.-!.07 ~'
1. Object Permanence--·--the child's emerging notion of independently exist-
it provides a measure of one of the best known Piagetian concepts ..~
ing objects is indicEltedb~arron'o\vingoranODjecrinasearcw'!;g_
s,~n..r~~~lion that S\lclLPE£perties of objects ~ /
for an object after ItlsFIuIClen with increasing degrees of~ment
------ ,®g t vol me or number remain uncl.ang.ed-llilie~.ect.s...und.er.go
2. Decelop1J1ent of ~lcans for achieving desired environmental ends-use of tI:ans£o lations in shane, E?sition, f0m. or other s~~~s, The
own hands in reaching for objects and of other means such as strings, authors (Goldschmi & Bentler, 1968b) focused on conservation as all
stick, support, etc. . ~---.)
!!1di~~~Q.UQ.i_~::ition J!:Qill..lb-E'_pr-eG.p~r.~-!~al ~.£9nc~
3. InLilation-includingboth gestural and vocal imitation. 01~~1!j~ stage of thinking, which Piagct places rougT-J:y at the age of
7 or 8 years:------·-·----
4. Operational Causality-recognizing and adapting to objective causality,
Throughout th :est, the procedure is essentially the same. The child is
ranging from visual observation of one's own hanus to eliciting desired
beh:lVior from a human agent and activating a mechanical tOY.
7 Proc~dures for the measurement of ordinality and the application of scalogram
5. Object Relations in Space-coordination of schemata of looking and analysis to Piagetian scale~ are still controversial, a fact th"at should be borne in mind
listening in loulizing 0bjects in space; understanding such relations as in interpreting any reported indices of ordinality (see Hooper, 1973; Wohhvili,
container, equilibriurn, gravity. 1970).
shown b,·o identical objects; then the examiner makes certain transforrna- & \\'asik, 1971). T"raini.ng in Cf'lltSl'rY~ltioJ.l Ll:::k:~~:J~ tic'en fonlld tf) 1n-t[',ny\"c
tions in one of them and interrogates the child about their similarity or scores significantlv (see also Goldschmid. 1865: Zimmerman '~;:nosenth:1L
difference. After answerin~, the ~hild is asked to e:,pbin Lis ans\\'e~. In 1974a, 1974b). The manual cites several studies on small grouDS that
, each item, one point is sc~red for the correct judgment of equivalence contribute suggestive data about the construct validity of th; test. Some
1 [Uld one point for an acceptable explanation. For example, the examiner evidence of predictive validity is provided b~' sig:nifi~ant correlations in
1. begins with two standard glasses containing equal amounts of water the .30·s and .40's with nrst-barade achievement, the correlation beina
b
;1 (continuous quantity) or grains of corn (discontinuous quantity) and highest with arithmetic grades ( ..52).
pours the contents into a Rat dish or into several small glasses. In another
task, the examiner shows the child two equal balls of Playdoh and then
< natteils one into a pancake and asks whether the ball is as heavy as the
pancake.
Three forms of the test are available. Forms A and E are parallel, each DL\F:>;ESS. Owing to their general retardation in linguistic development,
providing six tasks: Two-Dimensional Space, Number, Substance, Con- deaf children are usually handicapped on verbal tests, even when the
tinuous Quantity, 'Veight, and Discontinuous Quantity. The two forms verbal content is presented visuall~'. In fact, the testing of deaf children
were shown to be closely eouivalent in means and SD's and their scores was the primary object in the development of some of the earliest per-
correlated .95. Form C incl~des two different tasks, Area and Length; it formance scales, such as the I'intner-Paterson Performance Scale and the
correlates .76 and .74 with Forms A and B, respectively. Administration Arthur Performance Scale. In Revised Form II of the Arthur scale. the
is facilitated by printing all essential directions on the record form, in- verbal instructions required in the earlier form \yere further reduced in
cluding diagrams of the materials, directions for manipulating materials, order to increase the applicability of the test to deaf children. Special
and verbal instructions. adaptations of the W'echsler scales are sometimes employed in testing
i., Norms were established on a standardization sample of 560 bo)'s and deaf persons. The verbal tests can be administered if the oral questions
girls between the ages of 4 and 8, obtained from schools, da:: care are typed on cards. Various procedures for communicating the instruc-
centers, and I-lead Start centers i. the Los Angeles, California area. The tions for the performance tests have also been worked out (see Sattler,
sample included both blacks an·, whites and covered a wide range of 1974, pp, 170-172). 'Vith such modifications of standard testing pro-
socioeconomic level, but with a slight overrepresentation of the lower- cedures, however, one cannot assume that reliability, validihr, and norms
middle class. P~~~ are reported .f~J;:_.~'}.QlL.~~ _~
_~l. .These remain unchanged. :'<onlanguage group tests, such ~s the Al:n~~' Beta, are
norms, of course, must be regarded as tentative in view of the small num- also used in testing the deaf.
ber of cases at each age and the limitations in representativeness of the "'ltether or not they require speCial procedural adaptations, all the
sample. Mean scores for each age show a systematic rise \vith age, with tests mentioned thus far \vere standardized on hearing pe~sons. For many
a sharp rise between 6 and 8 years, as anticipated from Piagetian theory. purposes, it is of course desirable to compare the performance of tbe deaf
Both in the process of test construction and in the evaluation of the with general norms established on hearing persons. At the same time,
final fomls, the authors carried out various statistical anah'ses to assess norms obtained on deaf children are also useful in a numher of situations
scorer reliability; Kuder-Richardson, parallel-form, and ret~st reliability; pertaining to the educational development of these children.
scalability, or ordinality; and factorial composition (see also Goldschmid To meet this need, the Hiskey-Nebraska Test of Learning Aptitude was
& Bentler, 1968a). Although based on rather small samples, the results developed and standardized on deaf and hard-of-hearing children. This
indicate generally satisfactory reliability and give good evidence of ordi- is an individual test suitable for ages 3 to 16. Speed was eliminated, since
nalit)' and of the presence of a large common factor of conservation it is difficult to convey the idea of speed to young deaf children. An at-
throughout the tasks. tempt was also made to sample a \vider variety of intellectual functions
Comparative studies in seven countries suggest that the test is ap- than those covered by most performance tests. Pantomime and practice
plicable in widely di\'Crse cultures, yielding high reliaoilities and showing exercises to communicate the instructions, as well as intrinsicallv interest-
approximately similar age trends (Golclschmid et aI., 1973). Differences ing itetns to establish rapport, \vere considered important refluirements
among cultures and subcultures, however, have been found in the mean for such a test. All items were chosen with special reference to the limita-
ages at which concepts are acquired, i.e., the age curves may be displaced tions of deaf children, the final item selection beinab based . chieHv.. on the
horizontallv bv (me or two veal'S (see also Fil1nrelli & KfOJJer.J972: \Vasik criterion of age differentiation.
St..lnfol'd-Einct (S. P. I·Ia~...
:t:·~. l~J-L.~,.l~).1:31.. \i1 it::"'.;i";~; tb~!t (J.llli.J he <.tc1rnill-
i:;tcrecl without the US(~ of visiG!-l \\"c,e selcct.:c1 from )),,1th [unll Land
Form \1. This procedure yielded six tests felr e:lch Yl';1r level from VIII to
XIV, and cigl;t tests at the Average Adult le\el.· In order to assemble
8. Completion of Dr~lwing'
cnou<7h tests for \'ear levels III to \'1. it \';as necessarv to draw on some
9. ~lemOl~.' for Digit<; of th~ special tests de"iseu for me in the e~nhcr I-L,yes'-Bi:wt. \10st of th::
tests in the final scale are oral, a few requiring braille materials. A retest
reliability of .90 and a split-half reliability of .91 are reported by Hayes.
5. Paper Folding (Patterns) 11. Picture Analogies
Correlations ".-ith braille editions of standard achievement tests ranged
6. Visual Attention Span 12. Spatial Reasoning from .82 to .9:3. The validity ~ of tflis .
test was also checked against
~ school
progress.
:\"orms were deri"ecl separatel:; from 1,079 deaf and 1,074 hearing chil- The 'Wechsler scales have also been adapted for blind ex'lminees. These:
clren between the acres of :3 and 17 Years, testeu in 10 stales, Split-half re- adaptations consist essentidly in using the ,-erbal tests and omitting the
liabilities in the .90~ are reported for deaf and hearing groups. IDtercor- performance tests. A few items iDr.ppropriate for the blind have also bt't'rl
relations of the l:? subtests range from the ,30's to the .70·s among younger replaced by alternates. \Vhen tested under these conclitions, blind
children (ages :3 to 10) and from the .20's to the .40's among older chil- persons as a group have been found to equal or excel the general seeing
chen (ages 11 to 17). Correlations of .78 to .86 were found between the norms.
Hiskey-Nebraska and either the Stanford-Binet or the \Vechsler Intel- A different approach is illustrated by the Haptic Intelligence Scale for
li(!en~e Scale for Children in small groups of hearing children. Further Adult Blind. TIlis was developed as a nonwrbal test to be used in con-
e~·idence of validity was provided by substantial correlations with junction "ith the Verbal Scale of the \VAIS. Four of the tests are adapta-
achievement t,ests among deaf children. The manual contains a discussion tions of performance tests from the \VAIS, namely, Digit Symbol, Block
of desirable practices to be followed in testing deaf children. Design, Object Assembly', and Obiect Completion; two were newly de-
vised, including Pattern Board and Bead Arithmetic. The tests utilize a
completely tactile approach and, if given to the partially sighted, require
BLINDNESS. Testing the blind presents a very different set of problems the wearing of a blindfold. For this reason, among others, they are prob-
from those f"ncount~red with the deaf. Oral tests can be most readily ablv best suited for testing the tot all" blind. Standardization procedures
adapted for blind persons, v.·hile performance tests are least likely to be followed closely' those el;~ployed with the \VAIS. The blind subjects
applicable. In addition to the usual oral presentation by tho: examiner, tested in the standardization sample included a proportional number of
other suitable testing techniques have been utilized, such as phonog;raph nonwhites and were distrihuted O\'er the major geographical re;ions of
records and tape o~ "'ire recorJings. Some tests are also available in the country. Subtest scores and deviation IQ's are fowld as in the WAlS.
b'aille. The latter technique is somewhat limited in its applicability, how- Split-half ~eliahility' for the entire test was found to be .95 and for sub-
ever bv the greater bulkiness of materials printed in br~,ille as compared tests \'aried from .79 (Object Assembly) to .94 (Bead Arithmetic). A six-
witl; iniprint, by the slower reading rate for braille, and by th~ nu:nber month retest of 136 subjects yielded a total-score reliability of .91 and
of blind persons who are not facile braille readers. The exammee s re-' subtest reliabilities ranging from .70 to .81. Correlation \\'ith WAIS Verbal
sponses may likewise be recorded in braille or on a typewriter. Speciall;.' Scale in the 20-34 age group of blind subjects was .65. The materials are
prepared embossed answer sheets or cards are also available for use with bulky and administration time is long, requiring from 1;~to 2 hours; but
true-fabe, multiple-choice, and other objective-type items. In many incli- blind examinees generally find the tests interesting and enjoyable. The
vidually administered tests, of course; oral responses can be obtained. authors caution that this is a provisional scale, requiring further research.
Among the principal examples of general intelligence tests that have It can provide useful information when employed by a trained clinician.
been adapted for blind persons are the Binet and the ·Wechsler. The first A number of group i'ntelligence tests have likewise been adapted for
Haves-Binet revision for testing the blind was based on the 1916 Stanford- use with the visually handicapped and are available in both b.rge-h'pe
Bi~et. In 1942, the Interim Hayes-BinetS was prepared from the 1937 and braille editions, Examples include the School and College Ability
3 Orif;inally deSignated as an interim edition because of the tentative nature of its Tests (SCAT), the College Board Scholastic Aptitude Test (S,-\T), and
standardization, this revision has come to be known by this name in the literature. the Aptitude Test of the Graduate Record Examinations (GRE). Re-
IC;Ul,h \';! ~L 'j tadile- f";m r:f the Pi·, .:n''':i ,'C :\ ht r:,,'('S h;1:, ,11·".',Tl it to u~'e of "u:-;e" \·ocab'l.il:lr\·. f:.:.r:_'·:i;lJI.·· ..- ~~j)P!i(·ab;\.. td person-=; ll.r1;l.t)!c to ...0-
]m'e promise as a nOllverbal illtel1ig(']~(:(' kst for blilld childrcll bdwcen calizc well (such as th~ CtT~ bral palsi~~l) and to tl;e deaL Since thev are
the a~es of 9 and 1.5 veal'S (n icll &. :\ mler:~on, 19G.51. An adaptation of the eas~' to administer and can be completed in about 1.5 minutes or less: thev
'Vinel:md Social Maturity Scale for blind preschool children \vas devel- are also useful as a rapid screening device in situations \vhere no trained
oped ,1110 standardized by ~faxfielc1 and Buchholz (1957). examiner is available but individual testil1!:; is needed.
The Peabody Picture Vocabulary Test (PPVT) is typical of these in-
, struments. It consists of a series of 150 plates, each containing four pic-
ORTHOPEDIC HAI\'DICAPS. Although usually able to receive auditory and tures. As each plate is presented, the examiner provides a stimulus word
visual stimulation, the orthopedicall:' handicapped may have such severe orally; the child responds by pointin~ to or in some other way designating
Illotor disorders as to make either oral or \\'Titten responses impracticable. the picture on the plate that best illustrates the meaning of the stimulus
The manipu ration of fonn boards or other performance materials would word. Although the entire test covers a range from 21,;', to 18 vears, each
likewise meet with difficulties. \Norking against a time limit or in strange indivichl.ll is given only the plates appropl~iate to his-own p~riormance
surroundings often increases the motor disturbrmce in the orthopedically level, as determined by a specified run of successes at one end and failures
· handicapped. TIleir greater susceptibility to fatigue makes short testing at the other. Raw scores can be converted to mental ages, deviation IQ's,
sessions necessary. or percentiles. The PPVT is untimed but requires from 10 to 15 minutes.
Some of the severest motor handicaps are found among the cerebral It is available in f\vo parallel forms which utilize the same set of cards
palsied. Yet surveys of these cases have frequently employed common in- with different stimulus words.
telligence tests such as the Stanford-Binet or the Arthur Performance The standardization sample for the PPVT included a total of 4,012
· SCCL1~. In such studies, the most severely handicapped were usuall:' ex- cases between the ages of 21.~ and IS years tested in :'\ashville, Tennessee,
: dueled as untestable. Frequently, informal adjustments in testing proce- and its environs, Alternate form reliability coefficients for different age
dure are made in order to adapt the test to the child's response capacities. le\'els within the standardization sample ranged from .67 to .84. Reliabil-
Both of these procedures, of course, are makeshifts. ity coefficients within the same range were subsequently obtained in
..\. more satisfactor:' approach lies in the development of testing instru- several mentally retarded or physically handicqpped groups. Validity was
ments suitable for even the most severely handicapped indi\iduals. A originally established in terms of age differentiation. Since its publication,
number of s[leciallv.. desianed
b tests or adaptations of existing tests
.... are now the test has been employed in a number of studies \\'ith normal, mentally
a\'ailable for this purpose, although their normative and validity data are retarded, emotionally disturbed, or physically handicapped chilc1re~.
usually meaaer.
, ~ Several of the tests to be discussed in the next section, ,
These studies have vielded validitv coefficients in the .60's 'with individual
·
oriainalh-
b .
designed
'-'
for use in cross-cultural testing,\.. have also proved and group intellige;lce scales \\'itl~in relati\'Cly homogeneous age groups.
" applicable to the handicapped. Adaptations of the Leiter International Understandablv, . these correlations were hifTher with verbal than with
0
P,:rformance Scale and the Porte us ~1azes, suitable for administration to performance tests. There is also some evidence of moderate concurrent
cerebral-palsied children, have been prep~rC'd (Allen & Collins, 19;'3; and predictive validity against academic achievement tests. A limitation
Arnold, 19,51). In both adapted tests, the examiner manipulates the te~t of this test for certaiJ~ te;ting purposes is suggested by the finding that
'~materials, while the subject responds only by appropriate head move- culturally disadvantaged children tend to perform more poorly on it than
ments. A similar adaptation of the Stanford-Binet has been proposed on other intelligence tests (Costello & Ali, 1971; Cundick, 1970; IVlilgram
I (E. Katz, 1958). The Progressive i\l atrices provide a promising tool for ~ Ozer, 1967; Rosenberg & Stroud, 1966). On the other hand, particij)ants
I. this purpose. Since this test is given with no time limit, and since the re- 111 preschool compensatory education programs showed more improve-
.~sponse may be indicated orally, in writing, or by pointing or nodding, it ment on this test than on the Stanford-Binet (Howard & Plant, 1967;
1 appears to be espeCially appropriate for the orthopedically handicapped. Klaus & Gray, 1968; j'vlilgram, 1971). Scores 011the PPVT may reflect in
,,: Despite the flexibility and simplicity of its response indicator, this test part the child's degree ~f cultural assimiLion. /
covers a wide range of difficulty and provides a fairly high test ceiling. Similar procedures of test administration have been incorporated in pic-
Successful use of this test has been reported in studies of cerebral-palsied torial classification tests, as illustrated by the Columbia Mental :'vlaturity
children and adults (Allen & Collins, 195.5; Holden, 19.51; Tracht, 1948). Scale (CMMS). Originally developed for use with cerebral-palsied chii-
Another type of test that permits the ntilization of, a simple' pointing dren, this scale comprises 92 items, each consisting of a set of thee, four,
~~ ...,...;....~hr:' ,,-.;,...f.,~(? ,,"'nrnl'111nrll
............... }p(:t Thp<,p J.":)C't~ l')l'nvlrlp :l r~nir1 mpas-
or five drawings printed on a large card. The examinee is required to
iLh~nlif\' the Jr~·l\\·in~lliat d()t?~ 11\)t b('ll.'n~j" \1. ~LL tl'lL" \)tL[r:)~ il1cHcatll1U L.>~
I.~hoice' bv. noinlina
1.':>
'or Ill)dding'-' (see
.
Fil!:~'38).
t..'
To heij2hl~ll
•...
interc:;l"'and
.J
appeal, the cards and dra\yings are varicolored. The objects depicted were THE pnOBLE:\f. The testing of persons with highly dissimilar cultural
chosen to be within the range of experience of most American children. backgrounds has received increasing attention since midcentnry. Tests are
Scores are expressed as Age Deviation Scores, which are normalized needed for the maximum utilization of hl:man resources in the newly de-
standard scores within age groups, with a mean of 100 and an SD of 16. veloping nations in Africa and elsewhere. The rapidly expandiner educa-
Percentile and stanine equivalents for these scores are also provided. To tional facilities in these countries require testing for admission ~urposcs
meet the demand for developmental norms, the manual includes a Ma- as well as for indi\idual counseling. \\'ith increasing industrialization,
: turitylndex, indicating the age group in the standardization sample there is a mounting demand for tests to aid in the job selection and place-
whose test performance is most similar to that of the child. I:1ent of personnel, particularly in mechanical, clerical, and professional
helds.
In America the practical problems of cross-cultural testing have been
associated chiefly with subcultures or minority cultures within the domi-
na.nt ~\.~lture. TI:ere has been widespread concern regarding the ap-
plIcabilIty of avaIlable tests to culturall:' disadvantaged groups. It should
b~ ~ot:d par~nthe.tically that cultural disadvantage is a relative concept.
ODJectIvely tnere IS only cultural difference between anv two cultures or
subcultures. Each culture fosters and encourages the de'velopment of be-
havior that is adapted to its values and demands. \\'hen an individual
must adjust to and compete within a culture or subculture other than that
in which he was reared, then cultural difference is likely to become cul-
tural disadvantage. .
Although concern with cross-cultural testing has been greatly ~timu-
FIG. 39. Examiner Administering Colui11bia ~-tental 1\hturitv Scale to Child. l~ted by recent social and political developments, the problem \vas recog-
(From Columbia Mental Maturity Scale: Guide for Administering and interpreting,
mzed at least as early as 1910. Some of the earliest cross-cultural tests
1972, p. 11. Copyright © 1972 by Harcourt Brace Joyanovich, Inc. Reproduced by were
_.
developed_
for testiner
0
the larere
~
waves of immicrrants
b··0
comi11U to the
permission.) . Lhllted States at the turn of the centurv. Other early tests oriCTinated in
basic research on the comparative abilities of relati~elv isolated cultural
groups. These cultmes were often quite primitive ami had had IiHIe or
The standardization sample for the CM\'1S comprised 2,600 children,
no contact with \Vestern civilization within whose framework most 1)SV-
including 100 boys and 100 girls in each of 13 six-month afe groups be-
chological tests had been developed. t ,
tween the ages of 3-6 and 9-11. The sample ,vas stratified in' terms of the
Traditionally, cross-cultural tests have tried to nile out one or more
1960 U.S. Census with regard to parental occupational level, race, and
parameters along which cultures vary. A well-known example of such a
geographical region; proportion of children living in metropolitan and
parameter is language. If the cultural groups to be tested spoke different
non metropolitan areas was also approximately controlled. Split-kJf re-
languages, tests were developed that required no language on the part of
liabilities within single age groups ranged from .85 to .91. Standard errors
either examiner or subjects. \Vhen educational backgrounds differed
of measurement of the Age Deviation Scores are between 5 and 6 points.
widely and illiteracy was prevalent, re ding was ruled o~t. Orallangua<Te'
Retest of three age groups after an interval of 7 to 10 days yielded re-
was not eliminated from these tests because thev. were desierned for p:r-
liabilities of .84 to .86. A correlation of .67 with Stanford-Bi~1et' was found 0
sons spea k'ing a common language. Another parameter in which cultures
in a group of 52 preschool and first-grade children. Correlations with
or subcultures differ is that of speed. 1\ot only the tempo of daily life, but
achievement test scores in first- and second-grade samples fell mostly be-
also the. motivation to hurry and the value attached to npid performance
tween the high o4O's and the low .60·s. More extensive data on validitv
v~ry. wlde:y among national cultures, among ethnic minority groups
and on applicability to various handicapped groups are available for a~
,.......1:.,.•..f,....~?"'.., •....t. •..1... r'I .f",,("t
wIthm a smgle nation, and between urban and rural subcultures (see,
e.g., Klillcberg, 1925; Knap1', 1960). Aecordingl", cross-cultural tests have
often-though not alw3vs-tried to eliminate the influence of spced by
allowi11"lonLgtime limit~ and e,i\'inv; no premium for faster performance.
L
Still ~her parameters along \vhi~h cultures differ pertain to test con-
tent. Most nonlangurlge and nomeading tesl~, for example, call for items
of information that arc specific to certain cultures. Thus, they may re-
quire the examinee to understrlnd the function of such objects as violin,
postage stamp, gun, pocketknife, telephone, piano, or mirror. Persons
reared in certain cultures may lack the experiential background to re-
spond correctly to such items. It ',,"as chiefly to control this type of cul-
tural parameter that the classic "culture-free" tests were first developed.
Following a brief examination of typical tests designed to eliminate one
or more of the above parameters, we shall tum to an analysis of alterna-
tive approaches to cross-cult-D.ral testing.
TYPICAL INSTRUME:\,TS.In their efforts to construct tests applicable

across cultures, psychometricians have followed a variety of procedures,
some of which are illustrated bv the four tests to be considered in this
FIG, 40, Typic;!1 I\Literials for Use in .the Leiter .International Per[o~ma,nce
section. The Leiter Internation;l Performance Scale is an individually
··;ca]e. The test illusl,.,ted IS the AnalogJes ProgressIOn Test from the 5JX-\ ear
administered performance scale. It was developed through several years , .evel.
of use with different ethnic groups in Hawaii, including elementary and ( C our t e~y C . H . ~tnel1iIlL'
,_: ' .' Comnan';,)
1.. ••
high school pupils, The scale \\'rlS subsequently applied to several Af.ri-
can groups by Porteus and to a few other national groups by other .m-
\'t'sti(!ators. A later H'\'ision, issued in 1948, was based on further testll1g '.uch an IQ retains 11I~,same meaning .at different ag~s. In fa.et,. the pub-
of A~1Crican children, high school students, and Arm:,- recruits during lished (bta show CtlllslQerable f1uctuatlOl1 m the standard deVIatIon of the
Worlel \Yar II. A distincti~'e feature of the Leiter scale is the almost com- J{)'s at different a~,' lewls. Split-half reliabilities of ,91 to .94 are reported
plete elimination of instructiom, either spoken or pantomime. Each test f r~m several stlldit's. but the samples ,,;ere quite heterogeneous in age
begins with a very ea.s:' task of the type to be encountered throughput ;!nd piobably in 01 lIt·r. ~1~Jaraeteri~tics. Validati,on data ~re based pri.nci-
that test. The comprehension of the task is treated as part of the test. Jallv 011 age c1iffer('lltJanon and mternal conSIstency. Some correlatIOns
The materials consist of a response fr,lme, illustratC'd in FigurC' 40, with ~,re ~lso rel~ort('d \\-jlh teacheL·' ratings of intelligence and with scores on
all adjustable card holder. All tests are administered by attaching the ap- !ilher tests, incJudjJl~ the Stanford-Binet and the WISe. These correh-
propriate card, containing p!inted pictmes, to the frame. The examinee I ions range from .5{l to .92, but most were c:'tained on rather hetero-
chooses the blocks with the proper response pictures and inserts them "eneaus groups,
into the frame. I, The tests in year k\'els 2 to 12 are also available as the Arthur Adapta-
The Leiter scale was designed to cover a wiele range of functions, simi- f ion of the Leiter Infernational Performance Scale. This adaptation, con-
lar to those found in verbal scales. Among the tasks included may be men- :,idered most suitahle for testing children between the ages of 3 ancl 8
tioned: matching identical colors, shades of gray, forms, or pictures; copy- ",'ears was standardiled by Grace Arthur in 19.52. Its norms must be re-
ing a block design; picture completion; number estimation; analo~ies; ;'fard:d as quite limiled, being derived from a standardization sample of
", children frum a fmQC1
2,S9 . , ']'e,cJass, mv.d western metropo l'!tan 1J3C 1~grounC1. 1
series completion; recognition of age differences; spatial relations; 100t-
print recognition; simihrities; memory for a selies; and c1as:if-icatiOl: of J .ike the original scale, the Arthur adaptation yields an ~lA and a
animals aceordinc; to habitat. Administered individually, w!th no time ratio IQ.
limit. these tests ~re arranc::ccJ into vear levels from 2 to 18. The scale is Tlw Culture Fair Intelligence Test, developed by R. B. CatteD and pub-
,cu)"(·cl in h::nn:, oE \1-'1. and ratio 1(,), ahLouzh there is no as';ura'lce that 1 . c~
.;.';,j.:l( 1 ~l"'\' , "'_"·,,.·.:',':tl~~C for Pc':;on~:tt·,\' ~illd J\.hi}it.,,- 1\.:sling . (.rf;tl'T') .. is H
, .j1·J'j~··
in the circle. This condition can be met anI" in the third response alterna-
tive, which has been marked. .
For Scale 1, onl;: ratio IQ's are provided. In Scales 2 and 3, scores can
be converted into deviation IQ's ~\'ith an SD of 16 points. Scales 2 and 3
have l~fen standardized on larg.t:'-r samples than Scale 1, but the repre~
sentatJv','ness of the samples ar¥i the number of cases at some aue levels
still fall short of desirable test,,'Construction standards. Althouah ~he tests
are highly speeded, some n.drms are provided for an untim~d version.
Fairly extensive verbal ins}'ructions are required, but the author asserts
that g~vjng these instructions in a foreign language or in pantomime ,,-ill
not affect the difficultv of the test.
Internal consistenc); and alternate-fom1 reliabilitv' coefficients are mar-
ginal, especially for Scale 3, where thev fall mosth: in the .50's and .60's,
rTIl
lllJ \Talidity is dis~ussed chiefly in terms ~f saturatio;l ":ith a general intel-
le~tIve factor (g), having been investigated largely through correlation
WIth other tests and through factor analysis, Scattered studies of concur-
rent and predictive validity show moderate correlations with various
academic and occupational criteIia. The Cattell tests have been admin-
istered in several European countries, in America, and in certain African
FIG. 41. Sample Items from Cultme Fair Intelligence Test, Scale 2, and Asian cultmes. Norms tended to remain unchanged in cultures mod-
(Copyright by Institute of Personality and Ability Testing.) erately similar to that in which the tests were developed; in other cul-
tures, however, performance fell considerabl~' below the original norms,
Moreover, black children of low socioeconomic level tested in the United
paper-and-pencil test. This test is available in three levels: Scale 1, for
States did no better on this test than on the Stanford-Binet (\VilIard
aaes 4 to 8 and mentallv retarded adults; Scale 2. for ages 8 to 13 and 1968). '
a~erage adults; and Scal~ :3,for grades 10 to 16 and superior adults. E~ch
The Progressive l\1atrices, developf'd in Great Britain bv Raven, were
seal<: has been prepared in two parallel forms, A and B. Scale 1 reqwres
also designed as a measure of Spearman's g factor. Requi;ing chiefl" the
individual administration for at least some of the tests; the other scales
ed~:tjon of relations among abstract items, this test is regarded by 'most
may be given either as individual or as group tests. Scale 1 compris:s
BntIsh psychologists as the best available measure of g. It consists of 60
eight tests, only four of which are described by the author as culture-faIr.
matrices,or designs, froIi1 each of which a part has been removed. The
TI~e other four involve both verbal comprehension and slwcific cultural
subject chooses the mi:.sing insert from six or eight given alternatives, The
information. II' i:. suggested that the four culture-fair tests can be used as
a sub-batten!, separate norms being provided for this abbreviated scale:
items are grouped into five series, each containing 12
matrices of increas-
ing ~im.('u~ty ~ut similar in principle. The earlie: series require accuracy
Scales 2 and 3 are alike, except for difficulty level. Each consists of the
of ~ISCnnllnatlOn; the later, more difficult series involve analogies, permu-
fo]]owing four tests, sample items from which are shown in Figure 41.
~atlOn and alternation of pattern, and other logical relations. Two sample
1. Series: Select the item that completes the series, Items are reproduced in Figure 42. The test is administered with no time
2. Classification: Mark the one item in each row that does not belong with limit, .and can be given individually or in groups. Ver~- simple oral in-
the others. structIOns are required.
3. Matrices: Mark the item that correctly compietes the given matdx, or Percentile norms are provided for each half-veal' interval between 8
pattern, and 14 years, and for each five-year inten-al b~t\\'een 20 and 65 years,
<1.CondifiollS: Inserl a dot in one of the alternati\'e designs so as to meet ThE'se norms are based on British sarnples, including 1,407 children, '3,665
the san-,p conditions inc]ica!(;c] in the sample d(~sigll. Thus, in th(~ eXJTI1plc men in .militar~- service tested durin£" '''arid \V,o,]' ~II, and 2.1 92 e!vj)j'1J1
rcproduC"u:Jir. figur<' ~J, the dot Hius1 be in tlk two rcc l~,,·,!:];.0, hi! JiJl ") ''1 1) . .
neaLis. LICJse<,! ~ir{iLar r!(jnrl~ \',-'e~'('obt:-l11:t,!J b',' LirnoLli (1943) on ] ,6{~O
...',"ith a factor COIlll11011 to rnost int('llj~"l"l'l""
.
If- ':!"-,,; i ';--~n.l-'·l·'I·r~(·ll \";t-il
..~.,....
I
~ -. ,~.. \ \·_l\.. ~1.
1~I't-::"'I'-
,_7,
l,. .,.
\"
El man's g by British p:;ychologists j, b~lt that spati~tl aptitude, illdu'ctive

I
IS
reasoning. perceptual accuracy, and other group fadors also innuence
( 8 I performance (Burke, 19.38).
I
I
-.....r' r-,--, .;::::;.c:> I An easier form, the Colonred Progressive :\Iatrices. is available for chil-
~~DI
I cIren bet\\'een the ages of .3 and 11 \~ears and for mentally retarded ~~dults.
A more advanced form has also b~en developed for suiJerior aduits, but
2
I ~~
its distribution is restricted to approved and registered users.
A still different approach is illustrated bv the GoodenouO'h Draw-a-
2 3 L1 \lan Test, in which the examinee is simply i;lstructed to "make a picture
\+)! ())1* )\c<=»

E~';;:'J:l'
'
H:'~':p .
'1'~'\''!,?'.
~"d: .
,
of a mall; make the very best picture that yOU can." This test was in use
..:~~,Y
without change from its original standardization in 1926 until 1963. An
5 5 6 7 8
extension and revision was published in 1963 under the title of Good-
I ~{, '"
1"\ ) \--r)l§L)[±)\ 2 ) enough-Harris Drawing Test (D. B. Harris, 1963). In the revision. as in
the original test, emphasis is placed on the child's accuracv of obser\'<ltion
FIG. 42. Sample Items from the Progressive Matrices. and on the development of conceptual thinking, rather 'than on artistic
(Reproduced by permission of J. C. Raven.) skill.. Credit is ~iven for the inclusion of individual body parts, clothing
detmls, proportIOn, perspectiw, and similar features. A total of 73 scorable
items were selected on the basis of age differentiation, relation to total
children in Argentina. Use of the test in several European countries like-
scores on the test, and relation to group intelligence test scores. Data for
wise indicated the applicability of available norms. Studies in a numb~r
this purpose were obtained by testing samples of 50 boys and 50 girls at
of non-European cultures, however, have raised doubts about the SUIt-
each grade level from kindergarten through the ninth grade in urban and
abilitv of this test for aroups \. ith very dissimilar backgrounds. In such
rural areas of 1finnesota and \Visconsin, stratified according to father's
groul;s, moreover, the fest was found to reflect amount of education and occupation.
to be susceptible to considerable practice effect.
In the revised scale, subjects are also asked to draw a picture of a
The manual for the Progressive :\btrices is quite inadequate, glvmg,
woman and of themseh·es. The \Voman scale is scored in .terms of 71
little information on reliabilitv and none on validity. :\Ian)' investigations
items similar to those in the J\-Ian scale. The Self scale was developed as a
have been published, ho\Vev~r,that provide relevant data on this test. In
pr~je~ti\'e test of personality, although available findings from this ap-
a review of publications appearing prior to 1957, Burke (1958) lists ~ver
phcatlOn are not promising. Norms on both J\fan and \Voman scales were
50 studies appearing in England, 14 in America, and 10 elsew~ere. Sl~ce
e:tablished on new samples of 300 children at each year of age from .5 to
that time, research has continued at a rapid pace, especially In Amenca
10, selected so as to be representative of the United States population
where this test has received growing recognition. The Scventh Mental
with regard to father's occupation and geographical region. Point scores
Measuremcnts Yearbook lists nearl;' ..JOO studies, many dealing wi~h the
on each scale are transmuted into standard Scores with a mean of 100 and
use of this test with clinical patients.
an SD of 15.. In Figure 43 \vill be found three illvstrative drawings pro-
Retest reliability in groups of older children and adults that were mod-
duced by' chlldren aged 5-8, 8-8, and 12-11, to ,'ether with the corre-
, eratelv homogeneous in age varies approximately between .70 and .90. At
sponding raw point scores and standard scores. An alternative, simplified
the lo~ver score ranges, however, reliability falls considerably belO\~ thes.e
scoring procedure is provided by the Quality scales for both }'Ian and
,~ values. Correlations "ith both verbal and performance tests of mtelh-
\\'oman drawings. Instead of the point scoring, the Quality scales utilize
gence range between .40 and .75, tending to be higher with perfo:man~e
a global, qualitative assessment of the entire drawina. obtained bv match-
than \\'ith verhal tests. Studies with the m~ntally retarded and wIth dif-
ing. the child's drawing with the one it resembles m~t closely in ; graded
ferent occupational and educational groups indicate fair concurrent senes of 1:2 samples.
validitv. Predictive validity coefficients against academic criteria run
The reliabili.ty of the Draw-a-\fan Test has been repeatedly investi-
some\\:hat lower than thos~ of th·~ usual verbal intelligence tests. Several
gated by a vanety of proc.::dures. In one carefullv controlled study of the
factorial analyses suggest that the Progressive 1-fatrices are heavily lo;,ded
earlier forin administered to .386 third- and fou;th-grade school;hildren,
scales, information regarding the constrnct validity of the test i~ provided
by correlations with other intelligf'l1ce tests. These con-elations vary
widely, but the rnajorit~, arc over .50. In a study with 100 fourth-grade
children, correlations were found between the Draw-a-\1an Test and a
number of tests of known factorial composition (Ansbacher, 1952). Such
correlations indicated that, within the ages covered, the Draw-a-\lan
Test correlates highest with tests of reasoning, spatial aptitude, and per-
ceptual accuracy. \lotor coordination pla~'s a negligible role in the test
at these ages. For kindergarten children, the Draw-a-\1an Test correlated
higher with numerical aptitude and lower with perceptual speed and
accuracy than it did for fourth-grade children (D. B. Harris, 196:3). Such
findings suggest that the test may measure somewhat different functions
at different ages.
The original Draw-a-"t\'Ian Test has been administered widel" in clinics
as a supplement to the Stanford-Binet and other verbal scales.' It has also
been employed in a large number of studies on different cultural and
ethnic groups, including several American Indian samples. Such investi-
gations have indicated that performance on this test is more dependent on
differences in cultural background than was originally assumed. In a re-
"'omon, Raw Score 31 Mon, Row Score 66 view of studies pertaining to this test, Goodenough and Harris (1950, p.
Man, Raw Score 7 .-
CA 5-8 CA 8-8 CA 12-11 399) expressed the opinion that "the search for a culture-free test,
~ Standard S,ore 103 Standard Score 134
Standard Score 7 ..5 - - whether of intelligence, artistic ability, personal-social characteristics, or
FIG. 43. Specimen Drawings Obtained in Goodenough-Harris Drawing Test. an~' other measurable trait is illusorY." This view was reaffirmed bv Harris
(Courtesy Dale B. Harris. ) in his 196:3 book. :\lore recentl~', Dennis (1966) analyzed comixll'ative
data obtained with this test in 40 \Videl~' different cultural groups, prin-
the retest correlation after a one-week interval was .6'30, and split-half cipally from 6-year-old children. 1\1ean group scores appeared to be most
reliability was .89 (McCarthy, }9H). Rescoring of the identical dr.a\\-ings closel~' related to the amount of experience with representational art
\vithin each culture. In the case of groups ,,;ith little indigenous art, it
by a difT'erent scorer yielded r, scorer relia bilit\' of .90, and resconnp b]-
\\'as hypothesized that test performance reflects degree of acculturation
tl;e same scorer correlated .94. Studies with the new form (Dunn, 196 i;
to \ \' estern civilization. ~'
D. B. Harris, 1963) have yielded similar results. Readrninistration of, the
test to groups of kindergarten children on consecutive days re.vealen no Cultural differences in experiential background were again rewaled in
a well-designed comparative investigatio;1 of l\Iexican~ and American
significant difference in performance on different days. ~X~l1ml:er effect
was also found to be 11<:,gligible,as was the effect of art trammg m school. children with the Goodenough-Harris test (Laosa, Swartz. & Diaz-
The old and new scales are apparently quite similar; their scores cor- Guerrero, 1974). In studies of this test in i\igeria (Bakare, 1972) and
relate between .q and .98 in homogeneous age groups. The correlation Turkey (U9man, 1972), mean scores increased consistently and signifi-
of the }'1an and Woman scales is about as high as the split-half reliability cantly \\'ith the children's socioeconomic level. It should be added that
of the Man scale found in comparable samples. On this basis, Harris rec- these findings with the Goodenough-Harr:s test are typical of results ob-
ommends that the two scales be regarded as alternate forms and that the tained with all tests initially designed to be "culture-free" o'r "culture-
fai,r."
mean of their standard scores he used for greater reliability. The Quality
scales, representing a quicker but crueler scoring metholl, yield interscorn
Icliabilities dusterin':!, in the .80's. Correlations of about the same mao;-
nitude have been fo~md between Quality scale ratings and point scon:s AI'PHOACHLS TO CHOSS'CVLTURAL TESTI:\'f;, Theoreticallv we can identify
thr.:e npproadws to the c1evclonment of if",!, for n,C're;))],: [(,'2rce:1 in cl;{-
obl8ilipd for the same clrawin£':s.
f('rent cultUTi"S or subcultures, , . hO;i ir, ill ,-;f'tjr.(' ;",.,':,- [,,;;':r'- ";,, , : '::11
,\na~t froln the ilf:'ln-analvsis--'data q~ltht'li·-d in. the JC\'e~ 'lprncllt of the.
~ 1.." • '-'
three l1L1\' be combined. The first approa('h invo1:-(', the choice of it~'ms
c:onlJ110n to Jnal1V cultures and the Y~1.1ic.1ation of tIle resulting tt:~~tag ..11n:-:t .\llllTican Imtitlltes for H('scarch, undl,']' th'2 SjH)11Sm"hip (;f tlw l_'ni:ecl
local criteria in {11<111Y di!Ierent cultures. This is the basic approach of the SLltes j\~,:r-'ncy for International Dcvelopment -(SchV:,1r;, l%-:1a, 19G-lb;
culture-fair tests, although their repeated validation in different cultures Schwarz 6: Krug, 1972). A.nother exarnpleis the long-term testillg pro-
has often been either neglected altogether or inadequate 1:' executed. frarn of the i\ational Institute of Personnel Hcsearch in ]oh'111m:sburcf
\Vithout such a step, however, we cannot be sure th<1t the test is relatively (Blake, 19(2): In such instances, the tests are validated aaaimt th~
free from culturally restricted elements. :Moreover, it is unlikel:' that any specific educational and \'ocational criteria they Clre designecl ~) p;'eelict,
sinerle test could be desiabl1ed that \vould fullv. meet these requirements an.d pcrfonm:nce is evaluated in terms of local norms. Each test is ap-
b
across a wide range of cultures. plied only Within the culture in ,,-hich it \\'<15 developed and 110 cross-
On .the other hand, cross-cultural assessment techniques are needed for cultural comparisons are attempted. If the criteria to be predicted are
basic researc1i on some verv fundamental questions. One of these ques- technological, however, "'''estern-type intelligence" is likely to be needed,
tions pertains to the generality of psychological principles and constructs and the teS ts will reflect the direction in which the particular culture is
derived within a single culture (Anastasi, 19583, Ch. 18). Another ques- evolving rather than its prevalent cultural characteristics at the time (see
also Vernon, 1969, Ch. 14). .
tion concerns the role of environmental conditions in the dC\'elopment
of individual differences in behavior-a problem that can be more ef- A.ttention should also be cal1ed to the publication, in the late 1960" and
fectively studied within the wide range of environmental variation pro- early 1970s, of several handbooks concerned with cross-cultural testing
Yided bv hiahlv dissimilar cultures. Research of this sort calls for instru- and research, and witI-: the use of tests in dewloping countries (Bieshel~
.; ments t11at ~a~ be administered under at least moderately comparable vel, 1969; Brislin, Lonner, & Thorndike, 1974; Sc!l\nllz &. Krug, 197:2). All
conditions in different cultures. Safeguards against incorrect interpreta- provide information on recommended tests, adaptations of standardized
tions of results obtained with such instruments should be sought in ap- tests, and procedural guidelines for the development and application of
propriate experimental designs and in the investigators' thorough fa- tests. Further indication of the Widespread interest in cross-cultural test-
miliaritv "ith the cultures or subcultures under investigation. ing can be founel in the report of an international conference on :'IIental
A se~ond major approach is to make up a test within one culture and Tests and Cultural Adaptation held in 1971 in IstanbuL Turk-y (Cron-
administer it to individuals with different cultural backgrounds. Such a bach &. Drenth, 1972). The papers presented at this conference r~flect the
procedure would be followed when the object of testing is prediction of a wi~e diversity of interests and backgrounds of the participants. TI~e
local criterion within a particular culture. In such a case, if the speCific ~OPICSrange from methodological problems and entluations of speCific
cultural loading of the test is reduced, the test validity 111 a:' also drop, llistruments to theoretical discussions and reports of empirical studies.
since the crite~ion itsclf is culturally loaded. On the other hand, we The principal focus in both the handbooks and the conference report is
should . avoid the mistake of reaardinc~ on major ~ultural differences, as found among nations and among p~oples
b oJ.,anv test develoI,ed within a single
L
cultural framework as a universal yardstick for measuring "intelligence." at ver:' c1tfferent stages in their cultural evolution. In addition. a vast
1\01' should ,,-;. assume that a low s~ore on such a test has the same causal amount of literature has accumulated in the decades of the 1960s and
explanation when obtained by a member of another culture as when ob- 1970s on the psychological testing of minorities in the United States,
tained bv a member of the test culture. 'Vhat can be ascertained by such chieBy for educational and vocational purposes. In the present book, this
an appr~ach is the cultural distance between groups, as well as the indi- material is treated wherever it can be most dearl\' presented. Thus in
vidual's degree of acculturation and his readiness for educational and C.h.a~,te~·:'3, the focus was on social and ethical c;ncerns and responsi-
vocational activities that are culture-specific. bdltl:S In the use of tests \\ith cultural minorities. The technical psycho-
As a third approach, different tests may be developed \vithin each cul- metnc problems of test bias and item-group interactions were considered
ture and validated against local criteria only. This approach is exemplified in Chapters '7 and 8. In the present chapter, attention was centered on
bv the manv revisions of the original Binet scales for use in different instruments developed for cross-cultural abi]jtv testing. Problems in the
European, Asian, and African cult~Hes, as well as by the development of interpretation of the results of cross-cultural te~tina will be considered in
CI1apter 12. 0
tests for industrial and military personnel within particular cultures. A
current example is provided by the test-development program conducted A final point should be reiterated about the instruments discussed in
in several developing nations of Africa, Asia, and Latin America by the this section. Although initially developed for cross-cultural testing, several
of these instruments have found a major application in tJ;e anna-
298 Tests of General Intellectual Lccd
mentariul!1 of the clinical psychologist,

obtained with such instruments
both to sl~pplement info~m.ati~m
as the Stanford-Bmet and the .\\.cchsJer
C HAP TER 11
scales and in the testing of persons with various handicaps. TIm IS es~e-
ciaBv true of the Goodenough-Ranis Drawing Test, the ProgressIve
}'lat~'ices, and the Arthur Adaptation of the Leiter scale.
"1 ",
'.
HILE individual tests such as the Stanford-Binet
\Vechsler scales find their principal application
.; group tests are used primaril:' in the educational
and the
in the clinic,
system, civil
service, industr\" and the miIitarv services. It ",ill be recalled that mass
testing began during \Vorld \\'a;' 1 with the de\'~lopment
Alpha and the Army Beta for use in the United States Army. The former
of the Army
was a verbal test deSigned for general screening and placement purposes.
The latter was a nonlanguage test for use with men who could not prop-
erly be tested with the Alpha owing to foreign-language background or
illiteracy. The pattern established by these tests was closely followed in
the subsequent development of a large number of group tests for civilian
application.
Hevisions of the civilian fOID1S of both original army tests are still in
use as Alpha Examination, },loclified Forl1I 9 (commonly known as
Alpha g) and as Revised Beta Examination. In the armed sen'ices, the
Armed Forces Qualification Test (AFQT) is now administered as a pre-
liminar)' screening instrument, followed by classification batteries de-
veloped within each serviCe for assignment to occupational specinlties.
The AFQT provides a Single score based on an equal number of vocabu-
lary, arithmetic, spatial rehtions, and nwchanical ability items.
In this chapter, major types of group tests in current use will be sur-
w:ycd. First we shan consider the principal differences between group
and individual tests. Then we shall discuss the characteristics of multi-
level batteries d',igned to cover a wide age or grade range, with typical
illustrations £rorn different levels. Fina]])', group tests designed for use
at the college level and beyond will be examined,
ADVANTAGES OF GH01.'P TE5Tl:":G. Group tests are deSigned prirnarily as

instruments for mass testing. In compariscJn with indi\'ic1ual tests, they
h2\'(~ bnth adv~:ntnr::('s nnd d}S~i.1\-~·!nta~7C's. t}n t}j,? l,ositive sjdt\ r:.>rcrU~l\)
,~. ' '- ~.
tf':~t.S cc~n be- adnljni~}te.red SilfJ'.} 1L':r:c'ol1s1y to as Irlany persorls as c~n-l Ll~
fittf'll cl1lJ1fortallh' into the available ~pnce and rcac1lCd tJ:n'tlf:lJ a micro- difficult:,. This arrangement ensures th~lt each e:-:aminee has an oppor-
l.)hol1e, Lnrgre-scale
~
testillerb I)roerrnm5
b
wer(' made l)ossib1<.:,])\.. tile ck\'(~lol)- tuni~y to tn' each type of item (such as \'ocabulary, arithmeti c, spiltial,
ment of group testing techniques. By utilizing only printed items <lnd etc.) and to complete the easier items of each type before ti':'ing the
simple responses that can be recorded on a test booklet or answcr sheet, more difficult ones on which he might other".-ise waste a good deal of
the need for a one-to-one relationship between examiner and examinee time. L
was eliminated. A practical difficulty encounter('d with separate sub tests, however, is
A second\\;a:-" in which group tests facilitated mass testing; was by that the less expelienced or less careful examiners may make timing
greatly simplifying the examiner's role. In contrast to the extensive train- errors. Such errors are more likely to occur and are relativelv more seri-
, ing and experience required to administer the Stanford-Binet, for ex- ous with several short time limit~ than "ith a sin b ale lonab ti~1e limit for
ample, most group tests require onl:' the ability to read simple instruc- the whole test. To reconcile the use of a single time limit \vith an ar-
tions to the examinees and to keep accurate time. Some preliminary train- rangement permitting aU examinees to try :11 types of items at snc-
ing sessions are desirable, of course, since inexperienced examiners are cessively increasing difficulty levels, some tests utilize the spiral-omnibus
likely to deviate inadvertently from the standardized procedure in ways format. One of the earliest tests to introduce this format was the Otis
that may affect test results. Because the examiner's role is minimized, Self-Administering Tests of I\.fental Ability which, as its name implies, en-
howevel:, group testing can pro\'ide more uniform conditions than does deavor.ed to reduce .the examiner's role to a minimum. The same arrange-
'I individual testing. The use of tapes, records, and film in test administra- ment IS followed m the Otis-Lennon ~Iental Abilitv Test from the
tion offers further opportunities for standardizing procedure and,elim- fourth-grade levE up. In a spiral-omnibus test, the ea;iest it~ms of each
inating examiner variance in large-scale testing. type are presented first, followed by the next easiest of each type, and
Scoring is typically more objective in group testing and can be done so on in a rising spiral of difficulty leveL as illustrated below:
by a clerk. J\Iost group tests can also be scored by computers through
several available test-scoring services. Moreover, whether hand-scored or 1. The opposite of hate is: Answer
machine-scored, group tests usually provide separate answer sheets and 1. enemy, 2. fear, 3. love, 4. friend, 5. joy , .
reusable test booklets, Since in tllese tests all responses are written on :2. If 3 pencils cost 25 cents, how man:--"pencils can be bought for 7.5
cents? , , , .
the answer sheet, the test booklets can be used indefinitely until they
3. A bird does not always have:
wear out, thereby effecting considerable economy. Answer sheets also
1. wings, 2, eyes, 3. feet, 4. a nest,S. a bill .
take up less room than test booklets and hence can be more conveniently 4. The opposite of honor is:
filed for large numbers of examinees. 1. glon', 2. disgrace, 3. cowardice, 4. fear,S. defeat .....
From another angle, group tests characteristically provide better estab-
lished norms than do individual testS. Because of thc relative ease and In order to avoid the necessity of repeating instructions in each item
''', rapidity of gathering data with group tests, it is customary to test very and to reduce the number of shifts in instructional set reauired of the
, large, representative samples in the standardization process. In the most examinees, some tests apply the spiral-omnibus arrangementJnot to single
recently standardized group tests, it is not unusual for the normative items but to blocks of 5 to 10 items. This practice is followed, for ex-
samples to number between 100,000 and 200,000, in contrast to the ~,OOO ample, in the Armed Forces Qualification Test and in the Scholastic Apti-
to 4,000 cases laboriously accumulated in standardiZing the most care- tude Test of the College Entrance Examination Board.
fully developed individual intelligence scales.
Group tests necessarily differ from individual tests in form and ar-
'I rangement of items. Although open-ended questions calling for free re- DISADVAl"TAGES OF GROUP TESTI(,;G. Although group tests have several
, sponses could be used-and were used in the early group tests-today desirable features and serve ;i well-nigh indispensaUe function in present-
the typical group test employs multiple-choice items, This change was day testing, their limitations should also be noted. In group testing, the
obviously required for uniformity and objectivity of scoring, whether by examiner has much less opportunity to establish rapport, obtain co-
hand or machine. \Vith regard to arrangement of items, whereas the operation, and maintain the interest of examinees. Any temporary condi-
Binet type of scale groups items into age levels, group tests character- ~ion of the ~xaminee, such as illness, fatigue, worry, or anxiety, that may
istically group items 'J similar content into separately timed sllbtcsts. mterfere WIth test performance is less readily detected in group than in
Within each suhtest, items are usually arranged in increasing order of individual testing. In general, persons unaccustomed to testing may be
30Z Tesls of Gel:CTa{ hllcllccltWI Lcrel
1973 ~. Although it is possi.ble to deSign paper-and-pencil group tests
somewhat more handicapped on group than on individual tests. There is
that ll1co~po~ate such adaptIve procedures (Ckary, Linn, & Rock, 1968;
also sOl11e('yjdencc suggesting: that emotionally disturbed children may
Lord, 19 ,la), these techniques lend themselves best to computerized
perform better on individual than on group tests (Bower, 1969; \Vi11is,
test administration.
1970).
From another angle, group tests have been attacked because of the
restrictions imr;osed on the examinee's responses. Criticisms have been
. directed parti~u1arly against multiple-choice items and against such
standard item types as analogies, similarities, and classification (Hoffman,
1962; LaFave, 1966). Some of the arguments are ingenious and provoca-
tive. One contention is that such items may penalize a brilliant and Routing
original thinker who sees unusual implications in the answers. It should Test
be noted parenthetically that if this happens, it must be a rare occur-
rence in view of the item analysis and validitv data. Some critics have
focus~d on the importance of 'analyzing erro;s and inquiring into the
reasons why an individual chooses a particular answer, as in the ty'pical
Intermediate
Piagetian approach (Sigel, 1963). It is undoubtedly true that group
tests provide little or no opportunity for direct observations of the ex-
aminee's behavior or for identifying the causes of atypical performance.
For this and other reasons, when important decisions about indiYiduals
are to be made, it is desirable to supplement group tests either with indi-
vidual examination of doubtful cases or with additional information from
E
. other sources.
Still another limitation of traditional group testing is its lack of flexi-
bility, insofar as every examinee is ordinarily tested on all items. Avail-
able testing time could be more effectively utilized if each examinee con-
centrated on items appropriate to his ability level. Moreover, such a
procedure \\'ould avoid boredom from working on too easy items, at one FTG.44. Two-Stage Adaptive Testing with Three I\leasurement Lewis. Each
extreme, and mounting frustration and anxiety from attempting items examinee takes routing test and one measurement test.
beyond the individual's present ability level, at the other. It ,,,ill be re-
called that in some individual tests, such as the Stanford-Binet and the
,.~dapti~e. testing can follow a wide variety of procedural models (De-
Feabody Picture Vocabulary Test, the selection of items to be presented \\-Itt &. \hl.Ss, 1974; Larkin &: \Veiss, 1975- \Veiss 1974· Weiss &. Betz
by the examiner depends upon the examinee's prior- responses. Thus in 1973). A simple example involvina two-s~aO'e te~tina is illustrated l'n'
these tests the examinee is given only items within a difficulty range ap- F'
'lgure 44 '. In this hypothetical test, ball examinees
D
take b a 10-item routina
propriate to his ability level. test, .whose items cov~r a wide difficulty range. Depending on hispel~
formance on the routmg test, each examinee is directed to one of the
three 20-item measurement tests at different levels of difficultv. Thus each
COMPUTER UTILIZATION e.; GROUP TESTING. In the effort to combine
p~rson takes only 30 items, although the entire test comprise; 70 items. A
some of the advantages of individual and group testing, several inno\' -
(h.fierent ~rra:!gement is i11ust~ated in the pyramidal test shown in Fig-
tive techniques are being explored. 11ajor interest thus far has centered
u~e. 45. III tlm case, all exammees begin with an item of intermediate
on wavs of adjusting item coverage to the response characteristics of d1fI1culty. If an individual's response to this item is correct, he is routed
individual examinees. In the rapidly growing literature on the topic,
~rward to. the next nore difficult item; if his response is \\Tong. he moves
this approach has been .variou'.\' designated as adaptive, sequential, ~lOwnward to the next easier item, This procedure is repeated :r.fter each
br:mchcd, tai1ored, indivic1\1~!1izt·d, programed, dynamic- or f{'~ponse- ltem resp,i1se, until the indi\'idual has given 10 responses. The ii1ustr~;-
contingent testing (Baker, 1971; Glaser & Kitko, 19"il; \Veiss &. Bdz,
iIHlividual's responses provide enough information for a deeision about
~~:~~ng his content mastery.

Besides providing the opportunit~' to adapt testing to the abilities and
,
\ I
Examinee's
Scores
..:.. needs of indiyidual examinees, computers can help to circumvent other
limitations of traditional group tests ( Baker, 1971; Glaser & l\'itko, 1971;
B, F. Green, 1970). One potential contribution is the analysis of wrong
\ responses, in order to identi~y ty es oC~rtQr in il2SiLYigYi.1L~ii'iig:i;AJ1ot1wr
\ is thE1 use of response--~s that permit the examinee to tr~' alternative
responses in turn, "ith immediate feedback, until the correct response
\ is chosen, Still another possibility is the development of special response
i procedures and item types to investigate the examinee's problem-solving
techniques. For instance, following the initial presentation of the prob-
lem, the examinee lTIay have to ask the computer for further information
needed to proceed at each step in the solution; or he may be required to
respond by indicating the steps he follows in arrivinb at the solution.
Considerable research is also in progress with several relativel~- unre-
stricted response modes, such as underlining the appropriate words in
sentences or constructing Single-word responses.
FIG .. 45. Pyr[lmida 1 Testing~1

:\ 0 d e.1 H eavy 11'11e. shows routE of eX[lmince
whose item scores are listed across top. OVERVIEW. Unlike individual scales of the Binet type and computerized
or other adaptive tests, traditional group tests present the same items to
tion shows a 10-stage test, in which each examinee is given 10 items out all examinees, regardless of their individual responses. For this reason,
of the total pool o{5,S items in the test. The heavy line shov,s the route any given group test must cover a relatively restricted range of difficulty,
followed by one eX2.minee whose responses are listed as ~ or - along suitable for the particular age, grade, or ability lewl for which it is de,
the top. . ' Signed. To provide; cOl11parable measures of ir:tellectual dewlopment
Several variants of hoth of these adaptive testlllg modelS have been over a hroader range, series of overiapping multilevel batteries have been
tried, with both paper-and-penciI and computer forms. ;,i ore .complex ·constructed. Thus, anyone individual is examined only with the level
. models, ,-"hich do not use preestablished, fixed patterns of Item S('- appropriate to him, but other levels can be used for retesting ] 'm in
quenccs arc feasible onh' 'when a computer is available. In general,
l
subsequent years or for comparatiw evaluations of different age groups.
r'esearcl; by various appr~aC'hes indicates that ad.€ptiv!J.~t~ Ci!!1~~~ie~:.:_ The fact that successive batteries overlap provides adequate floor and
the same reliabilitv and validitv as convention,~L~est~ ".]1.1-)11\1e11, less ceiling for individuals at the extremes of their O\m age or grade distribu-
te~ili~-tim;-( La!'-i;1 &\\;~~s:T~ir{W"ej'~';&:"Betz, 197;~). :<\?aptive test- tions. It should be recognized, of course, that the match llE'tweeri item
ing alW pro','ides greater precision of measurement for lI1dlvlduals at the difficulty and examinee ability provided by multilevel batteries is at best
upper and lower extremes of the ability range covered by th~ test (~or~, approximate. Unlike the individualized procedures described in the
1970, 1971b, 1971c; Weiss &: Betz, 1973). In addition, adaptIve testmg IS preceding section, moreover, the match is based on prior Jmo\dedge
especial1y appropriate for use in the individualized ins~ru~tional pro- about tlw examinee, such as his age or school grade, rather than on
grams cited in Chapter 4, in which students proceed at tnelr own pace his own test responses,
and mav therefore require test items at widely different difficulty levels. . i\lultilewl batteries are especially suitable for me in the schools, where
Computerized testing mahs it possible to stop testing as ~oon as the comparability over several ~-ears is higl)l)' desira.ble. For this reaS')J1 the
levels are typjc~J]y described in terms of gradc's, \lost lllullilevd batlcri,':,
IFor example':, of di\·er;.e n,rinnts and ITHIuels. see \\'eiss a.nd P,~t7. (1978),
provide a reasowtble d('~~rec of cOl!tinuitv~ with rC'zard to content or in\l-l-
DeWitt and Wei;;s (1974), and Wei," (1974), '-- ,-
GrouJi Tcstin~ 307
lectual functions covered. Scores are expressed in terms of the same
toward the prediction of academic performance'. Utilizina seYf'ral in-
scale of units throughout. The normative samples employed at different
geni~us new item types, the development of this battery began with a
levels are aha more nearly equivalent than would be true of independ-
detaIled analysis of the abilities required by school learning tasks. It thus
ently standardized tests.
reflects a grO\\'ing emphasis on the measurement of prerequisite intel-
Table 2-1 contains a list of representative multilevel batteries, together
lectu~l sk.ills for school work and other activities of daily life. Its primary
with the range of grades covered by each level. It will be noted that the functIon IS to assess the individual's readiness for schoolleaming at each
individual ranges are quite narrow, most levels covering 1\\'0 to three stage in the educational process. ~
grades. The total range that can be uniformly tested \\'ith a, given multi- , :!\!ost of the batteries listed in Table 24 proYide deviation IQ's or
level battery, on the other hand, ma~' extend from kindergarten to
~lmllar standard scores. Some batteries furnish seyeral types of norms,
collet?:e.
lllcluding percentiles, stanines, or grade equivalents. as well as de\'iation
Th~ names of the batteries are also of interest. Such terms as "intelli- IQ's. In addition to a total, global score, most batteries' also yield sep-
gence," "general abilit),," "mental ability," "mental maturity," "learning
arate verbal and quantitative, or linguistic and nonlinauistic scores. This
potential," "school-and-colJege ability," and "educational ability," are
breakdown is in line with the finding that an individu~rs performance in
used to deSignate essentially the same type of test. In the psychome-
verbal and in other types of subtests ma)' be quite dissimilar, espeCially
trician's \'ocabulary, these terms are virtually synonymous and inter- at the upper levels. Attempts have also been made (e.g., Crv1\f) to re-
changeable. It is noteworthy that, in the most recently developed or report norms for the interpretation of scores on separate subtests, or com-
vised batteries, the term "intelligence" has been replaced by more specific binations of subtests, representing narrower breakdowns. This practice is
designations. This change reflects the growing recognition that the tel111
not to be. recommended, ~owever, because these part scores are usually
"intelligence" has acquired too many excess meanings, which may lead too unrelIable and too hIghly intercorrelated to permit meaningful in-
to misinterpretations of test scores. One of the newest batteries listed in
t~rpretati~n of. intraindividual differences. In general, the types 'Of tests
Table 24, entitled "Analysis of Learning Potential", is explicitl~' directed dIscussed 111thIS chapter are suitable for assessing general intellectual de-
velopment rather than relative standing in different aptitudes.
TABLE 24 To illustrate the scope .of current multilevel batteries. a different level
Representative Multilevel Batteries will be examined from each of three batteries. The levels chosen are
h'pical.of those suitable for primary, elementar)', and high school grades,
respectlvely.
Analysis of Leaming Potf'ntial (ALP) 1,2-3, '1-6, '1-9, ]0-12

Pln~fAnY LI':\'EL. The youngest age at which it has proved feasible to
California Test of 1\fental !\!at'lrit}" K-l, 1.5-3,4-6,7-9,9-12,12-16
1963 Revision (CT1\1:"1) el11plo)' group t~s~s is the kindergarten and first-grade levp!. At the pre-
school ages, mdl\'ldual testing is required in order to establish and main-
Cognitive ALilities Test (CAT) K-l,2-3,3,4,5, 6,7,8-9,10-11,12 ~ain rap~)ort, as well as t(· ~dmin!ster the oral and performance type of
Henman-Kelson Tests af !\Iental Abil- K-2, 3-6, 6-9, 9-]2 Items SUItable for such chllaren. By• the aac C
of 5 or 6 ) I'Jwever , it is 1105-
•
ity, 1973 Revision sible to administer printed tests to small groups of no more than 10 o~ 15
children. In such testing, the examiner must still aive considerable in-
I';:l1h1munn-Anderson Tests: A 1leas-
dividual atten:ion to the children to make sure th~t directions arc fol-
me of Academic Potential
]~\\'ed, see that pages are turned properly in the test booklets, and super-
Otis-Lennon 1'-.lenta1Ability Test K, 1-1.5, 1.6-3.9, 4-6, 7-9, 10-12 VIse other procedural details. \\lith one or two assistant examiners,
School and College Abi]jtv Tests- somewhat larger groups may be tested if necessary.
4-6, 7-9, 10-12, 12-14
SCAT Series II Group tests for the primary level generally cover' kindergarten :~nd the
first two or three> grades of elementary school. In such tests, each child is
provided \vitlt a booklet on which are printed the pictures and dia[!rams
constituting tll(' test items. All instructions are given orall)' and ar~' usu,
ally accompanied by demonstrations. Fore-exercises are frequently in-
Pori I. Clcssi{icalion: Mark the picture that does not belong with the other
duded in which examinees try one or two sample items and the ex-
three, the one thot is different.
aminer or proctor checks the responses to make certain that the instruc-
(Pictorial)
tions were properly understood. The child marks his responses on the
test booklet with a crayon or soft pencil. ~\'1ost of the tests require only
marking the correct picture out of a set. A few call for simple motor co-
ordination, as in drawing, lines that join two dots. Obviously tests for the
primary level can require no reading Ol" writing on the part of the ex- (Geometric)
Co <!E.
= =
aminee. i p.
Reference to Table 24 shows that several of the batteries listed have
tests suitable for the primary level. To illustrate the nature of tests at this
I CT)
0 0'"
level, we shall examine the Primary Level of the Otis~Lennon Mental = 6;

= =
Ability Test. The current edition of this test, published in 1967-1969, is Port II. Verbal Concepfua/izalion: Mark the picture that shows a fiame.
available in two equivalent forms (Jand K) at all levels. The Primary
(\ r\
Level actuallvconsists of two levels. Prim an' I for kindergarten and
./"~
0 \'~ ~!
IJrimary II fo~' the £rst half of grade ~ne. The~e two levels a~e id~ntical ill ~
,j
----
in content and differ only in the way the child indicates his responses.
In Primary I, responses are recorded by encircling the correct altGnative;
= = @-
=
for this reason, the test inust be hand scored. In Primary II, a machine- Quantitative Reasoning: Mark the picture that shows the same
number of dots as there are paris in the circle.
scorable booklet is used, on which responses are indicated by filling in a
small oval under the correct alternative (see Fig. 46).
For each item in the entire test, the examiner gives the imtructions ~
) @>
• \!fee e~411~ eS$fHt
orally. By so doing, the examiner also controls the amount of time avail-
able to complete each item (about 15 seconds). The whole test re-
= = I
= G·
quires approximately 25 to 30 minutes and is administered in two parts General Information: Mark the picture of the thing we talk into.
"'ith a short rest period in between. Part I consists of 23 classification
items; Part II contains a total of 32 items. designed to measure verbal (~'.jP
CI1
~.
[J!9
- ""-J, )
~ ~
~.y-
conceptualization, quantitative reasoning. gener~l information, and abil- ~
ity to follow directions. Typical items in each of these categories are ; = Co
=
shown in Figure 46. Following Directions: Mark the picture that shows a glass inside a
1\orms for all levels of the Otis-Lennon batten! were obtained on a square with c cross on top.
carefully chosen representative sample of over 200,000 pupils in 100 + + it +
school systems drawn from all 50 states. Scores can be expressed as devia-,
tion IQ's with an SD of 16. Percentile ranks and stanines can also be
found with reference to both age and grade. norms. \Vell-constructed
U ~ [[J [Q]
tests for the primary level have generally been found to have satisfactory
= = = G·
reliability and criterion-related validitv. The Otis-Lennon Primarv II FIG. 46. Items Illustrative of the Otis-Lennon I\lental Ability Test, Primary
yielded ~n alternate-form reliability of.in in a sample of 1,047 first-g~ade I and Primary 1I Levels.
children, over an interval of two weeks. Sp1it-half reliahility in the total (CopyTight, 1967, Harcourt Brace Jovanovich, Inc.)
sample of 14,014 first-grade children was .90. A follow-up of 144 first-

grade children yielded a correlation of .80 with a higher level of the
battery administered a year later. Both concurrent and predictive validity
coefficients against achievement test scores and end-of-year grades cluster
in the .50's.
ally accompanied by demonstrations. Fore-exercises are frcCJuently in-
Pori I. Classification: Mark the picture that does not belong with the other
cluded in whicll examinC:'cs try one or two sample items and the ex-
three, the one thol is different.
aminer or proctor checks the responses to make certain that the instruc-
(Pictorial)
tions were properly understood. The child marks his responses on the
test booklet with a crayon or soft pencil. ~'lost of the tests require only
marking the correct picture out of a set. A few call for simple motor co-
ordination, as in drawing, lines that join two dots. Obviously tests for the
primary level can require no reading Of writing on the part of the ex- (Geometric)
= <!E.
= =
aminee. i g.
Reference to Ta hIe 24 shows that several of the batteries listed have
tests suitable for the primary level. To illustrate the nature of tests at this
I CI)
8 0'"
level, we shall examine the Primary Level of the Otis"Lennon Mental = 6

= =
Ability Test. The current edition of this test, published in 1967-1969, is Part II. Verbal Conceplualizalian: Mark the picture that shows a fiame.
available in t"vo equivalent forms (Jand K) at all levels. The Primary
;\ r\
Level actuallv consists of two levels. Primarv I for kindergarten and
~
0 f, ....
\'~ ~!
lJrimary II fo~' the £rst half of grade ~ne. The~e two levels a~e identical /'
ill ~ ~j
-...
in content and differ only in the way the child indicates his responses.
In Primary I, responses are recorded by encircling the correct altGnative;
= = ~ =
for this reason, the test must be hand scored. In Primary II, a machine- Quantitative Reasoning: Mark the picture that shows the same
number of dots os there ore paris in the circle.
scorable booklet is used, on which responses are indicated by filling in a
small oval under the correct alternative (see Fig. 46).
For each item in the entire test, the examiner gives the instructions ~
) @>
• ".6' .~~~ .S$fHt
orally. By so doing, tbe examiner also controls the amount of time avail-
able to complete each item (about 15 seconds). The whole test re-
= 0 I
= G·
quires approximately 25 to 30 minutes and is administered in two parts General Information: Mork the picture of the thing we talk into.
with a short rest period in between. Part I consists of 23 classification
items; Part II contains a total of 32 items. designed to measure verbal
~
~p
", \
OJ
WI) ~
~.
[1!9
~-9
conceptualization, quantitative
ity to follow directions.
shown in Figure 46.
reasoning. gcner~l infomlation, and abil-
Typical items in each of these categories are ; -- = = =
Failowing Direclions: Mark the picture that shows Q glass inside 0
J\orms for all levels of the Otis-Lennon batten! were obtained on a square with c cross on top.
carefully chosen representative sample of over 200,000 pupils in 100 + + •• +
school systems drawn from all 50 states. Scores can be expressed as devia-,
hon IQ's with an SD of 16. Percentile ranks and stanines can also be U ~
[[] [Q]
found with reference to both age and grade. norms. \Vell-constructed
tests for the primary level have generally been found to have satisfactory
= = = e·
reliability and criterion-related validitv. The Otis-Lennon Primarv II FIG. 46. Items Illustrative of the Otis-Lennon Mental Ability Test, Primary
yielded ~n alternate-form reliability of .87 in a sample of 1,047 first-g~ade 1 and Primary II Levels.
children, over an interval of two ·weeks. Sp1it-half reliahility in the total (Copyright, 1967, Harcourt Brace JO\'ano\,ich, Inc.)
sample of 14,014 first-grade children was .90. A follow-up of 144 £rst-

grade children yielded a correlation of .80 with a higher levc1 of the
battery administered a year later. Both concurrent and predictive validity
coefficients against achievement test scores and end-of-year grades cluster
in the .50's.
310 Tests of Gelleral Illtellectual Level 'l '\1i
ELEMEKTARY SCHOOL Group intelligence tests designed for use

LEVEL.
1. Vocabulary:. find the one word that means the same or nearly the same.
as the word In dark type at the beginning of the line.
from the fourth grade of elementary school through high school have
much in common in both content and general design. Since functional impolite A unhappy B angry C faithless D rude E talkative
literacy is presupposed at these levels, the tests are predominantly verbal
in content; most also inc:1ude arithmetic problems or other numerical 2. Sente~ce Completion: pick the one word that best fits the empty
tests. In addition, a few batteries provide nonreading tests designed to space In the sentence.
assess the same abstract reasoning abilities in children with foreign
lanerua(Te background readin(T disabi1ities. or other educational handi- Mark was very fond of his science teacher, but he did not ~
o b ,0 .
his mathematics teacher.
caps.
As an example of tests for the elementary school grades, we shall con-
sider some of the intermediate levels of the Cognitive Abilities Test. The
entire series includes two primary levels (for kindergarten through the 3, Verbal Classification: think in what way the words in dark type go
third grade) and the Multi-Level Edition covering grades 3 through 12. together. Then find the word in the line below that goes with them,
The M ulti- Level Edition comprises eight levels (A-H) printed in a Single
booklet. Examinees taking different levels start and stop with different
sets of items, or modules~ The test is designed so that most examinees 1 I
will be tested \vith items that are at intermediate difficulty for them,
where discrimination will be most effective.
All eight levels of the \1ulti-Level Edition contain the same 10 sub-
4. Verbal Analogies: figure out how "the first two words are related
each other. Then from the five words on the line below find the word
to
1 I
I
tests, grouped into three batteries as follows: that IS related to the third word in the same way. i
Verbal Battery-Vocabulary, Sentence Completion, Verbal Classification,
Verbal Analogies.
Quantitatioe Battery-Quantitative Relations, Number Series, Equation
Buildinc:.
Nonverbal Battery-Figure Classification, Figure Analogies, Figure Synthe- FIG. 47. Typical Items from Verbal Battery of CO(Tniti\'CA1.JilitiesTest
sis. These subtests use no words or numbers, but onlv geometric or figural swers: I-D , 2-E , 3-C , 4-E .. b .
elements; the items bear relatively little relation to fonnal school instruc- (Reproduced by courtesy of Robert L. Thorndike and Elizabeth Hagen.)
tion,
into a uniform scale across all levels to permit continuity of measure- O
Each subtest is preceded by practice exercises, the same set being used ment
. and .comparability ~ of scores in dinerent.. grades F or nonna t'lve
'
for all levels. In Figures 47, 48, and 49 v,'ill be found a typical item from mterp}"etatlOns, th~ s~ores on each battery can be e';;pressed as normalized
each of the 10 subtests. with highly condensed instructions. In difficulty standaI~ scores wlt~m each age, with a mean of 100 and an SD of 16.
level, these items correspond to items given in grades 4 to 6. The authors ~centrles and st.anmes can also be found, \"lithin ages and \vithin d~ades.
recommend that all three batteries be given to each child, in three testing , e m~nual ~dvlses agamst combining scores from the three b;tteries
sessions. For most children, the Nonverbal Battery does not predict mto a smgle mdex.
school achievement as well as do the Verbal and Quantitative batteries. .K~der-Richardson reliabilities of the three batten' scores computed
However, a comparison of performance on the three batteries may pro- wlthm to(Trades , ' are mas tl y III . ~h e .9'0 s. The manual also .' reports standard
vide useful information regarding special abilities or disabilities. errors .o~ m(as~rement for drfferent grades and score levels, as well as
The standardization sample, including a;?proximately 20,000 cases in the nllmmum Illterbatterv score differences that can b 'd d
hav '<'. '... '., e conSl,ere to
each of the 10 grade groups, was carefully chosen to represent the school _ e statJ~tlcal and practlcal slgI1lRc::mce. IntercorreJations of batten'
population of the country. Haw scores on each battery are translated scores
" are 111 the 1-'(11-
ll ,u. 60'-·
t ~ anc 1 .1~'(" ."
J s; mtcrc:orrelations of subtests are also'
~
" ,
1. Quantitative Relations: If the amount or quantity in Column I is more 1. Figure Classification: the first three figures are alike in some way.
than in Column II, mark A; if it is less, mark B; if they are equal, Find the figure at the right that goes with- the first three.
mark C.
ABC D E
2. Number Series: t'1e numbers at the left are in a certain order. Find the
96oD~
number at the right that should come next. 2. Figure Analogies: decide how the first two figures are related to each
other. Then find the one figure at the right that goes with the third
figure in the same way that the second figure goes with the first .
.l
~ 3. Equation Building: Arrange the numbers and signs below to make true A
~
I:'
equations and then choose the number
correct answer.
at the right that gives you a
e
t
3. Figure Svnthesis: For each shaded area, decide whether or not it can
be completely covered by using all the given black pieces without
l~ FIG. 48. Typical Items from Quantitative Battery of Cognitive Abilities Test.
\' overlapping any.
Answers: I-B, 2-B, 3-A.
(Reproduced by courtesy of Robert L. Thorndike and Elizabeth Ragen.)
Given Pieces ~"". ~
L... l~
·.·):·
unusually high. Factor analyses likewise showed the presence of a large
Complete Shapes
general factor through the three batteries, probably representing prin-
Q:. [b.
rn
cipally the ability to reason "'ith abstract and symbolic content.
The Cognitive Abilities Test was standardized on the same normative ~-I
~. I
ii,.,
sample as two achievement batteries, the Iowa Tests of Basic ~kills
(ITBS) for grades 3 to 8 and the Tests of Academic Progress (TAP) for I .H / ..•........ ( ,
.•lr; grades 9 to 1~2.Concurrent validity of the Cognitive Abilities Test against
the ITBS, found within single-grade groups of 500 cases from the stand-
ardization sample, ranged from the .50'5 to the .70's. A.JS is generally
found with academic criteria, the Verbal Battery yielded the highest cor- , 49. Typical Items from Nonverbal Battery of Cognitive Abilities Test.
FIG.
Answers:I-B, 2-D, 3-3 & 4.
relations with achievement in all school subjects, except for arithmetic
(Reproduced by courtesy of Robert L. Thorndike and Elizabeth Ran-en and with per-
which tended to correlate slightly higher with the Quantitative Battery. mission of Roughton Mifflin Company.) '"
Correlations with the l\'onverbal Battery were uniformly lower than with
the other two batteries.
Correlations with achievement tests over a three-vear interval are of
HI~,H SCHOOL LEVEL. It should be noted that the high school levels of
about tl1(' same ma gnitude as the concurrent cor;elations. Predictiye IrIultJJcvel batteries, as well as other tests designed for high school
validity coefficients ~gainst school grades obtaineu from one to three students, are also suitable for testing general, unseJected adult groups.
veal'S .iater run some~vhat lower. falling mostly in the .50'5 and .60's.
Another SOurce of adult tCe,ts is to be found in the tests c1e\'Cloped for
These correlations are probably 'attent~ated b)· unre1iabilit)' and other military personnel and subsequentl}' published in civilian editions. Short
extraneous variance in grading procedures. screening tcsts fcJr job ;\pp]icants will be considered in Chapter Fl.
Cr"l/p Testing 315
314 Tests of General Intel/ccillal Let;cl
. analogies test. The items in this test, ho\Ye\u, eliHer from the traditional
An example of a group intel1i~en('(' test for the high school level is
analogies items in that the respondent must choose both words in the
provided by Level 2 of the School and Collegt' Ability Tests (SCAT )--
second pair, rather than just the fourth word. The quantitative score is
Series II, designed for grades 9 to 12. At all levels of the SCAT series,
derived from a quantitative comparison test desi'2:ned to assess the
tests are available in two equivalent forrn5, A and B. Oriented specifically
examinee's understanding of fundamental number dperations. Covering
toward the prediction of academic achievement, all levels yield a verbal.
both numerical and geometric content, these items require:::. minimum of
a quantitative, and a total score. The verbal score is based on a verbal
reading and emphasize insight and resourcefulness rather than traditional
cOl:1pu~ati~nal procedures. Figure 50 shows sampie Yerbal and quanti-
tatIve Items taken from a Student Bulletin distributed to SCAT ex-
Part I - Verbal AbilitV: each item begins with two words that go together a~1inees~for preliminary orientation purposes. The items reproduced in
in a certain way. Find the two other words that go together in about the FIgure 0" fall approximately in the difficulty range covered bv Level 2.
same way. At all IE-- Is, testing time is 40 minutes, 20 minutes for e~ch ·part.
In line with current trends in testin£: theory, Sc.\ T undertakes to
2 braggart: humility ::
tool : hammer :: measure developed abilities. This is simply an e~plicit admission of what
A traitor: repentance is more or less true of all intelligence tests, namely that test scores reflect
A table: chair
B radical: con\entionality
B toy : doll the nature and amount of schooling the individual has received rather
C weapon : metal C precursor: foresight
D ~ophisticate : predisposition than m:as~ring "capacity" independently of relevant prior experiences.
D slc:igh : bell
AccordmglY, SCAT dra\vs freely on word knowled~e and arithmetic
processes learned in the appropriate school grades. In this respect, SCAT
Part II _ Mathematical AbilitV: each item is made up of tvvo amounts or
does not really differ from other intelligence tests, especially those de-
quantities, one in Column A and one in Column B. You have four choices:
Signed for the high school and college levels; it only makes o,;ert a condi-
A, B, C, or D. Choose A if the quantity in Column A is greater, B if that
tion sometimes unrecognized in other tests,
in Column B is greater, C if the quantities are equal, or D if there is not
Verbal, quantitative, and total scores from all seAT levels are ex-
enough information for you to tell about their sizes.
pressed on a common scale which permits direct comparison from one
level t~ another. These scores can in turn be cOlwerted into percentiles
I 1 or stamnes for the appropriate grade. A particularl~' desirable feature of
3 A number between A number between
10 and :10 1000 1001 seA T scores is the provision of a percentile band in addition to a sino-Ie
10 and 20
percentile for each obtained score. Representing a distance of appro~i-
mately one standard error of measurement on either side of the corre-
sponding percentile, the percentile band gives the 65 percent confidence
interval, or the range within which are found 68 perc~nt of the cases in
Q
a normal curve. In other words, if we conclude that an individual's true
O I score lies within the given percentile band, the chances of our being
&
,2
2'I I
correct are 68 out of 100 (roughly 2: 1 ratio). As explained in Chapter 5,
t I the error of measurement provides a concrete wav of takinO' the reliabilitv
P
5
R - - -'-='- - -- of a test into acconnt when interpreting an indi\~dua]'s sc~re. "
If two percentile bands overlap, the difference between the. scores can
Area of .6 STU
b~.,ignored; if t~ey do not overlap, the difference can be regarded as sig-
Area of L, PQR
abo ....e mficant. Thus, If two students were to obtain total SeAT scores that fall
above
in the percentile bands 55-68 and 74-84, we could conclude with fair
confid(;nu' that the second actually excels the first and ,,-auld continue to
FIG. 50. Typical Items from SCAT Seri(;~sII, Level 2, for Grades 9 to 12.
~o ~o. on <: rctes~. Percentile bands -likewiSE:;help in comparing fl_ Single
:\'115\\'CI:': ]--B, 2-B, 3-D,4-A, 5-C.
!l;dl\'lduals reIutlye standing on vel11,<1and quantit,,~i\-e p,u-ts of thc test.
(fn.1tTJ Studtnt Bull('tin-SCAtI'-"St"Tits Ii. Copyrif!ht (Sl 19Gi by EdUc~ttion:.d 'Te~tin~
II a student's verba] ,mc1 quantitative scores correspond to the percentiJt'
c)('ni(c . .-\.Il rit.:lltS It''iEned Heprillted by pETInl'S)CI1.)
grade groups, from grade 4 to 14, total Score reliabilities are all .90 or
above; verbal and quantitative reliabilities var:\' between .83 and .9l.
Very =! Very These reliabilities may be spuriously high because the tests are some-
High
Higtl
90 ~ _.. - -_. what speeded. The percentage of students reaching the last item in differ-
0-=1
i .. flf; -~
ent grades ranged from 65 to 96 on the verbal test and from 55 to 8.5 on
the quantitative test. Under these circumstances, equivalent-fom] relia-
... ,--------- bility would seem more appropriate. If the reliability coefficients are in
fact spuriously high, the errors of measurement are underestimated; and
hence the percentile bands should be wider.
60 :J
1 It should be noted, however, that many of the students who did not
record an answer for all items may have abandoned any attempt to solve
1
Average
50 the more difficult items even though they had enough time. In the quanti-
tative test, moreover, a student can waste an inordinate amount of time
E- 40 -=l 40 - in reaching the answer to certain items b:-' counting or computation,
~
when the proper recognition of numerical relations would yield the an-
--------
30
1_ --_. 30 swer almost instantaneously. Under the.<;e conditions, speed of perform-
~:l-
ance would be highly correlated with the quantitative reasoning abilities
Low 20 the test is designed to measure.
--
-------- In view of the stated purpose for which SCAT was developed, its pre-
Very
10
dictive validity against academic achievement is of prime relevance.
Low
[ -~ [ Follow-up data in grades 5, 8, 11, and 12 were obtained from schools
participating in the standardization sample. Validity coefficients were
computed in each school and then averaged for each grade across
FIG. 51. SeAT-II Profile; Illustrating Percentile Bands.
schools, the number of available schools per grade ranging from 3 to 26.
(From Student BUlletin-SeAT -Series II. CopYright © 1% I by Education::! Testing
Seryice. Ali rights resen-ed. Reprinted b." permission.)
For the four grades investigated, the average correlation between SCAT
Total and grade-point average ranged from ..59 to .68; for SCAT Yerbal
and English grades, the range was .41 to .69; and for SCAT Quantitative
. bands 66-SG and .58-78, respectively, we would conclud~ that he is ll~t and mathematics grades, the range was .43 to .65. Because indi\ic1ual
Significantly better in verbal than in quantitative abilitIes, because IllS correlations varied Widely from school to school, however, the manual
percentile bands for these two scores overlap (see Fig. 51). . recommends local validation.
The SCAT standardization sample of over 100,000 cases was seleCted Correlations with achievement tests (Sequential Tests of Educational
so as to constitute a representati\;e cross-section of the national school, Progress) generally fell between .60 and .80. Quantitative Scores tend to
population in grades 4 through ]2 and the first two years of col~ege. ~or correlate higher than Verbal Scores with mathematics achievement; and
this purpose, a three-stage sampling procedure was employed, III .wl11ch Verbal scores tend to correlate higher than Quantitative scores with
the unilS to be sampled were school systems (publIc and pflvate), achievement in all other subjects. Total SCAT scores, however, generally
schools, and classrooms, respectively. Similar successive sampling pro- yield validity coefficients as high as those of either of the two part scores
cedures were followed for the college sample. The selection of the or higher. Thus the effectiveness of Verbal and Quantitative scores as
standardization sample, as well as other test-construction procedures fol- differential predictors of academic achievement remains uncertain. It is
lowed in the development of SCAT, sets an unusually high standard of noteworthy in this connection that the Verbal and Quantitative scores are
technical quality.
themselves highly correlated. These correlations are in tIll' .70's, except at
Reliability coefficients for verbal, quantitativE', and total scores were the lowest and highest grades, where they drop to the .60's. Such dose
separately ~omrmted within Single grade groups by tl:e Ku(~~r-HjC'1nrdson simil::rity may resu1t from the item types employed in the two tests,
tecllllicJl.lC'.The reported relio.bilities are uniforrnl)' hJgh. \\'It]llD sep::cratc ",hid.;. involve hrgck the abi]it\, to detect and utilize relations in S'I'ill-
oolic and abstract ~ontent. Lil~~ other ff·ct~ disf'!J<cr-rJ in 11,;r ,..h"~'.",, .
!
SCAT was designed prine-ipany as a measure of general intellectual de- Standard Mult iple-Choice Ouestions: dra\'!ing upon elementary. arithmetic alQebra and
~eometrv t8~ght. In tht' ninth gradE: or earlier, the~f- ite:i"ls emphasize in5ich~ful- recs~n;ng
ve lopment ahd only secondarily as an indicator of intraindividual apti- and the applIcation of principles. .•
tude differences.
Ou~:tir~ri!'e Comparisons: mark (Ai if the quantity in Column A is the greater, (B) if the
quc"t~t, In ~olumn B .'S the greater, (C) If the two quantities are equ,.:, (D) if the relet ion·
COLLEGE ADMISSION. A number of tests have been developed for use in ship Ccnnot De aetermlned from the information given.
the admission, placement, and counseling of college students. An out-
Column A Column 8
standing example is the Scholastic Aptitude Test (SAT) of the College
3 X 353 >: 8 4 X 352 X 6
EntrancE' Examination Board. Several new forms of this test are pre-
pared each year, a different 10rm being employed in each administration.
FIG. 53 .. Illustrative Mat~ematics Items from CEEB Scholastic Aptitude Test.
Separate scores are reported for the Verbal and Mathematics sections of
InstructIOns have been hIghly condensed. Answers: I-E. 2-C
the test. Figures 52 and 5:3 contain brief descriptions of the "erbal <.nd
(.Fr~m About the SAT 1?7~-75, pp. 4, 5. College Entrance Examination ·Board, New
mathematiZs item types, with illustrations. The items reproduced are "on .. Re~nnted by . ,ermlSSlon of Education"! Testing Service, copyright owner of the
among those given in the orientation booklet distributed to prospective test questions.) .
examinees (College Entrance Examination Board, 1974b). Changes intro-
duced in 1974 on an experimental basis include the addition of a Test
of Standard 'Vritten English and the separate reporting of a vocabulary prehension score (based on the sentence completion and readiner com-
score (based on the antonyms and analogies items) and a reading comprehension items). 0
First incorporated into the CEEB testing program in 1926, the SAT has
undergone continuing development and extensi\'e research of hicrh
technical quality. One of the reviewers in the Set;{.'ntll Mental Measll~e-
mCl;:s ?'earbook .writes: "Technically, the SAT ma:' be regarded as highly
pel<ecd'-d-posslbJy reachmg the pinnacle of the current state of the art
of psychometrics" (DuBois, 1972). Another comments: "The sYstem of
Sent&nCE CCfnpli:tio"-'s: choos~ ~he or,e word or set of words vl!r.ich. when
lenCE::. bEst 1its in with the meaning o~ the sentence as a whc!e,
inserted in the ~en· ]~retestin~ 0:
item~~ ~nalysis, and stand.ardization of new forms 'exempli-
£~s ;h~ nl~s~~oP~l,~:Jcated procedures m modern psychometrics" (\\T. L.
2. From the first the island':. rs, despite an outvvard __ • did what the V could to __ the "alJaLe, 19. _). Several aspects of the SAT research have been cited in
ruthless occupyirlg power.
di~erent chapters of this book to illustrate speCific proceuures. A de-
iF.) harmonv .. assist (B) enmity .. embarrass (e) rebel!ion .. foil talled account of methodology and results can be found in the technical
(D; resistance .. dEstrc.y (E') acquiescence .. th\",,'cri.
report e~it.ed by Angoff (1971 b). A shorter comparable form, known as
An3!ogi2S: sticet the lettered pair which best exprEsses a relationship similar to that ex"
the Pre~1l11ll1ary SA~, has also ~een in use since 1959. Generally taken at
pfes~E:din HIe original pair.
an earher stage, thIS test provldes a rough estimate of the hi(Th school
(B) hero: worshi;1
3. CRUTCH: LOCOMOTION: lA) paddle: canoe s~udent's apti~ude for college work and has been employed f~r educa-
(E) statement. conter,lion
lei horse: carriage (Dj spectacles: Vision
banal counseling and other speCial purposes. Both tests are restricted to
r· Reading CornprehenslO.'1: examinee reads a p2$~ge and answers multiple<cho;ce q:J~slion$
i
the testing program administered by the College Entrance Examination
. dsse~~;rlghis understandmg of its content.
~oard on behalf of member colleges. All applic;nts to these colleges take
:FIG. 52. l1lustrati\·e Verbal Items from CEEB Scholastic Aptitude Test. In- tDe ~AT. Sam,e col1eges also require one or more achievement tests in
structions have been l'ighly condensed. Answers: I-B, 2-E, 3-D. specI:ll fields, likewise administered by CEEB.
(rroJn Aboul the SAT l!J74-75, pp. 4·, 5. College Entrance Examination Board, New Another nationwide program, bunched in 1959, is the Americ2.n Col-
. 'lc,rk. Heprinted by pen11i5sion of Educntion"l Te.sting Service, copyright ov;ner of th' legE T(·~ting Program (ACT). OrigimJJy hmil-u brgcly to sinle uni-
test quC'~jons.)
320 Tests of General Infdlcclllal Lcccl
wrsity systC'ms, this program has grown rapidly and is now l~scd by many
colleges throughout the countrv. The ACT Test Battery mdudes four GRADUATE SCHOOL AD:\flSSION. The practice of testing applicants for
parts~ English ~Usage, l\1athem;tics Usage, Social Studies R~ading, and admission to college was subsequently extended to include graduate and
Kutural Sciences Reading. Reflecting the point of view of Its founder, professional schools." \Iost of the tests designed for this purpose repre-
E. F. Lindquist, the examination provides a set of work samples of col- sent a combination of general intelligence and achievement tests. A well-
lege work. It overlaps traditional aptitude and achievement tests, focu.s- known example is the Graduate Record Examinations (GRE). This series
ing on the basic intellectual skills required for satisfactory performance 111 of tests originated in 19:36 in a joint project of the Carnegie Foundation
for the Advancement of Teaching and the graduate schools of four east-
college.
Technicallv, the ACT does not come up to the standards set by the ern universities. Ko\\' greatly expanded, the program is conducted hv
SAT. Heliabilities are generally lower than desirable for individual de- Educational Testing _, Service, under the baeneral direction of the Graduat'e
cisions. The separate scores are somewhat redundant insofar as the four Record Examinations Board. Students are tested at desianated b
centers
parts are heavily loaded with reading comprehension and higll1)' int:r- prior to their admission to graduate school. The test results are used bv
correlated. On the other hand, validity data compare favorably wIth the universities to aid in m~king admission and placement decisions and
those found for other instruments in similar settings. Correlations be- in selecting recipients for scholarships, fellowships, and special appoint-
tweeri composite scores on the whole batter)' and college grade-p.oint ments.
averages cluster around .50. Most of these validity data \vere obtamed The GEE include an Aptitude Test and an Advanced Test in the stu-
throuah research services made available to member colleges by the ACT dent's field of specialization. The latter is available in many fields, such
progr~m staff. The program also provides extensive normative and in- as biolog;', English literature, French, mathematics, political science, and
terpretive data and other ancillary services. . ps)'chology. The Aptitude Test is essentially a scholastic aptitude test
In addition to the above restricted tests, a number of tests desIgned suitable for advanced undergraduates and graduate students. Like many
for college-bound high school students and for college students are com- such tests, it yields separate Verbal and Quantitati\'C scores. The verbal
merciallv available to counselors and other qualified users. An example items require verbal reasoning and comprehension of reading passages
is the C~llege Qualification Tests. This battery offers. six possible. scores: taken from several fields. The quantitative items require arithmetic and
Verbal, Numerical, Science Information, Social StudIes Information, To- algebraic reasoning, as well as the interpretation of graphs, diagrams,
tal Information, and a Total score on the entire test. The information de- and descriptive data.
manded in tbe various fields is of a fairly general and basic nature and is Scores on all GRE tfstS are reported in terms of a single standard score
not dependent on technical aspects of particular courses. Reliability and scale with a mean of 500 and ~11 SD of 100. These s~on:."s are directh'
normative data compare favorably with those of similar tests. Validity comparable for all tests, having been anchored to the Aptitude Te~t
data are promising but understandably less extensive than for tests th"t scores of a fixed reference group of 2,09,5 seniors examined in 1952 at 11
have been used more widely. collrges. A score of 500 on an Advanced PhysiC'; Test, for example, is the
It \vill be noted that, witl1 the exception of the Collebe Board's SAT score expected from physics mo.' )rs whose Aptitude Test score equals the
(which can be supplemented with achievement tests), all these tests mean Aptitude Test score of the reference group. Since graduate school
sample a combination of general aptitudes and knowledge about (or' applicants are a selected sample with reference to academic aptitude,
ability to handle) subject matter in major acade.mic fields. 'Vhen sepa- the means of m:::Jst groups actually taking each Adnlnced Test in the
rate scores are available, their differential validity in predicting achieve- graduate student selection program will be considerably abow 500.
ment in different fields is questionable. It would seem tl1at total score Moreover, there are consistent differences in the intellectual caliber of
provides the best predictor of perfom1ance in nearly all colleg.e courses. students majoring in different subjects. For normative interpretation,
Among the part-scores, verbal scores are usually the best smgle pre- theref ore, the current percentiles given for specific groups are more rele-
, '. dictor~. Another important point to bear in mind is that scores 011any of vant «nd local norms are still better.
these tests are not intended as substitutes for high school grades in the The reliability ~ and validity • of the GHE have .
been in\'CstiaatedD
in a
prediction of college achievement. High school grades can predict college number of different student samples (Guide for the use of the GRE,
achievement as well as most tests or better. '\'hen test scores are C0111-
bilH;'clwith high school gmdes, however, the prediction of college per- , Tl,e tf:',tin~of :ll'plic«nl' t.) pn)fe"ioIl31 sd,()ols wiJl ]-.".discussed in Cktptcr ]5,
in conneclion will, occupation::d tests.
fl)rmance is eo~siderablyCimproved.
1U1.3). Kuch'T-Hic:hanl,;o!1 rcli"hilitil'S of the Verbal ,md (:!u"ntit,,!i\'C
scor~s of the Aptitude Test "ita of toLll scores on the Auvanced Tests
are consistently over .90. Scver"l Adv,mcecl Tests also report scores in ::f (-=J j:;n~'strv
r :, .JCJ
/
~
two or three r;1ajor subdivisions of the field, such as experimen tal and
social psychology. The re1iabilities of these suhscores are mostly in the
.80·s. The lower reliabihtits, as well as the high intercorrelations among
::
60
~
/
/
/
/
the parts, call for caution in the interpretation of subscores. /

Predictive validitv~ has been checked aaainst v such criteria as Q;raduate
G
55 /
school grade-point average, performance on departmental comprehensive 0
50
I
c..
/ J>
examinations, overall facultv ratings, and attainment of the PhD (\Vil- co /
lingham, 1974). In general, the GRE composite score, including Aptitude
c
c
45 / /
co / /
and Advanced tests, pro\'E·d to be more valid than undergraduate grade- -'
::[ 40 L
.I
/
I
/
/
/
point average as a predictor of graduate school performance. Depend- '"

3 35 /
/' ,/ Psychology
ing upon the criterion used, the difference in favof of the GRE varied S /'
G r: .34
30 /' /
from slight to subsLmtial. Consistent with expectation, GRE-Q was a '"
a..
./ /
/
better predictor than GRE-V in those scientific fields in which rl,1athe- 25 /
matical ability is of major importance; the reverse was true of such fields
as English. The GHE Advanced Test was the most generally valid single
20.
I /
,---_/
predictor among those investigated. Illustrative
be seen in Figure 54, showing the percentage
data from three fields can
of students attaining the
15 r /
/
/
/
PhD in successive intervals of Advanced Test scores. The three coef- /
1: ~ /
ficients ai\'en in Fiaure 54 are biserial correlations between GRE Ad- /
b b
vanced Test scores and attainment or nonattainment of the PhD. 01 !
I' !
!
2 3 4 5 6 8 9
The highest validities were obtained with weighted composites of
undergraduate grade-point average and one or more GRE scores. These
multiple correlations fell mostly between .40 and .4,5 for various criteria FIG. 54. Percentage of Students at Various Levels of GRE Advanced Test
and for different fields. It should be noted that the narrO\v range of Scores \Vho Attained the PhD within 10 Years.
talent covered by graduate school applicants necessari1. ' results in lower (Fro,:, Willingh.an:" 1974, p. 276; data from Creager, 1965. Copyright 1974 by the
correlations than arc obtained with the SAT among college applicants. American ASSOClatlon for the Advancement of Science.)
This finding does not imply that the GRE is intrinsically less valid than
the SAT; rather it means that finer discriminations are required within
the more narrowly restricted graduate school population. Percentil~ norms on the ~Ii1ler Analogies Test are given for graduate
Another widely used test for the selection of graduate students is the, and ~rofesslOnal school students in several fields and for groups of in-
~\filler Analogies Test. Consisting of complex analogies items whose sub- dustnal employees and applicants. Over half of these groups contained
ject matter is drawn from many academic fields, this test has an un- 500 or more cases and none had less than 100. ;\,'larked variations in test
usually high ceiling. Although a 50-minute time limit is imposed, the performance are found among these different samples. TIre median of one
test is primarily a power test. The !\-liller Analogies Test was first de- group, for example, corresponds to the 90th percentile of another. Means
veloped for use at the University of Minnesota, but later forms were and SD's, for. addition~l, smaller industrial samples are also reported as
made available to other graduate schools and it was subsequently pub- further aIds m normahve interpretation.
lished by the Psychological Corporation. Its administration, however, is Odd-even reliability coefficients of .92 to .95 were found in different
restricted to licensed centers in universities or business organizations. The s~mples, and alternate-form reliabiIities ranged from .85 to .90. Correla-
tpst is used both in the selection of graduate students and in the evalua- ho~s with several individual and group tests of intelligence and academic
tion of personnel for high-level job~ in industry. It is available in five aptltudes fall almost entirely between the ..50's and the .70'5. Over 100
",.,rO>llpl fnrn" nnp nF whil'h is fPs.·rvpd for reexuminat) .ns. validity coefficients are reported for graduate and professional student
3~L; Yr,;I" of Cencra/Iniel/celllni Lt'['(!
Group Tc.<:tin[: 325
groups and for a fc\\' industrial samples. The~(' coefficients vary widel:-'.
Slight1: owr a third are hetween .:30 and .60, About an equal number C~\~T Scores correlated .40 and 045 with peer ratinas of abilitv to think
cntIcally and analyticallu A, d : ~2 b .'
are elearl:' too low to be significant. The field of specialization, the nature d d'" }. n ln a group 0. 00 experienced elementary
of the criteria employed, and the size, heterogeneity, and other char- an seco~ ary sCho?1 te.achers, Ci\lT correlated .54 with a scale desicrned
acteristics of the samples are obvious conditions affecting these coef- to meaSUle teachers attitude to\vard gifted children. Evidentlv the t:ach-
ficients, :t\1eans and SD's of several contrasting groups in different settings ers ~vho ~hems~lves sc~red higher on this test had mOre fa\;orable atti-
tu dES to\\md gIfted children,
provide some additional promising validity data. It is evident that the
nlidity of this test must be evaluated in 1::'rms of the specific context in , B~ca~;e of its unique features, the Concept ~fastery Test can un-
which it is to be used. GOU )te y, s~r:re a useful function for certain testing purposes. On the
oth:: ha~1Q,It IS cl~arly n~t an instrument that can be used or inter reted
rouLI~lel). A meal:mgful mterpretation of G\IT scores requires a ~reful
St:PERIOR ADL1LTS.Any test deSigned for college or graduate students is st1ud) of all the dIverse data accumulated in thc manual, preferablv sup-
p emented by local norms. .•
also likely to be suitable for examining superior adults for occupational
assE'ssrnent, research, or other purposes. The use of the Miller Analogies
Test for the selection anel evaluation of high-level industrial personnel
has already been mentioned. Another test that provides sufficient ceiling
for the examination of highly superior adults is the Concept \1astery
Test (C\fT). Originating as a by-product of Terman's extensive longi-
tudinal study of gifted children, Form A of the Concept Mastery Test
\vas developed for testing the intelligence of the gifted group in early
maturity (Terman & aden, 1947), For a still later follo'N-up, when the
gifted subjects were in their mid-forties, Form Twas prepareo (Terman
& aden, 1959). This form, which is somewhat easier than Form A, was
subsequentl:' released for more general use.
The Concept \lastery Test consists of both analogies and s:-'non)'m-
antonym items, Like the \liller Analogies Test, it draws on concepts from
man:' fields, including ph:-'sical and biological sciences, mathematics,
histe'ry, geograph:-', littrature, music, and others. Although preclom-
inanth' verbal, the test incorporates some numerical content in the
analogiC's items.
Percentilc norms arc given for approximately 1,000 cases in the Stan-
ford gifted group, te"ted at a mean age of 41 years, as well as for smaller
s:1mples of graduate students, college seniors applying for Ford Founda-
tion Fellowships in Behavioral Sciences, and engineers and scientists in
a navy electronics laboratory, To provide further interpretive clata, the
manual (with 1973 supplement) reports means and SD's of some 20
additional student and occupational samples.
Alternate-form rc1iabilitics of the C:\1T range from .86 to .94. Scores
shcl\\' consistent rise with increasing educational level and yield correh-
tions clustering around .60 with predominantly verbal intelligence tests,
Sisrnificant cOlTebtions with grade-point averages were found in seven
college samples, the correbtions ranging from .26 to .59. Some sllggestivc
findings in other ('ontexts are also cited. For example, in two gioups of
managers participating in advanced management tra;ning programs,
C HA PTE R 12
An important approach to the understanding of the construct, "intel-
ligence," is through longitudinal studies of the same individuals over
PS)lclwlogicallssucs long periods of time. AlthouO'h
as contributing
o such investiaations
b. may. be reO'arded
to the long-term predictiw validation of speCific tests,
,:,
they have broader implications for the nature of intelligence and the
in In.telligcncc Testing meaning of an IQ. When intelligence was believed to be largely an ex-
pression of hereditary potential, each indi\'idual's IQ was expected to
remain very nearly constant throuO'hout o. life.
_ An\' obs·erved'·variationJill._
_._ ..- ~
retesmtg-,'\Cl'S<l.ttrihuTea~fC)\\·E'aknesses ii1the n';;;suring instrument-
either inadequate reliability or poor selection of functions tested. \Vith
P.:,
:.
.
, SYCHOLOGICAL
t 1 Like all tools
tests should be regar d e d as 00 s. . .'
~heir eEectiveness depends on the knowledge, skill, and .mtegnty of
b l'
the user. A hammer can e emp 0) e 0 u·
d t b ild a crude ]'atchen table
.
, increasing research on the nature of intelligence, however, has come the
realization that il~~:2ce itself is both complex and dynamic. In the
following sectioris, we sllaIr examine typJCiil £ndil1gs of longitudinal
or a fine cabinet-or as a weapon of assault. Since psychol~g1Ca~ tes:s studies of inteJJigence and shall inquire into the conditions making for
are measures of behavior, the interpretation of test results requnes kno\\l- both stability and instability of the IQ.
ed\!e about human behavior. ~\',~ho.!og~ca,l ~~sts can~?~. ,~~ y~_op~r1y,ap.::
pli~d outside the context of pS)'chological science. F~ml1Jant) wlth ~ele-
-"aJ'lt bel1a\'ioral lesearch is needed not only by the test constmctor but STABILITY OF THE IQ. An extensiH' bod v of data has accumulated
showing that, over the elementary, high sch~ol, and college period, intel-
also b\' the test user. .' . f
An inevitable consequence of the expansion an~ gr?\\'mg ~omplexlty 0 ligence test performance is quite stable (see Anastasi, 19.58a, pp. 2:32-
an\' scientific endeavor is an increasing speclalizat101.1 of mterests and 238; ~'lcCal1, Appelbaum, & Bogart\·, 1973). In a S'-vedish stud\' of a
fU;lctions among its practitioners. Such specialization lS. clearly apparent relatively unselected population: fo;' example, Husen (19.51) f~und a
in the relationship of psn:hological testing to the mamstream of co.n- correlation of .';2. between the test scores of 618 third-grade school boys
( 'A .... 196"') S ecialist<; in IJs"chometncs and the scores obtained by the same persons 10 veal'S ~'later on their i'n-
temporary psychology na5tasl, I.. P '.- .' ; . '"
hav~ raised tedilliques of test constructiOn to ~rul): 11:1preSS]\~ pll1n."c.le~ duction into military service. In a later Swedish study, Harnqvist (196S)
"1'" \\'l']'le l.)rovidin<' tl:chnicalh' sUI)enor mSlTumenb, 110\\,e\·el, reports a correlation of .78 between tests administered at 13 and 18 years
o f qua H). J . to .. . '. rId of age to over 4,500 young men. Even preschool tests show remarkably
the\' have aiven relativel\' little attentJOJ1 to ensunng tlldt test use:s la
• b . '" d d f . tl' ) 'oI)er use of such mstru- high correlations with later retests. In a longitudinal study of 140 chii-
the psychological mformatJon lJee eo.! Ie I I '"
ments. As a res111t, outdated interpretatlOns of test perf.ormo.nce al~. to.~, dren conducted at Fels Hcsearch Institute (Sontag, Bal:'er, & ~elson:
l 1958), Stanford-Binet scores obtained at 3 and at 4 veal'S of 8.r!e cor-
often survjvcd without rderence to the results of pertme1Jt be.la\ ~or,J
research. This lXlrtial isolation of psychological t~stin.g.from othe:' fields related .83. The correlation with the 3-vear tests decreased c;s. the
of I)Svcl!oloav-witl'j its consequent misuses and 1ll1smterpretatlOns. of interval between .retests increased, but by 'age 12 it was still as high as
.'
tests--accounts ,c>. 11' .~co_~.~~
for some 0 f t 1·Ie J2u.2.IC d··,·t 'rt ..~.~
with ..nsvcholorrlcnL
1"''-j~----e--: _.- .46. Of special relevance to the Stanford-Binet is the foJ]ow-up conducted
testing in the decades of the_~~~_~=~,~~,~~.:,!he tOl;ics. chose~ f~r, d]s~ by Bradway, Thompson, and Cravens (195S) on children originally
_~. _ ..t.l- C::11.1p·-t-e~1:
]"ll-u"'s-trate
ways 111wl11ch the I1l1d1l1r!s01 PS) cho tested between the ages of .'2.and 51.~ as part of the 1937 Stanford-Binet
CUSSlon 1n lJS:I, "'. L' L'. D
"1 t t· tl··, efl er,tive use 0'..' mtelll0'ence ksts standardization sample. Initi[.;,} I Q'.s correlated .fj5 witli ~J- vcar retests
lorrjcal rcse~1!'Cj 1 can CODln.,et t U Ie .~' ','~ .
aJ~l can help to correct popular misconceptions abou~~~.~.3-.~1~ ..__ and ,59 with 25 ..vear retests. The correlation bet\H'cn the lO-H'ar retest
similar scores:-~'-~-~---'"-''~'-'--"'''-'----'-~-''-''' -_..- (Mean age:=: 1·4 years) and ~5-)'ear retest (:\1('an age ='29 vears)
\\'as .85. . .
As would he expected, retest cUlTcl:ttions ale higlWT, the shorter the
interval b,·t\Vecr~ tests. \Vith a consLmt interval het~>,'f'I',-~'tf',;tTmn)'pnvpr
retest correlations tend tel ])(' hi::hcr th,-' older the children. T},c en'cds of implied in the pi'e\'jous~v di:icussc,J Piagciial1 al\pro:lch to nwntal de-
\'CJn"111C 1t
. ~l' ~' •,"
"IS '1',,]1 .. ,- '1' 1-1 ....• • l' ., ,. '.
\ e, d .• Il ute \ <lllOUS]JlC 1nOuallzccl lllstrudi o]]al nn),~rams.
age and retest interval on retest correlations e:-;hibit considerable rcgu-
l~rity and are themselws highly predictable (Thorndike, 19:3::;, 19,10).
Applicatiolls of the same principle ul1derlie Proiect Head SL1;t
an,l
other compe:1satoryeducational programs for cult~rally disac1vantage~l
One' explanation for the increasing stability of the IQ with age is 'pr~-
preschool chIldren (Bloom, D<lvis, & Hess, 1963: Gordon & Wilkcr;on
vided b\' the cUll1ubtiw nat~Q.Lil~~elle~~L.sl~Y.~.lQI?:2lE,nt. The mdl- 1w'e S' 1 19-r> ~ l' '.' . ,
vUU; 1ge, . I.:>; ~tan ey, 1972, 197:3; Whim bey, 1975), Insofar as chil-
vidual's'intellectual skills and knodedgc at each age include all his
dren. f.rom dlsadva.1~taged backgrounds lack some of the essential pre-
earlier skills ~ll1dknowledge plus an incr,~2~~~~!()L~~,-v,~~g1!isj,ti,<:J~1Sc~
Even
if the annual increments bear no re'1itTcmto each other, a growing con- ~~~ms~tes f~r e~ec(Jve sd~ool ~e.aming, they ,vould only fall farther and
leU thel behmd m academiC aC11levement as they progressed throuO'h the
sistencv of T)erformance level \\'ould emerge, simply because earlier ac-
qUisiti~ns c~nstitute an increasing proportion of total skills and knowl- school gra.des. It should 'be added that learning prerequisites co\~r not
edge as age increases. Predictions of IQ from age 10 t.o 16 would thus be o,n.1Ysuch mtellcctual skills. ~lSt~1ea~quisition of bnguage and of quantita-
more accurate than from 3 to 9, because scotes at 10 lllclude over half of :1\,e concep~s, but also atc.ltuoes, mterests, motivation, problem-soh'ing
what is present at 16, while sc~res at 3 include a much smaller proportion ~t).le~, r,eactwns to frustratIOn, self-concepts, and other personalitv char-
actenstlCS. .. ,
of what is present at 9.
Anderson (1940) described this relationship between successive scores 10Th: object of. c.ompensato.ry educational programs is to provide the
~:.~r,nll:g~}rerequlSlt~s that \\'111 enable children to profit from subsequent
as the overlap hypotllesis. He maintained that, "Since the growing indi-
vidmll docs not lose what he already has, the constancy of the IQ IS 111 ;,~~1u;i~nt;;I~1 to,
-.~,a~1~,"LtY~9~,~
domg, of course, these programs h9.p.i_"JQ~.disxllpLthe,
...Q~.~tha.~.,.
\v~.~~.~.?~~=::~~~J~~::!:..!_':.~lain cd...}O\~ COmpc nsa-
large measure a matter of the part-~\'hole or overlap relation" (p. 394).
In -support of this hypothesis, Anderson computed a set of correlations ~~ e~::~~h~~~rl~gl?.:ns prOVIDe one example of the interactiori'bet'\ff'i'l-
Imt@score andtreatD1enCin··flie I)realCfioI1oJ'-"::s--k"--~---""'-'··;-·"---"""1~~--'------'
_I suusequenl seol e, (1S-
between initial and terminal "scores" obtained with shuffled cards and .~_. -,._, '__"___ _ '__ "__
'~'_"_
cussed m Chapter 7 " That mte·l·l-e-c"- ..'-l"SKI
'. lua, ..e'1·1
.s·.--..
:-'l·_a so..1oe
C,ln :-··... ''1'------,---
rr::-......,..·-.
eaecbve v tauO'ht
random numbers. These correlations, which depended solely on the
extent of overlap bet\wen successive measures, agreed closely with em- at the adult l,evel is suggested by the promising exploratory 'rese2.;c11
pirical test-retest correlations in intelligence test scores found in th~ee reported by "himbey (1975), which he describes as "cognitiv~ ther<1p\'."
published longitudinal studies. In fact, the test scores ~ended to gl\'e
somewhat IOlcer correlations, a difference Anderson attnbuted to such
factors as errors of measurement and change in test content with age. I:\'~TABlLITY~F THE IQ. Correlational studies on the stability of the IQ
Although the o\'erlap hypothesis undoubtedly accounts for some .o~the p~'ovl(ie actuan~l data, applicable to group predictions. For the reasons
increasinO' stability of the IQ in the developing individual, two addltlonal gl\'en above, IQ s tend to be quite stable in this actuarial sense. Studies of
conditionsb merit 'consideration. The first is the ..?JJPiWJ,1JJJe..n.ta I••,sta&l.,!.tlj-
7"1' ~l:~.~~~d ..~:.)1.1e ?~l:.~~l~D_~:L~~.::eal}~~K~..!::I?~~.9.,!2!:..2
~~~,_.? ..Q.~:!.1\VUuL~!ifiL~...,
characterizing the developmental years of most individuals. Children ~~~Q.:l ;:,harp nses or drops in IQ may occur as a result of major environ- .
tend to remain in the same family, the same socioeconomic level, and mental cha~~es in the child's life. Drastic changes in family structure or
the same cultural milieu as they g~O\v up. They are not typically shifted home condlbo:1S, adoption into a foster home, severe or prolonged illness,
at random from intellectually stimulating to intellectually handicapl}ing and therapeubc or rem~di,al programs are examples of the type of e\'ents
environments. Hence, whatever intellectual advantages or disadvantages th~t may alter th~ ~hl1d s subsequent intellectual development. f:_\:~.l~,
they had at one stage in their development tend to persist in the interval ':~~?~=.::_~~.~_O"~.ll1a.:n~ ..the same environment, however, may show large'
between retests.
.-.J.ncLeas.es ..QLd.R~JJ!l!?esjJ}:::.rQJ!D:'ie~~f·Tnese-c11'an ges-n~e:~n:":oTcour~e:----
A second condition contributing to the general stability of the IQ per-
I See, e.g., Ba~'ley (1955), Bayley & Schaefer (1984), Brad\'.'av (194.5) Bradwav
tains to th e !:Q1 e aLp rer.G_qJJi~.i.tLJ.fW Tn i ~g,.:S:.kil£~,n_.s..],lhs,eqlle.nU~,c~!I.lJ.ll:g: , _
& RobInson c(1961); Haan (196.'3), Honzik, l\ladarlane, &: Al1e~ ('1048): Kagan &
Not only doecs the individual retain prior learning, but much of his prior Freeman (U{)-'3), Kagan, Sontag, Baker, & Kelson (19.58), ~lcCall Appelhaum &
Iearnin g I;P;q~j,des,tQQbJ.9J,,=s1Jb5,e.q1Jent.l.eanling,-He!~f.§.,.Jh~ ..D.:!2!f_PLQg.t:~2~, __• H.ogarty (197-'3), Rees & Palmer (1970), Sontaa. Baker, & :\elson (1958) Wie~er
he has made in the a~guisi!i91LoLintel1Gct.]JaLslm.s.. __cm.d,..1n.Q,~'..1.e.d,ge_J!L_, Rlder, & Oppel (196:3). to ",
a'nv~on;:;-point~ti~ne, the better able he is to p!.~fuJrom subsequent I Pinne<l'tl (1961). has prepared tables showing the median and range of individual
le;~;";i~g'~~~~nces. -Tl;~onCel)t'''or~~;dil~es;-'in educatioll'lsar{ expres-::--' Q changes f~und III the Berkeley Growth Study for each age at test and ITiest from
1 month to 11 Years. -
slon'oCthis gt'neraFpnnciple. The sec,jential nature of learnin~ is also
330 Tesls of General Inldlcel1l01 Lc.Tcl
PsycllO!ngica/ IsslIes ill 171tclli~(,llcc Teslillg 331 t I';' [
th,at the child is developing :1t a faster or a. slower rate than that of the 1
wit I IQ's in both follow-ups. Several longitudinal studies have demon-
\
~\'t~ \"> "
normati\'e population on which the test was standardized, In genej'al, chil- strated a relation between amount and direction of change in intelligence,r ~~ [
dren in culturally disadvantaged environments "tcntl,~.tQ~)ose and those tcst Scores and amount of formal schooling the individual himself h;ls ~' ~. ~.
in sUJ)erior envir'onments to ocrain in IQ with ae:e. ~ Invest(~ations
•. of the completed in the interval between test and ~'etest (see Harnqvist, 197:3). {"'-'>.J
specific characteristics of these environments and of the children them- The score differences associated with schoolin.Q: are larger tLm those .\}~ ~[
selves are of both theoretical and practical interest. associated with socioeconomic status of the family. - ..~'fS.~
Typical data on the ma~nitude of individual IQ changes is provided SOl:le inYE'stiga~or.s have concentrated more speCifically on the per- ~ ;~j~
by the California Guidance Studv. In an analYsis of retest data from sonality charactenstrcs associated with intellectual acceleration and de- .~ \ ;;~,J,
2~2 cases in this studv, Hor1Zik, 1\lacfarlane, and Allen (1948) reported ce I eration. -A t t I1e Fels Research
. Institute, 140 children were included ~'\l~ v . I
individual IQ chang~s of as much as 50 poin!~yver the period from in an intensive longitudinal study extending from early infancy to adoles- ~ ~li~
6 to 18 years, when retest cOrrelafionsaregei-lerally high, 59 percent c~nce and _beyond (Kagan & Freeman, 1963; Kagan, Sontag, Baker, & .. ~ ~ j
of the d;ildren changed by 15 or more IQ points, :37 percent by 20 or
more points, and 9 p~rcent' by 30 or more. Nor are 1110Stof these changes
~~lso.n, 1905; .Sontag, Baker, &. Nelson, 1955). \\~ithin this group, those ~~ I~ j
childlen slJowll1g the largest gams and those showmg the largest losses in "\.~-
random or erratic in nature. On the contrary, children exhibit consistent IQ between the ages of 41S and 6 were compared in a wide varietv of ".,
upward or dowmvard trends over several ~'onsecutive years; and __ !.ll..~~~.~~
chancres
.. _ b.
.JO_M_
are related to environmental
•• •• ~_,_ .,. _. '
characteristics.
-'- • •
In the Cali1(';';nia per.s.o~.lalit).. and.. enVil..onm.entitl m.ea.s'u.re...s; the same \\·as.. don.c with tllOse
sho\~'1llg the.larges~ IQ changes between 6 and 10. Q.y.ring, thLPI.t?ch~.L
!/;.... '
,,~,
CU"idance Study~--aetailed-;I)\-;e-strgation of home conditions and parent-
:.~:_1:~_,::~1.~~0l~11 ~ependency on parents was the principal condition ~ :'
child relationships indicated that large upward or downward shifts in _:ssocJatea \"ith IQ 16ss-:'During-tITe'schbOlyem's'~-rQ-gaills--\,refe-associate-d- .'
IQ were associated ,,\lith the cultural milieu and emotional climate in '. chieHy ~~~!~f'high ach~~l~.:nt_~riv.:, competitive striving, and curiosity [
which the child ,,-as real:ed. A further follow-up conducted when the .~~out nature., SuggestIve aata were liKeWIse o15famed- regardl11gthe ro1e-~
participants had reached the age of 30 still f9~~1d signiJi~ant _correl~- of parental attitudes and child-rearip'g practices in the development .
I
:.,1
{ions .!:>et\'l'eentest sC:gres al1.d..fumil;:.... mili£.u as a;;sessed at the age of 21 of these traits.
months (Honzik, 1967). Parental concern with the child's educational A later analysis of the same sample, extending through age 17, focused .1
achievement emerged as an important correlate of subsequent test per- principally on patterns of IQ change over time (~1cCall, Appelbaum, &'
formance, as did other variables reBecting parental concern with the Hogart)', 197:3). Children exhibiting different patterns were compared
child's gener,'\l welfare. \\'ith regard to child-rearing practices as assessed through periodic home I
In the previously mentioned follci\\'-up of the 1937 Stanford- Binet visits. A typical finding is that the parents of children whose IQ's showed
standardization sample, Brad\\'ay (19·tS) selected for speCial study the a rising. trend during the preschool veal'S presented "an encouraging and
50 children showing the largest IQ changes from the preschool to the rewardmg atmosphere, but one with some structure and enforcement of
junior high school period. Results of home visits and interviews with policies" (!\1cCall et aI., 1973, p. 54), A m2 'I' condition associated \\'ith
parC:'nts again indicated that Significant rises or drops in IQ over the rising I Q's is described as accelerational att . :l1p1. or the extent to which
lO-)'ear period were related to various familial and home characteristics. "the parent deliberatd:' trained the child in various mental and molor
In a reanalysis of results obtained in fin; published longitudinal studies, skills which were not yet essential" (p. 52).
including some of those cited directly in this chapter, Rees and Palmer An~ther approach to an understanding of IQ changes is illustrated by
( 19-:-0) found chan ges in IQJ~~_t\}'.f.t;:!!..§ __a.mll:?:_):~ar~Jfd~(O_~sniBcap!Jy. .. " ~I~ans (1.963) follow-up study of 49 men and 50 women who had par- 1 ;
related to socioeconomic'status as inc!~~t~q~Y.[c1n2~b.,.c:~.~.<:atiOI~al and hClpated. ll: a long-term growth study. IQ's were obtained with a group
ocCUj)affonar-lc\'eT~-A--ilrillJm'l:eIationship was observcd by ltiril~\'ist tco,t adm1l1lstered when the subjects ;'vere about 12 years old and again
·-(T9"6Sj-ijYh-is-S'wedish-stud)'. In their 10- and. 25-)'ear follow-ups of chil- when they were in their middle or late 30's. Personalih' characteristics
drelJ who had taken the Stanford-Binct at preschool ages, Bradway and were invcstiga ted throu gh a self-report inn'n torf,';;;d' a- strf~~-'onrrFCilsTVC""
Hobinson (1961) computed an index based on parents' education, inter\'ie:.;'s conducted at the time of the adult follow-up. The upper and
btlwr's occupation, and occupation of both grandfathers. Although they lower 2.:) percent of the (Tbl'OUI) in tenns of 10 chancre de<cr!'J'ltcc "s ac-
"./"<
la]Je]ed this measure an ancestral index, rather than an index of socio-
, '- b' "0,·1'" '
c('l~rato~s and dece1.erators: were comp::ned witl.l special reference to ~)J I I
ec, .nomic status, their resu1ts are consistent \\'ith those' of other illvesti- tbClr reh::u:ce on ~'OPll1gor (je_fcnsenwckll1isn~. Tiles£: mechanisms refer" I C)
gaiors: the index \'ieldec1 Significant correhtions of approximately .:30 to contrastmg personalitv st)·Jes in aea1ing with probJenls and frustration':. '
COllinG ~ JrIl'c:hanisrns in ~gener;1l relllTscnt an objediYc,
.
constructive, real- schooling \yas completed 111am' vear~ C':1rlier and who h:1H' since been
istic approach; defense mechanisms are characterized by '\"ithdrawal, engaged in highly din:Tsified ~ct·i\'ities. In this and the next section, .we
dcnial, rationalization, and distortion. The results ('on_firmc0_t.h~ h::,p.Qth~:: , sh~IJ examine some of the implications of these problems for early
sis that a('cel~1·Elp.r~mage sigl!18_c:~!!.!!£*r~16rellj;~.QL.Q.QEi}lg __~11echanisms chIldhood and adult testing, respectiwh'. .
aildd;'('~le'rators of defense mechanisms. -------------'
--Simirarresu1ts~--;~~~ort~dbyM';:iart~' (1966), from a longitudinal
stud,' of 65 children tested from two to four times between infancv and PREDICTIVE VALIDITY OF INFANT AXD PRESCHOOL TESTS. The conclusions
the ~arly teens. On the basis of IQ changes, the children \\'ere c1a~sified that emcrw" .from longitudinal studies is that prescllOol tests (especiall;'
into four categories: (a) rdatively constant--40 percent; (b) accelerative when admmIstered after the age of 2 years) have moderate validitv in
spurts in one or more areas of functioning-25 percent; (c) slo\v, delayed, predicting subsequent intelligence test performance, but that infant tests
or inhibited development-9 percent; (d) erratic score changes, incon- have \'ir~ually none (Badey, 1970; Lewis, 197,'3; t\lcCalL Hogarty, & Hurl-
sistent performance in different functions, or progressive IQ dec1ine- burt, 19i2). Combining the results reported in eight studies. McCall and
26 percent. Intensive case studies of the individual children in these four his associatl's computed the median correlations '-between tests adminis-
categories led ~'1oriarty to hypothesize that characteristic differences in tered during the first 30 months of life and childhood IQ obtained be-
coping mechanisms constitute a major factor in the observed course of ~ween 3 an~ 18 ;'ears (McCall et at, 19(2). Their findings are reproduced
IQ over time. m ~able 20 .. Several trends are apparent in this table. First, tests given
Research on the factors associated with increases and decreases in IQ dunng tl~e first year of life haw little or no long-term predictive value.
throws light on the conditions determining intellectual development in Second, mfant tests show some validity in predicting IQ at preschool
general. It also suggests that prediction of subsequent intellectual statu~ __ ag:s (:3-4 yea~'s), but the correlations exhibit a sharp drop beyond that
can be imprm:ed-lf~.res of !l~~il1dl':'iduars ~mQtj_Qllar:'5lliLll1.QJE':a.:: __.-_-.__ pomt, after chIldren reach school age. Third after the aGe of 18 months
·-t'~~--chal~acteri;;tICs al1<LoLhi-s--€Cn.\liL01ll11.enLar.e._CQm~ed with initial validities are moderate and stable. "'hen predictions are ~lade from thes'~
. test scores. From still another viewpoint, the findings of this t~-pe of ages, the correlations seem to be of the same order of maanitude regard-
-j-cse-:1.rCI1JSoillt the way to the kind of intervention programs that can less of the length of the retest interval. b
effective];" alter the course of intellectual development in the desireel

directiom, TABLE ~5
Median Correlations between Infant Tests and Childhood IQ
(F~orn ]\jcC:'B, .1-Iogorty, & HUJ Ibmt, 1972. Copyright 19'i2 by the American Psvcho-
logIcal ASSollatlOn. Reprinted b:.- permission,) . .
The assessment of intelligence at the t\\'o extremes of the age range ChildhOOD Age
prcsenb special theoretical and interpretive problems. One of these in Years
problems pertains to the functions that should be tested. \Vhat constitutes (Fletest) 1-6 7-12- 13-18 19-30
intelligence for the infant and the preschool child? What constitutes iH-
telligcnce for the older adult'? The second prohlem is not entirely inde-
oS-IS .01 .20 .21 .49
.5-7 .Cll .06 .30 .41
pendent of the first. Vnlike the schoolchild, the infant and preschooler
3-4 .23 .3:3 .4i .54
have not been exposed to the standardized series of experiences r"'111'e-
sented by the school curriculum. In developing tests for the elementary,
high school, and college levels, the test constructor has a large fund of
common experiential matt·rial from which he can draw test items. Prior The lack of Jong-term predict i\'(· validity of infant tests needs to be
to school entrance, on the other hand, the child's experiences are far evaluated further with 1".'gard to otlJer rf'lated findings. First, a number
less standardized, despite certain bron.d cultural uniformities in chi1d- of dinic:ialls llan' argued that infant tests do improv~. the prediction of
rcaring practices. Under these conditions, both the construction of tests subsequent. (~('\'elopl11cnt, but only if interpreted in the light of con-
and tIle interpretation of test results are much more difficult, To some COJT!lt:<ntchl1lcal olJ5Cf\'ations (Donofrio, 19G5: Escalona. 19S0; l~nobl()cb
extent the same difEcu1ty is encountered in testing older adults, whose (;, Pas~;rnanicL IDfiO). Predictions mj~llt also 1x' illlprO\'E'd by a con-
sideration of developmental trends through repeated testing, a procedure Lail sL,ge, provide a 1ramc\\'or]; for e:\alllinin~ the changing nature of
originallY proposed by Ge~,e]l with refnencc to his Developrnental intelligence. ~lcCall and his associates at th; Fe1s Res~ar~h Institute
Schedules. (:"lcCall ct aI., 1972) haw explored the interrelations of infant behavior
In the second place, sen'ral investigators have found that infant tests in terms of such a Piageti::m orientation. Through sophisticated statistical
have mueh higher predictive validity within nonnormal, clinical popu- analyses involving intercorrelations of different skills within each age as
I lations than within normal populations. Significant validity coefficients in :wl1 a~ correlations .among the same and different skills across ages, these
the .60's and .70·s have been reported for children with initial IQ's below 1l1\'estJgators looked for precursors of later development in infant be-
80, as well as for groups having knO\\"ll or suspected neurological ab- havior. :<\lthough the findings are presented as highly tentative and only
normalities (Ireton, Thwing, & Gravern, 1970; Knobloch & Pasamanic:k, suggestive, the authors describe the major component of infant intelli-
1963, 1966,1967; \Verner, HorlZik, & Smith, 1968). Infant tests appear to gence at each six-month period during the first 2 years of life. These
be most useful as aids in the diagnosis of defective development resulting descriptions bear a rough resemblance to Piagetial~ developmental se-
from organic pathology of either hereditary or environmental origin. In quences. The main developmental trends at 6, 12, 18, and 24 months,
the absence of organic pathology, the child's subsequent development is resp~ctively, are summarized as follows: (1) manipulation that produces
determined largely by the environment in \vhich he is reared. This the contmgent perceptual responses; (2) imitation of fine motor and social-
test cannot be expected to predict. In fact, parental education and other vocal-verbal beha\'ior; (3) verbal labeling and comprehension; 4)
characteristics of the home environment are better predictors of subse- furt~)er verbal .development, including fluent verbal production and gru.rn-
quent IQ than are infant test scores; and beyond 18 months, prediction mabcnl matunty.
is appreciably improved if test scores are combined with indices of fa- Apart from a wealth of provocative hypotheses, one conclusion that
milial socioeconomic status (Bayley, 1955; McCall et aI., 1972; Pinneau, emerges clearly from the research of :McCall and his co-workers is that
1961; Werner, Honzik, & Smith, 1968). the predominant behavior at diHerent ages exhibits qualitative shifts and
falls to support the conception of a "constant and pervasive" general in-
tellectual ability (1fcCall et a1" 1972, p. 746). The same conclusion was
NATURE OF EARLY CHILDHOOD Il"TELLIGENCE. Several investigators have reached 1 Lewis (1973) on the basis of both his own research and his
concluded that, while lacking predictive validity for the general popu- survey of published studies. Lewis describes infant intelligence test per-
lation, infant intelligence tesLs are valid indicators of the child's cognitive formance as being neither stable nor unitar\'. Negligible correlations mav
abilities at the time (Bayley, 1970; Stott & Ball, 1965; Thomas, 1970). be founel over intervals even as short as ti1ree l;JO;lths: and correlatiOl;s
According to this view, a major reason for the negligible correlations with performance on the same or different scales at the age of two years
between infant tests and subsequent performance is to be found in the and beyond are usually insignificant. 1\10reO\'er, there is little correiation
changing nature and composition of intelligence with age. Intelligence in among different scales administered at the same age. These results have
infancy is qualitatively different from intelligence at school age; it con- been obtained with both standardized instruments such as the Bavlev
sists of a different combination of abilities. Scales of Infant DcveJormlcnt and with ordinal scales of the Piao~tia;l
This approach is consistent with the concept of developmental tasks type (Gottfried & Brody, 19"75; King & Seegmiller, 1971; Lewis &. Mc-
proposed by several psychologists in a variety of contexts (Erikson, 19,50; Gurk, 1972). In ~lace. of the traditional model of a "developmentally
Havighurst, 195:3; Super et al., 1957). Educationally and vocationally, as' constant general mte1hgence," Lewis proposes an interactionist view
well as in other aspects of human development, the individual encounters emrhasizi.n~ both. the role of experience in cognitive development and
typical behavioral demands and problems at diHerent life stages, from the speclfJclty of mtellcctual skills.
infancy to senescence. Although both the problems and the appropriate
reactions vary somewhat among cultures and subc~Jtures, modal require-
ments can be speCified within a given cultural setting. Each life stage I\fI'LICATIO:\'S FOR I:\,TEP""P,TIC)fo; PROGP~"'\fS. The bte 1960s and earl"
m:tkcs characteristic demands up()n the individual. 1\Iaste~' of the de- ~giOs witnessed scnne disillusionm~nt and considerable confusion regard-
vLloprnental tasks of earlier stages influences the individu,1l's handhng mg the. purposes, methods, and efJectiveness of compensaton' preschool
of Ihe lwhavioral demands of the next. educ~ttJonal programs, such as Project Head Start. Designed prinCipally
\\'itlJin the rnorc c:irClln1ScriLed area of cognitiVE: dcvelopment, Pia;:c- to en!J:mcc: the ac:JC1emic readiness of children frorn disadvanLczed back-
grounds, these programs differed widely in procedures and results. ~1ost
were cI:ash projects, initiated with inadcquate plannin~. Onl~· a few PROBLEMS IN THE TESTIl\G OF ADULT Il\TELLIGEI\CE
could demonstrate substantial improvements in the children's perform-
ance-and such improvements \\"('re often limited and short-livecl (Stan- AGE DECREMENT. A distinctive feature introduced by the 'Vechslcr
ley, 1972). scales for measuring adult intelligence (eh. 9) was the u'se of a dec1ininrr
"Against this background, the Office of Child Development of the O.S. norm to compute deviation 1Q's. It \"ilI be recalled that raw scores on th~
Department of Health, Education, and '''e!fare sponsored a conference of
a panel of experts to try to define "social competency" in early childhood
i VAISsubtests are first transmuted into standard scores with a mean of
o and an SD of 3. These scaled scores are expressed in terms of a fixed
(Anderson & Messick, 1974). The panel agreed that social competency reference. group consisting of the 500 persons between the aaes of 20 and
includes more than the traditional concept of general intellgC'nce. After 34 years mcluded in the standardization sample. The sum ~f the scaled
working through a diversity of approaches and some thorny theoretical scores. on the 11 sub tests is used in finding the deviation IQ in the ap-
issues, the panel dre\v up a list of 29 components of social competency, proprIate age table. If we examine the sums of the scaled scores directly
which could serve as possible goals of earl:' intervention programs. In- however, w~ can compare the performance of different acre groups il~
cluding emotional, motivational, and attitudinal as well as cognitive terms of • a slnrrle
• t>'
co 1It'muous- scale.
'F' Igure 5-0 s1lOWS the means e-
of these
variables, these components ranged from self-care and a differentiated t?tal scaled scores for the age levels included in the national standardiza-
self-concept to verbal and quantitative skills, creative thinking, and the tIon d sample
60 and for the mol'f~ ~. limited "old -,aae
0 .
san lp I"e 0 f 4-"'-
i;) persons
enjoyment of humor, play, and fantasy. Assessment of these components age years and Over (DoppeJt &: Wallace, 19.5.5).
requires not only a wide variety of tests but also other measurement As can be seen in Figure .55, the scores reach a peak bet,,-een the ages
techniques, such as ratings, records, and naturalistic observations. Few 0: 20 and 34 and th.en declme slo\VI~' until 60. A sharper rate of decline
if any intervention programs could undertake to implement all the goals. ,,,as f~und af.ter .6?
m :he old-age sample. The deviation IQ is found bv
But the selection of goals should be explicit and deliberate; and it should refernng an ll1dlvldual s total scaled score to the norm for his o\m ag~
guide both intervention procedures and program evaluation.
The importance of evaluating the effectiveness of an intervention
program in terms of the specific skills (cognitive or noncognitive) that
the program was deSigned to improve is emphasized by Lewis (1973).
In line with the previousl" cited specificity of beh~nioral development
in early childhood, Lewis urges the measurement of specific skills rather 105
than the use of an IQ or other broad den:lopmental indices. Training in

sensorimotor functions should not be expected to improye verbal skills;
] lOOt
1\ 95
training in the development of object permanence should be assessed by ov 90 ,
a te~t of object permanence, and so forth. In addition, the content of the V)
'0
intervention program should be tailored to fit the needs of the indi\idual 85
[
child, with reference to the development attained in speCific skills. V)
:>
80
Sigel (1973) gives an incisive analysis of preschool programs against c
0
..,;'" 75
the background of current knowledge regarding both child development "'"
and educational techniques. In line 'with available knowledge about de- 70
velopmental processes, he, too, recommends the use of speCific achieve-
ment tests to assess progress in the skills developed by the educational
programs, instead of such global scores as IQ's. He also emphaSizes the
65 t
19;
I
30 50
I: I
:,62.5.I 725
importance of process, interrelation of changes in different functions, and 17 22.5 40 60 67.5 79.5
patterns of development, as in the characteristic Piagetian approach. Age> in Years (lv\idpo;nts of Age Grot:p5)
And he urges the reformulation of the goals of carly childhoCJd education FIG. 5.5. Decline ill WAIS Scaled Scores witl, Age.
in more realistic terms. (From D(lppelt & V\!alllo-ce, 195:;, p. 323.)
338 re-,!s of (;e!l('1011nfcllccfviil Left'! PsychologiclJi hSII('s in In!cllic:cli('c
-. Tesfin" •...
group_ Thus, HIlc shows the s~ecline in pe:fOl!~~~_~:iQ1 e

age as .!l2..__ Longitudin::tl studies, based on retests of the same persons over pniods
;lOrmative samD'rC,!1iDQ \\7i11 r~~Qu:tant-.-~he ~1l1derlYll1g ~ssm~~p~ of 5 ,to 40 years, hav~ general~~' r~ea!e~ the opp~sit<:' trend, the scores
tion is that it is "normal" for an indi\'idu;ll s tested abIlIty to dedmc \\ Itl, tellclmg t~~r~...:vlth age. Sever~11of these longitudinal investi~ations -
age beyond the 30's. , nave been conducteCI\\:lllCintellectuall:' superior groups, such as ~oll('ge
"Tw; fads from the \Vechsler standardization data are relevant to tne graduates:~r individual~ initiall." chosen because of high IQ's (Ba:'leyc&
interpretation of age changes. First, since tl:.c standa~dization sam'pl~~ is a Oden, 1900; Burns, 19lJ6; CampbelL 1965; ?\isbet, 1957; Owens. 1953
llormatice sample, it should reflect existing populatIOn .c1:aractenst~~s at 1966). For this reason, some \Hiters have argued that the findin~s ma\:
each age level (see Anastasi, 1956). Because of the T1smg educanonal be restricted;. persons in the higher intellectual or educationallevE'is
level of the general popul~tion, ol~roup~ ..~_~.:~[.~l~~J::.?~l~~~i~~~ \\~'.~~ and do not apI"Y to the general population. However, similar results have
have reeei\'ed less educatiOn than '.:9Q!IgfL~r:Ol!..E~.Thls educatlOl,al dL been obtained in other longitudinal studies with normals (Charles &
in -ti; \\!AI5 standardization
·ferei1'l.'e--i's·'Cle.1i.:f;·-refl~~t·ed sample, in which James, 1964; Eisdorfer, 1963; Schaie & Labouvie, 1974; T~ddenham,
the maximum ,,'ears of schooling; are found at the 20-:34 vein levels and Blumenkrantz, & Wilkin, 1968), as well as with mentally reta1ded adults
educationalle\:el drops consistently in the older groups. These age differ- outside, of institutions (Baller, Charles, &. t-.filler. 196~; Bell & Zubek
ences in amount of education are inevitable if the standardization sample 1960; Charles, 1953). . ,
is to be truly representativE' of the population of the com1try at. the time N:itll.cr cros.2-se~tional nor l~::gitudinr:l studies alone can provide a con- __ .
the norms were established. :\"e\'ertheless, the educatlOnf'll dIfferences ~ll?xye l:!!_t.~:!:Br~tatIon_aLa bser\'edage-chan~S~3,~~t'::~enei1t -anaiYsCS--
complicate the interpretation of the observed score decrements. '!:l1e . of the m~thodological difficulties inherent in each approach:t;g-;thel'-~;;iti~··--
old er grou 1?.Lin ..tl~.Jit;:mdar.dil.atimL..samp le ..ma )'_hay.e...perfQID1 ed ..IllQr~_. the requ~~ed experimental designs, have been published (Baltes, 1968;
poorly on the test, n~~.?~~.~:tJ~~_~l!,E:),. were· gr()\\:~l::g_0}?,bu.!:....~_e.C!-l~S~0..:.~_ Buss, 19, :.); Damon, 1965; Goulet & Baltes, 1970; Kuhlen. 1963; Nessel-
had receI\;ed-resse9.1!~atiQD- than. the. younger .groups. roade & Reese, 1973; Schaie, 1965). Basicall\', what is needed in order to
Ksecon'a-p~;ti;:;nt fact emerges frO!11a comparison o~ the \\7 AI,..Swith te-ase out the effect of cultural changes is a ~ombinatjon of cross-sectional
the \\'echsler-Bellevue, which was standardized approXImately b years and longitudinal approaches. On tlle one hand, ae'e differences in edu-
earlier. In the \\'echsler-Bellevue standardization sample, improvement ~ational leve~ may produce a spurious age decreme~t in test performance
in score e'eased at an e·arlier age and the decline set in earlier than in the m cross-s~ctlO.nal studies. On the other hand, as the individual grows
\VAIS sample. An examination of the educational distributi?ns of the. two older. he lS hU11Self exposed to cultural changes that may improve his
samples reveals that the changes in the age curves parallel tne educatIOnal performance on intelligence tests. .
chan\':es that h;t\'e occurred in the general population dming the inter- A f:\\· ~tudies provide data that permit at least a partial analysis of the
venil;g period. Persons in the WAIS standardization sample ha~ receiv~d contnblltll1g factors. Owens (1966) in his 40-vear retest of 10\\'a State
more education on the avcrar:e than persons in the correspondmg age 11l University students and D. P. Campbell (1965) in his 2.S-n<u retest of Uni-
the yrechsler-Rellevue sample, since the latter were educated 15 ~'ears \'.E'rsit~, of Minnesota stu?ents also tested present freshn~en in the respec-
enrl~('~· tive colleges. Thus, multiple comparisons could be m8de between the two
. T"ile''result- obl'lined with the standf'lrdization samples of the \Vechsler
.• .' .L . .
.,', t·····
'pt"j) fA g~ouI~s tested at the same age 25 or 40 ye:lrs apart, and the performance
scalE'S are typical of the findings of ~!:.os~~ect~~1al st~dies_ of adul~_. v·,. . . r; 0, a ~l~glE' ~wo.u~)tes:ecl b:for~ ~n~ after .the same til:1e intervals. J.2:-both.
inte Higel~e~-Cros5-sectional =cornpari'sons,' in wl1i<:h persons of different I,"i·t'l-U.l )&(/.., j-, •...
·fj/ stu~e,~' tl~e ImtJal gJOup Impl ~\ ea over Its own earher_J2~rf~lJ~n<.:.e, but
aaE'S are examined at the same time, are likely to show an apparent age pelfOl;leo ~bout on a par WIth the younger group tested at the later
d~cremE'nt because cultural changes are corloundecl with the efleets of dat~. uuch findmgs suggest that it~urS{l chan~.and other experi-
aging. Amour;'C'orrOfrri":rl--eaucafloni's'-olllyOlle-of ';l:;any variables in entJ ,~~:~ ..rather th~~_~~p.1L5e>,.lhaLpJ:Qd.lli:..U2.9.nLth~ ri;es-and~
\;fii2fi' age groups may differ. Other cultural e:l:~mges have occun.·ed .in d,~~~;_Dl...~ro[~:)tui.I.lt:d_W1h_the..l;jl[).re..li.Jl)itt~L~~Ilt:Iim~nl?'La~
,,·)·1\. --·11 d'
~~s".,
our society during the past half century which make the expenentlal A:··,~t'·'''r 1)",.tIC11
• ",,' dC.,ct.. . 1<,.1, \',C - eSlgne d s t'uay utl'l"IZll1g thiS
, . COll1lJjncd
1 ,-
ap-
back!:!roUl~ds of 20·year-olds and 70-year-olds quitf di,similar. Cert~linly proach was c~nducted on a more nearly representative sample of the
changes in commu;1ication media, such as radio and television, ':.nd in adult population (Schaie & Strother, 1968). A speciall\' selected batten'
t..!:!:illS1XlJ'lationbclrrries-ha\;e"gre~!ly'increased t11e rang~ of inf.o:'mation of tests' was administered to a stratified-random sampie of .500 person~.
av:l]Ei131e tn--the.' cTe\'eloj)lng individual: Improvements ll!. nU!!lt.~~~. and
mc~fi-~~f~~~~~,would also indirectly influence behavior development.
" ..- .- _.~.- ------ .
The population from which this sample \\':\s drawn consisted of approxi-
ES!I-llCi<:?C Ag'· Gr(!;:!I(',\tS fo~ VER!:!AL MEAh!lNG
mately 18,000 members of a prepaid medical plan, whose membership b:;"-ne.:l A2~ GliH'lH::f".IS fer St-;;CE
was fairly representative of the census figures for a large metropolitan

area. The sample included 25 men and 25 women at each £lYe-year age
inter\'al from 20 to 70. Seven veal's later, all the original subjects who
::~~,:~'" J: I ~l oJ
60~
~~_':'O~':,:'~d;:~~
__"""
cros~'S~;:t!Q'l2'
---,fJ:.1
't"
t5
I
50 l- c'O'''.'''O",' \ 50 I, 5 50
I ,
"15')
.~j.;
v:
could be located were contact~d and 302 of them were giwn the same I
tests again. This subs ample was shown to be closely comparable to the
original group in age, sex ratio, and socioeconomic level.
1., i
I
>-
45f
I
r
!
1
The design of this study permits two types of comparisons: (1) a cross-
2~ 30 35 40 45 50 5S 60 65 70
I 2~ 3~ 35 4(} 4=. 50 55 !
sectional comparison among different age groups from 20 to 70 tested at
Ag.
, Ag, I
the same time, and (2) a longitudinal comparison \\ithin the same indi-
viduals, initially tested at ages ranging from 20 to 70 and retested after :: I :] ,,,'"' ' " 0""",,,, ""''''' 1:: I
seven years, The results of the cross-sectional comparisons sh2.we~. signifi- .
~ I !~f ~~- ~ I
i
cant .inlergener.aJion differences oQ...alLtesis .. In other \\'ords, those born
_~'~::t~":n~!~
1
and reared more recently performed better than those Lorn and reared 50
at an earlier time period. Longitudinal comparisons, on the other hand, I I cross secli~-,<!!
45 r
sl}.o\\'cl-,a..J.e.ud.eugdQr mean s...C:Q1:.e.s-ei~rto ri~ or remail~,..':!..nchanged -1 II 40 45 I
!
\"hen ~iliYiduals were retested. ~The one major exception oC~dlll .lfl-:;y I 2: 30 ~o 40 45 ~D 5'5 6'~ de 7~' I
two h~12tt.<i.-E:iUest~jluYhi.2.~~~mance was _~ig!~&~_'l1ltl:l12Q.Q!:~! I AQ': :
after ~3eyen-"ear interval. 1,(- FIG. 56. Differences in Adult Intelligence as Assessed b\· Cross-Sectional and
The contrast between the results of the cros'i(-sectional and longitudinal Longitudinal Studies. .
approaches is illustrated in Figure 56, sho\\'ing the trends obtained with (From S~haie & S.trother, 1968, pp. 675, 676. Copyright 1963 bv the American
four of the tests.' Similar results were found in a second seven-year retest PsychologIcal AssocIation. Reprinted by permission.)
of 161 of the original participants (Schaie & Labouvie-Vief, 1974). Fur-
ther corroboration was prm'ided by a still different approach involving
the testing of three independent age-stratified samples drawn from the persons. Moreover, the best performers within the older groups excel
sallle population in 1956, 196:3, and 19,0 (Schaie, Labomie, & Buech, tl~e p~or~st perfclrmers \\'ithin the younger groups. l\or is such overl::tp'
1973 ). pmg l1111ltedto adjacent age levels; the ranges of performance still o\,(:'r-
In general. the results of the better designed studies of adult intelli- lap when extreme groups are compared. Thus some 80-\'ear-olds \,·Jl do
gence strcmgl,,'strr:-gt',;rthat
\,. -./- '- •...
tbe ability decrements:J0.IJ11@.di:....attrihu.tEC't:.:ro.::
~_. '" ~ - - .- - better than some 20-year-olds. '
aging' 'lreadual1y-illtt'1"gt'1T€l1rtiUl.'1Of"Tlltercohort differences, proba~l)' \\'hat is even more rele\'ant topic, howeveL is that tile to the present
associated \vl111ci-:i1tuwchangesin omsocj'et);~-Ge·riuJ.ne-abiHtY-d'Ccre- ~_ cl~~~!.f!.!~S
that occur "ith aging vary with the individual. Thus between the
~_ . _•......•
-="'._,...,."...,_..--"="=.,.."'~·r~
ments are not likely to be manifested until well over tFie'a'ge of 60. 1:!.£!~ ages of 50 and 60, for e~Ui)ie, so'me persons may show' a decrease. som'?
over, any generahnti.On, ",he her 'perramtng"'to age ecrement or cohort no appreciable change, and some an increase i~ test performanc~. The
differences, nmst be. qualified by a recognitionb~\¥idt. individ1d.Ul a~~~ch~!:ge. wheth~ it be a dmp-9L.Lri~>_.l\'ill also vary wi~
vari::hih~~""f(')-j,~.all si~_~!i~!.,~ In'clivid.llaT"di"fft,rences within anyone J:llllQ.I}R_~ney~"i~~:!':'~
Moreover, intensive studi~7 of per-~;~~'-;f-'~d;a;~~d-~~
age level are much greater thall' the average difference between age age, extendmg mto the seventh, eighth, and ninth decades of ]jfe. indicate
levels. As a result, the distributions of scores obtained by persons of differ- that inte1lectual functioning is more closely related to thein~lividua]'s
ent ages o\'erlap extenSively. This simp1)' means that large numbers of health st:ltus than to his chronological age (Birren, 1968; Palmore, 1970).
older persom can be found whose performance equals that of younger
3 1'h(',(' are the te.<ls that most closely rescn-;!e iJ:tellif,Ccnce test,; in their content.
Of the,e fom tests, only Rea50nin~ showed a b~dy si~nifi('ant (]1 < .0.5) n>Ltioll ~ATl'llE o~ ADt.'LT I~TELLIGE~CE. \Vithin the life span, testing has been
to aC'" in th2 ]ongitudinal 5tuc1\'. Th(' Illar:nitm]e of the decline, however, i'; J',"('1. Oriented c1JlefJy towf:rd the schoolchild and college student. At these
smaller than in th~ cro;5-secljOl;~1 C'omptUi~~n, lcwJs, till' test comtructor can ura'vV on the large -~,ommon poul of ex..
:342 Tests of CCllcroi I ntciicl'l 1101 Lel'd
P.syclw!ogicollsslICS in Intelligence Testing 343
jwricnc('s that have been organized into academic CUrri(~llLl. .\Jo~t in-
telligence tests measure how well the individual has acqUlr~d the 1I1tel- a elu!tllOod d:~~2:.~:2a rgel yo n w ha t eXP:!::!!~c:.c:?-ili!j}l.divi d ~~~~2~e.~gQ~~ _
lectual ski11s taught in am schools; and they can in turn predIct how well ..d~~.:11.E,_t.!!~~_~~~~I:~_,~I19~:S.iT:~~-,':r.~iTItt011shlp
between these experiences and
the functIOns covered bv the tests.
he is prepared f~)r the next level in the educational hieran:h):. Te~ts for
adults, including the Wechsler scales, draw Lugel)· on tlm Iden:lfiable
common fund of experience. As the individual grows older an~l hIs own
formal educational experiences recede farther into the past, tlm fund of PROBLE\lS II, CROSS-CULTURAL TESTI:'\G
common experience may become increasingly less approp:iate. to assess
his intellectual functioning. Adult occupations are more dIversIfied than The use of tests with persons of diverse cultural backgrounds has al-
chil dhood sch oolin g. ThL£1llJ)lllati¥e-e>:p.eri ences.nLadll1thDo d·J11a~'-ihus_" ready been considered from \'arious angles in earlier parts of this book.
stimulate a differential developmen!~SJLab}li!~~~ in different per~~.~:s:,,~_ Ch~pter 3 w~s concemed with the social and ethical implications of such
Becaus~i~;~mg-ence-fe-stsare'c:Yosely linked to·ac,a-deIliic"a:15i1ities, it is test,mg, partIcularly with reference to minority groups within a broader
not surprising to find that longitudinal studies of adults show, larger ag.e national culture. Technical problems pertaining to test bias and to item-
increments in score among those individuals who have contmued theIr group in~eraction. were analyzed in Chapters 7 and 8. And in Chapter 10
education longer (D. P. Campbell, 1965; Harnqvist, 1968; Hust.n, 1951; we e~:lmme.d typIcal tests designed for various trans cultural applications.
Lorge, 1945; Owens, 195.3). Similarly, persons whose occupatIOns .are In tIllS sectIon, we shall present some basic theoretical issues about the
more "academic" in content, calling into play verbal and numencal role of culture in behavior, with special reference to the interpretation of
abilities, are likeh' to maintain their performance level or show improve- intelligence test scores.
ment in intellige;lce test scores over the years, while those engaged in
occupations emphasizing mechanical activities or interp.ersonal reh:tions
mav show a loss. Some suggestive data in support of tl11s hypotheSIS are
. LEVELSor CULTURALDIFFERE~TfALS. Cultural differences may operate
rel;orted by \ViJ]iams (1960), who compared the. perfomlance of 100
lJ1 many ways to bring about group differences in behavior. The level at
l)ersons ) ranain(T
c' b
in acre
b
from 6.5 to over 90, on a senes of verbal and non-
which cultural influences are manifested varies alan" a continuum ex-
verbal tests. Rather stlfrin~rresl?'~51eI!CeS were found bet\\'::~ the ..
tending from superficial and temporary effects to tho~ which are basic,
incjj \'id-Hal~-oGGupnJiDn.:mldhju .. elative __llil1oin~c:~~:r. ..the _~~~~Ees of _~_
pern;anent, ~I1~ ~ar-reaching. From both a theoretical and a practical
tasks. Longitudinal investigiltions of adults have also found ~~g ..~2!iY~_
stanapomt, It IS Important to inquire at what le\'e] of this continuum
relations.lii.p~ betwet-n toh:t1 IQ changesa!:!.9.:..-sertai!L bio,gl':lEI1j,C:~ ..Li!lY~n: _
tor\'i~m~ (Ch;~fes&Ta~1~m64";-O\\;el;-S,1966). any observed behaVioral difference falls. At 011(' e\trem(~ \\'(> find cultural
dlfferences that may aftect only responses on a particular test and thus
Each time and place fosters the development of skills appropriate to
reduce its validity for certain groups. There are undoubt(.dlv test items
its characteristic demands. \iVithin the life span, these demands differ for
th(' infant, the schoolchild, the adult in different occupations, Mid the that have no diagnostic value when applied to persons f;'om certain
retired septuagenarian. An interesting demonstration of the implications c.ultures .because of lack of familiarity with speCific objects or other rela-
tIvel:' tnvlal experiential differences.
of this fact for intelligence testing was provided by Demming and Pressey
(1957). 111ese investigators began with a task analysis of typical adult ?\fost cultural factors that affect test responses, however, are also likelv
functions, conducted through informal surveys of reading matter and of to influence the broader behavior domain that the test is desianed t~
reported daily activities and problems, On this basis, they prepared pre- sample ..In an Englisl:-speaking culture, for example, inadt'quate ~laster)'
liminarv forms of some 20 tests "indigenous" to the older years. The tests ~f E~lghsh may handICap a child not only on an intelligenee test but also
1~ hIS. school work, contact with associates, play activities, and other
empha~ized practical information, judgment, and social perception. Re-
SItuatIOns of daily life. Such a condition would thus interfere with the
sults with three of these tests, administered together \vith .!'it~m.q~l!_~::~:"l:)_~;J
child's sub:equ.ent ~nte]]ectual and emotional developme~It and would
and nQl.lYerbal-tc~to samples of different ages"._~bQ,:~y.t:-:,(:Uh~lJhS',_Q1Qf.L_,_
peIS (;;I~u::~s.~)l~:c!
the YOuI~·ge;-on~t.!l'e-~1_~~\~:-t-;;:iE:~:hill;J1 _r_.£b:jgE~-
..Q_.rf.Y~L~~ have practIcal ImplIcations that extend far be\'ond immediate test per-
shipJl@lQ,-for,_tl:ltjr-@B:]2~1E~~~;;ts.An these types of research suggest that formance. At the same time, deficiencies of this sort can be remedied
whether intelligence test seores rise or decline with increasing age in \\:i~hollt much difficulty. Suitable language training can bring the indi-
vldual np te>an dfective functioning It·vel within a rclati\'e!y short period.
344 Tests of CCllcrallntdlcc!uol Lcul
The bllgua~e an individual has been taught to speak was chosen in the
above exa~11)1~because it provides an extreme and obvious illustration of Ct.'LTURALDIFFERE"CESAKD CULTURALHAKDlCAP.\\'hen ps;-ehologists
several POil;tS: (1) it is clearly not a hereditary condition; (2) i~ ~an be began to develop instruments for cross-cultural testing in the Hrst quarter
altered; (3) it can seriously affect pC'rformance on a t~st,. a~mm~stered of this century, they hoped it would be at least theoretic,lJly possible to
in a different language; (4) it will similarly effect the mdividual s edu- measure hereditary inteJlectual potential independentl;' of the impact of
cationaL vocational, and social activities in a culture that uses an un- cultural experiences. The individual's behavior \vas thought to be Over-
familiar language. Many other examples can be cited from the rni~~le laid with a sort of cultmal veneer whose penetratie.n became the ob-
ranae of the continuum of cultural differentials. Some are cogmtJve jective .of what .were then called "culture-free" tests. Subsequent develop-
diff~rentials, such as reading disability or ineffective strategies for solv- ments 111 genetIcs and psychology have demonstrated the fallac,' of thi~
ing abstract problems; others are attitudinal or motivation~l, s~ch as lack concept. \Ve now recognize that hereditarv and environrnenta'l factoI';
of interest in intellectual activities, hostility toward authont)' £gures, low interact at all stages in the organism's devel~pment and that their effects
achievement drive, or poor self-concept. All such conditions can be are inextricably intertwined in the resulting beha\-ior. For man, culture
ameliorated by a variety of means, ranging from functional .literacy permeates nearly all environmental contacts. Since all behavior is thus
training to vocational counseling and psychotherapy. All are hkely to affected by the cultural milieu in which the individual is reared and
affect both test performance and the daily life activities of the child and since psychological tests are but samples of beha\'ior, cultural influences
adult. will and should be reflected in test performance. It is therefore futile to
As we move along the continuum of cultural differentials, we must ~ry ~o ~e\ise a test that is free from cultural influences. The present ob-
recoanize that the l~nger an environmental condition has operated in JectIve 111 c~'oss-cultural testing is rather to construct tests that presuppose
the Dindividual's lifetime, the more difficult it b'tecomes to reverse 1 s only expenences that are Common to different cultures. For this reason
effects. Conditions that are environmentally determined are not neces- such terms as "culture-common," "culture-fair," and "cross-cultural" hav~
sarily remediable. Adverse experiential factors operating over many years replaced the earlier "culture-free" label.
may produce intellectual or emotional damage that can no longer be :0:0 single test can be uni\"ersaJly applicable or equally "hir" to all
eliI;1inated bv the time intervention occurs. It is also important to bear cultures_ There are as man\" varieties of culture-fair tests as there are
in mind, hO\~'ever, that the permanence or irremediability of a psycho- parameters in which cultur~s diHer. A nonreading test ma\' be culture-
logical condition is no proof of hereditary origin. , ~air iJ: .one situation, a non language test in anoth~L a performance test
An example of cultural differentials that Inay produce p~rm.anentfeHects 111 a thud, and a translated adaptation of a vcrbal test in a fourth. The
011 individual bphavior is provided by researcb on complIcations 0, preg-
varieties of available cross-cultural tests are not interchancreable but are
nancy and parturition (Knobloch &, Pasamanick, 19GG; Pasamanick ~ useful in different types of cross-cultural comparisons. 0
Knobloch, 1966). In a series of studies on large samples of blacks ana It is unlikely, moreover, that any test can be equall~' "fair" to more
whites, p;'enatal and perinatal disorders were found to be signi~cantly than one cultural group, especially if the cultures are quite dissimilar.
related to mental retardation and beha\ior disorders in the offspnng. An \\'hile reducing cultural differentials in test perforrn;ll1ce, cross-cultural
important source of such irregularities in the process of childbearing and
1 I
birth is to be found in deficiencies of maternal nutrition and other con-
tests cannot completely eliminate such differentials. Even' test tends to
favor persons from the culture in which it was developed: The mere use
ditions associat~d with low socioeconomic status. Analysis of the data of paper and pencil or the prf:'sentation of abstract tasks having no im-
revealed a much higher frequency of a11 such medical complications in mediate practical significance will favor some cultural groups and handi-
lower than in higher socioeconomic levels, and a higher frequency cap others. Emotional and motivational factors likewise ~influcnce test per-
among blacks than among vihites. Here then is an example of cultural formance. Among the many relevant conditions differing from culture to
differentials producing organic disorders that in _turn may lead .t~ behav- cl~l~ure may be mentioned the intrinsic interest of the test content- rapport
ioral deficiencies. TIll' cflects of this type of CUltural ellfferentl.u cannot W1UI the examiner, drive to do well on a lest, desire to excel others, and
be completely reversed within the lifetime of the indiv.idl~al, but require pa~t habits of solving problems individually or cooperativch-. In testing
more than one generation for their elimination. Agam It needs to be clnldren of low socioeconomic level, several investigators have found that
emphaSized, hO\~~ever, that such a situation does l?ot irnpl~- hereditary the examinees rush through the test, marking ans\~-('rs almost at mnclorn
deiect, nor does it provide any justification for faIlure to Improve the and Bnishin~ before time is ca]led (Ee]]s et a!', 1951 !. The sarnc r(':action
environrocntal conditions that brought it about. has been observed among Puerto Wean sc:hoolc:hil~lrcn tested in :0:ew
1'1(')$'
",: ' ,1" 19G'0
Ort~'· 'J. l"""
v,~; \' emon, lOG'"
v 0). 1
_11 a provocative analYsis of the
York Cit,· and in }lawiiii (.-\nastasi & Cordtwrl, 195:3; S. Smith, 19,12).
prol)lem, Ortar (1963, pp, 2:32-233) writes: '
Such a reaction may reflect a combination of lack of interest in the
n,];;tiyeh' abstract t~st content and l':\p(.'ctation of low achicH'ment on
~,n ~he ~<iSiS of our results it appe.ars that, both from the practical point of
tasks re~cmblillg schoo] work By hurrying through the test, the child ~Je\\ ano 01. theoretlcal grounds, the verbal tests and items are better suited as
shortens the period of disco:nforL mtercultural measuring instruments than am' other kind, They must. of
Each culture and subculture encourages and fosters cert~in abilities comse, ~e translated and adapted, but this adaptation is inHnitel~' easier' and'
and wa\'s of behaving, and discourages ;r suppresses others. It is there- ~,ore rehab:e:han, t~ w~]:.ni~h :n::possible task of ",tra:)slati~g" a'nd adapting
fore to b~expected that, on tests developed within the majorityA.merican ~ pe~o:~naJ.Ct"tesc. l,1C langLl"t!,E' of perfonllancc- IS tne cultural perception,
culture, for example, persons reared in that culture will gerwrally exce1. ~~t ItS. words "and grammar and ~YJjtax are not even eomplE'tel~' under~tood,
H a tf'st were constructed by the same procedures within a culture diiIer- le( al01le orgamzed in natIOnal entitles. \Ve do not know how to "transhte" a
in'" markedlY from ours, .~11ericans would probably appear deficient in picture .into the representational language of a different culture, but we are
Co' • r' 1 ' thoroughl\' familiar with the technique and requirements of translatin'" verbal
terms of test norms, Data bearing on tl1is type or cultura companson are
contents. , , . A concept that is non-existent in a certain laJ)gua"~'" simplY
meager. \\l,at c\"ide:nce is avo.ilable, hO\\'8ver, suggests that persons from
•..... ' • - . 1 can:1Ot bf' translated into this language, a be-tnT which acts a~ a ~afe£uard
our culture m~w be just as handi.capped on tests preparea wlthlll otller
ag?Jnst meehanical use of a gi,'cn instrument when adapting it for a difr~rent
cultures as mCl~'lbers of those cultures are on our tE'sts (Anastasi, 1958&, culture, ~
pp. 5CG-56S), Cultural differences become cultural handicaps when th~
individu[ll moves out of tIle culture or subculture in which he \\'as reared Amcm~ the e~amples cited by Ortar is the observation that, when ]1re-
and endea\'ors to function, compete, or succeed within another cdture, sentechnth a pIcture of a head from which the mouth was missing, Orie!l-
From a broader viewpoint, however, it is these yery contacts and inter- tal immigrailt children in Israel said the bod\' was missincJ'. Lh~hmiliar
changes hetween cultures that stimulate the advancement of ci\'ilization~, \\:it11the conven.tion of considering the dra\·,.'ii~g of a head ~~ a complete
Cultt;ral isolation, \\'hile possibly more comfortr.ble for individuals, leads pIcture, these ctlildren regarded the absence of the body as mOT(: im-
to socieLtl stagnation. portant than the omission of a mere detail like the mouth. 'For a diHerent
reason, an item requiring that the names of the seasons be arrancrtd ill
the proper sequence would he more appropriate in a cross-cublr~l te;t
LA);GI:AGE I); TI1ANSCl'LT1.:RAL TESTl);G. Most cross-cultural tests utilize
than \':ould an item using pictures of the St'asons, The seasons ,,','ou]d nol
nonverbal contt'nt in the hope of obtaining a more nearly culture-fair
onh' look different in different countries for geographical reason::. but
measure of t]w san1e intellectL1al functions measured b\' verbal intelli-
the" "'ould also probably be represented bv means of ('orJveJ1tiona lized
gence k>ts. Both assurnpt\ol1s unckr1ving this approach are questionable, pictc1rial S\'111bol, which would be unfami]'iar to pcr'C!IiS from another
}~irst, it c~:nnot be r,ssLlliled th"t gOfivcrbal tests measure the S:1ll1e func- culture,
tions 8S \'('rua] tests, howev(;1' ~;imj]ar they may appear. A spatid analogies rn'f'USeOCI'1'ct.-" 1)t' 'lnsl1itahh"
.
.,,1. , '. 'ul8.1 1epreSenc~1tl()n
.j'" 1" may
lIS8J1 in cultures
test is nd men'!\' a nonverbal version of a verbal an::Jogies test. Some of
~na(,C'lI,tomecl to representative drawin~, A lwo-climer!sional reproouc--
tlw c:-,r]v nonJar:£!uaoe
- J •• C1'
tt:st~" such as the Army flE::ta, wc~e hea\'ih' loaded
v1lh spatia] visualization and perceptual
#. •
8.1!nit~es, which art: quite un-

non ~f an objec~ i:
not :'. perfect ~epjica of the original: it Simply l~i'f:sents
c:rt~m C:l~S '~Vh1C:h: as a result ot past experience, lead to the perception
:rdat!:.d to verhal and numerical abi1itics. Even in tests like the Pro?;ressi\'c
ot tne object. 1£ the cues are highly reduced, as in a simplified or
\!:atriccs and other nonlanguagt: tests deliberatt'ly designed to tap rea-
sctematic drawing, and if the necessary past experience is a'usent, the
soning and abstract conceptllaliz2.1ion, factorial analyses have revealed a
correct perception may not fo]]ow, There is now a considerable body of
L'rge contribution of nOiwcrlxl1 b.ctors to the variance of test scores (e.g.,
~.m~)iric~l data in,dic~,t~ng marked differences in the perception of 'pic-
Das, 1903), Im es bv persons 111dlrtcrent cultures (Mi]1er 197;:> 8r-"'111 r~IT,j'ly,'1 &
FrOJll
.' a difTprcnt
...•.. ~lwle
_~ ~ , there is a arO\T,in
t. ~cr bod\'
- .. of evidence sU2:gesting
L".....· Bersko~its, 1966), . '.,' " '0< , ~~ .' ~J,
llut nonlangua';!,C' testo m,ly be more culturallY loaded than language' tests. Fron'1.1 ~"fjl 1 tests.' hl.'quently reCjuire rtlati\'eJ~' .
j . 1.' ~, .• ·t,···.·
,":IOL ~
,C, dng!i" 1
nO:WE'rtx,
. •
lnvC'stiqatic;ns ~Yith a. wiJ'c v8.riety of cultnr;:l groups i.n man)' countrib ao<tr'ict tl"11 1'jn(i t. ~~ -':, ~-:.nc..
l~b 1·)I·C)~'~·"·e~'
·.... I an::d\'UC CO,in) 't' IVE' sh·'jec:. .
··rY~r,..,('J "'rirtir. (yr.
h:1ve f(~;mcl larger group J.if1eren(:e~. in perform';mce aDd other nonverbal
"'1oJ
.•~, J..
_.I.
-'I""
1
~
L'
:",'. . ••~: .'\...LL0 l\.:'l ,.1
rmoo (,-Class \ \. estnn cultures (E, , A. Cohep :. lOfiC:)

v'-.•.. ,.... l'p, '-'~•.' J·t~r;"·-l' 1'1·).
<J,n<
_.. ., ~1. '- l" .
tests th:;n in \'elb~\l test:; (Anastasi, 19':iJ; lrvir!e, 19Ci9a 1fJ69b; Jensen,
, J . ]'syclwlof:,icn! hSI1CS ill Jnlcllif''''l1c''
',,- l.
T'('s I'mg 349
yes the same e "t'
other cu1tural contC'\ts may be much less accustomed to such probkm-
1])VO
scores on suec~ssi~'e {~:~)II;~f
,
~~~C~~ITl!'Ct
"
r.cgllJ~fr1y followed
0 a llJ)J orm scal':.',
in converting
soh'ing approaches.
It should bc added that nonyerbal tests ha\'C fared no better in the
testing of minority groups and persons of low sociopconomic status \\"ithin
tl-Je United St,-,tes. On the WISC. for instance, black children usuallY find
the Performance tests as difficult as or more difficult than the Yerbal MEAKING OF A!\ ,. IQ For tl)e genera 1 public the IQ . 'd
tests; this pattrrn is also characteristiC of children from low socioeconomic a particular tVI)e of seo .,' IS not) entified \\ith
.' re on a part)cu1'1r te t b t' f '
levels (Caldwell & Smith, 1965; Cole & Hunter, 1971; Goffeney, Hender- d esignation for intelliaence o
S' ., I
. 0 pleva elit has tl .
C s, U IS 0 ten a shorthand
b
son, & Butler, 1971; Hughes & Lessler, 1965; Teahan & Drews, 1962), cannot be merely ignored 0' d I d' )IS usage ecome, that it
I ep ore as a I)Opul '' .
The samp groups tend to do better on the Stanford-Binet than on either b e sure, wben consideriner tl . - ar nJlsconcephon. To
b 1e numencal value f '
Raven's Progressive 1\iatrices (Higgins & Sivers, 1958) or Cattell's Cul- a Iways specify the test fr~OI1)\,,1 ' 1 . '. a a gl\'en IQ, we should
, __' HC1 It was deriv d D'ff .
tests that yield an IQ--ao'~"-'f-;-t d'-:--fj~---:~'---~..: I erent mtellir-ence
ture Fairlntf::'lligence Test (\\'illard, 1968). ' In ac I ler In cant' t d' ~
There is of course no procedural difficulty in administering a verbal a ffcct the interpretation of tl " en an III other \\aVS that
. 1ell scores. Some of th d'ff, '
test across cultures speakin~ a common language, \Vhen language differ- tests sJ)armg~ the common lab ceo 1 f ".mte II'Igely'e t t"ese, I el ences amonab
ences necessitate a translat'ion of the test, problems arise regarding the examples considered il) tl d' ~ ~ es \\'ere apparent in the
)e prece ml! chapt :t\ 1
comparabi1it;' of norms and the equivalence of scores. It should also be need to reexamine th 1 u ers. 1 onet wIess, there is a
" ,e genera connotations of tI ".
noted tl)at a simple translation would rarely suffice. Some adaptation gence, as symbolized bv tl I I' 1e construct ll1telli-
.' _ 1e Q. t m1O'ht be add d tl J
and revision of content is generally required. Of interest in this con- conceptIOn of intelliaence I b h c e Ult t)e pre\'alent
e' )as een S aI)ed t 'd
nection is the procedure developed in equating the scales of the CEEB t)e
I characteristics of th St f 'd ' ' 0 a consl erable degree by
. e an OJ - Bmet seal ,I . h f .
Scho1astic Aptitude Test (SAT) and the Puebra de Aptitud AcaMmica prOVIded the only instrument for the't "e, W HC or many years
gence and which was ofte d llLenSI\ e measurement of inteJli-
(P AA )-a Spanish version of the SAT (Angoff & \1odu, 197:}). . n use as a crit " f' J' ,
The PAA was originally developrd for local use in Puerto Rico, 11ut has Fnst, intelligence should b 'd d ell on 01 va Idatmg ne\\ tests.
e regal e as a-.des . ',' h
subsequently been considered by continental American universities as a e.KIUD,atQIY
I ~onceI)t. A 1 IQ . ;F ''-~ _cnpIIY~ rat, er than an
,-- ---~ I IS an expresslOn of . d' 'd I'
possible aid in the admission of Spanish-speaking students. With such Ieve 1 at a. given point in "i . 1'-
lome, m re ahon to his aa
an m ,1\'1 ua s ability
' -. . .
objectives in mind, an expJoratory project in scale equating was under- test can mdicate the reaSOI) f ,I' f oe n01l))s. No mtelhgence
,C _ S.OI HS per orma T '1 ' ~
taken. This project provides a demonstration of an in:'''''nious method performance on a test or i "'d' l' n.c:,. 0 attn Jute madeqnate
. n e\elV a\- ,Fe actInt e t ". 1
applicable to other situations requiring testing in mul pIe languages. Igence"
I, d' , is a tautology and' ', ..
_ " c ,I S 0 mae equate intel-
. In no \\a\' advances 0' d d' a
Essentiall:', the procedure consists of two steps. The hrst step is the m IVldual's bandicaI) 1ft' . c. UJ un erstan m of the
, n ac, It may serve t I 1 ff 0
selection of a common set of anchor items judged to be equally ap- causes of the handicaI) in the i d' 'd' 1'1-' 0 )a t e orts to explore the
' n 1\ I ua s Hlstorv
propriate for both groups of students. These items are administered in I n t e III!!ence
~ tests ' c.
as '''ell
v, as any ot h er k' d f' .
English to thc English-speaking students and in Spanish to the 5panish- to .~.!?.£LQn in.dividual b t t 1- l' . ,m 0 tests, should be used not
. . . ~__ •__., .J,,1.. ..Q. .. 1t...p_!.n understand' a 1. T . '
speaking stuoents. The performance of these groups pwvides the data for s~n to hIS maximum func'tioninn' le\'-~- \\-,-,---'-- Jl1o- 1lm~, a brmg a pcr-
measuring the difficulty level (~ value) arid discriminative power (rl.;" time; we need to ass h' b e need to start where he is at the
. ess IS strenrrths and ' k ' ,
with total test score) of each item. For the final set of anchor items, an:' mgly. If a reading test indicates 1 ,\\e~ nesses and plan aecord-
a
items sho\\'ing appreciable item-group interaction are discarded (see do not label him as a 110' d t )adt a dnld IS retarded in readin . we
'. urea er an stop' n 'd ' ' . b'
Ch. S). These would be the "biased" items that are likely to have a verbal test to conceal his h d' I ' OJ 0 \\ e give hm1 a non-
him to read an Icap, nstead we concentrate on teachina
psychologically different meaning for the two groups. The final anchor . . 0
items are those having approximately the same relat1'oC difficu1t:' for tl)e .~n ImpUJ'tant goal of contemporary testin(T ., ,. ' ,
English-speaking and the Spanish-speaking sampJes, in addition to mcet- to self-understandl'l)a . cL ] . 1:>' mOJeO\o, IsJQ ..contnbute.
.' , o..aJ1 p.eJ:.Sima dev 1 ' ' .
\ Ided bv lests is beinrr . d' --~---fc..Q-E!!!-~!::- The mformation 1.1ro-
'n~ the specifications for difficu1ty level and discriminative rnwcr. . 1" " 'i::- use mcreasm<T]y to ',,' d' . 1 ]
The second step is to include these anchor items in a reguJar adminis- t'. Jona and voc'lticlllaJ
c,
I'
p annmg and in
b ,
J-'
aSSlSl III l\'E ua s in edue-a-
d " <
tration of the SAT and the PAA and to use the scores on tIle anc1-l<)1 l1\'es. The attention 1"1')']<1 C! ..· t ff ma .mg eCISlOns ahout their OWI1
, ~ b ,_ c;1'('no e" c','i' ."
,~c \ e \\'d)'S 0 f communicat'I'"·,,i::- t l
• (. 1J f''''
":."'!
items as a basis for converting al1 test scores to a single scale. This step
350 Tests of GellcTCll Il1tcllecttwl Leu} Psychological Issl/es ill intelliaencc TcstillC7 351
b . t-
results to the individual attests to the grO\ving recognition of this appli- cat.ed, a maj~r s~l~stantj.ve source of controversy pertains to the interpre-
cation of testing. tatIOn of hentablhty estImates. Specifically, a heritability index shows the
A second major point to bear in mind is ~~:. ~~~t.:lli~e"~1ce,~~.~!.9t .a~ proportional contribution of genetic or hereditary fa~tors to the t~tal
single, unitary ability, b~_a composi~~eve~~l. fu~~tlO~S:,The_tg[!l~_ v~l:iance of a particular trait in a given populati;n under existing con-
ci5'i'DiliOi1Iy use to cover that combination of abilItIes reqUIred for survival dItIons. For example, the statement that the heritabilitv of Stanford-Binet
ana a vancemeDf\vithin a particular culture. It follo\\'s that the specific IQ among urban American high school students is .70 would mean that
·abltrtle~-im:lu·de--om--f11fs compos~vell as their relative weights, will 70 percent of the variance found in these scores is attributable to heredi-
vary with time and place. In different cultures and at different historical tary differences and 30 percent is attributable to environment.
periods within the same culture, the qualifications for successful achieve- Heritability indexes have been computed by various formulas (see, e.g.,
ment \vill differ. The changing composition of !.n!.~}Egs:nQ~ ...~<.m_als'Lh~.._ . Jensen, 1969; Loehlin, Lindzey, & Spuhler, 1975), but their basic data
recognized within iflle lIe orfne uiCli\;rauaC'from infan~!, to ~5!~!hSlOd,--_ . are measures of familial resemblance in the trait under consideration. A
A:~ iUQlvioual'Sielative a1lilifywil1 t~ to increase wltnage m thos.e frequent procedure is to utilize intelli ence test correlations of monozy-
functions \~hose va ue is emp 1asizeCfl)y his culture or subcultur~d hiS gotic (identi.caI) and dizygotic (fraternal) twins. Corre ations betwe~ri
rej'afi\;e aoihty wlltrenatodecrease in those functions whose va ue is 11'lonozygotic ..twins rearea--togetler and between monozygotic twins
deemphasized (see, e.g., Levinson, 1959, 1961). reared apart m foster homes have also been used.
Typical intelligence tests designed for use in our culture with school- Several points should be noted in interpreting heritability estimates.
age children or adults measure largely verbal abilities; to a lesser degree, F~rst, ~he empirical data on familial resemblances are subj~ct to some
they also Celver abilities to deal with numerical and other abstract sym· dlstortJOn because of the unassessed contributions of environmental fac-
bol~. These are the abilities that predominate in schoolleaming. Most tors. For instance, there is eVide~~-th-;t-;;-:;onozvaotic twins ~amore7
intelligence tests can therefore be regarded as measures of scholastic apti- closely similar environment than do dizygoti~ ~wins (Anastasi, 1958a,
tude. Th.e IQ is_both a reflecrtgn of.,Eor educational achieve~-=~! and ~ pp. 2-87-288; K~ch, 1966). Another difficulty is that twin pairs reared
prediclor of subsequent educational perrorriiance. Because tne functions apart are not aSSIgned at random to different foster homes, as thev would
taught in the educational system are of basic importance in our culture, be in an i~eal e.xperiment; it is well known that foster' home pla~ements
the IQ is also an effective predictor of performance in many occupations are selectIve wlth regard to characteristics of the child and the foster
and other activities of adult life. family. Hence the foster home environments of the twins within each
On the other hand, there are many other important functions that pair are likely to sho\\' sufficient resemblance to account for some of the
intelligence tests have never undertaken to measure. ~e~llimice.l. IllQJ2:2 correlation between their test scores. There is also evidence that twin data
n~.c.al ~..md..artisJiQ..ilptitudes are obvious exam~D10tivational, emo- regarding heritability may not be generalizable to the population at
tional, and attitudinal variables 'are important determiners of achieve- larg~ because of the greater susceptibility of twins to prenatal trauma
ment in all areas:B Current creativity research is identifying both cog- leadmg to severe mental retardation. The inclusion of such severelv re-
nitiyc and personality variables that are associated with creative produc- tarded cases in a sample ma:' greatly increase the twin correlati~n in
tivity. All this implies, of course, that both individual and institutional intelligence test scores (Kichols & Br~man, 1974).
decisions should be based on as much relevant data as can reasonably Apart from questionable data, heritability indexes have other intrinsic
be gathered. To base decisions on tests alone, and especially on one or limitations (see, Anastasi, 1971; Hebb, 1970). It is noteworthy that in
two tests alone, is clearly a misuse of tests. Decisions must be made by the earl~' part of the previously cited article, Jensen (1969, pp. 33-46)
persons. Tests represent one source of data utilized in making decisions; c1.e~rly.hsts th.ese limitations among others. !gst, th~mpt of herita.
they are not themselves decision-making instruments. ~Ih.ty .!.~.aE0~~~ble ..!~:p01)Ulat~ons, not indiv~duals. For example, in
_~!lg to establish the etiOlogv oLa ..padicular-child'.5i11ental retardation
the hc:ritabn~i inJex would be of lliLh.cl~egardless of the size of th~'---.,
HERITABILITY A~D MODULABILITY. Much confusion and controversy have
llenta )j!ity index in the population, the child's mental retardation could
resulted from the application of heritability estimates to intelligence test have resulted from a defective gene (as in phenylketonuria or PKU),
scores. A well-Jmowl1 example is an article by Jensen (1969), which has from prenatal bntin damage, or from extreme experiential denrivation.
engendered great furor and led to many heated arguments. Although Se~ond, heritability indexes refer to the population on which thev were
there are sev-eral aspects to this controversy and the issues are compli- foune! at the tirne. Any change in either hereditary or environ~nental
Psychological Issues in Intelligence Testing 353
(;2 Tests of GCllcrallntcllcctll~Ji L~~cl . '. .
conditions would alter the hCrltablllty ll1dex. For lI1stancc, an ll1crcase raised from .5S to .92, \vhen parents' educational level was included in
;'1 inbreeding, as on an isolated island, would reduce the variance at- the multiple correlation (p. 80).
I 'ib~table to'-hcredity an~ hence lower tl~e heritab.ility index; increa~in~

en\'lronmental homogenclty,
ance attributable to ~nvir~mnent
on the other hand, \\ould reduce the \ an
and hence raise the heritability index.
Relevant data are also provided by longitudinal studies of populations,
in which comparable samples of the same population are tested many
years apart (see A.nastasi, 1958a, pp. 209-211). When cultural conditions
'urthermore, a heritability index computed within one population is not have improved over the interval, a significant rise in the mean intelli-
\ pplicable to an analysis of the differences in test performance between gence test perfomlance of the population has generally been found. Such
two populations, such as different ethnic groups. .' . findings are illustrated by a comparison of the test scores of American
Third, he~itability ~oes. ~lOt .indicate the d.eg~ee of .~10dlfiablhty. of .~ soldiers examined during \Vorld \'Tars I and II, spanning an interval of
I rait. Even If the hentablllty mdex of a traIt m a gl\en populatIon

100 percent, it does not follow that the contribution of environrl:ent to
J.:, 25 years (Tuddenham, 1948). For this purpose, a representative
of enlisted men in the \VorId \Var II army was given both the AGCT and
sample
"hat trait is unimportant. An extreme example may help to clanfy the a revision of the Army Alpha of \Vorld \Var 1. The distribution of this
l .oint. Suppose in a hypothetical adult community everY~I:e has th.e

Identical diet. All receive the same food in identical quantItIes. In tlllS
population, the contribution of food to the total variance of health and
group on the AGCT paralleled closely that of the entire army. With the
data from this sample as a bridge, it was possible to estimate that the
median performance of the 'VorId \Var II army fell at the 83rd percentile
of the \Vorld \;Var I army. In other words, 83 percent of the \Vorld \Var I
I }lwsical condition would be zero, since food variance accounts for none
Jf'the individual differences in health and physique. Nevertheless, if the
food supply were suddenly cut off, the entire comn1Unit): would die of
population fell below the median score of the World \Var II population.
It is noteworthy that the average amount of education in the \Vorld \Var
I tarvation. Conversely, improving the quality of ~he dIet could well II population was 10 years, as compared with 8 years during World \Var
l 'esuIt in a aeneral rise in the health of the commumty. I. This increase in amount of education, accompanied by improvements
Regardle~s of the magnitude of heritability indexes found for IQ's in in communication, transportation, and other cultural changes that in-
',..ario~s populations, on~ empirical fact is well establishe~: the IQ i~ not creased. the individual's range of experience, was reflected in improved
r hed and unchanging: and it is amenable to modificatIOn by enVlrOI1- test performance.
I mental interventions. Some evidence for this conclusion was examined A similar investigation on a smaller scale was conducted "ith school-
earlier in this chapter, in connection ",ith longitudinal studies. There has children living in certain mountainous counties in Tennessee ("-heeler,
}cen some progress in iden~ifying characte.ristiC's of accelerating and 1942). Group intelligence tests were administered in 19-10 to over 3,000
I leceleratincr environments.
both fortui~ous environmental
RIses and drops 111IQ may also result from
changes occurring in a child's life and
children in 40 rmal schools. The results were compared with those ob-
tainedwith children in the same areas and largely from the same families,
planned environmental interventions (see A_n~5tasi,. 19~8a). 1Iajor who had been similarly tested in 1930. During the intervening lO-year
l ~hanges in family structure, sharp rises or drops m famIly Income level,

adoption into a foster home, or participation in a preschool program may
period, the economic, cultural, and educational
had improved conspicuously.
status of these counties
Paralleling; such environmental changes, a
'?roduce conspicuous increases or decreases in IQ. rise in IQ was found at all ages and all grades. The median IQ was 82
J Prediction of both critelion performance and sub~equent test score can in the 1930 sample and 93 in the 1940 sample.
be improved by taking into account salient features of the environme~t Fr0m another angle, the very composition of intelligence may alter as a
in which each child has lived during the intef\'ening period. There IS result of the individual's e},:periences. The individual's pattern of abilities
;ome suggestive evidence that correlations with subsequent intelligence will tend to change with age as his environment fosters and encourages
test sco~~ or academic achievement can be substantially raised when the development of some aptitudes and deemphasizes other aptihldes.
environmental variabl~s are included along \\ith initial test scores as Moreover, factor-analvtic research has demonstrated that experiential dif-
predictors (Bloom, 1964, eh. 6). For example, in a group of 40 pupils, ferences may influen~e not onl~' the level of performanc~ reached in
I
the correlation between readibg comprehension at grade 2 and at grade
8 rose from .52 to .72 'when father's occupation was added as a rough
different abilities, but also the way in which intelligence becomes differ-
entiated into identifiable traits.4 There is empirical evidence that the
index of the cultural level of the home (p. 119). Again, a reanalysis of
\ intelligence test data. fron~ the Harvard Growth Stud] showe~ that the
correlation between mtelhgence test scores at ages I and 16 could be
J11llnber and nature of traits or abilities may change over tirflr and may these specific motives will interact with situational bctors, as well as with
differ amonrr cultures or subcultures (Anastasi, 1970). aptitudes, to determine the individual's actual performance in given
An individual's intelligence at an:' one point in time is the en~ product situations. .
of a vast and complex sequence of interactions between heredItary and The relation between personality and intellect is reciprocal. Not only
environmental factors. At any stage in this causal chain, there is oppor- do personality characteristics affect intellectual development, but intei-
tunity for interaction with n~\\' f;ctors; and because each interaction in lectual level also affects personality deVelopment. Suggestive data in
turn 'determines the direction of subsequent interactions, there is an ever- support of this relation are provided in a studv by Plant and 1\linium
widening network of possible outcomes. The connection betw~e~ th.e .(1967!- ~ra\\'ing upon the data gathered in fi~'e ;vailable longitudinal
genes an individual inherits and any of his behavioral charactenstlcs IS mvestIgatJOns of college-bound young adults, the authors selected the
thus highly indirect and devious (see Anastasi, 1958b, 1973; Hebb, 1953). upper and lower 25 percent of each sample in terms of intelligence test
scores. These contrasted groups were then compared on a se;:ies of
personalit:, tests that had been administered to one or more of the
INTELLIGEKCE A"n PERSO:"ALIH. Although it is customary and con- samples: The pe.rsonality tests included measures of !t1:it~des, vallle~ __
venient to classify tests into separate categories, it should be recognized motJvatlOn, and l11terE:::rsonal and oth~r non cognitive traits. The results
that all such distinctions are super£cial. In interpreting test scores, per- 'of1Iiisarlal:'sis re<;;::;led-a stro~g t~ndenc:T for th;high-aptitude groups
sonality and aptitudes cannot be kept apart. An individual's per~orman~e to undergo substantially more "psychologically positive" personality
on an aptitude test, as well as his performance in scho~l, on.the JO~, or 111 changes than did the low-aptitude groups.
any other context, is influenced by his achievement dnve, hIS perSIstence, The success the individual attains in the development and use of his
hi~ value system, his freedom from handicapping emotional problems, aptit~des is bound to influence his emotional adjustment, interpersonal
and other ~haracteristics traditionally classi£ed under the heading of relatIOns, and self-concept. In the self-concept we can see most clearly the
"personality." . mutual influence of aptitudes and personality traits. The child's ach'ieve-
Even more important is the cumulative effect of personality character- n:ent in school, on the ~la);gl~ound, and in other situations helps to shape
istics on the direction and extent of the individual's intellectual develop- hIS self-concept; and Ius selt-concept at any given staGe influences his
ment. Some of the evidence for t· s effect, collected through longitudinal subsequent performance, in a continuing spiral. I!L!!~itregard, the self-
studies of children and adults, was summarized earlier in this chapter. c:?nc~J?l~~J:~~~_s_.~Q.r.LQL12[i\.'a.te_selHulfillin g-prophec}: .._.~. .----------.-.
Other investigations on groups ranging from preschool children to college At a more basic theoretical level, K. J. Hayes (1962) has proposed
students have been surveyed b\· Dreger (1968). Although some of the ~ broadly oriented hypothesis concerning the relationship of drives and
research on young childr~n utiiized a~longitudinal approach, data from 1l1te]]ect. Regarding intelligence as a collection of learned abilities, Haves
older subjects were gathered almost exclusively through concurrent maintains that the individual's motivational makeup influences the kind
correlations of personality test scores with intelligence test scores and and amount of learning that occurs. Specificallv, it is the strength of
indices of academic achievement. The data assembled by Dreger indi- the "exp~rience-pI:oducing drives" that affects i~tellectual development.
c<\te the importance of considering appropriate personality variables as Th~se. (1: lves are Illustrated by exploratory and manipulatory activities,
an aid in understanding an individual's intelligence test perfonm'l11ce cunos~ty, pl~~', the ?abbling of infants, and other intrinsicaliy motivated
andin predicting his academic achievement. .' behavIOr. ~Itmg chIefly research on animal behavior, Hayes argues that
It would thus seem that prediction of a child's subsequent
l
mtellectual these expenence-producing drives are genetically determined and tepre-
development could be improved by combining information about his se~t the only hereditary basis of individual differences in intelligence. It
emotional and motivational characteri,;tics \vith his scores on ability mIght be ad~ed th~t the hereditary or environmental basis of the experi-
tests. A word should be added, however, regarding the assessment of ~n:e-pr,oducmg dnves need not alter the conceptualization of their role
. 'mot;vat;on." In the pmd;",1 ,valnatian of "hookbild"n, ool1ege,tn- 111 lIltelicctual development. These two parts of the theon' mav be con-
dents, job applicants, and other categories of persons, psychologists are sidered independently. . "'
often asked for a measure of the individual's "motivation." \\'hen thus \Vhatever the origin of the experience-produciJ12: drives, the individual's
:t_,..;JIij
worded, this is a meaningless request, since motivation is speCific. \"hat experience is regarded as a joint function of the ~trength of these drives
is needed is an indication of the individual's value system and the and the environment in which they operate. The cu~nulativc eHect of
intensity with whch he will strive toward specific goals. The strength of
thm "'p'";,n,,, In tmn detenn;n" the ;ndl,·;du,r, ;ntelhtnal I,ve!
PsychologicallsslICs i1l l1ltcllir;:cllcc Tcsti1lr;:
any given time. This is a provocative hypothesis, through which Hayes
cnviromncnt provides the immediate task and contributes to motivational
in'·egrates a considerable body of data from mall)' t)'pes of research on
strength for this task relative to the competing moti\"ational strengths .of
h,Ml human and animal behavior.
alternative actions. Motivation affects both the efficienev \\'ith which the
On the basis of 25 years of research on achievement motivation,
~asL is perform.e~ and the time spent On it (e.g., stud:·il~g. carrying out a
J. W. Atkinson (1974, '1976) and his co-workers have formula~ed. a Job-related actmty). Efficiency reflects the relationship between nature
'connrehensive schema representing the interrelationships of abl1itles,
of task and current motivation. Level of performance results from the
moti\'ation, and environmental variables. The approach is dynamic in that
individual's relevant abilities (e.g., as assessed by test scores) and the
it implies systematic change rather than constancy in. the individual'.s
efficiency with which he applies those abilities t~ the current task. The
lifetime. It also incorporates the reciprocal effects of aptltudes and motl-
final achievement or product shows the combined effects of level of
vational variables; and it emphaSizes the contribution of motivational
~erformance while at work and time spent at work. Another and highly
variance to test performance. To illustrate the application of :his con-
Important consequent of level of performance X time spent at work is the
ceptual schema, computer simulations were emp~oyed.' ShOW111g hm\'
lasting cumulative effects of this activity or experience on the individual's
ability and motivation can jointl:' influence both lllt:lhgence. ~est per-
Own cognitive and non cognitive development. This last step represents
fonnance and cumulative achievement. Some supporting empuical data
a feedback loop to the individual's personality, whose effects are likely
are also cited regarding the high school grade-pomt average of bo~'s as to be reflected in his future test scores, .
predicted from earlier intelligence test scores and a measure of achieve-
Many implications of this approach for the interpretation of test scores
ment motivation (Atkinson, O'Malley & Lens, 1976).
are spelled out in the original sources cited above, which should be
A diagram of Atkinson's conceptual schema is repr~duced in ~igure
examined for further details. The schema provides a promising orientation
57. Beginning at the left, this figure shows the combmed operatlon of
for the more effective utilization of tests and for the better understandina
heredity and past formative environment in the devel~pm;nt of both of the conditions underlying human achievement. b
c02:niti"e and noncounitive aspects of the total personalIty. fhe present
" b
I mmecJiate environment
as guide to ac.tion
Pers,:ma Iil\' l
l\Jctu~i=:of the t2sk (Ai Cumulative Effec:ts
I '1 l d (\AX
~X-Eif},enCy ~,I.ll\S\i,",\/1
Heredity
.....
/ hi
l/ II
Abilities
1 -t- ~:':f~r~~alice
while at work
~::h~::eV:O:~t
b I
;" 1\ Motives k--....... ',/
Formative
.
I '"
I
I
\ "'"
!I
i \;K
~'trength
i \\
of
motl\'atloflIT"L
/'-.,
Time spent
at work
J On the se:i
<environment I 'it, I d :}/ . Growth in ability,

j I Know e ge, V'I / kno\!vledge, bell~fs
beliefs. ar,d I\ Incentives and \. . Changed conceptions.
\
I
conceptIOns \
I
t\. .....\\
'L
1 ----*- \
opportunltl€S
+-
\
Strength
IV', • \'e5o
0.1
of motivation for
alternatives {T B T:2
Immediate environment
as goad to action
FIG. 57. Schema Illustrating Interaction of Cognitive and Noncogniti\'c Fac-

tors in Cumulative AchieH.'ment and Individual Development.
,i'lom Atkinson, O'~I"lley, &: Leuo, 1976, Ch. e. Heproduced by pcrmis,ion (of
Academic. PICSS.)

Anastasi, Anne - Psychological Testing II

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anastasi, Anne - Psychological Testing II

Uploaded by

Copyright:

Available Formats

Pri'lciplcs of Psychological Testing Vali~ity: Measurement and Interpretation 183

Classffication: 20% Below

SLOPE BIAS. To facilitate an understanding of the technical aspects of

DISTRIB~ON OF TEST SCORES. The difficulty of the test as a whole is, of

adjustments are continued until the distribution becomes at least roughly

b . d d' tribution curve is not -- Diiitribution of Ability

~~;e~e~:e;~:l=:~~ W;;g::~ a~s~g~ed to certain responses revised. Such

RELATINC ITEM DIFFICULTY TO TESTINC PURPOSE. Standardized psycho-

Deltas for 81ack Sample

differences. To eliminate items showing these differences might serve

arc administered, since the subsequent conduct of the examination de-

Mild -2 68--52 2.14

.:The'increasing reliability of scores with increasing age is characteristic

as the \Vechsler-Bellevue Intelligence Scale, was published in 1939. One judg

Picture Arrangement: Each item consists·' of a set of cards containing

VERBAL SCALE PERFORMA~CE SCALE

1. Information 2. Picture Completion

3. Similarities 4. Picture Arrangement

5. Arithmetic 6. Block Design

7. Vocabulary 8. Object Assembly

9. ComprehenSion 10. Coding (or 11azes)

regular battery and their inclusion is advised because of the qualitative

'VECHSLER PRESCHOOL A::\D PRI\IARY SCALE

;ii tributes the discrepancy principally to the contribution of other variables,

:Allhough the traditional categories of performance, nonlanguage, and

TYPICAL INSTRUME:\,TS.In their efforts to construct tests applicable

El man's g by British p:;ychologists j, b~lt that spati~tl aptitude, illdu'ctive

\+)! ())1* )\c<=»

mentariul!1 of the clinical psychologist,

ADVANTAGES OF GH01.'P TE5Tl:":G. Group tests are deSigned prirnarily as

~~:~~ng his content mastery.

FIG .. 45. Pyr[lmida 1 Testing~1

Analysis of Leaming Potf'ntial (ALP) 1,2-3, '1-6, '1-9, ]0-12

level, we shall examine the Primary Level of the Otis~Lennon Mental = 6;

sample of 14,014 first-grade children was .90. A follow-up of 144 first-

level, we shall examine the Primary Level of the Otis"Lennon Mental = 6

sample of 14,014 first-grade children was .90. A follow-up of 144 £rst-

ELEMEKTARY SCHOOL Group intelligence tests designed for use

the parts, call for caution in the interpretation of subscores. /

point average as a predictor of graduate school performance. Depend- '"

effective];" alter the course of intellectual development in the desireel

than the use of an IQ or other broad den:lopmental indices. Training in

group_ Thus, HIlc shows the s~ecline in pe:fOl!~~~_~:iQ1 e

was fairly representative of the census figures for a large metropolitan

8.1!nit~es, which art: quite un-

rmoo (,-Class \ \. estnn cultures (E, , A. Cohep :. lOfiC:)

I 'ib~table to'-hcredity an~ hence lower tl~e heritab.ility index; increa~in~

I rait. Even If the hentablllty mdex of a traIt m a gl\en populatIon

l .oint. Suppose in a hypothetical adult community everY~I:e has th.e

l ~hanges in family structure, sharp rises or drops m famIly Income level,

<environment I 'it, I d :}/ . Growth in ability,

FIG. 57. Schema Illustrating Interaction of Cognitive and Noncogniti\'c Fac-

You might also like

:ng his content mastery.