You are on page 1of 23

Journal of Clinical and Experimental Neuropsychology 2002, Vol. 24, No. 3, pp.

383405

1380-3395/02/2403-383$16.00 # Swets & Zeitlinger

From the BinetSimon to the WechslerBellevue: Tracing the History of Intelligence Testing
Department of Physical Medicine and Rehabilitation, University of Texas-Houston Medical School and The Institute for Rehabilitation and Research, Houston, TX, USA

Corwin Boake

ABSTRACT
The history of David Wechsler's intelligence scales is reviewed by tracing the origins of the subtests in the 1939 WechslerBellevue Intelligence Scale. The subtests originated from tests developed between 1880 and World War I, and was based on approaches to mental testing including anthropometrics, association psychology, the BinetSimon scales, language-free performance testing of immigrants and school children, and group testing of military recruits. Wechsler's subtest selection can be understood partly from his clinical experiences during World War I. The structure of the WechslerBellevue Scale, which introduced major innovations in intelligence testing, has remained almost unchanged through later revisions.

When future psychologists look back upon the practice of psychological testing during the 1900s, they will be struck by the profound inuence of David Wechsler's intelligence scales, beginning with the publication of the Wechsler Bellevue Intelligence Scale in 1939. Current surveys of psychologists in the U.S.A. (Camara, Nathan, & Puente, 2000) demonstrate that the Wechsler intelligence scales continue to dominate individual intelligence testing. According to Lezak (1995), administration of a Wechsler intelligence scale ``typically constitutes a substantial portion of the test framework of the neuropsychological examination'' (p. 688). Yet, the history of the most popular intelligence tests is unfamiliar to most of the psychologists who use them. It is timely to retrace the origins of these scales because they represent a major chapter in the history of psychological assessment and because this history might have implications for future use of the scales. The method used in this article is to review the history of the Wechsler intelligence scales by tracing the origins of the tests that became the

Wechsler subtests. The historical record demonstrates that the Wechsler subtests represent the continuation of tests that existed long before 1939 and, in many cases, before 1900. The article identies the persons who created the tests and reveals the purposes for which the tests were developed. Finally, the article discusses implications of the fact that, in the midst of accelerating scientic progress, advances from early intelligence tests to those of today have been relatively minor. EARLY MENTAL TESTING Tests for the assessment of cognitive and perceptual abilities had been developed well before the publication of the rst BinetSimon intelligence scale in 1905 (Peterson, 1925). Form boards were used during the mid-1800s by the French phy sician Edouard Seguin (Pichot, 1948, 1949) for training cognitively impaired children. The form board of Halstead's (1947) Tactual Performance Test was adapted by the American psychologist

Address correspondence to: Corwin Boake, TIRR, 1333 Moursund, Houston, TX 77030-3405, USA. Tel.: 1-713797-5913. Fax: 1-713-797-5208. E-mail: corwin.boake@uth.tmc.edu Accepted for publication: July 31, 2001.

384

CORWIN BOAKE

Henry Goddard from one of Seguin's designs. The British scientist Francis Galton developed a set of ``anthropometric'' measures, such as line bisection, that were administered to persons attending the 1884 International Health Exhibition in London (Galton, 1885). The pioneering American psychologist James McKeen Cattell, who in 1890 coined the term mental test, adapted Galton's tests for research with American college students during the 1890s. The following examples of the digit span and substitution tests demonstrate that the Wechsler scales contain mental tests from the period before the BinetSimon scale and before the concept of psychometric intelligence. Digit Span The digit span test was already familiar in psychology before 1900, having been used in England during the 1880s in studies by Galton (1887) and by the cultural historian Joseph Jacobs (1887). In the article introducing this test of ``the power of reproducing sounds accurately,'' Jacobs (1887) noted that the test stimuli had been switched to digits after it was found that nonsense syllables, which had been used initially following the example of Ebbinghaus, were too difcult for school children. Jacobs (1887) described how the test instructions were developed: It was necessary, in the rst place, to adopt some uniform rate at which the dictation should be given, as the power of apprehension varied with the rate of utterance. A sound every half-second was found to be a convenient rate, and a little practice with a metronome beating twice a second gives the experimenter a sense of the proper interval. If possible, two sets of the series of sounds should be given, and the highest number correctly reproduced is to be regarded as the limit which we wish to nd, and which we term here the span. The reading should be in a monotonous tone, so as not to give any perceptible accent or rhythm, either of which, it appeared, assists the power of repetition in a considerable degree. (p. 76) Addressing ``the question as to what is the exact power of the mind which is involved in reproducing these sounds,'' Jacobs (1887) claimed that the

digit span test selectively measured a process that he termed prehension and that he dened ``as the mind's power of taking on certain material'' (p. 79). Jacobs concluded that since ``we clearly cannot take in without rst taking on F F F the mental operation we have been testing thus seems a necessary preliminary to all obtaining of mental material,'' and he recommended ``that `span of prehension' should be an important factor in determining mental grasp, and its determination one of the tests of mental capacity'' (1887, p. 79). Substitution Test The substitution test, the precursor of the Wechsler Digit Symbol and Coding subtests, was probably created around 1900 as an American college classroom experiment for demonstrating the processes involved in learning associations (Dearborn, 1910; Starch, 1911). In his 1911 book of classroom teaching exercises, Starch reported that the test had been ``originally devised several years ago by [Joseph] Jastrow'' of the University of Wisconsin. The test's rationale was explained in a 1910 compendium of mental tests authored by Guy Whipple, who described the substitution test as a measure of ``the rapidity with which associations are formed by repetition'' (p. 350). Whipple proposed that ``in theory, an S whose nervous system is plastic and retentive, who, in other words, is a quick learner, will make the most rapid progress'' (p. 350). One version of the substitution test consisted of ``pages headed with an imitation typewriter keyboard'' in which ``each letter of the alphabet is enclosed with a number in a circle'' (Starch, 1911, pp. 47 48). Below the keyboard diagram was a text passage to be ``transcribed'' by the subject through ``substituting the numbers for the letters'' in blank spaces placed alongside the text (Starch, 1911, p. 48). The Digit Symbol subtest of the Wechsler intelligence scales probably descended from another University of Wisconsin substitution test having a smaller key in which the nine digits were each paired with a different typographical symbol (e.g., equal sign). Revisions of this nine-pair substitution test were included in early children's batteries (Healy & Fernald, 1911; Pyle, 1913) as measures of association learning. The version of the substitution test shown in Figure 1, developed

HISTORY OF INTELLIGENCE TESTING

385

Fig. 1. Substitution Test. Introduced in a 1911 monograph by Robert Woodworth and Frederic Lyman Wells, the test was designed to measure the ability to learn new associations. The test was derived from an earlier substitution test with 20 letter pairs (Kirkpatrick, 1909), by pairing digits with geometric forms in order to prevent ``easy mnemonics.'' This Substitution Test page also served as a test of rapid naming of geometric forms. The ve geometric forms were selected because they do not change appearance when rotated, which allowed the test page to be presented in different orientations. Note. From ``Association tests. Being a part of the Report of the Committee of the American Psychological Association on the Standardization of Procedure in Experimental Tests,'' by R.S. Woodworth and F.L. Wells, 1911, Psychological Monographs, 13 (Whole no. 57), pp. 5253. In the public domain.

by Cattell's students Robert Woodworth and F. L. Wells (Woodworth & Wells, 1911), is probably the source of the Coding subtest of the Wechsler children's scales. BINET AND SIMON'S MEASURING SCALE OF INTELLIGENCE In 1905, the psychologist Alfred Binet and psychiatrist Theodore Simon published a ``measuring

scale of intelligence'' that they had developed for use with Paris school children (Binet & Simon, 1905). The scale consisted of a series of 30 brief cognitive tests, arranged in order of difculty, which could be administered in about 40 min. The selection of tests included measures of language skills (e.g., naming, following commands, semantic judgments), memory, reasoning, digit span, and psychophysical judgments. Many of the tests had been discussed by Binet and Victor Henri in their 1895 review of ``differential

386

CORWIN BOAKE

psychology.'' The scale included tests developed by others (e.g., digit span) as well as ones developed by Binet and his collaborators. Validity of the scale was demonstrated, as in Jacobs' studies, by the increase of scores with age and by the scale's ability to differentiate normal and cognitively impaired children (Peterson, 1925). In 1908, Binet and Simon made a fundamental revision of their scale by grouping the tests into age levels. In this procedure, later known as a year scale, each test was assigned to the age level at which most children performed it successfully. For example, digit span items of different lengths were administered at various age levels, with longer spans assigned to older ages. Administration to an individual child began with the tests at the child's age level and proceeded to higher or lower age levels, until nding the age level at which most tests were failed. The child's intelligence was quantied in terms of the intellectual level (later, mental age), dened as the highest age level at which the child completed most tests successfully. Although the 1908 revision contained both verbal and non-verbal tests, the consensus of psychometric opinion was that the scale overemphasized verbal skills such as vocabulary and repetition. The BinetSimon scales have served as both a model of form and source of content for later intelligence tests. The basic procedure of combining different mental tests to yield a composite score is the foundation of intelligence scales. Binet emphasized that since ``a particular test isolated from the rest is of little value'' or even ``signies nothing,'' the important information contributed by intelligence scales was the subject's average performance over various tests. He facetiously wrote that ``one might almost say, `It matters very little what the tests are so long as they are numerous.''' (1911/1916, p. 329). In addition to establishing the basic form that intelligence tests would take, the BinetSimon scales contributed items and tests that have been recycled in later intelligence scales. Table 1 shows tests and items of the 1905 and 1908 BinetSimon scales that have been duplicated in the Wechsler intelligence scales. For example, the BinetSimon Unnished pictures item, a drawing of a face missing a nose, is duplicated as an item of

the Picture Completion subtest of the Wechsler Bellevue scale and later Wechsler intelligence scales. In only a few years, the BinetSimon scale became widely used in Europe and North America. Goddard learned of the scale while traveling in Europe and arranged for its translation into English. From his position as research director of the Training School at Vineland, a residential center in New Jersey for children with cognitive disorders, Goddard led a drive to popularize intelligence testing that rapidly led to the use of the BinetSimon scale in American institutions (Zenderland, 1998). Shortly before the U.S.A. entered World War I, two major revisions of the BinetSimon scale were produced by American psychologists. The rst revision, by Robert Yerkes and James Bridges of the Boston Psychopathic Hospital, restructured the BinetSimon scale from a year scale into a point scale termed the YerkesBridges Point Scale Examination, by grouping items of similar content into a smaller number of subtests (Yerkes, Bridges, & Hardwick, 1915). For example, the Memory Span for Digits subtest of the Yerkes Bridges scale was formed by consolidating the digit span items (Repetition of two gures, ve gures, etc.) that were spread over various Binet Simon age levels. Point-scale tests were administered beginning with the easiest item and proceeding in order of difculty until completing the test. The correspondence between the Binet Simon tests and the YerkesBridges tests is illustrated in Table 1, where it can be seen that the titles of the YerkesBridges tests were formed by abbreviating the BinetSimon titles into descriptive phrases. The point-scale method introduced by the YerkesBridges scale was the model for the tests that evolved into the Wechsler scales (Thorndike & Lohman, 1990). The second revision, by Lewis Terman of Stanford University, extended the age range into adulthood and, most important, replaced the mental age with the intelligence quotient (IQ) as the preferred composite score. In addition, Terman supplemented the BinetSimon tests with newer tests such as arithmetic reasoning items developed by Bonser (1910) and a form board developed by Healy and Fernald (1911). Terman's

Table 1. BinetSimon Tests Duplicated in the WechslerBellevue Scale. Test in BinetSimon scales (1905, 1908) Unnished pictures (1908) Reply to an abstract question (1905); Comprehension questions (1908) Repetition of three gures; Immediate repetition of gures (1905) Denitions of abstract terms (1905) Verbal denition of known objects (1905); Denition of familiar objects (1908) Resemblances of several known objects given from memory (1905) Making change from 20 sous (1908) Description/example Identify missing parts of a drawing of a face or person. ``When one breaks something belonging to another what must one do?'' (p. 224). Repeat spoken list of digits in same order. Dene abstract nouns. Score is based on abstractness of denition. Dene concrete nouns, in terms of use or superior to use. Example: ``In what way are a y, an ant, a buttery, and a ea alike?'' (p. 61). Make change when 4 sous are taken from 20. Related test in YerkesBridges Point Scale (1915) Perception and comparison of pictures (missing parts) Comprehension of questions Memory span for digits Denition of abstract words Denitions of concrete terms Comparison of three pairs of objects none Related test(s) in StanfordBinet scale (1916) Finding omissions in pictures
HISTORY OF INTELLIGENCE TESTING

Comprehension Repeating three digits, etc.; Repeating three digits reversed, etc. Dening abstract words; Vocabulary Giving denitions in terms of use; Giving denitions superior to use; Vocabulary Giving similarities Making change; Arithmetical reasoning

Note. Quotations are from the translation of the BinetSimon tests by Elizabeth S. Kite (Binet, 1905/1916, 1908/1916).

387

388

CORWIN BOAKE

revision, the StanfordBinet Intelligence Scale (Terman, 1916), quickly became the dominant measure in American intelligence testing. PERFORMANCE TESTS OF INTELLIGENCE The need for non-verbal measures of intelligence was felt by clinicians examining subjects with limited English-language skills, such as hearingimpaired and foreign-born persons. A 1911 monograph by the psychiatrist William Healy and psychologist Grace Fernald, of the Chicago Juvenile Psychopathic Institute, presented a group of ``practical'' tests designed for use with juvenile delinquents. In a criticism of the BinetSimon scale to be repeated by many other authors, Healy and Fernald complained that the scale ``helps very little where the language factor is a barrier, either on account of foreign parentage or insufcient schooling, and with uneducated deaf and dumb children'' (1911, p. 5). As a substitute for the BinetSimon scale, Healy and Fernald claimed that their tests had been constructed in order ``to ascertain the mental ability quite apart from the individual's experience in formal training in our language, or indeed in any language'' (1911, p. 4). For example, Healy's (1914, 1921) Pictorial Completion tests consisted of picture boards of childhood scenes with empty spaces designed to be lled by different picture elements. The subject's task was to select the element that completed the picture in the most appropriate way (e.g., placing a ball in the hand of a boy throwing something). Healy and Fernald explained that, because their tests were designed ``with the idea of invoking always as much interest as possible in our tests, we have ever had in mind the development of them in forms resembling games and puzzles, but really involving points much more open than puzzles to solution by use of simple reasoning ability'' (1911, p. 7). This method of measuring intelligence using nonverbal tasks came to be termed performance testing. A seminal event in the history of performance testing was the assessment program at Ellis Island, in New York harbor, by U.S. Public Health Service physicians who were responsible

for screening arriving immigrants for mental and physical disorders (Knox, 1914b; Yew, 1980). The task of mental screening was complicated by the fact that many immigrants spoke no English and had little or no formal education. The Ellis Island physicians rejected the Binet Simon scale as inappropriate for testing immigrants, on the rationale that the scale was an ``arbitrary and articial scale that was derived from experiments performed on French school children'' (Knox, 1914a) and that it would be ``manifestly absurd to use educational tests in the case of uneducated persons'' (Knox, 1913). To substitute for the BinetSimon scale, the Ellis Island physicians assembled various form board and puzzle assembly tasks into ``a graduated scale with performance tests for determining the intelligence of aliens, and especially illiterates'' (Knox, 1915). Two surviving tests from the Ellis Island testing program are the feature prole (Fig. 2) and the cube imitation test (Knox, 1914b). In the words of Howard Knox, an Ellis Island physician who may have coined the term performance tests, the tests required abilities such as ``a little native ingenuity, constructive imagination and sense of form and some judgment'' (1915, p. 53), but not any formal education. Collection of child and adult norms from Italian-, Spanish-, and German-speaking immigrants began at Ellis Island during 1914 but was discontinued when World War I caused immigration to decline (Mullan, 1917). In an interview conducted 60 years later, the former Ellis Island physician Grover Kempf recalled: The mental examination of immigrants was usually done in a quiet room it had to be done with an interpreter present, a man or woman who was well versed in the language of the immigrant. Usually two ofcers sat in during the examination and it was conducted in a question and answer method, and also mostly with the board tests, putting blocks back together and the Knox cube test. And I had one test there of a face, called the Kempf test, which required the immigrant to place the blocks in order to form a human face. However, none of these tests were standardized. (U.S. Public Health Service, 1977)

HISTORY OF INTELLIGENCE TESTING

389

Fig. 2. Feature Prole Test. The test, on display at the Ellis Island Immigration Museum, was designed for mental screening of arriving immigrants at Ellis Island. It was developed by Grover Kempf and Howard Knox, Assistant Surgeons of the U.S. Public Health Service. Knox claimed that the test was ``eminently fair because everyone has seen a human head'' (1914b, p. 744). Note. Reproduced courtesy of the Ellis Island Immigration Museum.

Following the example of the Ellis Island testing program, Rudolf Pintner and Donald Paterson of Ohio State University developed a performance test battery, the PintnerPaterson Performance Scale, for assessment of hearingimpaired school children (Pintner & Paterson, 1917). The rationale of the PintnerPaterson scale was that the intelligence of these children, like that of many immigrants, would be underestimated by the verbally weighted BinetSimon scale. Table 2 shows the tests comprising the PintnerPaterson scale. Of the 15 tests in the scale, 4 were newly developed by Pintner and Paterson, and the majority were existing tests borrowed from the Ellis Island tests, the Healy Fernald tests, and other sources. Although the

PintnerPaterson scale is no longer sold as a battery, some of its component tests are still in use today. For example, the Pintner Manikin Test is the beginning item of the Object Assembly subtest of the Wechsler intelligence scales. U.S. ARMY INTELLIGENCE TESTING DURING WORLD WAR I Probably the most important event in the spread of intelligence testing was the testing program carried out in the U.S. Army during the rst world war. Yerkes, who was president of the American Psychological Association at the time when the United States declared war, formed a committee of testing

390

CORWIN BOAKE

Table 2. Performance Scales that Preceded the WechslerBellevue Scale. PintnerPaterson performance test series (Pintner & Paterson, 1917) Mare & Foal Picture Board Seguin Form Board Five Figure Board Two Figure Board Casuist Form Board Triangle Test Diagonal Test Healy Puzzle `A' Manikin Test Feature Prole Test Ship Test Picture Completion Test Substitution Test Adaptation Board Knox Cube Test Army Performance Scale (Yerkes, 1921) Ship Test Manikin & Feature Prole Knox Cube Imitation Cube Construction Form Board Memory for Designs Digit Symbol Test Porteus Maze Test Picture Arrangement Picture Completion Point Scale of Performance Tests (Arthur, 1930) Knox Cube Seguin Form Board Two-Figure Form Board Casuist Form Board Manikin; Feature-Prole Mare & Foal Healy Picture Completion I Porteus Maze Kohs Block Design Performance Ability Scale (Cornell & Coxe, 1934) Manikin-Prole Block-Designs Picture-Arrangement Memory-for-Designs Digit-Symbol Cube-Construction Picture-Completion

experts to design tests to determine if Army recruits were t for military service. The committee, shown in Figure 3, created a preliminary version of a group intelligence test in its rst 2-week session. Trials of the new group test were conducted during July 1917 and the data were analyzed by a statistical unit at Columbia University headed by Edward Thorndike and assisted by Arthur Otis and Louis Thurstone. The history of wartime intelligence testing in the U.S. Army was later recorded in a 890-page monograph edited by Yerkes and written partly by Terman and E. G. Boring (Yerkes, 1921; see also Yoakum & Yerkes, 1920). The monograph's technical errors and conclusions about racial and ethnic differences have been widely criticized (e.g., Gould, 1981). The main Army intelligence tests were Group Examinations Alpha and Beta, designed to be administered to groups of recruits by trained psychological examiners. The transformation from the pre-war individual intelligence scales to the group-administered examinations was made possible by the method of multiple-choice, which was credited to Otis (Samelson, 1987; Yerkes, 1921, pp. 299300). Group Examination Alpha was designed for the assessment of literate English speakers and Group Examination Beta for the minority of recruits who were illiterate or nonprocient in English. The complementary missions of the Alpha and Beta examinations in the

Army testing program paralleled the roles of the BinetSimon scale and performance tests in individual intelligence testing. Both the Alpha and Beta examinations were point scales consisting of a series of subtests that could be administered in less than 1 hr. Army records estimate that, from 1917 to 1919, the Alpha and Beta examinations were administered to 1,726,966 recruits (Yerkes, 1921, p. 103). The Army group examinations were a major source of subtests and items used in the WechslerBellevue scale, as explicitly stated by Wechsler (1939) and noted by many authors (e.g., Frank, 1983). Table 3 shows the three Alpha subtests that correspond to WechslerBellevue verbal subtests. The WechslerBellevue Arithmetic subtest is virtually an orally administered short form of the Arithmetical Problems test of the Alpha examination. Of the 10 Wechsler Bellevue Arithmetic items, 7 are from the Arithmetical Problems test, which contained items adapted from a test for middle-school students (Bonser, 1910). The WechslerBellevue Comprehension subtest is largely derived from the Alpha Practical Judgment test, which itself borrowed items from the 1905 BinetSimon scale, Bonser's (1910) Selective Judgment test, and pre-war tests developed at Stanford University by Terman and his students. The Alpha Information test, which appears to be the model

HISTORY OF INTELLIGENCE TESTING

391

Fig. 3. Committee on the Psychological Examination of Recruits, during one of their meetings at the Vineland Training School between May and July 1917. Front: Edgar Doll, Henry Goddard, and Thomas Haines. Rear: F.L. Wells, Guy Whipple, Robert Yerkes (chairman), Walter Bingham (secretary), and Lewis Terman. Initial versions of the Army group and individual tests were developed during meetings of this committee in summer 1917. Doll, who was Goddard's assistant at the Training School, was invited to join the committee at later sessions when the individual tests were developed. Note. Reproduced with permission of the Archives of the History of American Psychology, Henry Goddard Papers.

Table 3. Alpha Examination Tests Related to WechslerBellevue Subtests. Test in Alpha examination Arithmetical Problems Information Practical Judgment Committee member(s) responsible for test Bingham Wells Goddard Haines Example of item related to WechslerBellevue scale ``If it takes 6 men 3 days to dig a 180-ft drain, how many men are needed to dig it in half a day?'' [Yerkes, 1921, p. 221] ``Darwin was most famous in literature science war politics'' (choose best answer) [Yerkes, 1921, p. 234] ``If you are lost in the forest in the daytime, what is the thing to do? hurry to the nearest house you know of look for something to eat use the sun or a compass as a guide'' (choose best answer) [Yerkes, 1921, p. 229]

392

CORWIN BOAKE

for the WechslerBellevue Information subtest, also appears to be an adaptation of earlier tests (Healy & Fernald, 1911; von Mayrhauser, 1987). Three Beta tests are sources of Wechsler Bellevue performance subtests. The Beta Digit Symbol test (Yerkes, 1921, p. 254) was simply duplicated in the WechslerBellevue subtest with the same title. Although the Beta Digit Symbol test was credited to Otis, the test appears to be only a slight modication of the University of Wisconsin substitution test with nine digitsymbol pairs. The Beta Picture Completion test (Yerkes, 1921, pp. 238239, 256) contributed items to the Picture Completion subtest of the WechslerBellevue scale. Figure 4 presents an item from the Beta Picture Completion test that has been duplicated in the Wechsler intelligence scales. The Beta Picture Completion test was modeled on Pintner and Hoops' (1918) `drawing completion' technique for group administration of the BinetSimon Unnished pictures test (Table 1). The Beta Picture Arrangement test

(Yerkes, 1921, pp. 242243), included in a preliminary version of the Beta examination but dropped before the nal version, supplied items to the WechslerBellevue Picture Arrangement subtest (Wechsler, 1939, p. 90). The Beta Picture Arrangement test was modeled upon a children's test from Belgium (Decroly, 1914). Recruits who failed the group examinations were assessed with individual intelligence tests that included the StanfordBinet, the Yerkes Bridges point scale, and the new Army Performance Scale. It was estimated that 83,500 individual examinations were performed in the Army testing program (Fig. 5). The Army Individual Examination, an individual intelligence test developed for the Army testing program by the same committee, was to have a particularly strong inuence on the Wechsler scales. Army Individual Examination The Army Individual Examination was a point scale designed for English-speaking recruits that

Fig. 4. Item in the Picture Completion test of Group Examination Beta. This Beta test, credited to Truman Kelley, was designed to assess intelligence of U.S. Army recruits who were illiterate or not procient in English. Group administration was accomplished by asking subjects to draw the missing parts directly onto pictures printed on the response form. Note. From `Psychological examining in the United States Army,' edited by R. M. Yerkes, 1921, Memoirs of the National Academy of Sciences, 15, p. 256. In the public domain.

HISTORY OF INTELLIGENCE TESTING

393

Fig. 5. Individual administration of the Pintner Manikin Test to a U.S. Army recruit during World War I. Individual testing was administered to recruits who failed the Alpha and Beta examinations. The test originated with the PintnerPatterson scale (Pintner & Paterson, 1917) and was included in the Army Performance Scale. After the war, the test was used in the CornellCoxe Performance Ability Scale (Cornell & Coxe, 1934) and later in the Object Assembly subtest of the WechslerBellevue scale. The original test is still commercially available as part of the MerrillPalmer scale (Stutsman, 1931). Note. From `Psychological examining in the United States Army,' edited by R.M. Yerkes, 1921, Memoirs of the National Academy of Sciences, 15, p. 91. In the public domain.

consisted of 22 mostly verbal tests taken from the YerkesBridges point scale or adapted from the Alpha examination. It is a historical paradox that the Individual Examination, which was to strongly inuence intelligence testing in the long run, was discontinued during its standardization stage and

never came to serve as an individual intelligence test. The Individual Examination appears to have been a major source for the Wechsler intelligence and memory scales, given that some Wechsler subtests are closely related to the tests in the Individual Examination. Table 4 presents the three

Table 4. Army Individual Examination Tests Related to WechslerBellevue Subtests. Test in Individual Examination Committee member(s) responsible for test Arithmetical reasoning Comprehension Likenesses and differences Wells Goddard Terman Wells Terman Example of item related to WechslerBellevue scale ``If a man buys 6 cents worth of postage stamps at the post ofce and pays a dime, how much change does he get back?'' (Yerkes, 1921, p. 144) ``Why are people who are born deaf usually dumb?'' (Yerkes, 1921, p. 143) ``In what way are the eye and the ear alike?'' (Yerkes, 1921, p. 139)

394

CORWIN BOAKE

Individual Examination tests that correspond to subtests of the WechslerBellevue scale. Half of the items in the WechslerBellevue Comprehension subtest were taken from the Comprehension test of the Individual Examination. Army Performance Scale Like the Beta examination, the Army Performance Scale was designed for recruits who performed poorly on the Army group examinations or who were not procient in English. The composition of the Army Performance Scale, shown in Table 2, corresponded closely to the Pintner Paterson scale. The WechslerBellevue Picture Arrangement, Object Assembly, and Digit Symbol subtests correspond to the Army Performance Scale tests of the same titles. The Army Performance Scale can therefore be regarded as a transitional scale that rened available performance tests into a form leading to the WechslerBellevue performance scale. Wechsler as Wartime Psychological Examiner In 1917 Wechsler, having completed his masters thesis with Woodworth at Columbia University, began work at an Army camp on Long Island where he scored Alpha examination protocols in a unit supervised by Boring. Wechsler then enlisted in the U.S. Army and during summer 1918 he attended the School for Military Psychology in Georgia for training as a psychological examiner. Upon graduating with the rank of corporal, he was assigned to Camp Logan near Houston, Texas, where he conducted individual psychological examinations (Edwards, 1974). Army records estimate that 319 individual examinations were performed at this base (Yerkes, 1921, p. 80). Wechsler later recalled how his wartime experiences inspired the intelligence scale he developed later: My duties there called mostly for the administration of individual psychological tests to soldiers who had failed both the Army Alpha and Army Beta, and who were being evaluated for possible discharge, or special labor assignments. My usual examination of subjects included, in addition to a short interview, administration of the StanfordBinet or Yerkes

Point Scale, and nearly always one or more of the available performance tests. It then occurred to me that an intelligence scale, combining verbal and nonverbal tests, would be a useful addition to the psychometrist's armamentarium (Wechsler, 1979, p. 2). Wechsler often remarked that he became convinced of the shortcomings of existing intelligence tests through wartime experiences with recruits who had functioned normally as civilians before induction, but who had failed the Army group examinations and had obtained low mentalage scores on the StanfordBinet scale. He attributed these misdiagnoses to the emphasis of the StanfordBinet scale on verbal skills acquired through formal education. He recalled that the rst such subject he encountered was ``a native, white Oklahoman'' who obtained a Stanford Binet mental age of 8 years, but who ``before entering the Army F F F had gotten along very well, was supporting a family, had been working as a skilled oil-driller for several years and, at time of draft, was earning from $60 to $75 per week'' (1935, p. 256). Wechsler claimed that he had often encountered the type of subject who, like the Oklahoman, systematically rates as a mental defective on mental tests, but, who can in no way be judged as such, when diagnosed on the basis of concrete social standards, i.e., in terms of capacity to adjust to the normal demands of his social and economic environment. (1935, p. 256) The next few years of Wechsler's life were a kind of brief odyssey that widened his clinical and scientic perspectives. While still in uniform he was transferred from the Texas army camp to the University of London, where he studied with Charles Spearman and Karl Pearson. Wechsler then won a 2-year fellowship to study in France, where he worked with Henri Pieron and other Paris psychologists researching the psychophysiology of emotions. Upon returning to the U.S.A. in 1922, he spent a summer internship with Wells at the Boston Psychopathic Hospital. During this internship Wechsler attended teaching conferences conducted by Healy and the psychologist Augusta

HISTORY OF INTELLIGENCE TESTING

395

Bronner (Edwards, 1974). Wechsler then returned to New York where he completed his dissertation at Columbia University on physiological measures of emotion (Wechsler, 1925) and worked as a psychologist at a child guidance bureau. Individual Mental Testing in the 1920s and 1930s For intelligence testing, the postwar decades were a period of aggressive growth. The Army testing program had promoted the credibility of intelligence testing and trained many psychologists in test construction and interpretation. The Psychological Corporation was founded in 1921 to offer psychological testing services to industry, using the tests administered to the Columbia students as well as those developed in the Army (Sokal, 1981). The publication of new group intelligence tests dramatically expanded the application of intelligence testing, especially in schools. Development of performance tests increased during the 1920s and 1930s, probably driven by concerns about the adequacy of the Stanford Binet scale. One of the most successful of the new performance tests was the Block Design test, developed by Samuel Kohs as a Stanford University dissertation with Terman (Kohs, 1923). The test's origin is unusual in that it was adapted from Color Cubes, a commercially available game in which children constructed decorative mosaic designs from a set of 16 painted wooden cubes. The sides of each cube were painted red, white, blue, yellow, red-white and blue-yellow. Kohs' instructions advised that the blocks ``may be secured at any of the large department stores and at the various distributing centers of Milton Bradley's'' (1923, p. 64). Instructions for a later revision of the Block Design test recommended giving subjects ``extended practice'' on the initial designs in order ``to offset in some measure the advantage possessed by those patients who have had these sets of blocks as toys'' (Arthur, 1930, p. 30). According to Kohs, the test required ``rst, the breaking up of each design presented into logical units, and second, a reasoned manipulation of the blocks to reconstruct the original design from these separate parts'' (1923, p. 271). Kohs proposed that the test's involvement of both analytic and synthetic thinking would

implement the ``combination-method'' that had been proposed by Ebbinghaus as a fundamental feature of intelligence tests. Kohs emphasized the test's value as a measure of intelligence, as demonstrated by correlations with the StanfordBinet scale and other indicators of general intelligence, but did not mention such concepts as visual-spatial perception. Some postwar performance scales pursued the PintnerPaterson strategy of constructing intelligence scales using only nonverbal tests. The Leiter International Performance Scale was developed by Russell Leiter and later adapted by Stanley Porteus of the University of Hawaii for his research on racial differences (Leiter, 1936). Table 2 shows the tests comprising CornellCoxe Performance Ability Scale (Cornell & Coxe, 1934), which includes the immediate precursors of most WechslerBellevue performance subtests and of the Wechsler Memory Scale (WMS) Visual Reproduction subtest. The signicance of the overlap between the CornellCoxe and the WechslerBellevue scales will be discussed below. Grace Arthur's Point Scale of Performance Tests (Arthur, 1930), also shown in Table 2, consisted of seven PintnerPaterson tests plus adaptations of the Porteus maze test and the Kohs Block Design test. The impact of these tests was shown by a 19331934 survey of testing practices in U.S. psychological clinics (Report of Committee of Clinical Section of American Psychological Association, 1935, pp. 2327). The survey found that while the most popular psychological test was the StanfordBinet scale, 5 of the 9 most widely used tests were either performance tests (Healy Pictorial Completion II, Porteus Mazes) or scales comprised of such tests (Arthur Point Scale, MerrillPalmer, PintnerPaterson). The popularity of these performance tests suggests that many psychologists followed a practice of supplementing the StanfordBinet scale with one or more performance measures, out of concern with using the StanfordBinet scale as the sole measure of intelligence. The Psychological Corporation published civilian versions of the Alpha and Beta examinations designed for educational or industrial testing (Kellogg & Morton, 1935; Wells, 1932). The Revised Beta Examination included a substitution

396

CORWIN BOAKE

test that used the stimulus key from the Beta Digit Symbol subtest but required the subject to translate symbols into numbers. The Revised Beta Examination also included a Picture Completion test that shared some items with the WechslerBellevue subtest. WECHSLER AND THE BELLEVUE INTELLIGENCE EXAMINATION In 1932 Wechsler became the chief psychologist at Bellevue Psychiatric Hospital in New York. Of Wechsler's publications during the 1920s and early 1930s, few concerned intelligence testing and only one reported primary research on the topic. His most relevant pre-1939 publication is a 1932 note in which he suggested that the Army Alpha examination had the potential advantage, not available with other intelligence tests, ``of analyzing the subject's performance on the individual tests which comprise the examination, in order to discover'' if the subject had ``any special abilities or disabilities'' (Wechsler, 1932, p. 254). In contrast, prole interpretation of the StanfordBinet scale depended on procedures that were complicated and unstandardized (Wells, 1927). Wechsler later recalled that, in his position at Bellevue, the ``immediate experience of working with, and supervising the testing of, the diverse patient population at the mental hygiene clinic, and Psychiatric Wards of Bellevue Hospital'' reinforced his ``growing conviction of the need for an alternate to the Binet tests, and in particular, of an intelligence scale more suitable for use with adults'' (1979, p. 3). The chief rationale that he cited for the new adult intelligence test was statistical in nature. Wechsler argued that contemporary intelligence tests were applicable only to children and adolescents because of the insurmountable statistical artifacts caused by applying the mental age and ratio IQ to adults. Repeating the views of Thurstone and others, Wechsler advocated replacing the ratio IQ with the deviation score, a method that calculated IQ by converting the sum of subtest scores into a standard score, using the mean and standard deviation at each age level. Basically this method changed the meaning of the IQ from a mental age-

chronological age ratio score into a standard score having the same distribution at each age level (Thorndike & Lohman, 1990). Wechsler's second major technical innovation was to incorporate verbal and performance tests into the same scale, enabling the Wechsler Bellevue scale to exploit both of the major contemporary approaches to intelligence testing. The `verbal' and `performance' labels of the two WechslerBellevue scales were already in common use. One rationale for the verbal-performance combination was to minimize the over-diagnosing of feeble-mindedness that was, he believed, caused by intelligence tests that were too verbal in content. He viewed verbal and performance tests as equally valid measures of intelligence and criticized the labeling of performance tests as measures of ``special abilities'' while reserving for the Binet Simon scale the property of measuring ``general intelligence.'' This view, he argued, was ``incorrect because it not only assumes that there are different kinds of general intelligence, but because it further implies that the Performance tests are relatively unimportant as measures of general intelligence'' (Wechsler, 1939, p. 138). A further rationale cited by Wechsler in favor of the verbal-performance combination was that performance tests might be more sensitive to ``temperamental and personality factors'' such as ``the subject's interest in doing the task set, his persistence in attacking them and his zest and desire to succeed'' (Wechsler, 1939, p. 10). He argued that by including tests to measure such ``capacities that cannot be dened as either purely cognitive or intellective,'' the relationship of test results with everyday functioning would be strengthened. Echoing Terman, he stated that his scale was ``constructed on the hypothesis that an individual manifests intelligence by his ability to do things, as well as by the way he can talk about them'' (Wechsler, 1939, p. 138). These arguments suggest that Wechsler intended the performance subtests to serve not only as measures of intelligence, but also to measure personality traits similar to what would now be termed executive functions. Constructing the WechslerBellevue Scale In retracing Wechsler's rationale for selecting the WechslerBellevue subtests, a few clues may be

HISTORY OF INTELLIGENCE TESTING

397

noted. First, it is obvious that all of the Wechsler Bellevue subtests except for Block Design were derived from the Army tests. Even the titles of several Army tests (i.e., Information, Comprehension, Vocabulary, Picture Completion, Picture Arrangement, Digit Symbol) were duplicated as WechslerBellevue subtest titles. This reliance on the Army tests is understandable given Wechsler's wartime experience, the endorsement of these tests by contemporary testing experts, and their continued use by Wechsler, Wells, and other psychologists through the 1930s. Wechsler's construction of his new intelligence scale from old tests, which contrasts sharply with today's custom of constructing new scales from new tests, was standard practice among contemporary test developers (Frank, 1983). Like the authors of the performance scales in Table 2, Wechsler explicitly identied the sources of all of his subtests. The reason why he had ``drawn so heavily on the experience of others,'' he explained, was because his ``aim was not to produce a set of brand new tests but to select, from whatever source available, such a combination of them as would best meet the requirements of an effective adult scale'' (Wechsler, 1939, p. 78). A second clue is that Wechsler appears to have assigned subtests to the verbal and performance scales in accordance with existing intelligence scales, rather than on an empirical basis. The WechslerBellevue verbal scale was constructed mostly from Army Alpha tests, and the Wechsler Bellevue performance scale from contemporary performance tests. While emphasizing that most of the subtests correlated strongly with the total score, he drew attention to the fact that some subtests did not. For example, he commented that the Object Assembly subtest, despite having serious statistical weaknesses, ``was included in our battery only after much hesitation'' because the subtest provided clinical information about ``one's mode of perception, the degree to which one relies on trial and error methods, and the manner in which one reacts to mistakes'' (1939, p. 100). In light of the fact that the verbal and performance scales were based on an a priori classication that was inconsistent with correlational evidence, it is understandable that later factor analytic studies would suggest different composites.

Third, the composition of the Wechsler Bellevue performance scale corresponds closely with the performance scale developed by Cornell and Coxe (1934). As seen in Table 2, 4 of the 7 CornellCoxe tests also appear in the Wechsler Bellevue performance scale. It should be noted that the CornellCoxe Picture Completion test is one of the Pictorial Completion tests produced by Healy (1914, 1921), and not a missing-detail type of test like the Picture Completion tests in the Beta examination and Wechsler scales. The Memory-for-Designs test, 1 of 2 CornellCoxe tests not represented in the WechslerBellevue scale, is closely related to the WMS Visual Reproduction subtest. Fourth, the initial group of WechslerBellevue subtests differed from the now-traditional Wechsler subtest prole. At the start of standardization, there were fewer verbal subtests (Information, Digit Span, Arithmetic, and Comprehension) than performance subtests. The performance scale consisted of the ve traditional subtests plus a sixth subtest, Cube Analysis, which was based on the Beta examination subtest of the same title. Because the standardization data revealed a sex difference on the Cube Analysis subtest in favor of males, the subtest was omitted from the published scale. Wechsler explained that the Similarities and Vocabulary subtests had been omitted from the initial subtest selection ``because of the mistaken belief'' that they ``would be unduly inuenced by the language factor'' (1939, p. 87) and were ``seemingly unfair to illiterates and persons with foreign language handicap'' (p. 100). These were the very shortcomings for which he criticized the StanfordBinet scale and which he was constructing his own scale to overcome. The time needed to develop items for these subtests may explain why they were added so late in standardization. The Vocabulary subtest, the last subtest added to the scale, acquired a smaller normative sample of about 400 subjects and was included in the published scale as an `alternate' verbal subtest. Finally, it may be signicant that the Wechsler Bellevue scale, unlike the StanfordBinet scale and the performance scales in Table 2, does not include any memory tests. It is possible that Wechsler's selection of tests for his intelligence

398

CORWIN BOAKE

scale was inuenced by his development of the WMS during the same period (Wechsler, 1945). Indeed, Wechsler's selection of subtests for the WMS shows some of the same inuences as the WechslerBellevue scale. It appears that Wechsler began work on the new intelligence scale shortly after he came to Bellevue. In his 1939 book, The measurement of adult intelligence, he stated that work on the scale had taken a little more than 7 years, of which the initial 2 years were spent in trying out tests with adult subjects. Collection of the standardization sample was funded by a grant from the Works Progress Administration (WPA), a federal agency that funded employment in public works projects. The entire WechslerBellevue standardization sample, ranging from 7 to 59 years of age, was collected in the New York area. The adult sample was limited to whites who understood and wrote English. Wechsler noted facetiously that a ``special source worth mentioning was the Coney Island beach, where one of our resourceful examiners went daily throughout a summer'' (Wechsler, 1939, p. 114). The adult norms were rened by a kind of stratication technique in which subjects were selected to match the proportions of the adult population employed in various occupations, according to the U.S. Census. Wechsler's use of occupational stratication was similar to the 1937 StanfordBinet revision, in which children were selected so that their parents' occupations would match census data about the proportions of adults in various occupational categories (Terman & Merrill, 1937). The WechslerBellevue children's sample was tested at schools labeled `average' by the Board of Education of the City of New York. The children's norms were likewise stratied according to the proportions of children at each age who were enrolled in different school grades of the New York public schools. The normative data for 79year-olds were discarded because of oor effects, leaving a normative sample ranging from 10 to 59 years old. Wechsler had already brought the scale to completion when he contracted with the Psychological Corporation to produce the test materials (Edwards, 1974). The examiner's manual was published as an appendix to his 1939 book. Thus,

it appears that Wechsler's 7-year project of selecting subtests, creating items, supervising a standardization, and writing an introductory book and an examiner's manual was basically a one-man show. The only individual whom Wechsler (1939) acknowledged for making substantive contributions was Wells, ``whose continued interest in the Scales and helpful suggestions were a source of both inspiration and encouragement'' (p. vii). AFTERMATH OF THE WECHSLER BELLEVUE SCALE In comparison with the StanfordBinet scale and other contemporary intelligence tests, the advantages of the WechslerBellevue scale were formidable. The WechslerBellevue scale was based on familiar tests that psychologists accepted as valid. These tests were organized into verbal and performance scales that could be administered and interpreted separately. The use of deviation scores obviated statistical artifacts and provided a statistical basis for interpreting the subtest prole and verbal-performance discrepancy. The large standardization sample, spanning from childhood to adulthood, had been selected in a principled and precise manner. By incorporating these technical innovations into a single scale, Wechsler accomplished a major advance in the technology of individual intelligence testing. Published reviews of the scale drew attention to these technical advances, while Wechsler's borrowing of tests was treated as an advantage, if mentioned at all. For example, Lorge's (1943) review in the Journal of Consulting Psychology commented that Wechsler's chief contribution was ``in the organization of well-known tests into a composite scale'' (p. 167). The combination of technical advances was probably more than sufcient for the Wechsler Bellevue scale to become the dominant adult individual intelligence test. However, the impact of the scale was further magnied because it met the need created by the rapid growth of clinical psychology during the 1940s, particularly in adult psychiatry. As described by Matarazzo, ``overnight this massive social need teamed up with the new developments in professional assessment and

HISTORY OF INTELLIGENCE TESTING

399

psychotherapy which had been occurring in relative isolation among the country's few hundred full time practitioner-psychologists of the 1930s and early 1940s, and psychology in the United States found itself fully launched as a profession'' (1972, p. 11). A measure of this impact is that Wechsler's book, containing the manual for the WechslerBellevue scale, went through three editions between 1939 and 1944. The second edition (Wechsler, 1941) introduced a chapter describing how proles of subtest scores could be used in differential diagnosis, a procedure that shaped the practice of test interpretation by later clinicians. The scale was selected for the standard psychological battery recommended by prominent psychologists at the Menninger Clinic (Rapaport, Schafer, & Gill, 1944). A 1946 survey of psychological testing practices (Louttit & Browne, 1947) found that the WechslerBellevue scale was one of the most widely used psychological measures, second only to the 1937 StanfordBinet revision. Research with the scale increased rapidly and soon comprised a significant part of research in clinical psychology. The entry of the U.S.A. into World War II created the need for new individual intelligence tests for screening and assigning recruits. An alternate form of the Bellevue Intelligence Scale was constructed, termed the Wechsler Mental Ability Scale (Wechsler, 1946). The scale was published after the war as Form II of the WechslerBellevue scale. The Form II manual reported that the new scale had been standardized on a sample of 18 to 40 year-old males, but data from this standardization were not presented. Instead, Wechsler (1946) instructed the clinician to obtain IQs and subtest scaled scores using the 1939 norms, by adding a correction constant to the Form II scores in order to equate the difculty level between the two forms. During the war, the Army Individual Test (Staff, Personnel Research Section, Classication and Replacement Branch, The Adjutant General's Ofce, 1944) was developed as an individually administered IQ test that would imitate the verbal-performance structure of the Wechsler Bellevue scale but require less time to administer. The Trail Making Test was one of the ``nonverbal'' subtests of this scale. After the war, Leiter (1951)

published most of the subtests of the Army Individual Test as the Leiter Adult Intelligence Scale (LAIS). The WechslerBellevue scale may have put an end to the development of separate performance intelligence tests. A comparison of testing surveys in 1935 and 1946 (Louttit & Browne, 1947) revealed a sharp decline in the usage of purely performance tests during that interval. Two of the last performance batteries were created by Halstead (1947) and by Goldstein and Scheerer (1941) for testing patients with brain disorders. Both batteries were comprised of performance tests that required little or no spoken response. The GoldsteinScheerer Tests of Abstract and Concrete Thinking (Goldstein & Scheerer, 1941) included the Kohs Block Design test as a measure of ``abstract attitude,'' a concept akin to executive function. In Halstead's battery, 3 of the 10 ``neuropsychological indicators'' comprising the Impairment Index were derived from the Tactual Performance Test. Halstead's adaptation of this form board followed the earlier procedure of administering three trials in order to assess learning. Administration of the Tactual Performance Test was probably based upon a procedure known as the `tactual form,' used with subjects older than 10 12 years, in which the subject performed the test blindfolded and was instructed after the nal trial to ``try to sketch the positions of the forms and their shapes'' (Whipple, 1914, p. 301). In 1949 a revision of WechslerBellevue Form II was published as a children's scale, the Wechsler Intelligence Scale for Children (WISC; Wechsler, 1949). Wechsler's comment that the scale had been 5 years in preparation implies that the children's revision had begun in 1944 or 1945, before WechslerBellevue Form II was published. The revision from WechslerBellevue Form II to the WISC involved few major changes, so that the children's scale inherited almost all the Wechsler Bellevue Form II items. In a clarication repeated in the manual of each WISC revision, Wechsler (1949) stated that ``most of the items in the WISC are from Form II of the earlier scales, the main additions being new items at the easier end of each test'' (p. 1). The WISC introduced the basic standardization procedure to be followed by later

400

CORWIN BOAKE

Wechsler intelligence scales. The sample consisted of 100 male and 100 female subjects at each age from 5 to 15 years. The subjects, who were all white, were selected to represent the proportions of the U.S. population residing in four geographical regions, as well as the proportion of parents working in various occupations. The WechslerBellevue scale has undergone three revisions (Wechsler, 1955, 1981, 1997) culminating in the current Wechsler Adult Intelligence Scale-III (WAIS-III). These revisions have been restricted to adults but have provided improved norms, based on samples of the adult U.S. population that are considerably larger than the WechslerBellevue sample and that are geographically as well as demographically representative. While successive revisions have raised the upper end of the age range and have more accurately sampled the U.S. population's racial-ethnic composition, these revisions have retained the original subtest structure and many of the original items. The only change in the subtest prole was made by the WAIS-III, which substituted an untimed Matrix Reasoning subtest in place of the Object Assembly subtest, whose statistical weaknesses were discussed by Wechsler (1939). A nal edition of Wechsler's book about adult intelligence was published in 1958, probably to coincide with the 1955 revision. The last revision of the book, which Wechsler described as a ``necessary supplementary text'' (1955, p. iv), was published almost 30 years ago by Matarazzo (1972). Several factor analyses have been conducted of the Wechsler subtests in order to validate their categorization into verbal and performance scales. The factors obtained in these studies, starting with Balinsky's (1941) study of the WechslerBellevue standardization sample, have tended to reproduce the historical origins of the subtests. The verbal factor is dened by subtests derived from the BinetSimon scale and Alpha examination, and the `perceptual' factor by subtests derived from the Army Performance Scale and the CornellCoxe scale. The Digit Span and Digit Symbol subtests, which cannot be assigned to either factor, differ in their historical origins from the other subtests. The WISC has undergone two revisions (Wechsler, 1974, 1991) which, like the adult

revisions of the WechslerBellevue scale, conserve the original subtests and many items of WechslerBellevue Form II. Like the adult revisions, successive WISC revisions have been based on standardization samples that are larger and more racially and ethnically representative. A Wechsler intelligence scale for preschool children was published in 1967, based on the same verbal-performance structure but with some of the subtests replaced by ones designed for younger children. Wechsler died in 1981. The newest revisions of his intelligence scales (Wechsler, 1991, 1997), which were developed long after his death, continue to list him as sole author. Because the current revisions of the WISC and WAIS were published about 16 years after the preceding versions (Wechsler, 1974, 1981, 1991, 1997), the WISC-IV could be anticipated in 2007 and the WAIS-IV in 2013. CONCLUSIONS From a historical perspective, the Wechsler Bellevue Intelligence Scale is a battery of intelligence tests developed between the 1880s and World War I. In their origins, the Wechsler subtests represent the major preWorld War I approaches to cognitive assessment. These approaches (and related subtests) include anthropometrics (Digit Span), association psychology (Digit Symbol), the BinetSimon tests (Comprehension, Similarities, Vocabulary, Picture Completion), performance testing of immigrants to the U.S.A. (Object Assembly), and group testing in American schools and businesses (Arithmetic, Information). The WechslerBellevue verbal subtests were adapted mostly from the BinetSimon and YerkesBridges scales and from the Alpha examination. The WechslerBellevue performance subtests were taken from the various tests developed or popularized by Healy and Fernald, the Ellis Island physicians, Pintner and Paterson, Kohs, and Cornell and Coxe. These tests had been selected by the Army testing program for the purpose of rapid personnel screening of young, physically healthy men, and were mostly designed for testing groups rather than individ-

HISTORY OF INTELLIGENCE TESTING

401

uals. These Army tests were transformed directly into WechslerBellevue subtests with the same titles and, in some cases, the same content. Wechsler's recycling of old intelligence tests was similar to the construction of other intelligence scales of the period. His grouping of subtests into verbal and performance scales represented the two major contemporary approaches to individual intelligence testing. The verbal and intelligence scales were designed to serve as different measures of the same construct, that of general intelligence. Wechsler's major contributions were not the subtests themselves but rather the technical innovations of deviation scores, combining verbal and performance tests into a single scale, and selecting the adult standardization sample by occupation. These are the features of the Wechsler Bellevue scale that have been emulated by later individual intelligence tests. For various reasons, including statistical innovations and historical circumstances, the Wechsler intelligence scales have achieved almost monopoly status in the eld of individual adult intelligence testing. The point scale model of intelligence testing, as adapted by the WechslerBellevue scale, has replaced the year scale as the preferred model for children's intelligence tests, most of which have abandoned the IQ label for their composite scores. The history of the Wechsler intelligence scales presents several paradoxes. First, it is remarkable that the rate of change in the technology of individual intelligence testing has been so slow given the unprecedented pace of scientic progress in the century since the 1905 BinetSimon scale. Over the century, advances in psychological testing have been signicant albeit far less radical than in other technologies. Widespread advances in psychological testing include the use of factor analysis to guide scale design, selection of items using item response theory, evaluation of test-taking motivation, and computerized test administration. However, the only signicant innovation incorporated into the Wechsler intelligence scales is the availability of factor scores for the WAIS-III and WISC-III. The basic administration procedure used by the Wechsler scales and other individual intelligence tests, that

of an examiner individually testing a subject with an invariant sequence of items, is not fundamentally different from how the BinetSimon scale was administered. Therefore, it is paradoxical that the WechslerBellevue scale, which was a model of technical innovation in 1939, represents in its current revision one of the oldest mental tests in continuous use. The intelligence scale that is relied upon to make medical, educational, and legal decisions does not reect advances in understanding of cognitive functioning during the past 60 years and contains tests from the 1800s. While the revisions of the WechslerBellevue scale have introduced few changes in its basic structure, clinical interpretation of Wechsler test results has changed dramatically. Nowadays clinicians are encouraged to interpret the subtests not as measures of general intelligence, but rather as separate measures of specic cognitive abilities. For example, Lezak (1995) described the Picture Arrangement subtest as a measure of both ``socially appropriate thinking'' and ``sequential thinking,'' so that the subtest can serve ``as a nonverbal counterpart of that aspect of Comprehension'' (p. 639). To some extent this method of subtest interpretation is inspired by changes in Wechsler's own views. Wechsler's earlier view, expressed in his 1939 book, was that the important factor underlying most of the subtests was general intelligence and not special abilities. For example, Wechsler criticized the view that the Picture Arrangement subtest measured ``social intelligence,'' arguing that he did ``not believe in such an entity'' because ``social intelligence is just general intelligence applied to social situations'' (1939, pp. 90 91). However, his 1941 chapter introducing prole interpretation implicitly assumed that subtests could be interpreted as measures of specic abilities, particularly when testing patients whose mental disorders were characterized by strengths or impairments in these abilities. He speculated that ``the good score frequently obtained by the psychopath on the Picture Arrangement test'' might occur because psychopaths ``generally have a grasp of social situations'' (1941, p. 153), even though they tend to use this knowledge in an anti-social way.

402

CORWIN BOAKE

A third paradox is that the Wechsler children's intelligence scale, now the dominant intelligence scale administered to school-aged children, is a downward extension of tests designed for adults. Thus, the StanfordBinet scale, which was designed for children and based on year-scale developmental norms, was replaced by an adaptation of an adult scale with neither a developmental rationale nor sensitivity to developmental stages. Revisions of the Wechsler children's scale have conserved the scale's basic structure without introducing major upgrades to improve its clinical utility. While the 1940s clinicians probably recognized the WechslerBellevue subtests as old tests with familiar origins, currently these subtests are used without recognition of the psychologists and physicians who created them. It is paradoxical that the creators of these subtests were not individuals of little note, but rather public gures, such as Terman, who are famed for other contributions. The most direct way to recognize the tests' creators would be to rename their tests in their honor. For example, the Digit Symbol subtest could be renamed the WisconsinJastrow Substitution Test. Whether or not the tests are renamed, which is unlikely, test manuals and psychology textbooks should properly credit the persons who created the tests. Finally, it is curious that the Wechsler scales have come to play so central a role in neuropsychological assessment, given that the measurement of cognitive decit does not appear to have been a major consideration in Wechsler's subtest selection. Only in the case of the Digit Span subtest did Wechsler (1939) indicate that the subtest was selected to measure cognitive decit, in this case as a measure of `attention' in cognitively impaired persons. He commented that the Block Design and Digit Symbol subtests were especially sensitive to certain brain diseases, but not that these or other subtests were selected for that purpose. Because the neuropsychological understanding of intelligence tests would have been quite limited in 1939, it would have been difcult to foresee that the use of the Wechsler intelligence scales as measures of cognitive decit would be successful. Indeed, it might have been predicted that intelligence testing of brain-disordered patients would be more misleading than helpful. Such a warning was given by

the British psychiatrist Andrew Paterson, at the time a colleague of Oliver Zangwill at the Edinburgh Brain Injuries Unit in Scotland, in a 1944 address to the Royal Society of Medicine. Referring to the wartime context in which intelligence tests were introduced into psychiatry and neurology, Paterson stated: On the outbreak of war there was a clamour for tests for intellectual impairment. The academic psychologists through no fault of their own were encouraged to produce tests for conditions of which they had little knowledge and no clinical experience. This is the very opposite of the clinical approach where close observation should lead to the formulation of a test. In no other sphere of clinical science are tests devised before the phenomena have been studied. Such tests devised a priori tie nature down to a certain pattern of breakdown and such an assumption has always hindered progress. It also leaves out of account the variety of ways in which interference with cerebral function may express itself in the eld of performance. There is more than a danger that the stereotyping of modes of investigation will force us to think along those lines only, and to close our eyes to and cease investigation of the breakdown which the hard facts of clinical observation present. (1944, p. 559) In hindsight, Paterson's warning can be seen as only partly true. On the one hand, the sensitivity of the verbal-performance discrepancy to brain disorders contributed important evidence pointing to brain-behavior relationships. In a discussion of his own wartime experiences, Zangwill (1945) concluded that performance tests ``appear well adapted to assess the ner grades of constructive disability'' and had proven ``especially helpful in studying cases with lesions of the parieto-occipital region'' (p. 249). On the other hand, years of inconclusive research on Wechsler subtest proles, which have produced ``no rm rules'' (Spreen & Strauss, 1998), were probably caused by reliance on the Wechsler intelligence scales as the sole measures of cognition. By breaking away from what Zangwill (1945), echoing Paterson, called ``premature stereotyping of methods'' (p. 248), the

HISTORY OF INTELLIGENCE TESTING

403

invention of new neuropsychological tests has helped reveal brain-behavior relationships and has produced a testing technology allowing clinical neuropsychologists to exploit these discoveries. The history of the Wechsler intelligence scales leads to the question of what the scales have to offer neuropsychological assessment in the new century. New revisions of the Wechsler scales offer the benets of updated norms and what Lezak (1995) termed the ``hard won achievements of familiarity and experience,'' at the price of reinvesting in testing technology that is becoming obsolete. It is reasonable to anticipate that in the new century, emerging technologies using computerized administration will offer decisive advantages. Eventually, new tests based on these technologies will replace the individual intelligence test as we know it. Then it will be the job of these new tests to carry on the tradition of mental testing established by the BinetSimon and WechslerBellevue scales. ACKNOWLEDGMENTS An earlier version of this paper was presented at the 2000 meeting of the Midwest Neuropsychology Group in Madison, Wisconsin. I am grateful to David Baker, Bill Barr, Kevin Daley, Jeff Dosik, Janice Goldblum, Chris Grote, Kathy Hickey, Walter High, Sean Little, Kent Mercer, Mary Nowak, and Bernie Silver for their help. John Parascandola and John Wasserman shared historical materials. The Archives of the History of American Psychology and National Park Service gave permission to reproduce the gures. REFERENCES
Arthur, G. (1930). A point scale of performance tests. New York: Commonwealth Fund. Balinsky, B. (1941). An analysis of the mental factors of various age groups from nine to sixty. Genetic Psychology Monographs, 23, 191234. Binet, A. (1911/1916). New investigations upon the measure of the intellectual level among school children. In H.H. Goddard (Ed.), Development of intelligence in children (the BinetSimon Scale) (E.S. Kite, Trans., pp. 274328). Baltimore: Williams & Wilkins.

Binet, A., & Henri, V. (1895). La psychologie individuelle. L'Annee Psychologique, 2, 411465. Binet, A., & Simon, T. (1905). Methodes nouvelles pour le diagnostic du niveau intellectuel des anormaux. L'Annee Psychologique, 11, 191244. Binet, A., & Simon, T. (1905/1916). New methods for the diagnosis of the intellectual level of subnormals. In H.H. Goddard (Ed.), Development of intelligence in children (the BinetSimon Scale) (E.S. Kite, Trans., pp. 3790). Baltimore: Williams & Wilkins. Binet, A., & Simon, T. (1908). Le developpement de l'intelligence chez les enfants. L'Annee Psychologique, 14, 190. Binet, A., & Simon, T. (1908/1916). The development of intelligence in the child. In H.H. Goddard (Ed.), Development of intelligence in children (the Binet Simon Scale) (E.S. Kite, Trans., pp. 182273). Baltimore: Williams & Wilkins. Bonser, F.G. (1910). The reasoning ability of children of the fourth, fth, and sixth school grades (Teachers College Contributions to Education, no. 37). New York: Teachers College, Columbia University. Camara, W.J., Nathan, J.S., & Puente, A.E. (2000). Psychological test usage: Implications in professional psychology. Professional Psychology: Research and Practice, 31, 141154. Cattell, J.M. (1890). Mental tests and measurements. Mind, 15, 373381. Cornell, E.L., & Coxe, W.C. (1934). A performance ability scale. New York: World Book. Dearborn, W.F. (1910). Experiments in learning. Journal of Educational Psychology, 1, 373388. Decroly, O. (1914). Eprouve nouvelle pour l'examen mental. L'Annee Psychologique, 20, 140 159. Edwards, A.J. (1974). Introduction. Selected papers of David Wechsler (pp. 329). New York: Academic Press. Frank, G. (1983). The Wechsler enterprise: An assessment of the development, structure, and use of the Wechsler tests of intelligence. Oxford/New York: Pergamon Press. Galton, F. (1885). On the Anthropometric Laboratory of the late International Health Exhibition. Journal of the Anthropological Institute of Great Britain and Ireland, 14, 205221. Galton, F. (1887). Supplementary notes on `prehension' in idiots. Mind, 12, 7982. Goldstein, K., & Scheerer, M. (1941). Abstract and concrete behavior. An experimental study with special tests. Psychological Monographs, 53 (Whole No. 239). Gould, S.J. (1981). The mismeasure of man. New York: Norton. Halstead, W.C. (1947). Brain and intelligence: A quantitative study of the frontal lobes. Chicago: University of Chicago Press. Healy, W. (1914). A pictorial completion test. Psychological Review, 7, 140143.

404

CORWIN BOAKE

Healy, W. (1921). Pictorial Completion Test II. Journal of Applied Psychology, 5, 225239. Healy, W., & Fernald, G.M. (1911). Tests for practical mental classication. Psychological Monographs, 13 (Whole no. 54). Jacobs, J. (1887). Experiments on ``prehension''. Mind, 12, 7579. Kellogg, C.E., & Morton, N.W. (1935). Revised beta examination. New York: Psychological Corporation. Kirkpatrick, E.A. (1909). Studies in development and learning. Archives of Psychology, 12, 1101. Knox, H.A. (1913). The moron and the study of alien defectives. Journal of the American Medical Association, 60, 105106. Knox, H.A. (1914a). A scale, based on the work at Ellis Island, for estimating mental defect. Journal of the American Medical Association, 62, 741747. Knox, H.A. (1914b). Mental defectives. New York Medical Journal, 99, 215222. Knox, H.A. (1915). Measuring human intelligence. A progressive series of standardized tests used by the Public Health Service to protect our racial stock. Scientic American, 112, 5253, 5758. Kohs, S.C. (1923). Intelligence measurement: A psychological and statistical study based upon the block design tests. New York: Macmillan. Leiter, R.G. (1936). The Leiter International Performance Scale. Honolulu: University of Hawaii Press. Leiter, R.G. (1951). The Leiter Adult Intelligence Scale. Psychological Service Center Journal, 3. Lezak, M.D. (1995). Neuropsychological assessment (3rd ed.). New York: Oxford University Press. Lorge, I. (1943). The measurement of adult intelligence [review]. Journal of Consulting Psychology, 7, 167 168. Louttit, C.M., & Browne, C.G. (1947). The use of psychometric instruments in psychological clinics. Journal of Consulting Psychology, 11, 4954. Matarazzo, J.D. (1972). Wechsler's measurement and appraisal of adult intelligence (5th ed.). New York: Oxford University Press. Mullan, E.H. (1917). Mentality of the arriving immigrant (Public Health Bulletin 90). Washington, DC: Government Printing Ofce. Paterson, A. (1944). Discussion on disorders of personality after head injury (Section of Neurology. May 4, 1944). Proceedings of the Royal Society of Medicine, 37, 556 561. Peterson, J. (1925). Early conceptions and tests of intelligence. Yonkers-on-Hudson, NY: World Book. Pichot, P. (1948). French pioneers in the eld of mental deciency. American Journal of Mental Deciency, 53, 128137. Pichot, P. (1949). Les tests mentaux en psychiatrie, I. Instruments et methodes. Paris: Presses Universitaires de France.

Pintner, R., & Hoops, H.A. (1918). A drawing completion test. Journal of Applied Psychology, 2, 164173. Pintner, R., & Paterson, D.G. (1917). A scale of performance tests. New York: Appleton. Pyle, W.H. (1913). The examination of school children: A manual of directions and norms. New York: Macmillan. Rapaport, D., Schafer, R., & Gill, M. (1944). Manual of diagnostic psychological testing: 1: Diagnostic testing of intelligence and concept formation. New York: Josiah Macy, Jr. Foundation. Report of Committee of Clinical Section of American Psychological Association. (1935). Psychological Clinic, 23. Samelson, F. (1987). Was early mental testing (a) Racist inspired, (b) Objective science, (c) A technology for democracy, (d) The origin of the multiple-choice exams, (e) None of the above? (Mark the RIGHT answer). In M.M. Sokal (Ed.), Psychological testing and American society, 1890 1930 (pp. 113127). New Brunswick, NJ: Rutgers University Press. Sokal, M.M. (1981). The origins of the Psychological Corporation. Journal of the History of the Behavioral Sciences, 17, 5467. Spreen, O., & Strauss, E. (1998). A compendium of neuropsychological tests. Administration, norms, and commentary (2nd ed.). New York: Oxford University Press. Staff, Personnel Research Section, Classication and Replacement Branch, The Adjutant General's Ofce. (1944). The new Army Individual Test of general mental ability. Psychological Bulletin, 41, 532538. Starch, D. (1911). Experiments in educational psychology. New York: Macmillan. Stutsman, R. (1931). Mental measurement of preschool children with a guide for the administration of the Merrill-Palmer Scale of Mental Tests. New York: World Book. Terman, L.M. (1916). The measurement of intelligence: An explanation of and a complete guide for the use of the Stanford Revision and Extension of the Binet Simon Intelligence Scale. Boston: Houghton Mifin. Terman, L.M., & Merrill, M.A. (1937). Measuring intelligence: A guide to the administration of the new revised StanfordBinet tests of intelligence. Boston: Houghton-Mifin. Thorndike, R.M., & Lohman, D.F. (1990). A century of ability testing. Chicago: Riverside. U.S. Public Health Service. (1977). Interviews with physicians stationed at Ellis Island in the 1910s and 1920s, by Elizabeth Yew, 19771978. Bethesda, MD: National Library of Medicine (unpublished transcript of interview with Grover A. Kempf, September 11, 1977).

HISTORY OF INTELLIGENCE TESTING

405

von Mayrhauser, R.T. (1987). The manager, the medic, and the mediator: The clash of professional psychological styles and the wartime origins of group mental testing. In M.M. Sokal (Ed.), Psychological testing and American society, 18901930 (pp. 128157). New Brunswick: Rutgers University Press. Wechsler, D. (1925). The measurement of emotional reactions: Researches on the psychogalvanic reex. Archives of Psychology, 76. Wechsler, D. (1932). Analytic use of the Army Alpha examination. Journal of Applied Psychology, 16, 254256. Wechsler, D. (1935). The concept of mental deciency in theory and practice. Psychiatric Quarterly, 9, 232236. Wechsler, D. (1939). The measurement of adult intelligence. Baltimore: Williams & Wilkins. Wechsler, D. (1941). The measurement of adult intelligence (2nd ed.). Baltimore: Williams & Wilkins. Wechsler, D. (1945). A standardized memory scale for clinical use. Journal of Psychology, 19, 8795. Wechsler, D. (1946). The WechserBellevue Intelligence Scale. Form II. Manual for administering and scoring the test. New York: Psychological Corporation. Wechsler, D. (1949). Wechsler Intelligence Scale for Children. Manual. New York: Psychological Corporation. Wechsler, D. (1955). Manual for the Wechsler Adult Intelligence Scale. New York: Psychological Corporation. Wechsler, D. (1958). The measurement and appraisal of adult intelligence (4th ed.). Baltimore: Williams & Wilkins. Wechsler, D. (1967). Wechsler Preschool and Primary Scale of Intelligence. New York: Psychological Corporation. Wechsler, D. (1974). Wechsler Intelligence Scale for ChildrenRevised. Manual. New York: Psychological Corporation. Wechsler, D. (1979, September). The psychometric tradition: Developing the Wechsler Adult Intelligence Scale. Paper presented at the 87th Annual Meeting of the American Psychological Association, New York.

Wechsler, D. (1981). Wechsler Adult Intelligence Scale Revised. Manual. New York: Psychological Corporation. Wechsler, D. (1991). WISC-III. Wechsler intelligence scale for children. Manual. San Antonio: Psychological Corporation. Wechsler, D. (1997). Wechsler Adult Intelligence ScaleThird Edition. Administration and scoring manual. San Antonio: Psychological Corporation. Wells, F.L. (1927). Mental tests in clinical practice. Yonkers, NY: World Book. Wells, F.L. (1932). Army Alpharevised. Personnel Journal, 10, 411417. Whipple, G.M. (1910). Manual of mental and physical tests. A book of directions compiled with special reference to the experimental study of school children in the laboratory or classroom. Baltimore: Warwick & York. Whipple, G.M. (1914). Manual of mental and physical tests. A book of directions compiled with special reference to the experimental study of school children in the laboratory or classroom. Part I: Simple processes (2nd ed.). Baltimore: Warwick & York. Woodworth, R.S., & Wells, F.L. (1911). Association tests. Being a part of the Report of the Committee of the American Psychological Association on the Standardization of Procedure in Experimental Tests. Psychological Monographs, 13 (Whole no. 57). Yerkes, R.M. (Ed.) (1921). Psychological examining in the United States Army. Memoirs of the National Academy of Sciences, 15 (Parts 13), Washington DC: Government Printing Ofce. Yerkes, R.M., Bridges, J.W., & Hardwick, R.S. (1915). A point scale for measuring mental ability. Baltimore: Warwick & York. Yew, E. (1980). Medical inspection of immigrants at Ellis Island, 18911924. Bulletin of the New York Academy of Medicine, 56, 488510. Yoakum, C.A., & Yerkes, R.M. (1920). Army mental tests. New York: Holt. Zangwill, O.L. (1945). A review of psychological work at the Brain Injuries Unit, Edinburgh, 19411945. British Medical Journal, 2, 248250. Zenderland, L. (1998). Measuring minds: Henry Herbert Goddard and the origins of American intelligence testing. New York: Cambridge University Press.

You might also like