You are on page 1of 16

School Psychology Review,

2002, Volume 31, No. 4, pp. 498-513

What is Measured in Mathematics Tests? Construct


Validity of Curriculum-Based Mathematics Measures

Robin Schul Thurber


Puyallup School District

Mark R. Shinn
University of Oregon

Keith Smolkowski
Oregon Research Institute

Abstract. Mathematics assessment is often characterized in the literature as being


composed of two broad components: Computation and Applications. Many as-
sessment tools are available to evaluate student skill in these areas of mathematics.
However, not all math tests can be used in formative evaluation to inform instruc-
tion and improve student achievement. Mathematics curriculum-based measure-
ment (M-CBM) is one tool that has been developed for formative evaluation in
mathematics. However, there is considerably less technical adequacy information
on M-CBM than CBM reading. Of particular interest is the construct that M-CBM
measures, computation or general mathematics achievement. This study utilized
confirmatory factor analysis procedures to determine what constructs M-CBM
actually measures in the context of a range of other mathematics measures. Other
issues examined in this study included math assessment in general and the role of
reading in math assessment. Participants were 207 fourth-grade students who were
tested with math computation, math applications, and reading tests. Three theoretical
models of mathematics were tested. Results indicated that a two-factor model of
mathematics where Computation and Applications were distinct although related
constructs, M-CBM was a measure of Computation, and reading skill was highly
correlated with both math factors best fit the data. Secondary findings included the
important role that reading skills play in general mathematics assessment.

A large number of students receive spe- IDEA during the 1995-96 school year. More
cial education services for academic achieve- than half of those students receiving special
ment problems. According to the 19th Annual education services (2,597,231) were identified
Report to Congress on the Implementation of with achievement deficits, and therefore,
the Individuals with Disabilities Education Act served under the learning disability category.
(IDEA; U.S. Department of Education, 1997), Even more students are at-risk for developing
over 5 million students were served under academic problems (National Center for Edu-

Authors’ Notes. This article is based on the doctoral dissertation of the first author and was supported in
part by grant No. 84.029D60057 Leadership Training in Curriculum-Based Measurement and Its Use in a
Problem-Solving Model sponsored by the US Department of Education, Office of Special Education Pro-
grams. The views expressed within this paper are not necessarily those of the USDE. Address all corre-
spondence and questions about this manuscript to Mark R. Shinn, Ph.D., College of Education, University
of Oregon, Eugene, OR 97403. E-mail: mshinn@oregon.uoregon.edu
Copyright 2002 by the National Association of School Psychologists, ISSN 0279-6015

498
Mathematics Measurement

cational Statistics, 1996). When combined with Stecker, 1990) that part of effective interven-
the large number of students served in special tion is formative evaluation. In contrast to
education nationally, almost one student in five summative evaluation that is retrospective (i.e.,
receives some type of remedial education to data are collected after completion of instruc-
reduce academic achievement deficits (Shinn tion), formative evaluation involves the col-
& McConnell, 1994). lection of data during instruction as a basis for
Traditionally, reading deficits have re- modifying that instruction (Deno & Espin,
ceived the most attention in the education lit- 1991). By formatively evaluating students’
erature. For example, in a review of the litera- mathematics progress, teachers can assess the
ture on reading disabilities, Hallahan, effectiveness of their instruction within weeks
Kaufman, and Lloyd (1985) asserted that read- to determine if their programs are working
ing is essential for academic functioning in (Deno, 1986).
nearly all subjects. However, these research- Because formative evaluation requires
ers also suggested that the common assump- repeated, frequent measurement by classroom
tion that disabilities in mathematics are not as teachers, the measurement procedures must be
prevalent as those in reading and writing is technically sound, quick and easy to adminis-
misleading, if not completely false. Numerous ter and interpret, and yield useful information
evaluations reveal significant deficits in math- about student performance in basic skills
ematics performance for students in the United (Deno, 1985; Shinn, 1989). Curriculum-based
States (National Assessment of Educational measurement (CBM) possesses these features.
Progress [NAEP], 1992; Reese, Miller, CBM is a well-established technology for mea-
Mazzeo, & Dossey, 1997). For example, more suring student proficiency in reading (Deno,
than 80% of eighth-grade students could not 1985; Shinn, 1989, 1998). Less is known about
solve modestly difficult problems (e.g., com- the technical adequacy of math curriculum-
pute with decimals, fractions, and percents; based measurement (M-CBM) where students
recognize geometric figures; solve simple write answers to standardized computation
equations) correctly from their eighth-grade tasks drawn from the annual general curricu-
math textbook (NAEP, 1992). Anrig and lum on tests that vary from 2-5 minutes. Of
LaPointe (1989) found that only 16% of eighth- the few studies that have been conducted, M-
grade students in the U.S. mastered the con- CBM has demonstrated high interrater agree-
tent of the typical eighth-grade mathematics ment (.97), high 1-week test-retest reliability
text. NAEP results also revealed that only 8% (.87), and moderate alternate form reliability
of eighth graders could answer mathematics (.66; Tindal, Marston, & Deno, 1983).
questions requiring problem-solving skills In M-CBM validity studies, the empha-
(NAEP, 1992). Finally, the latest National As- sis has been on concurrent validity. A relation
sessment of Educational Progress in Math- with commercial norm-referenced math tests
ematics reported that across Grades 4, 8, and provides only modest support for validity. Few
12, 25% or fewer students were estimated to reported correlations exceed .60 and the me-
be at the Proficient level or beyond, where stu- dian correlation is .43 with the Problem-Solv-
dents should demonstrate evidence of solid ing subtest and .54 with the Math Operations
academic performance. Only 2 to 4% of stu- subtest of the Metropolitan Achievement Tests
dents attained the Advanced level, where stu- (MAT; Marston, 1989; Putnam, 1989).
dents should demonstrate superior performance Two hypotheses have been offered to
(Reese et al., 1997). explain these lower than expected correlations
(Marston, 1989). First, the limited content va-
Formative Evaluation and Mathematics
lidity of the criterion commercial mathemat-
Curriculum-Based Measurement
ics tests (Freeman et al., 1983) may make them
It has been demonstrated repeatedly inadequate criterion measures. Second, these
(Fuchs & Fuchs, 1986; Fuchs, Fuchs, & criterion math tests could be measuring more
Hamlett, 1989; Fuchs, Fuchs, Hamlett, & than just mathematics skills because many of

499
School Psychology Review, 2002, Volume 31, No. 4

the items rely on silent reading of the instruc- M-CBM was designed to serve as a mea-
tions and problems. Thus, reading skills may sure of general math achievement, not specifi-
influence performance on the mathematics test cally as a measure of only computation or ap-
(Skiba, Magnusson, Marston, & Erickson, plications. This theory is predicated on the
1986 as cited in Marston, 1989). hypothesized high relation between computa-
Although some criterion-related valid- tion and application. The purpose of this study
ity evidence has been provided for Math-CBM, was to examine the relation of M-CBM to the
in and of itself, this type of validity is consid- constructs of general mathematics achieve-
ered necessary but not sufficient to establish a ment, computation, and application from a
measure’s technical adequacy (Messick, 1990). theoretical perspective using confirmatory fac-
Construct validity is the most important type tor analysis. Three models were tested:
of validity evidence. Some construct validity 1. A unitary model where Computation and
evidence (e.g., discriminant validity) for CBM Applications comprise a general math com-
mathematics measures has been demonstrated. petence construct that M-CBM measures
For example, Shinn and Marston (1985) found accurately;
that Math-CBM probes differentiated students 2. A two-factor model where Computation and
in general education, Title 1, and programs for Applications are distinct constructs and M-
mild disabilities at Grades 5 and 6. Students CBM is a measure of Computation; and
with mild disabilities also were distinguished 3. A two-factor model where Computation and
from general education students in Grade 4. Applications are distinct and M-CBM is a
However, a preferable way to evaluate con- measure of Applications.
struct validity is to examine the factor struc-
ture of multiple measures (Good & Jefferson, The role of reading in mathematics also
1998). By using confirmatory factor analysis, was examined in each of the proposed math
the correlation of each measured variable with models.
the constructs (e.g., math competence) shared Method
by the measures can be estimated. This ap-
proach has been used to support the validity of Participants
CBM reading measures (Shinn, Good, Participants were 207 fourth graders
Knutson, Tilly, & Collins, 1992) but has not from general education classrooms in four el-
been used to examine the construct validity of ementary schools located in a midsized North-
CBM mathematics measures. western public school district. General demo-
What Construct Does M-CBM Measure? graphic information on the schools involved
in the study is presented in Table 1. Gender
Although different names may be used, was distributed about equally with 46% female
two broad constructs of mathematics perfor- and 54% male students. Most participants
mance are typical in the education literature: (74%) received all their instruction in general
computation or operations, and applications education; 18% received Title 1 services, and
or “problem solving.” Computation involves 8% received special education services in a
working math problems where students must resource room for part of the day.
know the concepts, strategies, and facts Participants were obtained in accordance
(Howell, Fox, & Morehead, 1993; Silbert, with University of Oregon protection of human
Carnine, & Stein, 1990). Applications are the subjects practices. Permission to conduct the
use and understanding of math concepts to study was obtained at the district administrative
solve problems (e.g., applied word prob- level and four elementary school principals were
lems, measurements, temperature, volume provided a written description of the study.
(Salvia &Ysseldyke, 1991). This problem Fourth-grade teachers then were contacted and
solving is the functional combination of they sent a description of the study and pas-
computation and application knowledge sive consent letters to 213 parents. No parents
(Howell et al., 1993). refused. Complete data were obtained on 207

500
Mathematics Measurement

Table 1
School Demographic Information

School Total Ethnicity Socioeconomic State-Wide


Enrollment Status (SES)a Testingb

A 564 82% White/non-Hispanic 64% 44% Math


2% Black/African American 48% Reading
12% Hispanic/Latino

B 488 88% White/non-Hispanic 46% 56% Math


1% Black/African American 59% Reading
8% Hispanic/Latino

C 256 94% White/non-Hispanic 15% 56% Math


<1% Black/African American 73% Reading
<1% Hispanic/Latino

D 96 98% White/non-Hispanic 13% 41% Math


<1% Black/African American 50% Reading

a
SES based on percent of students receiving free/reduced lunch. bState-wide testing percentiles indicate percent of fifth-
grade students meeting or exceeding state standards.

participants as six students were absent from suggests that M-CBM is reliable with respect to
at least one of the three testing sessions. interrater scoring and alternate forms, with mod-
est, but limited evidence of validity.
Measures
Basic math fact probes. Students were
Participants were administered the 12 administered two math fact probes containing
mathematics measures shown in Table 2 se- a combination of addition, subtraction, mul-
lected to provide multiple measures of the hy- tiplication, and division facts. Addition and
pothesized constructs of computation, appli- subtraction facts formed approximately 25%
cations, or general mathematics competence. of the test items and were distributed equally.
Mathematics CBM probes (M-CBM). Basic addition facts included mathematics
M-CBM consisted of three fourth-grade-level problems with whole numbers under 10 (e.g.,
math probes sampled from the annual curricu- 0 + 0 to 10 + 10). Basic subtraction facts in-
lum of typical mathematics texts. The prob- cluded problems in which the subtrahend (i.e.,
lems required a range of computation skills, subtracted number) and the difference (i.e.,
from basic addition, subtraction, multiplica- answer) are single-digit numbers (e.g., 1 - 0 to
tion, and division facts to more complex use 10 - 9). Approximately 75% of the test items
of algorithms and strategies (e.g., 362 x 25). were basic multiplication and division facts that
Each basic skill problem area comprised ap- again, were divided equally. Basic multiplica-
proximately 8% of the test items, and approxi- tion facts consisted of mathematics problems
mately 36% of the total. The more complex with single-digit factors (e.g., 0 x 0 to 9 x 9).
computational problem constituted about 64% Basic division facts included problems in
of the test items. Participants were given 5 which the divisor and quotient are single-digit
minutes on each probe to complete as many numbers (e.g., 0 ÷ 0 to 81 ÷ 9) (Silbert et al.,
problems as possible. Problems were scored 1990). Students were given 2 minutes to com-
by counting the number of correct digits (CD) plete as many problems as possible. Probes
in the process of obtaining the answer and the were scored by counting the number of prob-
answer itself. As discussed earlier, some evidence lems correct.

501
School Psychology Review, 2002, Volume 31, No. 4

Table 2
Measures

Name of Test Math Computation Math Applications Reading

Curriculum-Based 3 mixed-operation
Measurement probes
(M-CBM)

Basic Math Facts 2 fact worksheets

Stanford Diagnostic Computation subtest Applications subtest


Mathematics Test
(SDMT)

California Math Computation Math Concepts and


Achievement Tests subtest Applications subtest
(CAT)

National Assessment Applications items


of Educational Progress
(NAEP)

Reading Maze Test 3 Maze tests

Stanford Diagnostic Mathematics California Achievement Tests


Test. Students also were given the Stanford (CAT). Students were administered the Math-
Diagnostic Mathematics Test (SDMT, Beatty, ematics Computation and Mathematics Con-
Gardner, Madden, & Karlsen, 1985) Compu- cepts and Applications subtests from the CAT
tation and Applications subtests of the Green (CTB/McGraw-Hill, 1992). The Mathematics
Level test, intended to be used with students Computation subtest assessed skill in solving
in Grades 4 and 5, and with low-achieving addition, subtraction, multiplication, and divi-
students in Grade 6. The Computation sion problems involving whole numbers, frac-
subtest assessed knowledge of the facts and tions, mixed numbers, decimals, and algebraic
algorithms of addition, subtraction, multi- expressions. The Mathematics Concepts and
plication, and division, and methods for Applications subtest assessed skill in under-
solving simple and compound number sen- standing and applying a variety of mathemati-
tences. The Applications subtest assessed cal concepts involving numeration, number
skill in applying basic math facts and prin- sentences, problem solving, and measurement.
ciples. Items ranged in difficulty from requir- Reliability evidence is restricted to internal
ing students to solve simple story problems and consistency with all but two coefficients ex-
to select models for solving one-step problems ceeding .80. Validity evidence for the CAT is
to those that require solving multiple-step and limited to a demonstration that the percentage
measurement problems. Internal consistency of students mastering objectives increases with
reliability estimates for subtests by grade typi- age and that the CAT is correlated with the Test
cally exceeded .90. Criterion-related validity of Cognitive Skills.
evidence is reported in the test manual with The National Assessment of Educa-
correlations between SDMT subtests and total tional Progress (NAEP). Students also com-
test score and between the Stanford Achieve- pleted a set of fourth-grade NAEP items that
ment Test subtest and total score ranging from measured mathematics applications. The
.64 to .89. NAEP mathematics assessment was designed

502
Mathematics Measurement

to report the progress of students nationally at directions included extensive explanations for
Grades 4, 8, and 12 based on a framework in- administering the NAEP as part of a compre-
fluenced by the National Council of Teachers hensive nationwide assessment of student
of Mathematics (NCTM; 1989). The NAEP learning. Because most of these directions were
purports to examine mathematical abilities not applicable to the purpose of this study, these
(conceptual understanding, procedural knowl- explanations were dropped.
edge, and problem solving) and mathematical Testing was completed in three sessions
power (reasoning, connections, and commu- across 3 days in a prescribed order. The first
nication) but no technical adequacy informa- session consisted of M-CBM and Maze tests.
tion is available. The second session consisted of the SDMT
CBM Maze test. Students completed Computation subtest and NAEP items. The
three CBM Maze reading tests (Fuchs & third session consisted of Basic Math Fact
Fuchs, 1992; Shinn, 2002) as a measure of gen- probes and the SDMT Concepts and Applica-
eral reading achievement. Each test consisted tions subtest. Each session lasted approxi-
of a reading passage of approximately 250 mately 45 minutes.
words with every 7th word deleted. Students Interrater Agreement
selected responses from three choices with two
distracters and only one word that correctly The seven data collectors and the pri-
completed the sentence. Participants were mary investigator independently scored six of
given 5 minutes to complete each test. Re- the tests for 10 participants: (a) M-CBM, (b)
search has demonstrated the technical ad- Basic Facts worksheet, (c) Maze Task, (d)
equacy of the CBM Maze test as a valid mea- SDMT Computation subtest, (e) SDMT Con-
sure of reading with correlations with commer- cepts and Applications subtest, and (f) NAEP
cial reading tests ranging from .60 to .90, with items. Interrater agreement coefficients of .83,
an average correlation of .74 (Fuchs & Fuchs, .90, .94, .94, .85, and .77 for total scores were
1992). obtained for M-CBM, Basic Facts, Maze Task,
SDMT Computation, SDMT Concepts and Ap-
Procedures plications, and NAEP, respectively, using the
Training of data collectors. Seven formula: number of agreements/(number of
graduate students from a school psychology agreements + number of disagreements) x 100.
program at a major Pacific Northwestern uni- Results
versity were trained as data collectors in four
1-hour training sessions. During the first ses- Means and standard deviations for all
sion, they were trained to give M-CBM mea- measures are reported in Table 3. Overall, for
sures, basic skill probes, and CBM Maze tests tests with multiple forms, scores appeared simi-
with scripted directions after modeling by the lar with respect to means and standard devia-
first author; this was followed by a second ses- tions. However, the mean of the third M-CBM
sion during which the scoring procedures were probe was lower than the other two M-CBM
taught. The third and fourth sessions covered probes (i.e., 61.0, 65.2, and 49.9).
administration and scoring of the SDMT The correlation matrix for all measures
subtests and the NAEP math test. is reported in Table 4. Correlations between
parallel forms of the same measure were gener-
Data collection and scoring. All tests ally high, suggesting high alternate form reli-
were administered to groups of students in their ability. For example, correlations among the
own classrooms. The CAT was given by stu- three M-CBM correlations ranged from .90 to
dents’ classroom teachers as part of the .92.
district’s evaluation program. The NAEP was When examining the correlations among
administered by the researchers according to the math tests for evidence of concurrent va-
the primary investigator’s modifications of the lidity, nearly all correlations were greater than
original NAEP directions. The original NAEP .50. However, some general patterns are no-

503
School Psychology Review, 2002, Volume 31, No. 4

Table 3
Descriptive Statistics for Student Performance on
Measured Variables (n = 207)

Variable Mean Standard Deviation


(SD)

Reading Maze 1 32.7 12.4

Reading Maze 2 37.2 13.6

Reading Maze 3 40.7 14.9

M-CBM 1 61.0 33.0

M-CBM 2 65.2 34.2

M-CBM 3 49.9 31.0

Basic Facts 1 31.2 13.9

Basic Facts 2 30.1 13.8

SDMT Comp 14.2 4.4

SDMT App 19.8 6.5

NAEP 17.4 9.2

CAT Comp 28.1 8.9

CAT Comp 29.4 9.9

Note. All scores are reported in raw score units. Maze scores reflect number of correct words circled; CBM scores reflect
number of correct digits; Basic Fact scores reflect number of correct basic facts. M-CBM = Curriculum-Based Mea-
surement Math; SDMT = Stanford Diagnostic Mathematics Test; NAEP = National Assessment of Education Progress;
CAT = California Achievement Tests.

ticeable. First, the measures typically con- tosis. This test indicated positive skewness
ceived as measuring Computation correlated greater than 1.0 for the three M-CBM com-
more highly with other computation measures. putation probes with coefficients of 1.32, 1.32,
They also correlated lower with measures tra- and 1.41, respectively. Positive kurtosis greater
ditionally conceived as Applications. The re- than 1.0 was found for Maze 1, and the three
verse pattern was apparent with the measures M-CBM probes with coefficients of 1.19, 1.94,
conceptualized as Applications. With respect 2.45, and 2.25.
to the specific measures of interest in this study,
Model Testing
M-CBM, these tests correlated most highly
with the computation tasks tested in the Basic The three models of interest were tested
Facts 1 and Basic Facts 2 probes, with a me- using the Mplus statistical analysis package
dian correlation of .82. M-CBM correlated less (Muthén & Muthén, 2001). Because of the
well with applications measures such as SDMT moderate deviations from normality for some
Applications, CAT Applications, and the NAEP of the measures discussed earlier, models esti-
with median correlations of .44. mated with maximum likelihood estimation
Because confirmatory factor analysis were compared to the same models estimated
assumes multivariate normality, a preliminary with the robust estimators provided in Mplus
analysis of the sample data was conducted to (Muthén & Muthén, 2001). The fit statistics
test for skewness and, more importantly, kur- and parameter estimates differed only slightly

504
Mathematics Measurement

Table 4
Correlations Among Measured Variables

Variable 1 2 3 4 5 6 7 8 9 10 11 12 13

1. Maze1

2. Maze2 .88

3. Maze3 .87 .88

4. M-CBM1 .66 .59 .68

5. M-CBM2 .68 .60 .67 .92

6. M-CBM3 .65 .57 .65 .90 .91

7. Basic Facts1 .67 .63 .67 .82 .83 .82

8. Basic Facts2 .62 .59 .61 .80 .81 .82 .92

9. SDMT Comp .57 .54 .57 .58 .59 .54 .67 .61

10. SDMT App .58 .55 .58 .42 .42 .36 .51 .47 .66

11. NAEP .61 .60 .63 .44 .44 .38 .52 .45 .60 .81

12. CAT Comp .64 .63 .66 .62 .63 .59 .66 .62 .82 .69 .66

13. CAT App .61 .60 .63 .50 .51 .44 .55 .50 .68 .80 .80 .78

Note. M-CBM = Curriculum-Based Measurement Computation; SDMT = Stanford Diagnostic Mathematics Test; SDMT1
= Computation subtest; SDMT = Applications subtest; NAEP = National Assessment of Education Progress; CAT =
California Achievement Tests; CAT Comp = Computation subtest; CAT App = Concepts/Applications subtest.

(i.e., less than .01) within the levels of preci- tion of model fit. The RMSEA characterizes ac-
sion typically reported. Because differences ceptable fit when it is below .05.
were small and showed no substantive differ- Each model tested in this study speci-
ences, the models reported were estimated with fied estimated and constrained parameters. The
standard maximum likelihood. estimated parameters include factor loadings,
Based on the recommendations noted by residual (error) variances, and correlations that
Bollen and Long (1993), multiple measures were were expected to be nonzero. Constrained pa-
used to evaluate fit of the hypothesized models. rameters include those factor loadings and cor-
The Tucker-Lewis Index (TLI), also called the relations that were specified to be 0.0 for each
Nonnormal Fit Index and ρ2 in Bollen, 1989) model. All parameters are reported as standard-
favors more parsimonious models, is sample-size ized coefficients. For example, factor loadings
independent, and has been recommended by are reported as correlations between measured
Marsh (1995). The Comparative Fit Index (CFI), variables with the latent factor(s) in the model.
a truncated version of the Relative Noncentrality Because of hypothesized confounds of
Index, offers an index of fit when parsimony is method variance observed in previous model
less important (Marsh, 1995). Both indices are testing with Reading-CBM (Shinn et al. 1992),
normed so that they conform to a standard each model tested included a specific method
metric, ranging from 0 to 1, and for adequate variance factor as a number of the mathemat-
fit, indices exceeding .95 were expected. These ics measures (including M-CBM and the Maze
measures, along with the traditional χ2 and the Reading tests) were very short, timed tests. This
Root Mean Square Error of Approximation Timed Test factor was defined so that it was
(RMSEA), provide a fairly complete descrip- uncorrelated with content factors, and thus it

505
School Psychology Review, 2002, Volume 31, No. 4

should measure variance associated only with nificant, χ2(52) = 77.23, p = .013. This indi-
test methods. In addition, two pairs of mea- cates that the departure from the observed co-
sures not associated with the Timed Test fac- variance matrix from the covariance matrix
tor were related beyond that captured by con- specified by the model may be due to chance
tent factors. The two alternate forms of basic sampling variability. The CFI and TLI, how-
facts were allowed to correlate as were the two ever, were both .99, indicating excellent fit to
CAT measures. Like the timed tests, these pairs the data. In addition, the RMSEA was .048,
of measures were highly similar to each other another indication of acceptable fit.
in the testing methods, yet different from other In this model, Computation and Appli-
measures. It is important to account for extra- cations were distinct, but highly related, .83.
neous sources of variance, such as that associ- Factor loadings on Computation ranged from
ated with test methods, in the models to allow .60 for CBM 3 to .93 for CAT Computation.
the substantive factors for mathematics and Factor loadings on applications ranged from
reading to capture the true variance associated .89 for SDMT Applications to .90 for NAEP
with those skills. and CAT Applications. Reading performance
Results of model testing are represented was highly correlated with both Computation
in Figures 1 through 3. Latent constructs are (r = .76) and Applications (r = .77).
displayed in ellipses, and measured variables Because the unitary model was nested
are portrayed as squares. Directional arrows within the two-factor model, a chi-square dif-
from the factors to the measured variables rep- ference test was calculated to determine which
resent factor loadings. Curved double-headed model was an improved fit to the data. The
arrows between factors or residuals indicate difference in chi-square was significant, χ2(2)
correlations. = 112.82, p < .01, indicating that this two-fac-
tor model was a substantially better fit to the
Unitary model of mathematics as- data.
sessment. In general, most of the indices in-
dicated that this model was not a good fit to M-CBM as a measure of Applica-
the data. The chi-square goodness-of-fit was tions. Because there was theoretical and practi-
significant, χ2(54) = 190.05, p < .01, indicat- cal interest in using M-CBM to make gener-
ing poor fit, and the CFI and TLI were .96 and alizations to math applications performance, a
.94, respectively, indicating a marginal fit to two-factor model in which Computation and
the data. The RMSEA was .110, well above Applications were distinct math constructs and
the preferred .05 level. M-CBM was a measure of Applications was
tested. As shown in Figure 3, reading again
Two-factor model of mathematics was highly correlated with both Computation
assessment. The unitary model of mathemat- (.69) and Applications (.76). A particularly
ics, presented in Figure 1 was hierarchically strong correlation was found between Com-
related to the two-factor model displayed in putation and Applications (.88). When factors
Figure 2 where Computation and Applications are so highly correlated, it is difficult to deter-
are distinct constructs and M-CBM is a mea- mine that they are measuring distinct con-
sure of Computation. The unitary model is structs. Therefore, it appears that when M-
nested within the two-factor model in that the CBM is loaded on the Applications factor, the
more restrictive unitary model was obtained Computation and Applications constructs can-
by applying two constraints on the more gen- not be distinguished.
eral two-factor model (i.e., making Computa- Like all the models, the chi-square good-
tion and Applications correlate 1.00 and read- ness-of-fit statistic for this two-factor model
ing correlate equally with both Computation was significant, χ2(52) = 133.66, p < .01. The
and Applications). CFI and TLI were .98 and .96, indicating ac-
Results indicate that this two-factor ceptable fit to the data, but the RMSEA was
model provided an acceptable fit to the data. .087, above the .05 criterion. Based on the chi-
The chi-square goodness-of-fit was barely sig- square and other measures of goodness-of-fit,

506
Mathematics Measurement

Maze 1 .87

.90
Maze 2
Reading
87
Maze 3

34

.26 Basic Facts 1


.34
18
.66 79
.59 Basic Facts 2

60
61 CAT Computation .86

.02 SDMT
80
Computation
Timed Test .76 Computation
.58

.77 M-CBM 1
.79 .59

M-CBM 2
.52

M-CBM 3
86
SDMT
X2 190.05
Applications
df 54
90
p <.01
CFI .96 CAT
TLI .94 Applications
RMSEA .11 85

NAEP

Figure 1. A single-factor model of mathematics assessment inclusive of timed


method variance and reading skill.

507
School Psychology Review, 2002, Volume 31, No. 4

Figure 2. A two-factor model of mathematics assessment inclusive of timed


method variance and reading skill with M-CBM as a measure of computation.

508
Mathematics Measurement

this two-factor model with M-CBM measur- In this two-factor model, the median fac-
ing Applications was not a better fit to the data tor loading of M-CBM on the Computation
than the two-factor model with M-CBM mea- construct was .64 providing moderate evidence
suring Computation. of its validity as a measure of mathematics
Computation. Reading skill, as measured by
Discussion CBM maze also correlated highly with both
M-CBM was developed to address the the Computation and Application constructs
need for ongoing progress monitoring in math- (.76 and .77, respectively). In addition, using
ematics computation. Despite a plethora of the χ2 difference test and other goodness-of-fit
research conducted to validate the use of CBM indices, a significant improvement in fit was
reading, and to a lesser degree CBM written not obtained with other models tested. There-
expression and spelling, the technical adequacy fore, this model cannot be rejected as the most
of the M-CBM has not been investigated thor- plausible explanation of the data.
oughly (Marston, 1989; Putnam, 1989; Shinn, Implications
1989). Important questions remain unanswered
regarding which aspect(s) of mathematics, if Although mathematics theory is not as
any, are measured validly by M-CBM. The well documented as reading theory, two broad
current study used confirmatory factor analy- factors of math emerge from the literature,
sis procedures in an attempt to answer this Computation and Applications. In practice,
question as well as the potential influence of these two factors typically are viewed as inde-
reading in mathematics assessment. pendent. For instance, this multidimensional
In this study, evidence of high alternate theory of math is evident in the scope and se-
form reliability for M-CBM was observed with quence of traditional mathematics curricula.
a median correlation of .91 among the three Typical math textbooks usually contain “(1)
forms. However, lower than expected problem sets in which only computations are
interscorer agreement coefficients of .83 also performed and (2) word problems that require
were observed. Given that examiners used the selection and application of the correct algo-
same scoring key, this low interscorer agree- rithm and computation” (Salvia & Ysseldyke,
ment suggests that counting all the correct dig- 1991, p. 554). Math tests are constructed along
its in the answer is more complex than it ap- these theoretical lines. It may be important to
pears or that more scorer training is necessary. understand the degree of dependence among
Some convergent and divergent validity these constructs, however. It appears that skills
data for M-CBM also were generated. In ex- in one area are necessary for success in the
amining the correlation matrix in Table 4, M- other.
CBM correlated highly with other measures What also has been ignored typically in
of basic facts computation (median r = .82) mathematics assessment is the role of reading.
and more modestly with commercial measures This role not been well researched, although
of math Computation (median r = .61). Per- speculation abounds, especially with respect
formance on M-CBM also was less related to to math applications. For example, Salvia and
tests conceptualized as measuring math appli- Ysseldyke (1991) caution, “Although reading
cations (median r = .42). level is popularly believed to affect the diffi-
Results of model testing indicate that the culty of word problems, its effect is less clearly
most defensible model was the one displayed established” (p. 554). Similarly, Skiba et al.
in Figure 2. This model specified a two-factor (1986; as reported in Marston, 1989) and
model of mathematics assessment where Com- Marston, (1982; as reported in Good &
putation and Applications were distinct, al- Jefferson, 1998) hypothesized that commercial
though highly related constructs (r = .83). Al- math tests could be measuring more than just
though the χ2 statistic was significant for this mathematics skill. Skiba and colleagues (as
model, all other reported fit measures indicate reported in Marston, 1989) suggested that va-
a good fit to the data. lidity coefficients among math measures were

509
School Psychology Review, 2002, Volume 31, No. 4

Figure 3. A two-factor model of mathematics assessment inclusive of timed


method variance and reading skill with M-CBM as a measure of application.

510
Mathematics Measurement

improved when reading competence was in- sample size of the current study, 207 individu-
cluded in a prediction equation. However, the als, is considered adequate by some standards
study in which these data were reported was but too small by other standards.
unobtainable and thus could not be verified. The final major threat to the study’s re-
In Figure 2, correlations between the sults is the failure to obtain a nonsignificant χ2
Reading and Computation (.76) and Applica- statistic for any of the three models. In con-
tions constructs (.77) were high, sharing about ventional interpretation (Bollen, 1989; Bryant
one-half of their variance. This suggests that & Yarnold, 1997; Fassinger, 1987) this failure
students who performed well in reading also suggests that each model did not reproduce the
tended to perform well in mathematics. Con- observed data accurately. As mentioned previ-
versely, students who were not proficient in ously, however, there is considerable debate
reading did not perform well on the math mea- as to whether the χ2 statistic should be used as
sures. Therefore, results indicate that reading the primary or sole indicator of model fit due
may be a necessary and important component to a number of limitations (e.g., too restrictive,
in overall math competence and should not be sensitivity to sample size and multivariate
overlooked in drawing conclusions about normality). It is possible that a nonsignificant χ2
mathematics skills. could have been obtained by adding additional
parameters (e.g., correlated errors, additional
Limitations of the Study loadings), thereby improving the fit. However,
As with all research studies, this study it was believed that the mildly significant χ2 that
possesses several limitations that affect the was obtained by testing a theory-driven model
interpretations and generalizations of the re- was better than a nonsignificant χ2 that may have
ported results. Foremost among these limita- been attained by capitalizing on chance.
tions is lack of external validity. An effort was Summary
made to collect data across a diverse sample
of students in terms of gender, educational pro- Theories are not proven true but are con-
gram, and ethnicity. Regardless, students in this firmed or disconfirmed by converging evi-
sample represented primarily White/non-His- dence (Stanovich, 1992). This is an important
panic, general education students from schools caution in interpreting theoretical research such
from one school district in one Northwestern as the current study. Confirmatory factor analy-
state. Finally, as only fourth-grade students sis, by nature, is designed to reject theories,
served as participants, potential developmen- rather than prove them true. Instead, confir-
tal differences in mathematics models among matory factor analysis procedures determine
different grades cannot be estimated. This study whether the sample data confirm the hypoth-
should be replicated with students who differ esized models, thus lending support to a pro-
on these demographic features. Another con- posed theory (Long, 1983). Although this study
cern, as is usually the case in confirmatory fac- can be useful in making statements about this
tor analysis, is sample size. Currently, there is sample under these conditions, additional re-
no consensus on the optimal sample size for a search is necessary to allow for the conver-
confirmatory factor analysis study. Fassinger gence of information.
(1987) reported estimates ranging from 100 for The intent of the present study is to be-
a small study to 30 participants per measured gin to explore the technical properties of M-
variable for larger studies. Another guideline CBM and other variables in mathematics as-
for the minimum number of participants is 5 sessment. Three outcomes strongly suggest the
to 10 times the number of observed variables need for additional research to improve the
(Bryant & Yarnold, 1997). Finally, Marsh, technical properties of M-CBM. First, as was
Balla, and McDonald (1988) found that the noted earlier, the interscorer agreement (.83)
effect of sample size was still significant for was too low, given that the examiners were
sample sizes as large 400 to 1,600 individuals. given a scoring key. However, examiners were
Using the aforementioned guidelines, the asked to count the number of digits in the an-

511
School Psychology Review, 2002, Volume 31, No. 4

swer and in the process of obtaining the an- 79-97). Silver Spring, MD: National Association of
School Psychologists.
swer. The latter task may either require more
Fassinger, R. E. (1987). Use of structural equation model-
examiner inferences that need to be trained and ing in counseling psychology research. Journal of
practiced with consistency, better scoring keys, Counseling Psychology, 34, 425-436.
or a simplified scoring scheme. Freeman, D. J., Juhn, T. M., Porter, A. C., Floden, R. E.,
Schmidt, W. H., & Schwille, J. R. (1983). Do text-
Second, there was evidence of both posi-
books and tests define a national curriculum in elemen-
tive skew and kurtosis. The former suggests tary school mathematics? Elementary School Journal,
that the grade-level computation tasks may 83, 501-513.
have been too difficult for the students. It is Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic
formative evaluation: A meta-analysis. Exceptional
worth noting that M-CBM was the only non-
Children, 53, 199-208.
“broadband” mathematics measure. All the Fuchs, L. S., & Fuchs, D. (1992). Identifying a measure
others sampled a range of across-grade types for monitoring student reading progress. School Psy-
of problems. With respect to the kurtosis, the chology Review, 21, 45-58.
Fuchs, L. S., Fuchs, D., & Hamlett, C. L. (1989). Effects
reported value suggests a restricted range. of instrumental use of curriculum-based measurement
Third, the relation of M-CBM to the to enhance instructional programs. Remedial and Spe-
Computation factor, although defensible as cial Education, 10, 43-52.
evidence of validity, was lower than expected. Fuchs, L. S., Fuchs, D., Hamlett, C. L., & Stecker, P. M.
(1990). The role of skills analysis in curriculum-based
Whether this correlation reflects the cumula- measurement in math. School Psychology Review, 19,
tive effects of low interscorer agreement, skew- 6-22.
ness, and kurtosis, or accurately reflects the Good, R. H., & Jefferson, G. (1998). Contemporary per-
relation of the type of measure to the construct spectives on curriculum-based measurement validity.
In M. R. Shinn (Ed.), Advances in curriculum-based
should be investigated. measurement (pp. 61-88). New York: Guilford Press.
Hallahan, D. P., Kaufman, J. M., & Lloyd, J. W. (1985).
References Introduction to learning disabilities. Englewood Cliffs,
NJ: Prentice-Hall.
Anrig, G. R., & LaPointe, A. E. (1989). What we know Howell, K. W., Fox, S. L., & Morehead, M. K. (1993).
about what students don’t know. Educational Leader- Curriculum-based evaluation: Teaching and decision
ship, 47(3), 4-9. making. Pacific Grove, CA: Brooks/Cole.
Beatty, L. S., Gardner, E. G., Madden, R., & Karlsen, B. Long, J. S. (1983). Confirmatory factor analysis. Newbury
(1985). The Stanford Diagnostic Mathematics Test (3rd Park, CA: Sage Publications.
ed.). San Antonio, TX: The Psychological Corpora- Marsh, H. W. (1995). χ2 and χ2 I2 fit indices for struc-
tion. tural equation models: A brief note of clarification.
Bollen, K. A. (1989). Structural equations with latent vari- Structural Equation Modeling, 2, 246.
ables. New York: John Wiley & Sons. Marsh, H.W., Balla, J. R., & McDonald, R. P. (1988).
Bollen, K. A., & Long, J. S. (1993). Testing structural Goodness-of-fit indexes in confirmatory factor analy-
equation models. Newbury Park, CA: Sage Publica- sis: The effect of sample size. Psychological Bulletin,
tions, Inc. 103, 391-410.
Bryant, F. B., & Yarnold, P. R. (1997). Principal-compo- Marston, D. B. (1989). A curriculum-based measurement
nents analysis and exploratory and confirmatory fac- approach to assessing academic performance: What it
tor analysis. In L. G. Grimm & P. R. Yarnold (Eds.), is and why do it. In M. R. Shinn (Ed.), Curriculum-
Reading and understanding multivariate statistics (pp. based measurement: Assessing special children (pp.
99-136). Washington, DC: American Psychological 18-78). New York: Guilford Press.
Association. Messick, S. (1990). Test validity and the ethics of assess-
CTB/McGraw-Hill.(1992). California Achievement Tests, ment. American Psychologist, 35, 1012-1027.
Fifth Edition. Monterey, CA: CTB Macmillan/ Muthén, L. K., & Muthén, B. O. (2001). Mplus user’s
McGraw-Hill. guide. Los Angeles: Muthén & Muthén.
Deno, S. L. (1985). Curriculum-based measurement: The National Assessment of Educational Progress. (1992).
emerging alternative. Exceptional Children, 52, 219- NAEP 1992 mathematics report card for the nation
232. and the states (Report No. 23-ST02). Washington, DC:
Deno, S. L. (1986). Formative evaluation of individual National Center for Educational Statistics.
programs: A new role for school psychologists. School National Center for Educational Statistics. (1996). The
Psychology Review, 15, 358-374. pocket guide to the condition of education: 1996. Wash-
Deno, S. L., & Espin, C. A. (1991). Evaluation strategies ington, DC: U.S. Department of Education.
for preventing and remediating basic skill deficits. In National Council of Teachers of Mathematics. (1989).
G. Stoner, M. R. Shinn, & H. M. Walker (Eds.), Inter- Curriculum and evaluation standards for school math-
ventions for achievement and behavior problems (pp. ematics. Reston, VA: NCTM.

512
Mathematics Measurement

Putnam, D. (1989). The criterion-related validity of CBM students: A curriculum-based approach. Remedial and
measures of math. Unpublished master’s thesis, Uni- Special Education, 6(2), 31-38.
versity of Oregon, Eugene. Shinn, M. R., & McConnell, S. M. (1994). Improving
Reese, C. M., Miller, K. E., Mazzeo, J., & Dossey, J. A. general education instruction: Relevance to school
(1997). NAEP 1996 mathematics report card for the psychologists. School Psychology Review, 23, 351-
nation and the states. Washington, DC: National Cen- 371.
ter for Education Statistics. Silbert, J., Carnine, D., & Stein, M. (1990). Direct instruc-
Salvia, J., & Ysseldyke, J. E. (1991). Assessment (5th ed.). tion mathematics (2nd ed.). Columbus, OH: Merrill.
Boston: Houghton Mifflin. Skiba, R., Magnusson, D., Marston, D., & Erickson, K.
Shinn, M. R. (1989). Curriculum-based measurement: (1986). The assessment of mathematics performance
Assessing special children. New York: Guilford. in special education: Achievement tests, proficiency
Shinn, M. R. (Ed.). (1998). Advanced applications of cur- tests, or formative evaluation? Minneapolis: Special
riculum-based measurement. New York: Guilford. Services, Minneapolis Public Schools.
Shinn, M. R. (2002). Use of curriculum-based measure- Stanovich, K. E. (1992). How to think straight about psy-
ment maze in general outcome measurement. Eden chology. New York: Harper Collins.
Prairie, MN: Edformation, Inc. Tindal, G., Marston, D., & Deno, S. (1983). The reliabil-
Shinn, M. R., Good, R. H., Knutson, N., Tilly, W. D., & ity of direct and repeated measurement (Research Re-
Collins, V. L. (1992). Curriculum-based measurement port No. 109). Minneapolis, MN: University of Min-
of oral reading fluency: A confirmatory analysis of its nesota Institute for Research on Learning Disabilities.
relation to reading. School Psychology Review, 21, 459- U.S. Department of Education. (1997). Nineteenth annual
479. report to Congress on the implementation of the Indi-
Shinn, M. R., & Marston, D. (1985). Differentiating mildly viduals with Disabilities Education Act. Washington,
handicapped, low-achieving, and regular education DC: U.S. Department of Education.

Robin Schul Thurber received her Ph.D. in School Psychology from the University of
Oregon in 1999 and is a school psychologist in Puyallup, Washington. Areas of interest
include instructional design and consultation, curriculum-based measurement (CBM), and
violence prevention/social skills instruction.

Mark R. Shinn received his Ph.D. in Educational Psychology (School Psychology) from
the University of Minnesota in 1981 and is a Professor in the Special Education area at the
University of Oregon. His primary research and teaching interests are curriculum-based
measurement and its use in a problem-solving model and other needs-based service deliv-
ery systems.

Keith Smolkowski currently works as a research analyst at Oregon Research Institute. He


received his Master’s degree in Decision Sciences in 1995 from the University of Oregon
and is pursuing a Ph.D. in Special Education there. His interests include research methods
and statistics, effective literacy instruction, and positive behavior support.

513

You might also like