Trends in Mathematics and Science Study

1
William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

The following response is a collaborative effort of
Vincent Gordon, Kindra Jones, William Molnar, and Maria Newton-Tabon
IEA’s Trends in International Mathematics and Science Study (TIMSS) gives a lot of
information about students’ science and math achievement in an international
framework. TIMSS tests students in grades four and eight and also gathers a wide
ragne of data from their schools and teachers about curriculum and instruction in
mathematics and science. TIMSS findings have been used by many countries around
the world in their efforts to develop better methods in teaching science and
mathematics.Involving more than 60 countries, “TIMSS 2007 is the most recent in the
four-year cycle of studies to measure trends in students’ mathematics and science
achievement”. TIMSS one was in 1995 in 41 countries, the second in 1999 involving 38
countries. TIMSS 2003 consisted of more than 50 countries. The majority of countries
participating in TIMSS 2007 will have data going back to 1995.
TIMSS Advanced assesses students who are leaving school for preparation in
advanced physics and mathematics. Since the 1995 assessment, however, TIMSS has
not assessed children who are nearing the end of high school. Recognizing the strong
link between scientific competence and economic productivity,and given the relatively
long time period since the 1995 assessments, countries around the world have
expressed interest in participating in TIMSS Advanced. They want internationally
comparative data about the achievement of their students enrolled in advanced courses
designed to lead into science-oriented programs in university. By joining TIMSS
Advanced, ”countries that participated in 1995 can determine whether the achievement
2

of students having taken advanced coursework has changed over time. Countries
participating in TIMSS Advanced for the first time can assess the comparative standing
in mathematics and physics in an international context”. Introduction TIMSS uses the
program mostly defined, as “the major organizing concept in considering how
educational opportunities are provided to students, and the factors that influence how
students use these opportunities”. To begin the process of defining the topics to be
assessed in TIMSS Advanced, this document built on the composition of the 1995
Advanced Mathematics and Physics assessments to draft a Framework for TIMSS
Advanced 2008. The description of cognitive domains also benefited from the TIMSS
developmental project, funded by a number of countries, to enable reporting TIMSS
2007 results according to cognitive domains. The first draft of this document was
thoroughly reviewed by participating countries, and updated accordingly. Countries
provided comments about the subjects incorporated in their subject matter in advanced
mathematics and physics courses, and made recommendations about the desirability
and suitability of assessing particular topics.TIMSS, including TIMSS Advanced, is a
major undertaking of theIEA. The IEA has taken full responsibility for the management
of the project. The “TIMSS International Study Center” correlates with the IEA
Secretariat in Amsterdam on the translation ,the “IEA Data Processing Center” in
Germany on construction of the documents for the database, “Statistics Canada” on
sampling, and “Educational Testing Service in New Jersey on the psychometric scaling
of the data” .
The question that is needed to be answered is this appropriate to use TIMSS to
evaluate the 56 differing state or state-like education entities in the United States? This
3

author chooses to focus on the pros of validity and reliability with regards to this
particular assessment. Based on research, this assessment is reliable and valid due to
the content domains which “define the specific mathematics subject matter covered by
the assessment, and the cognitive domains which define the sets of behaviors expected
of students as they engage with the mathematics content. The cognitive domains of
mathematics and science are defined by the same three sets of expected behaviors-
knowing, applying, and reasoning” . In other words, although there could exist other
variables factored into why this particular test may not be valid or reliable, the common
factors that span continents and are shared with these age groups (fourth and eighth
graders) that are tested are “the mathematical topics or content that students are
expected to learn and the cognitive skills that students are expected to have developed”
(2007).
The IEA developed TIMSS to compare the educational achievement around the
globe. TIMMS began in the 1990s with a desire to study international studies of
students within the same age/or grade bracket. It was believed that math and science
education would be effective for the economic development in the technological world of
the future. The break-up of the Soviet Union brought about new countries wanting to be
participants in this study to help provide them with data to guide their educational
systems
TIMMS contained a measurement of science and math in conjunction with a
questionnaire for student and teacher. The measurement included topics in science and
math students should receive by grades 4 and 8. The questionnaires used were to
collect information on the background of students, their attitudes, and their belief system
4

about school and the learning process. The school and teacher questionnaires looked
at class scheduling of science and math coverage, the policies of the school, the
educational backgrounds of the teachers and their preparation.
The use of a summated rating scale was used to measure the idea as
common practice in the social sciences. The summated rating scale showed validity and
reliability on the sample on which it used. Summated rating scales were derived for
each construct. The construct pertained to student self-interest in mathematics and the
belief that motivation is a vital role in predicting the present and future achievement of a
student. The opportunity for students in both the fourth and eighth grades across
continents to do well is fair and consistently assessed. Scores by subject and grade are
comparable over time (2007). The overall idea of the TIMSS when it was created had a
mean of 500 based on the number of countries that took part in the testing. This testing
of both grades four and eight began in 1995 and continued until 2007. “Successive
TIMSS assessments since 1995, 1999, 2003, and 2007 have scaled the achievement
data so that scores are equivalent from assessment to assessment” (2007). The
establishing of the “Trends in International Mathematics and Science Study (TIMSS)”
assessment is the cognitive domains.
In addition to assessing proficiency in science and mathematics, TIMSS
collected data related to the teachers, the students, and the schools. To help the
researchers comprehend the implementation of the students in their own country, this
information was vital. A summated rating scale was developed since the possibility to
divulge information item by item is almost impossible. TIMSS called this scale multi-item
indicator.
5

There are advantages of combining several ideas into one variable such
as “improved measurement precision, scope, and validity”. In instances of a collection of
a huge amount of information being reported, a summated rating scale reduces data
amount and causes an easier way to digest for the public. “For the advantages of
summated rating scales over single item measures to hold, it is important that the scale
can show reasonable score reliability and validity as an indicator of a latent variable for
the sample on which the scale is used”. A collaboration of participating countries helped
create TIMSS. These countries created item pools, assessment frameworks, and
questionnaires including curriculum, measurement and educational experts. The basis
of school curricula is the cause of TIMSS and is designed to examine the provisions of
educational opportunities to students. TIMSS investigates two levels: “the intended
curriculum and the implemented curriculum”. The “intended curriculum” is the science
and math that society expects the learning to take place within the students, along with
the process of the organizational system reaching its goal. The “implemented
curriculum” is the content taught in class, how it is taught, and who teaches it. The
assessment of 2003 which looked at student achievement in math and science has
ambitious coverage goals, reporting not only overall science and math achievement
scores, but also scores in important content areas in these subjects. Examples of the
mathematical topical or content domains (as referred to in TIMSS) that are covered in
the fourth grade are “numbers, geometric shapes, measures, and data display. In the
eighth grade, the content domains are numbers, algebra, geometry, data and chance.
The cognitive domains in each grade are knowing, applying, and reasoning “(2007). The
five domains in science are “life science, chemistry, physics, earth science, and
6

environmental science that defined the specific science subject matter covered by the
assessment. The cognitive domains, four in mathematics (knowing facts and
procedures, using concepts, solving routine problems, and reasoning) and three in
science (factual knowledge, conceptual understanding, and reasoning and analysis)
defined the sets of behaviors expected of students as they engaged with the
mathematics and science content”. Students achievement was reported in terms of
performance in each content area as well as in science and math overall.
To encourage the US with the opportunity to reach the equivalents of high
performing students of other countries, the “International Study Center at Boston
College”, “The National Science Foundation” and “The Center for Education Statistics”
established the TIMSS 1999 benchmarking research. The TIMSS achievements tests
were given to students in spring 1999 in conjunction with the administering of TIMSS in
other countries. “Participation in TIMSS benchmarking was intended to help states and
districts understand their comparative educational standing, assess the rigor and
effectiveness of their own mathematics and science programs in an international
context, and improve the teaching and learning of mathematics and science”.
With regards to test reliability, statistically “the median reliabilities ranged
from 0.62 in Morocco to 0.86 in Singapore. The international median, 0.80 is the median
of the reliability coefficients for all countries. Reliability coefficients among benchmarking
participants were generally close to the international median ranging from 0.82 to 0.86
across states, and from 0.77 to 0.85 across districts”. An example of the validity and
reliability of TIMSS is the method of assessment for both fourth and eighth grades. This
method of assessment is equal across continents. “TIMSS provides an overall

7

mathematics scale score as well as content and cognitive domain scores at each grade
level. The TIMSS mathematics scale is from 0-1,000 and the international mean score
is set at 500, with a standard deviation of 100” (2007).
Regarding test validity, the following questions were posed by TIMSS for
comparative validity:
“Is curriculum coverage, instrument translations, and target populations

comparable?
Was sampling of populations and scoring of constructed-response

items conducted correctly?
Were the achievement tests administered appropriately?”
The variance between teachers on evaluative judgments can vary between
teachers and is part of the assertion that is unquestionable. Another assumption is with
regards to the idea of validity, defined as “the accuracy of assessment-based
interpretations”. What standardized measures lose in validity gain in reliability. “Our
concern for the reliability of teachers’ evaluative judgments must be tempered by the
realization that particular classroom assessment tasks are vital to provide a
comprehensive picture of student and school success”.
Comparative validity is a trademark of TIMSS and is the main reason that
international data has become acceptable as an instrument of educational policy
analysis. The question often raised in relation to international testing is if these results
have meaning? International testing programs such as TIMSS have been criticized for
two reasons; first, “other nations have not tested as large a percentage of their student
population causing their scores to be inflated; and second, our best students are among
the world’s best, with our average being brought down by a large cohort of low-
8

achievers”. Due to the qualities of the test, the validity of the inferences has become
dependent. “The degree to which the test items adequately represent the construct and
whether the number of items administered is enough to provide scores with sufficient
reliability. In addition, the scores should be reasonably free from construct-irrelevant
variance”. In simpler terms, observing scores reveal the indication of an awareness and
are not influenced by contributions that are not relevant to the construct.
“One important potential threat to the validity of score-based inferences is the
degree effort devoted by examinees to the test” When an individual takes a test, the
examiner assumes the individual will want to get items correct. But there are instances
when the test taker does not try his best. As a result, this leads to false underestimation
of what the test take can do. A low effort results in a negative biased estimate of the
individual’s proficiency. “Whenever test-taking effort varies across examinees, there will
be a differential biasing effect, which will introduce construct-irrelevant variance into the
test score data”. Low test scores can be caused by the test giver having low proficiency
or it could be the test taker has a higher proficiency and is not trying his best on the test.
If personal consequences such as grades were affected by the test, then low effort
would not be a major validity threat. If the examiner decides not to give his best, it will
not be seen as a danger to the measure of score-based conclusions. Many
measurements exist where scores will have an impact on test givers, but no impact on
test takers. An example of this is the TIMMS.
There are two sides to everything. The sword is double edged and the
TIMMS assessment is not exempt from their being cons to all the pros. Prais (2007)
explains that “when international tests were first introduced nearly two generations ago,
9

it was widely understood that their main objective was not-as it seems to have become
today- to produce international league table of countries’ schooling attainments, but to
provide broader insight into the diverse factors leading to success in learning”.
Countries commit to large expenditures of money and time to participate in international
assessments. The educational profile of the country is raised by participation and
shows the level of commitment a country has towards improved global education, the
other side of the sword for this is that these large sums of money and time could be
used to improve the educational systems (Robertson, 2005).
There has been much research completed presenting very impressive points to be
considered when determining the reliability and validity of the TIMMS assessment.
Robertson (2005) states, “overall findings from past international studies have been
accused of overshadowing particular aspects of each country’s performance and
clouding rather than clarifying issues relevant to the implementation of policies”. We
are going to look at some of the cons of TIMMS assessment and to what extent the
implementation of such an assessment in the 56 United States educational systems
could be effected by the same factors as the European countries were. These factors
include: student age, baseline data, motivation, curriculum mismatch including cultural
differences and translation of the assessment. One factor that has been found to create
problems in interpreting scores with validity is the age of the students when they start
school. It is important to understand at what age the students were when they began
school. Start times for schools within and between countries varies which can make
interpreting the assessment scores difficult. As well as knowing the age of the students,
Tymms, Merrell, & Jones (2004) maintain that baseline data is needed to be able to
10

interpret the data properly. The TIMMS assessment only assesses the students in the
middle and the end of their school career. These types of testing practices are only
measuring the student’s level at the particular time of the assessment, versus
measuring student progress over time.
This same consideration can apply when considering a nationwide
implementation of a TIMMS like assessment in the United States. As it stands now,
across the country there are different age ranges for when a child can start school.
Many states implement a head start program and pre-K program to create opportunities
to get children ready for school. Since these are not mandated, attendance is voluntary
and not all students glean the benefits from attending. This leaves students entering
Kindergarten at varying levels of proficiency.
What could be considered the primary factor in the validity, reliability, and
differences in assessment scores is a curriculum mismatch. This mismatch is not
limited to, but includes, curriculum subject matter, translation, or academic vocabulary,
unfamiliar context and/or cultural context, and item formats. A curriculum match has
been determined to be the most serious concern of the validity of international testing.
How well an assessment measure matches the curriculum of the country will determine
the success factor of the individual countries participating in the assessment. The
translation of the TIMMS test has translated to poor assessments scores for some
students. Even though the test goes through rigorous translation practices, the
vocabulary used in the context of the test questions proved to be difficult to some
students. Item formatting has also posed a problem. While the format might have been
easily identified by some, it was not comprehensible by others. Cultural differences also
11

created problems with assessments. When comparing the different countries, it is
important to understand their differences. It was found that the cultural emphasis
placed on education had a direct correlation to student’s success. These differences
also carry into test item interpretation and successful answering. These same
assessment considerations can be made in America. While, the United States historical
data could offer some homogenous research finding, our current classroom
demographics and research findings are widely varied. U.S. schools are filled with
students of different socio-economic backgrounds, as well as, cultural backgrounds.
These differences include a wide range of possibilities and limitations for students when
we implement a blanket test without consideration for these differences. These
differences include language background, cultural background, life experiences, and
education background of the second language learners coming to the U.S. The United
States could be compared to a small world There are populations of people
represented in our populations from all over the world. So while the specific countries
have their own issues with the TIMMS, the U.S. faces all of these problems within our
own country. The admirable goal of universal success, which is implicit in the No Child
Left Behind requirements, is simply not realistic (Holland, 2009)
Lastly, motivation and low-stakes testing go hand in hand in having an
effect on assessment results. Low-stakes testing is testing implemented for the
purpose of collecting data. The students do not get any feedback of their performance
on the test. Their scores do not have any impact of their educational experience. Low-
stakes nature of the TIMMS causes an under achievement amongst its assessment
candidates. This lack of motivation and ultimate low achievement could create a biases
12

guess of the skill of the student creating a risk to the validity of the outcome of the test
(Elk, 2007). This is an area that lacks in research and a place on the TIMMS test
battery. Motivation is a human characteristic that does not have
national boundaries. No matter where or what kind of assessment is administered, the
test taker’s level of motivation will have an effect on their score. If we were to now
implement a low-stakes test at this stage in the game in the U.S., it would provide some
interesting results. With our students now being exposed to high stakes testing, a low-
stakes test probably would create the same anxieties as does the state competency
tests. The nature of the test and its ramifications would need to be fully explained to
the students. In present literature the TIMSS project is most often
valued due to its “rich comparative data about educational systems, curriculums, school
characteristics, and instructional practices. One strength of TIMSS is its attempt to link
student performance to school improvement” (Rotherg, 1998). However, Rotherg poses
several concerns about the validity of how the scores are ranked “because countries
differ substantially in such factors as student selectivity, curriculum emphasis, and the
proportion of low-income students in the test-taking population” (p.1030). “Evidence has
been produced to justify concerns at the secondary level (Bracey, 2000; Rotberg, 1998).
At the primary and middle school level researchers monitored the interpretation of
TIMSS results”. Wang (2001) suggested that “one researcher believed that since TIMSS
was not a controlled scientific study and did not measure the effectiveness of one
teaching method against another”, the findings could not support certain reforms at a
local school. Another issue that is a concern is that TIMSS failed to examine results of
diverse population.
13

America actually compares favorably with other nations Caucasians, especially
considering that 25% of the population is of under-performing African and Latino
decent. The top nations are all East Asian. This study does not break down Americans
by race, if they did, Asian Americans would likely score high as Asians in their home
countries, and Whites would near top of the European nations. (Hu, 2000. p. 8)
Several technical problems became apparent which came from the database of
TIMSS. Wang (2001) found additional technical problems which in the end can skew the
comparative results which undercut the reliability of the TIMSS bench marking. Wang
(2001) believes there are some disadvantages of TIMSS: (1) “The Format of TIMSS
Test Items Is Not Consistent With the Goals of Some Education Reforms”, (2) The
TIMSS Test Booklets Have Discrepant Structures and, (3)”Because of Grade-Level
Differences and Content-Differences Among Countries, the TIMSS Tests Might Not Align
With What Students Have Learned” and (4)”Several Problematic Age Outliers in the
TIMSS Database Are Not Adequately Explained”.
The formats of test items have been found to be inconsistent with the
goals of various school reforms. “TIMSS test measures mostly lower learning outcomes
by means of predominantly multiple choice format (Lange, 1997, p. 3). The test items
are arranged by 429 multiple-choice, 43 short-responses, and 29 extended-response
items (Lang, 1997)”. From the pool of questions students were tested on subsets of
questions which would not reflect outcomes that came from any reform initiative.
When issuing booklets for testing (Gonzalez & Smith, 1997), expressed
how the rotation of booklets arranged in various clusters could reflect invalid results for
students. A study should show booklet eight and booklets one through seven were
structured by clusters. Booklet eight focused on the breadth cluster which could only
14

benefit students who had an in-depth study of the content. Whereas other booklets one
through seven contained a more focus cluster. Due to the structure discrepancy
systematic errors can occur across test booklets.
TIMSS is usually administered to students in adjacent
grades such as third and fourth graders for students in the United States which is
considered primary grades. The adjacent grades resulted in grade gaps based on
school experience. Each grade level comes with different learning levels and school
experience. Because of the learning difference experience Wang, 2001 suggested that
it would be unclear to say whether any testing instrument could measure what students
have learned in any grade level. Martin and Kelly (1997) pondered “whether the
meaning of student population should be defined by chronological age or grade level as
it relates to student achievement”. The authors believed that “age outliers in the TIMSS
database were not adequately explained”. For example student population ranged from
adults as old as a 49.9 year old seventh grader from a foreign country and as young as
a10 year old eighth grader from the United States. Since a students’ age is a factor in
cognitive development, the age outliers are deemed essential when analyzing data in
studies such as TIMSS. Wang (2001) supports that when interpreting results of TIMSS,
the test score component will come with problems of technical nature.
In conclusion, TIMSS is a vital examination that
gathers important information readily available to worldwide researchers and may
impact educational systems throughout the nation. It is important that science and
mathematics measure both what they are designed to measure to be certain that the
questionnaires are selected with care, scored, studied closely, and find meaning. It is
15

here researchers that use TIMSS have an significant task to pursue. Therefore, due to
the content domain, cognitive domain, and assessment, “Trends in International
Mathematics and Science Study (TIMSS)” has proven to be an both an assessment that
is valid and reliable among International Educational Assessment System, but does
have its drawbacks.
Reference
Eklof, H. (2007). Test-taking motivation and mathematics performance in TIMMS

2003. International Journal of Testing, 7, 311-326.
Gonzalez, E. J., & Smith, T.A. (1997). Users guide for the TIMSS international
database. Chestnut Hill, MA: TIMSS International Study Center
Gonzales, Patrick; Williams, Trevor; Jocelyn, Leslie (2008).
Highlights from TIMSS 2007: Mathematics and science achievement of U.S.

fourth and eighth grade students i Hu, A. (2000). TIMSS:. in an international context.
National Center for Education Statistics.
Hu, A. (2000). TIMSS: Arthur Hu’s index

www.leconsulting.com/arthurhu/index/timss.htm.
Hussein, M.G. (1992). What does Kuwait want to learn from TIMSS? Prospects (22),
275-277.
International Association for the Evaluation of Educational Achievement (IEA),

Trends in International Mathematics and Science Study (TIMSS)(2007)
Lange, J.D. (1997). Looking through the TIMSS mirror from a teaching angle.
http://www.enc.org/topics/timss/additional/documents
Martin, M. O., & Kelly, D.L. (1997). Technical report volume II: Implementation and
analysis. Chestnut Hill, MA: TIMSS International Study Cen
Martin, M.O. (Ed.) (2003). TIMSS 2003 User Guide for the International Database.
TIMSS & PIRLS International Study Center, Lynch School of Education, Boston
College.
Martin, M.O. & Mullis, I.V.S (2006). TIMSS in perspective: Lessons learned from
IEA’s four decades of international mathematics assessments. TIMSS & PIRLS
International Study Center, Lynch School of Education, Boston College.
16

Mullis, I.V. S., Martin, M.O., Gonzalez, E.J.,Gregory, K.D., Garden, R.A., O’Connor,
K.M., Chrostowski,S.J. & Smith, T.A. (2000). TIMSS 1999 International Mathematics
Report; findings from IEA’s repeat of the Third International Mathematics and
Sciency Study at the eight grade, Chestnut Hill, MA, Boston College.
Rose, L.C. (1998). Who cares? And so what? Responses to the results of the third
international mathematics and science study, Phi Delta Kappan, 79(10), 722.
Rotberg, I. (1998). Interpretation of international test scores comparisons. Science,

280, 1030-1031.
Wang, J. TIMSS primary and middle school data:Some technical concerns,

Educational Researcher, Vol.30, pp.17-21
www.minniscoms.com.au/educationtoday/articles.php?articleid=150
www.mackinac.org/article.asps?ID=6998
http://timss.bc.edu/timss1999b/sciencebench_report/t99bscience_A.html
http://nces.ed.gov/tmiss/Results03.asp
http://timss.bc.edu/timss2003.html
http://timss.bc.edu/TIMSS2007/about.html
www.iea.nl/timss2007.html
www.asanet.org/footnotes/jan05/fn10.html
www.ed.gov/inits/Math/silver.html

Trends in Mathematics and Science Study

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Trends in Mathematics and Science Study

Uploaded by

Copyright:

Available Formats

1

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

Vincent Gordon, Kindra Jones, William Molnar, and Maria Newton-Tabon

information about students’ science and math achievement in an international

four-year cycle of studies to measure trends in students’ mathematics and science

participating in TIMSS 2007 will have data going back to 1995.

expressed interest in participating in TIMSS Advanced. They want internationally

designed to lead into science-oriented programs in university. By joining TIMSS

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

in mathematics and physics in an international context”. Introduction TIMSS uses the

program mostly defined, as “the major organizing concept in considering how

Advanced Mathematics and Physics assessments to draft a Framework for TIMSS

developmental project, funded by a number of countries, to enable reporting TIMSS

thoroughly reviewed by participating countries, and updated accordingly. Countries

and suitability of assessing particular topics.TIMSS, including TIMSS Advanced, is a

Secretariat in Amsterdam on the translation ,the “IEA Data Processing Center” in

Germany on construction of the documents for the database, “Statistics Canada” on

The question that is needed to be answered is this appropriate to use TIMSS to

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

TIMMS contained a measurement of science and math in conjunction with a

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

educational backgrounds of the teachers and their preparation.

establishing of the “Trends in International Mathematics and Science Study (TIMSS)”

assessment is the cognitive domains.

In addition to assessing proficiency in science and mathematics, TIMSS

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

as “improved measurement precision, scope, and validity”. In instances of a collection of

questionnaires including curriculum, measurement and educational experts. The basis

educational opportunities to students. TIMSS investigates two levels: “the intended

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

assessment. The cognitive domains, four in mathematics (knowing facts and

science (factual knowledge, conceptual understanding, and reasoning and analysis)

mathematics and science content”. Students achievement was reported in terms of

performance in each content area as well as in science and math overall.

To encourage the US with the opportunity to reach the equivalents of high

performing students of other countries, the “International Study Center at Boston

effectiveness of their own mathematics and science programs in an international

With regards to test reliability, statistically “the median reliabilities ranged

method of assessment is equal across continents. “TIMSS provides an overall

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

is set at 500, with a standard deviation of 100” (2007).

“Is curriculum coverage, instrument translations, and target populations

Was sampling of populations and scoring of constructed-response

Were the achievement tests administered appropriately?”

The variance between teachers on evaluative judgments can vary between

regards to the idea of validity, defined as “the accuracy of assessment-based

interpretations”. What standardized measures lose in validity gain in reliability. “Our

realization that particular classroom assessment tasks are vital to provide a

comprehensive picture of student and school success”.

Comparative validity is a trademark of TIMSS and is the main reason that

international data has become acceptable as an instrument of educational policy

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

reliability. In addition, the scores should be reasonably free from construct-irrelevant

“One important potential threat to the validity of score-based inferences is the

not be seen as a danger to the measure of score-based conclusions. Many

test takers. An example of this is the TIMMS.

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon

today- to produce international league table of countries’ schooling attainments, but to

Countries commit to large expenditures of money and time to participate in international

assessments. The educational profile of the country is raised by participation and

used to improve the educational systems (Robertson, 2005).

accused of overshadowing particular aspects of each country’s performance and

clouding rather than clarifying issues relevant to the implementation of policies”. We