You are on page 1of 16

1

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


The following response is a collaborative effort of

Vincent Gordon, Kindra Jones, William Molnar, and Maria Newton-Tabon

IEA’s Trends in International Mathematics and Science Study (TIMSS) gives a lot of

information about students’ science and math achievement in an international

framework. TIMSS tests students in grades four and eight and also gathers a wide

ragne of data from their schools and teachers about curriculum and instruction in

mathematics and science. TIMSS findings have been used by many countries around

the world in their efforts to develop better methods in teaching science and

mathematics.Involving more than 60 countries, “TIMSS 2007 is the most recent in the

four-year cycle of studies to measure trends in students’ mathematics and science

achievement”. TIMSS one was in 1995 in 41 countries, the second in 1999 involving 38

countries. TIMSS 2003 consisted of more than 50 countries. The majority of countries

participating in TIMSS 2007 will have data going back to 1995.

TIMSS Advanced assesses students who are leaving school for preparation in

advanced physics and mathematics. Since the 1995 assessment, however, TIMSS has

not assessed children who are nearing the end of high school. Recognizing the strong

link between scientific competence and economic productivity,and given the relatively

long time period since the 1995 assessments, countries around the world have

expressed interest in participating in TIMSS Advanced. They want internationally

comparative data about the achievement of their students enrolled in advanced courses

designed to lead into science-oriented programs in university. By joining TIMSS

Advanced, ”countries that participated in 1995 can determine whether the achievement
2

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


of students having taken advanced coursework has changed over time. Countries

participating in TIMSS Advanced for the first time can assess the comparative standing

in mathematics and physics in an international context”. Introduction TIMSS uses the

program mostly defined, as “the major organizing concept in considering how

educational opportunities are provided to students, and the factors that influence how

students use these opportunities”. To begin the process of defining the topics to be

assessed in TIMSS Advanced, this document built on the composition of the 1995

Advanced Mathematics and Physics assessments to draft a Framework for TIMSS

Advanced 2008. The description of cognitive domains also benefited from the TIMSS

developmental project, funded by a number of countries, to enable reporting TIMSS

2007 results according to cognitive domains. The first draft of this document was

thoroughly reviewed by participating countries, and updated accordingly. Countries

provided comments about the subjects incorporated in their subject matter in advanced

mathematics and physics courses, and made recommendations about the desirability

and suitability of assessing particular topics.TIMSS, including TIMSS Advanced, is a

major undertaking of theIEA. The IEA has taken full responsibility for the management

of the project. The “TIMSS International Study Center” correlates with the IEA

Secretariat in Amsterdam on the translation ,the “IEA Data Processing Center” in

Germany on construction of the documents for the database, “Statistics Canada” on

sampling, and “Educational Testing Service in New Jersey on the psychometric scaling

of the data” .

The question that is needed to be answered is this appropriate to use TIMSS to

evaluate the 56 differing state or state-like education entities in the United States? This
3

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


author chooses to focus on the pros of validity and reliability with regards to this

particular assessment. Based on research, this assessment is reliable and valid due to

the content domains which “define the specific mathematics subject matter covered by

the assessment, and the cognitive domains which define the sets of behaviors expected

of students as they engage with the mathematics content. The cognitive domains of

mathematics and science are defined by the same three sets of expected behaviors-

knowing, applying, and reasoning” . In other words, although there could exist other

variables factored into why this particular test may not be valid or reliable, the common

factors that span continents and are shared with these age groups (fourth and eighth

graders) that are tested are “the mathematical topics or content that students are

expected to learn and the cognitive skills that students are expected to have developed”

(2007).

The IEA developed TIMSS to compare the educational achievement around the

globe. TIMMS began in the 1990s with a desire to study international studies of

students within the same age/or grade bracket. It was believed that math and science

education would be effective for the economic development in the technological world of

the future. The break-up of the Soviet Union brought about new countries wanting to be

participants in this study to help provide them with data to guide their educational

systems

TIMMS contained a measurement of science and math in conjunction with a

questionnaire for student and teacher. The measurement included topics in science and

math students should receive by grades 4 and 8. The questionnaires used were to

collect information on the background of students, their attitudes, and their belief system
4

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


about school and the learning process. The school and teacher questionnaires looked

at class scheduling of science and math coverage, the policies of the school, the

educational backgrounds of the teachers and their preparation.

The use of a summated rating scale was used to measure the idea as

common practice in the social sciences. The summated rating scale showed validity and

reliability on the sample on which it used. Summated rating scales were derived for

each construct. The construct pertained to student self-interest in mathematics and the

belief that motivation is a vital role in predicting the present and future achievement of a

student. The opportunity for students in both the fourth and eighth grades across

continents to do well is fair and consistently assessed. Scores by subject and grade are

comparable over time (2007). The overall idea of the TIMSS when it was created had a

mean of 500 based on the number of countries that took part in the testing. This testing

of both grades four and eight began in 1995 and continued until 2007. “Successive

TIMSS assessments since 1995, 1999, 2003, and 2007 have scaled the achievement

data so that scores are equivalent from assessment to assessment” (2007). The

establishing of the “Trends in International Mathematics and Science Study (TIMSS)”

assessment is the cognitive domains.

In addition to assessing proficiency in science and mathematics, TIMSS

collected data related to the teachers, the students, and the schools. To help the

researchers comprehend the implementation of the students in their own country, this

information was vital. A summated rating scale was developed since the possibility to

divulge information item by item is almost impossible. TIMSS called this scale multi-item

indicator.
5

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


There are advantages of combining several ideas into one variable such

as “improved measurement precision, scope, and validity”. In instances of a collection of

a huge amount of information being reported, a summated rating scale reduces data

amount and causes an easier way to digest for the public. “For the advantages of

summated rating scales over single item measures to hold, it is important that the scale

can show reasonable score reliability and validity as an indicator of a latent variable for

the sample on which the scale is used”. A collaboration of participating countries helped

create TIMSS. These countries created item pools, assessment frameworks, and

questionnaires including curriculum, measurement and educational experts. The basis

of school curricula is the cause of TIMSS and is designed to examine the provisions of

educational opportunities to students. TIMSS investigates two levels: “the intended

curriculum and the implemented curriculum”. The “intended curriculum” is the science

and math that society expects the learning to take place within the students, along with

the process of the organizational system reaching its goal. The “implemented

curriculum” is the content taught in class, how it is taught, and who teaches it. The

assessment of 2003 which looked at student achievement in math and science has

ambitious coverage goals, reporting not only overall science and math achievement

scores, but also scores in important content areas in these subjects. Examples of the

mathematical topical or content domains (as referred to in TIMSS) that are covered in

the fourth grade are “numbers, geometric shapes, measures, and data display. In the

eighth grade, the content domains are numbers, algebra, geometry, data and chance.

The cognitive domains in each grade are knowing, applying, and reasoning “(2007). The

five domains in science are “life science, chemistry, physics, earth science, and
6

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


environmental science that defined the specific science subject matter covered by the

assessment. The cognitive domains, four in mathematics (knowing facts and

procedures, using concepts, solving routine problems, and reasoning) and three in

science (factual knowledge, conceptual understanding, and reasoning and analysis)

defined the sets of behaviors expected of students as they engaged with the

mathematics and science content”. Students achievement was reported in terms of

performance in each content area as well as in science and math overall.

To encourage the US with the opportunity to reach the equivalents of high

performing students of other countries, the “International Study Center at Boston

College”, “The National Science Foundation” and “The Center for Education Statistics”

established the TIMSS 1999 benchmarking research. The TIMSS achievements tests

were given to students in spring 1999 in conjunction with the administering of TIMSS in

other countries. “Participation in TIMSS benchmarking was intended to help states and

districts understand their comparative educational standing, assess the rigor and

effectiveness of their own mathematics and science programs in an international

context, and improve the teaching and learning of mathematics and science”.

With regards to test reliability, statistically “the median reliabilities ranged

from 0.62 in Morocco to 0.86 in Singapore. The international median, 0.80 is the median

of the reliability coefficients for all countries. Reliability coefficients among benchmarking

participants were generally close to the international median ranging from 0.82 to 0.86

across states, and from 0.77 to 0.85 across districts”. An example of the validity and

reliability of TIMSS is the method of assessment for both fourth and eighth grades. This

method of assessment is equal across continents. “TIMSS provides an overall


7

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


mathematics scale score as well as content and cognitive domain scores at each grade

level. The TIMSS mathematics scale is from 0-1,000 and the international mean score

is set at 500, with a standard deviation of 100” (2007).

Regarding test validity, the following questions were posed by TIMSS for

comparative validity:

“Is curriculum coverage, instrument translations, and target populations


comparable?

Was sampling of populations and scoring of constructed-response


items conducted correctly?

Were the achievement tests administered appropriately?”

The variance between teachers on evaluative judgments can vary between

teachers and is part of the assertion that is unquestionable. Another assumption is with

regards to the idea of validity, defined as “the accuracy of assessment-based

interpretations”. What standardized measures lose in validity gain in reliability. “Our

concern for the reliability of teachers’ evaluative judgments must be tempered by the

realization that particular classroom assessment tasks are vital to provide a

comprehensive picture of student and school success”.

Comparative validity is a trademark of TIMSS and is the main reason that

international data has become acceptable as an instrument of educational policy

analysis. The question often raised in relation to international testing is if these results

have meaning? International testing programs such as TIMSS have been criticized for

two reasons; first, “other nations have not tested as large a percentage of their student

population causing their scores to be inflated; and second, our best students are among

the world’s best, with our average being brought down by a large cohort of low-
8

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


achievers”. Due to the qualities of the test, the validity of the inferences has become

dependent. “The degree to which the test items adequately represent the construct and

whether the number of items administered is enough to provide scores with sufficient

reliability. In addition, the scores should be reasonably free from construct-irrelevant

variance”. In simpler terms, observing scores reveal the indication of an awareness and

are not influenced by contributions that are not relevant to the construct.

“One important potential threat to the validity of score-based inferences is the

degree effort devoted by examinees to the test” When an individual takes a test, the

examiner assumes the individual will want to get items correct. But there are instances

when the test taker does not try his best. As a result, this leads to false underestimation

of what the test take can do. A low effort results in a negative biased estimate of the

individual’s proficiency. “Whenever test-taking effort varies across examinees, there will

be a differential biasing effect, which will introduce construct-irrelevant variance into the

test score data”. Low test scores can be caused by the test giver having low proficiency

or it could be the test taker has a higher proficiency and is not trying his best on the test.

If personal consequences such as grades were affected by the test, then low effort

would not be a major validity threat. If the examiner decides not to give his best, it will

not be seen as a danger to the measure of score-based conclusions. Many

measurements exist where scores will have an impact on test givers, but no impact on

test takers. An example of this is the TIMMS.

There are two sides to everything. The sword is double edged and the

TIMMS assessment is not exempt from their being cons to all the pros. Prais (2007)

explains that “when international tests were first introduced nearly two generations ago,
9

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


it was widely understood that their main objective was not-as it seems to have become

today- to produce international league table of countries’ schooling attainments, but to

provide broader insight into the diverse factors leading to success in learning”.

Countries commit to large expenditures of money and time to participate in international

assessments. The educational profile of the country is raised by participation and

shows the level of commitment a country has towards improved global education, the

other side of the sword for this is that these large sums of money and time could be

used to improve the educational systems (Robertson, 2005).

There has been much research completed presenting very impressive points to be

considered when determining the reliability and validity of the TIMMS assessment.

Robertson (2005) states, “overall findings from past international studies have been

accused of overshadowing particular aspects of each country’s performance and

clouding rather than clarifying issues relevant to the implementation of policies”. We

are going to look at some of the cons of TIMMS assessment and to what extent the

implementation of such an assessment in the 56 United States educational systems

could be effected by the same factors as the European countries were. These factors

include: student age, baseline data, motivation, curriculum mismatch including cultural

differences and translation of the assessment. One factor that has been found to create

problems in interpreting scores with validity is the age of the students when they start

school. It is important to understand at what age the students were when they began

school. Start times for schools within and between countries varies which can make

interpreting the assessment scores difficult. As well as knowing the age of the students,

Tymms, Merrell, & Jones (2004) maintain that baseline data is needed to be able to
10

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


interpret the data properly. The TIMMS assessment only assesses the students in the

middle and the end of their school career. These types of testing practices are only

measuring the student’s level at the particular time of the assessment, versus

measuring student progress over time.

This same consideration can apply when considering a nationwide

implementation of a TIMMS like assessment in the United States. As it stands now,

across the country there are different age ranges for when a child can start school.

Many states implement a head start program and pre-K program to create opportunities

to get children ready for school. Since these are not mandated, attendance is voluntary

and not all students glean the benefits from attending. This leaves students entering

Kindergarten at varying levels of proficiency.

What could be considered the primary factor in the validity, reliability, and

differences in assessment scores is a curriculum mismatch. This mismatch is not

limited to, but includes, curriculum subject matter, translation, or academic vocabulary,

unfamiliar context and/or cultural context, and item formats. A curriculum match has

been determined to be the most serious concern of the validity of international testing.

How well an assessment measure matches the curriculum of the country will determine

the success factor of the individual countries participating in the assessment. The

translation of the TIMMS test has translated to poor assessments scores for some

students. Even though the test goes through rigorous translation practices, the

vocabulary used in the context of the test questions proved to be difficult to some

students. Item formatting has also posed a problem. While the format might have been

easily identified by some, it was not comprehensible by others. Cultural differences also
11

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


created problems with assessments. When comparing the different countries, it is

important to understand their differences. It was found that the cultural emphasis

placed on education had a direct correlation to student’s success. These differences

also carry into test item interpretation and successful answering. These same

assessment considerations can be made in America. While, the United States historical

data could offer some homogenous research finding, our current classroom

demographics and research findings are widely varied. U.S. schools are filled with

students of different socio-economic backgrounds, as well as, cultural backgrounds.

These differences include a wide range of possibilities and limitations for students when

we implement a blanket test without consideration for these differences. These

differences include language background, cultural background, life experiences, and

education background of the second language learners coming to the U.S. The United

States could be compared to a small world There are populations of people

represented in our populations from all over the world. So while the specific countries

have their own issues with the TIMMS, the U.S. faces all of these problems within our

own country. The admirable goal of universal success, which is implicit in the No Child

Left Behind requirements, is simply not realistic (Holland, 2009)

Lastly, motivation and low-stakes testing go hand in hand in having an

effect on assessment results. Low-stakes testing is testing implemented for the

purpose of collecting data. The students do not get any feedback of their performance

on the test. Their scores do not have any impact of their educational experience. Low-

stakes nature of the TIMMS causes an under achievement amongst its assessment

candidates. This lack of motivation and ultimate low achievement could create a biases
12

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


guess of the skill of the student creating a risk to the validity of the outcome of the test

(Elk, 2007). This is an area that lacks in research and a place on the TIMMS test

battery. Motivation is a human characteristic that does not have

national boundaries. No matter where or what kind of assessment is administered, the

test taker’s level of motivation will have an effect on their score. If we were to now

implement a low-stakes test at this stage in the game in the U.S., it would provide some

interesting results. With our students now being exposed to high stakes testing, a low-

stakes test probably would create the same anxieties as does the state competency

tests. The nature of the test and its ramifications would need to be fully explained to

the students. In present literature the TIMSS project is most often

valued due to its “rich comparative data about educational systems, curriculums, school

characteristics, and instructional practices. One strength of TIMSS is its attempt to link

student performance to school improvement” (Rotherg, 1998). However, Rotherg poses

several concerns about the validity of how the scores are ranked “because countries

differ substantially in such factors as student selectivity, curriculum emphasis, and the

proportion of low-income students in the test-taking population” (p.1030). “Evidence has

been produced to justify concerns at the secondary level (Bracey, 2000; Rotberg, 1998).

At the primary and middle school level researchers monitored the interpretation of

TIMSS results”. Wang (2001) suggested that “one researcher believed that since TIMSS

was not a controlled scientific study and did not measure the effectiveness of one

teaching method against another”, the findings could not support certain reforms at a

local school. Another issue that is a concern is that TIMSS failed to examine results of

diverse population.
13

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


America actually compares favorably with other nations Caucasians, especially
considering that 25% of the population is of under-performing African and Latino
decent. The top nations are all East Asian. This study does not break down Americans
by race, if they did, Asian Americans would likely score high as Asians in their home
countries, and Whites would near top of the European nations. (Hu, 2000. p. 8)

Several technical problems became apparent which came from the database of

TIMSS. Wang (2001) found additional technical problems which in the end can skew the

comparative results which undercut the reliability of the TIMSS bench marking. Wang

(2001) believes there are some disadvantages of TIMSS: (1) “The Format of TIMSS

Test Items Is Not Consistent With the Goals of Some Education Reforms”, (2) The

TIMSS Test Booklets Have Discrepant Structures and, (3)”Because of Grade-Level

Differences and Content-Differences Among Countries, the TIMSS Tests Might Not Align

With What Students Have Learned” and (4)”Several Problematic Age Outliers in the

TIMSS Database Are Not Adequately Explained”.

The formats of test items have been found to be inconsistent with the

goals of various school reforms. “TIMSS test measures mostly lower learning outcomes

by means of predominantly multiple choice format (Lange, 1997, p. 3). The test items

are arranged by 429 multiple-choice, 43 short-responses, and 29 extended-response

items (Lang, 1997)”. From the pool of questions students were tested on subsets of

questions which would not reflect outcomes that came from any reform initiative.

When issuing booklets for testing (Gonzalez & Smith, 1997), expressed

how the rotation of booklets arranged in various clusters could reflect invalid results for

students. A study should show booklet eight and booklets one through seven were

structured by clusters. Booklet eight focused on the breadth cluster which could only
14

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


benefit students who had an in-depth study of the content. Whereas other booklets one

through seven contained a more focus cluster. Due to the structure discrepancy

systematic errors can occur across test booklets.

TIMSS is usually administered to students in adjacent

grades such as third and fourth graders for students in the United States which is

considered primary grades. The adjacent grades resulted in grade gaps based on

school experience. Each grade level comes with different learning levels and school

experience. Because of the learning difference experience Wang, 2001 suggested that

it would be unclear to say whether any testing instrument could measure what students

have learned in any grade level. Martin and Kelly (1997) pondered “whether the

meaning of student population should be defined by chronological age or grade level as

it relates to student achievement”. The authors believed that “age outliers in the TIMSS

database were not adequately explained”. For example student population ranged from

adults as old as a 49.9 year old seventh grader from a foreign country and as young as

a10 year old eighth grader from the United States. Since a students’ age is a factor in

cognitive development, the age outliers are deemed essential when analyzing data in

studies such as TIMSS. Wang (2001) supports that when interpreting results of TIMSS,

the test score component will come with problems of technical nature.

In conclusion, TIMSS is a vital examination that

gathers important information readily available to worldwide researchers and may

impact educational systems throughout the nation. It is important that science and

mathematics measure both what they are designed to measure to be certain that the

questionnaires are selected with care, scored, studied closely, and find meaning. It is
15

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


here researchers that use TIMSS have an significant task to pursue. Therefore, due to

the content domain, cognitive domain, and assessment, “Trends in International

Mathematics and Science Study (TIMSS)” has proven to be an both an assessment that

is valid and reliable among International Educational Assessment System, but does

have its drawbacks.

Reference

Eklof, H. (2007). Test-taking motivation and mathematics performance in TIMMS


2003. International Journal of Testing, 7, 311-326.

Gonzalez, E. J., & Smith, T.A. (1997). Users guide for the TIMSS international
database. Chestnut Hill, MA: TIMSS International Study Center

Gonzales, Patrick; Williams, Trevor; Jocelyn, Leslie (2008).

Highlights from TIMSS 2007: Mathematics and science achievement of U.S.


fourth and eighth grade students i Hu, A. (2000). TIMSS:. in an international context.
National Center for Education Statistics.

Hu, A. (2000). TIMSS: Arthur Hu’s index


www.leconsulting.com/arthurhu/index/timss.htm.

Hussein, M.G. (1992). What does Kuwait want to learn from TIMSS? Prospects (22),
275-277.

International Association for the Evaluation of Educational Achievement (IEA),


Trends in International Mathematics and Science Study (TIMSS)(2007)

Lange, J.D. (1997). Looking through the TIMSS mirror from a teaching angle.
http://www.enc.org/topics/timss/additional/documents

Martin, M. O., & Kelly, D.L. (1997). Technical report volume II: Implementation and
analysis. Chestnut Hill, MA: TIMSS International Study Cen

Martin, M.O. (Ed.) (2003). TIMSS 2003 User Guide for the International Database.
TIMSS & PIRLS International Study Center, Lynch School of Education, Boston
College.

Martin, M.O. & Mullis, I.V.S (2006). TIMSS in perspective: Lessons learned from
IEA’s four decades of international mathematics assessments. TIMSS & PIRLS
International Study Center, Lynch School of Education, Boston College.
16

William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon


Mullis, I.V. S., Martin, M.O., Gonzalez, E.J.,Gregory, K.D., Garden, R.A., O’Connor,
K.M., Chrostowski,S.J. & Smith, T.A. (2000). TIMSS 1999 International Mathematics
Report; findings from IEA’s repeat of the Third International Mathematics and
Sciency Study at the eight grade, Chestnut Hill, MA, Boston College.

Rose, L.C. (1998). Who cares? And so what? Responses to the results of the third
international mathematics and science study, Phi Delta Kappan, 79(10), 722.

Rotberg, I. (1998). Interpretation of international test scores comparisons. Science,


280, 1030-1031.

Wang, J. TIMSS primary and middle school data:Some technical concerns,


Educational Researcher, Vol.30, pp.17-21

www.minniscoms.com.au/educationtoday/articles.php?articleid=150

www.mackinac.org/article.asps?ID=6998

http://timss.bc.edu/timss1999b/sciencebench_report/t99bscience_A.html

http://nces.ed.gov/tmiss/Results03.asp

http://timss.bc.edu/timss2003.html

http://timss.bc.edu/TIMSS2007/about.html

www.iea.nl/timss2007.html

www.asanet.org/footnotes/jan05/fn10.html

www.ed.gov/inits/Math/silver.html

You might also like