Testing and Evaluation

LANGUAGE TESTING AND EVALUATION
WHY? WHAT? HOW? WHEN?

Associate Professor Titela Vilceanu, PhD
Teachers usually understand a great deal about the knowledge, abilities and skills
of the learners in their classroom without the need to resort to formal tests. Over periods
of time they have the opportunity to observe learners participate in a wide range of
activities and tasks, working on their own and in groups, developing their ability to
communicate with others. In the classroom the context is the learning environment,
constructed of sets of learning experiences that are designed to lead to the acquisition of
language and communication. This context is not construct-irrelevant, but directly
relevant to the assessment of the learners. How well they are progressing can be assessed
only in relation to their involvement with the context and the others with whom they
interact in the process of learning.
In a traditional large-scale language test, learners may spend anything between
one hour and five hours responding to a large number of tasks and test items, sometimes
broken down into different papers, labelled by a skill such as reading, or listening. It
has become accepted that the more tasks or items a test contains, the more reliable and
valid it is likely to be. The temptation for classroom teachers is to try to copy these task
and item types, especially when there is an institutional requirement for a record of
progress. Further, the response to each item or task must be independent of the responses
to other items or tasks. The fact is that teachers almost never design tasks that are totally
independent from everything else that is in the learning environment. Yet, the formal test
needs to be as long as possible in order to collect lots of pieces of evidence about a
learner in a short period of time.
Types of assessment
1. Diagnostic testing it is not restricted to the beginning of a course / the school
Year, it can also take place whenever the teacher aims to identify the needs for remedial
work.
2. Continuous assessment it takes place during different stages of the course (ongoing) to monitor learners progress.
3. Formative assessment - the teacher is familiar with each and every learner, and
as we have seen above, can draw on a much wider range of evidence that informs
judgments about ability. The teacher interacts with each learner, and the purpose of the
interaction is not to assess in the most neutral, non-invasive way possible. Rather, it is to
assess the current abilities of the learner in order to decide what to do next, so that further
learning can take place. In traditional terminology, this makes classroom assessment
formative.
4. Summative assessment it is conducted at the end of a programme of study to assess
whether and how far individuals or groups have been successful.
4. Achievement testing it is intended to measure the takers level of proficiency
with no reference to any particular course.
Controlling factors in testing and evaluation (principles)
1. Reliability - the desired consistency (or reproducibility) of test scores is called
reliability. Whenever a test is administered, the test user would like some assurance that
the results could be replicated if the same individuals were tested again under similar
circumstances. It underpins four assumptions:
Stability: the abilities of the test takers will not change dramatically over short
periods of time.
Discrimination: tests are constructed in such a way that they discriminate as well
as possible between the higher ability and lower ability test takers. However, in the
classroom the teacher does not often wish to separate out all individuals and rank-order
them, as it serves no pedagogic purpose. Rather, the teacher wishes to know if any
individual has achieved the goals of the syllabus and can move on to learn new material,
or whether it is necessary to recycle previous material.
Test length: the more items or tasks are included in the test, the higher the
reliability coefficient will be. In the classroom there is very little reason to wish to spend
many hours having learners take long tests, because teachers are constantly collecting
evidence about their progress. Some task types, usually involving performance, take
2
extended periods of time, and yet these still only count as one task one piece of
evidence when calculating reliability.
Homogeneity: the items are related or correlated to each other. So each piece of
information is independently contributing to the test score, and the test score is the best
possible representation of the knowledge, ability or skills of the test taker.
2. Validity directs to the meaningfulness and appropriateness of the interpretations on the
basis of test scores.
3. Authenticity test tasks should reflect the language takers performance in real-life
situations, i.e. what learners can do in L2.
4. Interactiveness is defined as the extent and type of involvement of the test takers
individual characteristics in accomplishing the test tasks (learning style, amount and
quality of knowledge, strategic competence, etc).
5. Practicality the resources (human and material) required for test design and
implementation. In other words, it is a question of whether the test is developed and used
at all.
Techniques for alternative assessment1: structured interviews, observing and recording,
unstructured interviews, creating portfolios, open group discussions, completing selfreflection forms, brainstorming groups, writing essays to single prompts, keeping a
journal, recording peer evaluation, open questions, etc.
Integrative tests vs. discrete-point tests - although it has been argued that integrative
type tests must be used to measure communicative competence, it seems that discretepoint tests will also be used in the communicative approach. This is because such tests
may be more effective than integrative tests in making the learner aware of and in
assessing the learners control of the separate components and elements of
communicative competence. This type of test would also seem to be easier to administer
and score in a reliable manner than is a more integrative type of test. For example, a test
designed to assess grammatical accuracy might be considered to have more of a discretepoint orientation.
1
Also labelled informal assessment.
The washback effect - the term describes the effect that tests have on what goes on in
the classroom, on the teaching and learning processes. For many years it was just
assumed that good tests would produce good washback and inversely that bad tests
would produce bad washback. More recently, the assumption has been challenged as
bad tests can produce good effects: teachers and learners do good things they would
not otherwise do: for example, prepare lessons more thoroughly, do their homework, take
the subject being tested more seriously, and so on.
Criterion-referenced assessment (CRT) vs. norm-referenced assessment (NRT)
CRT represents a test that measures knowledge, skill or ability in a specific domain.
Performance is usually measured against some existing criterion level of performance
(successful performance standards), above which the test taker is deemed to have
achieved mastery. NRT represents a test in which the score of any individual is
interpreted in relation to the scores of other test takers. Once the test is used in live
testing, any score is interpreted in terms of where it falls on the curve of normal
distribution established in the norm-setting study.

Testing and Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing and Evaluation

Uploaded by

Copyright:

Available Formats

LANGUAGE TESTING AND EVALUATION

WHY? WHAT? HOW? WHEN?

Also labelled informal assessment.

You might also like