You are on page 1of 4

UNIT 10: ASSESSING AND EVALUATING LANGUAGE COMPETENCE

1. Introduction – and some remarks on terminology

Testing can have a significant effect on teaching and learning – an effect which is known as
backwash or washback. Backwash can be harmful or beneficial. If a test is regarded as important,
then preparation for it can come to dominate all teaching and learning activities (a.k.a “as teaching
to the test”) - and if the test content and testing techniques are at variance with the objectives of the
course, then there is likely to be negative backwash.

An instance of this would be where students are following an English course which is meant to train
them in the language skills (including writing) necessary for university study in an English-
speaking country, but where the language test which they have to take in order to be admitted to a
university does not test those skills directly. If the skill of writing, for example, is tested only by
multiple choice items, then there is great pressure to practise such items rather than practise the skill
of writing itself. So, to not having a negative backwash, we have to test the skills we are asking for
and not others because this could frustrate the students and can confuse them.

However, backwash need not always be harmful; indeed it can be positive backwash. One example
could be to prepare a test which includes tasks as similar as possible to those which they would
have to perform as undergraduates and which they are familiar with (reading textbook materials,
taking notes during lectures, and so on). Another example of beneficial backwash would be taking
an oral exam and practising listening skills during the whole course in order to take this test.

2. Test Criteria

2.1 Validity

A test is said to be valid if it measures accurately what it is intended to measure. A test is said to
have content validity if its content constitutes a representative sample of the language skills,
structures, etc. with which it is meant to be concerned. It is obvious that a grammar test, for
instance, must be made up of items testing knowledge or control of grammar. But this in itself does
not ensure content validity. The test would have content validity only if it included a proper sample
of the relevant structures. Just what are the relevant structures will depend, of course, upon the
purpose of the test. We would not expect an achievement test for intermediate learners to contain
just the same set of structures as one for advanced learners. In order to judge whether or not a test
has content validity, we need a specification of the skills or structures.
What is the importance of content validity?
1. The greater a test's content validity, the more likely it is to be an accurate measure of what it
is supposed to measure.
2. Secondly, such a test is likely to have a harmful backwash effect. Areas which are not tested
are likely to become areas ignored in teaching and learning.

A test, part of a test, or a testing technique is said to have construct validity if it can be
demonstrated that it measures just the ability which it is supposed to measure. Construct validation
is a research activity because it is the means by which theories are put to the test and are confirmed,
modified, or abandoned.

A test is said to have face validity if it seems to measure what it is supposed to measure. For
example, a test which pretended to measure pronunciation ability but which did not require the
candidate to speak (and there have been some) might be thought to lack face validity.

2.2 Reliability
A test may be called reliable if it measures consistently. On a reliable test you can be confident that
someone will get more or less the same score, whether they happen to take it on one particular day
or on the next; whereas on an unreliable test the score is quite likely to be considerably different,
depending on the day/hour on which it is taken. The more similar the scores would have been, the
more reliable the test is said to be. E.g: an exam that is taking at 3p.m or on Wednesday rather than
on Thursday.

However, human beings can react differently to an exam depending on determinate factors. We have
to ensure to provide the best conditions for the exam, scores with no judgement for the part of the
teachers and to assume that the students would be able to remember everything they have learnt.

It is possible to quantify the reliability of a test in the form of a reliability coefficient. A test with a
reliability coefficient of 1 is one which would give precisely the same results for a particular set of
candidates regardless of when it happened to be administered.
2.3 Objectivity
If no judgement is required on the part of the scorer, then the scoring is objective. If judgement is
called for, the scoring is said to be subjective.
There are different degrees of subjectivity in testing. For example, a multiple choice test with the
correct responses unambiguously identified, would be a case of an objective test. In contrast, the
correction of an essay will be maybe conditioning for the scorer. In general, the less subjective the
scoring, the greater agreement there will be between two different scorers.

2.4 Practicality/ Economy


A test may be called economic if its design, its administration, and its grading can be accomplished
in a reasonable time. What is reasonable for designing and grading a test may depend to some
extend on the individual teacher, but the time restraints for a test’s administration are often
institutional. This leads us to a number of other institutional restraints which have to be taken into
consideration when deciding whether a test is practical, as for example the financial means.

3. Kinds of tests and testing

3.1 Proficiency tests


Proficiency tests are designed to measure people's ability in a language regardless of any training
they may have had in that language. The content of a proficiency test, therefore, is not based on the
content or objectives of language courses which people taking the test may have followed. Rather, it
is based on a specification of what candidates have to be able to do in the language in order to
be considered proficient. This raises the question of what we mean by the word 'proficient'.

In the case of some proficiency tests, 'proficient' means having sufficient command of the language
for a particular purpose.

Examples:
• A test used to determine whether a student's English is good enough to follow a course of
study at a British university.
• British examples of these would be the Cambridge examinations (First Certificate
Examination and Proficiency Examination) and the Oxford EFL examinations (Preliminary
and Higher). The function of these tests is to show whether candidates have reached a
certain standard with respect to certain specified abilities.
All proficiency tests have in common the fact that they are not based on courses that candidates
may have previously taken. On the other hand, such tests may themselves exercise considerable
influence over the method and content of language courses. Their backwash effect may be
beneficial or harmful.

3.2 Achievement tests


In contrast to proficiency tests, achievement tests are directly related to language courses, their
purpose being to establish how successful individual students, groups of students, or the courses
themselves have been in achieving objectives.

They are of two kinds:


1) Final achievement tests: are those administered at the end of a course of study. Clearly the
content of these tests must be related to the courses with which they are concerned.
2) Progress achievement tests are intended to measure the progress that students are making.
Such tests will not form part of formal assessment procedures but they are useful for the
students to evaluate their own progress.

3.3 Diagnostic tests


Diagnostic tests are used to identify students' strengths and weaknesses. We may be able to go
further, analysing samples of a student's performance in skills such as writing or reading. It is not so
easy to obtain a detailed analysis of a student's command of grammatical structures, something
which would tell us, for example, whether she or he had mastered the present perfect/past tense
distinction in English. In order to be sure of this, we would need a number of examples of the
choice the student made between the two structures in every different context which we thought
was significantly different and important enough to warrant obtaining information on. A single
example of each would not be enough, since a student might give the correct response by chance.
Grammar will be difficult to test.

You might also like