You are on page 1of 16

CHAPTER

Principles of Test and Measurements

LEARNING OUTCOME Upon completion of this chapter, you should be able to: 1. Describe the different kinds of reliability in relation to testing; 2. Describe how reliability relates to testing and measurement; and 3. Describe the different kinds of validity.

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

INTRODUCTION Tests need to be reliable, valid, and practical in order to be considered a good test capable of measuring the ability or knowledge we are interested in. Each of these requirements makes a test a more useful tool and the information they provide more precise. What do these three terms reliability, validity, and practicality refer to? This topic will try to answer these questions. 2.1 RELIABILITY

Lets say that a student scores 35 points in a 50 point test of listening comprehension. How sure are we that this is actually the score that the student should receive? One way to be more confident of this is to look at the reliability of the test. Reliability has to do with the consistency and accuracy of the measurement. It is reflected in several possible ways including the obtaining of similar results when the measurement is repeated on different occasions or by different persons. Reliability is computed using some kind of correlation. Some of the more common ways for computing reliability are illustrated in Figure 2.1.

Parallel Equivalent Form

Test-retest
WAYS FOR COMPUTING RELIABILITY

Inter-rater Reliability

Split Half Reliability

Intra-rater Reliability

Figure 2.1: Ways of computing reliability

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

(a) Test-retest In test-retest reliability, the same test is re-administered to the same people. The scores obtained on the first administration of the test are correlated to the scores obtained on the second administration of the test. It is expected that the correlation between the two scores would be high. However, a test-retest situation is somewhat difficult to conduct as it is unlikely that students will take the same test twice. The effect of practice as well as memory may also influence the correlation value. (b) Parallel/Equivalent Forms In this type of reliability measure, two similar tests are administered to the same sample of persons. Therefore, as in test-retest reliability, two scores are obtained. However, unlike testretest, the parallel or equivalent forms reliability measure is protected from the influence of memory. (c) Inter-rater Reliability Inter-rater reliability involves two or more judges or raters. Scores on a test are independent estimates of these judges or raters. A score is a more reliable and accurate measure if two or more raters agree on it. The extent to which the raters agree will determine the level of reliability of the score. (d) Intra-rater Reliability While inter-rater reliability involves two or more raters, intra-rater reliability is the consistency of grading by a single rater. Scores on a test are rated by a single rater/judge at different times. When we grade tests at different times, we may become inconsistent in our grading for various reasons. Some papers that are graded during the day may get our full and careful attention, while others that are graded towards the end of the day are very quickly glossed over. As such, intra rater reliability determines the consistency of our grading. (e) Split Half Reliability In a split half measure of reliability, a test is administered once to a group, is divided into two equal halves after the students have returned the test, and the halves are then correlated. As the means for determining reliability is internal, within one administration of the test, this method of computing reliability is considered as an internal consistency measure. Halves are often determined based on the number assigned to each item with one half consisting of odd numbered items and the other half even numbered items.

What do you think determines the reliability of a test?

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

The five different measures of reliability mentioned inform us of the different techniques for computing reliability. We should now be aware that reliability concerns first, the consistency of student performance; secondly, the teachers grading, and thirdly, the test itself. The emphasis of the five different types of reliability with respect to these concerns can be illustrated in Table 2.1. Table 2.1: Focus, Number of Administrations, Graders, and Grading Sessions of Measures of Reliability
Number of tests or administrations Number of graders Number of grading

Type of Reliability

Focus

Test-retest Parallel/Equivalent forms Inter rater Intra rater Split half

Student performance; the test itself Student performance; the test itself Grading Grading Test itself; student performance

1 or 2

1 or 2

1 1 1

2 1 1

1 per grader 2 1

Table 2.1 describes the focus of each reliability measure, with the emphasis given to the first that is listed when there are two emphases. We can see from Table 2.1 that the five types of reliability measures, the split half is perhaps the most efficiently performed because it requires only one test administration, one grader, and one grading session. 2.1.1
ACCURACY AND ERROR

Notice that the discussion on reliability in the previous section has so far revolved around only the notion of consistency. According to its definition, reliability is also an issue of accuracy. As such, it is also imperative that we examine the accuracy of a test so that we can be satisfied with its use as a measure of knowledge or ability. In this respect, it is important to remember that in any test, the obtained score is actually a mixture of the true score and some element of error. This element of error in any score can be represented by the following notation: Obtained Score = True score +/- Error

It would be extremely easy for us if an obtained score is actually the students true score. That is, if a student obtains a score of 75, it reflects his or her actual ability. Unfortunately, a students observed or obtained score actually consists of his or her true score plus or minus some element of error.

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

This error may come from various sources such as within the test takers, within the test, in the test administration or even during scoring. Fatigue, illness, copying or even the unintentional noticing of another students answer all contribute to error from within the test taker. Some of these will reduce value of the true score while others will increase it. For example, fatigue will cause the obtained score to be lower than the true score (i.e. Obtained score = True score Error) while copying will cause the obtained score to be higher than the true score (i.e. Obtained score = True score + Error). Just as errors within the test taker affects the value of the obtained scores with respect to the true score, errors within the test such as the use of faulty test items, a reading level in the test that is too high and faulty instructions can do the same. Errors in test administration include the level of physical comfort during the test, the test administrators attitude as well as the use of faulty equipment such as a cassette recorder with poor sound quality in a listening test. Finally, errors in scoring are quite obvious as graders can contribute to the error element if they lack adequate qualifications, do not follow instructions or are themselves fatigued. In testing, we can use the Standard Error of Measurement in order to estimate a students true score. This formula is as follows: 1-r

where r is the reliability of the test. Using the normal curve as presented earlier, you can estimate the students true score to some degree of certainty based on the observed score and the Standard Error of Measurement. For example let us take the obtained score of 75. Assuming that the standard deviation (SD) = 2.5 and the reliability is 0.7, then the Standard Error of Measurement will be:

1 0.7 = 2.5 0.3 = 2.5 X 0.55 = 1.375

Therefore, based on the normal distribution curve, the students true score is between 75 1.375 and 75 + 1.375 or 73.625 and 76.375 (68% of the time) X +/- 1Sm 75 2.75 and 75 + 2.75 or 72.25 and 77.75 (95% of the time) X +/- 2Sm 75 4.125 and 75 + 4.125 or 70.875 and 79.125 (99% of the time) X +/- 3Sm. Such information is helpful when important decisions need to be made on the basis of test scores and we would like to be fair to candidates by taking into account any error that may have occurred.

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

Do this simple exercise to assess your understanding of the concepts of accuracy and error: A student obtains a score of 63 in a test. The reliability of the test Do this simple exercise to assess your understanding of the concepts was calculated at 0.75. The standard deviation of the test is 0.5. of accuracy and error: (a) If you wanted to select scoresof ofthe 65 and A student obtains a score of 63 students in a test.who The had reliability test above, you be willing to of accept this student was calculated at would 0.75. The standard deviation the test is 0.5. assuming thatto there could be an who error in his score? (Hint: (a) If you wanted select students had scores of 65 and Use 1 would standard of measure). above, youerror be willing to accept this student assuming there could be at an error insure his of score? (Hint: s Use (b)that If you wanted to be least 95% the student true1 standard error of measure). score, what would be the range for the true score?

(b) If you wanted to be at least 95% sure of the students true score, what would be the range for the true score?

2.2

VALIDITY

The second characteristic of good tests is validity which refers to whether the test is actually measuring what it claims to measure. This is important for us as we do not want to make claims concerning what a student can or cannot do based on a test when the test is actually measuring something else. Validity is usually determined logically although several types of validity may use correlation coefficients.

The following are different types of validity: (a) Face validity is validity which is determined impressionistically; for example by asking students whether the exam was appropriate to their expectations (Henning, 1987). It is important that a test looks like a test even at first impression. If students taking a test do not feel that the questions given to them are not a test or part of a test, then the test may not be valid as the students may not seriously attempt to answer the questions. The test, therefore, will not be able to measure what it claims to measure.

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

(b) Construct validity refers to whether the underlying theoretical constructs that the test measures are themselves valid. Some authors consider construct validity to be the most critical type of validity. Construct validity is the most obvious reflection of whether a test measures what it is supposed to measure as it directly addresses the issue of what it is that is being measured. (c) Concurrent validity is the use of another more reputable and recognised test to validate ones own test. For example, suppose you come up with your own new test and would like to determine the validity of your test. As it uses an external measure as a reference, concurrent validity is sometimes also referred to as criterion validity. If you choose to use concurrent validity, you would look for a reputable test and compare your students performance on your test with their performance on the reputable and acknowledged test. In concurrent validity, a correlation coefficient is obtained and used to generate an actual numerical value. A high positive correlation of 0.7 to 1 indicates that the learners score is relatively similar for the two tests or measures. (d) Predictive validity is closely related to concurrent validity in that it too generates a numerical value. For example, the predictive validity of a university language placement test can be determined several semesters later by correlating the scores on the test to the GPA of the students who took the test. Therefore, a test with high predictive validity is a test that would yield predictable results in a latter measure. A simple example of tests that may be concerned with predictive validity is the trial national examinations conducted at schools in Malaysia as it is intended to predict the students performance on the actual SPM national examinations. (e) Content validity is concerned with whether or not the content of the test is sufficiently representative and comprehensive for the test to be a valid measure of what it is supposed to measure (Henning, 1987). We can quite easily imagine taking a test after going through an entire language course. How would you feel if at the end of the course, your final exam consists of only one question that covers one element of language from the many that were introduced in the course? If the language course was a conversational course focusing on the different social situations that one may encounter, how valid is a final examination that requires you to demonstrate your ability to place an order at a fast food restaurant? These different types of validity represent different concerns that many educators feel the test should address. However, it should be mentioned that in most situations, each of the kinds of validity described previously are independent of each other.
A test may have face validity but lack construct validity. Similarly, it may have predictive validity but not content validity. What kind of validity we need to focus on depends on what our test is supposed to measure. A second issue is that validity is often a matter of degree. We will seldom find a language test that is completely an invalid measure of language ability. Similarly, it is also extremely unlikely that we will find a completely valid test of language.

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

What do you think is test validity? What makes a test valid?

2.2.1

THE RELATIONSHIP BETWEEN VALIDITY AND RELIABILITY

The relationship between validity and reliability is often discussed. Questions that are asked include: Can a Test be Valid but Not Reliable? In order to answer that question, we have to be aware of what sort of reliability is being measured. If the reliability in question is rater reliability, it is possible for a test to be valid but not reliable. The test could be a good measure of what it intends to measure but the grading is not reliably done. However, if we were to consider the split half reliability which measures the internal consistency of the items and whether or not they work together, it will be difficult to claim that the test is valid if reliability is low. This is because when the internal consistency of the test is low, we cannot be sure of what the test is testing as the items do not seem to measure the same construct. In this case, therefore, you must have reliability before you can consider the test to be valid. It should also be noted that sometimes, the validity of a test is determined only by the scores that are given and how well the scores are seen to measure a students ability. If the validity is determined completely on the basis of scores, then reliability is critical regardless of whether it is rater reliability or internal consistency as the score that is unreliable is not a valid measure.
(a) Can a Test be Reliable if it is Not Valid?

This question is much easier to respond to. Imagine a bulls-eye in a shooting practice and let us consider the bulls eye as an indication of validity./// If a person consistently hits the bulls-eye, then you may want to say that his shots are consistent (reliable) and valid. However, if his shots consistently miss the mark, then we can consider him to be consistent (reliable) but not valid. This may be illustrated by the following:

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

Figure 2.2: The relationship between validity and reliability


A test can therefore be reliable but not valid. If a test is said to measure reading ability but actually measures grammatical knowledge, then reliable scores on this test will not indicate that it is a valid test of reading ability.
(b) Which Would You Give Preference To?

In a situation where all else is equal, it is often better to consider validity as the more important of the two. This is because, as described earlier and in the previous paragraph, the worth of a test rests with how well it measures what it claims to measure. We need to analyse items in a test to ensure that they are relevant and representative of the construct that is being tested. If they are not, no degree of reliability will make it a useful test in measuring the construct we are interested in.
(c) Which Would You Give Preference To?

In a situation where all else is equal, it is often better to consider validity as the more important of the two. This is because, as described earlier and in the previous paragraph, the worth of a test rests with how well it measures what it claims to measure. We need to analyse items in a test to ensure that they are relevant and representative of the construct that is being tested. If they are not, no degree of reliability will make it a useful test in measuring the construct we are interested in. For language teaching, the relationship between validity and reliability is an important one as sometimes, by increasing the degree of validity, the reliability of scoring of the test is affected. A valid test of speaking, for example, is one that is conducted with very little controls, restriction or artificial interference. However, by reducing each of these elements, the reliability of the test may quite easily be negatively affected.

2.3

PRACTICALLY

Although an important characteristic of tests, practicality is actually a limiting factor in testing. There will be situations in which after we have already determined what we consider to be the most valid test, we need to reconsider the format simply because of practicality issues. A valid test of spoken interaction, for example, would require that the candidates be relaxed, interact with peers and speak on topics that they are familiar and comfortable with. This sounds like the kind of conversations that people have with their friends while drinking coffee at road side stalls. Of course such a situation would be a highly valid measure of spoken interaction if we can set
8

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

it up. Imagine if we even try to do so. It would require hidden cameras as well as a lot of telephone calls and money. Therefore, a more practical form of the test especially if it is to be administered nationwide as a standardised test is to have a short interview session lasting about fifteen minutes using perhaps a picture or reading stimulus that the candidates would describe or discuss. Therefore, practicality issues, although limiting in a sense, cannot be dismissed if we are to come up with a useful assessment of language ability. Practicality issues can involve economics or costs, administration considerations such as time and scoring procedures, as well as the ease of interpretation. Tests are only as good as how well they are interpreted. Therefore tests that cannot be easily interpreted will definitely cause many problems.
Do this simple exercise to assess your understanding of the concepts of accuracy and error: A student obtains a score of 63 in a test. The reliability of the test was calculated at 0.75. The standard deviation of the test is 0.5. Should practicality be taken into consideration when (a) If you wanted to select students who had What scoresdo of you 65 and above, would you be willing to testing is conducted? think? accept this student assuming that there could be an error in his score? (Hint: Use 1 standard error of measure). (b) If you wanted to be at least 95% sure of the students true score, what would be the r

2.4

ISSUES RELATED TO VALIDITY, RELIABILITY AND PRACTICALITY

There are numerous issues that are related to validity, reliability and practicality. Here, we will examine the concept of washback as well as suggested techniques for achieving acceptable standards in testing. 2.4.1
WASHBACK

Washback, also known as backwash (e.g. Hughes, 1989), refers to the extent to which a test affects teaching and learning. According to Alderson and Wall (1993), washback is seen in the actions that teachers and students perform which they would otherwise not necessarily do if there were no test. Washback can influence how a teacher teaches or what a teacher teaches. It can also influence how a student learns or what a student learns. Bailey (1996), discusses a framework suggested by Hughes in which washback effect is addressed with respect to the participants, processes, and products involved. Washback affects participants involved in the test such as students, teachers, materials, writers and curriculum developers, as well as researchers by influencing them affectively or even cognitively. It can also affect processes by way of how the participants act or do their work in relation to the test. Finally, the products are also affected as materials and teaching programmes are clearly influenced by the processes. Washback is especially apparent when the tests or examinations in question are regarded as being very important and having a definite impact on the test-takers future. We would expect, for example, that national standardised examinations would have strong washback effects compared to a school-based or classroom-based test. It is also interesting to note that a related concept -Washforward or the effect of teaching and learning on tests has also been suggested. It is plausible that as we teach, we think of how we will later test or measure the effectiveness of our teaching. This is the essence of washforward and brings about the question of whether we teach what we test or we test what we teach. The former would refer to washback while the latter certainly describes washforward.
9

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

2.4.2

PROMOTING POSITIVE WASHBACK

A test can affect teaching and learning either in a positive or a negative manner. If the effect is positive, then we refer to it as positive washback. If it is negative, then we have an example of negative washback. If it is true that tests affect teaching and learning, then we definitely want this effect to be positive. What are examples of positive washback? When a student attends tuition classes in preparation for a test, does this represent positive washback? You may think that it does as attending tuition classes means that the students will spend more time learning. However, it may not be positive washback if the tuition class teacher teaches using poor techniques and promotes learning strategies that may help the students do well in the test but not help them become more proficient in the language. How can such a situation occur? Let us take one possible scenario. If the test is a multiple choice format test and the teacher spends countless hours preparing students for the test by drilling them with multiple choice type questions, then although the teaching and learning may be effective in terms of performance on the test, it may not actually promote language development. On the other hand, if a language test involves performance in group interaction, test preparation would also use discussion techniques. Most language learning theorists would approve of such a washback effect from tests as interaction is often seen as providing opportunities for language development. Based on these two examples, we can conclude that the nature of the test itself is the cornerstone of whether there will be positive or negative washback. Good and valid tests will have a positive washback effect while tests that are not valid measures of the construct concerned will promote negative washback. As tests are in the service of teaching and learning, we must strive to achieve positive washback. Hughes (1989: 44-47), makes several suggestions on how to promote beneficial washback:

First, he suggests that we test the abilities whose development we want to encourage. Essentially, this refers to a very obvious fact if we want to develop a particular skill such as speaking, then we should test that skill. Although this seems fairly obvious, Hughes points out that this is often times not done because of reasons (or rather excuses) such as impracticality of some tests. As such, instead of testing the abilities that we want to develop, we end up testing a small portion or a poor representation of the skill.

His second suggestion is to use direct testing. We are all aware of the difference between direct tests and indirect tests. By using tests that directly assess the abilities that we are interested in, we will be able to encourage students to develop those abilities. Therefore, positive washback occurs as students work towards developing abilities relevant to the testing construct rather than work on skill that may be more testwiseness than anything else.

10

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

Hughes also suggests that teachers should sample widely and unpredictably. He argues that if the test is too predictable, the students will prepare for the test by performing only the kinds of tasks expected in the test. Students have often been observed to show disinterest in performing tasks and activities that they immediately recognise as not part of a test. This action is taken regardless of the benefits of the task or activity. Therefore, by sampling widely and unpredictably, students are expected to be more willing to prepare for the test by performing a variety of tasks related to the objectives of the teaching learning situation.

Two other suggestions are to make testing criterion-referenced and to base achievement tests on teaching learning objectives. Both these steps will help students become aware of what is expected of them and what kind of abilities they should be able to demonstrate. Descriptions of criterial levels help students match their current abilities to the stated criteria and develop their abilities accordingly.

Beneficial backwash can also be achieved if the students and teachers are familiar with the test, its objectives as well as format. By becoming aware of the objectives of the test, both students and teachers can prepare for the test in an organised and more directed manner. Hughes also mentions the importance of assisting teachers as they prepare their students for tests. He argues that whenever a test is intended to create positive backwash in teaching methodology, some teachers may find it difficult to adapt their teaching techniques to the demands of the test. In such situations, it becomes imperative that these teachers are assisted in order for the test to have a positive backwash effect. Several others have also suggested ways and means to promote positive washback. Bailey (1993), for example, suggests that attention should be given to the language learning goals of the programme and the learners and that as much authenticity as possible should be built into the tests. The importance of authenticity is echoed by Doye (1991) who advocates absolute congruence between tests and real-life situations. An authentic test is therefore one that reproduces a real-life situation in order to examine the students ability to cope with it (p. 104). Other suggestions include the need to focus on learner autonomy and self-assessment. These suggestions imply that tests should provide sufficiently informative score reporting so that learners can assess their own performance and determine their own strengths and weaknesses. It should also be noted here that high stakes tests tend to have a higher washback effect than tests that are not considered important. Teachers often observe students not performing well on classroom tests. Students are not too concerned about the outcome of these tests as they are aware that the results would have little impact on them. Hughes stresses the need to count the cost. He believes that in many situations, the cost of not achieving beneficial backwash may be higher than the cost to develop a test and testing situation that promote beneficial backwash. He points out that when we compare the cost of the test with the waste of effort and time on the part of the teachers and students in activities quite inappropriate to their true learning goals , we are likely to decide that we cannot afford not to introduce a test with a powerfu l backwash effect (p. 47).
11

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

2.4.3

ATTAINING ACCEPTABLE STANDARDS IN TESTING LANGUAGE

As discussed earlier, tests are an integral part of much of academic as well as professional life. Many important decisions may be made on the basis of test scores. In this topic, as well as throughout most of the module, various issues related to accurate testing and assessment have been raised and discussed. In this section, some of the observations made by Linn, Baker and Dunbar (1991) and reiterated in Herman et al. (1992) will be used to conclude this topic and the entire module on tests and measurement. These concerns how to evaluate a test are adapted for language tests and include the following. Each of these concerns is accompanied by a pertinent question that needs to be asked. (a) Consequences This issue deals with the effect tests can have on teaching and learning. The question that needs to be asked is whether the test has positive consequences or whether there are negative and unintended washback effects such as the narrowing of the curriculum and the use of inappropriate teaching and learning strategies? (b) Fairness A contemporary concern in any kind of evaluation or assessment is fairness. As the cultural background of students will colour the way they use language and communicate, it is important to ask whether the ethnic and cultural background of the students have been taken into consideration in the assessment process? (c) Transfer and Generalisability Assessment is only useful if it can be taken out of the testing context and applied to actual contexts and situations. Therefore, we should strive to answer whether the assessment results support accurate generalizations about student capability? (Herman et al., 1992, p. 10). (d) Cognitive and Linguistic Complexity In so far as we believe that true learning requires cognitive operations, we should be concerned with the level of cognitive complexity that occurs during assessment. Does the assessment require that the students use complex thinking and problem solving? Similarly, in language testing, does the test ensure that the candidates are required to display an appropriate level of linguistic complexity? (e) Content Quality The content of the test or assessment is always an important concern. As such, the validity of the content as representative of the construct that is being tested, assessed and measured is of prime importance. The question that needs to be asked in this context is therefore: Is the selected content representative of the current understanding of the construct?

12

CHAPTER 2

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

(f) Content Coverage As we have seen in earlier topics, tests and assessments are integrated to instruction and the curriculum specifications of the teaching and learning programme. We would expect that tests would cover the content presented in the curriculum. As such, we must ask whether the key elements of the curriculum are covered by the assessment? (g) Meaningfulness Another concern involves the meaningfulness of the task in tests and assessments. Only if the tasks are meaningful will we be able to get students to respond with honesty and motivation. A measure of performance when students are honest and motivated will be more accurate and trustworthy than one in which the students are not concerned. Therefore, we need to ask ourselves if the students feel that the tasks are realistic and worthwhile? (h) Cost and Efficiency Finally, the issue of practicality is addressed. While practicality may be an important aspect of testing, it should not be the determining factor in developing a test. Rather, cost-effectiveness is a more important concern. In this respect, the issue is whether the information about students obtained through the test or assessment is worth the cost and time to obtain it? (a) What are reliability and validity? What determines the reliability of a test? (b) What are the different types of validity? Describe any three types and cite examples.

http://www.2dix.com/pdf-2011/testing-and-evaluation-in-eslpdf.php

Testing ESL Proficiency in ESP Context http://www.aua.am/academics/dep/hf_publications/5%20Testing%20ES L%20proficiency%20in%20ESP%20context.pdf Communicative Language Testing http://www.hpu.edu/images/GraduateStudies/TESL_WPS/6_1_02Phan _a24075.pdf

13

PRINCIPLE OF TEST AND MEASUREMENT CHAPTER 2

SUMMARY The concepts of reliability, validity and practicality are central to testing and measurement as they help determine the worth of the measurement. If a student is awarded 90 marks but the test is not reliable and not valid, then the marks awarded may not be worth the certificate it is written on. Therefore, this topic on reliability, validity, and practicality is an important one for us to better understand the importance of good testing and measurement.

GLOSSARY Face validity Face validity has various interpretations but generally most agree that it refers to a basic form of validity in which a test is considered valid because it appears to measure what it claims to measure. Internal consistency refers to the extent to which the items in a measurement instrument such as a test measure the same behaviour or skill in a manner consistent with other items in the test. Inter rater reliability refers to the consistency of grades given by two different raters Intra rater reliability refers to the consistency of grades given by a rater at different times or situations. A form of reliability in which performance on a test is compared to performance on another similar or parallel test. Parallel forms reliability is similar to a test-retest reliability estimate except that the two tests are similar but not the same. Also known as equivalent forms reliability. In predictive validity, the validity of a test is determined by the extent to which it is able to predict a particular level of performance in other related measures. For example, a test may have high predictive validity if it correlates highly with students grades in the university. A method of determining reliability through a single administration of the test in which performance on one half of the test is compared and correlated to performance on the other half. Reliability refers to the consistency and accuracy of a measurement.

Internal consistency

Inter rater reliability Intra rater reliability Parallel forms Parallel forms reliability

Predictive validity

Practicality

Reliability

14

CHAPTER 2 Self assessment

PRINCIPLES OF TEST AND MEASUREMENTS

.........................

Self assessment refers to assessment in which the student assesses himself or herself with respect to how well he or she has performed or progressed. Often also referred to as self-appraisals, self evaluation, and self rating. A method of determining reliability through a single administration of the test in which performance on one half of the test is compared and correlated to performance on the other half. The SEM is a means to estimate how accurate the test score of an individual is likely to be through the use of statistical techniques involving sample group performance and the normal distribution. A candidates true score is estimated from his or her actual test score (observed score) and adjusted within the range of +/- one or more standard errors of measurement. Test-retest reliability is a form of reliability which is determined reliability by comparing performance on a test with a second performance on the same test taken later. Validity is the extent to which the test is actually measuring what it claims to measure. Washback refers to the extent to which a test affects teaching and learning.

Split half reliability

Standard error of measurement (SEM) True score

Test-retest

Validity Washback

15

You might also like