Project 1

Natapon Kidrai 4436733 SCAL/M
SCLG 637 Testing and Evaluation

Project on Test Development
English Proficiency Test for M.2 Students
at Bangna Demonstration School
The test aims to measure language proficiency of students. This

test was administered by Anuchit Nasomboon, an English teacher
teaching at Bangna Demonstration School. This private school is located
in Bangna area. Thus most students are of rich family. Surprisingly, quite
a few students turn to go to this school. There are at least two classes each
for primary level. However, there is only one class each for secondary
level. Some of the students have foreign family: their parents come to
work in Thailand. This school is attempting to create its own teaching
curriculum for every subject. Mr. Nasomboon then tried to measure how
much his students know before beginning the lesson. The participants
were twenty-six Mathayom Two students. Time for taking this test was
fifty minutes.
Test objective
This test was given to students to measure background knowledge

of students. The test score analysis will be used to adjust curriculum for
English for Mathayom Two at Bangna Demonstration School. At the
same time, score of this test will be analyzed to see how and where
improvement is needed form each item.
Subjects
There were twenty-six Mathayom Two students attending the test.
All of the participants are in the same class. Among the participants there
were one student whose family immigrated from the United States, and
the other one from Japan. The test then sounded quite unequal for other
Thai students. However, the test score came out quite unexpectedly
dissatisfied.
Students Total Students Total

Narit 29 Nattaporn 13
Porntip 25 Chalrmachai 13
Maturot 19 Staporn 13
Wannisa 19 Prakorn 13
Pravee 19 Pawetre 12
Sorratat 19 Phornphan 11
Piyada 17 Witawat 11
Warunya 17 Tanasan 11
Sutthida 16 Kanok-karn 10
Wareewan 16 Julawat 9
Manecha 16 Teerapat 9
Utomphorn 16 Jinnaput 7
Wiliya 15 Chatchai 5
Mean = 14.6154 Standard Deviation = 14.6511
Table 1 Test Score
From the test score, I then make an analysis of the whole score into
individual item score per item number. Table 2 shows how many points
each student get from individual items. At the bottom of the table are
mean score and standard deviation of total score.
From the raw score, mean score, and standard deviation, I then turn
to analyze item facility of the test. Item Facility can be measured by
adding up the number of students who correctly answered a particular
item, and divide that sum by the total number of students who took the
test (James Brown, 1996: p. 65). The formula can be written like this:
IF = Ncorrect
Ntotal
where Ncorrect = number of students answering correctly
Ntotal = number of students taking the test
This formula can range from 0.00 to 1.00 for different items. Items left
blank are assumed incorrect answers. The IF score indicates difficulty or
easiness of each test item.
The IF value then gives another useful score for interpretation of
the item. The Item Discrimination score is the degree to which an item
separates the students who performed well from those who performed
poorly. The ID score helps the teachers to contrast the performance of the
upper-group students on the test with that of the lower-group students.
From both table 1 and table 2 you can see the discrimination into groups
of the students taking the test.
The ID score can be calculated by this formula:
ID = IFupper – IFlower
where ID means item discrimination for an individual item
IFupper = item facility for the upper group of the whole test
IFlower = item facility for the lower group on the whole test
Below are table 3 indicating IF score and ID score of individual
items.
IF score and ID score
Part I
Item Item Number
Statistics 1 2 3 4 5 6 7 8 9 10
IF total 0.69 0.65 0.65 0.58 0.62 0.58 0.50 0.46 0.62 0.62
IF upper 0.85 0.85 0.85 0.92 0.77 0.85 0.62 0.62 0.62 0.77
IF lower 0.54 0.46 0.46 0.23 0.46 0.31 0.38 0.31 0.62 0.46
ID 0.31 0.38 0.38 0.69 0.31 0.54 0.23 0.31 0.00 0.31
Part II
Item Item Number
Statistics 1 2 3 4 5 6 7 8 9 10
IF total 0.35 0.35 0.27 0.77 0.38 0.50 0.54 0.38 0.38 0.38
IF upper 0.46 0.46 0.46 0.77 0.54 0.62 0.69 0.46 0.46 0.62
IF lower 0.23 0.23 0.08 0.77 0.23 0.38 0.38 0.31 0.31 0.15
ID 0.23 0.23 0.38 0.00 0.31 0.23 0.31 0.15 0.15 0.46
Item Item Number

Statistics 11 12 13 14 15 16 17 18 19 20
IF total 0.69 0.38 0.58 0.31 0.50 0.46 0.27 0.35 0.31 0.38
IF upper 0.77 0.38 0.85 0.38 0.69 0.62 0.38 0.31 0.38 0.46
IF lower 0.62 0.38 0.31 0.23 0.31 0.31 0.15 0.38 0.23 0.31
ID 0.15 0.00 0.54 0.15 0.38 0.31 0.23 -0.08 0.15 0.15
Table 3 IF score and ID score of the whole test
Since this proficiency test is one of the Norm-referenced test

(NRT) type, ideal item should have IF value of 0.50 as average, and the
highest possible ID. It is considered acceptable for IF value between 0.30
and 0.70. Ebel (1979, p. 267) has suggested the following guidelines for
making decisions based on ID:
0.40 and up Very good items
0.30 to 0.39 Reasonably good but possibly subject to
improvement
0.20 to 0.29 Marginal items, usually needing and being
subject to improvement
Below 0.19 Poor items, to be rejected or improved by
revision
Considering IF and ID of the test, they bring to analyze distractor
efficiency. The goal of distractor efficiency analysis is to examine the
degree to which the distractors are attracting students who do not know
the correct answer. And also, it investigates the degree to which the
distractors are functioning efficiently. As mentioned above, IF value helps
to see which items need improvement or elimination. For example, one
item might be considered too easy when ID value is very low. But an easy
item is sometimes good to see that the students can get from the simplest
item to the harder one. The percentages of students who chose each
option are analyzed. Below shows table 4 Distractors Efficiency of the
test.
Distractor Efficiency
Part I
Item Options Notes

IF ID
Number Group + ing + ed
1 0.69 0.31 High 0.85* 0.15 Reasonable
Low 0.54* 0.46
2 0.65 0.38 High 0.85* 0.15 Reasonable
Low 0.46* 0.54
3 0.65 0.38 High 0.15 0.85* Reasonable
Low 0.54 0.46*
4 0.58 0.69 High 0.92* 0.08 Good
Low 0.23* 0.77
5 0.62 0.31 High 0.23 0.77* Reasonable
Low 0.54 0.46*
6 0.58 0.54 High 0.15 0.85* Good
Low 0.69 0.31*
7 0.5 0.23 High 0.31 0.69* Improvement
Low 0.62 0.38* Needed
8 0.46 0.31 High 0.62* 0.38 Reasonable
Low 0.31* 0.69
9 0.62 0 High 0.31 0.69* Rejected
Low 0.38 0.62*
10 0.62 0.31 High 0.85* 0.15 Reasonable
Low 0.46* 0.54
*correct option
Part II
Item Options Notes
IF ID
Number Group A. B. C. D.
1 0.35 0.23 High 0.00 0.38 0.08 0.46* Improvement
Low 0.31 0.31 0.15 0.23* Needed
2 0.35 0.23 High 0.23 0.23 0.46* 0.08 Improvement
Low 0.31 0.23 0.23* 0.23 Needed
3 0.27 0.38 High 0.46* 0.23 0.31 0.00 Reasonable
Low 0.08* 0.15 0.54 0.23
4 0.77 0 High 0.77* 0.08 0.08 0.08 Rejected
Low 0.77* 0.08 0.00 0.08
5 0.38 0.31 High 0.08 0.00 0.54* 0.15 Reasonable
Low 0.08 0.23 0.23* 0.46
6 0.5 0.24 High 0.31 0.62* 0.00 0.08 Improvement
Low 0.31 0.38* 0.23 0.08 Needed
7 0.54 0.31 High 0.31 0.00 0.15 0.69* Reasonable
Low 0.08 0.23 0.23 0.38*
8 0.38 0.15 High 0.23 0.15 0.15 0.46* Rejected
Low 0.15 0.31 0.15 0.31*
9 0.38 0.15 High 0.23 0.46* 0.08 0.23 Rejected
Low 0.38 0.31* 0.15 0.08
10 0.38 0.47 High 0.62* 0.15 0.00 0.23 Good
Low 0.15* 0.31 0.08 0.46
11 0.69 0.15 High 0.15 0.08 0.77* 0.00 Rejected
Low 0.08 0.08 0.62* 0.08
12 0.38 0 High 0.46 0.08 0.38* 0.00 Rejected
Low 0.23 0.31 0.38* 0.00
13 0.58 0.54 High 0.08 0.85* 0.00 0.08 Good
Low 0.38 0.31* 0.08 0.15
14 0.31 0.15 High 0.46 0.08 0.08 0.38* Rejected
Low 0.31 0.31 0.15 0.23*
15 0.5 0.38 High 0.00 0.00 0.31 0.69* Reasonable
Low 0.00 0.08 0.54 0.31*
16 0.46 0.31 High 0.23 0.08 0.62* 0.08 Reasonable
Low 0.31 0.31 0.31* 0.00
17 0.27 0.23 High 0.38* 0.15 0.31 0.08 Improvement
Low 0.15* 0.15 0.54 0.08 Needed
18 0.35 -0.07 High 0.31* 0.00 0.46 0.23 Rejected
Low 0.38* 0.23 0.15 0.15
19 0.31 0.15 High 0.15 0.38* 0.31 0.15 Rejected
Low 0.23 0.23* 0.38 0.08
20 0.38 0.15 High 0.15 0.23 0.46* 0.15 Rejected
Low 0.15 0.46 0.31* 0.08
*correct option
Table 4 Distractor Efficiency Analysis
As you could see, from the test part I it seems very good in
discriminating good students from poor students. But part II does not
seem so. Many of the item distractors work too well that even good
students could not answer correctly. For example, item 12 gets nearly
equal IF values for the correct answer and the other two distractors. Item
like this is good for distracting students who do not really know the
correct answer. On the other hand, it indicates the way the students had
been taught, or the retention of previous knowledge. Item 18 is a very bad
one that should be the first to eliminate because it does not discriminate
between the good students and poor students: poor students answered
more correctly than good students, which was unexpected. While item 4
and 12 are to be rejected as well, as the items cannot differentiate good
students from the whole class.
Therefore, the items to be rejected should then be replaced by new
items. The followings are substitutions for those items:
For Part I
9. You speak English very (good/well).
For Part II
4. She sang …………

A. beautiful B. beautifully C. beauty D. beautily
8. He can paint the fence ………
A. fastly B. fasten C. fastness D. fast
9. He is ……….. right.
A. quite B. quiet C. quitely D. quietly
11. He killed a cat ……….. yesterday.
A. accident B. accidental C. accidentally D. accidently
12. The employees were ………. afraid of their new boss.
A. terrifying B. terrified C. terrible D. terribly
14. They entered the room …………… because they were ……
…..
A. quiet, late B. quietly, late
C. quietly, lately D. quiet, lately
18. Our teacher explained things very ………… We all understand
him ……..
A. clear, perfect B. clearly, perfect
C. clear, perfectly D. clearly, perfectly
19. Please carry the glasses ………… They were very expensive.
A. careful B. carefully C. carely D. care
20. She speaks ……….. She has a ……….. voice.
A. soft, soft B. softly, soft
B. soft, softly D. softly, softly
Reliability of the test
The test would be good in discriminating students from each other.
But how much it is reliable? It is assumed that a test should give the same
results every time it measures, if it is used under the same conditions,
should measure what it is supposed to measure, and should be practical to
use. Because in every measurement instrument it inevitably has flaws that
cause inaccuracies. Then in a language test, there are various ways to
examine the reliability of the test depending on what type of the test is.
The English Proficiency test is of course an NRT test. The method
in measuring reliability of the test can be done by using Kuder-
Richardson Formula 20 (K-R20). The reason is that it avoids the problem
of underestimating the reliability of certain language test. Using its
formula to calculate, it can be shown as follow:

K-R20 = k (1 - ∑IV)
k-1 St2
where K-R20 = Kuder-Richardson Formula 20
k = number of items
IV = item variance
St2 = variance for the whole test (that is, the standard deviation
of the test scores squared)
in calculating for the K-R20 value, there are many others variables
involved. Below is calculation of item variances.
Calculating Item Variances

Part I
Item number IF 1-IF IF(1-IF)
1 0.6923 0.3077 0.2130
2 0.6538 0.3462 0.2263
3 0.6538 0.3462 0.2263
4 0.5769 0.4231 0.2441
5 0.6154 0.3846 0.2367
6 0.5769 0.4231 0.2441
7 0.5000 0.5000 0.2500
8 0.4615 0.5385 0.2485
9 0.6154 0.3846 0.2367
10 0.6154 0.3846 0.2367
Part II
Item number FV 1-IF IF(1-IF)
1 0.0385 0.9615 0.0370
2 0.0769 0.9231 0.0710
3 0.1154 0.8846 0.1021
4 0.1538 0.8462 0.1302
5 0.1923 0.8077 0.1553
6 0.2308 0.7692 0.1775
7 0.2692 0.7308 0.1967
8 0.3077 0.6923 0.2130
9 0.3462 0.6538 0.2263
10 0.3846 0.6154 0.2367
11 0.4231 0.5769 0.2441
12 0.4615 0.5385 0.2485
13 0.5000 0.5000 0.2500
14 0.5385 0.4615 0.2485
15 0.5769 0.4231 0.2441
16 0.6154 0.3846 0.2367
17 0.6538 0.3462 0.2263
18 0.6923 0.3077 0.2130
19 0.7308 0.2692 0.1967
20 0.7692 0.2308 0.1775
Variance Total 6.9127
Table 5 Calculation of Item Variances
In addition to the content in Table 5, there are others values needed.
See from table 6 for the rest of the calculation.
Test Statistics
Mean 14.615
S 14.651
K-R20 0.029
Table 6 Test Statistics
The reliability of this test came out to be 0.029, which is quite low.
When re-administering the test, putting new items in places on items to
be eliminated, the reliability will, of course, change. In that case, the
participants will have to retake the test so that the consistency of the test-
takers remains the same. In other cases, this test was rated by only one
rater, so the question of inter-rater can be eliminated.
Conclusion
Achieving the English Proficiency Test for Mathayom Two
students at Bangna Demonstration School gave a wide range of result.
The implication from this range means curriculum and course design
development needed. The test, though, was done to prepare the students
for the next coming year, it indicates the areas of improvement needed.
Though there were many influential factors that can make the test result
changed or different, the overall score proved that the students need an
extensive course for preparing them to classroom. And by the result of the
test, the course designer should make a better plan in directing and
explaining for specific required skills. And for NRT test developers, it is
recommended to make a test as long as possible, well-designed and
carefully written, assess relatively homogeneous material, has items that
discriminate well, is normally distributed, and is administered to a group
of students whose abilities are as wide as logically possible within the
context (James Brown, 1996: p. 209)
The rationale why such items should be eliminated and why the
scores were not satisfactory was that my ideal concept that private school
provides better language classroom learning than governmental schools.
But this idea was proved fault when this test was accomplished. The
reasons behind this lay on the curriculum design and lesson planning. The
participants are also influential that they were not ready to take the test,
and their concentration was not at the test, as the test was taken nearly at
the end of the school day.
Reference
Brown, James Dean. (1996). Testing in language programs. New Jersey:

Prentice Hall Regents.
Appendix
Circle the appropriate words in the brackets to complete the sentences.

1. I think this film is (bored/boring)………..
2. I don’t find politics (interested/interesting) ………….
3. Walking makes me (tired/tiring) …………
4. This book is really (excited/exciting) ………..
5. Kate is doing her exams and is (worried/worrying) …………
6. Are you (interested/interesting) ………….. in basketball?
7. Dang always feels (bored/boring) ……………
8. Jan finds computers (confused/confusing) …………..
9. We were all feeling (tired/tiring) ………….
10. What an (excited/exciting) ………………. day.
Circle the appropriate items to complete the sentences.

1. He bought a(n) …………. From the antique shop.
A. rosewood old round table B. old rose wood round table
C. round old rosewood table C. old round rosewood table
2. It is a(n) ……………
A. horrifying old mysterious story B. horrifying mysterious old story
C. old horrifying mysterious story D. mysterious old horrifying story
3. His voice is …………….
A. loud B. aloud C. loudly D. aloudly
4. The lesson seems …………….
A. interesting B. interested C. interestingly D. interest
5. We arrived at the destination ……………
A. save B. safe C. safely D. safety
6. I am sure the soup tastes …………..
A. well B. good C. goodness D. goodly
7. The ……….. parents scolded the child for his …………. results.
A. disappointing, disappointing B. disappointed, disappointed
C. disappointing, disappointed D. disappointed, disappointing
8. Give him that ………..
A. yellow old leather case B. old leather yellow case
C. leather yellow old case D. old yellow leather case
9. The curry smells …….. but it doesn’t taste …………
A. well, deliciously B. good, delicious
C. good, deliciously D. well, delicious
10. I feel ………… when I think of my housework.
A. bad B. badly C. badness D. worse
11. We were already ………..
A. worry B. worrying C. worried D. worrily
12. The children are ……….. by the animals.
A. frightening B. frighten C. frightened D. frightingly
13. It was a very ………… journey.
A. tired B. tiring C. tiresome D. tireness
14. We were all very ………… in what he said.
A. interesting B. interest C. interestingly D. interested
15. Why do you look so …………. at school?
A. boringly B. boredom C. boring D. bored
16. It was a terribly ………… day.
A. excited B. excitement C. exciting D. excitedly
17. Didn’t you think it was an ………….. play?
A. amusing B. amusement C. amused D. amusingly
18. We had a ………….. trip home.
A. tiring B. tiredness C. tired D. tiredly
19. The last half hour was a …………. time.
A. worry B. worrying C. worried D. worrily
20. I’ve never been so ………… in my life.
A. frightening B. frighten C. frightened D. frighteningly

Project 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 1

Uploaded by

Copyright:

Available Formats

Natapon Kidrai 4436733 SCAL/M

SCLG 637 Testing and Evaluation

The test aims to measure language proficiency of students. This

This test was given to students to measure background knowledge

Students Total Students Total

Item Item Number

Since this proficiency test is one of the Norm-referenced test

Item Options Notes

and 12 are to be rejected as well, as the items cannot differentiate good

students from the whole class.

Therefore, the items to be rejected should then be replaced by new

items. The followings are substitutions for those items:

9. You speak English very (good/well).

4. She sang …………

Reliability of the test

The test would be good in discriminating students from each other.

results every time it measures, if it is used under the same conditions,

should measure what it is supposed to measure, and should be practical to

use. Because in every measurement instrument it inevitably has flaws that

cause inaccuracies. Then in a language test, there are various ways to

The English Proficiency test is of course an NRT test. The method

in measuring reliability of the test can be done by using Kuder-

Richardson Formula 20 (K-R20). The reason is that it avoids the problem

of underestimating the reliability of certain language test. Using its

formula to calculate, it can be shown as follow:

Calculating Item Variances

In addition to the content in Table 5, there are others values needed.

See from table 6 for the rest of the calculation.

When re-administering the test, putting new items in places on items to

be eliminated, the reliability will, of course, change. In that case, the

rater, so the question of inter-rater can be eliminated.

students at Bangna Demonstration School gave a wide range of result.

recommended to make a test as long as possible, well-designed and

carefully written, assess relatively homogeneous material, has items that

discriminate well, is normally distributed, and is administered to a group

of students whose abilities are as wide as logically possible within the

context (James Brown, 1996: p. 209)

provides better language classroom learning than governmental schools.

the end of the school day.

Brown, James Dean. (1996). Testing in language programs. New Jersey:

Circle the appropriate words in the brackets to complete the sentences.

Circle the appropriate items to complete the sentences.

You might also like