Test Construction Reliability

TYPES OF RELIABILITY ESTIMATES
_________ RELIABILITY?
HOW DO WE DETERMINE
ESTIMATE
SARMIENTO, MICAH NICOLE V. 3PSY7

 Using Two Sets of Scores
 Test-Retest Reliability
 Alternate-form Reliability/parallel-form
 Using One Set of Scores
 Split-half Reliability
 Inter-item correlations or Internal Consistency/Homogeneity
correlations
 Inter-Scorer/Inter-rater reliability
ESTIMATING RELIABILITY USING TWO SETS OF SCORES
 Pearson's correlation coefficient is the reliability coefficient calculated for
these type of reliability estimates.
 The Pearson correlation coefficient measures the strength of
relationships between variables strength between variable. This formula is
often referred to as the Pearson R test. 𝑥1 = scores in one set
𝑥2 = scores in other set
σ 𝑥1 𝑥2 − 𝑥1ҧ 𝑥ҧ2 σ 𝑥1 𝑥2 = sum of each 𝑥1 score times its

corresponding 𝑥2 score
𝑟𝑥𝑥 =
𝜎𝑥1 𝜎𝑥2 𝑥ҧ1 = mean of the X scores
𝑥ҧ 2 = mean of the Y scores
𝜎𝑥1 = SD of the X scores
𝜎𝑥2 = SD of the Y scores
𝑥1 = scores in one set

𝑥2 = scores in other set
Pearson σ 𝑥1 𝑥2 = sum of each 𝑥1 score
σ 𝑥1 𝑥2 − 𝑥1ҧ 𝑥ҧ2 times its corresponding 𝑥2 score
𝑟𝑥𝑥 = 𝑥1ҧ = mean of the X scores
𝜎𝑥1 𝜎𝑥2 𝑥ҧ2 = mean of the Y scores



TEST-RETEST RELIABILITY
 Evaluates or determines how much error in a test score is due to problems

with test administration or conditions by administering it at two
different times
(e.g. extraneous variables during the testing time)
 Administer the same test to the same participants on two different
occasions.
 Correlate the test scores of the two administrations of the same test.
 It is an index of stability. It yields a coefficient of stability.
 Keywords: TIME, SAME TEST
 Procedure:
 Administering a test to a group of individuals
 Re-administering the same test to the same group at some later time
 Correlating the first set of scores with the second
o Length of interval is greatly considered
• is based on research/beliefs about the stability of the characteristic
• The interval is crucial.

Pearson r 𝑥2 = scores in other set

σ 𝑥1 𝑥2 = sum of each 𝑥1 score
σ 𝑥1 𝑥2 − 𝑥1ҧ 𝑥ҧ2 times its corresponding 𝑥2 score
𝑟𝑥𝑥 = 𝑥1ҧ = mean of the X scores
𝜎𝑥1 𝜎𝑥2 𝑥ҧ2 = mean of the Y scores

S A B C D E F G H I J X̅ σ
DAY 1 18 16 5 13 15 12 12 5 8 10 11.8 4.42
(X1)
DAY 2 18 18 6 6 17 14 14 5 7 11 12.8 4.87

(X2)
X1X2 324 288 30 208 255 256 168 25 56 110 ΣX1X2=1720
𝑥1 = scores in one set 𝑥2 = scores in other set σ 𝑥1 𝑥2 = sum of each 𝑥1 score times its corresponding 𝑥2 score
𝑥ҧ1 = mean of the X scores 𝑥ҧ 2 = mean of the Y scores N = 10 (number of people in the group)


σ 𝑥1 𝑥2 − 𝑥1ҧ 𝑥ҧ2 𝑥2 = scores in other set
𝑟𝑥𝑥 = σ 𝑥1 𝑥2 = sum of each 𝑥1 score times its
𝜎𝑥1 𝜎𝑥2 corresponding 𝑥2 score
𝑥ҧ1 = mean of the X scores (11.8)
𝑥ҧ 2 = mean of the Y scores (12.8)
1720/10 − 11.8 12.8 𝜎𝑥1 = SD of the X scores (4.42)
𝑟𝑥𝑥 = 𝜎𝑥2 = SD of the Y scores (4.87)
4.42 4.87
N = 10
172 – 151.04
𝑟𝑥𝑥 = = .9723 = .97
21.56
 PROs
 Easier to develop (unlike alternate-forms/parallel forms)
 CONs
 Maturation
 Extraneous variables during the time interval
 Carry over effects



ALTERNATE-FORM RELIABILITY/PARALLEL-FORM RELIABILITY
DIFFERENCES:
 PARALLEL FORMS
- Each form of the test, the means and the variances of observed test scores are equal
- Means correlate equally to the true score
- Correlate equally with other measures
 ALTERNATE FORMS
- Simply different versions of a test that have been constructed so as to be parallel
- Do no meet the requirements for the legitimate designation parallel
- Typically designed to be equivalent with respect to variables such as content and level
of difficulty SARMIENTO, MICAH NICOLE V. 3PSY7
DIFFERENCES:
 PARALLEL FORMS
- Refers to an estimate of the extent to which item sampling and other
errors have affected test scores on versions of the same test when, for
each for of the test, the means and variances of observed test scores are equal
 ALTERNATE FORMS
- Refers to an estimate of the extent to which these different forms of the
same test have been affected item sampling error, or other error

SIMILARITIES:
 Used to assess the consistency of the results of two tests constructed in the
same way from the same content domain (similar form, content, and level
of difficulty)
 To determine whether scores will generalize across different sets of items or
tasks
 The two forms of the test are correlated to yield a coefficient of
equivalence
 Determines how comparable are two different versions of the same measure.
 KEYWORDS: TWO TESTS, EQUIVALENT
 Procedure: administer two tests that measure the same

knowledge or characteristics at the same time, one following
the other, or after a relatively short time interval, and correlate
between scores on the two forms

𝑥A = scores in 1ST form

Pearson r 𝑥B = scores in 2nd form
σ 𝑥A 𝑥B = sum of each 𝑥1 score times
its corresponding 𝑥2 score
σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB 𝑥𝐴ҧ = mean of the X scores (11.8)
𝑟𝑥𝑥 =
𝜎𝑥A 𝜎𝑥B 𝑥ҧB = mean of the Y scores (12.8)
𝜎𝑥A = SD of the X scores (4.42)
𝜎𝑥B = SD of the Y scores (4.87)
N = 10

S 1 2 3 4 5 6 7 8 9 10 X̅ σ
FORM 18 16 5 13 15 12 12 5 8 10 11.8 4.42
A (XA)
FORM 18 18 6 6 17 14 14 5 7 11 12.8 4.87

B (XB)
XAXB 324 288 30 208 255 256 168 25 56 110 ΣXAXB=

1720
𝑥A = scores in 1ST form 𝑥B = scores in 2nd form σ 𝑥A 𝑥B = sum of each 𝑥A score times its corresponding 𝑥B score
𝑥𝐴ҧ = mean of the X scores 𝑥ҧ B = mean of the Y scores N = 10 (number of people in the group)


σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB 𝑥B = scores in 2nd form
𝑟𝑥𝑥 = σ 𝑥A 𝑥B = sum of each 𝑥1 score times its
𝜎𝑥A 𝜎𝑥B corresponding 𝑥2 score
𝑥𝐴ҧ = mean of the X scores (11.8)
𝑥ҧ B = mean of the Y scores (12.8)
1720/10 − 11.8 12.8 𝜎𝑥A = SD of the X scores (4.42)
𝑟𝑥𝑥 = 𝜎𝑥B = SD of the Y scores (4.87)
4.42 4.87
N = 10
172 – 151.04
𝑟𝑥𝑥 = = .9723 = .97
21.56
 PROs
 Solves the problem of carry over effects
 CONs
 Time consuming (to develop)
 Difficult to develop

 Other Inter-item correlations or Internal Consistency/Homogeneity
correlations
ESTIMATING RELIABILITY USING ONE SET OF SCORES

 Other Inter-item correlations or Internal
Consistency/Homogeneity correlations

SPLIT-HALF RELIABILITY
 Dividing a single test or split into two half-tests
 to create two separate forms (assessing the same domain)
 Treats the halves as “alternate forms” or “mini-parallel-forms”
 the two halves would need to be as similar as possible
or as nearly equal as humanly possible in format, stylistic, statistical, and
related aspects.
 yields coefficient of correlation; equivalence of the TWO
HALVES
 Keywords: HALF, SINGLE TEST SARMIENTO, MICAH NICOLE V. 3PSY7
 Can be divided in many ways such as

 Odd-even split (Odd-even reliability)
 Random assignment
 Test should not be divided in the middle.

PROCEDURE:
 Develop/use a single test
1. Divide the test into equivalent halves
2. Calculate a Pearson r between scores on the two halves of the
test
3. Adjust the half-test reliability using the Spearman-Brown
formula

σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB its corresponding 𝑥2 score
𝑟𝑥𝑥 = 𝑥𝐴ҧ = mean of the X scores (11.8)
N = 10

𝑛𝑟𝑥𝑥
𝑟𝑆𝐵 =
1 + 𝑛 − 1 𝑟𝑥𝑥
Spearman-Brown formula
2𝑟ℎℎ
𝑟𝑆𝐵 =
1 + 𝑟ℎℎ
S 1 2 3 4 5 6 7 8 9 10 X̅ σ
FORM 18 16 5 13 15 16 12 5 8 10 11.8 4.42
A (XA)
FORM 18 18 6 6 17 16 14 5 7 11 12.8 4.87

B (XB)
XAXB 324 288 30 208 255 256 168 25 56 110 ΣXAXB=

1720
𝑥A = scores in 1ST form 𝑥B = scores in 2nd form σ 𝑥A 𝑥B = sum of each 𝑥A score times its corresponding 𝑥B score
𝑥𝐴ҧ = mean of the X scores 𝑥ҧ B = mean of the Y scores N = 10 (number of people in the group)


σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB 𝑥B = scores in 2nd form
𝑟𝑥𝑥 = σ 𝑥A 𝑥B = sum of each 𝑥1 score times its
𝜎𝑥A 𝜎𝑥B corresponding 𝑥2 score
𝑥𝐴ҧ = mean of the X scores (11.8)
𝑥ҧ B = mean of the Y scores (12.8)
1720/10 − 11.8 12.8 𝜎𝑥A = SD of the X scores (4.42)
𝑟𝑥𝑥 = 𝜎𝑥B = SD of the Y scores (4.87)
4.42 4.87
N = 10
172 – 151.04
𝑟𝑥𝑥 = = .9723 = .97
21.56
𝑛𝑟𝑥𝑥
𝑟𝑆𝐵 =
1 + 𝑛 − 1 𝑟𝑥𝑥
2𝑟ℎℎ
𝑟𝑆𝐵 =
1 + 𝑟ℎℎ
2𝑟ℎℎ
𝑟𝑆𝐵 =
1 + 𝑟ℎℎ
𝑟𝑥𝑥 = 0.97= 𝑟ℎℎ

𝑟𝑆𝐵 = 0.98

σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB its corresponding 𝑥2 score
𝑟𝑥𝑥 = 𝑥𝐴ҧ = mean of the X scores (11.8)
N = 10

σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB 𝑥A = scores in 1ST form

𝑟𝑥𝑥 =
𝜎𝑥A 𝜎𝑥B 𝑥B = scores in 2nd form
its corresponding 𝑥2 score
92177/150 − 596.98 𝑥𝐴ҧ = mean of the X scores (11.8)
𝑟𝑥𝑥 = 𝑥ҧB = mean of the Y scores (12.8)
28.81 𝜎𝑥A = SD of the X scores (4.42)
𝑟𝑥𝑥 = 0.733676 or 0.73 𝜎𝑥B = SD of the Y scores (4.87)
N = 10

2𝑟ℎℎ
𝑟𝑆𝐵 =
1 + 𝑟ℎℎ
2(.73)
𝑟𝑆𝐵 =
1 + (.73)
𝑟𝑆𝐵 = 0.846382

 Other Inter-item correlations or Internal
Consistency/Homogeneity correlations

OTHER INTER-ITEM CORRELATIONS OR INTERNAL
CONSISTENCY/HOMOGENEITY CORRELATIONS
 Cronbach’s coefficient alpha – for likert scales

 Kuder-Richardson Formula – for items with right answers,
dichotomously scored
 KR-20 (range of difficulty)
 KR-21 (all items are about the same difficulty)
 Average proportional distance (APD)

 Cronbach’s coefficient alpha
𝑘 ෌ 𝜎𝑖 2
𝑟𝑥𝑥 =𝛼= 1− 2
𝑘−1 𝜎 𝑥
𝑘 = number of test questions

𝜎 2 𝑥 = the test variance
𝜎𝑖 2 = the variance on a specific test item
෌ 𝜎𝑖 2 = the sum of all test item variances


Question 1 2 3 4 5 Total Score
Participants
A 3 4 4 3 5 19
B 4 3 4 3 3 17
C 2 3 3 2 3 13
D 4 4 5 3 4 20
E 3 2 4 3 3 15
F 3 2 3 2 3 13
mean 3.17 3 3.83 2.67 3.5 Total mean= 𝜎 2𝑥 =

16.17 7.4722
𝜎𝑖 2 .4722 .6667 .4722 .2222 .5833 ෌ 𝜎𝑖 2 = 2.4166

CRONBACH’S COEFFICIENT ALPHA
𝑘 ෌ 𝜎𝑖 2
𝑟𝑥𝑥 = 𝛼 = 1− 2
𝑘−1 𝜎 𝑥
5 2.4166
𝑟𝑥𝑥 = 𝛼 = (1 − )
4 7.4722
α = .85

ITEMS 1 2 3 4 5 6 7 8 9 10 TOTAL
A 3 4 2 4 1 2 3 2 4 2 27
B 2 1 2 4 2 1 2 3 1 3 21
C 4 3 1 2 4 5 1 4 2 3 29
D 1 3 2 3 3 4 2 5 3 4 30
E 5 4 3 5 1 2 2 2 2 2 28
F 2 5 4 5 2 3 2 3 3 5 34
G 3 2 4 3 2 5 2 1 2 5 29
H 4 1 2 2 1 1 2 4 1 1 19
I 2 3 3 1 3 3 2 5 3 3 28
J 1 5 2 2 3 2 2 2 2 2 23
mean= 2.7 3.1 2.5 3.1 2.2 2.8 2 3.1 2.3 3 26.8
total var 20.4
〖𝜎𝑖〗item var1.788889 2.1 0.944444 1.877778 1.066667 2.177778 0.222222 1.877778 0.9 1.777778 ∑ var i= 14.73333
CRONBACH’S COEFFICIENT ALPHA
𝑘 ෌ 𝜎𝑖 2
𝑟𝑥𝑥 = 𝛼 = 1− 2
𝑘−1 𝜎 𝑥
10 14.73
𝑟𝑥𝑥 = 𝛼 = (1 − )
9 20.4
α = .3088



 Kuder-Richardson Formula 20
𝑘 𝜎 2 𝑥 − σ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 = 𝐾𝑅 − 20 =
𝑘−1 𝜎 2𝑥
𝜎 2 𝑥 = the test variance
𝑝𝑖 = the proportion of test takers answering an item correctly (p)
𝑞𝑖 = the proportion of test takers answering an item incorrectly (1-p)
σ 𝑝𝑖 𝑞𝑖 = the sum of each item’s p value times its corresponding q value

KUDER-RICHARDSON-20 FORMULA
Question 1 2 3 4 5 6 Total
Score (𝑥)ҧ
A Y Y Y Y N Y 5 𝑥ҧ = 2.8
B Y N N Y N Y 3 𝜎 2 = 1.76
C Y N N Y N N 2 𝜎 =1.3266
D N Y N N N N 1 𝑘=6
E N Y N N Y Y 3 N=5
𝑝𝑖 .6 .6 .2 .6 .2 .6 𝑥ҧ = 2.8
𝑞𝑖 .4 .4 .8 .4 .8 .4
(p)(q) .24 .24 .16 .24 .16 .24 σ 𝑝𝑖 𝑞𝑖 =
1.28

𝑘 𝜎 2 𝑥 − σ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 = 𝐾𝑅 − 20 =
𝑘−1 𝜎 2𝑥
6 (1.76)−1.28
rxx = KR − 20 = = .3273 = .33
6−1 1.76


 Kuder-Richardson Formula – for items with right answers, dichotomously scored


𝑥(𝑘
ҧ − 𝑥)ҧ
𝑟𝑥𝑥 = 𝐾𝑅 − 21 = 1 −
𝑘 (𝜎 2 )
𝑥ҧ = the mean number of questions correct
𝜎 2 = total variance

Person A B C D E F G H I J 𝑥̅ 𝜎
Grade 8 6 2 3 7 6 2 3 8 6 5.1 2.2561
 K=8
𝑥ҧ 𝑘 − 𝑥ҧ
𝑟𝑥𝑥 = 𝐾𝑅 − 21 = 1 −
𝑘 (𝜎 2 )
5.1 8−5.1
rxx = KR − 21 = 1 − = .6367 = .64
8(5.09)


AVERAGE PROPORTIONAL DISTANCE (APD)
 Focuses on the degree of difference

 A measure used to evaluate the internal consistency of a test
that focus on the degree of difference that exists between item
scores.

1.Calculate the absolute difference between scores for all of the

items
2.Average the difference between scores
3.Obtain the APD by dividing the average difference between
scores by the number of response options on the test, minus
one

Absolute differences
Between 1 & 2 1
Between 1 & 3 2
Between 2 & 3 1
(1+2+1)/3 = 4/3 = 1.33

1.33/ (7-1)
=.22

 Inter-item correlations or Internal Consistency/Homogeneity
correlations
INTER-SCORER/INTER-RATER RELIABILITY
 assesses the degree of agreement between two or more
raters in their appraisals.
 For subjective methods of scoring
 Variously referred to as scorer reliability, judge reliability,
observer reliability
 To calculate: Give the results from one test administration to
two evaluators and correlate the two markings from the
different raters.
 Uses Cohen’s Kappa (if there are two raters)


REFERENCES
Anastasi, A. & Urbina, S. (1996). Psychological testing (7th ed.). New Jersey: Prentice Hall.
Coaley, K. (2016). An introduction to psychological assessment and psychometrics (2nd ed.).
London: SAGE Publications Ltd.
Cohen, R.J., Swerdik M.E., & Sturman, E.D. (2013). Psychological testing and assessment: An
introduction to tests and measurement (8TH ed.). New York: McGraw-Hill Education.
Friedenberg, L. (1995). Psychological testing: design, analysis, and use. Boston: Allyn and Bacon.
Kaplan, R. & Saccuzzo, D. (2009). Psychological Assessment and Theory: Creating and Using
Psychological Tests. Singapore: Cengage Learning Asia.
Murphy, K.R. & Davidshofer. (1994). Psychological testing: Principles and applications (3rd ed.).
New Jersey: Prentice-Hall, Inc.
OTHER SOURCES:
https://www.ijme.net/archive/2/cronbachs-alpha.pdf
http://www.proftesting.com/test_topics/pdfs/test_quality_reliability.pdf
https://testing.wisc.edu/Reliability.pdf
https://web.stanford.edu/dept/SUSE/SEAL/Reports_Papers/methods_papers/G%20Theory%20
Hdbk%20of%20Statistics.pdf
https://www.tutorialspoint.com/statistics/reliability_coefficient.htm
http://tx.liberal.ntu.edu.tw/~purplewoo/Literature/!DataAnalysis/Reliability%20Analysis.htm


Test Construction Reliability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Construction Reliability

Uploaded by

Copyright:

Available Formats

TYPES OF RELIABILITY ESTIMATES

SARMIENTO, MICAH NICOLE V. 3PSY7

σ 𝑥1 𝑥2 − 𝑥1ҧ 𝑥ҧ2 σ 𝑥1 𝑥2 = sum of each 𝑥1 score times its

𝑥1 = scores in one set

SARMIENTO, MICAH NICOLE V. 3PSY7

 Using Two Sets of Scores

SARMIENTO, MICAH NICOLE V. 3PSY7

 Evaluates or determines how much error in a test score is due to problems

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥1 = scores in one set

Pearson r 𝑥2 = scores in other set

SARMIENTO, MICAH NICOLE V. 3PSY7

DAY 2 18 18 6 6 17 14 14 5 7 11 12.8 4.87

X1X2 324 288 30 208 255 256 168 25 56 110 ΣX1X2=1720

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥1 = scores in one set

SARMIENTO, MICAH NICOLE V. 3PSY7

 Using Two Sets of Scores

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

 Procedure: administer two tests that measure the same

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥A = scores in 1ST form

SARMIENTO, MICAH NICOLE V. 3PSY7

FORM 18 18 6 6 17 14 14 5 7 11 12.8 4.87

XAXB 324 288 30 208 255 256 168 25 56 110 ΣXAXB=

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥A = scores in 1ST form

SARMIENTO, MICAH NICOLE V. 3PSY7

 Using One Set of Scores

SARMIENTO, MICAH NICOLE V. 3PSY7

 Can be divided in many ways such as

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥A = scores in 1ST form

SARMIENTO, MICAH NICOLE V. 3PSY7

FORM 18 18 6 6 17 16 14 5 7 11 12.8 4.87

XAXB 324 288 30 208 255 256 168 25 56 110 ΣXAXB=

SARMIENTO, MICAH NICOLE V. 3PSY7

𝑥A = scores in 1ST form

𝑟𝑥𝑥 = 0.97= 𝑟ℎℎ

𝑥A = scores in 1ST form

SARMIENTO, MICAH NICOLE V. 3PSY7

σ 𝑥A 𝑥B − 𝑥ҧA 𝑥ҧB 𝑥A = scores in 1ST form

SARMIENTO, MICAH NICOLE V. 3PSY7

 Using One Set of Scores

SARMIENTO, MICAH NICOLE V. 3PSY7

 Cronbach’s coefficient alpha – for likert scales

SARMIENTO, MICAH NICOLE V. 3PSY7

 Cronbach’s coefficient alpha

𝑘 = number of test questions

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

mean 3.17 3 3.83 2.67 3.5 Total mean= 𝜎 2𝑥 =

𝜎𝑖 2 .4722 .6667 .4722 .2222 .5833 ෌ 𝜎𝑖 2 = 2.4166

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

 Cronbach’s coefficient alpha – for likert scales

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

SARMIENTO, MICAH NICOLE V. 3PSY7

 Cronbach’s coefficient alpha – for likert scales

 KR-20 (range of difficulty)