Professional Documents
Culture Documents
Student: ___________________________________________________________________________
1. The meaning of reliability in the psychometric sense differs from the meaning of
reliability in the "every day" use of that word in that
A. reliability in the "every day sense" is usually "a good thing."
B. reliability in the psychometric sense is usually "a good thing."
C. reliability in the psychometric sense has greater implications.
D. None of these
7. A Wall Street Securities firm that is actually located on Wall Street is testing a group of
candidates for their aptitude in finance and business. As the testing begins, an
unexpected "Occupy Wall Street" sit-in takes place. From a psychometric perspective in
the context of this testing, the sit-in is viewed as
A. systematic error.
B. random error.
C. test administration error.
D. background error.
8. A test entails behavioral observation and rating of front desk clerks to determine
whether or not they greet guests with a smile. Which type of error is this test most
susceptible to?
A. test administration error
B. test construction error
C. examiner-related error
D. polling error
10. Stanley (1971) wrote that in classical test theory, a so-called "true score" is "not the
ultimate fact in the book of the recording angel." By this, Stanley meant that
A. it would be imprudent to trust in Divine influence when estimating variance.
B. the amount of test variance that is true relative to error may never be known.
C. it is near impossible to separate fact from fiction with regard to "true scores."
D. All of these
11. The term test heterogeneity BEST refers to the extent to which test items measure
A. different factors.
B. the same factor.
C. a unifactorial trait.
D. a nonhomogeneous trait.
15. One of the problems associated with classical test theory has to do with
A. the notion that there is a "true score" on a test has great intuitive appeal.
B. the fact that CTT assumptions are often characterized as "weak."
C. its assumptions concerning the equivalence of all items on a test.
D. its assumptions allow for its application in most situations.
16. Which of the following is NOT an alternative to classical test theory cited in your
text?
A. generalizability theory
B. representational theory
C. domain sampling theory
D. latent trait theory
17. Item response theory is to latenttrait theory as observer reliability is to
A. generalizability theory.
B. domain sampling theory.
C. odd-even reliability.
D. inter-scorer reliability.
18. The multiple-choice test items on this examination are all examples of
A. dichotomous test items.
B. latent trait test items.
C. polytomous test items.
D. None of these
25. Why might ability test scores among testtakers most typically vary?
A. because of the true ability of the testtaker
B. because of irrelevant, unwanted influences
C. All of the above
D. None of the above
28. Which type of reliability estimate is obtained by correlating pairs of scores from the
same person (or people) on two different administrations of the same test?
A. a parallel-forms estimate
B. a split-half estimate
C. a test-retest estimate
D. an au-pair estimate
29. Which type of reliability estimate would be appropriate only when evaluating the
reliability of a test that measures a trait that is presumed to be relatively stable time?
A. parallel-forms
B. alternate-forms
C. test-retest
D. split-half
32. Which of the following is TRUE for estimates of alternate- and parallel-forms
reliability?
A. Two test administrations with the same group are required.
B. Test scores may be affected by factors such as motivation, fatigue, or intervening
events like practice, learning, or therapy.
C. Item sampling is a source of error variance.
D. All of these
35. Which of the following types of reliability estimates is the most expensive due to the
costs involved in test development?
A. test-retest
B. parallel-form
C. internal-consistency
D. Spearman's rho
36. What term refers to the degree of correlation between all the items on a scale?
A. inter-item homogeneity
B. inter-item consistency
C. inter-item heterogeneity
D. parallel-form reliability
37. Test-retest estimates of reliability are referred to as measures of ________, and split-
half reliability estimates are referred to as measures of ________.
A. true scores; error scores
B. internal consistency; stability
C. interscorer reliability; consistency
D. stability; internal consistency
38. Which of the following is usually minimized when using split-half estimates of
reliability as compared with test-retest or parallel/alternate-form estimates of reliability?
A. time and expense
B. reliability and validity
C. reliability only
D. time spent in scoring and interpretation
39. Which of the following factors may influence a split-half reliability estimate?
A. fatigue
B. anxiety
C. item difficulty
D. All of these
40. Internal-consistency estimates of reliability are inappropriate for
A. reading achievement tests.
B. scholastic aptitude/intelligence tests.
C. word processing tests based on speed.
D. tests purporting to measure a single personality trait.
43. Typically, adding items to a test will have what effect on the test's reliability?
A. Reliability will decrease.
B. Reliability will increase.
C. Reliability will stay the same.
D. Reliability will first increase and then decrease.
45. If items from a test are measuring the same trait, estimates of reliability yielded from
split-half methods will typically be ________ as compared to estimates from KR-20.
A. higher
B. lower
C. similar
D. approximately the same
46. Which of the following is NOT an acceptable way to divide a test when using the split-
half reliability method?
A. Randomly assign items to each half of the test.
B. Assign odd-numbered items to one half and even-numbered items to the other half of
the test.
C. Assign the first-half of the items to one half of the test and the second half of the
items to the other half of the test.
D. Assign easy items to one half of the test and difficult items to the other half of the
test.
47. If items on a test are measuring very different traits, estimates of reliability yielded
from split-half methods will typically be ________ as compared with estimates from KR-
20.
A. higher
B. lower
C. similar
D. approximately the same
48. KR-20 is the statistic of choice for tests with which types of items?
A. multiple-choice
B. true-false
C. All of these
D. None of these
50. Which is NOT an assumption that should be met in order to use KR-21?
A. Items should be dichotomous.
B. Items should be of equal difficulty.
C. Items should be homogeneous.
D. Items should be scorable by computer.
51. Which of the following is generally the preferred statistic for obtaining a measure of
internal-consistency reliability?
A. KR-20
B. KR-21
C. Kendall's Tau
D. coefficient alpha
52. Coefficient alpha is appropriate to use with all of the following test formats EXCEPT
A. multiple-choice.
B. true-false.
C. short-answer for which partial credit is awarded.
D. essay exam with no partial credit awarded.
58. Which BEST conveys the meaning of an inter-scorer reliability estimate of .90?
A. Ninety percent of the scores obtained are reliable.
B. Ninety percent of the variance in the scores assigned by the scorers was attributed to
true differences and 10% to error.
C. Ten percent of the variance in the scores assigned by the scorers was attributed to
true differences and 90% to error.
D. Ten percent of the test's items are in need of revision according to the majority of the
test's users.
59. When more than two scorers are used to determine inter-scorer reliability, the
statistic of choice is
A. Pearson r.
B. Spearman's rho.
C. KR-20.
D. coefficient alpha.
60. For determining the reliability of tests scored using nominal scales of measurement,
the statistic of choice is
A. Kendall's Tau.
B. the Kappa statistic.
C. KR-20.
D. coefficient alpha.
63. If a time limit is long enough to allow test-takers to attempt all items, and if some
items are so difficult that no test-taker is able to obtain a perfect score, then the test is
referred to as a ________ test.
A. speed
B. power
C. reliable
D. valid
65. Which type(s) of reliability estimates would be appropriate for a speed test?
A. test-retest
B. alternate-form
C. split-half from two independent testing sessions
D. All of these
66. Which of the following would result in the LEAST appropriate estimate of reliability for
a speed test?
A. test-retest
B. alternate-form
C. split-half from a single administration of the test
D. split-half from two independent testing sessions
67. A Kuder-Richardson (KR) or split-half estimate of reliability for a speed test would
provide an estimate that is
A. spuriously low.
B. spuriously high.
C. insignificant.
D. equal to a test-retest method.
68. A measure of clerical speed is obtained by a test that has respondents alphabetize
index cards. The manual for this test cites a split-half reliability coefficient for a single
administration of the test of .95. What might you conclude?
A. The test is highly reliable.
B. The published reliability estimate is spuriously low and would have been higher had
another estimate been used.
C. The split-half estimate should not have been used in this instance.
D. Clerical speed is too vague a construct to measure.
69. The Spearman-Brown formula can be used for which types of tests?
A. speed and multiple-choice
B. true-false and multiple-choice
C. speed, true-false, and multiple-choice
D. trade school and driving tests
75. The fact that the length of a test influences the size of the reliability coefficient is
based on which theory of measurement?
A. classical test theory (CTT)
B. generalizability theory
C. domain sampling theory
D. item response theory (IRT)
76. Which estimate of reliability is most consistent with the domain sampling theory?
A. test-retest
B. alternate-form
C. internal-consistency
D. interscorer
77. Classical reliability theory estimates the portion of a test score that is attributed to
________, and domain sampling theory estimates ________.
A. specific sources of variation; error
B. error; specific sources of variation
C. the skills being measured; variation
D. the skills being measured; content knowledge
80. The standard deviation of a theoretically normal distribution of test scores obtained
by one person on equivalent tests is
A. the standard error of the difference between means.
B. the standard error of measurement.
C. the standard deviation of the reliability coefficient.
D. the variance.
81. Which of the following is NOT a part of the formula for the standard error of
measurement for a particular test?
A. the validity of the test
B. the reliability of the test
C. the standard deviation of the group of test scores
D. Both b and c
82. "Sixty-eight percent of the scores for a particular test fall between 58 and 61" is a
statement regarding
A. the utility of a test.
B. the reliability of a test.
C. the validity of a test.
D. None of these
83. The standard error of measurement of a particular test of anxiety is 8. A student
earns a score of 60. What is the confidence interval for this test score at the 95% level?
A. 52-68
B. 40-68
C. 44-76
D. 36-84
84. As the confidence interval increases, the range of scores into which a single test
score falls is likely to
A. decrease.
B. increase.
C. remain the same.
D. alternately decrease and increase.
86. If the standard deviations of two tests are identical but the reliability is lower for Test
A as compared to Test B, then the standard error of measurement will be ________ for Test
A as compared with Test B.
A. higher
B. lower
C. the same
D. None of these
87. Which statistic can help the test user determine how large a difference must exist for
scores yielded from two different tests to be considered statistically different?
A. standard error of measurement between two scores
B. standard error of the difference between two scores
C. observed variance minus error variance
D. standard error of the difference between two means
88. The standard error of the difference between two scores is larger than the standard
error of measurement for either score because the standard error of the difference
between the two scores is affected by
A. the true score variance of each score.
B. the standard deviation of each score summed.
C. the measurement error inherent in both scores.
D. All of these
92. The universe score in Cronbach et al.'s generalizability theory analogous to the
________ in classical test theory.
A. coefficient of generalizability
B. true score
C. standard deviation
D. internal-consistency estimate
93. In classical test theory, there exists only one true score. In Cronbach generalizability
theory, how many "true scores" exist?
A. one
B. as many as the number of times the test is administered to the same individual
C. many, depending on the number of different universes
D. None of these
96. In general, which of the following is TRUE of the relationship between the magnitude
of the test-retest reliability estimate and the length of the interval between test
administrations?
A. The longer the interval, the lower the reliability coefficient.
B. The longer the interval, the higher the reliability coefficient.
C. The magnitude of the reliability coefficient is typically not affected by the length of the
interval between test administrations.
D. The magnitude of the reliability coefficient is always affected by the length of the
interval between test administrations, but one cannot predict how it is affected.
97. What is the difference between alternate forms and parallel forms of a test?
A. Alternate forms do not necessarily yield test scores with equal means and variances.
B. Alternate forms are designed to be equivalent only with regard to level of difficulty.
C. Alternate forms are different only with respect to how they are administered.
D. There are no differences between alternate and parallel forms of a test.
98. Coefficientalpha is the reliability estimate of choice for tests
A. with dichotomous items and binary scoring.
B. with homogeneous items.
C. that can be scored along a continuum of values.
D. that contain heterogeneous item content and binary scoring.
99. In which type(s) of reliability estimates would test construction NOT be a significant
source of error variance?
A. test-retest
B. alternate-form
C. split-half
D. Kuder-Richardson
100. If the variance of either variable is restricted by the sampling procedures used, then
the magnitude of the coefficient of reliability will be
A. lowered.
B. raised.
C. unaffected.
D. affected only in tests with a true-false format.
104. A psychologist administers a test and the test-taker scores a 52. If the cut-off score
for eligibility for a particular program is 50, what index will best help the psychologist
determine how much confidence to place in the test-taker's obtained score of 52?
A. the standard error of difference
B. the standard error of measurement
C. measures of central tendency: mean, median, or mode
D. measures of variability such as the standard deviation
105. Which of the following is TRUE of both the standard error of measurement and the
standard error of difference?
A. Both provide confidence levels.
B. Both can be used to compute confidence intervals for short answer tests.
C. Both can be used to compare performance between two different tests.
D. Both are abbreviated by SEM.
107. A police officer administers a breathalyzer test to a suspected drunk driver, does
not put on his glasses to read the meter, and as a result, mistakenly records the blood
alcohol level. This is the kind of mistake that is BEST with which type of reliability
estimates?
A. test-retest
B. interscorer
C. internal-consistency
D. situational
108. Which of the following statements is TRUE regarding the differences between a
power test and a speed test?
A. Power tests involve physical strength; speed tests do not.
B. In a power test, the testtaker has time to complete all items; in a speed test, a specific
time limit is imposed.
C. In a power test, a broad range of knowledge is assessed; in a speed test, a narrower
range of knowledge is assessed.
D. Both b and c
109. The index that allows a test user to compare two people's scores on a specific test
to determine if the true scores are likely to be different is
A. the standard error of the mean.
B. the standard error of the difference.
C. the standard deviation.
D. the correlation coefficient.
112. A test of attention span has a reliability coefficent of .84. The average score on the
test is 10, with a standard deviation of 5. Lawrence received a score of 64 on the test.
We can be 95% sure that Lawrence's "true" attention span score falls between
A. 63 and 65.
B. 62 and 66.
C. 60 and 68.
D. 54 and 74.
113. By definition, estimates of reliability can range from _______ to _______.
A. -3.00; +3.00
B. 1; 10
C. 0; 1
D. -1 to 1
114. Using estimates of internal consistency, which of the following tests would likely
yield the highest reliability coefficients?
A. a test of general intelligence
B. a test of achievement in a basic skill such as mathematics
C. a test of reading comprehension
D. a test of vocational interest
115. What type of reliability estimate is appropriate for use in a comparison of "Form A"
to "Form B" of a picture vocabulary test?
A. test-retest
B. alternate-forms
C. inter-rater
D. internal-consistency
116. What index of reliability would you use to compare two evaluators' assessments of
a group of job applicants?
A. KR-20
B. coefficient alpha
C. the Kappa statistic
D. the Spearman-Brown correction
119. A test containing 100 items is revised by deleting 20 items. What might be
expected to happen to the magnitude of the reliability estimate for that test?
A. It will be expected to increase.
B. It will be expected to decrease.
C. It will be expected to stay the same.
D. It cannot be determined based on the information provided.
121. The greater the proportion of the total variance attributed to true variance, the
more ____________ the test.
A. scientific
B. variable
C. reliable
D. expensive
122. A score earned by a testtaker on a psychological test may BEST be viewed as equal
to
A. the raw score plus the observed score.
B. the error score.
C. the true score.
D. the true score plus error.
125. Which of the following is TRUE about systematic and unsystematic error in the
assessment of physical and psychological abuse?
A. Few sources of unsystematic error exist, due to the nature of what is being assessed.
B. Few sources of systematic error exist.
C. Gender represents a source of systematic error.
D. None of these
127. In Chapter 5 of your textbook, you read of the "writing surface on a school desk
riddled with heart carvings, the legacy of past years' students who felt compelled to
express their eternal devotion to someone now long forgotten." This imagery was
designed to graphically illustrate sources of error variance during test
A. development.
B. administration.
C. scoring.
D. interpretation.
128. In the Chapter 5 Meet an Assessment Professional feature, Dr. Bryce B. Reeve noted
the necessity for very brief questionnaires in his work due to the fact that many of his
clients were:
A. young children with very short attention spans.
B. seriously ill and would find taking tests burdensome.
C. visually impaired an unable to focus for an extended period of time.
D. All of these
129. In the Chapter 5 Meet an Assessment Professional feature, Dr. Bryce B. Reeve cited
an experience in which he learned that the "Excellent" response category on a test was
best translated as meaning ______ in Chinese?
A. "super bad"
B. "superlative"
C. "bad"
D. None of these
130. The items of a personality test are characterized as heterogeneous in nature. This
tells us that the test measures
A. aspects of family history.
B. ability to relate to the opposite sex.
C. unconscious motivation.
D. more than one trait.
th
in a series of formulas developed by Cronbach.
C. a 20th-century revision of a Galtonian expression.
D. None of these
133. Most reliability coefficients, regardless of the specific type of reliability they are
measuring, range in value from:
A. -1 to +1
B. 0 to 100
C. 0 to 1.
D. negative infinity to positive infinity
134. All indices of reliability provide an index that is a characteristic of a particular
A. test.
B. group of test scores.
C. trait.
D. approach to measurement.
135. The precise amount of error inherent in the reliability estimate published in a test
manual will vary with
A. the purchase price of the test (the more expensive, the less the error).
B. the sample of test-takers from which the data were drawn.
C. the population of test user actually using a published test.
D. All of these
137. A test of infant development contains three scales: (1) Cognitive Ability, (2) Motor
Development, and (3) Behavior Rating. Because these three scales are designed to
measure different characteristics (that is, they are not homogeneous), it would be
inappropriate to combine the three scales in calculating estimates of the test's
A. alternate-forms reliability.
B. internal-consistency reliability.
C. test-retest reliability.
D. interrater reliability.
138. The fact that young children develop rapidly and in "growth spurts" is a problem
when it comes to the estimation which type of reliability for an infant development
scale?
A. internal-consistency reliability
B. alternate-forms reliability
C. test-retest reliability
D. interrater reliability
139. In the language of psychological testing and assessment, reliability BEST refers to
A. how well a test measures what it was originally designed to measure.
B. the complete lack of any systematic error.
C. the proportion of total variance that can be attributed to true variance.
D. whether or not a test publisher consistently publishes high quality instruments.
140. Because of the unique problems in assessing very young children, which of the
following would be the BEST practice when attempting to estimate the reliability of tests
designed to measure cognitive and motor abilities in infants?
A. Use relatively short test-retest intervals.
B. Use relatively long test-retest intervals.
C. Do not use the test-retest method for estimating reliability of the test.
D. Use only inter-scorer reliability estimates.
143. The directions for scoring a particular motor ability test instruct the examiner to
"Give credit if the child holds his hands open most of the time." Because what
constitutes "most of the time" is not specifically defined, directions such as these could
result in lowered reliability estimates for
A. test-retest reliability.
B. alternate-form reliability.
C. inter-rater reliability.
D. parallel forms reliability.
144. A vice president (VP) of personnel employs a "Corporate Screening Test" in the
hiring process. For future testing purposes, the VP maintains records of scores achieved
by __________ as opposed to ___________ in order to avoid restriction of range effects.
A. job applicants; hired employees
B. hired employees; job applicants
C. successful employees; hired employees
D. successful employees; other corporate officers
146. The Everyday Psychometrics for Chapter 5 dealt with psychometric aspects of the
Breathalyzer. We learned that in the state of New Jersey, it is legal and proper to
administer a Breathalyzer test to a drunk driver
A. only at the arrest scene.
B. at police headquarters.
C. even if the officer is intoxicated.
D. while a suspect is sucking on a breath mint.
149. Advocates of generalizability theory prefer the use of which of the following terms
as an alternative to the use of the term "reliability"?
A. generalizability
B. universality
C. regularity
D. dependability
151. As used in Chapter 5 of your text, the term inflation of variance is synonymous with
A. restriction of variance.
B. restriction of range.
C. inflation of range.
D. None of these
156. Why isn't IRT used more by "mom-and-pop" test developers such as classroom
teachers?
A. most classroom teachers were trained in generalizability theory
B. IRT has no application in classroom tests
C. applying IRT requires statistical sophistication
D. All of these
158. Which of the following is NOT an assumption attendant to the use of IRT?
A. the assumption of unidimensionality
B. the assumption of heteroskedacity
C. the assumption of local independence
D. the assumption of monotonicity
159. In IRT, the single, continuous latent construct being measured is often symbolized
by the Greek letter:
A. alpha.
B. beta.
C. psy.
D. theta.
160. If some of the items on a test were locally dependent, it would be reasonable to
expect that:
A. all test items were designed for members of a specific culture.
B. all test items were measuring the exact same thing.
C. some test items were measuring something different than other test items.
D. some test items were structured in a dichotomous format and others were structured
in a polytomous format.
162. The probabilistic relationship between a testtaker's response to a test item and that
testtaker's level on the latent construct being measured by the test is expressed in
graphic form by
A. an item characteristic curve.
B. an item response curve.
C. an item trace line.
D. All of these
163. It's an IRT tool that is useful in helping test users to better understand the range
over theta that an item is most useful for. It's called
A. an item response curve.
B. an information function.
C. an item trace line.
D. None of these
164. An IRT tool useful in helping test users abbreviate a "long form" of a test to a "short
form" is the
A. item response curve.
B. information function.
C. item trace line.
D. None of these
165. In an IRT information curve, the term information magnitude may BEST be
understood as referring to
A. theta.
B. the range of the underlying construct.
C. precision.
D. difficulty.
166. Test items with little discriminative ability prompt the test developer to consider the
possibility that
A. the content of the item does not match the construct measured by the other items in
the scale.
B. the item is poorly worded and needs to be rewritten.
C. the item is too complex for the educational level of the population.
D. All of these
168. The fact that cultural factors may be operating to weaken an item's ability to
discriminate between groups is evident from:
A. Lord's treatise entitled Item Response Theory.
B. an item characteristic curve.
C. an information function.
D. Georg Rasch's unauthorized biography, You Can Never Be Too Rich or Too "Rasch."
169. A difference between the use of coefficient alpha and IRT for evaluating a test's
reliability is that with IRT, it is possible to learn
A. how the precision of a scale varies depending on the level of the construct being
measured.
B. how the level of the construct being measured varies depending on variations in the
item characteristic curve.
C. the precise numerical value for the test's total interitem consistency.
D. All of these
c5 Key
1. The meaning of reliability in the psychometric sense differs from the meaning of
reliability in the "every day" use of that word in that
A. reliability in the "every day sense" is usually "a good thing."
B. reliability in the psychometric sense is usually "a good thing."
C. reliability in the psychometric sense has greater implications.
D. None of these
Cohen - Chapter 05 #1
Cohen - Chapter 05 #2
Cohen - Chapter 05 #3
Cohen - Chapter 05 #4
5. Which is TRUE of measurement error?
A. Like error in general, measurement error may be random or systematic.
B. Unlike error in general, measurement error may be random or systematic.
C. Measurement error is always random.
D. Measurement error is always systematic.
Cohen - Chapter 05 #5
Cohen - Chapter 05 #6
7. A Wall Street Securities firm that is actually located on Wall Street is testing a group of
candidates for their aptitude in finance and business. As the testing begins, an
unexpected "Occupy Wall Street" sit-in takes place. From a psychometric perspective in
the context of this testing, the sit-in is viewed as
A. systematic error.
B. random error.
C. test administration error.
D. background error.
Cohen - Chapter 05 #7
8. A test entails behavioral observation and rating of front desk clerks to determine
whether or not they greet guests with a smile. Which type of error is this test most
susceptible to?
A. test administration error
B. test construction error
C. examiner-related error
D. polling error
Cohen - Chapter 05 #8
9. Error in the reporting of spousal abuse may result from
A. one partner simply forgets all of the details of the abuse.
B. one partner misunderstands the instructions for reporting.
C. one partner is ashamed to report the abuse.
D. All of these
Cohen - Chapter 05 #9
10. Stanley (1971) wrote that in classical test theory, a so-called "true score" is "not the
ultimate fact in the book of the recording angel." By this, Stanley meant that
A. it would be imprudent to trust in Divine influence when estimating variance.
B. the amount of test variance that is true relative to error may never be known.
C. it is near impossible to separate fact from fiction with regard to "true scores."
D. All of these
11. The term test heterogeneity BEST refers to the extent to which test items measure
A. different factors.
B. the same factor.
C. a unifactorial trait.
D. a nonhomogeneous trait.
15. One of the problems associated with classical test theory has to do with
A. the notion that there is a "true score" on a test has great intuitive appeal.
B. the fact that CTT assumptions are often characterized as "weak."
C. its assumptions concerning the equivalence of all items on a test.
D. its assumptions allow for its application in most situations.
16. Which of the following is NOT an alternative to classical test theory cited in your
text?
A. generalizability theory
B. representational theory
C. domain sampling theory
D. latent trait theory
18. The multiple-choice test items on this examination are all examples of
A. dichotomous test items.
B. latent trait test items.
C. polytomous test items.
D. None of these
25. Why might ability test scores among testtakers most typically vary?
A. because of the true ability of the testtaker
B. because of irrelevant, unwanted influences
C. All of the above
D. None of the above
28. Which type of reliability estimate is obtained by correlating pairs of scores from the
same person (or people) on two different administrations of the same test?
A. a parallel-forms estimate
B. a split-half estimate
C. a test-retest estimate
D. an au-pair estimate
29. Which type of reliability estimate would be appropriate only when evaluating the
reliability of a test that measures a trait that is presumed to be relatively stable time?
A. parallel-forms
B. alternate-forms
C. test-retest
D. split-half
32. Which of the following is TRUE for estimates of alternate- and parallel-forms
reliability?
A. Two test administrations with the same group are required.
B. Test scores may be affected by factors such as motivation, fatigue, or intervening
events like practice, learning, or therapy.
C. Item sampling is a source of error variance.
D. All of these
35. Which of the following types of reliability estimates is the most expensive due to the
costs involved in test development?
A. test-retest
B. parallel-form
C. internal-consistency
D. Spearman's rho
36. What term refers to the degree of correlation between all the items on a scale?
A. inter-item homogeneity
B. inter-item consistency
C. inter-item heterogeneity
D. parallel-form reliability
37. Test-retest estimates of reliability are referred to as measures of ________, and split-
half reliability estimates are referred to as measures of ________.
A. true scores; error scores
B. internal consistency; stability
C. interscorer reliability; consistency
D. stability; internal consistency
39. Which of the following factors may influence a split-half reliability estimate?
A. fatigue
B. anxiety
C. item difficulty
D. All of these
43. Typically, adding items to a test will have what effect on the test's reliability?
A. Reliability will decrease.
B. Reliability will increase.
C. Reliability will stay the same.
D. Reliability will first increase and then decrease.
45. If items from a test are measuring the same trait, estimates of reliability yielded from
split-half methods will typically be ________ as compared to estimates from KR-20.
A. higher
B. lower
C. similar
D. approximately the same
47. If items on a test are measuring very different traits, estimates of reliability yielded
from split-half methods will typically be ________ as compared with estimates from KR-
20.
A. higher
B. lower
C. similar
D. approximately the same
48. KR-20 is the statistic of choice for tests with which types of items?
A. multiple-choice
B. true-false
C. All of these
D. None of these
51. Which of the following is generally the preferred statistic for obtaining a measure of
internal-consistency reliability?
A. KR-20
B. KR-21
C. Kendall's Tau
D. coefficient alpha
52. Coefficient alpha is appropriate to use with all of the following test formats EXCEPT
A. multiple-choice.
B. true-false.
C. short-answer for which partial credit is awarded.
D. essay exam with no partial credit awarded.
59. When more than two scorers are used to determine inter-scorer reliability, the
statistic of choice is
A. Pearson r.
B. Spearman's rho.
C. KR-20.
D. coefficient alpha.
60. For determining the reliability of tests scored using nominal scales of measurement,
the statistic of choice is
A. Kendall's Tau.
B. the Kappa statistic.
C. KR-20.
D. coefficient alpha.
63. If a time limit is long enough to allow test-takers to attempt all items, and if some
items are so difficult that no test-taker is able to obtain a perfect score, then the test is
referred to as a ________ test.
A. speed
B. power
C. reliable
D. valid
65. Which type(s) of reliability estimates would be appropriate for a speed test?
A. test-retest
B. alternate-form
C. split-half from two independent testing sessions
D. All of these
67. A Kuder-Richardson (KR) or split-half estimate of reliability for a speed test would
provide an estimate that is
A. spuriously low.
B. spuriously high.
C. insignificant.
D. equal to a test-retest method.
68. A measure of clerical speed is obtained by a test that has respondents alphabetize
index cards. The manual for this test cites a split-half reliability coefficient for a single
administration of the test of .95. What might you conclude?
A. The test is highly reliable.
B. The published reliability estimate is spuriously low and would have been higher had
another estimate been used.
C. The split-half estimate should not have been used in this instance.
D. Clerical speed is too vague a construct to measure.
69. The Spearman-Brown formula can be used for which types of tests?
A. speed and multiple-choice
B. true-false and multiple-choice
C. speed, true-false, and multiple-choice
D. trade school and driving tests
75. The fact that the length of a test influences the size of the reliability coefficient is
based on which theory of measurement?
A. classical test theory (CTT)
B. generalizability theory
C. domain sampling theory
D. item response theory (IRT)
76. Which estimate of reliability is most consistent with the domain sampling theory?
A. test-retest
B. alternate-form
C. internal-consistency
D. interscorer
77. Classical reliability theory estimates the portion of a test score that is attributed to
________, and domain sampling theory estimates ________.
A. specific sources of variation; error
B. error; specific sources of variation
C. the skills being measured; variation
D. the skills being measured; content knowledge
80. The standard deviation of a theoretically normal distribution of test scores obtained
by one person on equivalent tests is
A. the standard error of the difference between means.
B. the standard error of measurement.
C. the standard deviation of the reliability coefficient.
D. the variance.
81. Which of the following is NOT a part of the formula for the standard error of
measurement for a particular test?
A. the validity of the test
B. the reliability of the test
C. the standard deviation of the group of test scores
D. Both b and c
84. As the confidence interval increases, the range of scores into which a single test
score falls is likely to
A. decrease.
B. increase.
C. remain the same.
D. alternately decrease and increase.
87. Which statistic can help the test user determine how large a difference must exist for
scores yielded from two different tests to be considered statistically different?
A. standard error of measurement between two scores
B. standard error of the difference between two scores
C. observed variance minus error variance
D. standard error of the difference between two means
88. The standard error of the difference between two scores is larger than the standard
error of measurement for either score because the standard error of the difference
between the two scores is affected by
A. the true score variance of each score.
B. the standard deviation of each score summed.
C. the measurement error inherent in both scores.
D. All of these
92. The universe score in Cronbach et al.'s generalizability theory analogous to the
________ in classical test theory.
A. coefficient of generalizability
B. true score
C. standard deviation
D. internal-consistency estimate
93. In classical test theory, there exists only one true score. In Cronbach generalizability
theory, how many "true scores" exist?
A. one
B. as many as the number of times the test is administered to the same individual
C. many, depending on the number of different universes
D. None of these
96. In general, which of the following is TRUE of the relationship between the magnitude
of the test-retest reliability estimate and the length of the interval between test
administrations?
A. The longer the interval, the lower the reliability coefficient.
B. The longer the interval, the higher the reliability coefficient.
C. The magnitude of the reliability coefficient is typically not affected by the length of the
interval between test administrations.
D. The magnitude of the reliability coefficient is always affected by the length of the
interval between test administrations, but one cannot predict how it is affected.
97. What is the difference between alternate forms and parallel forms of a test?
A. Alternate forms do not necessarily yield test scores with equal means and variances.
B. Alternate forms are designed to be equivalent only with regard to level of difficulty.
C. Alternate forms are different only with respect to how they are administered.
D. There are no differences between alternate and parallel forms of a test.
99. In which type(s) of reliability estimates would test construction NOT be a significant
source of error variance?
A. test-retest
B. alternate-form
C. split-half
D. Kuder-Richardson
100. If the variance of either variable is restricted by the sampling procedures used, then
the magnitude of the coefficient of reliability will be
A. lowered.
B. raised.
C. unaffected.
D. affected only in tests with a true-false format.
104. A psychologist administers a test and the test-taker scores a 52. If the cut-off score
for eligibility for a particular program is 50, what index will best help the psychologist
determine how much confidence to place in the test-taker's obtained score of 52?
A. the standard error of difference
B. the standard error of measurement
C. measures of central tendency: mean, median, or mode
D. measures of variability such as the standard deviation
105. Which of the following is TRUE of both the standard error of measurement and the
standard error of difference?
A. Both provide confidence levels.
B. Both can be used to compute confidence intervals for short answer tests.
C. Both can be used to compare performance between two different tests.
D. Both are abbreviated by SEM.
107. A police officer administers a breathalyzer test to a suspected drunk driver, does
not put on his glasses to read the meter, and as a result, mistakenly records the blood
alcohol level. This is the kind of mistake that is BEST with which type of reliability
estimates?
A. test-retest
B. interscorer
C. internal-consistency
D. situational
108. Which of the following statements is TRUE regarding the differences between a
power test and a speed test?
A. Power tests involve physical strength; speed tests do not.
B. In a power test, the testtaker has time to complete all items; in a speed test, a
specific time limit is imposed.
C. In a power test, a broad range of knowledge is assessed; in a speed test, a narrower
range of knowledge is assessed.
D. Both b and c
109. The index that allows a test user to compare two people's scores on a specific test
to determine if the true scores are likely to be different is
A. the standard error of the mean.
B. the standard error of the difference.
C. the standard deviation.
D. the correlation coefficient.
112. A test of attention span has a reliability coefficent of .84. The average score on the
test is 10, with a standard deviation of 5. Lawrence received a score of 64 on the test.
We can be 95% sure that Lawrence's "true" attention span score falls between
A. 63 and 65.
B. 62 and 66.
C. 60 and 68.
D. 54 and 74.
115. What type of reliability estimate is appropriate for use in a comparison of "Form A"
to "Form B" of a picture vocabulary test?
A. test-retest
B. alternate-forms
C. inter-rater
D. internal-consistency
116. What index of reliability would you use to compare two evaluators' assessments of
a group of job applicants?
A. KR-20
B. coefficient alpha
C. the Kappa statistic
D. the Spearman-Brown correction
119. A test containing 100 items is revised by deleting 20 items. What might be
expected to happen to the magnitude of the reliability estimate for that test?
A. It will be expected to increase.
B. It will be expected to decrease.
C. It will be expected to stay the same.
D. It cannot be determined based on the information provided.
121. The greater the proportion of the total variance attributed to true variance, the
more ____________ the test.
A. scientific
B. variable
C. reliable
D. expensive
125. Which of the following is TRUE about systematic and unsystematic error in the
assessment of physical and psychological abuse?
A. Few sources of unsystematic error exist, due to the nature of what is being assessed.
B. Few sources of systematic error exist.
C. Gender represents a source of systematic error.
D. None of these
127. In Chapter 5 of your textbook, you read of the "writing surface on a school desk
riddled with heart carvings, the legacy of past years' students who felt compelled to
express their eternal devotion to someone now long forgotten." This imagery was
designed to graphically illustrate sources of error variance during test
A. development.
B. administration.
C. scoring.
D. interpretation.
128. In the Chapter 5 Meet an Assessment Professional feature, Dr. Bryce B. Reeve noted
the necessity for very brief questionnaires in his work due to the fact that many of his
clients were:
A. young children with very short attention spans.
B. seriously ill and would find taking tests burdensome.
C. visually impaired an unable to focus for an extended period of time.
D. All of these
129. In the Chapter 5 Meet an Assessment Professional feature, Dr. Bryce B. Reeve cited
an experience in which he learned that the "Excellent" response category on a test was
best translated as meaning ______ in Chinese?
A. "super bad"
B. "superlative"
C. "bad"
D. None of these
th
in a series of formulas developed by Cronbach.
C. a 20th-century revision of a Galtonian expression.
D. None of these
133. Most reliability coefficients, regardless of the specific type of reliability they are
measuring, range in value from:
A. -1 to +1
B. 0 to 100
C. 0 to 1.
D. negative infinity to positive infinity
135. The precise amount of error inherent in the reliability estimate published in a test
manual will vary with
A. the purchase price of the test (the more expensive, the less the error).
B. the sample of test-takers from which the data were drawn.
C. the population of test user actually using a published test.
D. All of these
137. A test of infant development contains three scales: (1) Cognitive Ability, (2) Motor
Development, and (3) Behavior Rating. Because these three scales are designed to
measure different characteristics (that is, they are not homogeneous), it would be
inappropriate to combine the three scales in calculating estimates of the test's
A. alternate-forms reliability.
B. internal-consistency reliability.
C. test-retest reliability.
D. interrater reliability.
139. In the language of psychological testing and assessment, reliability BEST refers to
A. how well a test measures what it was originally designed to measure.
B. the complete lack of any systematic error.
C. the proportion of total variance that can be attributed to true variance.
D. whether or not a test publisher consistently publishes high quality instruments.
140. Because of the unique problems in assessing very young children, which of the
following would be the BEST practice when attempting to estimate the reliability of tests
designed to measure cognitive and motor abilities in infants?
A. Use relatively short test-retest intervals.
B. Use relatively long test-retest intervals.
C. Do not use the test-retest method for estimating reliability of the test.
D. Use only inter-scorer reliability estimates.
143. The directions for scoring a particular motor ability test instruct the examiner to
"Give credit if the child holds his hands open most of the time." Because what
constitutes "most of the time" is not specifically defined, directions such as these could
result in lowered reliability estimates for
A. test-retest reliability.
B. alternate-form reliability.
C. inter-rater reliability.
D. parallel forms reliability.
144. A vice president (VP) of personnel employs a "Corporate Screening Test" in the
hiring process. For future testing purposes, the VP maintains records of scores achieved
by __________ as opposed to ___________ in order to avoid restriction of range effects.
A. job applicants; hired employees
B. hired employees; job applicants
C. successful employees; hired employees
D. successful employees; other corporate officers
149. Advocates of generalizability theory prefer the use of which of the following terms
as an alternative to the use of the term "reliability"?
A. generalizability
B. universality
C. regularity
D. dependability
151. As used in Chapter 5 of your text, the term inflation of variance is synonymous with
A. restriction of variance.
B. restriction of range.
C. inflation of range.
D. None of these
156. Why isn't IRT used more by "mom-and-pop" test developers such as classroom
teachers?
A. most classroom teachers were trained in generalizability theory
B. IRT has no application in classroom tests
C. applying IRT requires statistical sophistication
D. All of these
158. Which of the following is NOT an assumption attendant to the use of IRT?
A. the assumption of unidimensionality
B. the assumption of heteroskedacity
C. the assumption of local independence
D. the assumption of monotonicity
160. If some of the items on a test were locally dependent, it would be reasonable to
expect that:
A. all test items were designed for members of a specific culture.
B. all test items were measuring the exact same thing.
C. some test items were measuring something different than other test items.
D. some test items were structured in a dichotomous format and others were structured
in a polytomous format.
162. The probabilistic relationship between a testtaker's response to a test item and that
testtaker's level on the latent construct being measured by the test is expressed in
graphic form by
A. an item characteristic curve.
B. an item response curve.
C. an item trace line.
D. All of these
164. An IRT tool useful in helping test users abbreviate a "long form" of a test to a "short
form" is the
A. item response curve.
B. information function.
C. item trace line.
D. None of these
165. In an IRT information curve, the term information magnitude may BEST be
understood as referring to
A. theta.
B. the range of the underlying construct.
C. precision.
D. difficulty.
166. Test items with little discriminative ability prompt the test developer to consider the
possibility that
A. the content of the item does not match the construct measured by the other items in
the scale.
B. the item is poorly worded and needs to be rewritten.
C. the item is too complex for the educational level of the population.
D. All of these
168. The fact that cultural factors may be operating to weaken an item's ability to
discriminate between groups is evident from:
A. Lord's treatise entitled Item Response Theory.
B. an item characteristic curve.
C. an information function.
D. Georg Rasch's unauthorized biography, You Can Never Be Too Rich or Too "Rasch."
169. A difference between the use of coefficient alpha and IRT for evaluating a test's
reliability is that with IRT, it is possible to learn
A. how the precision of a scale varies depending on the level of the construct being
measured.
B. how the level of the construct being measured varies depending on variations in the
item characteristic curve.
C. the precise numerical value for the test's total interitem consistency.
D. All of these
Category # of Questi
ons
Cohen - Chapte 169
r 05