Validity and Reliabilty

Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed
to measure. Even if a test is reliable, it may not provide a valid measure. Let's imagine a
bathroom scale that consistently tells you that you weigh 130 pounds.
What is Reliability?
The idea behind reliability is that any significant results must be more than a one-off finding and
be inherently repeatable.
Other researchers must be able to perform exactly the same experiment, under the same
conditions and generate the same results. This will reinforce the findings and ensure that the
wider scientific community will accept the hypothesis.
Without this replication of statistically significant results, the experiment and research have not
fulfilled all of the requirements of testability.
This prerequisite is essential to a hypothesis establishing itself as an accepted scientific truth.
For example, if you are performing a time critical experiment, you will be using some type of
stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will keep
true and accurate time. However, diligent scientists take measurements many times, to minimize
the chances of malfunction and maintain validity and reliability.
At the other extreme, any experiment that uses human judgment is always going to come under
question.
For example, if observers rate certain aspects, like in Banduras Bobo Doll Experiment, then the
reliability of the test is compromised. Human judgment can vary wildly between observers, and
the same individual may rate things differently depending upon time of day and current mood.
This means that such experiments are more difficult to repeat and are inherently less reliable.
Reliability is a necessary ingredient for determining the overall validity of a scientific
experiment and enhancing the strength of the results.
Debate between social and pure scientists, concerning reliability, is robust and ongoing.
What is Validity?
Validity encompasses the entire experimental concept and establishes whether the results
obtained meet all of the requirements of the scientific research method.
For example, there must have been randomization of the sample groups and appropriate care and
diligence shown in the allocation of controls.
Internal validity dictates how an experimental design is structured and encompasses all of the
steps of the scientific research method.
Even if your results are great, sloppy and inconsistent design will compromise your integrity in
the eyes of the scientific community. Internal validity and reliability are at the core of any
experimental design.
External validity is the process of examining the results and questioning whether there are any
other possible causal relationships.
Control groups and randomization will lessen external validity problems but no method can be
completely successful. This is why the statistical proofs of a hypothesis called significant, not
absolute truth.
Any scientific research design only puts forward a possible cause for the studied effect.
There is always the chance that another unknown factor contributed to the results and findings.
This extraneous causal relationship may become more apparent, as techniques are refined and
honed.
Reliability
Reliability refers to the extent to which assessments are consistent. Just as we enjoy having
reliable cars (cars that start every time we need them), we strive to have reliable, consistent
instruments to measure student achievement. Another way to think of reliability is to imagine a
kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the
same scale should register five pounds for the potatoes an hour later (unless, of course, you
peeled and cooked them). Likewise, instruments such as classroom tests and national
standardized exams should be reliable it should not make any difference whether a student
takes the assessment in the morning or afternoon; one day or the next.
Another measure of reliability is the internal consistency of the items. For example, if you create
a quiz to measure students ability to solve quadratic equations, you should be able to assume
that if a student gets an item correct, he or she will also get other, similar items correct. The
following table outlines three common reliability measures.
Type of Reliability
How to Measure
Stability
Retest
or
Alternate Form
Test-
Give the same assessment twice, separated by days, weeks, or months.

Reliability is stated as the correlation between scores at Time 1 and
Time 2.
Create two forms of the same test (vary the items slightly). Reliability
is stated as correlation between scores of Test 1 and Test 2.
Internal
Compare one half of the test to the other half. Or, use methods such as
Consistency (Alpha,
Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.
a)
The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability
and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never
reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very
good reliability; if it is below .50, it would not be considered a very reliable test.
Validity
Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed
to measure. Even if a test is reliable, it may not provide a valid measure. Lets imagine a
bathroom scale that consistently tells you that you weigh 130 pounds. The reliability
(consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh
145 pounds (perhaps you re-set the scale in a weak moment)! Since teachers, parents, and school
districts make decisions about students based on assessments (such as grades, promotions, and
graduation), the validity inferred from the assessments is essential -- even more crucial than the
reliability. Also, if a test is valid, it is almost always reliable.
There are three ways in which validity can be measured. In order to have confidence that a test is
valid (and therefore the inferences we make based on the test scores are valid), all three kinds of
validity evidence should be considered.
at a test is valid (and therefore the inferences we make based on the test scores are valid), all
three kinds of validity evidence should be considered.
Type
of
Definition
Validity
Example/Non-Example
Content
A semester or quarter exam that only includes

The extent to which the content of
content covered during the last six weeks is not
the test matches the instructional
a valid measure of the course's overall
objectives.
objectives -- it has very low content validity.
Criterion
The extent to which scores on the

test are in agreement with If the end-of-year math tests in 4th grade
(concurrent validity) or predict correlate highly with the statewide math tests,
(predictive validity) an external they would have high concurrent validity.
criterion.
Construct
If you can correctly hypothesize that ESOL

The extent to which an assessment
students will perform differently on a reading
corresponds to other variables, as
test than English-speaking students (because of
predicted by some rationale or
theory), the assessment may have construct
theory.
validity.
So, does all this talk about validity and reliability mean you need to conduct statistical analyses
on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of
your peers to verify the content validity of your major assessments.) However, you should be
aware of the basic tenets of validity and reliability as you construct your classroom assessments,
and you should be able to help parents interpret scores for the standardized exams.

Validity and Reliabilty

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Validity and Reliabilty

Uploaded by

Copyright:

Available Formats

Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed

Give the same assessment twice, separated by days, weeks, or months.

A semester or quarter exam that only includes

The extent to which scores on the

If you can correctly hypothesize that ESOL

You might also like