Professional Documents
Culture Documents
Reliability concerns the ability of different researchers to make the same observations of
a given phenomenon if and when the observation is conducted using the same method(s)
and procedure(s)
Definition
The reliability of an instrument is the degree of consistency with which it
measures the attribute it is supposed to be measuring.
Ex:- If a scale gave a reading of 120 pounds
for a persons weight one minute, and a
reading of 150 pounds in the next
minute.
Reliability is in terms of accuracy. An instrument can be said to be reliable if its
measures accurately reflect the true scores of the attribute under investigation.
The reliability of a measuring tool can be assessed in several different ways.
The method chosen depends to a certain extent on the nature of the instrument but
also on the aspect of the reliability concept that is of greatest interest.
The aspects that have received major attention are:
Stability,
Internal consistency, and
Equivalence.
.1 Stability
The stability of a measure refers to the extent to which the same results are
obtained on repeated administrations of the instrument.
CORRELATION
Meaning of correlation:
Co-efficient of correlation
Co-efficient of correlation is a statistical method where the relationship is
expressed on quantitative scale.
Co-efficient of correlation is single number that tells us what extent two things are
related.
Direction
No. of sets
Same opposite
partial
(+/
one
direct)
(-/
only
two sets
Change
more than
two sets
(multiple)
Linear
(Straight
line)
Non linear
( curve
type)
ding
on other)
Types of Correlation
Positive or Direct Correlation:
Positive perfect correlation exists when one variable decreases the other variable
decreases (or) as one variable increases the other variable also increases.
Negative or indirect correlation :
Negative perfect correlation occurs when one variable increases the decreases. Its
maximum value will be (-1). The graph of negative perfect correlation is shown
as below.
Types of Correlation
Zero Correlation:
Zero correlation occurs where two related variables do not relate to each other. It
has got the value of zero the graph is shown as below.
Simple Correlation:
When only two variable are studied which are changing either in the same or
opposite direction.
Types of Correlation
Multiple Correlation:
When only three or more variable are studied which are changing either in the
same direction or opposite directions.
Linear Correlation: if for corresponding to a unit change in one variable, there is constant
change in other variable over the entire range of values.
Non- Linear Correlation: if the variable under study are graphed & the plotted points do
not form a straight line.
Methods of determining Correlation
Methods
Graphic
Mathematical
Pearson's
(doto
( Corre- coefficient
rank
gram)
lo gram) correlation co - co
Square
method
X Eveny
odd
Student
A
(40)
40
(20)
20
(20)
20
x
4.8
y
4.2
x2
23.04
y2
17.64
xy
20.16
28
15
13
-0.2
-2.8
0.04
7.84
0.56
35
19
16
3.8
0.2
14.44
0.04
0.76
38
18
20
2.8
4.2
7.84
17.64
11.76
22
l0
12
-5.2
-3.8
27.04
14.44
19.76
20
12
-3.2
-7.8
10.24
60.84
24.96
35
16
19
0.8
3.2
0.64
10.24
2.56
33
16
17
0.8
1.2
0.64
1.44
0.96
31
12
19
-3.2
3.2
10.24
10.24
-10.24
28
14
14
-1.2
-1.8
1.44
3.24
2.16
MEAN
31
15.2
15.8
95.6
143.6
73.4
3.26
3.99
SD
where
Spearman-Brown formula:
The split-half design in effect creates two comparable test administrations.
The items in a test are split into two tests that are equivalent in content and
difficulty.
Often this is done by splitting among odd and even numbered items.
This assumes that the assessment is homogenous in content.
Once the test is split, reliability is estimated as the correlation of two separate
tests with an adjustment for the test length.
Strategies
Spearman-Brown Prophecy Formula
r1 =
2r
1+r
Where:
r = the correlation coefficient computed
on the split halves.
r1= the estimated reliability of the entire
test.
Total Score
number
1
55
28
27
49
26
23
78
36
42
37
18
19
44
23
21
50
30
20
58
30
28
62
33
29
48
23
25
10
67
28
39
Kuder and Richardson devised a procedure for estimating the reliability of a test
in 1937.
It has become the standard for estimating reliability for single administration of a
single form.
Kuder-Richardson measures inter-item consistency.
When schools have the capacity to maintain item level data, the KR20, which is a
challenging set of calculations to do by hand
The rationale for Kuder and Richardson's most commonly used procedure is:
1. Securing the mean inter-correlation of the number of items (k) in the test.
2. Considering this to be the reliability coefficient for the typical item in the test.
3. Stepping up this average with the Spearman-Brown formula to estimate the
reliability coefficient of an assessment of k items
Items (K)
Student
7
1
10
11 12
(N)
0=incorrect
1=correct
scores
Mean Score
Score
Xx =x
11
4.5
20.25
10
3.5
12.25
2.5
6.25
0.5
0.25
0.5
0.25
-0.5
0.25
-1.5
2.25
-2.5
6.25
-2.5
6.25
-4.5
20.25
74.5
65
Mean = 6.5
Kuder-Richardson Formula 20
x2
x2 = 74.5
Where:
P
va
lu
es 0.9 0.9 0.8
Q
0.7
0.7
0.5
0.5
0.5
0.4
0.3
0.2
0.1
0.3
0.3
0.5
0.5
0.5
0.6
0.7
0.8
0.9
0.21
0.21
0.25
0.25
0.25
0.24
0.21
0.16
0.09
va
lu
es 0.1 0.1 0.2
p 0.0
q 9
p 2.2
q 1
0.09 0.16
Variance
x - is the student score
minus the mean
score;
x2 - is squared and the squares
(x2); the summed squares
2 - summed squares are divided by the number of students minus 1 (N-l).
Kuder-Richardson Formula 21
When item level data or technological assistance is not available to assist in the
computation of a large number of cases and items, the simpler, and sometimes
less precise, reliability estimate known as Kuder-Richardson Formula 21 is an
acceptable general measure of internal consistency.
M - the assessment
mean
(6.5)
k - the number of
items in the
assessment
2 - variance (8.28).
(12)
The formula simplifies the computation but will usually yield, as evidenced, a
lower estimate of reliability.
Cronbach's alpha ( ).
It is flexible and can be used with test formats that have more than one correct
answer.
The split-half estimates and KR20 are exchangeable with Cronbach's alpha
When examinees are divided into two parts and the scores and variances of the
two parts are calculated, the split-half formula is algebraically equivalent to
Cronbach's alpha.
When the test format has only one correct answer, KR20 is algebraically
equivalent to Cronbach's alpha.
Therefore, the split-half and KR20 reliability estimates may be considered special
cases of Cronbach's alpha.
3. Equivalence
A researcher may be interested in estimating the reliability of a measure by way of
the equivalence approach under one of two circumstances:
(1) When different observers or researchers are using an instrument to measure the same
phenomena at the same time.
2) When two presumably parallel instruments are administered to individuals at about the
same time.
Preconditions are:
Student
Score: Rater 1
Score: Rater 2
Agreement
No. of agreeme
No. of agreements + disagreements
10
= 0.76
10 + 3
- e.g. researcher effects and bias are part of the story that is told; they are not
controlled for
Interpretation of Reliability Coefficient
Group level comparison:
0.70 Sufficient
0.80 or greater are highly desirable
Individuals:
E.g.-: criterion for admission to B.Sc Nursing programme the test scores.
Instrumentation
Changes occur in measurement instrument.
Pretest and posttest different.
Person collecting data becomes more skilled on second or
subsequent measurement.
Regression artifact
Tendency for extreme scores to regress toward the mean on a second
assessment.
Occurs because chance factors contributed to extreme scores.
Differential selection.
When
participants
forming
comparison
groups
have
different
characteristics.
Additive and interactive effects.
Bias resulting from combination of two or more basic threats, such as
Selection-history effect.
Selection-maturation effect.
Differential attrition.
Difference in kinds of people dropping out of comparison groups.
External Validity
Ability to generalize across
Different people in single population.
Different populations of people.
Different settings.
Different times.
Different outcomes.
Different treatment outcomes.
Concern about validity (as well as reliability) is the primary reason thick
description is an essential component of the qualitative research enterprise
Resources:
Handout: Different Types of Notes
Example: ACY Site Visit Toolkit
Major Types of Validity in Qualitative Research
Descriptive Validity
Interpretive Validity
Theoretical Validity
External Validity (i.e., generalizability)
Descriptive Validity
Concerned with the factual accuracy of an account (that is, making sure one is not
making up or distorting the things one hears and sees)
All subsequent types of validity are dependent on the existence of this
fundamental aspect of validity
Descriptive Validity
Strategy to obtain descriptive validity:
Investigator triangulation.
use of multiple investigators to collect and interpret data.
Interpretive Validity
At issue, then, is the accuracy of the concepts as applied to the perspective of the
individuals included in the account
Interpretive accounts are grounded in the language of the people studied and rely,
as much as possible, on their own words and concepts
Interpretive Validity
Accurately portraying meaning attached by participants to phenomena.
Strategies used.
Participant feedback.
Use of low-inference descriptors
Theoretical Validity
Degree to which theoretical explanation fits the data.
Theoretical validity is thus concerned, not only with the validity of the concepts,
but also their postulated relationships to one another, and thus its goodness of
fit as an explanation.
Theoretical Validity
Degree to which theoretical explanation fits the data.
Strategies used:
Extended fieldwork.
Theory triangulation.
Pattern matching.
Peer review.
Internal validity
Internal validity.
Strategies used in qualitative research:
Researcher as detective.
Methods triangulation.
Data triangulation
External Validity
Not typically goal of Qualitative research.
People and settings not randomly selected.
Qualitative researchers interested in documenting particularistic
rather than universalistic findings.
Some suggest generalizations can be made.
mean that we do not know for sure what caused the effects that we
observed. Naturally, we like to hope that our interventions (experimental
treatments) or other known and measured independent variables caused
the effects. Unfortunately this is often not the case. For exampe, because
of their multidimensionality, confounded variables (which measure more
than one entity) are a threat to internal validity.
BIAS VERSUS RANDOM ERROR
If you have tight control over your experimental treatments (and, of course,
you used randomization), hopefully the only source of variance left in your
dependent variables will be random error.
Random error is just that: It is the random variation that occurs on
measurements across administrations, situations, or time periods. If
random error is VERY large, it can pose a threat to the reliability
variables
HERE
SOME
ARE
are
major
WELL-KNOWN
threat
THREATS
to
internal
TO
validity.
INTERNAL VALIDITY
Self-selection effects : When subjects can select their own treatments,
we do not know whether the intervention or a pre-existing factor of the
subject caused the outcomes we observed. Random assignment can cure
this problem. The same problem can occur with differential selection, only
in this case, the investigator (rather than the subject) uses human
judgement to assign groups or subjects to treatment. A common variation
on this one is selecting extreme groups (see below).
Experimental mortality. When subjects discontinue the study and this
occurs more in certain conditions than others, we do not know how to
causally interpret the results because we don't know how subjects who
discontinued participation differed from those who completed it. A pretest
questionnaire given to all subjects make help clarify this, but watch out for
pretesting effects (a Solomon four group design can help here.)
History: Some kind of event occurred during the study period (such as
the assaults on New York City) and it is reactions to these events that
caused the outcomes we observed. Sometimes this is a medical event
(such as a flu outbreak) and sometimes an actual political or historical
event. Random assignment and a control group helps with this problem.
Maturation effects are especially important with children and youth
(such as college freshmen) but could happen at any age. For example,
young children's speech will normally become more complex, no matter
what reading method you use. Some studies have found that most college
students pull out of a depression within six months, even if they receive no
treatment whatsoever. A certain number of people will stop smoking,
whether they receive treatment or not. Again, a randomized control group
helps.
For example, some people get nervous when a doctor or nurse takes their
blood pressure, and their blood pressure goes up.
Reactivity poses a distinct threat to internal validity because we don't know
what caused the outcome: treatment effects or reactivity. The experimental
laboratory is probably the most reactive because people have come for an
experiment and they know their behavior is being watched. That is why so
many experimenters use deception. They are trying to divert subject
attention so that the "true behavior under study" is not altered.
Demand effects, in which subjects or respondents "follow orders" or
cooperate in ways that they almost never would under their routine daily
lives.
tolerant
than
they
normally
are.
Paper
and
pencil
Most people and groups (who allow you to study them at all) try to
cooperate with researchers. But some try to descover the purpose of the
intervention and thwart it, or "wreck the study." Social Reactance effects
refer to boomerang effects in which individuals or groups "fake bad," or
deliberately deviate from study procedures. This happens more among
college students, and others who suspect that their autonomy is being
threatened.
ON REACTIVITY AND INTERNAL VALIDITY. If demand effects are specific to
a particular situation, reactivity problems may also influence generalizing,
or external validity (this is how your Wiersma book treats the term.)
However, I think reactivity introduces an alternative causal explanation for
our results: they occurred, not because of the intervention or treatment,
but because people were so self-conscious that they changed their
behavior. This is internal validity. Reactivity may also statistically interact
with the experimental manipulation. For example, if the treatment somehow
impacts on self-esteem (say you are told that the stories you tell to the TAT
pictures indicate your leadership ability), reactivity may be a greater
internal validity problem.
You may need to deceive subjects about the true purpose of the
study (if you were told the purpose of the study was to measure
leadership qualities in sports, might you try to "shape up?")
Avoid collecting your own data; don't act as your own experimenter
or interviewer. Trade off with another student or apply for a small
University or external grant to hire someone.
Almost no one who collects data "likes deception" but without at least a
little, you may introduce reactivity and bias into your study. Do the
minimum (I prefer "omission" rather than deliberate lies) and be sure to
debrief subjects after their participation in the study is completed. This
means that you tell them the true purpose of the study and any
manipulations pertinent to their role in it. Debriefing is ethically mandatory,
and is especially important if your manipulation involved lies about the
student's performance ("no, you really didn't score in the 5th percentile on
that test, all feedback was bogus") or any other aspect of the "real world."
Susan
Carol
This
page
and
is
Losh
was
best
September
21
2001
built
with
Netscape
Composer
viewed
with
Netscape
Navigator
measure, the less subjects you need to see a small change in the measure. For example, a
controlled trial with 20 subjects in each group or a crossover with 10 subjects may be
sufficient to characterize even a small effect, if the measure is highly reliable. See the
details on the stats pages.
Pilot Studies
As a student researcher, you might not have enough time or resources to get a sample of
optimum size. Your study can nevertheless be a pilot for a larger study. Perform a pilot
study to develop, adapt, or check the feasibility of techniques, to determine the reliability
of measures, and/or to calculate how big the final sample needs to be. In the latter case,
the pilot should have the same sampling procedure and techniques as in the larger study.
For experimental designs, a pilot study can consist of the first 10 or so observations of a
larger study. If you get respectable confidence limits, there may be no point in continuing
to a larger sample. Publish and move on to the next project or lab!
If you can't test enough subjects to get an acceptably narrow confidence interval, you
should still be able to publish your finding, because your study will set useful bounds on
how big and how small the effect can be. A statistician can also combine your finding
with the findings of similar studies in something called a meta-analysis, which derives a
confidence interval for the effect from several studies. If your study is not published, it
can't contribute to the meta-analysis! Many reviewers and editors do not appreciate this
important point, because they are locked into thinking that only statistically significant
results are publishable.
WHAT TO MEASURE
In any study, you measure the characteristics of the subjects, and the independent and
dependent variables defining the research question. For experiments, you can also
measure mechanism variables, which help you explain how the treatment works.
Characteristics of Subjects
You must report sufficient information about your subjects to identify the population
group from which they were drawn. For human subjects, variables such as sex, age,
height, weight, socioeconomic status, and ethnic origin are common, depending on the
focus of the study.
Show the ability of athletic subjects as current or personal-best performance, preferably
expressed as a percent of world-record. For endurance athletes a direct or indirect
estimate of maximum oxygen consumption helps characterize ability in a manner that is
largely independent of the sport.
Dependent and Independent Variables
Usually you have a good idea of the question you want to answer. That question defines
the main variables to measure. For example, if you are interested in enhancing sprint
performance, your dependent variable (or outcome variable) is automatically some
measure of sprint performance. Cast around for the way to measure this dependent
variable with as much precision as possible.
Next, identify all the things that could affect the dependent variable. These things are the
independent variables: training, sex, the treatment in an experimental study, and so on.
For a descriptive study with a wide focus (a "fishing expedition"), your main interest is
estimating the effect of everything that is likely to affect the dependent variable, so you
include as many independent variables as resources allow. For the large sample sizes that
you should use in a descriptive study, including these variables does not lead to
substantial loss of precision in the effect statistics, but beware: the more effects you look
for, the more likely the true value of at least one of them lies outside its confidence
interval (a problem I call cumulative Type 0 error). For a descriptive study with a
narrower focus (e.g., the relationship between training and performance), you still
measure variables likely to be associated with the outcome variable (e.g., age-group, sex,
competitive status), because either you restrict the sample to a particular subgroup
defined by these variables (e.g., veteran male elite athletes) or you include the variables
in the analysis.
For an experimental study, the main independent variable is the one indicating when the
dependent variable is measured (e.g., before, during, and after the treatment). If there is a
control group (as in controlled trials) or control treatment (as in crossovers), the identity
of the group or treatment is another essential independent variable (e.g., Drug A, Drug B,
placebo in a controlled trial; drug-first and placebo-first in a crossover). These variables
obviously have an affect on the dependent variable, so you automatically include them in
any analysis.
Variables such as sex, age, diet, training status, and variables from blood or exercise tests
can also affect the outcome in an experiment. For example, the response of males to the
treatment might be different from that of females. Such variables account for individual
differences in the response to the treatment, so it's important to take them into account.
As for descriptive studies, either you restrict the study to one sex, one age, and so on, or
you sample both sexes, various ages, and so on, then analyze the data with these variables
included as covariates. I favor the latter approach, because it widens the applicability of
your findings, but once again there is the problem of cumulative Type 0 error for the
effect of these covariates. An additional problem with small sample sizes is loss of
precision of the estimate of the effect, if you include more than two or three of these
variables in the analysis.
Mechanism Variables
With experiments, the main challenge is to determine the magnitude and confidence
intervals of the treatment effect. But sometimes you want to know the mechanism of the
treatment--how the treatment works or doesn't work. To address this issue, try to find one
or more variables that might connect the treatment to the outcome variable, and measure
these at the same times as the dependent variable. For example, you might want to
determine whether a particular training method enhanced strength by increasing muscle
mass, so you might measure limb girths at the same time as the strength tests. When you
analyze the data, look for associations between change in limb girth and change in
strength. Keep in mind that errors of measurement will tend to obscure the true
association.
change in blood lipids, say, and change in strength, then the change in strength cannot all
be a placebo effect. Unless, of course, changes in blood lipids are related to susceptibility
to suggestion...unlikely, don't you think?
Validity (does your test measure what it's supposed to?)
face (lowest: does the test seem to measure what it's supposed to?)
o
Risks to Validity
Regression to the Mean (subjects with extreme scores on a first test tend to have
scores closer to the mean on a second test)
Selection-Maturation Interaction
Experimenter
Bias
(always
try
to
blind
experimenter)
1.0
perfect
0 = no correlation
direct
correlation
Significance Tests
't' test
Types of reliability
Intra-rater
Inter-rater
Intra-session
Inter-session
ITEM ANALYSIS
Definition:
It is a procedure used to further assess the validity of a measure by separately evaluating
each item to determine whether/not that item discriminates in the same manner in which
the overall measure is intended to discriminate.
Purposes
To determine the adequacy of the items within a test as well as adequacy of the
test itself.
The results of an item analysis provide information about the difficulty of the
items to discriminate between better and poorer students.
To obtain more information on each item in order to determine the retention,
deletion, or revision of items.
Fix marks for current class that just wrote
the test
More diagnostic information on students - another immediate payoff of item
analysis.
Classroom level:
- will tell which questions they were are all guessing on, or if you find a questions
which most of them found very difficult, you can reteach that concept.
Individual level:
- isolate specific errors that the child made
Build future tests, revise test items to make them better
Part of your continuing professional development
item analysis will help teach - how to become a better test writer
documenting just how good your evaluation is
useful for dealing with parents or principals if there's ever a dispute.
It is determined that 10
subjects performed correctly in response item 1 and the remaining did not.
10
ID =
= 0.50
20
INTERPRETATION
P level may range from 0 to 1.00
The closer the value of P is to 1.00 easier the item.
The closer the value of P is to zero more difficult item.
When norm reference measures are employed P values between 0.30 to 0.70 are
desirable because extremely easy or difficult item have very little power to
discriminate or differentiate among subjects.
DISCRIMINATION INDEX
DefinitioThe extent to which a test item discriminates between examinees who obtain
high or low scores.
It is a numerical indicator of how the poorer students answered the item as
compared to how the better students answered the item.
To determine the index of discrimination value for given data the steps are:
1.Score each answer sheet, write score total on the corner.
2.Sort the pile into rank order from top to bottom score.
3. Identify those individual ranked in upper 25%.
4. Identify those individual ranked in lower 25%.
5. Place the remaining scores aside.
6. Determine the proportion of respondent
in the top 25% or 27% who answered
the item correctly (Pu).
7.Determine the proportion of respondent
in the lower 25% or 27% who answered
the item correctly (Pl).
8.Calculate discrimination index by
subtracting Pu Pl
DI = Pu Pl
E.g: 60 students take a test. The top 16 scores and bottom 16 scores are the upper
and lower groups.
For item No.1 12 of the 16 students in the upper group answered item correctly.
While 7 students in the lower group answered correctly.
The index of discrimination for item No.1
12 7
=
= 0.31
16
Index
Lower 1/4
90
20
0.7
80
70
0.1
100
100
100
50
50
20
60
- 0.4
INTERPRETATION
Discrimination of index ranges from 1.00 to + 1.00.
For a small group of student an ID for an item that exceeds 0.20 is satisfactory.
For larger group the ID should be higher because more difference between groups
would be expected.
A positive ID value is desirable and indicate that the item is discriminating in the
same manner as the total test.
i.e. those who score high on test tend to respond correctly to the item, while those
who score low do not respond.
A negative values suggest that the item is not discriminating in the same way as
the total test.
i.e. respond who obtain low scores on the measure tend to get the item correct,
while those who score high on the measure tend to incorrectly to the item.
A ve indicates that an item is faulty & needs improvement.
Use the following table as a guideline to determine whether an item (or its corresponding
instruction) should be considered for revision.
Item Discrimination (D)
Item Difficulty
High
Medium
Low
D =< 0%
review
review
review
ok
review
ok
D >= 30%
ok
ok
ok