You are on page 1of 49

Reliability

Reliability concerns the ability of different researchers to make the same observations of
a given phenomenon if and when the observation is conducted using the same method(s)
and procedure(s)
Definition
The reliability of an instrument is the degree of consistency with which it
measures the attribute it is supposed to be measuring.
Ex:- If a scale gave a reading of 120 pounds
for a persons weight one minute, and a
reading of 150 pounds in the next
minute.
Reliability is in terms of accuracy. An instrument can be said to be reliable if its
measures accurately reflect the true scores of the attribute under investigation.
The reliability of a measuring tool can be assessed in several different ways.
The method chosen depends to a certain extent on the nature of the instrument but
also on the aspect of the reliability concept that is of greatest interest.
The aspects that have received major attention are:
Stability,
Internal consistency, and
Equivalence.
.1 Stability
The stability of a measure refers to the extent to which the same results are
obtained on repeated administrations of the instrument.

Assessment of the stability of a measuring tool are derived through procedures


that evaluate.
1. Test-retest reliability.
Test Retest Method
It is more conservative method to estimate reliability.
Method :
The researcher administers the same test to a sample of individuals on two
occasions
Compares the scores obtained.
The comparison procedure is performed objectively by computing a reliability
coefficient.

CORRELATION

Meaning of correlation:

Karl Pearson is the father of correlation.


Correlation means a possible connection or relationship or interdependence
between the values of two or more variables of the same phenomenon or
individual series.
Correlation is concerned with describing the degree of relationship between two
related variables.

Co-efficient of correlation
Co-efficient of correlation is a statistical method where the relationship is
expressed on quantitative scale.
Co-efficient of correlation is single number that tells us what extent two things are
related.

Meaning: Coefficient of correlation is degree to which two variables are


interrelated. It gives of degree of correlation.
Types of Correlation
On the basis of nature
correlation

Direction

No. of sets

Same opposite

partial

(+/

one

direct)

(-/

only
two sets

indirect) depen ( simple)

Change

more than
two sets
(multiple)

Linear
(Straight
line)

Non linear
( curve
type)

ding
on other)

Types of Correlation
Positive or Direct Correlation:
Positive perfect correlation exists when one variable decreases the other variable
decreases (or) as one variable increases the other variable also increases.
Negative or indirect correlation :
Negative perfect correlation occurs when one variable increases the decreases. Its
maximum value will be (-1). The graph of negative perfect correlation is shown
as below.
Types of Correlation
Zero Correlation:
Zero correlation occurs where two related variables do not relate to each other. It
has got the value of zero the graph is shown as below.

Simple Correlation:
When only two variable are studied which are changing either in the same or
opposite direction.
Types of Correlation
Multiple Correlation:
When only three or more variable are studied which are changing either in the
same direction or opposite directions.
Linear Correlation: if for corresponding to a unit change in one variable, there is constant
change in other variable over the entire range of values.
Non- Linear Correlation: if the variable under study are graphed & the plotted points do
not form a straight line.
Methods of determining Correlation
Methods
Graphic

Mathematical

1Scatter 2 Simple 1 Karl


Diagram graph

Pearson's

2 Spear 3 Deviation 4Least


mans method

(doto

( Corre- coefficient

rank

gram)

lo gram) correlation co - co

Correlation coefficient - Pearson's 'r'

Square
method

Spearman rank coefficient correlation


2. Internal Consistency
Composite scales and tests that involve a summation of items are often evaluated
in terms of their internal consistency.
An instrument may be said to be internally consistent or homogeneous to the
extent that all of its subparts are measuring the same characteristic
Most widely used method.
These procedures are popular not only because they are economical (they require
only one test administration).
They are the best means of assessing one of the most important sources of
measurement error in psychosocial instruments the sampling of items.
Pearson Product Moment Correlation Coefficient
Pearson-Product-Moment Correlation Coefficient, is a statistical measure of the
degree of relationship between the two halves
Score

X Eveny

odd

Student
A

(40)
40

(20)
20

(20)
20

x
4.8

y
4.2

x2
23.04

y2
17.64

xy
20.16

28

15

13

-0.2

-2.8

0.04

7.84

0.56

35

19

16

3.8

0.2

14.44

0.04

0.76

38

18

20

2.8

4.2

7.84

17.64

11.76

22

l0

12

-5.2

-3.8

27.04

14.44

19.76

20

12

-3.2

-7.8

10.24

60.84

24.96

35

16

19

0.8

3.2

0.64

10.24

2.56

33

16

17

0.8

1.2

0.64

1.44

0.96

31

12

19

-3.2

3.2

10.24

10.24

-10.24

28

14

14

-1.2

-1.8

1.44

3.24

2.16

MEAN

31

15.2

15.8

95.6

143.6

73.4

3.26

3.99

SD

where

x = is each student's score


minus the mean on
even number items for
each student.
y = is each student's score
minus the mean on odd
number items for each
student.
N = is the number of
students.
SD = is the standard
Deviation
SD is computed by

squaring the deviation (e.g., x2 ) for each student,

summing the squared deviations (e.g., x2 );

dividing this total by the number of students minus 1 (N-l) and

taking the square root.

The Spearman-Brown formula

Reliability Estimation Using a Split-half Methodology

Spearman-Brown formula:
The split-half design in effect creates two comparable test administrations.
The items in a test are split into two tests that are equivalent in content and
difficulty.
Often this is done by splitting among odd and even numbered items.
This assumes that the assessment is homogenous in content.
Once the test is split, reliability is estimated as the correlation of two separate
tests with an adjustment for the test length.

Strategies
Spearman-Brown Prophecy Formula
r1 =

2r
1+r

Where:
r = the correlation coefficient computed
on the split halves.
r1= the estimated reliability of the entire
test.

The Spearman-Brown formula

Ex: Self Esteem Scale data


Subject

Total Score

Odd numbers score

Even numbers score

number
1

55

28

27

49

26

23

78

36

42

37

18

19

44

23

21

50

30

20

58

30

28

62

33

29

48

23

25

10

67

28

39

COEFICIENT ALPHA / CRONBACHS ALPHA FORMULA & KUDER


RICHARDSON FORMULA

Kuder and Richardson devised a procedure for estimating the reliability of a test
in 1937.
It has become the standard for estimating reliability for single administration of a
single form.
Kuder-Richardson measures inter-item consistency.

It is tantamount to doing a split-half reliability on all combinations of items


resulting from different splitting of the test.

When schools have the capacity to maintain item level data, the KR20, which is a
challenging set of calculations to do by hand
The rationale for Kuder and Richardson's most commonly used procedure is:
1. Securing the mean inter-correlation of the number of items (k) in the test.
2. Considering this to be the reliability coefficient for the typical item in the test.
3. Stepping up this average with the Spearman-Brown formula to estimate the
reliability coefficient of an assessment of k items
Items (K)
Student

7
1

10

11 12

(N)

0=incorrect
1=correct

scores

Mean Score

Score

Xx =x

11

4.5

20.25

10

3.5

12.25

2.5

6.25

0.5

0.25

0.5

0.25

-0.5

0.25

-1.5

2.25

-2.5

6.25

-2.5

6.25

-4.5

20.25

74.5

65

Mean = 6.5

Kuder-Richardson Formula 20

x2

x2 = 74.5

Where:

p - is the proportion of students passing a


given item
q - is the proportion of students that did
not pass a given item
2 -is the variance of the total score on this
assessment.
K - is the number of items on the test.

P
va
lu
es 0.9 0.9 0.8
Q

0.7

0.7

0.5

0.5

0.5

0.4

0.3

0.2

0.1

0.3

0.3

0.5

0.5

0.5

0.6

0.7

0.8

0.9

0.21

0.21

0.25

0.25

0.25

0.24

0.21

0.16

0.09

va
lu
es 0.1 0.1 0.2

p 0.0
q 9

p 2.2
q 1

0.09 0.16

Variance
x - is the student score
minus the mean
score;
x2 - is squared and the squares
(x2); the summed squares
2 - summed squares are divided by the number of students minus 1 (N-l).

Kuder-Richardson Formula 21

When item level data or technological assistance is not available to assist in the
computation of a large number of cases and items, the simpler, and sometimes
less precise, reliability estimate known as Kuder-Richardson Formula 21 is an
acceptable general measure of internal consistency.

M - the assessment
mean

(6.5)

k - the number of
items in the
assessment
2 - variance (8.28).

(12)

The formula simplifies the computation but will usually yield, as evidenced, a
lower estimate of reliability.

Cronbach's alpha ( ).

most commonly used reliability coefficient.

It is based on the internal consistency of items in the tests.

It is flexible and can be used with test formats that have more than one correct
answer.

The split-half estimates and KR20 are exchangeable with Cronbach's alpha

When examinees are divided into two parts and the scores and variances of the
two parts are calculated, the split-half formula is algebraically equivalent to
Cronbach's alpha.
When the test format has only one correct answer, KR20 is algebraically
equivalent to Cronbach's alpha.
Therefore, the split-half and KR20 reliability estimates may be considered special
cases of Cronbach's alpha.
3. Equivalence
A researcher may be interested in estimating the reliability of a measure by way of
the equivalence approach under one of two circumstances:
(1) When different observers or researchers are using an instrument to measure the same
phenomena at the same time.
2) When two presumably parallel instruments are administered to individuals at about the
same time.

In both situations the aims is to determine the consistency or equivalence of the


instruments in yielding measurements of the same traits in the same people.
Threat to internal consistency:
The higher is the risk of observer ratings and classifications
It can be enhanced by
careful training,
the development of clearly defined and non-overlapping categories,
the use of a small number of categories, and the use of behaviors that tend to be
molecular rather than molar.
Strategies
Inter-rater (or inter-observer) reliability:
Estimated by having two or more trained observer watching some event
simultaneously and independently recording the relevant variables according to a
predetermined plan or coding system.
The resulting records can then be used to compute an index of equivalence or
agreement
For certain types of observational data
e.g. - ratings and frequency correlation coefficient technique - to demonstrate the
strength of the relationship. between one observers ratings or frequencies and anothers.
Another one to compute reliability as a function of agreements the formula is
No. of agreements
No. of agreements + disagreements
Inter-rater agreement

Ex- performance-based assessment

Preconditions are:

A scoring scale should be clear and unambiguous in what it demands of the


student by way of demonstration.
Evaluators who are fully conversant with the scale and how the scale relates to the
student performance, and who are in agreement with other evaluators on the
application of the scale to the student demonstration.
The end result is that all evaluators are of a common mind with regard to the
student performance - all evaluators should give the demonstration the same or
nearly the same ratings.
Training evaluators for consistency should be included
Gronlund (1985) rater error
1) Personal bias - which may occur when a rater is consistently using only part of the
scoring scale, either in being overly generous. overly severe or evidencing a
tendency to the center of the scale in scoring.
2) A "halo effect- which may occur when a rater's overall perception of a student
positively or negatively colors the rating given to a student.
3) A logical error which may occur when a rater confuses distinct elements of an
analytic scale. This confounds rating on the items.
Establishing Rater Agreement Percentages

Student

Score: Rater 1

Score: Rater 2

Agreement

No. of agreeme
No. of agreements + disagreements
10

= 0.76

10 + 3

Enhancing the Reliability of Qualitative Research

Researchers can enhance the reliability of their qualitative research by:

Standardizing data collection techniques and protocols

Again, documenting, documenting, documenting (e.g., time day and place


observations made)

Inter-rater reliability (a consideration during the analysis phase of the


research process).

Implications for Handling Threats to Validity and Reliability

In qualitative research, such prior elimination of threats to validity is less possible


because:

qualitative research is more inductive, and

it focuses primarily on understanding particulars rather than generalizing


to universals.

Qualitative researchers view threats as an opportunity for learning

- e.g. researcher effects and bias are part of the story that is told; they are not
controlled for
Interpretation of Reliability Coefficient
Group level comparison:

E.g.-: Male Vs Female, Smoker Vs Non-Smoker, Experimental Vs Control.

0.70 Sufficient
0.80 or greater are highly desirable
Individuals:

making decisions about individuals reliability coefficient should be 0.90

E.g.-: criterion for admission to B.Sc Nursing programme the test scores.

Validity and Reliability


Validity is not a commodity that can be purchased with techniques Rather, validity is,
like integrity, character and quality, to be assessed relative to purposes and
circumstances.
Validity
In general, validity concerns the degree to which an account is accurate or truthful
In qualitative research, validity concerns the degree to which a finding is judged
to have been interpreted in a correct way.
Validity Issues in Design of Quantitative Research
Experimental reliability - repeatability of the results of a study.
Experimental validity - correctness of the inference made from the study results.
Confounding extraneous variable-extraneous variable that if not controlled for
will result in confounded result.
Types of Validity in Quantitative Research
Statistical conclusion validity.
Internal validity.
Construct validity.
External validity.
Statistical Conclusion Validity
Refers to

Correctness of inference that independent and dependent variables are


related.
Checked with significance testing.
Strength of relationship between independent and dependent variables.
involves computing effect size estimate
Internal Validity
Definition - accuracy of inference that two variables are causally related.
Types of causal relationships:
causal description - IVDV
causal explanation Check for mediator and moderator variables.
Criteria for Inferring Causation
Types of evidence needed
1. That IV and DV are related.
2. That changes in IV precede changes in DV.
3. That relationship between IV and DV not due to a confounding extraneous variable (a
third variable).
Threats to Internal Validity
Ambiguous temporal precedence
Inability to specify which variable is causal.
Exists in non-experimental research studies.
History
Event other than treatment affects DV.
Can exist in one-group pretest-posttest design.
Maturation
Physical or mental changes occurring over time that influence DV.
Examples: age, learning, boredom, fatigue.
Testing
Changes in scores on posttest result of having taken pretest.

Instrumentation
Changes occur in measurement instrument.
Pretest and posttest different.
Person collecting data becomes more skilled on second or
subsequent measurement.
Regression artifact
Tendency for extreme scores to regress toward the mean on a second
assessment.
Occurs because chance factors contributed to extreme scores.
Differential selection.
When

participants

forming

comparison

groups

have

different

characteristics.
Additive and interactive effects.
Bias resulting from combination of two or more basic threats, such as
Selection-history effect.
Selection-maturation effect.
Differential attrition.
Difference in kinds of people dropping out of comparison groups.
External Validity
Ability to generalize across
Different people in single population.
Different populations of people.
Different settings.
Different times.
Different outcomes.
Different treatment outcomes.

Types of External Validity and Threats to Each


Population validity - ability to generalize from sample to target population and
across its subpopulations.
Two inferential steps for population validity.
From sample to accessible population.
From accessible to target population.
Ecological validity - generalizing across settings.
Reactivity can be a threat
Alteration in performance because participants know they are in
study.
Temporal validity - generalizing across time.
Treatment variation validity -generalizing across variation in treatments.
Different researchers may not administer treatments in same way or from
time to time.
Outcome validity - generalizing across different but related DVs.
Construct Validity
Definition - extent to which higher order construct is accurately represented in
study.
Fostered by having good definition and explanation of meaning of construct and
good measure of it.
Problem-identifying prototypical features of constructs.
Operationalism-explaining specific set of steps or operations used
to measure construct.

Threats to Construct Validity


Treatment diffusion participants in one treatment condition exposed to some of
other treatment condition.
Face validity = you look at the measure and it
makes sense. The weakest criterion.
Content validity = you more systematically
examine or inventory the aspects of the construct and determine whether you have
captured them in your measures
Construct Validity by Criterion =
You may ask others to assess whether your measures seem reasonable to them
Predictive: it correlates in the expected way with something it ought to be able to predict
E.g. love ought to predict marriage (or gazing?)
Concurrent: able to distinguish between groups it should be able to distinguish
between (e.g. between those dating each other and those not)
Convergent: different measures of the same construct correlate with each other (this is
what we are doing in the survey exercise)
Divergent: able to distinguish measures of this construct from related but different
constructs (e.g. loving vs. liking, or knowledge from test-taking ability)
Research Validity or Trustworthiness in Qualitative Research
Validity in Qualitative research.
Refers to plausible, credible, or trustworthy research.
Threats to validity in qualitative research:
Researcher bias - finding what you want to find; selective attention.
Key strategy to reduce researcher bias:
Reflexivity - self-reflection by researcher about his or her biases.
Assessing the Validity of Qualitative Research:
Can another research read your field (and other types of) notes (i.e., the
explication of your logic) and come to the same understandings of a given
phenomenon?

Concern about validity (as well as reliability) is the primary reason thick
description is an essential component of the qualitative research enterprise
Resources:
Handout: Different Types of Notes
Example: ACY Site Visit Toolkit
Major Types of Validity in Qualitative Research
Descriptive Validity
Interpretive Validity
Theoretical Validity
External Validity (i.e., generalizability)
Descriptive Validity
Concerned with the factual accuracy of an account (that is, making sure one is not
making up or distorting the things one hears and sees)
All subsequent types of validity are dependent on the existence of this
fundamental aspect of validity
Descriptive Validity
Strategy to obtain descriptive validity:
Investigator triangulation.
use of multiple investigators to collect and interpret data.
Interpretive Validity
At issue, then, is the accuracy of the concepts as applied to the perspective of the
individuals included in the account
Interpretive accounts are grounded in the language of the people studied and rely,
as much as possible, on their own words and concepts
Interpretive Validity
Accurately portraying meaning attached by participants to phenomena.

Strategies used.
Participant feedback.
Use of low-inference descriptors
Theoretical Validity
Degree to which theoretical explanation fits the data.
Theoretical validity is thus concerned, not only with the validity of the concepts,
but also their postulated relationships to one another, and thus its goodness of
fit as an explanation.
Theoretical Validity
Degree to which theoretical explanation fits the data.
Strategies used:
Extended fieldwork.
Theory triangulation.
Pattern matching.
Peer review.
Internal validity
Internal validity.
Strategies used in qualitative research:
Researcher as detective.
Methods triangulation.
Data triangulation
External Validity
Not typically goal of Qualitative research.
People and settings not randomly selected.
Qualitative researchers interested in documenting particularistic
rather than universalistic findings.
Some suggest generalizations can be made.

Naturalistic generalization - based on similarity.


Replication logic - generalization based on replication of findings.
Research Validity or Legitimation in Mixed Research
Inside-outside validity use of participants subjective insider or native views
and researchers objective outsider view.
Paradigmatic validity researcher documents his or her philosophical beliefs
about research.
Commensurability validity - researcher makes Gestalt switches between
qualitative and quantitative creates integrate viewpoint.
Weakness minimization validity researcher combines qualitative and
quantitative to have non-overlapping weaknesses.
Sequential validity researcher addresses effects from ordering of qualitative and
quantitative phases.
Conversion validity accuracy of quantizing and qualitizing data.
Sample integration validity making appropriate generalizations from mixed
samples.
Political validity addresses interests and viewpoints of multiple stakeholders.
Multiple validities all pertinent validities addressed and resolved successfully .
Major Threats to Validity
Type I error: believing a principle to be true when it is not (i.e., mistakenly
rejecting the null hypothesis)
Type II error: rejecting a principle when, in fact, it is true
Type III error: asking the wrong question
Reliability

Reliability concerns the ability of different researchers to make the same


observations of a given phenomenon if and when the observation is conducted
using the same method(s) and procedure(s)
Enhancing the Reliability of Qualitative Research
Researchers can enhance the reliability of their qualitative research by:
Standardizing data collection techniques and protocols
Again, documenting, documenting, documenting (e.g., time day and place
observations made)
Inter-rater reliability (a consideration during the analysis phase of the
research process)
Internal reliability
Neutrality is maintained
External reliability
Replicability of concepts
Types of reliability
Intra-rater
Inter-rater
Intra-session
Inter-session
Reliability (is your test repeatable?)
Scatter plots
Best way to get an initial feel for the data is to draw a Scatter plot, and calculate
the
Correlation coefficient (Pearson's 'r'):

generally, r > 0.8 is regarded as good reliability


square of the correlation coefficient (r2) is equal to the proportion of
variation in the dependent variable that is accounted for
Correlation does not imply causation!

Threats to internal validity are threats to causal control. They

mean that we do not know for sure what caused the effects that we
observed. Naturally, we like to hope that our interventions (experimental
treatments) or other known and measured independent variables caused
the effects. Unfortunately this is often not the case. For exampe, because
of their multidimensionality, confounded variables (which measure more
than one entity) are a threat to internal validity.
BIAS VERSUS RANDOM ERROR
If you have tight control over your experimental treatments (and, of course,
you used randomization), hopefully the only source of variance left in your
dependent variables will be random error.
Random error is just that: It is the random variation that occurs on
measurements across administrations, situations, or time periods. If
random error is VERY large, it can pose a threat to the reliability

(predictability, stability) of our measurements. Many political attitudes, for


example, are highly unstable or volatile.
On the other hand, because it is random, random error does not usually
pose a threat to internal validity.
Bias is systematic error, such as the scale that always weighs you in at five
pounds too light. Bias introduces a constant source of error into
measurements or results. Bias can occur when test items that favor a
particular ethnic, age, or gender group are used. For example, a "culture
exam" that asked respondents to identify songs from the 1950s and the
1960s would discriminate against younger people. Tests of "science
knowledge" often favor younger people because they use the most recent
definitions of science phenomena and thus favor those with a more recent
education. Bias in testing instruments is a threat to internal validity
because it poses an alternative explanation for the results that we found.
If we could either control bias experimentally (random assignment controls
much of it by making experimental treatment groups roughly equivalent at
the beginning of a study, thus controlling factors such as self-selection or
regression toward the mean effects) or measure the variables we suspect
cause bias and thus control them statistically, we would at least maximize
internal validity.
Unfortunately bias is often hidden, either in the variables you didn't
measure--or the variables you didn't consider at all. Thus you didn't
measure it and only discover your mistake after all your data are collected.
Confounded

variables

HERE

SOME

ARE

are

major

WELL-KNOWN

threat

THREATS

to

internal

TO

validity.

INTERNAL VALIDITY
Self-selection effects : When subjects can select their own treatments,
we do not know whether the intervention or a pre-existing factor of the
subject caused the outcomes we observed. Random assignment can cure
this problem. The same problem can occur with differential selection, only
in this case, the investigator (rather than the subject) uses human
judgement to assign groups or subjects to treatment. A common variation
on this one is selecting extreme groups (see below).
Experimental mortality. When subjects discontinue the study and this
occurs more in certain conditions than others, we do not know how to
causally interpret the results because we don't know how subjects who
discontinued participation differed from those who completed it. A pretest
questionnaire given to all subjects make help clarify this, but watch out for
pretesting effects (a Solomon four group design can help here.)
History: Some kind of event occurred during the study period (such as
the assaults on New York City) and it is reactions to these events that
caused the outcomes we observed. Sometimes this is a medical event
(such as a flu outbreak) and sometimes an actual political or historical
event. Random assignment and a control group helps with this problem.
Maturation effects are especially important with children and youth
(such as college freshmen) but could happen at any age. For example,
young children's speech will normally become more complex, no matter
what reading method you use. Some studies have found that most college
students pull out of a depression within six months, even if they receive no
treatment whatsoever. A certain number of people will stop smoking,
whether they receive treatment or not. Again, a randomized control group
helps.

Regression toward the mean effects ("statistical regression") are


especially likely when you study extreme groups. For example, students
scoring at the bottom of a test typically improve their scores a least a little
when they retake the test. Students with nearly perfect scores might miss
an item the second time around. That is, people with extreme scores, or in
extreme groups, will often fall back toward the average or "regress to the
mean" on a second administration of the dependent variable.
Regression toward the mean effects are especially likely to occur among
well-meaning investigators, who want to give a treatment that they believe
is very beneficial to the group that appears to need it the most (the top
scoring group is usually left alone.) When the scores of the worst group
improve after the intervention (and the top group scores a little lower on
the readministration), misguided investigators are even more convinced
that they have found a good treatment (instead of a methodological
artifact.) How to avoid this threat to internal validity? Either avoid extreme
groups, or if you do use them, randomly assign their members to treatment
conditions, INCLUDING A CONTROL GROUP.
Testing. Just taking a pretest can sensitize people and many people
improve their performance with practice. Almost every classroom teacher
knows that part of a student' s performance on assessment tests depends
on their familiarity with the format. Solution? A Solomon Four Group
Design, wherein half the subjects do not receive a pretest is a good way to
control inferences in this case.
REACTIVITY AND THREATS TO INTERNAL VALIDITY
Reactivity refers to changes in the subjects' behavior simply because they
are being studied.

For example, some people get nervous when a doctor or nurse takes their
blood pressure, and their blood pressure goes up.
Reactivity poses a distinct threat to internal validity because we don't know
what caused the outcome: treatment effects or reactivity. The experimental
laboratory is probably the most reactive because people have come for an
experiment and they know their behavior is being watched. That is why so
many experimenters use deception. They are trying to divert subject
attention so that the "true behavior under study" is not altered.
Demand effects, in which subjects or respondents "follow orders" or
cooperate in ways that they almost never would under their routine daily
lives.

In research several decades ago, Martin Orne found that laboratory


subjects would do virtually anything an experimenter asked them to
do. They would eat bitter crackers for several minutes, they would
throw what they were told was acid at a laboratory assistant, they
would pick up snakes or prepare to eat worms.

Social Desirability effects take several forms.

Subjects may become nervous about being monitored, or evaluation


apprehension. When people become anxious, many things happen.
Physiological indicators, such as heart rate or blood pressure,
change. If people are slightly anxious, they may do better on tests,
performance, or assessments. However if people are very anxious
("flooded") they will almost certainly do worse

People may try to "fake good," to appear smarter, more attractive, or


more

tolerant

than

they

normally

are.

Paper

and

pencil

questionnaires are especially prone to these effects because often


the answers are not checked for their veracity. (And, on online
surveys, we may not even correctly know who anyone is.)

It is not just individuals who have social desirability effects. A


century ago, the famous writer Lev Tolstoy wrote about "Potemkin
villages." When the Czar went on cross-country trips, goverment
officials were a little ways ahead of him. In cooperation with local
government, they would erect false-fronted buildings (as on a movie
set) and the best looking young men and women of the village would
stand before these fake structures smiling and throwing flowers.
While most groups or organizations will not go to this extent, they
may "hide" their more embarassing members, "fudge" or slightly
alter records, and claim your procedures were followed when they
were not.

Most people and groups (who allow you to study them at all) try to
cooperate with researchers. But some try to descover the purpose of the
intervention and thwart it, or "wreck the study." Social Reactance effects
refer to boomerang effects in which individuals or groups "fake bad," or
deliberately deviate from study procedures. This happens more among
college students, and others who suspect that their autonomy is being
threatened.
ON REACTIVITY AND INTERNAL VALIDITY. If demand effects are specific to
a particular situation, reactivity problems may also influence generalizing,
or external validity (this is how your Wiersma book treats the term.)
However, I think reactivity introduces an alternative causal explanation for
our results: they occurred, not because of the intervention or treatment,
but because people were so self-conscious that they changed their
behavior. This is internal validity. Reactivity may also statistically interact
with the experimental manipulation. For example, if the treatment somehow
impacts on self-esteem (say you are told that the stories you tell to the TAT
pictures indicate your leadership ability), reactivity may be a greater
internal validity problem.

MORE ON GENERALIZING: "EXPERIMENTAL" VERSUS "MUNDANE"


REALITY
More of a threat to external validity is the issue of the reality of the study
setting. In many cases, such as studies of classrooms or online
environments, the setting of the study is identical to the "everyday reality"
or mundane reality in which most subjects live their lives. High mundane
reality makes it easier to generalize to people's typical settings and it
facilitates external validity. Field studies of all kinds, and ethnographies,
too, take place in typical, as opposed to unusual, settings.
However, laboratory experiments in particular may use unusual settings or
tasks. For example, some sports experiments will have subjects on a
treadmill for hours. In other studies, subjects may be injected with
substances (such as adrenaline) or take pills. Subjects may see specially
constructed movies that are nothing like they see on TV. Or may be called
upon to perform tasks (watching a light "move" in a darkened room) that
bear no resemblance to their normal environment. While these settings or
tasks may be engrossing or compelling, thus high in experimental reality,
they do not resemble the settings to which researchers may really want to
generalize.

DID ANYBODY NOTICE? I HOPE YOU USED A Manipulation check.


YOU are certain that your intervention will make life healthier or enhance
learning. But what if no one pays attention to the treatment or
comprehends its message? Then it will appear that you have no effects at
all, whereas if you had simply used a stronger manipulation, your
guesswork would have been confirmed.

Anyone doing experimental work needs to have a manipulation check, an


inclusion to measure if subjects even paid attention to factors in the
treatment and understood their messages For example, if you show
different movies to different groups and your topic is filmed aggression,
have a short questionnaire that has subjects rate the violence of the movie.
The group receiving the more aggressive film should rate it as more violent
than those receiving an unaggressive movie. If you are trying a new
reading technique, make sure that students understand the stories they are
exposed to and remember something about them. If you try a new template
in your online learning course, did students even pay attention?
THE HUMAN FACTOR: USING DOUBLE BLIND
When the medical and pharmacy professions test a new medicine, they
don't just use a "sugar pill" placebo.
Subjects in the study do not know if they are taking a new medication, an
old medication, or a sugar pill.
The individuals who pass out the medication and assess the subjects'
health and behavior also do not know whether the person is taking a new
medication, an old medication, or a sugar pill.
Thus both those involved as subjects and those involved with collecting
data are "blind:" blind to the purposes of the study, the condition that
subjects are in, and the results expected.
This means that

You may need to deceive subjects about the true purpose of the
study (if you were told the purpose of the study was to measure
leadership qualities in sports, might you try to "shape up?")

Avoid collecting your own data; don't act as your own experimenter
or interviewer. Trade off with another student or apply for a small
University or external grant to hire someone.

Don't tell interviewers or experimenters the true purpose of the study


and don't tell them (if possible) which subject is in which condition.
You might give each person a "generic overview" of the study ("this
study is about which movies children like.")

Almost no one who collects data "likes deception" but without at least a
little, you may introduce reactivity and bias into your study. Do the
minimum (I prefer "omission" rather than deliberate lies) and be sure to
debrief subjects after their participation in the study is completed. This
means that you tell them the true purpose of the study and any
manipulations pertinent to their role in it. Debriefing is ethically mandatory,
and is especially important if your manipulation involved lies about the
student's performance ("no, you really didn't score in the 5th percentile on
that test, all feedback was bogus") or any other aspect of the "real world."
Susan

Carol

This

page

and

is

Losh
was

best

September

21

2001

built

with

Netscape

Composer

viewed

with

Netscape

Navigator

600 X 800 display resolution.


Effect of Validity and Reliability
The precision with which you measure things also has a major impact on sample size: the
worse your measurements, the more subjects you need to lift the signal (the effect) out of
the noise (the errors in measurement). Precision is expressed as validity and reliability.
Validity represents how well a variable measures what it is supposed to. Validity is
important in descriptive studies: if the validity of the main variables is poor, you may
need thousands rather than hundreds of subjects. Reliability tells you how reproducible
your measures are on a retest, so it impacts experimental studies: the more reliable a

measure, the less subjects you need to see a small change in the measure. For example, a
controlled trial with 20 subjects in each group or a crossover with 10 subjects may be
sufficient to characterize even a small effect, if the measure is highly reliable. See the
details on the stats pages.
Pilot Studies
As a student researcher, you might not have enough time or resources to get a sample of
optimum size. Your study can nevertheless be a pilot for a larger study. Perform a pilot
study to develop, adapt, or check the feasibility of techniques, to determine the reliability
of measures, and/or to calculate how big the final sample needs to be. In the latter case,
the pilot should have the same sampling procedure and techniques as in the larger study.
For experimental designs, a pilot study can consist of the first 10 or so observations of a
larger study. If you get respectable confidence limits, there may be no point in continuing
to a larger sample. Publish and move on to the next project or lab!
If you can't test enough subjects to get an acceptably narrow confidence interval, you
should still be able to publish your finding, because your study will set useful bounds on
how big and how small the effect can be. A statistician can also combine your finding
with the findings of similar studies in something called a meta-analysis, which derives a
confidence interval for the effect from several studies. If your study is not published, it
can't contribute to the meta-analysis! Many reviewers and editors do not appreciate this
important point, because they are locked into thinking that only statistically significant
results are publishable.
WHAT TO MEASURE
In any study, you measure the characteristics of the subjects, and the independent and
dependent variables defining the research question. For experiments, you can also
measure mechanism variables, which help you explain how the treatment works.
Characteristics of Subjects

You must report sufficient information about your subjects to identify the population
group from which they were drawn. For human subjects, variables such as sex, age,
height, weight, socioeconomic status, and ethnic origin are common, depending on the
focus of the study.
Show the ability of athletic subjects as current or personal-best performance, preferably
expressed as a percent of world-record. For endurance athletes a direct or indirect
estimate of maximum oxygen consumption helps characterize ability in a manner that is
largely independent of the sport.
Dependent and Independent Variables
Usually you have a good idea of the question you want to answer. That question defines
the main variables to measure. For example, if you are interested in enhancing sprint
performance, your dependent variable (or outcome variable) is automatically some
measure of sprint performance. Cast around for the way to measure this dependent
variable with as much precision as possible.
Next, identify all the things that could affect the dependent variable. These things are the
independent variables: training, sex, the treatment in an experimental study, and so on.
For a descriptive study with a wide focus (a "fishing expedition"), your main interest is
estimating the effect of everything that is likely to affect the dependent variable, so you
include as many independent variables as resources allow. For the large sample sizes that
you should use in a descriptive study, including these variables does not lead to
substantial loss of precision in the effect statistics, but beware: the more effects you look
for, the more likely the true value of at least one of them lies outside its confidence
interval (a problem I call cumulative Type 0 error). For a descriptive study with a
narrower focus (e.g., the relationship between training and performance), you still
measure variables likely to be associated with the outcome variable (e.g., age-group, sex,
competitive status), because either you restrict the sample to a particular subgroup
defined by these variables (e.g., veteran male elite athletes) or you include the variables
in the analysis.

For an experimental study, the main independent variable is the one indicating when the
dependent variable is measured (e.g., before, during, and after the treatment). If there is a
control group (as in controlled trials) or control treatment (as in crossovers), the identity
of the group or treatment is another essential independent variable (e.g., Drug A, Drug B,
placebo in a controlled trial; drug-first and placebo-first in a crossover). These variables
obviously have an affect on the dependent variable, so you automatically include them in
any analysis.
Variables such as sex, age, diet, training status, and variables from blood or exercise tests
can also affect the outcome in an experiment. For example, the response of males to the
treatment might be different from that of females. Such variables account for individual
differences in the response to the treatment, so it's important to take them into account.
As for descriptive studies, either you restrict the study to one sex, one age, and so on, or
you sample both sexes, various ages, and so on, then analyze the data with these variables
included as covariates. I favor the latter approach, because it widens the applicability of
your findings, but once again there is the problem of cumulative Type 0 error for the
effect of these covariates. An additional problem with small sample sizes is loss of
precision of the estimate of the effect, if you include more than two or three of these
variables in the analysis.
Mechanism Variables
With experiments, the main challenge is to determine the magnitude and confidence
intervals of the treatment effect. But sometimes you want to know the mechanism of the
treatment--how the treatment works or doesn't work. To address this issue, try to find one
or more variables that might connect the treatment to the outcome variable, and measure
these at the same times as the dependent variable. For example, you might want to
determine whether a particular training method enhanced strength by increasing muscle
mass, so you might measure limb girths at the same time as the strength tests. When you
analyze the data, look for associations between change in limb girth and change in
strength. Keep in mind that errors of measurement will tend to obscure the true
association.

This kind of approach to mechanisms is effectively a descriptive study on the difference


scores of the variables, so it can provide only suggestive evidence for or against a
particular mechanism. To understand this point, think about the example of the limb
girths and strength: an increase in muscle size does not necessarily cause an increase in
strength--other changes that you haven't measured might have done that. To really nail a
mechanism, you have to devise another experiment aimed at changing the putative
mechanism variable while you control everything else. But that's another research
project. Meanwhile, it is sensible to use your current experiment to find suggestive
evidence of a mechanism, provided it doesn't entail too much extra work or expense. And
if it's research for a PhD, you are expected to measure one or more mechanism variables
and discuss intelligently what the data mean.
Finally, a useful application for mechanism variables: they can define the magnitude of
placebo effects in unblinded experiments. In such experiments, there is always a doubt
that any treatment effect can be partly or wholly a placebo effect. But if you find a
correlation between the change in the dependent variable and change in an objective
mechanism variable--one that cannot be affected by the psychological state of the
subject--then you can say for sure that the treatment effect is not all placebo. And the
stronger the correlation, the smaller the placebo effect. The method works only if there
are individual differences in the response to the treatment, because you can't get a
correlation if every subject has the same change in the dependent variable. (Keep in mind
that some apparent variability in the response between subjects is likely to be random
error in the dependent variable, rather than true individual differences in the response to
the treatment.)
Surprisingly, the objective variable can be almost anything, provided the subject is
unaware of any change in it. In our example of strength training, limb girth is not a good
variable to exclude a placebo effect: subjects may have noticed their muscles get bigger,
so they may have expected to do better in a strength test. In fact, any noticeable changes
could inspire a placebo effect, so any objective variables that correlate with the noticeable
change won't be useful to exclude a placebo effect. Think about it. But if the subjects
noticed nothing other than a change in strength, and you found an association between

change in blood lipids, say, and change in strength, then the change in strength cannot all
be a placebo effect. Unless, of course, changes in blood lipids are related to susceptibility
to suggestion...unlikely, don't you think?
Validity (does your test measure what it's supposed to?)

gold standard (highest: rarely have this in PT research!)


o

equipment calibrated against an accurate standard

internal (cause-effect relationship between the independent and dependent


variables)

content (is the sample representative of the population?)


o

some (silly) examples with poor content validity

asking young people questions and then generalising to the whole


population

face (lowest: does the test seem to measure what it's supposed to?)
o

some silly examples with no face validity

gathering "normal data" from subjects who have a disease!

using a goniometer to measure velocity

Risks to Validity

Selection (should be randomised for age, sex etc.)

History (background of subjects should be similar)

Maturation (subjects may change, e.g. fatigue, during the experiment)

Repeated Testing (subjects are affected by the test)

Instrumentation (may affect subjects)

Regression to the Mean (subjects with extreme scores on a first test tend to have
scores closer to the mean on a second test)

Experimental Mortality (subjects who drop out of the experiment)

Selection-Maturation Interaction

Experimenter

Bias

(always

try

to

blind

experimenter)

Reliability (is your test repeatable?)


Scatter plots
Best way to get an initial feel for the data is to draw a Scatter plot, and calculate the
Correlation coefficient (Pearson's 'r'):more here!

1.0

perfect

0 = no correlation

direct

correlation

-1 = perfect inverse correlation

generally, r > 0.8 is regarded as good reliability

square of the correlation coefficient (r2) is equal to the proportion of


variation in the dependent variable that is accounted for

Correlation does not imply causation!

Significance Tests

't' test

Analysis of Variance (ANOVA)


o

Intraclass correlation coefficient (ICC)

Types of reliability

Intra-rater

Inter-rater

Intra-session

Inter-session

Statistical Testing: types of error


Type I: did we detect a difference that isn't really there?
alpha test (p < 0.05)

Type II: is there really a difference that we didn't detect?


o

beta test (statistical power - difficult to calculate!)

Qualitative Tests, or non-parametric statistics, for ordinal data (integers, categories)


Mann-Whitney

ITEM ANALYSIS
Definition:
It is a procedure used to further assess the validity of a measure by separately evaluating
each item to determine whether/not that item discriminates in the same manner in which
the overall measure is intended to discriminate.
Purposes
To determine the adequacy of the items within a test as well as adequacy of the
test itself.
The results of an item analysis provide information about the difficulty of the
items to discriminate between better and poorer students.
To obtain more information on each item in order to determine the retention,
deletion, or revision of items.
Fix marks for current class that just wrote
the test
More diagnostic information on students - another immediate payoff of item
analysis.
Classroom level:
- will tell which questions they were are all guessing on, or if you find a questions
which most of them found very difficult, you can reteach that concept.
Individual level:
- isolate specific errors that the child made
Build future tests, revise test items to make them better
Part of your continuing professional development
item analysis will help teach - how to become a better test writer
documenting just how good your evaluation is
useful for dealing with parents or principals if there's ever a dispute.

METHODS ITEM ANALYSIS


1.The index of difficulty / item P Level.
2.Discrimination of index.
3.Item response chart or distracter Analysis.
The index of difficulty / item P Level
Definition: it is the proportion of students who correctly answer an item. The
average difficulty of a test is the average of the individual item difficulties.

ID is determined by counting the number of subjects selecting the correct /


desired response to a particular item divided by the total number of subjects.

For maximum discrimination among students, average difficulty of 0.60 is ideal.


E.g: 10 item performance measures employed to observe the behaviours of 20
nurses working in a paediatric ambulatory centre.

It is determined that 10

subjects performed correctly in response item 1 and the remaining did not.
10
ID =

= 0.50
20

E.g: If 36 students answered item no.1


answered in

correctly and 14 students

correctly. Calculate the ID level.

INTERPRETATION
P level may range from 0 to 1.00
The closer the value of P is to 1.00 easier the item.
The closer the value of P is to zero more difficult item.

When norm reference measures are employed P values between 0.30 to 0.70 are
desirable because extremely easy or difficult item have very little power to
discriminate or differentiate among subjects.
DISCRIMINATION INDEX
DefinitioThe extent to which a test item discriminates between examinees who obtain
high or low scores.
It is a numerical indicator of how the poorer students answered the item as
compared to how the better students answered the item.
To determine the index of discrimination value for given data the steps are:
1.Score each answer sheet, write score total on the corner.
2.Sort the pile into rank order from top to bottom score.
3. Identify those individual ranked in upper 25%.
4. Identify those individual ranked in lower 25%.
5. Place the remaining scores aside.
6. Determine the proportion of respondent
in the top 25% or 27% who answered
the item correctly (Pu).
7.Determine the proportion of respondent
in the lower 25% or 27% who answered
the item correctly (Pl).
8.Calculate discrimination index by
subtracting Pu Pl
DI = Pu Pl

E.g: 60 students take a test. The top 16 scores and bottom 16 scores are the upper
and lower groups.
For item No.1 12 of the 16 students in the upper group answered item correctly.
While 7 students in the lower group answered correctly.
The index of discrimination for item No.1
12 7
=

= 0.31
16

Examples :Number of students per group = 100


Item No

Number of Correct Answers inItem Discrimination


Group
Upper 1/4

Index
Lower 1/4

90

20

0.7

80

70

0.1

100

100

100

50

50

20

60

- 0.4

INTERPRETATION
Discrimination of index ranges from 1.00 to + 1.00.
For a small group of student an ID for an item that exceeds 0.20 is satisfactory.

For larger group the ID should be higher because more difference between groups
would be expected.
A positive ID value is desirable and indicate that the item is discriminating in the
same manner as the total test.
i.e. those who score high on test tend to respond correctly to the item, while those
who score low do not respond.
A negative values suggest that the item is not discriminating in the same way as
the total test.
i.e. respond who obtain low scores on the measure tend to get the item correct,
while those who score high on the measure tend to incorrectly to the item.
A ve indicates that an item is faulty & needs improvement.

Use the following table as a guideline to determine whether an item (or its corresponding
instruction) should be considered for revision.
Item Discrimination (D)

Item Difficulty

High

Medium

Low

D =< 0%

review

review

review

0% < D < 30%

ok

review

ok

D >= 30%

ok

ok

ok

Distracter Analysis: Definition

You might also like