Professional Documents
Culture Documents
Validity Overview
Overview & Definitions
One of the fundamental questions to be asked and addressed is 'how valid and reliable are
the study and the data collected'?
Validity of Data; Concerned primarily with the dependent variable. The instrument used
to quantify the dependent variable should be examined for it's validity (ability to truly
measure what it's supposed to). If the instrument is a well known one with established
validity it may be enough to site a reference where validity was examined and show that
the same protocol was followed in your study on similar subjects.
Validity of the dependent variable can be assessed:
Qualitatively
Content/logical validity
Quantitatively
Content/logical - factor analysis
Construct validity - multi-trait/multi- method procedure (correlations) or Factor analysis
Reliability of Research; Essentially this refers to the replicability of the research. The
reliability of the research is assessed qualitatively by scrutinizing the design and
methodology employed in the research.
azizi
Page 1
4/19/2015
An estimate of power (probability of correctly rejecting the null hypothesis) is also useful
in examining the stability of the research.
Reliability of Data; Concerned primarily with the dependent variable. The instrument
used to quantify the dependent variable should be examined for it's reliability (accuracy
of measures reflected in consistency)
Reliability of the dependent variable can be assessed quantitatively using:
1. Coefficient alpha
2. Intraclass R
Once the statement of the problem is translated into a null hypothesis choices with
respect to statistical analysis can be made.
Definitions
RESEARCH: The systematic, replicable, empirical investigation between or among
several variables.
EXPERIMENTAL RESEARCH: Involves the manipulation of some variable by the
researcher in order to determine whether changes in a behavior of interest are effected. A
well-controlled experiment also requires that the assignment of subjects to conditions or
vice-versa be made on a random basis and that all other variable except the conditions
that the subjects will experience.
NON-EXPERIMENTAL RESEARCH: No manipulation of variables takes place.
Observational, historical, and survey studies are typically non-experimental. Much of the
research conducted in natural settings is non experimental since the researcher is unable
to manipulate the conditions that the subjects will experience.
INDEPENDENT VARIABLE: The variable manipulated by the experimenter. Or a
broader definition would be - any variable that is assumed to produce an effect on, or be
related to, a behavior of interest.
LEVELS OF AN INDEPENDENT VARIABLE: The various values or groupings of
values of an independent variable. Ex: a study is conducted to determine the effect of
room temperature on performance. If the experimenter tests the subject at 70, 80, and 90
degrees, there is one independent variable - room temperature - with three levels.
azizi
Page 2
4/19/2015
Ordered data: used when reliable score data cannot be (or is not) obtained, but the
subjects can be ranked from high to low along the dimension of interest. In some cases, a
researcher may convert score data to ranks because it is believed that the measuring
instrument was not precise enough to trust the numeral scores, or that the assumptions
underlying a statistical test for score data would be badly violated by the data. Statistical
tests deigned for use with ordered data generally do not make stringent assumption about
the nature of the underlying distributions and hence are more conservative than those
designed for score data.
Frequency data: Each subject is classified into a particular category. The frequency of
occurrence of subjects in each category typically provides the data from which statistical
azizi
Page 3
4/19/2015
analysis is done.
Qualitative Research
Recognizing what the sources of qualitative data as distinct from data for
empirical/experimental research are is important.
3 broad categories that produce qualitative data:
General definitions:
Validity: tests/observations/instruments etc. are valid if they produce relevant and clean
measures of what you want to assess.
Reliability: tests/observations/instruments etc. are reliable if they produce accurate
measures of what you want to assess
Validity of Research
Internal validity: extent to which findings free of bias, observer effects, Hawthorne
effects etc. Extent to which findings match reality.
External validity: Extent to which findings can be generalized.
Validity of Data
Qualitative data is valid when it records relevant events/information without
interpretation, bias or filtering.
Reliability of Research
azizi
Page 4
4/19/2015
Reliability of Data
Qualitative data is reliable when it accurately (objectively) records events/information.
The validity and reliability of qualitative data depend to a great extent on the
methodological skill, sensitivity and integrity of the researcher.
All require discipline, knowledge, training, practice, and creativity on the part of the
researcher.
Qualitative and Quantitative research are not polar opposites with completely different
sets of techniques and approaches to inquiry. They exist along a continuum commonly
framed in terms of the amount of control or manipulation present.
You can think of qualitative inquiry as discovery oriented. No pre-conditions are set and
no constraints are placed on the outcomes. Variables emerge from the data.
azizi
Page 5
4/19/2015
Measurement Issues
The place to start is with how to classify data - the scale the data appropriately belongs on
will affect analysis decisions.
Measurement Scales
Categorical/nominal scale: Used to measure discrete variables that can be classified by
two or more mutually exclusive categories.
Ex: Gender is a categorically scaled variable with two categories: male & female. the
scale scores (0,1) have no meaning.
Ordinal scale: Used to measure discrete variables that are categorical in nature and can
be ordered (meaningfully).
Ex: Undergraduate class is an ordinally scaled variable with four meaningfully ordered
categories: freshman, sophomore, Junior, Senior. The scale scores (1,2,3,4) have meaning
in that Juniors have complete more units than sophomores who have completed more
than freshman . . .
Interval scale: Used to measure continuous variables that are ordinal in nature and result
in values that represent actual and equal differences in the variable measured.
Ex: Temperature is an interval scaled variable with meaningfully ordered categories (hot,
cold) that can be measured (scale has a constant unit of measurement) to finer and finer
degrees given appropriate instrumentation.
Ratio scale: Used to measure continuous variables that have a true zero, implying total
lack of the attribute/property being measured.
Ex: Weight is a ratio scaled variable with meaningfully ordered categories (heavy, light)
that can be measured to finer and finer degrees that also has a true rather than arbitrary
zero.
azizi
Page 6
4/19/2015
Continuous variables are ones at least interval scaled - call for use of parametric
statistics
Discrete variables are ones that are categorical or ordinal in nature and call for use of
non-parametric statistics.
azizi
Page 7
4/19/2015
Volunteer effect: Volunteers may be fundamentally different from the overall population
you are trying to generalize to.
Instrumentation effect: Changes in instruments can be mistaken for changes in subjects.
Pre-testing effect: Subjects can be changed or learning can take place during a pre-test
which could affect results.
Time: Over a length of time, maturation may have more of an impact than the
independent variable. Also, major events can affect subjects' behaviors and/or opinions.
Hawthorne effect: When the giving of attention rather than the independent variable is the
cause of observed differences/relationships.
Purpose
Cognitive
Motor Skills
Evaluation
Content Validity
Concurrent Validity
Logical Validity
Concurrent Validity
Prediction
Predictive Validity
Predictive Validity
azizi
Page 8
4/19/2015
Research
Content Validity
Concurrent Validity
Predictive Validity
Construct Validity
Logical Validity
Concurrent Validity
Predictive Validity
Construct Validity
When measures are found to be valid for one purpose they will not necessarily be valid
for another purpose. Validity also may not be generalizable across groups with varying
characteristics.
Content/logical validity (assessed qualitatively)
1. Clearly define what you want to measure.
2. State all procedures you will use to gather measures.
3. Have an "expert" assess whether or not you are measuring what you think you are.
azizi
Page 9
4/19/2015
3. If correlation > .80 for positively correlated variables or < -.80 for
inversely related variables your measure (x) is said to have good
concurrent validity
azizi
Page 10
4/19/2015
Coefficient alpha
Intraclass R
Should then follow up with SEM: band placed around observed score to quantify
measurement error.
The primary concern here is the accuracy of measures of the dependent variable (in a
correlational study both the independent and dependent variable should be examined).
Reducing sources of measurement error is the key to enhancing the reliability of the data.
Reliability is typically assessed in one of two ways:
1. Internal consistency - Precision and consistency of test scores on one administration of
a test.
2. Stability - Precision and consistency of test scores over time. (test-retest)
To estimate reliability you need 2 or more scores per person.
If motor skills/physiological measures collected at one time only, the most common way
of getting 2 scores per person is to split the measures in half - usually by odd/even or first
half/second half by time or trials.
For survey research with multiple factors, reliability is typically assessed within factors
by examining consistency of response across items within a factor. So, for a survey with
3 factors, you will compute 3 reliability coefficients.
If every subject can be measured twice on the dependent variable then you readily have
data from which reliability can be examined.
Once you have 2 scores per person the question is how consistent overall were the scores
in order to infer how accurate scores were.
Sources of measurement error
1. Random fluctuations, or a person's inability to score the same twice or perform
consistently throughout one administration.
2. Measuring device - test
3. Researcher
4. Temporary effects - warm-up, practice
5. Testing length - time/trials
azizi
Page 11
4/19/2015
Objectivity
Whenever measures have a strong subjective component to them it is essential to
examine objectivity. Subjectivity itself is a source of measurement error and so affects
reliability and validity. Therefore, objectivity is a matter of determining the accuracy of
measures by examining consistency across multiple observations (multiple judges on one
occasion or repeated measures over time from one evaluator) that typically involve the
use of rating scales.
Objectivity (using coefficient alpha/Intraclass R)
This way of examining objectivity requires that you have either multiple evaluators
assessing performance/knowledge on one occasion, or one evaluator assessing the same
performance (videotaped)/knowledge at least twice. If the measures (typically from a
rating scale) are objective they will be consistent across the two or more measures per
subject.
azizi
Page 12
4/19/2015
azizi
Page 13
4/19/2015
This design allows the researcher to test for significant differences across the two
experimental groups and the control group after the experimental groups have received
the treatment. A one-way ANOVA would be used to statistically test the null hypothesis
that H0: 1 = 2 = 3.
In this expanded design there is still one independent variable (now with 3 levels) and
one dependent variable. The independent variable is still groups or treatment condition
and the dependent variable is again the variable under study. In text, example is of
training at two levels (70% of VO2 max & 40% VO2 max) with the control group not
training. The dependent variable is cardiorespiratory fitness. So, at the end of training,
measures of cardiorespiratory fitness are collected on each subject and the means for each
group compared using a one-way ANOVA (if assumptions not met, a Kruskal-Wallis test
would be used).
Points to consider
1. For each subject in the study there can be only one score. If many measures are taken
for each subject, these measures must be combined in some fashion (eg averaged) so that
there is only one score for each subject.
2. Although it is not necessary, it is usually best to have an equal number of subjects in
each of the experimental groups.
3. The number of groups compared depends on the hypothesis being examined, however,
it is rare that more than four or five groups are compared in one study.
Example
Assume that the researcher is interested in determining the effect of shock on the time
required to solve a set of difficult problems. Subjects are randomly assigned to four
experimental conditions. Subjects in Group 1 receive no shock; Group 2, very low
intensity shocks; Group 3, medium shocks; and Group 4, high intensity shocks. The total
time required to solve all the problems is the measure recorded for each subject. The
independent variable then is shock (which has 4 levels) and the dependent variable is
time.
Factorial design:
Essentially an extension of the randomized-groups design, this design has more than one
independent variable and just one dependent variable. This design requires the formation
azizi
Page 14
4/19/2015
of a group for every combination (of every level) of the two or more independent
variables.
This design allows the researcher to test for significant differences as a function of each
independent variable separately (main effects) and in combination (interaction). A twoway ANOVA would be used to statistically test the null hypothesis that means are
equivalent for the first independent variable, that means are equivalent for the second
independent variable, and that the interaction is not significant.
Points to consider
1. For each subject there can be only one score. If many measures are taken for each
subject, these must be combined (eg averaged) so that there is only one score for each
subject.
2. It is best to have an equal number of subjects in each group.
3. The number of groups compared depends on the hypothesis being examined, however,
it is rare that more than four or five groups are included in either of the two factors.
Example
Assume than a researcher is interested in determining the effects of high vs. low-intensity
shock on the memorization of a hard vs. an easy list of nonsense syllables. Subjects
would be randomly assigned to four experimental conditions: (1) low shock & easy list,
(2) high shock & easy list, (3) low shock & hard list, and (4) high shock & hard list. The
total number of errors made by each subject is the measure recorded. The dependent
variable then is the number of errors and the independent variables are shock (with two
levels) and list difficulty (with two levels).
This would require the formation of 4 randomly assigned groups:
List difficulty
Shock Type
Easy
Difficult
Low
A1 B1
A1 B2
High
A2 B1
A2 B3
Page 15
4/19/2015
the number of errors made depending on list difficulty (regardless of shock type);
and the number of errors made due to the combined effect of shock type and list
difficulty.
Since 3 F statistics are examined, you would divide your alpha by 3 before comparing the
3 p values to your alpha. This is done to keep the overall studys alpha in tact.
The analysis would be referred to as a 2X2 ANOVA communicating that there are two
levels of the first independent variable and two levels of the 2nd independent variable. The
language used to talk about the results would be the main effect for shock type, the main
effect for list difficulty, and the interaction.
Using gender and shock type as the independent variables you would still need to form 4
groups, but, you could not randomly assign individuals to the gender categories though
you would still randomly assign individuals to the shock category.
Gender
Shock Type
Male
Female
Low
A1 B1
A1 B2
High
A2 B1
A2 B3
azizi
Page 16
4/19/2015
This design allows the researcher to test for significant differences or amount of change
produced by the treatment - does the experimental group change more than the control
group. Though the testing effect cannot be evaluated it is assumed to be controlled since
both the control and treatment groups are pretested. A factorial repeated measures
ANOVA is the recommended analytical procedure. With this approach you have two
independent variables (or factors) and one dependent variable. The first factor (not
repeated measures) is treatment condition and the second is test (repeated measures) pre/post. The dependent variable is what is being measured at the pre and post test.
Consider a dietary seminar intended to change eating habits particularly with respect to
consumption of fat.
Group 1
Pre-test
Group 2
Pre-test
Seminar
Post Test
Post Test
In this example there are two independent variables and one dependent variable. In the
situation depicted above there are two levels of each independent variable. The first
independent variable is group or treatment condition (two levels - experimental/group 1
& control/group 2). The second independent variable is test (two levels - pretest &
posttesst). The dependent variable is grams of fat consumed.
azizi
Page 17
4/19/2015
subjects after they run in an urban setting, the countryside, an indoor track, and an
outdoor track. The dependent variable is exercise satisfaction and the independent
variable is exercise environment.
The major advantage of this design over the completely randomized design is that fewer
subjects are required. In addition, very often increased statistical power is gained because
the random variability of a single subject from one measure to the next is usually much
less than the variability introduced by measuring and comparing different subjects. The
major disadvantage is that there may be carry-over effects from one treatment/testing to
the next. In addition, subjects might become progressively more proficient at performing
the criterion task and show an improvement in performance more attributable to learning
than the treatment.
Points to consider
1. Each subject is tested under, and a score is entered for, each treatment condition.
2. The number of repeated measures depends on the research question, however, it is rare
to have more than four or five treatment conditions.
azizi
Page 18
4/19/2015
Unfortunately there is no way to tackle this statistically. The best to be done is a 2X2
ANOVA. One independent variable then is treatment condition (no treatment &
treatment) and the second independent variable is testing (pre tested & not pre tested).
No Treatment
Treatment
Pre tested
Not Pretested
From the ANOVA produced with this analysis you can assess the effects of pretesting
(main effect for testing), the effect of the treatment (main effect for treatment condition),
and the external validity threat of the pretest interacting with the treatment (interaction
effect). Other concerns can and should be examined descriptively.
When you compare observed data with expected results from normative values this is
called testing the null hypothesis.
You can think of hypothesis testing as trying to see if your results are unusual enough so
that they would not even be expected by chance.
By chance, your sample could lie at the extremes of the distribution and so you could
draw the wrong conclusion. These erroneous decisions are generally referred to as type I
and type II errors.
azizi
Page 19
4/19/2015
Type I error = Incorrectly deciding to reject a null hypothesis. Incorrectly reject a true
null hypothesis.
Type II error = Incorrectly deciding not to reject a null hypothesis. Failing to reject a false
null hypothesis.
Alpha = The level of risk an experimenter is willing to take of rejecting a true null
hypothesis. Often called the level of significance, this value is used in establishing a
critical value around which decisions (reject or not reject null) are made. It is also
common to define alpha as the probability of incorrectly rejecting a true null hypothesis
or the probability of making a type I error.
Beta = The level of risk (not under direct control of an experimenter) of failing to reject a
false null hypothesis. It is also common to define beta as the probability of making a type
II error.
Power = 1 - Beta. The probability of correctly rejecting a false null hypothesis.
*In practice you will never know whether or not you've made a poor decision (made a
type I or type II error) but, you can (a) set the probability that you will make a type I error
when you select your alpha, and (b) determine beta (through estimating power) to
estimate the probability that you made a type II error. Note: since sample size is directly
related to power (and so tied to beta), studies will fail to find statistically significant
results even when they do exist because of a small sample size.
- If you decrease alpha (more stringent) power will decrease so beta will increase.
- As you increase sample size, power increases so beta decreases.
- As you enhance measurement precision both alpha and beta decrease so power
increases.
- As effect size increases both alpha and beta decrease so power increases.
Power
Power is the probability of correctly rejecting a false null hypothesis.
Ideally, power should be considered when planning a study, not after it is over. Knowing
what you would like power to be you can determine (using software or power charts)
what your sample size should be.
azizi
Page 20
4/19/2015
For studies examining differences between means, power charts are used to
determine the sample size needed to achieve a desired power.
For studies examining relationships, software is available. By hand, the
computations are very complex.
For studies estimating proportions or means, the computations can be done by
hand.
If power is not considered at the start of a study it should be estimated at the end,
particularly when non-significant results arise.
When non-significant results are obtained one of the following has occurred:
Inadequate theory
Evaluation of preliminary evidence incorrect
Design, analysis, or sample size choices faulty
Alpha
# of groups (k)
Minimum effect size you want to detect. This would come from a pilot study,
literature, or your own opinion on the size of the effect you would like to be able
to detect (.80, .50, .30).
Power desired
The information needed to determine the sample size needed to achieve a particular level
of power for relationships includes:
azizi
Alpha
Standard deviation for dependent and independent variables. This would come
from a pilot study or the literature.
Minimum effect you want to detect. Here the effect size is the correlation
coefficient itself. Again this would come from a pilot study, literature, or your
Page 21
4/19/2015
own opinion on the size of the effect you would like to be able to detect (.80, .50, .
30).
Power desired.
azizi
Page 22
4/19/2015
Statistics
Descriptive Statistics
Central tendency
A technique for conveying group information regarding the middle or center of a
distribution of scores. Indicates where scores tend to be concentrated.
azizi
Page 23
4/19/2015
Variability
Measures of central tendency are often not enough by themselves to communicate useful
information about distributions of scores. A measure of central tendency is valuable to
report anytime you want to summarize data, but, the spread of scores around the center
should accompany it.
Measures of variability
Inclusive range: Crude measure. Not stable since it uses only two scores.
Signed deviation: An individual statistic. Represents the signed distance of a raw
score from its mean. Not useful as summary statistic. Only conveys information
on an individual.
Standard deviation: The average unsigned distance of all scores from the mean.
For categorical data, frequency distribution tables or crosstabulation tables are useful in
summarizing information.
Example:
Male
Female
Smokes
42%
30%
58%
70%
Correlation
The correlation statistic allows you to examine the strength of the relationship between
two variables. This statistic helps you answer questions like 'Do gymnasts with
considerable amounts of fast twitch muscle fibers tumble better?' The underlying
question can be phrased: is there a relationship between concentration of FTMF and
tumbling.
azizi
Page 24
4/19/2015
There are several types of correlation coefficients to choose from. The choice is based on
the nature of the variables being correlated.
Pearson Product Moment Correlation (PPMC)
Use when both variables continuous
Phi
Use when both variables true dichotomies
Point Biserial
Use when one variable continuous and the other a true dichotomy.
A PPMC coefficient describes the strength and direction of the linear relationship
between two variables. When two variables are not linearly related, the PPMC is likely to
underestimate the true strength of the relationship. A graph of the x and y values can
show whether or not the relationship is linear.
When the scores on two variables get larger/smaller together, the direction of the
relationship is positive. When scores on one get larger as scores on the other variable get
smaller, the direction of the relationship is negative due to the inverse relationship of the
two variables. When there is no pattern, there is no relationship.
A PPMC coefficient is a signed number between -1 and 1 where 0 represents no
relationship. Presence of a relationship should never be interpreted as demonstrating
cause and effect. Remember the negative sign simply conveys direction. The farther away
from zero in either direction the stronger the relationship
The PPMC is affected by the variability of the scores collected. Other things being equal,
the more homogeneous the group (on the variables being measured) the lower the
correlation coefficient. Since small groups tend to be more homogeneous, Pearson is
most meaningful and most stable when group size is large (>50).
A point biserial correlation coefficient tells you the strength of the relationship between
one continuous and one dichotomous variable. The sign carries little meaning. It only
indicates which group tended to have higher scores. The point biserial coefficient is a
signed number between -1 and 1 where again zero represents no relationship.
azizi
Page 25
4/19/2015
A phi correlation coefficient tells you the strength of the relationship between two
dichotomous variables. The sign carries little meaning. It only indicates which diagonal
had the greater concentration of scores. The phi coefficient is a signed number between -1
and 1 where again zero represents no relationship.
Inferential Statistics
Once a statistical statement of the null hypothesis is made decisions must be made
regarding what statistic to use to test the null hypothesis. Two broad categories of
statistics are available: Parametric and non-parametric. All other things being equal more
power exists with parametric statistics. However, even when a researcher wants to use a
parametric statistic it is not always possible. Prior to using any statistic you must first
check to see whether the assumptions associated with the statistic are met.
azizi
Page 26
4/19/2015
false null hypothesis. Efficiency refers to the simplicity of the analysis as well as design
considerations.
All other things being equal, parametric tests are more powerful than non-parametric tests
provided all the assumptions are met. However, since power may be enhanced by
increasing N and parametric assumptions are difficult to meet, non-parametric procedures
become very important.
Advantages of Non-parametric Statistical Tests
azizi
Page 27
4/19/2015
assumption by comparing the largest and smallest standard deviations for the
groups in your sample. If they are similar (larger/smaller <2) you have met this
assumption.
Normality - is the distribution of scores for the dependent variable in the
population normal for each level of the independent variable? You check this
assumption by examining histograms for each group. The dependent variable
should be normally distributed for each group in your sample.
Sample randomly selected. Examine whether or not you have met this assumption
by scrutinize sampling procedure
Dependent variable at least interval scaled. Examine whether or not you have met
this assumption by checking to see that the dependent variable meets the
definition of an interval scaled variable.
If assumptions met you can proceed and conduct an independent t-test or ANOVA. If
distributional assumptions not met you should conduct a non-parametric test (MannWhitney or Kruskal-Wallis).
Dependent t-test Statistical Procedure for testing H0: two means are equivalent when the
two measures of the dependent variable are related. For example, when one group of
subjects is tested twice, the two scores are related.
Repeated Measures ANOVA Statistical Procedure for testing H0: two or more means are
equivalent when the two or more measures of the dependent variable are related.
Assumptions of the dependent t-test and Repeated Measures ANOVA procedure:
If assumptions met you can proceed and conduct a dependent t-test or repeated measures
ANOVA. If distributional assumptions not met you should conduct a non-parametric test
azizi
Page 28
4/19/2015
(Wilcoxon or Kendall)
Note, when there are more than 2 levels of an independent variable a significant F
statistic indicates only that somewhere means are different; it does not point out which
means are different. Special techniques called multiple comparison procedures or post
hoc analyses are needed to determine which means are different.
Recommend Scheffe post hoc analysis as follow up to a significant overall F.
azizi
Page 29
4/19/2015
azizi
Large Sx
Page 30
4/19/2015
Linearity
Homoscedasticity - Variance of y scores similar at each x value
Data at least interval scaled
N small
Growth is present. Variance tends to increase with age.
Observations/trials truncated or insufficient practice given. Pattern may be
curvilinear.
Set alpha
Calculate rxy
Get critical value from table (df = N-2)
If rxy > CV, reject H0
azizi
Page 31
4/19/2015
Ex: If rxy = .60, rxy2 = .36. So, 36% of the variance in the DV can be explained by the IV.
Left unexplained is 1 - rxy2.
Note: Outliers can significantly affect rxy. All outliers should be critically examined
before leaving them in the analysis. If the values are legitimate and your sample size is
substantial leave them in the analysis.
Partial correlation
Sometimes a correlation between two variables is due to their dependence on a 3rd
variable.
Ex: Any set of variables that increase with age (shoe size & intelligence) - if you remove
(control for) the effects of age, correlation could change in direction and/or strength.
A partial correlation procedure allows you to hold constant a 3rd variable and look at a
'truer' correlation between x and y.
Regression
This is the most common approach to prediction problems when you have one dependent
variable and multiple independent variables.
Assumptions
H0: b = 0
Values from an analysis of variance table (which partitions the variance due to regression
(explained) and residual (unexplained)) can be used to (a) test the lack of fit assumption,
azizi
Page 32
4/19/2015
(b) then if assumptions met, test for a significant regression, and examine practical
significance.
Factor Analysis
Also a data reduction procedure, this technique is used when the interest is in examining
the underlying structure to several variables or items on a test. Both exploratory and
confirmatory techniques exist.
Charts
Frequency distribution tables
Cross tabulated tables
are informative, however, it is often desirable to conduct a test of the null hypothesis that
two categorical/ordinal variables are not related.
azizi
Page 33
4/19/2015
The statistic that will test for the presence relationship between two categorical (though
can also be used on ordinal data) variables is the chi-square statistic. The null hypothesis
used to examine this is that there is no relationship. Another way to say this is that the
variables x and y are independent. In fact the chi square statistic is commonly referred to
as the chi square test of independence.
Assumptions
For example, Is there a relationship between level of ability of athletes and willingness to
spend time on a task for someone else? Assume return rate (of a survey or other
information) is considered willingness to spend time for someone else's benefit. The
information in the table below then represents return rate by level.
Spend Time
Elite
College
Intramural
Yes
10
32
35
No
62
40
37
To examine the expected frequency assumption you need an expected frequencies table.
Each cell in the expected frequencies table should have at least 5 cases.
To determine if this chi square is statistically significant, you compare it to a critical
value found in a chi square table. The degrees of freedom for a chi square statistic are:
df = (R-1)(C-1): Where R = # of rows, and C = # of columns in the two-way table.
The degrees of freedom for this problem are 2 so the critical value for an alpha of .01 is
9.21. Therefore, if the chi square statistic for this problem is > 9.21 you can reject the null
hypothesis which suggest that there is a statistically significant relationship between level
of ability and willingness to spend time on a task for someone else.
azizi
Page 34
4/19/2015
This does not necessarily mean that the relationship is of any practical significance. At
this point all you know is that the variables in question are not independent. You should
not stop here and should not claim you have something special to report.
Since the chi square statistic is sensitive to sample size, just about any two variables can
be found to be related statistically given a large enough sample size. So, to examine
practical significance you assess the strength of the association between variables using
phi or Cramer's V.
Use Phi for 2X2 tables:
Use Cramer's V for larger tables (Cramer's V and Phi are equivalent for smaller tables)
With chi square based measures you cannot say much beyond the strength of the
relationship. No predictive interpretation is possible.
Meta Analysis
Not a procedure for the analytically faint of heart. Much remains unsettle and the source
of considerable disagreement among many researchers. At the very least careful thought
and research needs to undertaken before beginning a meta analysis project regarding the
questions raised in the Thomas & Nelson text:
azizi
Page 35
4/19/2015
azizi
Page 36
4/19/2015
SOURCES OF INVALIDITY
What makes a data collection process valid or invalid? A data
collection process is valid to the extent that it meets the triple
criteria of (1) employing a logically appropriate operational
definition, (2) matching the items to the operational definition,
and (3) possessing a reasonable degree of reliability. Invalidity
enters the picture when the data collection strategy fails
seriously with regard to one of these criteria or fails to lesser
degrees in a combination of these criteria.
It may be instructive to look at some examples of invalid data
collection processes. Assume that a researcher wants to
develop an intelligence test. He operationally defines
intelligence as follows: "A person is intelligent to the extent
that he/she agrees with me." He then makes up a list of 100 of
his opinions and has people indicate whether they agree or
disagree with each item on this list. A person agreeing with 95
of the items would be defined as being more intelligent than
one who agreed with 90, and so on. This is an invalid measure
of intelligence, because the operational definition has nothing
to do with intelligence as any reputable theorist has ever
defined it.
Not all invalid data collection processes are so blatantly invalid.
Indeed, one of the most heated arguments in psychology today
is over the question of what intelligence tests actually
measure. This whole question is one of validity. The advocates
of many IQ tests argue that intelligence can be defined as
general problem-solving ability. They operationally define
intelligence as something like, "People are intelligent to the
extent that they can solve new problems presented to them."
They test for intelligence by giving a child a series of problems
and counting how many she can solve. A child who can solve a
large number of problems is considered to be more intelligent
than one who can solve only a few. The opponents of such
tests argue that the tests are invalid. They say that general
problem-solving ability is not the only quality - or even the
most important one required to do well on such tests. The
tests, they argue, really measure how well a person has
adapted to a specific middle-class culture. Success on such
azizi
Page 37
4/19/2015
azizi
Page 38
4/19/2015
ESTABLISHING VALIDITY
From the preceding discussion, it can be seen that there are
three steps to establishing the validity of a data collection
process designed to measure an outcome variable:
azizi
Page 39
4/19/2015
azizi
Page 40
4/19/2015
Assumed
Outcome
Variable
Ability to
understand
reading
passages
Operational
Definition
Conceivable Real
Outcome
Variable
Ability to guess
from context clues
Love of
The student will carry a Eagerness to
Shakespearean copy of Shakespeare's impress professor
drama
plays with him to class
Knowledge of
driving laws
azizi
Page 41
4/19/2015
Friendliness
toward peers
true-false questions
right on license test
subtle clues
present in them
Ability to tie one's own The student will tie her own shoes
shoes
after they have been presented to
her untied
Ability to spell
azizi
Page 42
4/19/2015
Ability to spell
correctly on essays
with use of dictionary
Ability to type 60
words per minute
azizi
Page 43
4/19/2015
Operational
Definition
azizi
Task on Instrument
Page 44
4/19/2015
sheet
Given a (culturally
familiar) novel
problem to solve, the
test taker will be able
to solve the problem
azizi
Page 45
4/19/2015
prepositions in the
paragraph provided
Page 46
4/19/2015
Page 47
4/19/2015
azizi
Page 48
4/19/2015
azizi
Page 49
4/19/2015
copies of all the plays and never has to borrow from any library
except her own.) Nevertheless, it may still be valid to evaluate
the group based on this operational definition. If you teach the
Shakespeare plays a certain way one year and only 2% of the
students ever borrow related books from the library, and the
next year you teach the same subject differently and 50% of
the students spontaneously borrow books, it is probably valid
to infer from their available documented records that
appreciation of Shakespeare has increased. The group decision,
at any rate, is more likely to be valid than is the individual
diagnosis.
Box 5.1
An Argument-Based Approach to Validity
azizi
Page 50
4/19/2015
Page 51
4/19/2015
Set 2.
a. The student will identify examples of the
principles of physics in the kitchen at home.
(understands principles of physics)
b. The student will choose to take optional courses
in the physical sciences. (appreciates physical
sciences)
Part 2
Write Invalid next to statements that indicate an invalid data
collection process; write Valid next to those that indicate a
valid data collection process; write N if no relevant information
regarding validity is contained in the statement.
1____ The questions were so hard that I
was reduced to flipping a coin to guess
the answers.
2____ The test measures mere trivia, not
the important outcomes of the course.
3____ To rule out the influence of
memorized information regarding a
problem, only topics that were entirely
novel to all the students were included on
the problem-solving test.
azizi
Page 52
4/19/2015
azizi
Page 53
4/19/2015
azizi
Page 54
4/19/2015
Content Validity
Content validity refers to the extent to which a data collection
process measures a representative sample of the subject
matter or behavior that should be encompassed by the
operational definition. A high school English teacher's midterm
exam, for example, lacks content validity when it focuses
exclusively on what was covered in the last two weeks of the
term and inadvertently ignores the first six weeks of the
grading period. Likewise, a self-concept test would lack content
validity if all the items focused on academic situations,
ignoring the impact of home, church, and other factors outside
the school. Content validity is assured by logically analyzing
the domain of subject matter or behavior that would be
appropriate for inclusion on a data collection process and
examining the items to make sure that a representative
sample of the possible domain is included. In classroom tests,
a frequent violation of content validity occurs when test items
are written that focus on knowledge and comprehension levels
(because such items are easy to write), while ignoring the
important higher levels, such as synthesis and application of
principles (because such items are difficult to write).
Criterion-Related Validity
azizi
Page 55
4/19/2015
azizi
Page 56
4/19/2015
Construct Validity
Construct validity refers to the extent to which the results of a
data collection process can be interpreted in terms of
underlying psychological constructs. A construct is a label or
hypothetical interpretation of an internal behavior or
psychological quality - such as self-confidence, motivation, or
intelligence - that we assume exists to explain some observed
behavior. Construct validity often necessitates an extremely
complicated process of validation. To state it briefly, the
researcher develops a theory about how people should perform
during the data collection process if it really measures the
alleged construct and then collects data to see whether this is
what really happens. The process is complicated because the
researcher is doing two separate things: (1) proving that the
data collection process possesses construct validity and (2)
refining the theory about the construct. Note that this process
of validation can never be completed; the goal of researchers
engaging in construct validation is to refine concepts and data
collection processes, not to arrive at ultimate conclusions.
Construct validity often deals with the intervening variable
(discussed in chapter 2), and it is of greatest relevance to
theoretical research (discussed in chapter 17).
Remember: The three technical types of evidence for validity
are merely tools for demonstrating that a data collection
process measures what the test designer or researcher says it
measures. The fundamental logic behind them is relatively
straightforward. The difficulty lies in carrying out the
procedures to collect these types of evidence for validity. The
information presented here (summarized in Table 5.5) should
be enough to enable you to deal with applying and interpreting
these concepts in most situations. If you find that you need
further information (for example, if your job requires that you
select people accurately for various programs), consult the
more technical references in the Annotated Bibliography at the
end of this chapter.
azizi
Page 57
4/19/2015
Type of
Validity
Content
Predictive
azizi
Definition
Mnemonic
Examples of
How to
Achieve and
Demonstrate
It
1. Use a plan
The extent to The content
(such as an
which a data
of the data
item matrix) to
collection
collection
plan a test so
process
process is a
that all areas
measures a
good sample are properly
representative of the content represented.
sample of the that it should
topic
cover.
2. Logically
encompassed
show that
by the
nothing has
operational
been omitted or
definition.
overrepresente
d in the data
collection
process.
How well a
data collection
process
predicts some
future
performance.
The data
collection
process
predicts
something
that has not
yet occurred.
Page 58
1. Select
students for an
advanced
algebra class
based on a
standardized
math test. Then
see if those who
did well on the
math test
actually do
better in the
4/19/2015
course.
2. Give students
the SAT before
they enter
college. Then
compute a
correlation
coefficient with
college GPA to
see if the SAT
accurately
predicts college
performance.
Concurrent
How well a
data collection
process
correlates with
a current
criterion.
Both data
collection
processes
occur at the
same time
(concurrently)
. We want to
demonstrate
that one can
be considered
a substitute
for the other.
1. Determine
that success in
English
composition
classes has
already
demonstrated
writing skill,
making it
unnecessary for
the student to
take the English
exit exam
(which
measures the
same thing).
2. Compute a
correlation
coefficient
between the
performance of
students on the
computerized
and non-
azizi
Page 59
4/19/2015
computerized
versions of the
GRE (so that we
can consider
performance on
one to be
equivalent to
performance on
the other).
Construct
azizi
The extent to
which the
results of a
data collection
process can be
interpreted in
terms of
underlying
psychological
constructs.
A
psychological
construct
(accent on
first syllable)
is something
that exists
inside a
person's
head. We
construct it
(accent on
second
syllable) by
reasoning
about
observable
information
(such as test
results).
Page 60
1. A person's
test results
show whites are
smarter than
blacks. We
challenge this
person by
demonstrating
that the test
measures
cultural
familiarity
rather than
intelligence.
2. A person
shows that her
moral
development
test really does
measure
something that
can be called
moral reasoning
- rather than
reading ability,
conformity,
intelligence, or
some other
4/19/2015
unrelated
characteristic.
azizi
Page 61
4/19/2015
4. The dean wants to make sure that all the exams in the
her final comprehensive exam really are the ones who will
have trouble with related materials the next year.
azizi
Page 62
4/19/2015
Page 63
4/19/2015
azizi
Page 64
4/19/2015
$75).
His sister's cat (she got it free a year
ago).
Dad's car keys (car is safely parked
on the street).
Mother's expensive coat (worth
$300).
CB radio (worth $210). Little brother's
pet gerbil.
Dad's checkbook.
azizi
Page 65
4/19/2015
Grade Level
Time
Interval
Correlation
Johnny
.63
Johnny
.75
Johnny
.70
Billy
.69
Billy
.70
Equivalent-Forms Reliability
Grade Level
azizi
Time
Interval
Correlation
. 70
.64
.73
Page 66
4/19/2015
.55
.71
.73
Page 67
4/19/2015
Page 68
4/19/2015
SUMMARY
Reliability refers to the degree to which measuring techniques
are consistent rather than self-contradictory. Reliability is
important because you will want to make decisions about your
programs and students based on internally consistent and
stable data rather than fleeting information that would change
if you simply took the time to collect the information a second
time. This chapter has discussed factors that introduce
inconsistency into data collection procedures as well as
strategies for controlling these factors. In addition to
presenting these guidelines, this chapter has described
statistical procedures that can be useful tools to help assure
reliability.
azizi
Page 69
4/19/2015
azizi
Page 70
4/19/2015
azizi
Page 71
4/19/2015
ANNOTATED BIBLIOGRAPHY
The following sources provide more detailed information on the
general topics of reliability and validity:
American Educational Research Association,
American Psychological Association, & National
Council for Measurement in Education. (1935).
Standards for educational and psychological testing.
Washington, DC: American Psychological
Association. This booklet includes the guidelines for
reliability and validity recommended by the three
corporate authors. A familiarity with these guidelines
will help you make better use of published
information regarding data collection processes.
Ebel, R. L, & Frisbie, D. A. (1979). Essentials of
educational measurement (5th ed.). Englewood
Cliffs, NJ: Prentice-Hall. Chapter 5, "The Reliability of
Test Scores," presents a clear statement of the
theoretical rationale behind the traditional methods
of assessing reliability, with a special emphasis on
how to apply these methods to educational practice.
Chapter 6, "Validity: Interpretation and Use," offers
guidelines to help teachers enhance the validity of
their tests by making sure they are appropriate for
the purposes for which they are intended. Chapter
13, "Evaluating Test and Item Characteristics,"
describes important techniques for promoting
internal consistency and generally revising early
versions of data collection procedures.
Gronlund, N. F., & Linn, R. L. (1990). Measurement
and evaluation in teaching (6th ed.). New York:
Macmillan. Chapter 3, "Validity," provides some
useful guidelines for teachers to increase the validity
of their classroom tests. Part of Chapter 11,
"Appraising Classroom Tests," describes item
analysis, an important technique for promoting
internal consistency. Chapter 4, "Reliability and
Other Desired Characteristics," discusses the
azizi
Page 72
4/19/2015
azizi
Page 73
4/19/2015
The following sources are useful for readers who are interested
in more detailed information on validity:
Cole, N. S. (1989). Bias in test use. In R. L. Linn (Ed.),
Educational measurement (3rd ed.). New York:
American Council on Education. This is a detailed
treatment of one of the major sources of invalidity in
the interpretation of data collection processes.
Messick, 5. (1989). Validity. In R. L. Linn (Ed.),
Educational measurement (3rd ed.). New York:
American Council on Education. This is probably the
most authoritative discussion available regarding
the status of current thought on validity. Anyone
doing serious work on validity of data collection
processes should consult this chapter.
azizi
Page 74
4/19/2015
ANSWERS TO QUIZZES
Review Quiz 5.1
1. Unreliable. If they were measuring Ralph's ability
consistently, they should be able to agree on his score.
Ralph's score depends on who scores the test, not on
what he wrote down. This is like having two parents read
the thermometer and one conclude that the child is sick,
while the other concludes that he is healthy.
2. Reliable. Of course, you may need more than two
witnesses to convince a jury. Their testimony could be
invalid (discussed elsewhere in the chapter) if they had
conspired to lie about seeing the violin case. However, in
the sense that we are using the word here, their
testimony is reliable.
3. No information is provided about reliability. It is quite
possible that her performance could have changed during
the intervening year. The difference could be the result of
unreliability, but we have no way to know. There is no
solid reason to expect the two grades to be identical.
4. Unreliable. If the same students are rating his overall
ability as a teacher, there does not seem to be any good
reason why this should change between Monday and
Wednesday. If factors such as mood swings on the part of
the students or teacher are causing the ratings to vary,
these are sources of unreliability. (Note that if the
students were rating him on how well he taught a specific
lesson, then it might be plausible to say that he did well
on one lesson and less well on another. In this case, the
different scores would be a reflection of his actual
performance, and the variation would not be evidence of
unreliability.)
5. Unreliable. Neuroses are supposed to be relatively
permanent personality characteristics. They do not come
one day and go the next. Neurosis is a vague term, and
the counselor is probably having trouble operationally
defining what she means.
azizi
Page 75
4/19/2015
azizi
Page 76
4/19/2015
Part 2
1. Invalid. The test is unreliable and therefore invalid.
2. Invalid. The tasks do not match the designated outcome
variable.
azizi
Page 77
4/19/2015
azizi
Page 78
4/19/2015
ANSWERS:
1. Information about the reliability of the unit pretests is not
clearly stated. The fact that the pretest and posttest
items were based on the same set of objectives would
tend to make it likely that changes in performance from
pretest to posttest reflected real gains rather than a
change in the test, but this report does not focus on the
pretest-posttest comparisons. The fact that the tests were
administered under standardized conditions would also
tend to make them more reliable. However, it would have
been useful to have evidence from a reliability coefficient.
2. The fact that the tests were taken from the manual
suggests good content validity for the objectives of the
unit, but we don't know how strongly the objectives of the
unit are related to the concept of problem solving. In fact,
it seems likely that the improved performance of students
on subsequent tests could be a result of generalization of
learning rather than problem solving. Actually, since the
first author of this textbook was the second author of this
study, he knows that the tests actually did measure
problem-solving ability, but this information is not clearly
expressed in the report.
azizi
Page 79
4/19/2015
azizi
Page 80
4/19/2015