Professional Documents
Culture Documents
METHODS:
RESEARCH THEMES
Northeastern University
2010
The Authors
Page | i
Table of Contents
INTRODUCTION TO CRIMINOLOGICAL AND CRIMINAL JUSTICE RESEARCH
THEMES......................................................................................................................................... 1
CHAPTER 1: MAJOR CRIME DATA SOURCES: PROMISES AND PROBLEMS.................. 3
By Matthew J. Dolliver................................................................................................................ 3
CHAPTER 2. DATA SOURCES FOR CRIMINOLOGICAL RESEARCH ................................. 6
By Michael Rocque...................................................................................................................... 6
CHAPTER 3. SAMPLING ........................................................................................................... 22
By Stephanie Fahy..................................................................................................................... 22
CHAPTER 4. SCALE MEASUREMENT.................................................................................... 36
By Chad Posick ......................................................................................................................... 36
CHAPTER 5. EXPERIMENTAL AND MODIFIED-EXPERIMENTAL DESIGNS................. 45
By Michael Rocque and Chad Posick ....................................................................................... 45
CHAPTER 6. QUALITATIVE RESEARCH METHODS........................................................... 55
By Diana K. Peel....................................................................................................................... 55
CHAPTER 7. PROGRAM EVALUATION................................................................................. 59
By Kristin Reschenberg............................................................................................................. 59
CHAPTER 8. NEWER STATISTICAL METHODS (PART I)................................................... 67
By Diana Summers .................................................................................................................... 67
CHAPTER 9. RESEARCH DESIGN AND NEWER STATISTICAL METHODS (PART II) .. 73
By Sean Christie ........................................................................................................................ 73
Page | ii
Page | 1
The fifth chapter, by Michael Rocque and Chad Posick, provides a basic overview of the fundamental research designs in the social sciences: experiments and quasi (modified) experiments.
The first part of this chapter describes what experiments are, their strengths and weaknesses and
ethical issues involved in such designs. The next chapter describes modified experiments, which
are the strongest research methodology available when true experiments are not feasible or ethical.
The sixth chapter is written by Diana K. Peel, a first year PhD student. Her theme focuses on methods of qualitative research in the social sciences. She discusses sampling methods, coding and
analysis techniques typically used in qualitative research. This topic is essential for all researchers to be familiar with in order to become competent consumers of the literature.
The seventh chapter is written by Kristin Reschenberg, a first year PhD student. Her theme provides an introduction to the basics of program evaluation. Her chapter discusses what program
evaluation is and includes a brief discussion of research designs commonly used in such evaluations. Her chapter includes an important emphasis on the political nature of program evaluations
in criminal justice.
The eighth and ninth chapters are written by Diana Summers, a first year PhD student and Sean
Christie a second year PhD student. These chapters are companion themes, covering new and
relatively rare statistical methods. Ms. Summerss theme covers longitudinal and time-series methods and Mr. Christies theme covers a variety of methods from meta-analysis to survival analysis.
This guide is meant to be used as a reference for beginning and intermediate researchers in criminology and criminal justice. We hope it serves you well.
Michael Rocque
Chad Posick
April, 2010
REFERENCES
Campbell, D. T. and Stanley, J. C. (1963). Experimental and Quasi-Experimental Designs for
Research. Chicago: Rand McNally.
Cook, T. D. and Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton-Mifflin.
DeCoster, J. (2005). Scale construction notes Retrieved November 16, 2008, from
http://www.stat-help.com/notes.html
DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social
Research Methods Series, Vol. 26). Newbury Park: Sage.
Isaac, S. and Michael, W.B. (1995). Handbook in Research and Evaluation (3rd Edition). San
Diego CA: EdITS.
Kerlinger, F. N. and Lee, H. B. (2000). Foundations of Behavioral Research (4th Ed.). New
York, NY: Holt, Rinehart and Winston.
Spector, Paul E. (1992). Summated Rating Scale Construction: An Introduction. Newbury
Park, CA: Sage Publications.
Page | 2
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd Edition). Cincinnati, OH:
Atomic Dog Publishing.
CHAPTER 1: MAJOR CRIME DATA SOURCES: PROMISES AND PROBLEMS
By Matthew J. Dolliver
PART I: INTODUCTION
The major crime data sources and their significance
There are a number of official/major crime data sources which can be utilized by researchers. These data sources are comprised of the official measures of crime for the U.S. federal
government. These measures can be used for a variety of projects from large-scale secondary
data analyses to quick references. As both a popular and official measure, these data sources are
worthy of some examination. This chapter will look at major crime data sources, including the
Uniform Crime Report, The National Incident-Based Reporting System, and the National Crime
Victimization Survey.
PART II: THE THREE MAJOR CRIME DATA SOURCES
Crime in the United States: An overview of the Uniform Crime Report (UCR)
The Federal Bureau of Investigation (FBI) was tasked with publishing the Uniform Crime
Report (UCR) in 1930 as an official measure of crime in the United States. The report is composed of crimes and arrests made by police in a given year (FBI 2010). This record has been
used to estimate the volume of crime, including the degree to which crime fluctuates over time,
space, and demographic factors throughout the U.S. (Liska and Messner, 1999).
The UCR is compiled based on crimes reported by (participating) police agencies to the
FBI (FBI 2010). These crimes are placed into two basic groups, and coded as Part I and Part II
offenses by the FBI. Part I, or index offences, include homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and mother vehicle theft, and arson. These crimes are considered to be the most serious, regularly occurring, and wide spread crimes in the UCR (FBI,
2010). Part II offenses include simple assault, fraud, embezzlement, gambling, driving under the
influence, weapons possession, vandalism, and vagrancy. Police departments report age, sex,
race, and clearance rate for Part I offenses. For Part II offences only arrest data is reported. The
UCR reports crime to the public in three ways. First, there are raw figures given each crime type.
Second, there is the rate per 100,000. Finally, the UCR presents the change in given raw and rate
numbers over time (Pattavina, 2005). These data are presented in a variety of formats, and are
widely available through the FBIs website.
Analysis of the Uniform Crime Report Approach
One of the most widely cited weaknesses of the UCR is its validity. There are several aspects of the UCRs construction that draws its validity into question. First, it is well known that
many crimes are not reported to the police. Second, a crime known to the police may not result in
a report by the police or may be recorded as a different type of crime when it is reported (Pope et
al, 2001). This may result from a number of reasons. For instance, police may have differing definitions or may be looking to present a particular image. It should be noted that the FBI does
provide a list of crimes and guidelines for reporting them, however they are guidelines and are
Page | 3
difficult if impossible to enforce (FBI, 2010). Finally, the UCR is compiled using a hierarchy
rule. This rule codes a group or string of offences that happen together as single crime represented by the worst offence. This coding practice does have some exceptions, for instance, when
multiple homicides are involved (Pattavina, 2005).
The UCR also suffers from systematic underreporting due to the voluntary nature of the
program. No law enforcement agency is required to submit data. According to the FBI, about
93% of departments participate. However, this still means that crime will be under reported. Additionally, the UCR does not account for federal crimes or arrests (FBI, 2010). Under reporting
will also take place because of the UCRs narrow field of included offenses. Together, parts I
and II cover only the 29 crimes seen by the FBI as the most serious. This list excludes a number
of crimes and even crime types. For example, many white-collar crimes are not included (Liska
and Messner, 1999).
The evolution of crime in the U.S.: National Incident-Based Reporting System (NIBRS)
In an effort to update it approach and address some of the issues associated with the
UCR, the FBI introduced the National Incident-Based Reporting System (NIBRS) in the late
1980s. As stated by the FBI the goal of this new system of crime reporting is to enhance the
quantity, quality, and timeliness, as well as to improve the methodology used for compiling, analyzing, and publishing, crime data (FBI a, 2010). Because NIBRS is an incident based reporting
system, it contains detailed information on individual crimes/arrests. For example, NIBRS reports include such information as location of incident, method of entry, and victim-offender relationship (FBI a, 2010) all of which can be useful to our understanding of crime.
NIBRS data are also collected and separated into two categories. Group A is comprised
of 46 crimes, and Group B has an additional 11. Group A offenses expand of the index crime
concept of the FBI by including crimes such as kidnapping and sex offenses. Crime in the
NIBRS system can generally be thought of as one of three types; violent crime, property crime,
or crimes against society. The expansion of the NIBRS system into the area crimes against society shows a shift in development from the UCR, and places and emphasis on drugs and drug related crime (Pattavian, 2005).
Analysis of the National Incident-Based Reporting System approach
The FBIs evolution of crime reporting from the UCR to NIBRS has several notable advantages. First, the NIBRS system is able to distinguish between crimes, committed in a group or
series, as it does not use the hierarchy rule. Additionally, the NIBRS system of reporting is able
to distinguish between attempted and completed crimes. In this same way the NIBRS system
produces better definitions of crimes for reporting and classification by law enforcement. Related
this classification, the NIBRS system also gives back more complete and comprehensive statistical analysis of crime in the U.S. (Pattavian, 2005).
NIBRS data is collected based on the reports of participating police agencies. This levees
it subject to underreporting (as discussed early with the UCR). This point is particularly significant because only about half of the U.S. participates in reporting to NIBRS. This reporting can
be seen as costly and time consuming, particularly for smaller police agencies.
NATIONAL CRIME VICTIMIZATION SURVEY (NCVS)
The BJS system, and alternative methods: The National Crime Victimization Survey
Page | 4
The final major source of crime data that we will examine here is the Bureau of Justice
Statisticss National Crime Victimization Survey (NCVS). Because this measure of crime is
conducted using a survey, it is not subject to the reporting problems associated with either the
UCR or NIBRS. Approximately 150,000 interviews are conducted in two sessions a year, using
U.S. Bureau of Census personnel. This alternative method of data collection reveals some of the
underreporting seen in the UCR and NIBRS. The NCVS collects demographic and other information about the persons involved, the nature of and extent of the victimizations, economic consequences and other real world information in order to provide a more complete understanding
(Pattavina, 2005).
However, as a survey the NCVS is labor-intensive and costly. These draw backs are felt
in the collection or compilation of data. For example, compiling NCVS data takes more time,
effort, personnel, and financial resources. This draws out an additional limitation of this approach. Interviews may affect the outcomes of surveys or may encode answers incorrectly when
dealing with participants. This means there is a potential of inconsistency in the coding of data,
as well as the creation of bias. In this same way subjects may fail to tell the truth, or decline to
participate fully. For instance, a subject may be embarrassed to reveal a criminal incident to an
interviewer. Similarly, a subjects memories surround an incident may not be accurate. Finally,
the NCVS does not collect information on a number of crimes. The survey focuses on victimization experiences. For example, the survey includes only those 12 years of age and older and
therefore systematically leaves out group of crimes (Pattavina, 2005)
PART III: CONCLUSION
We have conducted some analyses of the three major crime data sources. These sources
are presented by the United States federal government as the official measure of crime. However,
as we have seen, all of these methods for measuring and reporting crime are subject to some
limitations. They are best viewed together as an estimation of the nations general level of crime.
These measures are also likely to shift and evolve if they are to keep up with the rapidly changing nature of crime and its scientific study.
REFERENCES
Federal Bureau of Investigation a (2010). National Incident-based Reporting System.
Volume 1. Available from: www.fbi.gov/ucr/nibrs/manuals/v1all.pdf
Federal Bureau of Investigation (2010). Crime in the United States, 2002. Washington,
DC: United States Government Printing Office.
www.fbi.gov/ucr.ucr.htm#cius
Liska, A. E. and S. F. Messner. (1999). Perspectives on Crime and Deviance (3rd ed). Prentice
Hall, Upper Saddle River, NJ.
Pattavina, A. (2005). Information Technology and the Criminal Justice system. Thousand
Oaks, CA: Sage publishing.
Pope, C., R. Lovell, and S. Brandl. (2001). Voices from the Field: Readings in
Criminal Justice Research. Carborough, Ontario: Wadsworth publishing.
Page | 5
Page | 6
Page | 7
The NCVS includes violent crime and property crime, analogous to the UCR
Supplements: Crime and School Safety-includes information from US schools on
o Alcohol and drug availability;
o Fighting, bullying, and hate-related behaviors;
Page | 8
o
o
o
o
Page | 9
Data include:
measures of self-reported offending;
indicators of repeat offending;
trends in the prevalence of offending;
trends in the prevalence and frequency of drug and alcohol use;
evidence on the links between offending and drug / alcohol use;
evidence on the risk factors related to offending and drug use; and
information on the nature of offences committed, such as the role of co-offenders and the relationship between perpetrators and victims.
For more information and to download data, go to
http://www.homeoffice.gov.uk/rds/offending_survey.html
D. International Crime Victimization Survey
The International Crime Victimization Survey (ICVS) is conducted by the United Nations and
initiated by the ICVS international working group. The ICVS was first carried out in 1987, then
again in 1992. The third wave occurred in 1996 and the fourth in 2000. The latest round (2005)
includes 78 countries and 300,000 interviews. The purpose of the survey is to generate data for
national comparisons.
Subjects are interviewed in a similar fashion to the NCVS. Households are the unit of analysis
and interviews are done predominantly over the phone using CATI methods. Screen questions
are used to determine if a person has been a victim of a crime and if so, more detailed information is asked.
Information collected includes:
Demographic information;
5 year victimization screen;
Detailed situational information on victimization;
Whether crime was reported to police;
Victim services information; and
Seriousness of the crime
For more information download the latest ICVS report here:
http://www.unicri.it/wwd/analysis/icvs/pdf_files/ICVS2004_05report.pdf
Or download the data here:
http://www.icpsr.umich.edu/cocoon/NACJD/SERIES/00175.xml
E. European Union Crime and Safety Survey
The European Union Crime and Safety Survey (EU ICS) is modeled after the ICVS. It is conducted within the nations of the European Union and uses a slightly modified instrument from
that used in the ICVS. This survey was initiated in 2005.
Page | 10
The EU ICS uses Random Digit Dialing and complex survey methodology. Interviews are conducted using CATI methods (and some Web-based instruments). The total sample contains over
28,000 individuals.
Data include:
Household information
Victimization experiences
o Personal
o Property
o Motor Vehicle
o Hate crimes
Perceptions of safety
Neighborhood characteristics
For more information and to download data, go to:
http://www.europeansafetyobservatory.eu/euics_da.htm
F. Law Enforcement Management and Administrative Statistics (LEMAS)
LEMAS is a national survey of law enforcement agencies with over 100 sworn personnel. The
survey is executed on a three year basis (Langworthy, 2002). The purpose of the survey is to
gather data on officers, hiring practices, training procedures, expenditures and agency equipment (US DOJ, 1996, as cited in Langworthy, 2002). Data are also collected on agency initiatives, such as community policing.
The survey was conducted in the following years: 1987, 1990, 1993, 1997, 1999, 2000 and 2003.
For more information or to download data files, go to
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/04411/detail;jsessionid=0BCCDFFEF8264
3B02AC7BA8E7E102CDA
See Langworthy (2002). LEMAS: a comparative organizational research platform. Justice Research and Statistics Association.
G. World Health Survey (World Health Organization)
The World Health Survey (WHS) is conducted by the World Health Organization. Its purpose is
to provide data for cross-national comparisons on a variety of health issues. The WHS is relevant
for criminological research because it provides measures that can serve as indicators of the wellbeing of a nation as well as information on policies within nations. Data are collected from nations across the globe, and are geocoded.
Information is collected on the following domains:
Socio-demographics (occupation, income, sex and age structure of household, household
expenditures);
Healthcare utilization; and
Health status of respondents
Page | 11
Page | 12
Plea bargains
Page | 15
Sociodemographic factors;
Education;
Work;
Health;
Crime and delinquency;
Criminal justice contact;
Income;
Marriage and social relationships;
Sexual activity;
Substance and sexual abuse;
Attitudes; and
Political participation
Socioeconomic factors;
Impulsivity;
Crime/delinquency;
Attitudes;
Cognitive and language development; and
Family structure and relationships (Parental figure or caregiver)
Laub, J. H., and R. J. Sampson. (2003). Shared Beginnings, Divergent Lives: Delinquent Boys to
Age 70. Cambridge, MA: Harvard University Press.
Mosher, C.J., T. D. Miethe and D. M. Phillips (2002). The Mismeasure of Crime. Thousand
Oaks: Sage.
Sampson, R. J., and J. H. Laub. (1993). Crime in the making: Pathways and turning points
through life. Cambridge, MA: Harvard University Press.
Page | 21
CHAPTER 3. SAMPLING
By Stephanie Fahy
This chapter reviews sampling methods in social science research. Sampling involves selecting a
smaller number of elements (such as people or organizations) within a population of interest in
order to generalize from the sample to the population from which the elements were chosen
(Trochim, 2001:41). A samples quality is largely based on the degree to which it is representative the extent to which the characteristics of the sample are the same as those of the population
from which it was selected (Maxfield and Babbie, 2001:242).
Key Terms
Population The group you wish to generalize to
o Theoretical population - who we want to generalize to
o Accessible population the population we can get access to
Sampling Frame The listing of the accessible population from which youll draw your
sample
Sample The group of people you select to be in your study
External validity The degree to which the conclusions in your study would hold for
other persons in other places and at other times. A threat to external validity is an explanation of how you might be wrong in making a generalization.
o One way of improving external validity is by doing a good job of drawing a sample from a population (i.e., using random selection as opposed to non-random selection)
Figure 3.1. Different groups in a sampling model (Trochim, 2001)
The Theoretical Population (Who do you want to generalize to?)
The Study Population (What population can you get access to?)
Page | 22
Even the most carefully selected sample is almost never a perfect representation of the population from which it was selected. Probability sampling methods are highly recommended for
selecting samples that will be quite representative. An important advantage of using probability
sampling methods is that they make it possible to estimate the amount of sampling error that
should be expected in a given sample (Maxfield and Babbie, 2001:242).
A basic principle of probability sampling is that a sample will be representative of the
population from which it is selected if all members of the population have an equal
chance of being selected in the sample (Maxfield and Babbie, 2001: 220); thus, the key to
this process is random selection.
o Random selection forms the basis of probability theory, which permits inferences
about how sampled data are distributed around the value found in a larger population (i.e., probability theory makes it possible to estimate sampling error and confidence intervals for the population parameter); thus allowing you to estimate the
accuracy or representativeness of the sample. In other words, you know the odds
or probability that you have represented the population well (Maxfield and Babbie, 2001).
o Random selection reduces conscious and unconscious sampling bias.
A probability sampling method is any method of sampling that utilizes some form of random selection. Random sampling gives each and every member of the population an
equal chance of being selected for the sample (Fox, Levin and Shively, 2002:158).
Types of Probability Sampling Methods (see Figure 2)
Simple Random Sampling The simplest form of random sampling is appropriately
called simple random sampling (Trochim, 2001). Once a sampling frame has been established, you would assign a single number to each element in the list, not skipping any
number in the process. A table of random numbers, or a computer program for generating them is then used to select elements for the sample (Maxfield and Babbie, 2001:230).
Many computer programs can generate a series of random numbers by numbering the
elements in the sampling frame, generating its own series of random numbers and printing out the list of elements selected (Maxfield and Babbie, 2001; Trochim, 2001). Simple random sampling is easy to accomplish and because simple random sampling is a fair
way to select a sample, it is reasonable to generalize the results from the sample back to
the population (Trochim, 2001:51).
Stratified Random Sampling Involves dividing your population into homogenous
subgroups or strata and then taking a simple random sample in each subgroup. Stratification is based on the idea that a homogeneous group requires a smaller sample than does a
heterogeneous group (Fox, Levin and Shively, 2001:160). Stratified sampling may be
preferred over simple random sampling because 1) it assures that you will be able to represent not only the overall population, but also key subgroups of the population, particularly small minority groups, and 2) stratified random sampling has more statistical precision than simple random sampling if the strata or groups are homogeneous since the variability within groups would be expected to be lower than the variability for the population as a whole1 (Trochim, 2001).
According to sampling theory, a homogenous population produces samples with smaller sampling errors than a
heterogeneous population does, and stratified sampling is based on this.
Page | 23
Page | 24
while decreasing the number of elements within each cluster2 (Maxfield and Babbie,
2001:233).
Eltinge and Sribney (1997) point out that an important consideration to keep in mind when collecting sample data through complex designs like stratified and multi-stage cluster sampling is
that data cannot be assumed to be independent and identically distributed (iid) and therefore it is
inappropriate to analyze data using statistical methods that are based on iid assumptions (i.e., regression commands in statistical software programs like SPSS and Stata) since estimates will
almost certainly be biased. They argue that iid-based methods do not adjust for the for the effects of unequal selection probabilities and that iid-based variance estimators do not account for
the loss of information that invariably occurs when using a complex design over simple random
sampling, so you would run the risk of overstating the certainty attached to estimates (Eltinge
and Sribney, 1997).
A design-based approach is preferable to model-based approaches (i.e., iid-based approach) since
it accounts for the collection of sample data through complex designs, accounting for any losses
or gains in information, which results in more robust and accurate estimates. Importantly, this
approach restricts randomness to the specific random process by which the sample was selected (i.e., random selection at each successive stage) rather than assuming a true random sample, which is the approach taken by model-based analysis (Eltinge and Sribney, 1997).
Levy and Lemeshow (1999:482) recommend the following steps for performing a design-based
analysis:
1) Identify the following elements of the sample design:
a. Stratification
b. Clustering variables used
c. Population sizes required for determination of finite population corrections
2) On the basis of the above information, determine the sampling weight for each sample
subject.
3) Determine for each sample record a final sampling weight that takes into consideration
any nonresponse and poststratification adjustments that are desired.
4) Ensure that all stratification, clustering, and population size data required for an appropriate design-based analysis are identified on each sample record.
5) Determine the procedure and the set of commands for performing the required analysis
for the particular software package that will be used.
6) Run the analysis and carefully interpret the findings.
One way sampling error is reduced is by increasing the homogeneity of elements sampled. A sample of clusters
will best represent all clusters if a large number are selected and if all clusters are very much alike. A sample of
elements will best represent all elements in a given cluster if a large number are selected from the cluster and if all
the elements in the cluster are very much alike (Maxfield and Babbie, 2001:233).
Page | 25
Figure 3.2. Probability Sampling Techniques (Cox and Fitzgeralds figures on probability sampling)
Page | 26
Non-probability sampling does not involve random selection and therefore cannot depend on
the rationale of probability theory. In general, researchers prefer probabilistic or random sampling methods over non-probabilistic ones and consider them to be more accurate and rigorous;
however, in some circumstances in applied research it is not feasible, practical or theoretically
sensible to use random sampling (Trochim, 2001:55-56). Additionally, researchers may have
limited research objectives or seek to interview a population with no established sampling frame
(i.e., car thieves, bank robbers) (Maxfield and Babbie, 2001).
Non-probability sampling methods are divided into two broad types: accidental or purposive.
Types of Non-Probability Sampling Methods
Accidental, haphazard or convenience sampling This type of sampling takes what is
most quickly and easily available, often relying on available subjects (i.e., stopping people at a street corner or some other location). This type of sampling method is neither
purposeful nor strategic and there is no evidence that the subjects are representative of the
populations youre interested in generalizing to (Maxfield and Babbie, 2001; Trochim,
2001:56).
Purposive sampling This type of sampling is designed to understand certain select
cases in their own right rather than to generalize results to a population (Isaac and Michael, 1995:223). A sample is selected based on our judgment and the purpose of the
study, and usually you would be seeking one or more specific predefined groups (Trochim, 2001). For example, a study that looked at peoples attitudes about court-ordered
restitution for crime victims may want to test the questionnaire on a sample of crime victims, so rather than select a probability sample of the general population, you may select
some number of known crime victims, perhaps from court records (Maxfield and Babbie,
2001:238). With this type of sampling, you are likely to get opinions of your target population, but you are also likely to overweight subgroups in your population that are more
readily accessible (Trochim, 2001:56).
The following are subcategories of purposive sampling (Trochim, 2001:56-58):
o Modal Instance Sampling This type of sampling involves sampling the most
frequent or typical case (i.e., informal public opinion polls interview a typical
voter). An obvious problem with this approach is determining what the typical or
modal case is.
o Expert Sampling This approach involves assembling a sample of people with
known expertise in a particular area. This approach can also be used to validate
another sampling approach; however, the disadvantage to using this approach is
that even experts can be wrong.
o Quota Sampling This approach involves selecting people nonrandomly according to some fixed quota. There are two types of quota sampling: proportional and
nonproportional. For example, with proportional quota sampling, if you knew
the population has 40 percent women and 60 percent men and that you wanted a
total sample size of 100, you would continue sampling until you get those percentages and stop. So, if you already had 40 women but not 60 men you would
continue to sample men but not women because you have met your quota for
women. There are a couple of problems with this approach. First, the quota
frames must be accurate and it is often difficult to get up-to-date information on
the proportional breakdown of sample elements. Second, selection bias may exist
Page | 27
o Criterion sampling This strategy involves studying all cases that meet some predetermined criterion of importance.
o Confirmatory or disconfirming cases These cases are used to either support or
call into question the emerging trends or patterns in the early exploratory phase of qualitative evaluation.
o Sampling politically important cases Because evaluation often takes place in a
politically sensitive environment, it is practical to sample politically important or sensitive cases.
Page | 29
Anytime
Advantages
Use
Sampling Method
Page | 30
Disadvantages
Page | 31
Where:
N = 10,000
P = 0.50
d = 0.05
X2 = 3.841
*
**
*This formula and corresponding calculation assumes random selection of participants.
**Results should be rounded up to include the whole person. Thus, for a population of 10,000,
to obtain a sample that is representative of the population at the 95% confidence level, one
should sample 370 individuals.
Page | 32
Page | 33
Page | 34
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd Edition). Cincinnati, OH:
Atomic Dog Publishing.
Page | 35
Page | 36
Guttman Scaling
Similar to the Thurstone scale, the Guttman scale seeks to gauge the extremity of a respondents position on a certain concept. Using this scale, it is hypothesized that if a respondent answers in the affirmative to a particular question, all successive questions will also be in the affirmative. A criminal justice example is used below to gauge level
of delinquency. Someone who has been incarcerated is likely also have been previously convicted, arrested and
stopped by the police.
1. I have been previously stopped by the police
Yes________________
No____________
2. I have been previously arrested by the police
Yes________________
No ____________
3. I have been previously convicted in a court of law
Yes________________
No____________
4. I have been previously incarcerated in prison or jail
Yes________________
No_ __________
Likert Scaling
Likert scaling is probably the most widely used format for scaling on surveys. This format yields itself well to quantitative analysis and is thus helpful in social science research. In Likert scales, one question is asked followed by a
series of response items. The number of response items will vary but a scale should include between 5 and 9 response choices to ensure adequate variation for statistical analyses. A self-control set is presented in this example.
Fully Agree
Fully Disagree
1. I act on the spur of the moment without thinking
1 2 3 4 5
2. I do whatever brings me pleasure in the here and now
1 2 3 4 5
3. I am more concerned with what happens to me in the short run than long run 1 2 3 4 5
Page | 37
Semantic Differential
The semantic differential format is similar to the Likert scale and tends to obtain similar information. The response
choices exist between two extremes in which the respondent can choose the option that responds to the extent of
their attitude or belief. This is shown below.
How often do you lose your temper?
Very Often ----- ----- ----- ----- ----- ------ ------- Hardly Ever
Visual Analogue
The visual analogue is almost exactly the same as the semantic differential. However, where response choices are
given in the semantic differential, the visual analogue does not separate response choices and the respondent marks
their response along a solid continuum.
How often do you lose your temper?
Very Often -------------------------------------------------------- Hardly Ever
and reliability/validity tests should be run on the responses from the sample. Items that are poorly correlated or which lower the reliability of the scale should be dropped, or the data should be
re-examined for improper coding, missing data, etc.
Step 8: Optimize scale length
The scale should not be unduly long, placing a burden on the participants increasing the
chance of burnout and fatigue (lowers response rates, increases acquiescence responding). When
possible, splitting samples into two smaller samples and using one group as a development
sample and the other as a cross-validation sample is useful but not always possible.
RULES FOR GOOD QUALITY SCALES
Once you have determine what you want to study (the concept) and how you want to set
up the format of the questions and scales, it is time to take it one step further and ensure that the
scale is of high quality. The following set of rules is the minimal steps in constructing a good
quality scale.
5 Rules for Quality Scales
1) Each item should express only one idea
Questions with two ideas (usually separated by an and or or) are confusing and respondents may have different
answers to the two ideas. These are termed double-barreled questions.
Dont do this: I act on the spur of the moment and I do not think about the long-term consequences of my actions
Do this instead: Someone might act on the spur of the moment but also think about long-term consequences. To
limit this confusion, split the question into two separate questions.
1) I tend to act on the spur of the moment?
2) I do not think about the long-term consequences of my actions?
Page | 39
REFERENCES
Andrich, D. (1988). Rasch Models for Measurement. Newbury Park: Sage Publications.
Bond, T. G., and C. M. Fox. (2007). Applying the Rasch model: Fundamental
Measurement in the Human Sciences. Second Edition. Mahwah: Lawrence Erlbaum Associates.
Dayton, C. M. (1998). Latent Class Scaling Analysis. Thousand Oaks: Sage Publications.
DeVellis, R. F. (1991). Scale Development: Theory and Applications. Newbury Park: Sage
Publications.
Ferrando, P. J., and C. Anguiano-Carrasco. (2010). Acquiescence and Social Desirability as Item
Response Determinants: An IRT-based Study with the Marlowe-Crowne and the EPQ
Lie Scales. Personality and Individual Differences 48: 596-600.
Kruskal, J. B., and M. Wish. (1978). Multidimensional Scaling. Newbury Park: Sage
Publications.
Lodge, M. (1981). Magnitude Scaling: Quantitative Measurement of Options. Newbury Park:
Sage Publications.
McIver, J. P., and E. G. Carmines. (1981). Unidimentional Scaling. Newbury Park: Sage
Publications.
Spector, P. E. (1992). Summated Rating Scale Construction: An Introduction. Newbury Park:
Sage Publications.
ADDITIONAL MATERIAL
Kaplan, D. (2000). Structural Equation Modeling: Foundations and Extensions. Thousand Oaks:
Sage Publications.
Page | 40
APPENDIX 4A
GLOSSARY OF TERMS
Acquiescence Set the tendency of subjects to agree with all items of a construct regardless of
content
Bi-Polar Scale response to scale items can vary from a negative to positive points with a zero
somewhere in between
Coefficient Alpha (Cronbachs alpha) a measure of the internal consistency of a scale
Confirmatory Factor Analysis (CFA) technique that tests the hypothesis of an existing structure
how well the data fit
Convergent Validity measures of the same construct should related strongly with one another
Criterion-Related Validity involved the testing of hypotheses about how the scale will relate to
other variables
Discriminant Validity measures of different constructs should only relate moderately well with
one another
Eigenvalue represents the relative proportion of variance accounted for by each factor in a Factor Analysis
Exploratory Factor Analysis technique that determines the number of separate components that
exist for a group of items
Factors sets of groups that emerge out of a larger group of items that represent theoretical constructs
Internal-Consistency Reliability a measure of how well multiple items, designed to measure a
theoretical construct, intercorrelate with one another
Item-Remainder also considered part-whole or item-whole coefficient measures how well
each individual item relates to the others in the analysis
Known-Groups Validity measuring the scores of different groups of individuals based on the
hypothesis that those of different groups will answer items of theoretical constructs differentially
Latent Variable the underlying phenomena or construct that a scale in intended to reflect
Multitrait-Multimethod Matrix (MTMM) developed by Campbell and Fiske (1959) technique
that simultaneously explores convergent and discriminant validity
Page | 41
Page | 42
APPENDIX 4B
THE MARLOWE-CROWNE SOCIAL DESIRABILITY SCALE
Personal Reaction Inventory
Listed below are a number of statements concerning personal attitudes and traits. Read each item
and decide whether the statement is True or False as it pertains to you personally.
1. Before voting I thoroughly investigate the qualifications of all the candidates. (T)
2. I never hesitate to go out of my way to help someone in trouble. (T)
3. It is sometimes hard for me to go on with my work, if I am not encouraged. (F)
4. I have never intensely disliked anyone. (T)
5. On occasion I have had doubts about my ability to succeed in life. (F)
6. I sometimes feel resentful when I don't get my way. (F)
7. I am always careful about my manner of dress. (T)
8. My table manners at home are as good as when I eat out in a restaurant. (T)
9. If I could get into a movie without paying and be sure I was not seen, I would probably do it.
(F)
10. On a few occasions, I have given up doing something because I thought too little of my ability. (F)
11. I like to gossip at times. (F)
12. There have been times when I felt like rebelling against people in authority even though I
knew they were right. (F)
13. No matter who I'm talking to, I'm always a good listener. (T)
14. I can remember "playing sick" to get out of something. (F)
15. There have been occasions when I took advantage of someone. (F)
16. I'm always willing to admit it when I make a mistake. (T)
17. I always try to practice what I preach. (T)
18. I don't find it particularly difficult to get along with loud-mouthed, obnoxious people. (T)
19. I sometimes try to get even rather than forgive and forget. (F)
20. When I don't know something I don't at all mind admitting it. (T)
21. I am always courteous, even to people who are disagreeable. (T)
22. At times I have really insisted on having things my own way. (F)
23. There have been occasions when I felt like smashing things. (F)
24. I would never think of letting someone else be punished for my wrongdoings. (T)
25. I never resent being asked to return a favor. (T)
26. I have never been irked when people expressed ideas very different from my own. (T)
27. I never make a long trip without checking the safety of my car. (T)
28. There have been times when I was quite jealous of the good fortune of others. (F)
29. I have almost never felt the urge to tell someone off. (T)
30. I am sometimes irritated by people who ask favors of me. (F)
31. I have never felt that I was punished without cause. (T)
32. I sometimes think when people have a misfortune they only got what they deserved. (F)
33. I have never deliberately said something that hurt someone's feelings. (T)
Page | 43
APPENDIX 4C
Psychometric Theory
There are two major theoretical approaches to psychometric theory: 1) classical test theory and
2) item response theory.
Classical Test Theory
Classical Test Theory uses the following formula:
X = T + E
X = Observed Score
T = True Score
E = Error
Classical test theory (also referred to as true score theory) is primarily concerned with the reliability of the psychological test. Therefore, the most utilized statistic is the alpha coefficient
(Cronbachs alpha).
Page | 44
We introduce this term as a replacement for quasi-experiments, popularized by Campbell and Stanley (1963). In
our view, modified is a more appropriate way to describe experiments in which full randomization is not possible
or feasible. In these situations, researchers must seek to adapt or modify the true experimental design.
Page | 45
Consider a study that seeks to examine the effect of intensive probation on criminal recidivism. A true experiment would involve developing sampling criteria, use of appropriate sampling techniques, a pre-test on outcomes of interest (here, perhaps variables that may be related
to recidivism such as criminogenic attitudes), randomization to an experimental group (receiving
intensive probation) and a control group (for example, probation as usual) and a follow-up posttest (where the same variables are collected as were on the pre-test and recidivism is measured).
Turner, Petersilia, and Deschenes (1992) describe just such an experiment.
Experimenters recognize that randomization is essential in isolating the causal effect of
the intervention (Weisburd, 2003; Farrington, Loeber and Welsh, 2010). In the example above,
without a proper assignment procedure, researchers would be less confident that any difference
in recidivism between the two groups was caused by the treatment (e.g., intensive probation) and
not an unmeasured variable. Randomization means that each subject in the identified sample has
an equal chance (or the probability is known) of being in the experimental or control group.
Mathmatically, this produces groups that are roughly equivalent on all measures (given a large
enough sample size) (see Bachman and Schutt, 2001). An important point is that true experiments do not control or cancel possible confounders; rather they result in such confounders being
equally distributed between both groups so that they cannot differentially influence the outcome.
II. Threats to Internal Validity
Campbell and Stanley (1963; see also Cook and Campbell, 1979) identified several
threats to internal validity in research designs as a way to illustrate the power of true experiments. In certain research designs where the intention is to isolate and describe a causal effect of some form of treatment or intervention, there are several alternative explanations that
may account for differences between the experimental and control group or observed effects of
treatment. Researchers can use various design strategies to control these threats to internal validity. These threats are summarized in Table 5.1
Table 5.1. Threats to Internal Validity
History
Factors that affect the outcome apart from the treatment. Example: a highly publicized case of child abduction occurs during an experiment testing
the effect of a new law on child predator recidivism
Page | 46
Maturation
Testing
Instrumentation
Statistical Regression
Selection
Experimental Mortality
Growth or natural changes within subjects over time that would have occurred with or without the treatment. Example: a research study is examining the effect of a tutorial program on language acquisition over two
years. Even without the program, the children might be expected to improve language skills. This must be taken into account to isolate the effect
of the treatment.
The very act of taking part in a study and completing questionnaires
changes subject behaviors. Example: in a famous study examining the effect of changes in environment on worker production, researchers in Chicago found that no matter what they did, worker production increased.
They concluded that the workers realized they were taking part in a study
and changed their behavior accordinglythis is now known as the Hawthorne effect (see Maxfield and Babbie, 1998).
Differences in testing procedures influences observed results. Example: a
study on anxiety assigns subjects to a condition in which a scary movie
with viewed or a condition in which an episode of Lassie is viewed. The
subjects are given an anxiety test immediately following the viewing. Lo
and Behold there is a difference in these scores, with the Lassie group
showing less anxiety. However it is discovered later that the movie group
took their test in a room outside of a construction site, with a jackhammer
in full blast (sounding like a machine gun). It is unclear whether the
higher anxiety scores in this group are the result of the movie or the jackhammer.
When subjects are selected for an experiment on the basis of extreme (high
or low) scores, it is expected that their post-test scores will revert back to
the mean. Example: subjects are tested for social anxiety in order to be
selected into a study. Those who score in the 90th percentile are chosen to
be given a treatment. After treatment, the subjects scores are reduced by
2 standard deviations. However, because of the high scores on the pretest, it may be expected that these subjects scores at a later date would be
less extreme even without treatment. This is known as regression to the
mean.
Without careful attention to procedures by which subjects are assigned to
experimental or control groups, differences in outcomes may be due to
pre-existing differences rather than treatment. Example: a study seeks to
examine the effect of rehabilitation in prison on recidivism. Subjects are
selected to rehabilitation via volunteerism; only those volunteering to take
part are given treatment. The experimental group is compared to a different group of prisoners not given rehabilitation. Lo and Behold, two years
after release from prison, the rehabilitation group has a lower rate of recidivism. Yet we cannot be sure if this difference is due to treatment or the
fact that the experimental group may have been more willing to give up
crime (as evidenced by their volunteering to a rehabilitation program).
Subjects may drop out or refuse to take part in the study, resulting in unequal control and experimental groups. Example: in a two year study that
seeks to examine the effect of the D.A.R.E. program on drug use, students/classrooms are assigned to receive the D.A.R.E. program or no program. After two years, the D.A.R.E. program group participants have
HIGHER rates of cannibus use, prompting officials to drop Crime Dog
McGruff. Yet post-hoc analyses find that in the control group, 15% of the
students dropped out of the study (and these 15% were highly likely to be
drug users). Thus, the negative effects of D.A.R.E. may have been artificially caused by drug user drop-out in the control group, resulting in a
lower rate of drug use at follow-up.
Page | 47
Diffusion of Treatments
In a study with a control and experimental group, the control group may
become aware of the treatment being offered to their counterparts, or be
given the treatment (outside of the experimental protocols). This results in
contamination. Example: a study seeks to examine the effect of counseling
on domestic violence perpetrators. Subjects are randomized to a group
that receives mandatory counseling or a group that receives treatment as
usual. Part way through the study the control group subjects become
aware of the study and engage in compensatory behavior to demonstrate they are not inferior to the experimental group. This is known as
compensatory rivalry or the John Henry effect (Cook and Campbell,
1979, cited in Bachman and Schutt, 2001). Treatment sometimes crosses
over in experiments, whereby staff either unwittingly give subjects in the
control group treatment (e.g., broken assignment) or given treatment on
purpose to the control group because of perceived unfairness. This results
again in contamination.
Interactions between any threats can also occur. For example, selection
and history may both take place thus contaminating results (see Isaac and
Michael, 1995).
In true experiments, these threats to internal validity are controlled to the highest extent
possible. For example, history threats (say a highly publicized child abduction case) affect both
the experimental and control group equally. However, even in true experiments, process or implementation failures can weaken internal validity. For example, in the famous Sherman and
Berk study of the effect of mandatory arrest for domestic violence, officers were to choose a response at random. However, some officers purposely ignored the response they were assigned. If
staff responsible for implementing a randomized study ignore assignment procedures, equality
between the two groups (remember, the hallmark of true experiments) may no longer hold (see
Goldkamp, 2008).
III. The Counterfactual
In research, one often hears a term called the counterfactual. This is the idea that what
researchers are really interested in is what would have happened in the absence of x. In simple
terms, if we are studying the effect of intensive probation on recidivism, the counterfactual of
recidivism outcomes for an individual who received the treatment would be recidivism outcomes
for the same individual without treatment. That is, the counterfactual is the absence of treatment
during the same period, for the same individual.
Consider Y, the outcome, t=treatment and c=control. The counterfactual is represented
by:
Equation 5.1
Where is the treatment effect and subscript i indicates individual subjects. Equation 5.1
indicates that each individual has a potential outcome under a) treatment and b) no treatment.
In an ideal world, we would be able to observe and have an unbiased estimate of the treatment
effect. However, as should be clear, it is not possible to observe the counterfactual as the same
person cannot be in the experimental and control group simultaneously (Heckman and Smith,
1995; Winship and Morgan, 1999). Because random assignment has been shown to result in generally equivalent groups, it is seen as the closest researchers can come to estimating the counterfactual. Thus, what true experiments produce is given by equation 5.2
Equation 5.2
Page | 48
This provides the average effect of the treatment or average treatment effect (ATE; note
the line above each of the estimators) (see Loughran and Mulvey, 2010). In this equation, the
average scores of those individuals in the control group are subtracted from those individuals in
the treatment group. Because of randomization, we are provided an unbiased and consistent
estimate of the treatment effect (for a more complete discussion, see Loughran and Mulvey,
2010:167; Winship and Morgan, 1999).
IV. External Validity
One common critique of experiments is that they lack generalizability. Some argue that
experiments are often too focused (idiographic) and rigid to provide valid estimates of a treatment for more than the sample studied (see Pawson and Tilley, 1994). In order to increase internal validity researchers must have as much control over extraneous factors as possible in true
experiments. As a result, true experiments may not provide a reasonable or realistic estimate of
how the treatment may operate in reality (Bachman and Schutt, 2001; Weisburd, 2000).
Weisburd (2000) argues that experiments can be made more generalizable, by designing
heterogeneity into samples and including experiments in the field. In addition, multi-center
trials, in which the same experiment is conducted across various settings at the same time, can
increase external validity (Weisburd and Taxman, 2000). In general, external validity is not a
more serious threat for true experiments than for other types of research designsresearchers
must be cognizant of how their sample is selected and to whom their results can be generalized.
V. The Black Box problem
True experiments suffer from what we call the black box problem. Because true experiments, as we have mentioned, seek to maximize internal validity, their concern is with
demonstrating that x is causally related to y. What is not often a focus of true experiments is why
x and y are causally related. Thus, in the words of Shadish, Cook and Campbell (2002), true experiments provide causal description rather than causal explanation (pp. 9-10). Shadish et
al. provide a nice example of the difference between the two (2002: 9-10):
For example, most children very quickly learn the descriptive causal relationship between flicking a light
switch and obtaining illumination in a room. However, few children (or even adults) can fully explain
why that light goes on. To do so, they would have to decompose the treatment (the act of flicking a light
switch) into its causally efficacious features (e.g., closing an insulated circuit) and its nonessential features (e.g., whether the switch is thrown by hand or a motion detector).They would have to do the same
for the effect (either incandescent or fluorescent light can be produced, but light will still be produced
whether the light fixture is recessed or not). For full explanation, they would then have to show how the
causally efficacious parts of the treatment influence the causally affected parts of the outcome through
identified mediating processes (e.g., the passage of electricity through the circuit, the excitation of photons). Clearly the cause of the light going on is a complex cluster of many factors.
Thus, experiments are generally silent on the causal mechanisms linking x to y (see also
Sampson, Laub and Wimer, 2006). Experiments can build in elements that enable them to examine underlying causal mechanisms, however. For example, some experiments are complex, with
more than two groups. If researchers specify hypothesized causal mechanisms prior to conducting their study, they can test these effects. For example, suppose that researchers are interested in
the effect of education on criminal recidivism. Suppose also that these researchers hypothesize
that educationcrime because it leads to improved chances of offenders attaining meaningful
employment. Employment here represents the causal mechanism or black box linking educaPage | 49
tion to crime. An experiment can be designed wherein subjects are assigned to one of four
groups: 1) educational training alone (Xt1); 2) no training (Xc1); 3) educational training and job
placement assistance (Xt2); 4) job placement assistance alone (Xc2). If the researchers are correct
that education leads to less crime because of employment but we still think that job placement
matters, we would expect the difference in recidivism to be greatest between Xt2 and Xc1. Yet the
factorial design of this study allows a more robust test of the effect of education on recidivism. If
education does not lead to reduced recidivism through (mediated by) employment, but employment is still related to reduced recidivism, then we would expect Xt1= Xt2. Thus, factorial designs, while requiring a larger sample size, provide a more powerful test of experimental conditions (see Bachman and Schutt, 2001). These designs, sometimes called a Solomon Four
Group Design, appear to be relatively rare in criminal justice research, however.
VI. Ethics and True Experiments
Finally, some have worried about the ethicality of conducting experiments, especially
with regard to criminal justice interventions. First, the term random sometimes appears to imply haphazard or accidental (see Rossi, Lipsey and Freeman, 2004). Thus, to those not familiar with social science research methods (such as many criminal justice practitioners), the notion
of a randomized experiment may evoke images of unsystematically assigning treatment. It is,
therefore, essential for researchers to fully explain to staff taking part in studies what true experiments are and why they are appropriate. Second, some have argued that it is unethical to
withhold treatment from individuals (which is the essence of true experiments). When a treatment is known (scientifically, that is) to be efficacious and to produce unambiguous benefits to
individuals, we agree. A randomized study that assigns families in poverty to a) receive food
stamps versus b) no governmental assistance would, in our view, violate ethical conduct. Why?
Because we know that food stamps provide a tangible benefit to individuals. Thus, withholding
them from some individuals in order to study their effects would not be ethical by any standard.
This is essentially what happened in the famous Tuskegee Institute study that took place after
World War II. In that study, African-American men were assigned to receive a) treatment for
syphilis or b) no treatment. This withholding of treatment resulted in deaths and is thus considered an exemplar of unethical research (Oakes, 2002).
Another ethical issue related to experiments is related to deception. In some research
studies, the subjects knowledge of the treatment under study might influence their outcomes.
Thus, researchers misdirect or misinform the subjects about the intention of the study in order to
arrive at an unbiased estimate of the effect of treatment. A criminological example might be a
study that seeks to examine the effect of violent media on aggression. Subjects knowledge that
they are being tested with respect to their level of aggression may intentionally act calmer and
more passive. This is an example of a testing effect, described in table 3.1. Thus, to reduce this
threat to internal validity, researchers might tell subjects that they are testing subject ratings of
favorability of different types of media. When deception occurs in research, it is essential that
subjects are informed of the true intention of the study upon completion. This is called debriefing (see Bachman and Schutt, 2001).
We agree with researchers such as David P. Farrington and David Weisburd that true experiments are the most desirable research design available to criminological researchers. In fact,
Weisburd has convincingly argued that in terms of ethical considerations, it is unethical for
criminologists to make recommendations on the basis of non-experimental research, when true
experiments are possible (see Farrington, 2003; Weisburd, 2003). However, in the realm of
criminological research, it is often not practical or feasible to conduct randomized experiments.
Page | 50
It is not possible to test the effect of marriage, for example, by randomly assigning some inmates
to a wedding group versus treatment as usual. Thus, in these situations, modifications to the
true experiment must be made. It is to this subject that we now turn.
VII. Modified Experimental Designs
The distinguishing factor between true experimental designs and quasi-experimental designs (or what we will refer to as modified experiments from now on) is random assignment.
True experiments use a truly random control group and modified experiments use a nonrandom comparison group. The value of the modified design relies on how well the comparison group is matched with the treatment/intervention group. While there are several different
modifications that can be done to experimental designs, they tend to have two general characteristics 1) they are often retrospective in nature (occur after the program is in place) and 2) exhibit
questionable internal validity. Other types of modified experiments rely on statistical controls to
isolate the effect of treatment. The problem with this approach is that variables that potentially
affect treatment outcomes must be specified in advance. Despite these limitations, modified experiments have the potential to provide important and substantive information about program
effects (Bingham and Felbinger, 2002). Some of the most common designs are reviewed in Table
5.2 below.
Table 5.2 Common Modified Experimental Designs
One of the most frequent designs in the social
science literature. Here, the evaluator assigned
participant to either the experimental group
(Group A) or a comparison group (Group B).
The groups are matched on a common characPretest Program Posttest
teristic that the evaluator wishes to measure
Group A
O
X
O
(such as race, gender, age, etc.)
Group B
O
O
The longitudinal or interrupted time series deInterrupted Time-Series Design
Before
After sign allows researchers to use subjects as their
Ot1 t-2
Ot1 t-1
X1
Ot1 t+1
Ot1 t+2
own control. This design is stronger than the
pre-post test design because maturation effects
can be controlled.
______________________
Interrupted Time-Series Comparison Group This design allows the evaluator to identify
treatments effects over time in treated and unDesign
treated individuals. Time-series designs are
Before
After beneficial because they are capable of identifyOt1 t-2
Ot1 t-1
X1
Ot1 t+1
Ot1 t+2
ing trends in social processes and are able to
Ot2 t-1
X2
Ot2 t+1
Ot2 t+2
Ot2 t-2
filter out noise departures from the underlying trend.
Oct-2
Oct-1
Oct+1
Oct+2
The Pretest-Posttest Comparison Group
Test
______________________
Program
X
Posttest
O
O
Counterbalance Designs
E
E
E
E
X1O
X2O
X3O
X4O
X2O
X3O
X4O
X1O
X3O
X4O
X1O
X2O
X4O
X1O
X2O
X3O
Bachman, R., and Schutt, R. (2001). The Practice of Research in Criminology and Criminal
Justice. Thousand Oaks, CA: Pine Forge Press.
Bingham, R., and C. L. Felbinger. (2002). Evaluation in Practice: A Methodological Approach
(2nd ed.). New York: Seven Bridges Press.
Campbell, D. T. and Stanley, J. C. (1963). Experimental and Quasi-Experimental Designs for
Research. Chicago: Rand McNally.
Cook, T. D. and Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton-Mifflin.
Farrington, D. (2003). A Short History of Randomized Experiments in Criminology.
Evaluation Review, 27:218-227.
Farrington, D. P. R. F. Loeber, and B. Welsh. (2010). Longitudinal-Experimental Designs. In
A.Piquero and D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.101-121).
New York: Springer.
Fienberg, S. E. and D. V. Hinkley (1980). RA Fisher: An Appreciation. New York: SpringerVerlang.
Fisher, R. A. (1935). The Design of Experiment. New York: Hafner.
Goldkamp, J. S. (2008). Missing the Target and Missing the Point: Successful Random
Assignment but Misleading Results. Journal of Experimental Criminology. 4(2): 15728315.
BosHagan, F. E. (2003). Research Methods in Criminal Justice and Criminology (6th ed.).
ton: Pearson Education.
Heckman, J. J. and J. A. Smith (1995). Assessing the Case for Social Experiments. Journal of
Economic Perspectives. 9(2): 85-110.
Heinsman, D. T., and W. R. Shadish. (1996). Assignment Methods in Experimentation: When do
Nonrandomized Experiments Approximate Answers from Randomized Experiments.
Psychological Methods 1(2): 154-169.
Hirschi, T and H. Selvin. (1967). Delinquency Research: An Appraisal of Analytic Methods. The
Free Press.
Isaac, Stephen and Michael, William B. 1995. Handbook in Research and Evaluation (3rd ed.).
pp. 35-45 (Ch. 3-Planning Research Studies) & pp. 237-245 (Ch. 9-Criteria and Guidelines for Planning, Preparing, Writing, and Evaluating the Research Proposal,
Report, Thesis, or Article). San Diego, CA: EdITS.
Maxfield, M., and E. Babbie. (1998). Research Methods for Criminal Justice and Criminology.
(2nd ed.). Belmont, CA: Wadsworth.
Oakes, J. M. (2002). Risks and Wrongs in Social Science Research: An Evaluators Guide to
the IRB. Evaluation Review, 26:443-479.
Pawson, R. and N. Tilley. (1994). What Works in Evaluation Research? British Journal of
Criminology. 34:291-306.
Petersilia, J. (1989). Implementing Randomized Experiments: Lessons from BJAs Intensive
Supervision Project. Evaluation Review. 13(5):435-458.
Rossi, P. H., M. W. Lipsey, and H. E. Freeman. (2004). Evaluation: A systematic Approach (7th
ed.). Thousand Oaks, CA: Sage Publications.
Sampson, R. J., J. H. Laub, and C. Wimer (2006). Does Marriage Reduce Crime? A
Counterfactual Approach to Within-Individual Causal Effects. Criminology 44(3):465508.
Page | 53
Page | 54
human behaviour occurs in context it is influenced by the setting and the internalised norms the person associates with it.
Experimental research effects the findings through the artificial setting; likewise surveys set out how the researcher interprets the phenomenon of interest and the participant has to fit their experience into that.
Qualitative research allows the researcher to understand the framework within which
people interpret their feelings thoughts or actions.
Meanings and processes can be identified, explored and understood (Wilson, 1977)
Page | 55
REFERENCES
Bachman, R., and Schutt, R. (2003). The Practice of Research in Criminology and Criminal
Justice (2nd Ed.). Thousand Oaks, CA: Pine Forge Press
Geertz, C. (1973). Thick Description: Toward an Interpretive Theory of Culture. In C. Geertz
(Ed.), The Interpretation of Cultures: Selected Essays. New York: Basic Books
Kahn, R., and Cannell, C. (1957). The Dynamics of Interviewing. New York: John Wiley
Marshall, C., and Rossman, G. (1989). Designing Qualitative Research. Newbury Park, CA:
Sage Publications
Marshall, C. (1985). Appropriate Criteria of Trustworthiness and Goodness for Qualitative
Research on Educational Organizations. Quality and Quantity, 19, 353-373
Wilson, S. (1977). The Use of Ethnographic Techniques in educational Research. Review of
Educational Research, 47(1).
Page | 58
merely successful evaluation, in contrast, falls short of providing the best information possible
given the constraints but provides better information than would have otherwise been available
(p. 5). This suggests that the proper measure of the success of an evaluation is whether it adds to
the current knowledge, rather than what might be nice to know (Berk and Rossi, 1999, p.5).
Additionally, some argue that in order to be successful, an evaluation must include some
form of advocacy or be implemented by policymakers. While these are, without a doubt, important components of evaluation, it is a risky prospect for evaluation researchers and can even
compromise the research. The evaluation can be compromised if it appears that the researcher
has a position on the issue or if it appears that the research has been tailored to one side or the
other to ensure its use (Berk and Rossi, 1999). In the view of Berk and Rossi (1999), evaluation
can be successful even if it is ignored or even if it misused by stakeholders. Once the findings of
the evaluation are presented to the interested parties in a clear manner, the evaluation has been
concluded (Berk and Rossi, 1999).
A Brief History of Program Evaluation
Although program evaluation is a relatively recent development, the activities that make
up program evaluation are not. In fact, the roots of evaluation research extend to the 17th century, though evaluation as it is currently known is a relatively modern development. The systematic evaluation of social programs first became common in the fields of education and public
health (Rossi, Lipsey, and Freeman, 2004). The field of applied social research grew rapidly as a
result of the boost it received following its contributions during World War II. After World War
II, many federal and privately funded social programs were launched, providing services such as
urban housing, education, occupational training, and health services. These new programs required evaluation and, as a result, by the end of the 1950s program evaluation commonplace
(Rossi, Lipsey, and Freeman, 2004).
The 1960s arrived with an increase in the number of books and articles focusing on evaluation research. By the end of the decade, evaluation research represented a growth industry. The
large amount of interest in program evaluation was sparked, in part, by President Lyndon Johnsons federal war on poverty and the corresponding programmatic remedies (Rossi, Lipsey, and
Freeman, 2004). By the early 1970s, program evaluation research had emerged as a specialty
field. Special sessions focusing on evaluation research became commonplace at professional
meetings and conferences. In addition, professional associations were also founded (Rossi, Lipsey, and Freeman, 2004).
Eventually, changes began to occur in the field of evaluation research. Initially, the interests of researchers shaped the field of evaluation research. However, this evolved to a point
where the interests of the consumers of the evaluation shape the research (Rossi, Lipsey, and
Freeman, 2004). While the results of these evaluations are not often newsworthy, they are of
great importance to those directly or indirectly affected by the program including concerned citizens, program sponsors, policymakers (Rossi, Lipsey, and Freeman, 2004). As a result of the
changes that have occurred in the arena of evaluation research, program evaluations have moved
beyond the world of academic social science into the arena of political and policy decisions
(Rossi, Lipsey, and Freeman, 2004).
Page | 60
Page | 61
els of coverage (Rossi, Lipsey, and Freeman, 2004). Bias, as applied to program coverage, is defined as the extent to which subgroups of a target population are reached unequally by a program. This can best be uncovered using comparisons of program users, eligible nonparticipants,
and dropouts (Rossi, Lipsey, and Freeman, 2004).
The task of monitoring a programs organizational functions has the purpose of determining how successful the program is at organizing its efforts and utilizing resources to achieve the
stated goals. Attention is given to identifying weaknesses and problems in the implementation of
the program that would impede the programs services from reaching the intended population
(Rossi, Lipsey, and Freeman, 2004). Some potential sources of implementation failure include
incomplete intervention, delivery of the wrong intervention, and unstandardized or uncontrolled
interventions (Rossi, Lipsey, and Freeman, 2004).
ASSESSING IMPACT OF PROGRAMS
Randomized Field Designs
According to Rossi et al. the purpose of impact assessments is to determine the effects
that programs have on their intended outcome and whether there are unintended effects (Rossi,
Lipsey, and Freeman, 2004, p. 234). It is possible to conduct impact assessments at various
stages of the program, however, since rigorous impact assessment requires the use of significant
resources, one must consider whether the use is justified by the circumstances. The methodological concepts that underlie all research designs used in impact assessment come from the logic of
randomized experiments. An essential feature of this is the use of random assignment to divide
subjects into intervention and control groups. In quasi-experiments, subjects are assigned using
something other than true random assignment. In these experiments, the evaluator must decide
what constitutes a suitable research design, keeping in mind that compromises are always inherent in construction of design to a certain extent (Rossi, Lipsey, and Freeman, 2004)
Randomized experiments represent an ideal choice for impact assessment because they
provide the most credible conclusions about program effects when the experiments are conducted well. The primary advantage of implementing a randomized experiment is the fact that
the effect of the intervention is isolated, ensuring that the intervention and control groups are statistically equivalent with the exception of the intervention received (Rossi, Lipsey, and Freeman,
2004). There are some procedures that can produce circumstances that are acceptable approximations to randomization, for example, assigning every other name on a list or assigning clients
to a program based on the programs ability to take additional people at a given time. However,
these alternatives are only suitable if they can generate intervention and control groups that do
not differ on any characteristics that are relevant to the expected outcome (Rossi, Lipsey, and
Freeman, 2004). The level of precision in the measurement of the outcome of an intervention
can be increased through the use of several measurements, including measures taken before an
intervention, during the intervention, as well as after the intervention. The use of multiple measures enables evaluators to more precisely determine how intervention worked over time (Rossi,
Lipsey, and Freeman, 2004).
Although they are the most rigorous, randomized experiments may not be feasible or appropriate for all evaluations. The results may be ambiguous if the experiment is conducted in the
early stages of a program, when interventions change in ways that experiments cannot easily
capture. Additionally, stakeholders may be hesitant to allow randomized experiments if they feel
that they are engaging in unfair or unethical conduct by withholding the intervention from conPage | 63
trol group (Rossi, Lipsey, and Freeman, 2004). It is important to keep in mind that for all the
positives, experiments are resource intensive, require technical expertise, research resources,
time as well as tolerance from the programs being studied as their normal procedures are being
disrupted (Rossi, Lipsey, and Freeman, 2004). Experiments also have the potential to create artificial situations, for instance, the delivery of the program during the experiment may differ from
how the program is actually delivered (Rossi, Lipsey, and Freeman, 2004).
Alternative Designs
While randomized experiments are the strongest methodology for measuring the strength
of the impact of a program, there are several quasi-experimental methodologies that are also
potentially valid. These methodologies can be used when randomized experiments are not feasible or are not appropriate. A major concern that evaluators have in any impact assessment is to
reduce the bias in the estimate of program effects. As discussed earlier, bias is defined as the extent to which subgroups of a target population are reached unequally by a program (Rossi, Lipsey, and Freeman, 2004). There are several potential sources of bias in quasi-experimental designs including selection bias, secular trends, interfering events, and maturation. Rossi et al.
(2004) define selection bias as the systematic under or over estimation of program effects resulting from uncontrolled differences between the intervention and control groups that would result
in differences between the groups even if intervention not present (Rossi, Lipsey, and Freeman,
2004).
The intervention and control groups are created using methods other than random assignment in quasi-experimental designs. As a result, there is not an assumption of equivalence
between these groups. Differences may exist between the groups that would result in differences
in outcome even if the intervention were not applied. Thus, appropriate procedures must be applied to adjust for these differences in estimations of program impacts (Rossi, Lipsey, and Freeman, 2004). In one variety of quasi-experimental methodology, matched controls are implemented. The control group is constructed by matching program nonparticipants with program
participants. This procedure can be done either on the individual-level or on the aggregate-level.
Additionally, the variables that are used in the matching procedure must include all those strongly related to outcome on which the groups would otherwise differ in order to avoid bias (Rossi,
Lipsey, and Freeman, 2004).
The intervention and control groups can also be equated through the use of statistical controls. As with other methodologies, any differences that the groups share on variables relevant to
the outcome must be identified and included in the statistical tests (Rossi, Lipsey, and Freeman,
2004). Ideally and when it is possible, participants should be assigned to the intervention and
control groups based on quantitative measures. For example, measures of need, merit are less
susceptible to bias than those from other quasi-experimental designs. It is appropriate to use
quasi-experimental designs when randomized experiments are not feasible but considerable efforts must be taken to minimize the potential for bias. It is also important to acknowledge the
limitations of quasi-experimental methodologies (Rossi, Lipsey, and Freeman, 2004).
COST-EFFECTIVENESS
Cost-benefit analysis is a useful quantitative tool for program evaluators. These analyses are especially useful in when used in evaluations of existing programs to assess their success
or failure, to determine whether the program should be modified or continued, as well as assessPage | 64
ing the likely consequences of changes to the program (Kee, 1994). There are three main steps to
cost-benefit analysis. First, the evaluator must determine the benefits of a proposed or existing
program and placing a dollar value on these benefits. Second, the total costs of the program must
be calculated. Finally, the total benefits and total costs must be compared (Kee, 1994).
While these steps seem straightforward they can be quite challenging. It can sometimes
be difficult to determine the appropriate unit of analysis. Even if this is possible, placing a dollar
value on this unit can be quite challenging (Kee, 1994). An additional benefit of conducting costbenefit analyses is that the procedure can illuminate important issues and may even lead to an
implicit valuation of some intangible ideas that are obscured by rhetoric (Kee, 1994, p. 457).
There are several types of costs and benefits that can be identified during this procedure.
Direct benefits and direct costs are those that are closely related to the main objective of the
program. In contrast, indirect benefits and indirect costs are spillover or investment effects of
the project or program. Additionally, costs and benefits can also be tangible or intangible. Tangible benefits and tangible costs are those that can be easily converted into dollars or an equivalent of dollars while intangible benefits and intangible costs are those you cannot or choose not
to assign an explicit price to (Kee, 1994).
After the evaluator has determined the range of costs and benefits associated with the
program in question and has assigned values to the costs and benefits, the next step is to present
the information to the decision maker. Kee (1994) argues that there are three ways in which this
can be done. The first option is a retrospective analysis, which involves looking at historical data
on the benefits and costs and converts them into a net present values for the program. The second
option is a snapshot analysis simply looks at the costs and benefits for the current year. The third
and final option presented is a prospective analysis, which consists of an analysis that projects
future benefits and costs of the program based on the retrospective analysis (Kee, 1994).
FUTURE OF EVALUATION
Rossi et al. (2004) suggest that there are a variety of reasons to believe that the field of
evaluation research will continue to grow in the future. First, stakeholders such as planners, staff,
and participants are increasingly skeptical about using common sense as a sufficient basis for the
design of social programs that will actually have the ability to achieve their intended goals (Rossi, Lipsey, and Freeman, 2004). This skepticism has led policymakers to seek out methods of
learning from past mistakes and to more quickly identify which measures work. When these programs that work are defined they can then be enhanced and used to their full potential (Rossi,
Lipsey, and Freeman, 2004).
A second reason to expect continued growth in the area of evaluation research is the everincreasing sophistication of knowledge and technical procedures in the social sciences. These
new methodologies become a more powerful means of testing social programs when paired with
more traditional methods (Rossi, Lipsey, and Freeman, 2004). Finally, there have also been
changes in the political and social climate that are favorable to the increased use of evaluation
research. There is a desire to fix the problems that ail society, though the variety and number of
concerns that demand the attention of social science researchers can be overwhelming (Rossi,
Lipsey, and Freeman, 2004).
Page | 65
REFERENCES
Berk, R. A., and P. H. Rossi. (1999). Thinking about program evaluation (2nd ed.). Thousand
Oaks, CA: Sage Publications.
Kee, J. E. (1994). Benefit-cost analysis in program evaluation in J. S. Wholey, H. P. Hatry,
and K. E. Newcomer (eds.) Handbook of Practical Program Evaluation. San Francisco:
Jossey-Bass Publishers.
Rossi, P. H., M. W. Lipsey, and H. E. Freeman. (2004). Evaluation: A systematic Approach (7th
ed.). Thousand Oaks, CA: Sage Publications.
Scheirer, M. (1994). Designing and using process evaluation in J. S. Wholey, H. P. Hatry,
and K. E. Newcomer (eds.) Handbook of Practical Program Evaluation. San Francisco:
Jossey-Bass Publishers.
Shadish, W. R., T. D. Cook, and L. C. Leviton. (1991). Foundations of program evaluation:
Theories of practice. Newbury Park, CA: Sage Publications.
Page | 66
Note: The final two chapters are companion pieces that discuss relatively recent or rarely used
statistical analytic techniques. The first chapter covers time-series analysis, hierarchical linear
models, and poisson regression. The second chapter discusses meta-analysis, propensity score
matching, survival analysis and spatial regression techniques. MR/CP
CHAPTER 8. NEWER STATISTICAL METHODS (PART I)
By Diana Summers
Time Series Analysis
Time series analysis is a type of regression model where observations are ordered in time
and therefore cannot be treated as statistically independent. These observations can be a person,
organization, nation, aggregated arrests, etc., and are usually reported on a consistent basis (e.g.,
yearly, monthly, quarterly, daily). Time series analyses are primarily used to aid in forecasting,
and originated in the field of economics. However, the field of criminology has benefited from
time series analysis in studying the nature of trends (in number of offenses, number of convictions, etc.). This methodology was developed to decompose a series into trend, seasonal, cyclical
and irregular components. These components of the series are each a type of difference equation, which expresses the value of a variable as a function of its own lagged values, time and
other variables. Uncovering these paths in a series improves forecasting accuracy since each of
the predictable components can be extrapolated into the future. It is possible to estimate the
properties of a single series or a vector containing many interdependent series; however, this discussion will continue with univariate time series analyses. In addition, discrete time series analyses will be discussed here, as most researchers analyze discrete time series and not continuous
time series.
It is not generally reasonable to suppose that the errors in a time series regression are independent, since time periods close together are more likely to be similar than points in time that
are relatively isolated. This similarity can extend to the errors, which represent the omitted causes of the response variable. It is therefore important to test for and correct autocorrelation if necessary by employing an analysis of the residuals (e.g., the Durbin-Watson test statistic).
One forecasting model in particular that is utilized most in time series analyses is the
ARIMA (autoregressive integrated moving average) model. Properties of stationarity are considered in this model. When an ARIMA model is stationary, it becomes an ARMA (autoregressive moving average) model.
Stationary: H0 is rejected; the trend is mean reverting, allowing researchers to
proceed with further statistical testing.
Non-stationary: fail to reject H0; the trend is stochastic (random), requiring researchers to further manipulate the data (e.g., first-differencing)
The first component of the ARIMA model (AR) considers that time series processes can
be influenced by past events or observations. It is often assumed that elements of an observed
time series are outcomes or realizations of a stochastic process. However, in econometrics this is
a more general assumption than in other fields like criminology (GNP is arguably more stably
collected than crime-related data, and people can be more easily influenced on topics such as incarceration rates and drug abuse). A discrete variable y is said to be stochastic if for any real
number r there exists a probability p(y r) that y takes on a value less than or equal to r. It is typically implied that there is at least one value for r for which 0 < p(y = r) < 1. If there is some r
Page | 67
for which p(y = r) = 1, y is deterministic rather than stochastic. In discussing stochastic timeseries models, white-noise processes should also be mentioned. A white-noise process occurs if
each value in the sequence has a mean of 0, a constant variance, and is serially uncorrelated.
The second component of an ARIMA model is the integrated process. This simply means
that the mean, variance and covariance are not constant over time. When this is not the case, the
series is stationary and an ARIMA model is employed instead.
The third component of an ARIMA model is the moving average (MA). This implies that
time series processes are driven by various shocks to the time series data. These shocks can be
defined as any major event or occurrence that can potentially significantly affect the time series
data. Mathematically, the moving average is described below:
Consider the following white-noise (t) process:
Yt = + t + t-1 ,
where and could be any constants. This is an example of a first-order moving average (MA) process, where moving average comes from the fact that Yt is constructed from a
weighted sum, akin to an average, of the two most recent values of (Hamilton, 1994: 48).
A more formal, more statistically rigorous method used for forecasting and defining the
nature of time series data is known as unit root testing. When testing for the existence of unit
roots, Phillips-Perron or Augmented Dickey-Fuller tests can be employed. These tests help determine whether any shock to the time series will produce a temporary or permanent effect.
When a unit root test yields a stationary time series, any shock will have a temporary effect on
the variable. The effects of the shocks will be brief and will dissipate over time, causing the trajectory of the data to revert back to its original mean. From these results, it is possible to then
calculate the approximate length of time the effects would last by employing the ARIMA model
and examining the size of the slope coefficient. If any shock to the variable is found to permanently affect the trajectory (non-stationary), the variable will never return to some form of longrun mean (Enders, 1995).
The value of unit root testing in criminology and criminal justice is evident, especially in
areas of policy evaluation. For instance, after conducting unit root tests on related time series
data, if the series is found to be stationary and the effects of the shock have dissipated, new policies would have to be enacted to reapply the effects. This result would also indicate some level
of predictability in the variable.
Other time series tests that involve multivariate analyses and may prove useful to criminal justice researchers are:
Vector Autoregression (VAR): is particularly helpful in econometrics for estimation and forecasting. It has been described as a natural extension of the univariate autoregressive model, and is user-friendly. VAR forecasts are superior in
ways to univariate time series models, as they allow for more flexibility due to
the fact that they can be made conditional on the potential future paths of specified variables in the model.
Kalman Filtering: is a discrete data filter composed of a set of equations that
provides an efficient recursive means to estimate the state of a process in a way
that minimizes the mean of the squared error. In this state-space system, one of
the ultimate objectives is to estimate the values of any unknown parameters in
the system on the basis of the given observations. It supports estimations of past,
present, and even future states, and can do so even when the precise nature of the
modeled system is unknown (Welch and Bishop, 2006). This would be helpful in
Page | 68
identifying missing observations in criminal justice time series data, and would
aid in forecasting efforts.
Statistical software packages for time series analysis include E-views, STATA, SAS, and
Shazam.
Hierarchical Linear Modeling (HLM)
Hierarchical Linear Modeling (HLM) refers to a type of regression analysis that involves
modeling multilevel data that are inherently hierarchical. The focus of HLM is to appropriately
model relationships between variables reflecting different levels of analysis. For instance, individuals are nested or grouped within larger units, such as a neighborhood or work group. The
hierarchical nature of this type of data leads to problems with employing traditional regression
models, because the individual units of analysis that are grouped within larger units of analysis
cannot be considered independent. Individual units of analysis tend to be more similar to each
other than separate units randomly sampled from an entire population. For example, individuals
sampled from one particular work group are more similar to each other than to individuals randomly sampled from the entire company or group of companies. This is because people are not
randomly assigned to a company or work group; rather, they are selected based on skill set and
other qualifying factors.
HLM provides a method to overcome this problem of independence of observations,
whereas using OLS regression will produce standard errors that are too small and therefore a
higher probability of rejecting the null hypothesis (e.g., inflating Type I error rate). In the traditional OLS approach, all the regression parameters are fixed, so that if a two-level approach
(such as the example described above) were utilized, the variance components would not be
separable from the individual level residual. HLM software uses a maximum likelihood estimation of the variance components, generalized least squares estimates of the level-two regression
parameters, and can yield empirical Bayes estimates of the level-one regression parameters
(Hofmann and Gavin, 1998: 626).
Below is the Level 1 regression equation:
Yij = B0j + B1j * X1ij + B2j * X2ij + rij
Where i refers to the person number and j refers to the group number.
Researchers might use HLM when determining the success or failure of ethics workshops
among certain work groups. However, since data is collected at an individual level, issues of accounting for this cross-level data arise. HLM allows researchers to separate individual and group
effects on the outcome, instead of either aggregating individuals up one level or reducing higherlevel variables down to individual levels (see Byrk and Raudenbush, 1992 for further discussion).
Statistical software packages for HLM include HLM, SPSS and MLWin.
Poisson Regression
Poisson regression (or log linear model) is a member of a family of analyses known as
the generalized linear model where OLS regression is generalized for use with different types
of error structures and dependent variables (Coxe, West and Aiken, 2009). It is based on the
Poisson distribution, and is designed for use with count data where the dependent variable can
only take on non-negative integer values. With the basic Poisson specification, it is assumed that
Page | 69
the variance of the variable is equal to the mean. It is also a nonlinear, univariate distribution.
The count data reflect the number of occurrences of a behavior in a fixed period of time (e.g., the
number of drug-related arrests for an individual over the past 12 months), and Poisson regression
analysis allows for the investigation of individual factors affecting the particular count variable.
Coxe, West and Aiken (2009) warn against trying to use a count variable as an outcome
or criterion variable in OLS regression, as it can cause major problems. When the mean of the
outcome variable is relatively high OLS regression can beappliedwith minimal difficulty.
However, when the mean of the outcome is low, OLS regression produces undesirable results
including biased standard errors (121). The Poisson distribution increasingly resembles the
normal distribution as the expected mean value becomes larger. Generally, a Poisson distribution
with an expected value greater than 10 will appear similar to a normal distribution in shape and
symmetry. A count variable with a very low mean count will be skewed to the right and highly
asymmetric. See Figure 8.1 below for a visual representation (as provided by Coxe, West and
Aiken, 2009):
Figure 8.1. Distributions of Arrest Counts
Even though equations for Poisson distributions may appear very similar to OLS regression equations, the predicted score is not itself a count but rather a natural logarithm of the count.
Thus it is said that Poisson regression is linear in the logarithm when given the correct combination of independent variables (Coxe, West and Aiken, 2009: 124).
Below is the equation for the probability density of a variable with a Poisson distribution:
P(y|) = (e - y)/y!
If E(y) = (as is the rate parameter), the Poisson Process is modeled as such:
ln =bX
Page | 70
Note: with the basic Poisson specification it is assumed that the variance of the variable is
equal to the mean:
Var ( y ) = y =
Hofmann, D. and M. Gavin. (1998). Centering Decisions in Hierarchical Linear Models: Impli
cations for Research in Organizations. Journal of Management. 24(5). 623-641.
Osborne, J. (2000). Advantages of Hierarchical Linear Modeling. Practical Assessment,
Research and Evaluation. 7(1).
Welch, G. and G. Bishop. (2006). An Introduction to the Kalman Filter. TR 95-041. University
of North Carolina at Chapel Hill.
Page | 72
Meta-analysis
Propensity Score Analysis with a Mahalanobis Distance Matching extension
Cox Proportional Hazards Regression model (Survival Analysis)
Spatial Regression Models
Introduction:
As the second person investigating newer statistical models, I have chosen an area typically outside of standard fare for criminology. What follows is a selection of techniques used in
research coming from the medical arena. I decided to look at this area as it has a reputation for a
more savvy use of statistics and models that many in our field and the readings in this semester
see as a gold standard. It, of course, cannot be said that this selection is in any way random or
representative. Further this section does not consider the research reported using a true experimental model. The reasoning for this non-reporting is that it seemed superfluous to the world of
criminological study, and the resulting methods seemed overly simplistic when the reality of
even a treatment design is implicated in criminological research.
This said, there are methods that are more reflective of the issues many criminologists
face in the reality of their own research. The selected research methods reported here have been
employed in the study of epidemiology, a sub-section of the medical realm that has many parallels to the study of crime. It must be remembered by the reader that the studies here are a very
small selection of the papers available from this one journal in the past year. The American Journal of Epidemiology is a bi-weekly publication with approximately fifteen papers per issue on
new research alone, and one to two pooled, meta-analysis papers offered in each issue. That is
approximately 390 articles to be considered, an amount far in excess of the restricted needs of
this section.
Lessons that Criminal Justice Could Learn From Epidemiology:
Before looking at the methods employed in the chosen articles it struck the current reader that the field of Criminal Justice should take a leaf from the medical field. The first thing noticed is the voracity with which publications are made. The high number of articles published
could be a double edged sword in criminology, possibly encouraging a lesser quality of work.
However, one assumes the same screening process occurs, and the articles are forced to be more
concise as a result. But the flow-on effect is illustrated by the Meta-analysis summary provided
later in this section. With such a large number of publications, over 1,000 were found and over
700 could be included looking at heart disease, it seems that this mechanism builds the body of
evidence for one thing or another at a more rapid rate than we experience in the criminal justice
field. Further, the journal has a dedicated section for pooled and meta-analysis to encourage the
analysis and discovery of the direction of the accumulating body of evidence. How much more
Page | 73
certain would the criminologist be of the causes of crime if the criminal justice field followed the
whole medical model and did not simply place faith in its methods?
Further, the medical researchers also seem to adhere to the strict rules of models and what
statistics to employ more so than much criminological research4. It seems, rather than check to
see if violations skew data to an extent that makes the researcher shift to more complicated or
less sensitive models, the medical researcher will rely on multiple methods and other techniques
to make the results comparable, or simply report the whole picture created.
The Following Sections:
The next section summarizes some of the methods employed in the medical research. Not
all are new, but nor are they all common as far as this reader is aware to criminal justice research. The summary notes also include the methods discussed illustrating how the medical researchers employ them and what complimentary methods are also employed giving a fuller picture of the actual implementation of the broad methods discussed.
Meta Analysis:
While not a new analysis technique meta analysis seems to be something of a rarity in
criminological research especially when compared to medical research. And the technique could
be useful to the criminological field to start conducting these analyses on a more regular basis
and provide them with more prestige in the field. It seems to be a prevalent way in which a field
we often look to emulate check to see in what direction the body of findings is pointing.
What is Meta Analysis?
Meta analysis is essentially a synthesis of literature/previous research. Ideally randomized trials are used, but many other methods have been developed to use other types of studies as
is evidenced in the first summary in Part III of the Newer Statistical Methods section. Generally,
in cases in which non-randomized samples are used in employed data the effect sizes reported
must control for any theoretically important confounders as a minimum requirement.
The broad and general steps include a search and recording of the search strategy (captured in the abstract of the summary sheet). Once this is conducted a reduction of studies eligible
to be included occurs due to the comparability of the studies and if the studies offer the required
statistics to be used in a meta-analysis.
Once the researcher has entered the required data into their chosen program (Usually
SAS and the STATA sample provided for medical research) the researcher looks at effect sizes,
either as rate differences, relative risk or rate ratios (relative risk seems to be common in epidemiology). The type of model selected depends on the intention of the study. A fixed effects model states that the conclusions are correct for the studies in the analysis and random effects models
assume the studies are a random sample of a universe. After gaining the effect size, the
confidence intervals and the Q-statistic for homogeneity and its proportional effect on the regression slop are determined. Finally the meta-analysis, and medical research in many other models,
conducts sensitivity tests to test the validity of the model.
Limitations:
The limitations of meta-analysis are nicely summarized by the quote in the summary limitation section below from the example study. Also, it must be remembered that meta-analysis
4
It is accepted that the author only has their own anecdotal familiarity with criminological research.
Page | 74
cannot control for any confounds not controlled for in the studies themselves. A further limitation of meta-analysis as discussed below derives from publication biases as well as any biases
included but unknown in the studies employed. A technique used in some epidemiology metaanalyses to check for bias is to create a plot with the effect size on the X axis and the sample size
on the Y axis. If the plot resembles an upside down funnel no bias is indicated. The two figures
below show a simplified concept of the plots.
Page | 75
Page | 76
Figure 9.3: Number of Publications that used Propensity Score Analysis in Pub Med.
What is Propensity Score Analysis?
In observational studies or even evaluation studies the groups may not be comparable due
to many reasons covered in many research texts (such as feasibility, ethics, or non-compliance by
administrators). Ultimately, we as researchers are interested in knowing if the outcomes are due
to treatment, and non-randomized experiments/studies lead these assertions to be questionable.
As noted, PSA and its extensions, of which MDM is one, allow us to reduce the questionability
of the assertions made using observational data.
The first step is to model the non-random variables that set the persons propensity to be
selected for the treatment. Once this propensity score is created, the researcher can then use
matching protocols between treatment receivers and counterfactuals (Piquero and Weisburd,
2010). By using these matching protocols a greater degree of homogeneity of the groups can be
achieved. For example Piquero and Weisburd, (2010) use a national youth study of drug use with
employment status to illustrate how the treatment (high intensity employment) co-varies negatively with drug use; PSA indicates that this is not a treatment but the result of self selection.
Something must differ in the background of these two groups.
For a much more in depth discussion and explanation on how to apply PSA please refer
to the chapter in the book referenced below.
Page | 77
REFERENCE
Apel, R.J. and Sweeten, G. (2010). Propensity score matching in criminology and criminal justice. In A. Piquero and D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.
543-562) New York: Springer.
Page | 78
Page | 79
Page | 80
Allison, P. (2008). Survival analysis using SAS: A practical guide. Cary, NC: SAS Institute Inc.
Dugan, L. (2010). Estimating effects over time for single and multiple units. In A. Piquero and
D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.741-763). New York:
Springer.
Page | 81
Page | 82
Purpose:
To measure obesity rates, across time and neighborhoods. While controlling for contextual effects.
Abstract:
Obesity (body mass index 30 kg/m2) is a growing urban health concern, but few studies have examined
whether, how, or why obesity prevalence has changed over time within cities. This study characterized the
individual- and neighborhood-level determinants and distribution of obesity in New York City from 2003
to 2007. Individual-level data from the Community Health Survey (n = 48,506 adults, 34 neighborhoods)
were combined with neighborhood measures. Multilevel regression assessed changes in obesity over time
and associations with neighborhood-level income and food and physical activity amenities, controlling for
age, racial/ethnic identity, education, employment, US nativity, and marital status, stratified by gender.
Obesity rates increased by 1.6% (P < 0.05) each year, but changes over time differed significantly between neighborhoods and by gender. Obesity prevalence increased for women, even after controlling for
individual- and neighborhood-level factors (prevalence ratio = 1.021, P < 0.05), whereas no significant
changes were reported for men. Neighborhood factors including increased area income (prevalence ratio =
0.932) and availability of local food and fitness amenities (prevalence ratio = 0.889) were significantly
associated with reduced obesity (P < 0.001). Findings suggest that policies to reduce obesity in urban environments must be informed by up-to-date surveillance data and may require a variety of initiatives that
respond to both individual and contextual determinants of obesity.
Sample:
N=10,000.
Longitudinal Study (5yr, yearly repeat). Sampled from across the 5 boroughs of New York.
Limitations:
Resources measure may not solely capture context
The method due to standardization not sensitive to small changes
Differing response rates by group, Women, Unemployed and lower education = less complete
Page | 84
Summary:
The purpose of this section looking at medical journals was to try and see what we as
criminologists may be able to take away from this field to work in our own. In the field of epidemiology it seems that for all but the experimental designs the research is contending with the
similar issues stemming from observational research.
In addition to highlighting some methods that seem to be more common in the medical
field yet equally useful to criminology, and apart from the lessons we could take from the medical field as discussed at the beginning of this section, the following also stands out to the criminological researcher looking at medical research.
The most noticeable difference between criminological research on the whole and epidemiological studies apart from experimental designs, is that multiple methods are often used.
These are either driven by strictly adhering to the types of variables that can be used in a model,
as opposed to using them and checking for VIFs or skewness. Further, it was very common to
see analyses include other methods as a check for reliability or validity, as opposed to the checks
produced by adding commands. The medical research realm seems more comfortable with more
computationally demanding tests to validate the observed results, such as sensitivity analyses.
Criminology as a field talks a lot of being or becoming a science, and yet we mimic only
what we want to, we rarify the act of publishing, more so, it seems, than a field that produces
substantially more publications allowing it to quickly build vast bodies of knowledge with which
they can see the direction the findings are going in sooner. Finally as a result of searching this
literature, this author is left wondering how far we are falling behind a field dealing with similar
issues in statistical savvy-ness. Leaving a final thought, maybe it is time criminological schools
focused on criminology, and used the vastly more knowledgeable mathematics departments to
replete their students with the modern statistical knowledge.
Page | 85