Data Sources For Criminological Resear

CRIMINOLOGICAL RESEARCH AND EVALUATION
METHODS:
RESEARCH THEMES
Northeastern University
2010
The Authors
Page | i
Table of Contents
INTRODUCTION TO CRIMINOLOGICAL AND CRIMINAL JUSTICE RESEARCH
THEMES......................................................................................................................................... 1
CHAPTER 1: MAJOR CRIME DATA SOURCES: PROMISES AND PROBLEMS.................. 3
By Matthew J. Dolliver................................................................................................................ 3
CHAPTER 2. DATA SOURCES FOR CRIMINOLOGICAL RESEARCH ................................. 6
By Michael Rocque...................................................................................................................... 6
CHAPTER 3. SAMPLING ........................................................................................................... 22
By Stephanie Fahy..................................................................................................................... 22
CHAPTER 4. SCALE MEASUREMENT.................................................................................... 36
By Chad Posick ......................................................................................................................... 36
CHAPTER 5. EXPERIMENTAL AND MODIFIED-EXPERIMENTAL DESIGNS................. 45
By Michael Rocque and Chad Posick ....................................................................................... 45
CHAPTER 6. QUALITATIVE RESEARCH METHODS........................................................... 55
By Diana K. Peel....................................................................................................................... 55
CHAPTER 7. PROGRAM EVALUATION................................................................................. 59
By Kristin Reschenberg............................................................................................................. 59
CHAPTER 8. NEWER STATISTICAL METHODS (PART I)................................................... 67
By Diana Summers .................................................................................................................... 67
CHAPTER 9. RESEARCH DESIGN AND NEWER STATISTICAL METHODS (PART II) .. 73
By Sean Christie ........................................................................................................................ 73
Page | ii
INTRODUCTION TO CRIMINOLOGICAL AND CRIMINAL JUSTICE RESEARCH

THEMES
Researchers in criminology and criminal justice have numerous resources to draw upon when
planning and executing studies. Book length treaties have been published to guide researchers in
sampling (see Isaac and Michael, 1995; Trochim, 2001), experimental designs (see Campbell
and Stanley, 1963), non-experimental designs (see Cook and Campbell, 1979; Kerlinger and Lee,
2000) and scale construction (see DeVillis, 1991; DeCoster, 2005; Spector, 1992) in addition to
hundreds of journal articles on a variety of topics. Unfortunately, these resources tend to be
somewhat scattered, technical and cumbersome. Thus it is often necessary for the researcher
must seek out multiple sources in order to gather the requisite information to conduct a study.
The purpose of this guide is to provide brief summaries of what we refer to as research method
themes with which every beginning researcher should be familiar. The topics covered in this
guide range from finding secondary data sources to understanding how to construct a good scale.
The topic summaries are not meant to be exhaustive, but rather provide an initial exposure in
non-technical terms for beginning researchers, along with offering suggestions for finding further
information. The guide should direct the reader to quick reference information on common
method themes in the social sciences and can also serve as a study guide useful for graduate students in an easily accessible format. Each theme is written by doctoral level students in the college of criminal justice at Northeastern University. The idea for these themes emerged from the
course Advanced Research Methods, taught by Ni (Phil) He, PhD.
The first chapter is written by Matthew Dolliver, a first year PhD student. His theme discusses
the three major sources of information for criminologists and criminal justice researchers: official data (Uniform Crime Reports); Self-report studies; and Victimization surveys. His theme
discusses how each form of data is collected and the strengths and weaknesses of each. This information serves as an important launching point for any beginning researcher.
The second chapter is provided by Michael Rocque, a first year PhD student. His theme offers a
brief description and links to major sources of secondary data for use in criminological research.
This chapter is seen as especially important for those with research ideas but little notion of
where to turn to gain access to data.
The third chapter is composed by Stephanie Fahy, a third year PhD student and senior research
associate at the Center for Criminal Justice Policy Research/Institute on Race and Justice. Her
theme covers major sampling techniques for social science research. She also includes a brief
discussion of how to calculate sample size for adequate statistical power, a topic that has been
ignored in many criminological studies.
The fourth chapter is by Chad Posick, a first year PhD student and research associate with the
Center for Criminal Justice Policy Research. His theme discusses methods for constructing valid
scales. Mr. Posicks chapter is relevant for any form of social science research but is particularly
focused on criminological topics. He also includes a brief discussion of classical test theory and
item response theory.
Page | 1
The fifth chapter, by Michael Rocque and Chad Posick, provides a basic overview of the fundamental research designs in the social sciences: experiments and quasi (modified) experiments.
The first part of this chapter describes what experiments are, their strengths and weaknesses and
ethical issues involved in such designs. The next chapter describes modified experiments, which
are the strongest research methodology available when true experiments are not feasible or ethical.
The sixth chapter is written by Diana K. Peel, a first year PhD student. Her theme focuses on methods of qualitative research in the social sciences. She discusses sampling methods, coding and
analysis techniques typically used in qualitative research. This topic is essential for all researchers to be familiar with in order to become competent consumers of the literature.
The seventh chapter is written by Kristin Reschenberg, a first year PhD student. Her theme provides an introduction to the basics of program evaluation. Her chapter discusses what program
evaluation is and includes a brief discussion of research designs commonly used in such evaluations. Her chapter includes an important emphasis on the political nature of program evaluations
in criminal justice.
The eighth and ninth chapters are written by Diana Summers, a first year PhD student and Sean
Christie a second year PhD student. These chapters are companion themes, covering new and
relatively rare statistical methods. Ms. Summerss theme covers longitudinal and time-series methods and Mr. Christies theme covers a variety of methods from meta-analysis to survival analysis.
This guide is meant to be used as a reference for beginning and intermediate researchers in criminology and criminal justice. We hope it serves you well.
Michael Rocque
Chad Posick
April, 2010
REFERENCES
Campbell, D. T. and Stanley, J. C. (1963). Experimental and Quasi-Experimental Designs for
Research. Chicago: Rand McNally.
Cook, T. D. and Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton-Mifflin.
DeCoster, J. (2005). Scale construction notes Retrieved November 16, 2008, from
http://www.stat-help.com/notes.html
DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social
Research Methods Series, Vol. 26). Newbury Park: Sage.
Isaac, S. and Michael, W.B. (1995). Handbook in Research and Evaluation (3rd Edition). San
Diego CA: EdITS.
Kerlinger, F. N. and Lee, H. B. (2000). Foundations of Behavioral Research (4th Ed.). New
York, NY: Holt, Rinehart and Winston.
Spector, Paul E. (1992). Summated Rating Scale Construction: An Introduction. Newbury
Park, CA: Sage Publications.
Page | 2
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd Edition). Cincinnati, OH:
Atomic Dog Publishing.
CHAPTER 1: MAJOR CRIME DATA SOURCES: PROMISES AND PROBLEMS
By Matthew J. Dolliver
PART I: INTODUCTION
The major crime data sources and their significance
There are a number of official/major crime data sources which can be utilized by researchers. These data sources are comprised of the official measures of crime for the U.S. federal
government. These measures can be used for a variety of projects from large-scale secondary
data analyses to quick references. As both a popular and official measure, these data sources are
worthy of some examination. This chapter will look at major crime data sources, including the
Uniform Crime Report, The National Incident-Based Reporting System, and the National Crime
Victimization Survey.
PART II: THE THREE MAJOR CRIME DATA SOURCES
Crime in the United States: An overview of the Uniform Crime Report (UCR)
The Federal Bureau of Investigation (FBI) was tasked with publishing the Uniform Crime
Report (UCR) in 1930 as an official measure of crime in the United States. The report is composed of crimes and arrests made by police in a given year (FBI 2010). This record has been
used to estimate the volume of crime, including the degree to which crime fluctuates over time,
space, and demographic factors throughout the U.S. (Liska and Messner, 1999).
The UCR is compiled based on crimes reported by (participating) police agencies to the
FBI (FBI 2010). These crimes are placed into two basic groups, and coded as Part I and Part II
offenses by the FBI. Part I, or index offences, include homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and mother vehicle theft, and arson. These crimes are considered to be the most serious, regularly occurring, and wide spread crimes in the UCR (FBI,
2010). Part II offenses include simple assault, fraud, embezzlement, gambling, driving under the
influence, weapons possession, vandalism, and vagrancy. Police departments report age, sex,
race, and clearance rate for Part I offenses. For Part II offences only arrest data is reported. The
UCR reports crime to the public in three ways. First, there are raw figures given each crime type.
Second, there is the rate per 100,000. Finally, the UCR presents the change in given raw and rate
numbers over time (Pattavina, 2005). These data are presented in a variety of formats, and are
widely available through the FBIs website.
Analysis of the Uniform Crime Report Approach
One of the most widely cited weaknesses of the UCR is its validity. There are several aspects of the UCRs construction that draws its validity into question. First, it is well known that
many crimes are not reported to the police. Second, a crime known to the police may not result in
a report by the police or may be recorded as a different type of crime when it is reported (Pope et
al, 2001). This may result from a number of reasons. For instance, police may have differing definitions or may be looking to present a particular image. It should be noted that the FBI does
provide a list of crimes and guidelines for reporting them, however they are guidelines and are
Page | 3
difficult if impossible to enforce (FBI, 2010). Finally, the UCR is compiled using a hierarchy
rule. This rule codes a group or string of offences that happen together as single crime represented by the worst offence. This coding practice does have some exceptions, for instance, when
multiple homicides are involved (Pattavina, 2005).
The UCR also suffers from systematic underreporting due to the voluntary nature of the
program. No law enforcement agency is required to submit data. According to the FBI, about
93% of departments participate. However, this still means that crime will be under reported. Additionally, the UCR does not account for federal crimes or arrests (FBI, 2010). Under reporting
will also take place because of the UCRs narrow field of included offenses. Together, parts I
and II cover only the 29 crimes seen by the FBI as the most serious. This list excludes a number
of crimes and even crime types. For example, many white-collar crimes are not included (Liska
and Messner, 1999).
The evolution of crime in the U.S.: National Incident-Based Reporting System (NIBRS)
In an effort to update it approach and address some of the issues associated with the
UCR, the FBI introduced the National Incident-Based Reporting System (NIBRS) in the late
1980s. As stated by the FBI the goal of this new system of crime reporting is to enhance the
quantity, quality, and timeliness, as well as to improve the methodology used for compiling, analyzing, and publishing, crime data (FBI a, 2010). Because NIBRS is an incident based reporting
system, it contains detailed information on individual crimes/arrests. For example, NIBRS reports include such information as location of incident, method of entry, and victim-offender relationship (FBI a, 2010) all of which can be useful to our understanding of crime.
NIBRS data are also collected and separated into two categories. Group A is comprised
of 46 crimes, and Group B has an additional 11. Group A offenses expand of the index crime
concept of the FBI by including crimes such as kidnapping and sex offenses. Crime in the
NIBRS system can generally be thought of as one of three types; violent crime, property crime,
or crimes against society. The expansion of the NIBRS system into the area crimes against society shows a shift in development from the UCR, and places and emphasis on drugs and drug related crime (Pattavian, 2005).
Analysis of the National Incident-Based Reporting System approach
The FBIs evolution of crime reporting from the UCR to NIBRS has several notable advantages. First, the NIBRS system is able to distinguish between crimes, committed in a group or
series, as it does not use the hierarchy rule. Additionally, the NIBRS system of reporting is able
to distinguish between attempted and completed crimes. In this same way the NIBRS system
produces better definitions of crimes for reporting and classification by law enforcement. Related
this classification, the NIBRS system also gives back more complete and comprehensive statistical analysis of crime in the U.S. (Pattavian, 2005).
NIBRS data is collected based on the reports of participating police agencies. This levees
it subject to underreporting (as discussed early with the UCR). This point is particularly significant because only about half of the U.S. participates in reporting to NIBRS. This reporting can
be seen as costly and time consuming, particularly for smaller police agencies.
NATIONAL CRIME VICTIMIZATION SURVEY (NCVS)
The BJS system, and alternative methods: The National Crime Victimization Survey
Page | 4
The final major source of crime data that we will examine here is the Bureau of Justice
Statisticss National Crime Victimization Survey (NCVS). Because this measure of crime is
conducted using a survey, it is not subject to the reporting problems associated with either the
UCR or NIBRS. Approximately 150,000 interviews are conducted in two sessions a year, using
U.S. Bureau of Census personnel. This alternative method of data collection reveals some of the
underreporting seen in the UCR and NIBRS. The NCVS collects demographic and other information about the persons involved, the nature of and extent of the victimizations, economic consequences and other real world information in order to provide a more complete understanding
(Pattavina, 2005).
However, as a survey the NCVS is labor-intensive and costly. These draw backs are felt
in the collection or compilation of data. For example, compiling NCVS data takes more time,
effort, personnel, and financial resources. This draws out an additional limitation of this approach. Interviews may affect the outcomes of surveys or may encode answers incorrectly when
dealing with participants. This means there is a potential of inconsistency in the coding of data,
as well as the creation of bias. In this same way subjects may fail to tell the truth, or decline to
participate fully. For instance, a subject may be embarrassed to reveal a criminal incident to an
interviewer. Similarly, a subjects memories surround an incident may not be accurate. Finally,
the NCVS does not collect information on a number of crimes. The survey focuses on victimization experiences. For example, the survey includes only those 12 years of age and older and
therefore systematically leaves out group of crimes (Pattavina, 2005)
PART III: CONCLUSION
We have conducted some analyses of the three major crime data sources. These sources
are presented by the United States federal government as the official measure of crime. However,
as we have seen, all of these methods for measuring and reporting crime are subject to some
limitations. They are best viewed together as an estimation of the nations general level of crime.
These measures are also likely to shift and evolve if they are to keep up with the rapidly changing nature of crime and its scientific study.
REFERENCES
Federal Bureau of Investigation a (2010). National Incident-based Reporting System.
Volume 1. Available from: www.fbi.gov/ucr/nibrs/manuals/v1all.pdf
Federal Bureau of Investigation (2010). Crime in the United States, 2002. Washington,
DC: United States Government Printing Office.
www.fbi.gov/ucr.ucr.htm#cius
Liska, A. E. and S. F. Messner. (1999). Perspectives on Crime and Deviance (3rd ed). Prentice
Hall, Upper Saddle River, NJ.
Pattavina, A. (2005). Information Technology and the Criminal Justice system. Thousand
Oaks, CA: Sage publishing.
Pope, C., R. Lovell, and S. Brandl. (2001). Voices from the Field: Readings in
Criminal Justice Research. Carborough, Ontario: Wadsworth publishing.
Page | 5
CHAPTER 2. DATA SOURCES FOR CRIMINOLOGICAL RESEARCH

By Michael Rocque
Introduction and Guide
This chapter is meant to provide the reader with information about major data sources for use in
criminological and criminal justice research. Sources to be reviewed in this chapter include publicly available data, data collected by governments (oftentimes for purposes that are unrelated to
crime) and data collected by researchers. For example, the US Census Bureau provides data on
the general characteristics of the US population. These data may be essential to calculate crime
trends or when a population base is necessary. Also included in this document are data sources
from outside the US that may be useful for all criminological researchers. The sources listed in
this guide are by no means exhaustive but should help point researchers in useful directions to
find data.
The structure of the chapter is as follows: Each major data source is listed along with a
brief description of the data, method of collection, years of collection and structure. Next, a listing of variables or variable categories is provided. Finally, where applicable, links are included
to websites from which readers can either a) learn more about the data source or b) download
data directly.
First, the chapter reviews major official data sources. Second, data collected by governmental agencies are discussed. Finally, the chapter reviews major longitudinal data sources,
vital for questions relating to life course criminology.
Before getting started, there are two major websites that should be known to criminological researchers. These websites house datasets for public download and include most of the data
sources described below. These websites are 1) The Interuniversity Consortium for Political and
Social Research (ICPSR); URL: http://www.icpsr.umich.edu 2) the National Archive of Criminal
Justice Research (NACJD); URL: www.icpsr.umich.edu/NACJD/ and 3) US Bureau of Justice
Statistics. The Bureau of Justice Statistics compiles data and publications from the US Department of Justice. BJS was initiated in 1979 under an amendment to the Omnibus Crime Control
and Safe Streets Act of 1968.
BJS produces reports and data that are available online. The data are usually made available through the National Archive of Criminal Justice Abstracts. At the BJS website, researchers
can find information and reports on crime, criminal justice agencies and actors, victimization,
legislation and justice expenditures.
The BJS page links to such specialized websites as:
National Criminal Justice Reference Service

National Archive of Criminal Justice Data
Federal Justice Statistics Resource Center (FJSRC)
Infobase of State Activities and Research (ISAR)
National Criminal Justice Reference Service (NCJRS)
For a listing of data series available at the BJS website, go to

http://bjs.ojp.usdoj.gov/index.cfm?ty=dca
Page | 6
I. Official Data Sources

The Uniform Crime Report: US
The Uniform Crime Report, or UCR, was initiated in 1930 by the International Association of
Chiefs of Police. The report, now under the auspices of the Federal Bureau of Investigation,
compiles data from police agencies and includes crimes known to police. The UCR publishes
aggregate statistics on what are called Index Crimes. Index crimes include Part I and Part II
offenses. Part I offenses are generally thought of as more serious in nature.
These include:
Murder and nonnegligent manslaughter;
Forcible rape;
Robbery;
Aggravated Assault;
Burglary;
Larceny;
Motor-Vehicle Theft; and
Arson.
Part II offenses include less serious offenses such as vandalism, fraud, embezzlement and disorderly conduct.
Supplements: Supplemental Homicide Report (SHR)- provide incident-based information on
criminal homicides. The data, provided monthly by UCR agencies, contain information describing the victim(s) of the homicide, the offender(s), the relationship between victim and offender,
the weapon used, and the circumstance of the incident (National Archive of Criminal Justice
Data, http://www.icpsr.umich.edu/NACJD/ucr.html).
Police Employee (LEOKA) Data provide information about Law Enforcement Officers Killed
or Assaulted (hence the acronym, LEOKA) in the line of duty. The variables created from the
LEOKA forms provide in-depth information on the circumstances surrounding killings or assaults, including type of call answered, type of weapon used, and type of patrol the officers were
on (National Archive of Criminal Justice Data website).
Hate Crime Data-detailed information on offenses classified as hate crimes.
National Incident Based Reporting System (NIBRS)- relatively exploratory in nature at present.
NIBRS is a system that attempts to cull more detailed incident-based information on crimes.
There are 22 crime categories, 46 offenses and 53 data elements covering victim, property and
offender. First collected in 1989, but has not been widely adopted across the US because of considerable time and resources required.
For more information, please go to: http://www.fbi.gov/ucr/ucr.htm
See also Mosher, Miethe and Phillips (2002). The Mismeasure of Crime. Thousand Oaks: Sage.
The National Archive of Criminal Justice Data provides 1) agency level data; 2) county level data; and c) incident level data. http://www.icpsr.umich.edu/NACJD/ucr.html
Page | 7
A. Criminal Statistics, England and Wales: UK

Official data in the UK are similar to those collected in the US. The government collects and reports on crimes known to the police in an annual publication. In England and Wales, official data
were first collected in 1857. An official report, entitled Criminal Statistics: England and Wales is
more comprehensive that the Uniform Crime Report, including as it does, offenders dealt with
by formal police cautions, reprimands or warning, or criminal court proceedings in England and
Wales and is issued by the Ministry of Justice. This report also includes data from the British
Crime Survey, discussed below. Data can be downloaded in table format from the website, provided below.
The Criminal Statistics report includes information on the following:
Court Proceedings;
Offenders Cautioned or Convicted;
Use of Police Bail or Reprimand;
Motoring Offenses;
Offenses Brought to Justice; and
Penalty Notices for Disorder
For more information, please go to http://www.justice.gov.uk/publications/criminalannual.htm
See also Maguire, M. (2007). Crime Data and Statistics. In The Oxford Handbook of Criminology. Oxford: Oxford University Press.
B. National Crime Victimization Survey: US
The National Crime Victimization Survey (NCVS; originally National Crime Survey), was first
conducted in the early 1970s from a series of pilot studies. The first national survey was in 1972,
supervised by the Census Bureau.
Procedures: The NCVS has been designed by leading statisticians in order to develop a nationally representative sample of US households. A complex, stratified, multistage cluster sample is
used. 2,000 primary sampling units (PSUs) are identified and then stratified according to demographics. Samples are taken from these clusters in numbers proportionate to the population. Approximately 49,000 households are sampled, for a total of over 100,000 individuals. Interviews
are conducted over the phone.
Variables: Data include type of crime, month, time, and location of the crime, relationship between victim and offender, characteristics of the offender, self-protective actions taken by the
victim during the incident and results of those actions, consequences of the victimization, type of
property lost, whether the crime was reported to the police and reasons for reporting or not reporting, and offender use of weapons, drugs, and alcohol. Basic demographic information, such
as age, race, gender, and income, is also collected to enable analysis of crime by various subpopulations (NCVS Resource Guide, http://www.icpsr.umich.edu/NACJD/NCVS/).
The NCVS includes violent crime and property crime, analogous to the UCR
Supplements: Crime and School Safety-includes information from US schools on
o Alcohol and drug availability;
o Fighting, bullying, and hate-related behaviors;
Page | 8
o
o
o
o
Fear and avoidance behaviors;

Gun and weapon carrying; and
Gangs at school
Stalking (introduced in 2005-65,000 persons 18 and older interviewed).
For more information and data downloads, go to: http://www.icpsr.umich.edu/NACJD/NCVS/

C. British Crime Survey: UK
First conducted in 1982, the British Crime Survey (BCS) is analogous to the National Crime Victimization Survey. It is a household probability sample of individuals in England and Wales, carried out by the Home Office. The major use of the BCS is to analyze victims of crime. However,
the BCS also includes questions about attitudes toward crime and policy. Currently over 50,000
individuals are interviewed (aged 16 and older). In January of 2009, 4,000 interviews with children (aged 10-15) were included. Previously, the BCS was carried out at approximately two year
intervals but now is an annual survey.
Like the NCVS, the BCS measures not only whether a person was a victim of a crime but also
situational characteristics surrounding the incident. Offenses include personal and property
crimes.
It should be noted that Scotland (Scottish Crime and Justice Survey) and Northern Ireland
(Northern Ireland Crime and Victimisation Survey) have analogous surveys for their nations.
For more information and to download data, go to:
http://www.homeoffice.gov.uk/rds/bcs-methodological.html
Also conducted by the Home Office:
Commercial Victimization Survey: The CVS was first conducted in 1994 with the second being conducted in 2002 (with some small methodological modifications). The scope and methodology of the CVS is currently being reviewed with a view to potentially running a new survey in
the
future.
The CVS includes crimes which are not reported to the police and so is an important alternative
to police records. It helps to identify those most at risk of different types of crime, which helps in
the planning of crime prevention programmes. The survey also provides an important measure of
concern about, and perceptions of, crime and anti-social behaviour and their effects on businesses, as well as levels of reporting to the police and crime prevention (Home Office,
http://www.homeoffice.gov.uk/rds/business-crime.html).
Offending, Crime and Justice Survey: Self-report survey of 12,000 individuals, beginning in
2003. This is a longitudinal survey, conducted in England and Wales that collects data on
youths offending and attitudes toward crime. It is longitudinal in that youth are re-interviewed,
but it also includes yearly new samples. The interviews are conducted using computer assisted
personal interviewing and computer assisted self-interviewing technology.
Page | 9
Data include:
measures of self-reported offending;
indicators of repeat offending;
trends in the prevalence of offending;
trends in the prevalence and frequency of drug and alcohol use;
evidence on the links between offending and drug / alcohol use;
evidence on the risk factors related to offending and drug use; and
information on the nature of offences committed, such as the role of co-offenders and the relationship between perpetrators and victims.
For more information and to download data, go to
http://www.homeoffice.gov.uk/rds/offending_survey.html
D. International Crime Victimization Survey
The International Crime Victimization Survey (ICVS) is conducted by the United Nations and
initiated by the ICVS international working group. The ICVS was first carried out in 1987, then
again in 1992. The third wave occurred in 1996 and the fourth in 2000. The latest round (2005)
includes 78 countries and 300,000 interviews. The purpose of the survey is to generate data for
national comparisons.
Subjects are interviewed in a similar fashion to the NCVS. Households are the unit of analysis
and interviews are done predominantly over the phone using CATI methods. Screen questions
are used to determine if a person has been a victim of a crime and if so, more detailed information is asked.
Information collected includes:
Demographic information;
5 year victimization screen;
Detailed situational information on victimization;
Whether crime was reported to police;
Victim services information; and
Seriousness of the crime
For more information download the latest ICVS report here:
http://www.unicri.it/wwd/analysis/icvs/pdf_files/ICVS2004_05report.pdf
Or download the data here:
http://www.icpsr.umich.edu/cocoon/NACJD/SERIES/00175.xml
E. European Union Crime and Safety Survey
The European Union Crime and Safety Survey (EU ICS) is modeled after the ICVS. It is conducted within the nations of the European Union and uses a slightly modified instrument from
that used in the ICVS. This survey was initiated in 2005.
Page | 10
The EU ICS uses Random Digit Dialing and complex survey methodology. Interviews are conducted using CATI methods (and some Web-based instruments). The total sample contains over
28,000 individuals.
Data include:
Household information
Victimization experiences
o Personal
o Property
o Motor Vehicle
o Hate crimes
Perceptions of safety
Neighborhood characteristics
http://www.europeansafetyobservatory.eu/euics_da.htm
F. Law Enforcement Management and Administrative Statistics (LEMAS)
LEMAS is a national survey of law enforcement agencies with over 100 sworn personnel. The
survey is executed on a three year basis (Langworthy, 2002). The purpose of the survey is to
gather data on officers, hiring practices, training procedures, expenditures and agency equipment (US DOJ, 1996, as cited in Langworthy, 2002). Data are also collected on agency initiatives, such as community policing.
The survey was conducted in the following years: 1987, 1990, 1993, 1997, 1999, 2000 and 2003.
For more information or to download data files, go to
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/04411/detail;jsessionid=0BCCDFFEF8264
3B02AC7BA8E7E102CDA
See Langworthy (2002). LEMAS: a comparative organizational research platform. Justice Research and Statistics Association.
G. World Health Survey (World Health Organization)
The World Health Survey (WHS) is conducted by the World Health Organization. Its purpose is
to provide data for cross-national comparisons on a variety of health issues. The WHS is relevant
for criminological research because it provides measures that can serve as indicators of the wellbeing of a nation as well as information on policies within nations. Data are collected from nations across the globe, and are geocoded.
Information is collected on the following domains:
Socio-demographics (occupation, income, sex and age structure of household, household
expenditures);
Healthcare utilization; and
Health status of respondents
Page | 11
For more information and to see a list of participating nations, go to:

http://www.who.int/healthinfo/survey/en/
H. World Values Survey
The World Values Survey (WVS) was initiated in 1981 by the European Values Survey Group.
It has been conducted in 1981, 1990, 1995, 2000, 2005 and is scheduled to be conducted in
2010-2011. It is now conducted by the World Values Survey Association. The WVS contains a
multitude of measures that range from happiness to religious preferences, to political attitudes.
Since the WVS has been replicated, trends can be analyzed (it is not, however, traditionally longitudinal, since the same respondents are not interviewed in each wave). According to the WVS
website, 97 societies are included in the latest waves.
For more information and to download data, go to: http://www.worldvaluessurvey.org/
or: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/04531
I. General Social Survey: US
The General Social Survey (GSS) was first carried out in 1972. It is conducted by the National
Opinion Research Center (NORC) and is funded by the sociology program of the National Science Foundation. The survey currently contains many of the same variables as were included in
the original survey, to allow trend analyses.
The GSS includes demographic and attitudinal variables meant to be representative of the US
population. There are 5,364 variables available and time trends are available for 1,988 variables.
For more information and to download data, go to http://www.norc.org/GSS+Website/
J. US CENSUS
The United States Census Bureau compiles ongoing population and demography data. Every ten
years, as required by the Constitution, a thorough survey of the US population is taken. A form
that includes ten questions is mailed to every household in America in order to provide accurate
population estimates. These estimates are essential for many criminal justice and criminology
research activities, which often include the calculation of rates. A census is also taken in US
schools and on university campuses
In addition to population estimates, the US Census Bureau provides data on:
People and Households (poverty, income, health insurance status, etc.);
Business and Industry;
Geography (including geo-coded data); and
Special Topics (phishing and fraud, for example)
For more information and to download data, go to: http://www.census.gov/
Also note that comparable statistics are compiled for England and Wales by the Office for National Statistics (ONS): http://www.ons.gov.uk/census/index.html
Page | 12
K. School Survey on Crime and Safety (SSOCS)

This survey examines the extent of violence and crime across US schools and the environment of
each school participating in the survey. There is a focus on crime and violence prevention programs in the survey, which allows an examination of the effectiveness of each such programs.
The SSOCS was conducted in 1999-2000, 2003-2004, 2007-2008; and 2009-2010.
The survey is conducted by the US Department of Education and the National Center for Education Statistics in association with Abt Associates. The survey is a mail questionnaire with a telephone follow-up. The sample size includes roughly 3,000 elementary and secondary schools
across the United States.
Data include:
School demographics
Violent and property crime
School crime prevention programs
School policies
To download data and read more about the survey, go to
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3964/detail
Or
http://nces.ed.gov/surveys/ssocs/index.asp?FType=3
L. National Sex Offender Registry
Information on registered sex offenders across the United States can be accessed through a variety of sources. Researchers may wish to use these data in conjunction with data from official records to conduct analyses. Information on sex offender locations can be geocoded and analyzed
in a spatial framework.
For more information see the following sites:
http://www.nsopw.gov/Core/Conditions.aspx?AspxAutoDetectCookieSupport=1
http://www.fbi.gov/hq/cid/cac/registry.htm
http://www.familywatchdog.us/
M. Monitoring of Federal Criminal Sentences
Data on federal sentences is compiled by the United States Sentencing Commission. Data are
available starting from 1987 through the present. These data include federal cases sentenced under the US Sentencing Act of 1984 and deemed Constitutional.
Data include:
Defendant demographics
Offender history
Crime characteristics
Guideline minimum sentence
Departures
Page | 13
Plea bargains

http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/00083
II. Major National Self-Report Studies of Crime and Delinquency
A. National Household Survey on Drug Abuse (Now National Household Survey on
Drug Use and Health)
The National Household Survey on Drug Abuse is a yearly survey conducted with new samples
each year. It first began in 1971 and seeks to examine the use of illicit substances in the United
States. Interviews are face-to-face and include nearly 70,000 respondents (in 1997) aged 12-49.
In 1999, the interviews began to implement computer assisted technology. Also in that year,
state based probability sampling was implemented for the purposes of national drug use estimates (Mosher et al., 2002).
Data are collected on:
Alcohol use;
Drug use;
Risk and protective factors;
Drug treatment;
Mental health; and
Mental health treatment
For more information and to retrieve data, go to: http://www.oas.samhsa.gov/data.cfm
B. Youth Behavioral Risk Surveillance System
The Youth Behavioral Risk Surveillance System (YBRSS) is a national probability survey conducted in schools. It is under the auspices of the Centers for Disease Control. This survey has
been conducted since 1990. The purpose is to provide national estimates of risk behaviors.
In 2007, the YBRSS included 44 state surveys, 5 territorial surveys, 22 local surveys focusing on
students in grades 9-12. A new report is due in the summer of 2010.
Topics include:
Tobacco use;
Unhealthy dietary behaviors;
Inadequate physical activity;
Alcohol and other drug use;
Sexual behaviors that contribute to unintended pregnancy and sexually transmitted diseases, including HIV infection; and
Behaviors that contribute to unintentional injuries and violence. In addition the YRBSS
monitors general health status and the prevalence of obesity and asthma.
http://www.cdc.gov/HealthyYouth/yrbs/index.htm
Page | 14
C. National Violence Against Women Survey

Conducted from 1995 to 1996, the National Violence Against Women Survey (NVAW) provided estimates of a variety of acts perpetrated against women in the United States. This survey
was conducted by the National Institutes of Justice and the Centers for Disease Control. The
NVAW survey was a national survey, drawing on phone interviews. Sampling was conducted by
a professional survey organization, using random digit dialing. A total of 8,000 and 8,005
women and men over the age of 18 were interviewed using computer assisted telephone interview methods.
Questions asked of respondents include:
Their level of concern about their personal safety;
Their marital and cohabiting relationship history;
Their sociodemographic characteristics;
Their use of drugs and alcohol;
Their general state of physical and mental health;
Their current partners sociodemographic characteristics;
Emotional abuse by current and former spouses and cohabiting partners;
Physical assault by adult caretakers experienced as children;
Physical assault by other adults experienced as adults; and
Forcible rape and stalking by any type of perpetrator experienced at any time in their life
For more information, go to: http://www.ojp.usdoj.gov/nij/pubs-sum/181867.htm
(unclear whether data are publicly available)
D. Capital Punishment in the United States
Comprehensive, yearly data on individuals sentenced to death, as well as those with their sentences commuted. Data are available from 1973 to the current year through ICPSR. Data are also
collected on confinement facilities across the United States.
Data are collected on sociodemographic classifications including:
age;
sex;
race and ethnicity;
marital status at time of imprisonment;
level of education;
State and region of incarceration;
criminal/offense history
NOTE: inmates may be included in the dataset more than once. The unit of analysis is the institution.
For more information and to download data, go to: http://www.icpsr.umich.edu/NACJD/cp/
Page | 15
E. Survey of Inmates in State and Federal Correctional Facilities

Beginning in 1974, data have been collected on national and state level correctional facilities
across the United States. This survey is conducted by the Bureau of Justice and the Census Bureau. State surveys were conducted in 1974, 1979, 1986, 1991 and 1997. Federal surveys were
first done in 1991. The data are combined into one file (as of 1997).
Data were collected on:
current offense and sentence;
criminal history;
family background and personal characteristics;
prior drug and alcohol use and treatment programs;
gun possession and use; and
prison activities, programs and services
For more information and to download the data files, go to
http://www.icpsr.umich.edu/NACJD/sisfcf/#About
F. RAND Prison/Inmate Surveys
The RAND prison studies were conducted in 1976 and 1978. The intention of the surveys was
to gather information on individual patterns of criminal behaviortypes of crime committed,
degree of specialization in crime types, and changes in criminal patterns over time (Visher,
1986, quoted in Mosher et al., 2002: 124). Data were collected on nearly 50 inmates in the California prison system to inform a self-report survey. The self-report survey was used on 624 inmates. In 1978, inmates in California, Texas and Michigan were surveyed (N=2,190).
Data were collected on:
Number and type of crimes committed;
Attitudes; and
Demographic/employment data
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/8169?archive=ICPSRandq=RAND+inmate
+survey (1978)
and
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/7797?archive=ICPSRandq=RAND+inmate
+survey (1976)
(NOTE: A replication study was conducted by Julie Horney and Ineke Haen Marshall in 198990. These researchers collected similar data from inmates in Omaha, Nebraska (N=700). The
specific intention was to determine if two different forms of collected self-report data impacted
estimates of Lambda, or the yearly crime rate per individual. The investigators used a detailed,
month-by-month life-history calendar thought to result in more accurate recall.
For more information or to download data, go to:
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/9916?archive=ICPSRandq=Julie+Horney )
Page | 16
III. Major Longitudinal Datasets

A. Monitoring the Future: A Continuing Study of American Youth
First conducted in 1975, the Monitoring the Future (MTF) study focuses on use of illicit substances by youth. It also asks about attitudes in a sample of 8th, 10th, and 12th grade students. In
each grade, a target sample of 15,000-20,000 students is interviewed. In addition, between 9,000
and 16,000 college students are sampled (Mosher et al., 2002).
MTF began as a cross-sectional survey but is now longitudinal. Follow-up questionnaires are
mailed to respondents. The project has been funded by various grants, awarded by institutions
such as the National Institutes of Health and the National Institute on Drug Abuse.
For more information or to download data, go to: http://monitoringthefuture.org/ or
http://www.icpsr.umich.edu/cocoon/SAMHDA/SERIES/00035.xml
B. National Youth Survey-Family Study
The National Youth Survey (NYS) is a nationally representative survey of crime and delinquency. The NYS was initiated in 1976 by former American Society of Criminology president
Delbert Elliot, Scott Menard and colleagues. The survey targeted 1,725 individuals between the
ages of 11 and 17 and interviewed their parents. The survey is ongoing.
The NYS collects information on the following measures:
Sociodemographic factors;
Education;
Work;
Attitudes;
Genetic information;
Environmental factors; and
Criminal/delinquent behavior
http://www.colorado.edu/IBS/NYSFS/index.html or
http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/00088
C. National Longitudinal Survey Series
The National Longitudinal Survey of Youth (NLSY) is conducted by the Bureau of Labor and
contains several surveys. The National Longitudinal Survey of Young Women and Mature
Women includes participants first interviewed in 1968. The National Longitudinal Survey of
Young Men and Older Men was initiated in 1966 and stopped in 1981. The NLSY79 was initiated in 1979 and includes 12,686 men and women born in the 1960s. These individuals were interviewed each year until 1994 and are now interviewed every two years. The NLSY79 children
and young adults survey includes the biological children of the women participating in the original NLSY79. The NLSY97, first initiated in 1997, includes nearly 9,000 individuals born between 1980-84. It is currently ongoing.
The National Longitudinal Surveys include a wealth of information across a variety of life domains including:
Page | 17
Sociodemographic factors;
Education;
Work;
Health;
Crime and delinquency;
Criminal justice contact;
Income;
Marriage and social relationships;
Sexual activity;
Substance and sexual abuse;
Attitudes; and
Political participation
For more information and to download data files, go to: http://www.bls.gov/nls/home.htm

D. National Longitudinal Survey of Adolescent Health (Add Health)
The National Longitudinal Survey of Adolescent Health (Add Health) survey was initiated in
1994 under a US Congressional mandate, and seeks to examine the effect of environment on
health and risk behaviors. The survey initially included nearly 90,000 youth for participation.
The children completed surveys at school and 20,000 were then interviewed (with their families)
at home. Genetic and DNA information was collected on a subset of over 1,000 of the survey
respondents.
Measures collected in the Add Health survey include:
Socioeconomic factors;
Physical and mental health;
Relationships;
Education;
Crime/delinquent behavior;
Genetic information; and
Environmental information
For more information and to download/request data, go to:
http://www.nichd.nih.gov/health/topics/add_health_study.cfm or
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21600
E. Project on Human Development in Chicago Neighborhoods
The Project on Human Development in Chicago Neighborhoods (PHDCN), initiated in 1990,
consists of two studies: 1) a study of Chicago neighborhood dynamics (political, social, economic) using surveys and social observations and 2) a longitudinal study of 6,000 people from
80 neighborhoods. The longitudinal study includes individuals from 7 birth cohorts. Thus data
are available on a multilevel basis. The principal investigator is Felton Earls.
Longitudinal data include:
Demographic indicators;
Page | 18
Socioeconomic factors;
Impulsivity;
Crime/delinquency;
Attitudes;
Cognitive and language development; and
Family structure and relationships (Parental figure or caregiver)
For more information or to download data, go to:

http://www.icpsr.umich.edu/PHDCN/about.html
F. Criminal Careers and Crime Control in Massachusetts [The Glueck Study]: A
Matched-Sample Longitudinal Research Design, Phase I, 1939-1963
The Glueck Unraveling Delinquency Study began in 1939 and includes three waves of data collection. John H. Laub and Robert J. Sampson recoded, cleaned and computerized the original
data files from this study. The study includes 500 delinquents matched with 500 non-delinquents
from Massachusetts. The delinquents had been committed to one of two institutions: 1) the Lyman School for Boys or 2) The Industrial School for Boys. The groups were matched on a caseby-case basis on age, neighborhood characteristics, intelligence and nationality.
Data include family, relationship, education, psychological, physical and biological characteristics and crime and delinquency information (both self and official reports, and interview reports
from multiple sources). Interested researchers are referred to Sampson and Laub (1993) and
Laub and Sampson (2003) for more information on data available.
To download data files, go to: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/9735/detail
G. Continuity and Change in Criminal Offending by California Youth Authority Parolees Released 1965-1984
This research study followed a sample of 524 parolees from the California Youth Authority for a
period of 7 years. The parolees were in their teens and early 20s when the study was conducted.
Alex Piquero and colleagues have used the data to examine the relationship between local life
circumstances and crime.
Data include:
Demographics (age, sex, race, etc.);
Local life circumstances (marriage, employment, education, relationships);
Drug use;
Criminal offending; and
Time free (not incarcerated)
H. Philadelphia Birth Cohort I and II
Under the leadership of Marvin Wolfgang, the Philadelphia Birth Cohort studies are perhaps
among the most well known in all of criminology. The first study was called Delinquency in a
Page | 19
Birth Cohort in Philadelphia, Pennsylvania, 1945-1963. It followed 3,595 offenders born in

1945 and collected a variety of measures on life circumstances and offending. A total of 10,214
offenses were recorded. The second study was called Delinquency in a Birth Cohort II: Philadelphia, 1958-1988. This study followed a cohort of offenders born in 1958. Data were collected through interviews and official records.
Data include:
Demographic measures (age, race, gender, church membership);
Offense characteristics-both juvenile and adult (location, type of offense, police contacts,
complainant characteristics)
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/07729 and
I. Cambridge Study in Delinquent Development
The Cambridge Study in Delinquent Development began in 1961 under the leadership of Donald
West. Since 1982 it has been conducted by David Farrington (see Farrington et al., 2006). The
study follows 411 delinquent boys, aged 8 in 1961. Subjects were all white, male and working
class. Data were collected via interview in schools when the subjects were under 14 years of age,
at a research office until age 21, and in the subjects homes thereafter. Parents were interviewed
until the youth were aged 15. Peers were interviewed at age 8 and 10 interviews. Data have been
collected up to age 50. This study has been used to examine a variety of issues, including the correlates of juvenile offending and trajectories of offending over the life course.
Data include:
Demographic characteristics;
Offense characteristics;
Health measures;
Family relationships;
School and employment measures;
Physical characteristics;
Attitudes;
Antisocial behavior;
Personality measures; and
Contact with social agencies
http://www.icpsr.umich.edu/cocoon/NACJD/STUDY/08488.xml
REFERENCES
Farrington et al., (2006). Criminal careers and life success: new findings from
the Cambridge Study in Delinquent Development. Home Office.
Langworthy, R. H. (2002). LEMAS: A comparative organizational research platform. Justice
Research. 4:21-38.
Page | 20
Laub, J. H., and R. J. Sampson. (2003). Shared Beginnings, Divergent Lives: Delinquent Boys to
Age 70. Cambridge, MA: Harvard University Press.
Mosher, C.J., T. D. Miethe and D. M. Phillips (2002). The Mismeasure of Crime. Thousand
Oaks: Sage.
Sampson, R. J., and J. H. Laub. (1993). Crime in the making: Pathways and turning points
through life. Cambridge, MA: Harvard University Press.
Page | 21
CHAPTER 3. SAMPLING
By Stephanie Fahy
This chapter reviews sampling methods in social science research. Sampling involves selecting a
smaller number of elements (such as people or organizations) within a population of interest in
order to generalize from the sample to the population from which the elements were chosen
(Trochim, 2001:41). A samples quality is largely based on the degree to which it is representative the extent to which the characteristics of the sample are the same as those of the population
from which it was selected (Maxfield and Babbie, 2001:242).
Key Terms
Population The group you wish to generalize to
o Theoretical population - who we want to generalize to
o Accessible population the population we can get access to
Sampling Frame The listing of the accessible population from which youll draw your
sample
Sample The group of people you select to be in your study
External validity The degree to which the conclusions in your study would hold for
other persons in other places and at other times. A threat to external validity is an explanation of how you might be wrong in making a generalization.
o One way of improving external validity is by doing a good job of drawing a sample from a population (i.e., using random selection as opposed to non-random selection)
Figure 3.1. Different groups in a sampling model (Trochim, 2001)
The Theoretical Population (Who do you want to generalize to?)
The Study Population (What population can you get access to?)
The Sampling Frame (How can you get access to them?)
The Sample (Who is in your study?)
Page | 22
Even the most carefully selected sample is almost never a perfect representation of the population from which it was selected. Probability sampling methods are highly recommended for
selecting samples that will be quite representative. An important advantage of using probability
sampling methods is that they make it possible to estimate the amount of sampling error that
should be expected in a given sample (Maxfield and Babbie, 2001:242).
A basic principle of probability sampling is that a sample will be representative of the
population from which it is selected if all members of the population have an equal
chance of being selected in the sample (Maxfield and Babbie, 2001: 220); thus, the key to
this process is random selection.
o Random selection forms the basis of probability theory, which permits inferences
about how sampled data are distributed around the value found in a larger population (i.e., probability theory makes it possible to estimate sampling error and confidence intervals for the population parameter); thus allowing you to estimate the
accuracy or representativeness of the sample. In other words, you know the odds
or probability that you have represented the population well (Maxfield and Babbie, 2001).
o Random selection reduces conscious and unconscious sampling bias.
A probability sampling method is any method of sampling that utilizes some form of random selection. Random sampling gives each and every member of the population an
equal chance of being selected for the sample (Fox, Levin and Shively, 2002:158).
Types of Probability Sampling Methods (see Figure 2)
Simple Random Sampling The simplest form of random sampling is appropriately
called simple random sampling (Trochim, 2001). Once a sampling frame has been established, you would assign a single number to each element in the list, not skipping any
number in the process. A table of random numbers, or a computer program for generating them is then used to select elements for the sample (Maxfield and Babbie, 2001:230).
Many computer programs can generate a series of random numbers by numbering the
elements in the sampling frame, generating its own series of random numbers and printing out the list of elements selected (Maxfield and Babbie, 2001; Trochim, 2001). Simple random sampling is easy to accomplish and because simple random sampling is a fair
way to select a sample, it is reasonable to generalize the results from the sample back to
the population (Trochim, 2001:51).
Stratified Random Sampling Involves dividing your population into homogenous
subgroups or strata and then taking a simple random sample in each subgroup. Stratification is based on the idea that a homogeneous group requires a smaller sample than does a
heterogeneous group (Fox, Levin and Shively, 2001:160). Stratified sampling may be
preferred over simple random sampling because 1) it assures that you will be able to represent not only the overall population, but also key subgroups of the population, particularly small minority groups, and 2) stratified random sampling has more statistical precision than simple random sampling if the strata or groups are homogeneous since the variability within groups would be expected to be lower than the variability for the population as a whole1 (Trochim, 2001).
According to sampling theory, a homogenous population produces samples with smaller sampling errors than a
heterogeneous population does, and stratified sampling is based on this.
Page | 23
Systematic Random Sampling Is a sampling method where you determine randomly

where you want to start selecting in the sampling frame and then follow a rule to select
every xth element in the sampling frame list (where the ordering of the list is assumed to
be random) (Trochim, 2001:53). For example, in drawing a sample from the population
of 10,000 public housing tenants, you might arrange a list of tenants, then beginning at a
random place, take every tenth name on the list and come up with a sample of 1,0000
tenants (Fox, Levin and Shively, 2001:159). One of the advantages of systematic sampling is that a table of random numbers is not required, so it is less time consuming; however, it also has the potential of resulting in a biased sample if the assumption that position on a list of population members does not influence randomness is not taken seriously
(Fox, Levin and Shively, 2001; Maxfield and Babbie, 2001). For example, in selecting a
sample of apartments from an apartment building, if the sample is drawn from a list of
apartments arranged in numerical order there is a danger of the sample interval coinciding with the number of apartments on a floor; thus the samples might include only northwest corner apartments or only apartments near the elevator (Maxfield and Babbie,
2001:230).
Cluster Sampling Involves dividing the population into clusters (usually along geographic boundaries), then randomly sampling clusters and measuring all units within
sampled clusters (Trochim, 2001:54). Cluster sampling methods are frequently used to
reduce the costs of conducting large surveys and to minimize travel costs to many scattered localities (Fox, Levin and Shively, 2001). For example, in studies that include a target population of all law enforcement officers in the United States, one could sample the
list of counties, or clusters, in which U.S. law enforcement officers are employed and
then study those law enforcement agencies included in the sample of counties. While an
advantage of cluster sampling is that it reduces interview costs, a disadvantage of this
type of sampling is that a cluster sample of 2000 people, for example, will generally produce estimators that are less precise than one could obtain from a simple random sample
of 2000 people (Eltinge and Sribney, 1997:208). Eltinge and Sribney (1997) recommend
using stratification to recover some of the lost precision. For example, the U.S. contains
about 3000 counties which can be grouped into 20 to 40 strata based on population size
and other factors and then a few counties can be selected within each stratum and then
law enforcement agencies can be randomly selected from each county selected within
each stratum. The resulting design is called stratified multistage sampling (Eltinge and
Sribney, 1997).
Multi-Stage Sampling As noted above, multi-stage sampling involves combing sampling methods. Using the cluster sampling example above, we may not be able to measure every single law enforcement agency in the clusters selected, so you might set up a
random sampling process within the clusters where you would obtain lists of law enforcement officers from agencies in each of the selected jurisdictions and then sample
each of the lists to provide samples of police officers for study, so you would have a twostage sampling process with stratified samples within cluster samples (Maxfield and
Babbie, 2001:232). While this multi-stage clustering method is touted for its efficiency, it
is subject to sampling error at each stage. A simple random sample drawn from a population list, however, is subject to a single sampling error (Maxfield and Babbie, 2001). A
good general guideline for cluster design is to maximize the number of clusters selected
Page | 24
while decreasing the number of elements within each cluster2 (Maxfield and Babbie,
2001:233).
Eltinge and Sribney (1997) point out that an important consideration to keep in mind when collecting sample data through complex designs like stratified and multi-stage cluster sampling is
that data cannot be assumed to be independent and identically distributed (iid) and therefore it is
inappropriate to analyze data using statistical methods that are based on iid assumptions (i.e., regression commands in statistical software programs like SPSS and Stata) since estimates will
almost certainly be biased. They argue that iid-based methods do not adjust for the for the effects of unequal selection probabilities and that iid-based variance estimators do not account for
the loss of information that invariably occurs when using a complex design over simple random
sampling, so you would run the risk of overstating the certainty attached to estimates (Eltinge
and Sribney, 1997).
A design-based approach is preferable to model-based approaches (i.e., iid-based approach) since
it accounts for the collection of sample data through complex designs, accounting for any losses
or gains in information, which results in more robust and accurate estimates. Importantly, this
approach restricts randomness to the specific random process by which the sample was selected (i.e., random selection at each successive stage) rather than assuming a true random sample, which is the approach taken by model-based analysis (Eltinge and Sribney, 1997).
Levy and Lemeshow (1999:482) recommend the following steps for performing a design-based
analysis:
1) Identify the following elements of the sample design:
a. Stratification
b. Clustering variables used
c. Population sizes required for determination of finite population corrections
2) On the basis of the above information, determine the sampling weight for each sample
subject.
3) Determine for each sample record a final sampling weight that takes into consideration
any nonresponse and poststratification adjustments that are desired.
4) Ensure that all stratification, clustering, and population size data required for an appropriate design-based analysis are identified on each sample record.
5) Determine the procedure and the set of commands for performing the required analysis
for the particular software package that will be used.
6) Run the analysis and carefully interpret the findings.
One way sampling error is reduced is by increasing the homogeneity of elements sampled. A sample of clusters
will best represent all clusters if a large number are selected and if all clusters are very much alike. A sample of
elements will best represent all elements in a given cluster if a large number are selected from the cluster and if all
the elements in the cluster are very much alike (Maxfield and Babbie, 2001:233).
Page | 25
Figure 3.2. Probability Sampling Techniques (Cox and Fitzgeralds figures on probability sampling)
Page | 26
Non-probability sampling does not involve random selection and therefore cannot depend on
the rationale of probability theory. In general, researchers prefer probabilistic or random sampling methods over non-probabilistic ones and consider them to be more accurate and rigorous;
however, in some circumstances in applied research it is not feasible, practical or theoretically
sensible to use random sampling (Trochim, 2001:55-56). Additionally, researchers may have
limited research objectives or seek to interview a population with no established sampling frame
(i.e., car thieves, bank robbers) (Maxfield and Babbie, 2001).
Non-probability sampling methods are divided into two broad types: accidental or purposive.
Types of Non-Probability Sampling Methods
Accidental, haphazard or convenience sampling This type of sampling takes what is
most quickly and easily available, often relying on available subjects (i.e., stopping people at a street corner or some other location). This type of sampling method is neither
purposeful nor strategic and there is no evidence that the subjects are representative of the
populations youre interested in generalizing to (Maxfield and Babbie, 2001; Trochim,
2001:56).
Purposive sampling This type of sampling is designed to understand certain select
cases in their own right rather than to generalize results to a population (Isaac and Michael, 1995:223). A sample is selected based on our judgment and the purpose of the
study, and usually you would be seeking one or more specific predefined groups (Trochim, 2001). For example, a study that looked at peoples attitudes about court-ordered
restitution for crime victims may want to test the questionnaire on a sample of crime victims, so rather than select a probability sample of the general population, you may select
some number of known crime victims, perhaps from court records (Maxfield and Babbie,
2001:238). With this type of sampling, you are likely to get opinions of your target population, but you are also likely to overweight subgroups in your population that are more
readily accessible (Trochim, 2001:56).
The following are subcategories of purposive sampling (Trochim, 2001:56-58):
o Modal Instance Sampling This type of sampling involves sampling the most
frequent or typical case (i.e., informal public opinion polls interview a typical
voter). An obvious problem with this approach is determining what the typical or
modal case is.
o Expert Sampling This approach involves assembling a sample of people with
known expertise in a particular area. This approach can also be used to validate
another sampling approach; however, the disadvantage to using this approach is
that even experts can be wrong.
o Quota Sampling This approach involves selecting people nonrandomly according to some fixed quota. There are two types of quota sampling: proportional and
nonproportional. For example, with proportional quota sampling, if you knew
the population has 40 percent women and 60 percent men and that you wanted a
total sample size of 100, you would continue sampling until you get those percentages and stop. So, if you already had 40 women but not 60 men you would
continue to sample men but not women because you have met your quota for
women. There are a couple of problems with this approach. First, the quota
frames must be accurate and it is often difficult to get up-to-date information on
the proportional breakdown of sample elements. Second, selection bias may exist
Page | 27
even if the proportion of the population is accurately estimates. For example, an

interviewer instructed to interview five people who meet a given set of characteristics may avoid people who live in run-down homes or own vicious dogs (Maxfield and Babbie, 2001:240). Nonproportional quota sampling is less restrictive
in that you specify the minimum number of sampled units for each category but
are not concerned with having the numbers match the proportions in the population. Rather, this approach is concerned with making sure that smaller groups are
adequately represented in the sample. There are a couple of problems with this
approach.
o Heterogeneity Sampling This approach is used when you want to include all
opinions or views and are not concerned with representing these views proportionately. This is also referred to as sampling for diversity. The idea here is to
get a broad spectrum of ideas as opposed to the average or modal ones. The goal
is to obtain a universe of all possible ideas relevant to some topic, and in order to
get the ideas, you have to interview a broad and diverse range of participants.
o Snowball Sampling This approach begins with identifying people who are eligible for the study and then you ask them to recommend other people who meet
the criteria for the study. This form of sampling can suffer bias. For example,
criminal justice research on active criminals frequently uses snowball sampling
and in order to initiate contacts researchers may begin with subjects who have a
previous arrest; however, this method usually suffers from bias since it relies on
offenders who are known to police (Maxfield and Babbie, 2001).
Isaac and Michael (1995:223-224) identify other types of purposeful sampling methods, including:
o Extreme or deviant case sampling It is often informative to examine some of the
extreme cases in a population those performing well or poorly. The strategy aims
straight at the most promising sources of useful information.
o Maximum variation sampling This strategy sets out to sample the wide range of
diversity that exists in a particular population with the aim of capturing central
themes or principal outcomes. This sampling is done by selecting the most diverse
characteristics on which to base the sample with two outcomes in mind: 1) high quality and detailed descriptions of each case to document uniqueness; and 2) important
shared patterns cutting across cases that assume increased significance because they
emerged out of heterogeneity.
o Homogeneous samples In deliberate contrast to maximum variation sampling, this
strategy picks a small homogeneous sample. By describing some particular subgroup
in-depth, especially if that subgroup contains significant or challenging types of individuals, one obtains information that illuminates major program evaluation issues.
o Typical case sampling The intent of this form of sampling is to profile what is average or usual when that is the appropriate focus for decision makers or other interested individuals.
o Critical case sampling Involves sampling cases that can make a point quite dramatically or are particularly important in the scheme of things. This procedure is useful when time and resources are limited and when one or more salient cases can sum
up the problem in a nutshell.
Page | 28
o Criterion sampling This strategy involves studying all cases that meet some predetermined criterion of importance.
o Confirmatory or disconfirming cases These cases are used to either support or
call into question the emerging trends or patterns in the early exploratory phase of qualitative evaluation.
o Sampling politically important cases Because evaluation often takes place in a
politically sensitive environment, it is practical to sample politically important or sensitive cases.
Page | 29
Very easy to do; almost like not sampling at all

Easily understood by nontechnical audiences
When concerned about under representing smaller subgroups

When you want to sample every
kth element in an ordered set
When organizing geographically
makes sense
Anytime
Anytime
When you only want to measure a
typical respondent
As an adjunct to other sampling
strategies
When you want to represent subgroups
When you want to sample for diversity or variety
With hard to reach populations
Stratified random sampling
Systematic random sampling
Cluster (area) random sampling
Multi-stage random sampling
Accidental, haphazard, or convenience nonprobability sampling

Modal instance purposive nonprobability sampling
Modal purposive nonprobability

sampling
Quota purposive nonprobability
sampling
Heterogeneity purposive nonprobability sampling
Snowball purposive nonprobability sampling
Easy to implement and explain; useful

when youre interested in sampling for
variety rather than representativeness
Can be used when there is no sampling
frame
Experts can provide opinions to support

research conclusions
Allows for oversampling smaller subgroups
Allows you to oversample minority

groups to assure enough for subgroup
analysis
You dont have to count through all of
the elements in the list to find the ones
randomly selected
More efficient than other methods when
sampling across a geographically dispersed area
Combines sophistication with efficiency
Simple to implement; easy to explain to

nontechnical audiences
Anytime
Simple random sampling
Advantages
Use
Sampling Method
Table 3.1. Summary of Sampling Methods (Trochim, 2001:59)
Page | 30
Low external validity
Results only limited to the

modal case; little external
validity
Likely to be biased; limited
external validity
Likely to be more biased
than stratified random sampling; often depends on who
comes along when
Wont represent population
views proportionately
Requires a sample list

(sampling frame) to select
from
Requires a sample list
(sampling frame) to select
from
If the order of elements is
nonrandom, there could be
systematic bias
Usually not used alone;
coupled with other methods
in a multi-stage approach
Can be complex and difficult to explain to nontechnical audiences
Very weak external validity;
likely to be biased
Disadvantages
Determining Sample Size

A key consideration in selecting a sample is determining the size of the sample. While small
samples may be more economical and convenient and more appropriate for some studies (i.e., indepth case studies, exploratory research and pilot studies), large samples will reduce sampling
error, as well as increase reliability and statistical power (Isaac and Michael, 1995). Furthermore, they are essential when the following factors are present (Isaac and Michael, 1995:198199):
When a large number of uncontrolled variables are interacting unpredictably
When the total sample is to be divided into several subsamples
When the population is made up of a wide range of variables and characteristics
When differences in results are expected to be small
Another important consideration when preparing to select a sample from a population of known
size is how large the sample should be in order to generalize to the population from which the
sample is drawn. Krejcie and Morgan (1970) cited in Isaac and Michael (1995:200) offer a formula for determining proper sample size (see Figure 3).
Page | 31
Figure 3.3. SAMPLE SIZE CALCULATION
S = Required Sample Size

N = Given Population Size
P = Population Proportion
d = Degree of Accuracy
X2 = Table Value for Chi-Square (1 Degree of Freedom)
Example:
Where:
N = 10,000
P = 0.50
d = 0.05
X2 = 3.841
*
**
*This formula and corresponding calculation assumes random selection of participants.
**Results should be rounded up to include the whole person. Thus, for a population of 10,000,
to obtain a sample that is representative of the population at the 95% confidence level, one
should sample 370 individuals.
Page | 32
Sampling Credibility Scale

In order to judge the credibility and generalizability of small-scale samples, Sudman (1977) developed a credibility scale (see Table 2 below) that is based on four factors, including the generalizability of findings, sample size, the execution of sample design, and the use of available resources.
Table 3.2 Credibility Scale for Small Samples (Sudman, 1976:27)
Score
A. Generalizability
1. Geographic spread
Single location
Several locations combined
Several locations compared
Limited geography
Widespread geography
Total Universe
2. Discussion of limitations
No discussion
Brief discussion
Detailed discussion
3. Use of special populations
Obvious biases in sample that could affect results
Used for convenience, no obvious bias
Necessary to test theory
General population
B. Sample Size
Too small, even in total, for meaningful analysis
Adequate for some but not all major analyses
Adequate for purpose of study
C. Sample execution
Poor response rate, haphazard sample
Some evidence of careless field work
Page | 33
Reasonable response rate, controlled field operations

D. Use of resources
Poor use
Fair use
Optimum use
Total points possible
Generalizability is evaluated along geographic spread (i.e., single location versus multiple locations), whether there is a discussion of limitations and the level of detail, and the use and
justification of special populations (is it necessary to test a theory or is it used only for convenience?) Sudman (1977) argues that multiple locations will increase generalizability of findings.
Additionally, studies that include a thoughtful and well articulated section on sample limitations
will gain credibility, and careful attention should be paid to the use of special populations. While
the use of special populations can be a powerful tool for testing theory, they can also lead to bias
if used purely for convenience and should be critically examined.
The adequacy of a sample size is another important factor in judging the quality of a sample, and
this is largely contingent on the type of analysis. For most analyses, breakdowns of the sample
are required. As a general rule, the sample should be large enough so that there are 100 or more
units in each category of the major breakdowns and a minimum of 20 to 50 in the minor breakdowns (Sudman, 1977:30).
Sample execution can also positively or negatively affect study credibility. For example,
a low response rate could be an indicator of sloppy field work and a failure to follow up. As a
general rule, studies that fall below an 80 percent response rate are viewed as less credible.
Finally, an examination of the resources available for the study is important in evaluating how
well the sample was designed and executed. Studies that are characterized by an efficient use of
limited resources are viewed as more credible.
REFERENCES
Eltinge, J. and W. Sribney. (1997). Some Basic Concepts for Design-Based Analysis of Complex Survey Data. STATA Technical Bulletin (STB), 31:208-213.
Fox, J.A., J. Levin, and M. Shively. (2002). Elementary Statistics in Criminal Justice Research
(2nd Edition). Boston, MA: Allyn and Bacon.
Isaac, S. and W. B. Michael. (1995). Handbook in Research and Evaluation (3rd Edition). San
Diego CA: EdITS.
Levy, P. and S. Lemeshow. (1999). Strategies for Design-Based Analysis of Sample Survey Data. Pgs. 481-495, Sampling of Populations: Methods and Applications (3rd ed.). New
York, NY: John Wiley and Sons, Inc.
Maxfield, M.G. and E. Babbie. (2001). Research Methods for Criminal Justice and Criminology
(3rd Edition). Belmont, CA: Wadsworth.
Sudman, S. (1976). Small Scale Sampling with Limited Resources. Pgs. 25-47, Applied Sampling. New York, NY: Academic Press.
Page | 34
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd Edition). Cincinnati, OH:
Atomic Dog Publishing.
Page | 35
CHAPTER 4. SCALE MEASUREMENT

By Chad Posick
Introduction
It is common for people around us to use broad concepts such as happiness and stress
and for us to have a pretty good idea about what they are talking about. When a friend is about
to take a comprehensive exam and tells us they are stressed out we understand that they probably feel nervous, uncertain and have some level of anxiety. While colloquially we are able to
sufficiently communicate amongst one another about how we feel using these concepts and are
able to describe a specific state of being that others can relate to, those who study these social
phenomena have a more difficult task; one that requires them to be more precise and explicit.
As a social scientist, pretend you want to study the concept of low self-control a common concept in criminology. It is likely that this concept is known and understood by most social scientists and among the lay population. However, when a concept is studied, it must be
done systematically and it must be primarily done quantitatively. Concepts such as low selfcontrol are comprised of several attributes (ie temperament, risk-taking, etc.) or characteristics,
that when combined, make up the over-arching concept of low self-control. Constructing the
concept of low self-control is done by developing scales. Scales allow us to conceptualize terms
such as happiness, stress and low self-control and allow us to study those concepts empirically.
Scales are essential for quality research because they allow the researcher to conceptualize and systematically study a particular concept of interest. Generally, this approach to social
science research is under-utilized and when utilized is not thoroughly carried out. Psychometrics, the subspecialty within the behavioral/social sciences concerned with measuring psychological and social phenomena, is typically found in psychology. It is suggested that psychometrics has an important role in criminology and should be taken more seriously among criminological researchers.
The concept of self-control will be used in the following examples for consistency and
clarity. However, the examples will be relevant for almost all other concepts that one wishes to
study. A study of a concept begins with constructing a scale. A scale is a set of items (usually
obtained from a self-report questionnaire or collected in reports from others). The steps in developing a scale are described below with examples.
STEPS IN SCALE DEVELOPMENT
Step 1: Determine the construct intended for measurement
A good concept will always be grounded in sound theory. While we might not know exactly what items to include on a survey to perfectly measure a concept, we probably have a good
idea of items to include related to theory. For example, in measuring self-control we would include items related to temperament, goal setting, risk-taking, impulsivity, and self-centeredness
because we believe them to be related to self-control. This type of relationship is called face or
content validity. Poor content validity indicates that the items (or scale) that are intended to
measure the concept are not doing a good job and need to be revised.
Page | 36
Step 2: Generate an item pool

Once a concept for measurement has been determined and theory has been used to construct broad categories of items to include, we need to generate a specific set of items for an
item pool that seeks to measure the concept. The number of items (questions) included must
be numerous and allow for adequate variation in responses (this is necessary for statistical analysis).
Step 3: Determine the format
Many different formats can be used to set up the items (questions) on a survey. The format should correspond to the information you wish to obtain and relate back to the research
question. However, it is not necessary to include only one format on a single survey although
consistency is important to maintain. The major types of formats will be covered below.
Thurstone Scaling
Thurstone scaling intends to gauge a particular participants level of belief towards a specific concept. Therefore,
the Thurstone scale will present a series of questions related to a concept which progressively approach a particular
extreme view. A respondents set of answers will allow the researcher to know where they fall on a scale from one
extreme to the other. The following is an example of a Thurstone scale (this example asks respondents to answer
questions on how pay is related to the concept of a good job.).
1. High pay is the only important aspect of a good job
Agree______________ Disagree__________
2. High pay is one of the most important aspects of a good job
Agree______________ Disagree _________
3. High pay is relatively unimportant in a good job
Agree______________
Disagree__________
4. High pay has nothing to do with a good job
Agree______________
Disagree__________
Guttman Scaling
Similar to the Thurstone scale, the Guttman scale seeks to gauge the extremity of a respondents position on a certain concept. Using this scale, it is hypothesized that if a respondent answers in the affirmative to a particular question, all successive questions will also be in the affirmative. A criminal justice example is used below to gauge level
of delinquency. Someone who has been incarcerated is likely also have been previously convicted, arrested and
stopped by the police.
1. I have been previously stopped by the police
Yes________________
No____________
2. I have been previously arrested by the police
Yes________________
No ____________
3. I have been previously convicted in a court of law
Yes________________
No____________
4. I have been previously incarcerated in prison or jail
Yes________________
No_ __________
Likert Scaling
Likert scaling is probably the most widely used format for scaling on surveys. This format yields itself well to quantitative analysis and is thus helpful in social science research. In Likert scales, one question is asked followed by a
series of response items. The number of response items will vary but a scale should include between 5 and 9 response choices to ensure adequate variation for statistical analyses. A self-control set is presented in this example.
Fully Agree
Fully Disagree
1. I act on the spur of the moment without thinking
1 2 3 4 5
2. I do whatever brings me pleasure in the here and now
1 2 3 4 5
3. I am more concerned with what happens to me in the short run than long run 1 2 3 4 5
Page | 37
Semantic Differential
The semantic differential format is similar to the Likert scale and tends to obtain similar information. The response
choices exist between two extremes in which the respondent can choose the option that responds to the extent of
their attitude or belief. This is shown below.
How often do you lose your temper?
Very Often ----- ----- ----- ----- ----- ------ ------- Hardly Ever
Visual Analogue
The visual analogue is almost exactly the same as the semantic differential. However, where response choices are
given in the semantic differential, the visual analogue does not separate response choices and the respondent marks
their response along a solid continuum.
How often do you lose your temper?
Very Often -------------------------------------------------------- Hardly Ever
Step 4: Review of the item pool by experts

At this stage the items that are included in the response pool should be shared with experts in the field. The purpose of this step is to identify any items that would be problematic either in theory or in practice. Experts may also have suggestions on items to add to the pool to
make the scale more comprehensive.
Step 5: Consider inclusion of validation items
Validation items refer to questions that can be inserted into a questionnaire that can then
be used in analyses to identify undesirable characteristics of sets responses. Validation techniques can also be used to identify whether or not a particular respondent is answering questions
in a systematically inappropriate manner. Two of the most common problems in survey questionnaires are acquiescence responding (AR) and socially desirable responding (SDR). Scales
that measure social desirability can be inserted into questionnaires that will allow answers on the
scale to be compared with the responses to others scales. Appendix B includes the MarloweCrowne social desirability scale. For a good discussion on these topics refer to Ferrando and
Anguiano-Carrasco (2010).
Step 6: Administer items to a development sample
A development sample must be used to pilot test the survey containing the scale of interest. This will identify areas in the scale that are in need of modification. The sample should be
representative of the intended population of interest. Use of college students to validate scales
may not (and probably is not) representative of the population of interest and should be used with
caution. A large enough sample must be used to obtain sufficient results (rule is n = 300).
Step 7: Evaluate items for inclusion in the scale
Scale items should be highly correlated with one another, mean close to the center of the
range, and high coefficient alpha (>0.70; optimally 0.80-0.90). Standard descriptive statistics
Page | 38
and reliability/validity tests should be run on the responses from the sample. Items that are poorly correlated or which lower the reliability of the scale should be dropped, or the data should be
re-examined for improper coding, missing data, etc.
Step 8: Optimize scale length
The scale should not be unduly long, placing a burden on the participants increasing the
chance of burnout and fatigue (lowers response rates, increases acquiescence responding). When
possible, splitting samples into two smaller samples and using one group as a development
sample and the other as a cross-validation sample is useful but not always possible.
RULES FOR GOOD QUALITY SCALES
Once you have determine what you want to study (the concept) and how you want to set
up the format of the questions and scales, it is time to take it one step further and ensure that the
scale is of high quality. The following set of rules is the minimal steps in constructing a good
quality scale.
5 Rules for Quality Scales
1) Each item should express only one idea
Questions with two ideas (usually separated by an and or or) are confusing and respondents may have different
answers to the two ideas. These are termed double-barreled questions.
Dont do this: I act on the spur of the moment and I do not think about the long-term consequences of my actions
Do this instead: Someone might act on the spur of the moment but also think about long-term consequences. To
limit this confusion, split the question into two separate questions.
1) I tend to act on the spur of the moment?
2) I do not think about the long-term consequences of my actions?
2) Items should be worded in both positive and negative directions

Questions should be worded in different ways to limit the respondent from answering questions automatically and
with little thought. This is termed response acquiescence.
Do this: The same question can be worded differently to be presented in both a positive and negative form. For
example:
Positive form: I think about the long-term consequences of my actions.
Negative form: I rarely think about the long-term consequences of my actions.
3) Colloquialisms, expressions and jargon should be avoided

It is important to keep in mind who the participants will be in the research (who will be filling out the questionnaire). Questions that contain hard to understand words or expression are likely to be skipped or answered incorrectly.
Dont do this: I lose my cool easily.
Do this: A lot of people might understand what it is to lose your cool but the term might not be clear to some demographic groups (i.e. the elderly). Instead use: I lose my temper easily.
Page | 39
4) Reading level of the primary audience should be considered

Depending on the population of interest, the questions on the questionnaire should be age
appropriate. If the intended population is children, items should be written in a way that is understandable and relatable to kids. If the intended population is the elderly, questions should not
include contemporary slang terms. A general rule of thumb is to write at about a sixth grade
reading level (this can be done in Microsoft word review- spelling and grammar options
readability statistics- run check).
5) Avoid the use of negatives to reverse the wording of an item
It is always advised that the simpler the survey the better. Using negative wording is often confusing and participants often miss negatively worded choices.
Dont do this: I do not focus on short-term goals but instead focus on long-term goals.
Do this: I focus on short-term goals instead of long-term goals (to reverse, just switch long-term/short-term)
REFERENCES
Andrich, D. (1988). Rasch Models for Measurement. Newbury Park: Sage Publications.
Bond, T. G., and C. M. Fox. (2007). Applying the Rasch model: Fundamental
Measurement in the Human Sciences. Second Edition. Mahwah: Lawrence Erlbaum Associates.
Dayton, C. M. (1998). Latent Class Scaling Analysis. Thousand Oaks: Sage Publications.
DeVellis, R. F. (1991). Scale Development: Theory and Applications. Newbury Park: Sage
Publications.
Ferrando, P. J., and C. Anguiano-Carrasco. (2010). Acquiescence and Social Desirability as Item
Response Determinants: An IRT-based Study with the Marlowe-Crowne and the EPQ
Lie Scales. Personality and Individual Differences 48: 596-600.
Kruskal, J. B., and M. Wish. (1978). Multidimensional Scaling. Newbury Park: Sage
Publications.
Lodge, M. (1981). Magnitude Scaling: Quantitative Measurement of Options. Newbury Park:
Sage Publications.
McIver, J. P., and E. G. Carmines. (1981). Unidimentional Scaling. Newbury Park: Sage
Publications.
Spector, P. E. (1992). Summated Rating Scale Construction: An Introduction. Newbury Park:
Sage Publications.
ADDITIONAL MATERIAL
Kaplan, D. (2000). Structural Equation Modeling: Foundations and Extensions. Thousand Oaks:
Sage Publications.
Page | 40
APPENDIX 4A
GLOSSARY OF TERMS
Acquiescence Set the tendency of subjects to agree with all items of a construct regardless of
content
Bi-Polar Scale response to scale items can vary from a negative to positive points with a zero
somewhere in between
Coefficient Alpha (Cronbachs alpha) a measure of the internal consistency of a scale
Confirmatory Factor Analysis (CFA) technique that tests the hypothesis of an existing structure
how well the data fit
Convergent Validity measures of the same construct should related strongly with one another
Criterion-Related Validity involved the testing of hypotheses about how the scale will relate to
other variables
Discriminant Validity measures of different constructs should only relate moderately well with
one another
Eigenvalue represents the relative proportion of variance accounted for by each factor in a Factor Analysis
Exploratory Factor Analysis technique that determines the number of separate components that
exist for a group of items
Factors sets of groups that emerge out of a larger group of items that represent theoretical constructs
Internal-Consistency Reliability a measure of how well multiple items, designed to measure a
theoretical construct, intercorrelate with one another
Item-Remainder also considered part-whole or item-whole coefficient measures how well
each individual item relates to the others in the analysis
Known-Groups Validity measuring the scores of different groups of individuals based on the
hypothesis that those of different groups will answer items of theoretical constructs differentially
Latent Variable the underlying phenomena or construct that a scale in intended to reflect
Multitrait-Multimethod Matrix (MTMM) developed by Campbell and Fiske (1959) technique
that simultaneously explores convergent and discriminant validity
Page | 41
Norms describe the distribution of a given population on a scale variable

Observed Score the actual score of a particular subject on a construct of interest
Path Diagram method of depicting causal relationships among variables
Principle of Parsimony among equal explanations, the simplest form should be selected
Occams Razor
Psychometrics subspecialty within the behavioral/social sciences concerned with measuring
psychological and social phenomena
Response Sets the tendency of subjects to uniformly respond to items of a scale
Spearman-Brown Prophesy Formula technique that estimates the number of additional items
that should be needed to achieve a given level of internal consistency
Test-Retest Reliability a measure of how well a scale produced consistent measures over time
True Score the theoretical value that each subject has on the construct variable of interest (tau)
Unipolar Scale response to scale items can vary from a zero point to a high positive point (nonnegative
Page | 42
APPENDIX 4B
THE MARLOWE-CROWNE SOCIAL DESIRABILITY SCALE
Personal Reaction Inventory
Listed below are a number of statements concerning personal attitudes and traits. Read each item
and decide whether the statement is True or False as it pertains to you personally.
1. Before voting I thoroughly investigate the qualifications of all the candidates. (T)
2. I never hesitate to go out of my way to help someone in trouble. (T)
3. It is sometimes hard for me to go on with my work, if I am not encouraged. (F)
4. I have never intensely disliked anyone. (T)
5. On occasion I have had doubts about my ability to succeed in life. (F)
6. I sometimes feel resentful when I don't get my way. (F)
7. I am always careful about my manner of dress. (T)
8. My table manners at home are as good as when I eat out in a restaurant. (T)
9. If I could get into a movie without paying and be sure I was not seen, I would probably do it.
(F)
10. On a few occasions, I have given up doing something because I thought too little of my ability. (F)
11. I like to gossip at times. (F)
12. There have been times when I felt like rebelling against people in authority even though I
knew they were right. (F)
13. No matter who I'm talking to, I'm always a good listener. (T)
14. I can remember "playing sick" to get out of something. (F)
15. There have been occasions when I took advantage of someone. (F)
16. I'm always willing to admit it when I make a mistake. (T)
17. I always try to practice what I preach. (T)
18. I don't find it particularly difficult to get along with loud-mouthed, obnoxious people. (T)
19. I sometimes try to get even rather than forgive and forget. (F)
20. When I don't know something I don't at all mind admitting it. (T)
21. I am always courteous, even to people who are disagreeable. (T)
22. At times I have really insisted on having things my own way. (F)
23. There have been occasions when I felt like smashing things. (F)
24. I would never think of letting someone else be punished for my wrongdoings. (T)
25. I never resent being asked to return a favor. (T)
26. I have never been irked when people expressed ideas very different from my own. (T)
27. I never make a long trip without checking the safety of my car. (T)
28. There have been times when I was quite jealous of the good fortune of others. (F)
29. I have almost never felt the urge to tell someone off. (T)
30. I am sometimes irritated by people who ask favors of me. (F)
31. I have never felt that I was punished without cause. (T)
32. I sometimes think when people have a misfortune they only got what they deserved. (F)
33. I have never deliberately said something that hurt someone's feelings. (T)
Page | 43
APPENDIX 4C
Psychometric Theory
There are two major theoretical approaches to psychometric theory: 1) classical test theory and
2) item response theory.
Classical Test Theory
Classical Test Theory uses the following formula:
X = T + E
X = Observed Score
T = True Score
E = Error
Classical test theory (also referred to as true score theory) is primarily concerned with the reliability of the psychological test. Therefore, the most utilized statistic is the alpha coefficient
(Cronbachs alpha).
Item Response Theory

Item response theory (also referred to as latent trait theory) is primarily concerned with the probabilities associated with a participant scoring a particular score on an item of interest. One example of an item response theory technique is the Rasch Model which explores the fit of the data
to the model. The item response approach is considered more powerful than the classical response method (which yields more powerful results). Therefore, it is the current theory of
choice.
Below is an example of an item response theory formula for the three parameter logistic model.
Page | 44
CHAPTER 5. EXPERIMENTAL AND MODIFIED-EXPERIMENTAL DESIGNS

By Michael Rocque and Chad Posick
This chapter reviews the basic features of the most fundamental research designs available to criminological researchers. It is widely recognized that in order for researchers to be fully
confident that they have isolated the impact of a particular treatment, true experiments are necessary. When practical or ethical restraints render true experiments impossible, quasiexperiments or what we call modified experiments3 can be implemented to control extraneous
factors that may affect the dependent variable of interest. This chapter first reviews true experiments and offers examples from the criminological literature to illustrate such approaches. In the
second section, quasi or modified experimental designs are reviewed.
I. True Experiments
True experiments were introduced by R. A. Fisher, who published The Design of Experiments in 1935. Working mostly within the agricultural domain, Fisher demonstrated various
experimental designs, including blocking and factorial methods. His main point was that randomization is essential in order to isolate causal factors (see Fienberg and Hinkley, 1980). Why
are randomized designs essential and often desired by many social science researchers? As most
researchers know, a causal relationship between x and y implies three things: 1) an association
between x and y, say a correlation; 2) a non-spurious association between x and y (the correlation of x is not accounted for by an unmeasured covariate of both x and y); and 3) proper temporal order x must be shown to occur before y to be causal (see Hirschi and Selvin, 1967). Only
true experiments are able to unambiguously provide evidence of all three criteria.
True experiments involve randomization. In research, randomized procedures generally
fall under two categories: random selection and random assignment. With respect to experiments, we are referring to random assignment. A control group is only established through random assignment, otherwise, such as in modified experiments, the untreated group is considered a
comparison group. Random selection is associated with sampling (see Fahy, Chapter 3
above) and seeks to increase external validity or generalizability. Random assignment is
associated with experiments and is a procedure that seeks to increase internal validity. Internal validity is the degree of confidence one has that there is a causal relationship between x and
y. As Maxfield and Babbie (1998) state, randomization is the central feature of the classical experiment. The most important characteristic of randomization is that it produces experimental
and control groups that are equivalent. Put another way, randomization reduces possible sources
of systematic bias in assigning subjects to groups (p. 152).
Thus, true experiments involve at least two groups chosen on the basis of random chance.
According to Campbell and Stanley (1963), experiments involve a pretest, randomization, an intervention and then a post-test. This is illustrated in Figure 5.1
We introduce this term as a replacement for quasi-experiments, popularized by Campbell and Stanley (1963). In
our view, modified is a more appropriate way to describe experiments in which full randomization is not possible
or feasible. In these situations, researchers must seek to adapt or modify the true experimental design.
Page | 45
Figure 5.1. Design Flow of the True Experiment
Consider a study that seeks to examine the effect of intensive probation on criminal recidivism. A true experiment would involve developing sampling criteria, use of appropriate sampling techniques, a pre-test on outcomes of interest (here, perhaps variables that may be related
to recidivism such as criminogenic attitudes), randomization to an experimental group (receiving
intensive probation) and a control group (for example, probation as usual) and a follow-up posttest (where the same variables are collected as were on the pre-test and recidivism is measured).
Turner, Petersilia, and Deschenes (1992) describe just such an experiment.
Experimenters recognize that randomization is essential in isolating the causal effect of
the intervention (Weisburd, 2003; Farrington, Loeber and Welsh, 2010). In the example above,
without a proper assignment procedure, researchers would be less confident that any difference
in recidivism between the two groups was caused by the treatment (e.g., intensive probation) and
not an unmeasured variable. Randomization means that each subject in the identified sample has
an equal chance (or the probability is known) of being in the experimental or control group.
Mathmatically, this produces groups that are roughly equivalent on all measures (given a large
enough sample size) (see Bachman and Schutt, 2001). An important point is that true experiments do not control or cancel possible confounders; rather they result in such confounders being
equally distributed between both groups so that they cannot differentially influence the outcome.
II. Threats to Internal Validity
Campbell and Stanley (1963; see also Cook and Campbell, 1979) identified several
threats to internal validity in research designs as a way to illustrate the power of true experiments. In certain research designs where the intention is to isolate and describe a causal effect of some form of treatment or intervention, there are several alternative explanations that
may account for differences between the experimental and control group or observed effects of
treatment. Researchers can use various design strategies to control these threats to internal validity. These threats are summarized in Table 5.1
Table 5.1. Threats to Internal Validity
History
Factors that affect the outcome apart from the treatment. Example: a highly publicized case of child abduction occurs during an experiment testing
the effect of a new law on child predator recidivism
Page | 46
Maturation
Testing
Instrumentation
Statistical Regression
Selection
Experimental Mortality
Growth or natural changes within subjects over time that would have occurred with or without the treatment. Example: a research study is examining the effect of a tutorial program on language acquisition over two
years. Even without the program, the children might be expected to improve language skills. This must be taken into account to isolate the effect
of the treatment.
The very act of taking part in a study and completing questionnaires
changes subject behaviors. Example: in a famous study examining the effect of changes in environment on worker production, researchers in Chicago found that no matter what they did, worker production increased.
They concluded that the workers realized they were taking part in a study
and changed their behavior accordinglythis is now known as the Hawthorne effect (see Maxfield and Babbie, 1998).
Differences in testing procedures influences observed results. Example: a
study on anxiety assigns subjects to a condition in which a scary movie
with viewed or a condition in which an episode of Lassie is viewed. The
subjects are given an anxiety test immediately following the viewing. Lo
and Behold there is a difference in these scores, with the Lassie group
showing less anxiety. However it is discovered later that the movie group
took their test in a room outside of a construction site, with a jackhammer
in full blast (sounding like a machine gun). It is unclear whether the
higher anxiety scores in this group are the result of the movie or the jackhammer.
When subjects are selected for an experiment on the basis of extreme (high
or low) scores, it is expected that their post-test scores will revert back to
the mean. Example: subjects are tested for social anxiety in order to be
selected into a study. Those who score in the 90th percentile are chosen to
be given a treatment. After treatment, the subjects scores are reduced by
2 standard deviations. However, because of the high scores on the pretest, it may be expected that these subjects scores at a later date would be
less extreme even without treatment. This is known as regression to the
mean.
Without careful attention to procedures by which subjects are assigned to
experimental or control groups, differences in outcomes may be due to
pre-existing differences rather than treatment. Example: a study seeks to
examine the effect of rehabilitation in prison on recidivism. Subjects are
selected to rehabilitation via volunteerism; only those volunteering to take
part are given treatment. The experimental group is compared to a different group of prisoners not given rehabilitation. Lo and Behold, two years
after release from prison, the rehabilitation group has a lower rate of recidivism. Yet we cannot be sure if this difference is due to treatment or the
fact that the experimental group may have been more willing to give up
crime (as evidenced by their volunteering to a rehabilitation program).
Subjects may drop out or refuse to take part in the study, resulting in unequal control and experimental groups. Example: in a two year study that
seeks to examine the effect of the D.A.R.E. program on drug use, students/classrooms are assigned to receive the D.A.R.E. program or no program. After two years, the D.A.R.E. program group participants have
HIGHER rates of cannibus use, prompting officials to drop Crime Dog
McGruff. Yet post-hoc analyses find that in the control group, 15% of the
students dropped out of the study (and these 15% were highly likely to be
drug users). Thus, the negative effects of D.A.R.E. may have been artificially caused by drug user drop-out in the control group, resulting in a
lower rate of drug use at follow-up.
Page | 47
Diffusion of Treatments
Interactions between any of

the above
In a study with a control and experimental group, the control group may
become aware of the treatment being offered to their counterparts, or be
given the treatment (outside of the experimental protocols). This results in
contamination. Example: a study seeks to examine the effect of counseling
on domestic violence perpetrators. Subjects are randomized to a group
that receives mandatory counseling or a group that receives treatment as
usual. Part way through the study the control group subjects become
aware of the study and engage in compensatory behavior to demonstrate they are not inferior to the experimental group. This is known as
compensatory rivalry or the John Henry effect (Cook and Campbell,
1979, cited in Bachman and Schutt, 2001). Treatment sometimes crosses
over in experiments, whereby staff either unwittingly give subjects in the
control group treatment (e.g., broken assignment) or given treatment on
purpose to the control group because of perceived unfairness. This results
again in contamination.
Interactions between any threats can also occur. For example, selection
and history may both take place thus contaminating results (see Isaac and
Michael, 1995).
In true experiments, these threats to internal validity are controlled to the highest extent
possible. For example, history threats (say a highly publicized child abduction case) affect both
the experimental and control group equally. However, even in true experiments, process or implementation failures can weaken internal validity. For example, in the famous Sherman and
Berk study of the effect of mandatory arrest for domestic violence, officers were to choose a response at random. However, some officers purposely ignored the response they were assigned. If
staff responsible for implementing a randomized study ignore assignment procedures, equality
between the two groups (remember, the hallmark of true experiments) may no longer hold (see
Goldkamp, 2008).
III. The Counterfactual
In research, one often hears a term called the counterfactual. This is the idea that what
researchers are really interested in is what would have happened in the absence of x. In simple
terms, if we are studying the effect of intensive probation on recidivism, the counterfactual of
recidivism outcomes for an individual who received the treatment would be recidivism outcomes
for the same individual without treatment. That is, the counterfactual is the absence of treatment
during the same period, for the same individual.
Consider Y, the outcome, t=treatment and c=control. The counterfactual is represented
by:
Equation 5.1
Where is the treatment effect and subscript i indicates individual subjects. Equation 5.1
indicates that each individual has a potential outcome under a) treatment and b) no treatment.
In an ideal world, we would be able to observe and have an unbiased estimate of the treatment
effect. However, as should be clear, it is not possible to observe the counterfactual as the same
person cannot be in the experimental and control group simultaneously (Heckman and Smith,
1995; Winship and Morgan, 1999). Because random assignment has been shown to result in generally equivalent groups, it is seen as the closest researchers can come to estimating the counterfactual. Thus, what true experiments produce is given by equation 5.2
Equation 5.2
Page | 48
This provides the average effect of the treatment or average treatment effect (ATE; note
the line above each of the estimators) (see Loughran and Mulvey, 2010). In this equation, the
average scores of those individuals in the control group are subtracted from those individuals in
the treatment group. Because of randomization, we are provided an unbiased and consistent
estimate of the treatment effect (for a more complete discussion, see Loughran and Mulvey,
2010:167; Winship and Morgan, 1999).
IV. External Validity
One common critique of experiments is that they lack generalizability. Some argue that
experiments are often too focused (idiographic) and rigid to provide valid estimates of a treatment for more than the sample studied (see Pawson and Tilley, 1994). In order to increase internal validity researchers must have as much control over extraneous factors as possible in true
experiments. As a result, true experiments may not provide a reasonable or realistic estimate of
how the treatment may operate in reality (Bachman and Schutt, 2001; Weisburd, 2000).
Weisburd (2000) argues that experiments can be made more generalizable, by designing
heterogeneity into samples and including experiments in the field. In addition, multi-center
trials, in which the same experiment is conducted across various settings at the same time, can
increase external validity (Weisburd and Taxman, 2000). In general, external validity is not a
more serious threat for true experiments than for other types of research designsresearchers
must be cognizant of how their sample is selected and to whom their results can be generalized.
V. The Black Box problem
True experiments suffer from what we call the black box problem. Because true experiments, as we have mentioned, seek to maximize internal validity, their concern is with
demonstrating that x is causally related to y. What is not often a focus of true experiments is why
x and y are causally related. Thus, in the words of Shadish, Cook and Campbell (2002), true experiments provide causal description rather than causal explanation (pp. 9-10). Shadish et
al. provide a nice example of the difference between the two (2002: 9-10):
For example, most children very quickly learn the descriptive causal relationship between flicking a light
switch and obtaining illumination in a room. However, few children (or even adults) can fully explain
why that light goes on. To do so, they would have to decompose the treatment (the act of flicking a light
switch) into its causally efficacious features (e.g., closing an insulated circuit) and its nonessential features (e.g., whether the switch is thrown by hand or a motion detector).They would have to do the same
for the effect (either incandescent or fluorescent light can be produced, but light will still be produced
whether the light fixture is recessed or not). For full explanation, they would then have to show how the
causally efficacious parts of the treatment influence the causally affected parts of the outcome through
identified mediating processes (e.g., the passage of electricity through the circuit, the excitation of photons). Clearly the cause of the light going on is a complex cluster of many factors.
Thus, experiments are generally silent on the causal mechanisms linking x to y (see also
Sampson, Laub and Wimer, 2006). Experiments can build in elements that enable them to examine underlying causal mechanisms, however. For example, some experiments are complex, with
more than two groups. If researchers specify hypothesized causal mechanisms prior to conducting their study, they can test these effects. For example, suppose that researchers are interested in
the effect of education on criminal recidivism. Suppose also that these researchers hypothesize
that educationcrime because it leads to improved chances of offenders attaining meaningful
employment. Employment here represents the causal mechanism or black box linking educaPage | 49
tion to crime. An experiment can be designed wherein subjects are assigned to one of four
groups: 1) educational training alone (Xt1); 2) no training (Xc1); 3) educational training and job
placement assistance (Xt2); 4) job placement assistance alone (Xc2). If the researchers are correct
that education leads to less crime because of employment but we still think that job placement
matters, we would expect the difference in recidivism to be greatest between Xt2 and Xc1. Yet the
factorial design of this study allows a more robust test of the effect of education on recidivism. If
education does not lead to reduced recidivism through (mediated by) employment, but employment is still related to reduced recidivism, then we would expect Xt1= Xt2. Thus, factorial designs, while requiring a larger sample size, provide a more powerful test of experimental conditions (see Bachman and Schutt, 2001). These designs, sometimes called a Solomon Four
Group Design, appear to be relatively rare in criminal justice research, however.
VI. Ethics and True Experiments
Finally, some have worried about the ethicality of conducting experiments, especially
with regard to criminal justice interventions. First, the term random sometimes appears to imply haphazard or accidental (see Rossi, Lipsey and Freeman, 2004). Thus, to those not familiar with social science research methods (such as many criminal justice practitioners), the notion
of a randomized experiment may evoke images of unsystematically assigning treatment. It is,
therefore, essential for researchers to fully explain to staff taking part in studies what true experiments are and why they are appropriate. Second, some have argued that it is unethical to
withhold treatment from individuals (which is the essence of true experiments). When a treatment is known (scientifically, that is) to be efficacious and to produce unambiguous benefits to
individuals, we agree. A randomized study that assigns families in poverty to a) receive food
stamps versus b) no governmental assistance would, in our view, violate ethical conduct. Why?
Because we know that food stamps provide a tangible benefit to individuals. Thus, withholding
them from some individuals in order to study their effects would not be ethical by any standard.
This is essentially what happened in the famous Tuskegee Institute study that took place after
World War II. In that study, African-American men were assigned to receive a) treatment for
syphilis or b) no treatment. This withholding of treatment resulted in deaths and is thus considered an exemplar of unethical research (Oakes, 2002).
Another ethical issue related to experiments is related to deception. In some research
studies, the subjects knowledge of the treatment under study might influence their outcomes.
Thus, researchers misdirect or misinform the subjects about the intention of the study in order to
arrive at an unbiased estimate of the effect of treatment. A criminological example might be a
study that seeks to examine the effect of violent media on aggression. Subjects knowledge that
they are being tested with respect to their level of aggression may intentionally act calmer and
more passive. This is an example of a testing effect, described in table 3.1. Thus, to reduce this
threat to internal validity, researchers might tell subjects that they are testing subject ratings of
favorability of different types of media. When deception occurs in research, it is essential that
subjects are informed of the true intention of the study upon completion. This is called debriefing (see Bachman and Schutt, 2001).
We agree with researchers such as David P. Farrington and David Weisburd that true experiments are the most desirable research design available to criminological researchers. In fact,
Weisburd has convincingly argued that in terms of ethical considerations, it is unethical for
criminologists to make recommendations on the basis of non-experimental research, when true
experiments are possible (see Farrington, 2003; Weisburd, 2003). However, in the realm of
criminological research, it is often not practical or feasible to conduct randomized experiments.
Page | 50
It is not possible to test the effect of marriage, for example, by randomly assigning some inmates
to a wedding group versus treatment as usual. Thus, in these situations, modifications to the
true experiment must be made. It is to this subject that we now turn.
VII. Modified Experimental Designs
The distinguishing factor between true experimental designs and quasi-experimental designs (or what we will refer to as modified experiments from now on) is random assignment.
True experiments use a truly random control group and modified experiments use a nonrandom comparison group. The value of the modified design relies on how well the comparison group is matched with the treatment/intervention group. While there are several different
modifications that can be done to experimental designs, they tend to have two general characteristics 1) they are often retrospective in nature (occur after the program is in place) and 2) exhibit
questionable internal validity. Other types of modified experiments rely on statistical controls to
isolate the effect of treatment. The problem with this approach is that variables that potentially
affect treatment outcomes must be specified in advance. Despite these limitations, modified experiments have the potential to provide important and substantive information about program
effects (Bingham and Felbinger, 2002). Some of the most common designs are reviewed in Table
5.2 below.
Table 5.2 Common Modified Experimental Designs
One of the most frequent designs in the social
science literature. Here, the evaluator assigned
participant to either the experimental group
(Group A) or a comparison group (Group B).
The groups are matched on a common characPretest Program Posttest
teristic that the evaluator wishes to measure
Group A
O
X
O
(such as race, gender, age, etc.)
Group B
O
O
The longitudinal or interrupted time series deInterrupted Time-Series Design
Before
After sign allows researchers to use subjects as their
Ot1 t-2
Ot1 t-1
X1
Ot1 t+1
Ot1 t+2
own control. This design is stronger than the
pre-post test design because maturation effects
can be controlled.
______________________
Interrupted Time-Series Comparison Group This design allows the evaluator to identify
treatments effects over time in treated and unDesign
treated individuals. Time-series designs are
Before
After beneficial because they are capable of identifyOt1 t-2
Ot1 t-1
X1
Ot1 t+1
Ot1 t+2
ing trends in social processes and are able to
Ot2 t-1
X2
Ot2 t+1
Ot2 t+2
Ot2 t-2
filter out noise departures from the underlying trend.
Oct-2
Oct-1
Oct+1
Oct+2
The Pretest-Posttest Comparison Group
Test
______________________
The Posttest Only Comparison Group Test

Pretest
Group A
Group B
Program
X
Posttest
O
O
This design is also frequently used and is the

design of choice when evaluators are brought
in after-the-fact (after treatment has begun).
In such instances, no pretest is possible (nor
random assignment). Although this design
lacks the ability to identify true treatment
Page | 51
Counterbalance Designs
E
E
E
E
X1O
X2O
X3O
X4O
X2O
X3O
X4O
X1O
X3O
X4O
X1O
X2O
X4O
X1O
X2O
X3O
effects, it does provide some test of program

effects by providing a posttest to each group.
When all groups are intended to receive all
treatments, it is necessary to sequence the
groups in alternating order to untangle the interactions between treatments. All groups must
be equivalent. By alternating the sequence of
treatments between groups it becomes possible
to isolate the treatment effects.
The importance of familiarizing yourself with notation is paramount to understanding

modified experiments. Even complex designs become much more interpretable when notation is
clearly understood. In social science and criminal justice, many programs evaluations take the
form of modified experiments. For example, Hagan (2003) reviews several modified experiments that have been conducted in criminal justice. Notably, some of the most informative evaluations have been of this design and include the Provo and Silverlake Experiments, Kansas City
Preventative Patrol Experiment, Minneapolis Domestic Violence Experiment, and several early
evaluations on shock incarceration.
Researchers have demonstrated that well implemented modified experiments can provide
valid and important information about program effects. Indeed, it is found that well implemented
modified experiments are more likely to provide valid conclusions than poorly implemented random experiments (Rossi, Lipsey, and Freeman, 2004). However, among the two designs, researchers have shown that experimental methods provide more valid conclusions about a program than modified experiments and that often the conclusions tend to be contradictory (Weisburd, Lum, and Petrosino, 2001). Ensuring that modified experiments are well-designed reduces
the gap in results between the two methods (Heinsman and Shadish, 1996). We suggest that
when it is possible to successfully implement an experimental design, researchers should pursue
this method because it is the most likely to produce accurate and valid conclusions about the
program under study. However, if successful implementation is unlikely or it is expected that it
will be poor, well-designed modified experiments are suggested because high-quality designs in
this method will produce similar, if not better, conclusions than poorly implemented experimental designs.
It may be inappropriate or nearly impossible in many situations, particularly in social science, to use experimental designs. Therefore, we suggest that researchers also become familiar
with the use and design of modified experiments. Often it is the case that practitioner compromise the experimental design through a resistance to randomize (for ethical and/or practical reasons. Other times the law requires a certain response or policies in place dictate a particular action. For a variety of reasons, pure experiments are hard to implement in the field. Therefore, to
ensure validity and accuracy of outcomes, it is necessary to come as close to experimentation as
possible. Cook and Campbell (1979) give an excellent review of quasi-experiments in their
seminal piece Quasi-Experimentation: Design & Analysis Issues for Field Settings. Researchers
are encouraged to familiarize themselves with the implementation and design strategies of modified experiments as well as the limitations. The best researchers are those who have several tools
for field research which includes both randomized and non-randomized designs.
REFERENCES
Page | 52
Bachman, R., and Schutt, R. (2001). The Practice of Research in Criminology and Criminal
Justice. Thousand Oaks, CA: Pine Forge Press.
Bingham, R., and C. L. Felbinger. (2002). Evaluation in Practice: A Methodological Approach
(2nd ed.). New York: Seven Bridges Press.
Campbell, D. T. and Stanley, J. C. (1963). Experimental and Quasi-Experimental Designs for
Research. Chicago: Rand McNally.
Cook, T. D. and Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton-Mifflin.
Farrington, D. (2003). A Short History of Randomized Experiments in Criminology.
Evaluation Review, 27:218-227.
Farrington, D. P. R. F. Loeber, and B. Welsh. (2010). Longitudinal-Experimental Designs. In
A.Piquero and D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.101-121).
New York: Springer.
Fienberg, S. E. and D. V. Hinkley (1980). RA Fisher: An Appreciation. New York: SpringerVerlang.
Fisher, R. A. (1935). The Design of Experiment. New York: Hafner.
Goldkamp, J. S. (2008). Missing the Target and Missing the Point: Successful Random
Assignment but Misleading Results. Journal of Experimental Criminology. 4(2): 15728315.
BosHagan, F. E. (2003). Research Methods in Criminal Justice and Criminology (6th ed.).
ton: Pearson Education.
Heckman, J. J. and J. A. Smith (1995). Assessing the Case for Social Experiments. Journal of
Economic Perspectives. 9(2): 85-110.
Heinsman, D. T., and W. R. Shadish. (1996). Assignment Methods in Experimentation: When do
Nonrandomized Experiments Approximate Answers from Randomized Experiments.
Psychological Methods 1(2): 154-169.
Hirschi, T and H. Selvin. (1967). Delinquency Research: An Appraisal of Analytic Methods. The
Free Press.
Isaac, Stephen and Michael, William B. 1995. Handbook in Research and Evaluation (3rd ed.).
pp. 35-45 (Ch. 3-Planning Research Studies) & pp. 237-245 (Ch. 9-Criteria and Guidelines for Planning, Preparing, Writing, and Evaluating the Research Proposal,
Report, Thesis, or Article). San Diego, CA: EdITS.
Maxfield, M., and E. Babbie. (1998). Research Methods for Criminal Justice and Criminology.
(2nd ed.). Belmont, CA: Wadsworth.
Oakes, J. M. (2002). Risks and Wrongs in Social Science Research: An Evaluators Guide to
the IRB. Evaluation Review, 26:443-479.
Pawson, R. and N. Tilley. (1994). What Works in Evaluation Research? British Journal of
Criminology. 34:291-306.
Petersilia, J. (1989). Implementing Randomized Experiments: Lessons from BJAs Intensive
Supervision Project. Evaluation Review. 13(5):435-458.
Rossi, P. H., M. W. Lipsey, and H. E. Freeman. (2004). Evaluation: A systematic Approach (7th
ed.). Thousand Oaks, CA: Sage Publications.
Sampson, R. J., J. H. Laub, and C. Wimer (2006). Does Marriage Reduce Crime? A
Counterfactual Approach to Within-Individual Causal Effects. Criminology 44(3):465508.
Page | 53
Shadish, W. R., T. D. Campbell, and D. T. Cook. (2002). Experimental and Quasi-Experimental

Designs for Generalized Causal Inference. Boston: Houghton-Mifflin.
Turner, S., J. Petersilia and E. P. Deschenes. (1992). Evaluating Intensive Supervision
Probation/Parole (ISP) for Drug Offenders. Crime and Delinquency. 38(4): 539-556.
Weisburd, D. (2000). Randomized Experiments in Criminal Justice Policy: Prospects
and Problems. Crime and Delinquency. 46: 181-193.
Weisburd, D., C. M. Lum, and A. Petrosino. (2001). Does Research Design Affect Study Outcomes in Criminal Justice? The Annals of the American Academy of Political and Social Science. 578: 50-70.
Weisburd, D. (2003). Ethical Practice and Evaluation of Interventions in Crime and Justice: The
Moral Imperative for Randomized Trials. Crime & Delinquency, 27:336-354.
Weisburd, D. and F. Taxman. (2000). Developing and Multicenter Randomized Trial in
Criminology: The Case of HIDTA. Journal of Quantitative Criminology, 16(3): 315-340.
Winship, C. and S. L. Morgan (1999). The Estimation of Causal Effects from Observational
Data. Annual Review of Sociology. 25: 659-706
Page | 54
CHAPTER 6. QUALITATIVE RESEARCH METHODS

By Diana K. Peel
Qualitative research methods are the systematic inquiry of human behaviour and interactions in the natural environment in which they occurs, rather than the artificial siting of an experiment, or according to the predefined categories. Data is gathered through observation from a
real world setting, which seeks to get at the complexity of social phenomenon
Qualitative research provides an explanation of the process of interest, or the subjective
feelings and perceptions of those involved in it. It leads to in-depth and rich data a thick description (Geertz, 1973, p5) and detailed analysis of the phenomenon of interest.
Through qualitative methods, researchers are able to gather information about those
(human) interactions, reflect on their meaning, arrive at and evaluate conclusions, and eventually
puts forward an interpretation of those interactions (Marshall and Rossman, 1989, p.21). The
goal of much qualitative research is to create grounded theory a theory that is grounded or
based upon the observations.
Qualitative research involves studying a case within the larger phenomenon, and begins
with real world observations (inductive reasoning). Unlike quantitative research, it is developmental research a process of discovery. The goal is to expand knowledge and theory and to
discover important questions, processes and relationships, not test them (Marshall and
Rossman, 1989, p.43). The design tends to be flexible, in order to allow for adaptation. As the
research is inductive, there is not hypothesis that is being tested at most, there will be a guiding
hypothesis in order to generate questions, but the direction of the research is undetermined.
The types of research that merit a qualitative design are:
-
studies that practically cannot use an experimental design

research that seeks to look at complex and in depth processes
research where the important variables have not been identified
research that seeks to find out why policy or practice does not work
research on real or informal structures and processes, or unknown societies (Marshall,
1985)
Qualitative research is defended on the grounds that:

-
human behaviour occurs in context it is influenced by the setting and the internalised norms the person associates with it.
Experimental research effects the findings through the artificial setting; likewise surveys set out how the researcher interprets the phenomenon of interest and the participant has to fit their experience into that.
Qualitative research allows the researcher to understand the framework within which
people interpret their feelings thoughts or actions.
Meanings and processes can be identified, explored and understood (Wilson, 1977)
Page | 55
FOUR PURPOSES OF QUALITATIVE RESEARCH:

Exploration to find out about new or little understood phenomenon, to find out the key variables for study or to generate hypotheses
Explanation to explain the underlying causes that create the phenomenon, or to explain the
causal mechanisms shaping the phenomenon
Description to describe or document the phenomenon
Prediction to predict the outcome of the phenomenon, or the events or behaviours that might
arise from it
DATA COLLECTION TECHNIQUES
Observation
The systematic description of events, behaviours, and artefacts in the social setting chosen
for study (Marshall and Rossman, 1989, p.79) gets at the meaning attached to behaviours
- Full participant - full immersion into the setting
- Participant observation researchers first hand involvement in the setting
- Full observer no participation in the setting
Observations can be overt, where the participants know they are being watched or covert, where
they do not the latter presents ethical issues, but ensures that the participants behaviours are
not altered by the knowledge they are being watched and studied.
In-depth Interviewing
A conversation with a purpose (Kahn and Cannell, 1957, p.149) - tend to be unstructured to allow the participant to reveal their subjective perception of the phenomenon of interest
Formal interviews or questions are useful to standardise topics (easier for analysis)
Strengths get a good deal of data quickly, allows for follow ups or clarifications
Weaknesses cooperation is required, relies on the honesty and comprehension of the
participant, data may be distorted (personal bias etc)
Focus Groups
A group of individuals who partake in a group interview or discussion. The participants
are not a representative sample. The spontaneous discussion is aimed at increasing the validity
of the findings.
Supplemental data techniques:
1) Questionnaires and surveys distribution of characteristics, attitudes or beliefs. Relies on
accuracy and honesty of the participant. Limited usefulness for getting at complex issues or
understandings.
2) Films, photographs and videotapes used to document past events captures the event visually and objectively. Records non verbal behaviour and communication. However, can be
expensive and intrusive,
Page | 56
3) Street ethnography a focused ethnography, occurring in a particular location

4) Ethnographic interviewing a conversation in the setting with an explicit purpose
SAMPLING
Purposive sampling
Participants are included according to pre selected criteria relevant to the research study
being undertaken. It is a non-probability sample, targeting a particular group or characteristics,
and is the most commonly used in qualitative research.
Quota sampling
Participants are again recruited according to predefined criteria, however, unlike purposive sampling, subgroups are selected non randomly to fulfil a set quota. This can either be
proportional to what the subgroup represents in the population of interest, or non-proportional,
which is less restrictive.
Snowball sampling
Participants are recruited through chain referrals after initial participants have been indentified and recruited, they are then asked to suggest other possible participants from their social networks. This is particularly helpful for groups that are hard to locate or inaccessible.
ETHICAL ISSUES
Voluntary participation
This is particularly problematic with covert observation, likewise fieldwork, where the
researcher is likely to come into contact with a plethora of people, and it would be impractical to
get their consent to participate. While interviews or focus groups do not usually have this problem, in the field of criminology, voluntary participation may become an issue if protected subjects (such as prisoners or children) are the subjects for study.
Informed Consent
As with voluntary participation, it may not be practical or possible to get the informed
consent of the participants being observed. Further, with protected populations, it is harder to
ascertain if they have the capacity to consent.
Confidentiality
Participants do not have the option to be anonymous in most qualitative research. Descriptive studies in particular may make hiding an individuals identity difficult, but is especially
pertinent in the field of criminology, where illegal behaviour may be disclosed or uncovered
(Bachman and Schutt, 2003).
Page | 57
REFERENCES
Bachman, R., and Schutt, R. (2003). The Practice of Research in Criminology and Criminal
Justice (2nd Ed.). Thousand Oaks, CA: Pine Forge Press
Geertz, C. (1973). Thick Description: Toward an Interpretive Theory of Culture. In C. Geertz
(Ed.), The Interpretation of Cultures: Selected Essays. New York: Basic Books
Kahn, R., and Cannell, C. (1957). The Dynamics of Interviewing. New York: John Wiley
Marshall, C., and Rossman, G. (1989). Designing Qualitative Research. Newbury Park, CA:
Sage Publications
Marshall, C. (1985). Appropriate Criteria of Trustworthiness and Goodness for Qualitative
Research on Educational Organizations. Quality and Quantity, 19, 353-373
Wilson, S. (1977). The Use of Ethnographic Techniques in educational Research. Review of
Educational Research, 47(1).
Page | 58
CHAPTER 7. PROGRAM EVALUATION

By Kristin Reschenberg
INTRODUCTION AND BACKGROUND
What is program evaluation?
Program evaluation is defined as the use of social research methods to systematically
investigate the effectiveness of social intervention programs in ways that are adapted to their political and organizational environments and are designed to inform social action to improve social conditions (Rossi, Lipsey, and Freeman, 2004, p. 16). A main task of program evaluation is
to construct a valid description of program performance in a form that permits comparison with
the applicable criteria. This includes a description of the entity being evaluated and some standards or criteria for judging that performance (Rossi, Lipsey, and Freeman, 2004). Evaluation
research encompasses the design of social programs, the monitoring of the functioning of programs, the assessment of program impact, as well as analysis of benefits relative to costs (Berk
and Rossi, 1999).
Program evaluation operates under the idea that social programs should have explicit
goals by which success or failure can be measured. The effects of social programs should be
measureable using the scientific method (Berk and Rossi, 1999). Evaluators use social research
to methods to study, appraise, and to help improve social programs (Rossi, Lipsey, and Freeman,
2004). In their discussion, Shadish, Cook, and Leviton (1991) offer an idealized problem-solving
sequence that is essential to the evaluation of programs that includes: (1) identifying a problem,
(2) generating and implementing alternatives to reduce the symptoms, (3) evaluating these alternatives, (4) adopt the alternatives that will reduce the problem (Shadish, Cook, and Leviton,
1991).
The continued existence of evaluation research is reliant on interest and funding from
stakeholders (Berk and Rossi, 1999). Stakeholders are individuals, groups, or organizations that
have an interest in how well a program functions. Evaluations are aimed primarily at the audience that has the ability as well as the power to take action based on the results of the evaluation
(Rossi, Lipsey, and Freeman, 2004). These stakeholders, including policymakers, funding organizations, program managers, taxpayers, and clientele rely on program evaluations to distinguish worthwhile programs from ineffective programs (Rossi, Lipsey, and Freeman, 2004). Additionally, program evaluations can aid stakeholders in making the decisions regarding the initiation, continuation, expansion, or even termination of entire programs (Rossi, Lipsey, and Freeman, 2004). It must be remembered that program evaluations are designed only to assist policymakers and stakeholders in making judgments about the success or failure of a program or policy. They are not a substitute for judgment of stakeholders (Berk and Rossi, 1999). More attention is devoted to stakeholders later in this summary, as a key concept of evaluation research.
What is good evaluation research?
Berk and Rossi (1999) argue that practical perfection is reached when the best information is attained on all the key policy questions within the given real-world constraints. However,
all evaluations are flawed in some manner; the perfect evaluation does not exist. The evaluation
researcher must take into consideration budget, ethical, and political considerations when constructing their evaluation (Berk and Rossi, 1999). Thus, Berk and Rossi (1999) argue that a
Page | 59
merely successful evaluation, in contrast, falls short of providing the best information possible
given the constraints but provides better information than would have otherwise been available
(p. 5). This suggests that the proper measure of the success of an evaluation is whether it adds to
the current knowledge, rather than what might be nice to know (Berk and Rossi, 1999, p.5).
Additionally, some argue that in order to be successful, an evaluation must include some
form of advocacy or be implemented by policymakers. While these are, without a doubt, important components of evaluation, it is a risky prospect for evaluation researchers and can even
compromise the research. The evaluation can be compromised if it appears that the researcher
has a position on the issue or if it appears that the research has been tailored to one side or the
other to ensure its use (Berk and Rossi, 1999). In the view of Berk and Rossi (1999), evaluation
can be successful even if it is ignored or even if it misused by stakeholders. Once the findings of
the evaluation are presented to the interested parties in a clear manner, the evaluation has been
concluded (Berk and Rossi, 1999).
A Brief History of Program Evaluation
Although program evaluation is a relatively recent development, the activities that make
up program evaluation are not. In fact, the roots of evaluation research extend to the 17th century, though evaluation as it is currently known is a relatively modern development. The systematic evaluation of social programs first became common in the fields of education and public
health (Rossi, Lipsey, and Freeman, 2004). The field of applied social research grew rapidly as a
result of the boost it received following its contributions during World War II. After World War
II, many federal and privately funded social programs were launched, providing services such as
urban housing, education, occupational training, and health services. These new programs required evaluation and, as a result, by the end of the 1950s program evaluation commonplace
(Rossi, Lipsey, and Freeman, 2004).
The 1960s arrived with an increase in the number of books and articles focusing on evaluation research. By the end of the decade, evaluation research represented a growth industry. The
large amount of interest in program evaluation was sparked, in part, by President Lyndon Johnsons federal war on poverty and the corresponding programmatic remedies (Rossi, Lipsey, and
Freeman, 2004). By the early 1970s, program evaluation research had emerged as a specialty
field. Special sessions focusing on evaluation research became commonplace at professional
meetings and conferences. In addition, professional associations were also founded (Rossi, Lipsey, and Freeman, 2004).
Eventually, changes began to occur in the field of evaluation research. Initially, the interests of researchers shaped the field of evaluation research. However, this evolved to a point
where the interests of the consumers of the evaluation shape the research (Rossi, Lipsey, and
Freeman, 2004). While the results of these evaluations are not often newsworthy, they are of
great importance to those directly or indirectly affected by the program including concerned citizens, program sponsors, policymakers (Rossi, Lipsey, and Freeman, 2004). As a result of the
changes that have occurred in the arena of evaluation research, program evaluations have moved
beyond the world of academic social science into the arena of political and policy decisions
(Rossi, Lipsey, and Freeman, 2004).
Page | 60
KEY CONCEPTS IN EVALUATION RESEARCH

Policy Space
Policy space refers to the fact that evaluations are almost exclusively concerned with
making judgments about policies and programs that are on the current agenda of policymakers.
The policy space is bounded by space and time and does not include a permanent set of policies
and programs. The policies and programs of interest to policymakers change over time as well as
by political jurisdiction (Berk and Rossi, 1999). Evaluation research is distinguished from academic social science research by its attention to the issues in the policy space. A characteristic of
a good researcher is the ability to know how to determine which issues are currently in the policy
space and which are not (Berk and Rossi, 1999).
Stakeholders
As discussed earlier, stakeholders are defined as individuals, groups, or organizations
that have an interest in how well a program functions (Rossi, Lipsey, and Freeman, 2004). It is
not uncommon for the various stakeholders to have conflicting interests. For example, some of
the stakeholders may support a given social program while others may oppose this same program. As a result of this conflict, some stakeholders may support results of the evaluation, while
the others adamantly oppose the findings (Berk and Rossi, 1999). It is usually impossible to
please all of the stakeholders (Berk and Rossi, 1999).
Stakeholders scrutinize evaluations in order to determine how the results affect their interests because there is often much at stake. There are three main implications of this scrutiny.
First, those researchers who prefer to avoid controversy or criticism should not conduct evaluations. Second, much greater care is required in undertaking of evaluation research as compared to
basic academic research. The highly critical stakeholders will undoubtedly discover any errors
or questionable procedures. Third, evaluation research often necessitates careful prior negotiation
with stakeholders (Berk and Rossi, 1999).
Program Effectiveness
The ultimate goal of program evaluations is to determine the effectiveness of a program
or policy. Effectiveness is defined as the extent to which a policy or program is achieving their
stated goals (Berk and Rossi, 1999). The beginning point for achieving this goal is generally to
determine what the goals of the policy or program are. This can be a difficult prospect, since
program goals are often stated vaguely. At times, the goals are not even clear or consistent.
Without clear and consistent goals, programs cannot be evaluated for effectiveness (Berk and
Rossi, 1999)
In order to address the effectiveness of a program, it is necessary for the evaluation to
make comparisons. The evaluation must answer if the program is effective when comparisons
are made on the basis of a given criterion. There are three different varieties of effectiveness that
evaluations can measure. The first is marginal effectiveness. This type of effectiveness focuses
on the issue of dosage, meaning that it looks at what the effects or consequences are of more intervention as compared to less of the intervention. The second is relative effectiveness, which
focuses on the contrast between a program and the absence of a program. This can also focus on
the difference between two or more programs. Finally, cost-effectiveness makes comparisons in
units of outcome per dollar (Berk and Rossi, 1999).
Page | 61
ASSESSING AND MONITORING PROGRAM PROCESS

Program process evaluation is a form of evaluation that is used to describe how a program functions and how well these functions are being performed. Process evaluation expands
on expands on process theory, defining the critical components and functions necessary for the
given program to be effective (Rossi, Lipsey, and Freeman, 2004). Process evaluation also addresses what a given program is intended to be, what is delivered in actuality, and why there are
gaps between the program plan and delivery (Scheirer, 1994). There are several reasons for the
importance of process evaluation. First, it serves to provide feedback on the quality of the intervention being delivered. Second, process evaluations indicate who is receiving program services
and to what extent the services are being received. Third, knowledge of what the program components contribute to the ultimate outcome is increased. Finally, process evaluations aid in understanding how programs can be implemented in complex organizations and communities
(Scheirer, 1994). Process evaluations are typically conducted as a separate project by evaluation
researchers. However, these analyses are also often paired with an impact evaluation in order to
determine what services the program has to complement the findings about the impact of these
services (Rossi, Lipsey, and Freeman, 2004).
Process evaluations are referred to as process monitoring when repeated measures are
taken over an extended period of time. This serves the purpose of assessing whether the program
is being delivered to the intended recipients. Program process monitoring can take different
forms. The form depends on whether the process monitoring is undertaken from an evaluation,
an accountability perspective, or a program management perspective (Rossi, Lipsey, and Freeman, 2004).
When process monitoring is conducted from the perspective of the evaluators, the goal is
to understand and interpret the program impact results. The information provided to evaluators in
process evaluations is essential for making decisions such as the appropriate dosage or means of
delivering the intervention (Rossi, Lipsey, and Freeman, 2004). The accountability perspective
is applicable to those that sponsor and fund programs. Accountability refers to the responsibility a program staff has to provide evidence to stakeholders that the program is effective and conforming with all of its coverage, service, legal, and fiscal duties (Rossi, Lipsey, and Freeman,
2004, p. 200). Rossi et al. argue that accountability is a weapon that is often used by both advocates and detractors in support of their cause (Rossi, Lipsey, and Freeman, 2004, p. 181). Accountability is important to various stakeholders in the evaluation process. Program managers
must inform the program sponsors of activities, degree of implementation of programs, problems
encountered, and what is forthcoming. Government sponsors and other funders of programs require accountability in exchange for their fiscal support. Stakeholders want programs to fit their
self-interests (Rossi, Lipsey, and Freeman, 2004). Process monitoring from a management perspective is concerned with similar questions as those from the evaluation and accountability perspectives. However, the differences lie in the way the results will be put to use. In the management perspective, the results of process monitoring are especially important during the implementation and pilot testing phases (Rossi, Lipsey, and Freeman, 2004). Ultimately, it is important to note that the data required by these three perspectives and data collection procedures used
are generally the same or overlap considerably (Rossi, Lipsey, and Freeman, 2004).
Service utilization can be categorized into questions about coverage and those about bias.
Coverage is defined as the extent to which a program reaches its intended population. Program
records, surveys of participants, and community surveys are all useful sources for assessing levPage | 62
els of coverage (Rossi, Lipsey, and Freeman, 2004). Bias, as applied to program coverage, is defined as the extent to which subgroups of a target population are reached unequally by a program. This can best be uncovered using comparisons of program users, eligible nonparticipants,
and dropouts (Rossi, Lipsey, and Freeman, 2004).
The task of monitoring a programs organizational functions has the purpose of determining how successful the program is at organizing its efforts and utilizing resources to achieve the
stated goals. Attention is given to identifying weaknesses and problems in the implementation of
the program that would impede the programs services from reaching the intended population
(Rossi, Lipsey, and Freeman, 2004). Some potential sources of implementation failure include
incomplete intervention, delivery of the wrong intervention, and unstandardized or uncontrolled
interventions (Rossi, Lipsey, and Freeman, 2004).
ASSESSING IMPACT OF PROGRAMS
Randomized Field Designs
According to Rossi et al. the purpose of impact assessments is to determine the effects
that programs have on their intended outcome and whether there are unintended effects (Rossi,
Lipsey, and Freeman, 2004, p. 234). It is possible to conduct impact assessments at various
stages of the program, however, since rigorous impact assessment requires the use of significant
resources, one must consider whether the use is justified by the circumstances. The methodological concepts that underlie all research designs used in impact assessment come from the logic of
randomized experiments. An essential feature of this is the use of random assignment to divide
subjects into intervention and control groups. In quasi-experiments, subjects are assigned using
something other than true random assignment. In these experiments, the evaluator must decide
what constitutes a suitable research design, keeping in mind that compromises are always inherent in construction of design to a certain extent (Rossi, Lipsey, and Freeman, 2004)
Randomized experiments represent an ideal choice for impact assessment because they
provide the most credible conclusions about program effects when the experiments are conducted well. The primary advantage of implementing a randomized experiment is the fact that
the effect of the intervention is isolated, ensuring that the intervention and control groups are statistically equivalent with the exception of the intervention received (Rossi, Lipsey, and Freeman,
2004). There are some procedures that can produce circumstances that are acceptable approximations to randomization, for example, assigning every other name on a list or assigning clients
to a program based on the programs ability to take additional people at a given time. However,
these alternatives are only suitable if they can generate intervention and control groups that do
not differ on any characteristics that are relevant to the expected outcome (Rossi, Lipsey, and
Freeman, 2004). The level of precision in the measurement of the outcome of an intervention
can be increased through the use of several measurements, including measures taken before an
intervention, during the intervention, as well as after the intervention. The use of multiple measures enables evaluators to more precisely determine how intervention worked over time (Rossi,
Lipsey, and Freeman, 2004).
Although they are the most rigorous, randomized experiments may not be feasible or appropriate for all evaluations. The results may be ambiguous if the experiment is conducted in the
early stages of a program, when interventions change in ways that experiments cannot easily
capture. Additionally, stakeholders may be hesitant to allow randomized experiments if they feel
that they are engaging in unfair or unethical conduct by withholding the intervention from conPage | 63
trol group (Rossi, Lipsey, and Freeman, 2004). It is important to keep in mind that for all the
positives, experiments are resource intensive, require technical expertise, research resources,
time as well as tolerance from the programs being studied as their normal procedures are being
disrupted (Rossi, Lipsey, and Freeman, 2004). Experiments also have the potential to create artificial situations, for instance, the delivery of the program during the experiment may differ from
how the program is actually delivered (Rossi, Lipsey, and Freeman, 2004).
Alternative Designs
While randomized experiments are the strongest methodology for measuring the strength
of the impact of a program, there are several quasi-experimental methodologies that are also
potentially valid. These methodologies can be used when randomized experiments are not feasible or are not appropriate. A major concern that evaluators have in any impact assessment is to
reduce the bias in the estimate of program effects. As discussed earlier, bias is defined as the extent to which subgroups of a target population are reached unequally by a program (Rossi, Lipsey, and Freeman, 2004). There are several potential sources of bias in quasi-experimental designs including selection bias, secular trends, interfering events, and maturation. Rossi et al.
(2004) define selection bias as the systematic under or over estimation of program effects resulting from uncontrolled differences between the intervention and control groups that would result
in differences between the groups even if intervention not present (Rossi, Lipsey, and Freeman,
2004).
The intervention and control groups are created using methods other than random assignment in quasi-experimental designs. As a result, there is not an assumption of equivalence
between these groups. Differences may exist between the groups that would result in differences
in outcome even if the intervention were not applied. Thus, appropriate procedures must be applied to adjust for these differences in estimations of program impacts (Rossi, Lipsey, and Freeman, 2004). In one variety of quasi-experimental methodology, matched controls are implemented. The control group is constructed by matching program nonparticipants with program
participants. This procedure can be done either on the individual-level or on the aggregate-level.
Additionally, the variables that are used in the matching procedure must include all those strongly related to outcome on which the groups would otherwise differ in order to avoid bias (Rossi,
The intervention and control groups can also be equated through the use of statistical controls. As with other methodologies, any differences that the groups share on variables relevant to
the outcome must be identified and included in the statistical tests (Rossi, Lipsey, and Freeman,
2004). Ideally and when it is possible, participants should be assigned to the intervention and
control groups based on quantitative measures. For example, measures of need, merit are less
susceptible to bias than those from other quasi-experimental designs. It is appropriate to use
quasi-experimental designs when randomized experiments are not feasible but considerable efforts must be taken to minimize the potential for bias. It is also important to acknowledge the
limitations of quasi-experimental methodologies (Rossi, Lipsey, and Freeman, 2004).
COST-EFFECTIVENESS
Cost-benefit analysis is a useful quantitative tool for program evaluators. These analyses are especially useful in when used in evaluations of existing programs to assess their success
or failure, to determine whether the program should be modified or continued, as well as assessPage | 64
ing the likely consequences of changes to the program (Kee, 1994). There are three main steps to
cost-benefit analysis. First, the evaluator must determine the benefits of a proposed or existing
program and placing a dollar value on these benefits. Second, the total costs of the program must
be calculated. Finally, the total benefits and total costs must be compared (Kee, 1994).
While these steps seem straightforward they can be quite challenging. It can sometimes
be difficult to determine the appropriate unit of analysis. Even if this is possible, placing a dollar
value on this unit can be quite challenging (Kee, 1994). An additional benefit of conducting costbenefit analyses is that the procedure can illuminate important issues and may even lead to an
implicit valuation of some intangible ideas that are obscured by rhetoric (Kee, 1994, p. 457).
There are several types of costs and benefits that can be identified during this procedure.
Direct benefits and direct costs are those that are closely related to the main objective of the
program. In contrast, indirect benefits and indirect costs are spillover or investment effects of
the project or program. Additionally, costs and benefits can also be tangible or intangible. Tangible benefits and tangible costs are those that can be easily converted into dollars or an equivalent of dollars while intangible benefits and intangible costs are those you cannot or choose not
to assign an explicit price to (Kee, 1994).
After the evaluator has determined the range of costs and benefits associated with the
program in question and has assigned values to the costs and benefits, the next step is to present
the information to the decision maker. Kee (1994) argues that there are three ways in which this
can be done. The first option is a retrospective analysis, which involves looking at historical data
on the benefits and costs and converts them into a net present values for the program. The second
option is a snapshot analysis simply looks at the costs and benefits for the current year. The third
and final option presented is a prospective analysis, which consists of an analysis that projects
future benefits and costs of the program based on the retrospective analysis (Kee, 1994).
FUTURE OF EVALUATION
Rossi et al. (2004) suggest that there are a variety of reasons to believe that the field of
evaluation research will continue to grow in the future. First, stakeholders such as planners, staff,
and participants are increasingly skeptical about using common sense as a sufficient basis for the
design of social programs that will actually have the ability to achieve their intended goals (Rossi, Lipsey, and Freeman, 2004). This skepticism has led policymakers to seek out methods of
learning from past mistakes and to more quickly identify which measures work. When these programs that work are defined they can then be enhanced and used to their full potential (Rossi,
A second reason to expect continued growth in the area of evaluation research is the everincreasing sophistication of knowledge and technical procedures in the social sciences. These
new methodologies become a more powerful means of testing social programs when paired with
more traditional methods (Rossi, Lipsey, and Freeman, 2004). Finally, there have also been
changes in the political and social climate that are favorable to the increased use of evaluation
research. There is a desire to fix the problems that ail society, though the variety and number of
concerns that demand the attention of social science researchers can be overwhelming (Rossi,
Page | 65
REFERENCES
Berk, R. A., and P. H. Rossi. (1999). Thinking about program evaluation (2nd ed.). Thousand
Oaks, CA: Sage Publications.
Kee, J. E. (1994). Benefit-cost analysis in program evaluation in J. S. Wholey, H. P. Hatry,
and K. E. Newcomer (eds.) Handbook of Practical Program Evaluation. San Francisco:
Jossey-Bass Publishers.
Rossi, P. H., M. W. Lipsey, and H. E. Freeman. (2004). Evaluation: A systematic Approach (7th
ed.). Thousand Oaks, CA: Sage Publications.
Scheirer, M. (1994). Designing and using process evaluation in J. S. Wholey, H. P. Hatry,
and K. E. Newcomer (eds.) Handbook of Practical Program Evaluation. San Francisco:
Jossey-Bass Publishers.
Shadish, W. R., T. D. Cook, and L. C. Leviton. (1991). Foundations of program evaluation:
Theories of practice. Newbury Park, CA: Sage Publications.
Page | 66
Note: The final two chapters are companion pieces that discuss relatively recent or rarely used
statistical analytic techniques. The first chapter covers time-series analysis, hierarchical linear
models, and poisson regression. The second chapter discusses meta-analysis, propensity score
matching, survival analysis and spatial regression techniques. MR/CP
CHAPTER 8. NEWER STATISTICAL METHODS (PART I)
By Diana Summers
Time Series Analysis
Time series analysis is a type of regression model where observations are ordered in time
and therefore cannot be treated as statistically independent. These observations can be a person,
organization, nation, aggregated arrests, etc., and are usually reported on a consistent basis (e.g.,
yearly, monthly, quarterly, daily). Time series analyses are primarily used to aid in forecasting,
and originated in the field of economics. However, the field of criminology has benefited from
time series analysis in studying the nature of trends (in number of offenses, number of convictions, etc.). This methodology was developed to decompose a series into trend, seasonal, cyclical
and irregular components. These components of the series are each a type of difference equation, which expresses the value of a variable as a function of its own lagged values, time and
other variables. Uncovering these paths in a series improves forecasting accuracy since each of
the predictable components can be extrapolated into the future. It is possible to estimate the
properties of a single series or a vector containing many interdependent series; however, this discussion will continue with univariate time series analyses. In addition, discrete time series analyses will be discussed here, as most researchers analyze discrete time series and not continuous
time series.
It is not generally reasonable to suppose that the errors in a time series regression are independent, since time periods close together are more likely to be similar than points in time that
are relatively isolated. This similarity can extend to the errors, which represent the omitted causes of the response variable. It is therefore important to test for and correct autocorrelation if necessary by employing an analysis of the residuals (e.g., the Durbin-Watson test statistic).
One forecasting model in particular that is utilized most in time series analyses is the
ARIMA (autoregressive integrated moving average) model. Properties of stationarity are considered in this model. When an ARIMA model is stationary, it becomes an ARMA (autoregressive moving average) model.
Stationary: H0 is rejected; the trend is mean reverting, allowing researchers to
proceed with further statistical testing.
Non-stationary: fail to reject H0; the trend is stochastic (random), requiring researchers to further manipulate the data (e.g., first-differencing)
The first component of the ARIMA model (AR) considers that time series processes can
be influenced by past events or observations. It is often assumed that elements of an observed
time series are outcomes or realizations of a stochastic process. However, in econometrics this is
a more general assumption than in other fields like criminology (GNP is arguably more stably
collected than crime-related data, and people can be more easily influenced on topics such as incarceration rates and drug abuse). A discrete variable y is said to be stochastic if for any real
number r there exists a probability p(y r) that y takes on a value less than or equal to r. It is typically implied that there is at least one value for r for which 0 < p(y = r) < 1. If there is some r
Page | 67
for which p(y = r) = 1, y is deterministic rather than stochastic. In discussing stochastic timeseries models, white-noise processes should also be mentioned. A white-noise process occurs if
each value in the sequence has a mean of 0, a constant variance, and is serially uncorrelated.
The second component of an ARIMA model is the integrated process. This simply means
that the mean, variance and covariance are not constant over time. When this is not the case, the
series is stationary and an ARIMA model is employed instead.
The third component of an ARIMA model is the moving average (MA). This implies that
time series processes are driven by various shocks to the time series data. These shocks can be
defined as any major event or occurrence that can potentially significantly affect the time series
data. Mathematically, the moving average is described below:
Consider the following white-noise (t) process:
Yt = + t + t-1 ,
where and could be any constants. This is an example of a first-order moving average (MA) process, where moving average comes from the fact that Yt is constructed from a
weighted sum, akin to an average, of the two most recent values of (Hamilton, 1994: 48).
A more formal, more statistically rigorous method used for forecasting and defining the
nature of time series data is known as unit root testing. When testing for the existence of unit
roots, Phillips-Perron or Augmented Dickey-Fuller tests can be employed. These tests help determine whether any shock to the time series will produce a temporary or permanent effect.
When a unit root test yields a stationary time series, any shock will have a temporary effect on
the variable. The effects of the shocks will be brief and will dissipate over time, causing the trajectory of the data to revert back to its original mean. From these results, it is possible to then
calculate the approximate length of time the effects would last by employing the ARIMA model
and examining the size of the slope coefficient. If any shock to the variable is found to permanently affect the trajectory (non-stationary), the variable will never return to some form of longrun mean (Enders, 1995).
The value of unit root testing in criminology and criminal justice is evident, especially in
areas of policy evaluation. For instance, after conducting unit root tests on related time series
data, if the series is found to be stationary and the effects of the shock have dissipated, new policies would have to be enacted to reapply the effects. This result would also indicate some level
of predictability in the variable.
Other time series tests that involve multivariate analyses and may prove useful to criminal justice researchers are:
Vector Autoregression (VAR): is particularly helpful in econometrics for estimation and forecasting. It has been described as a natural extension of the univariate autoregressive model, and is user-friendly. VAR forecasts are superior in
ways to univariate time series models, as they allow for more flexibility due to
the fact that they can be made conditional on the potential future paths of specified variables in the model.
Kalman Filtering: is a discrete data filter composed of a set of equations that
provides an efficient recursive means to estimate the state of a process in a way
that minimizes the mean of the squared error. In this state-space system, one of
the ultimate objectives is to estimate the values of any unknown parameters in
the system on the basis of the given observations. It supports estimations of past,
present, and even future states, and can do so even when the precise nature of the
modeled system is unknown (Welch and Bishop, 2006). This would be helpful in
Page | 68
identifying missing observations in criminal justice time series data, and would
aid in forecasting efforts.
Statistical software packages for time series analysis include E-views, STATA, SAS, and
Shazam.
Hierarchical Linear Modeling (HLM)
Hierarchical Linear Modeling (HLM) refers to a type of regression analysis that involves
modeling multilevel data that are inherently hierarchical. The focus of HLM is to appropriately
model relationships between variables reflecting different levels of analysis. For instance, individuals are nested or grouped within larger units, such as a neighborhood or work group. The
hierarchical nature of this type of data leads to problems with employing traditional regression
models, because the individual units of analysis that are grouped within larger units of analysis
cannot be considered independent. Individual units of analysis tend to be more similar to each
other than separate units randomly sampled from an entire population. For example, individuals
sampled from one particular work group are more similar to each other than to individuals randomly sampled from the entire company or group of companies. This is because people are not
randomly assigned to a company or work group; rather, they are selected based on skill set and
other qualifying factors.
HLM provides a method to overcome this problem of independence of observations,
whereas using OLS regression will produce standard errors that are too small and therefore a
higher probability of rejecting the null hypothesis (e.g., inflating Type I error rate). In the traditional OLS approach, all the regression parameters are fixed, so that if a two-level approach
(such as the example described above) were utilized, the variance components would not be
separable from the individual level residual. HLM software uses a maximum likelihood estimation of the variance components, generalized least squares estimates of the level-two regression
parameters, and can yield empirical Bayes estimates of the level-one regression parameters
(Hofmann and Gavin, 1998: 626).
Below is the Level 1 regression equation:
Yij = B0j + B1j * X1ij + B2j * X2ij + rij
Where i refers to the person number and j refers to the group number.
Researchers might use HLM when determining the success or failure of ethics workshops
among certain work groups. However, since data is collected at an individual level, issues of accounting for this cross-level data arise. HLM allows researchers to separate individual and group
effects on the outcome, instead of either aggregating individuals up one level or reducing higherlevel variables down to individual levels (see Byrk and Raudenbush, 1992 for further discussion).
Statistical software packages for HLM include HLM, SPSS and MLWin.
Poisson Regression
Poisson regression (or log linear model) is a member of a family of analyses known as
the generalized linear model where OLS regression is generalized for use with different types
of error structures and dependent variables (Coxe, West and Aiken, 2009). It is based on the
Poisson distribution, and is designed for use with count data where the dependent variable can
only take on non-negative integer values. With the basic Poisson specification, it is assumed that
Page | 69
the variance of the variable is equal to the mean. It is also a nonlinear, univariate distribution.
The count data reflect the number of occurrences of a behavior in a fixed period of time (e.g., the
number of drug-related arrests for an individual over the past 12 months), and Poisson regression
analysis allows for the investigation of individual factors affecting the particular count variable.
Coxe, West and Aiken (2009) warn against trying to use a count variable as an outcome
or criterion variable in OLS regression, as it can cause major problems. When the mean of the
outcome variable is relatively high OLS regression can beappliedwith minimal difficulty.
However, when the mean of the outcome is low, OLS regression produces undesirable results
including biased standard errors (121). The Poisson distribution increasingly resembles the
normal distribution as the expected mean value becomes larger. Generally, a Poisson distribution
with an expected value greater than 10 will appear similar to a normal distribution in shape and
symmetry. A count variable with a very low mean count will be skewed to the right and highly
asymmetric. See Figure 8.1 below for a visual representation (as provided by Coxe, West and
Aiken, 2009):
Figure 8.1. Distributions of Arrest Counts
Even though equations for Poisson distributions may appear very similar to OLS regression equations, the predicted score is not itself a count but rather a natural logarithm of the count.
Thus it is said that Poisson regression is linear in the logarithm when given the correct combination of independent variables (Coxe, West and Aiken, 2009: 124).
Below is the equation for the probability density of a variable with a Poisson distribution:
P(y|) = (e - y)/y!
If E(y) = (as is the rate parameter), the Poisson Process is modeled as such:
ln =bX
Page | 70
Note: with the basic Poisson specification it is assumed that the variance of the variable is
equal to the mean:
Var ( y ) = y =
Two complications might occur with count data, however.

1. Actual data may have too much variability to be represented by a standard Poisson regression. This is known as overdispersion. In this case, overdispersed
Poisson regression or negative binomial regression should be used. Overdispersed Poisson regression includes a dispersion parameter into the formula for
the variance of the Poisson model. These adjustments have no effect on the interpretation of the coefficients of the regression output. Negative binomial regression assumes that lambda is treated like a random variable (meaning it varies between cases). The traditional Poisson model is then altered by including an error
term that accounts for these differences.
2. Problems may arise in Poisson regressions associated with counts of zero. This
occurs when the event being studied rarely occurs, such as the number of homicides by zip code or the number of drug-related deaths in a given week. Variants
of the Poisson regression are used to provide more accurate estimates.
The Poisson with Normal Heterogeneity Model specifies an error term that is normally
distributed with a mean of zero and a standard deviation of sigma. If panel data is being analyzed, it is possible to make no assumptions on the distribution of the errors, and to simply estimate the distribution from the data. This is known as semi-parametric mixed Poisson.
In some instances, there are even more zeros within count data than can be accommodated by the overdispersed Poisson model. The Zero-Inflated Models (ZIP) assumes that there
are two underlying groups. The first involves cases that are always zero, and the second involves
cases that follow a Poisson process. This can be separated into two different processes:
1. A probability model (logit or probit) where the probability of being in the zero
group depends on a set of z covariates
2. A Poisson model with a rate parameter of lambda that depends on a set of x covariates (in which some or all of them can be the same as z).
Zero-Inflated negative binomial models also exist for incorporating different groups
and heterogeneity.
Statistical software packages for Poisson regression models include SAS, STATA, and
SPSS.
REFERENCES
Byrk, A.S. and S.W. Raudenbush. (1992). Hierarchical Linear Models. Newbury Park, CA:
Sage Publications.
Coxe, S., S.G. West, and L.S. Aiken. (2009). The Analysis of Count Data: A Gentle
Intro
duction to Poisson Regression and Its Alternatives. Journal of Personality Assessment.
91(2): 121-136.
Enders, W. (1995). Applied Econometric Time Series. New York: John Wiley and Sons, Inc.
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Los Angeles: Sage
Publications.
Hamilton, J. (1994). Time Series Analysis. NJ: Princeton University Press.
Page | 71
Hofmann, D. and M. Gavin. (1998). Centering Decisions in Hierarchical Linear Models: Impli
cations for Research in Organizations. Journal of Management. 24(5). 623-641.
Osborne, J. (2000). Advantages of Hierarchical Linear Modeling. Practical Assessment,
Research and Evaluation. 7(1).
Welch, G. and G. Bishop. (2006). An Introduction to the Kalman Filter. TR 95-041. University
of North Carolina at Chapel Hill.
Page | 72
CHAPTER 9. RESEARCH DESIGN AND NEWER STATISTICAL METHODS (PART

II)
By Sean Christie
Methods Covered:
Meta-analysis
Propensity Score Analysis with a Mahalanobis Distance Matching extension
Cox Proportional Hazards Regression model (Survival Analysis)
Spatial Regression Models
Introduction:
As the second person investigating newer statistical models, I have chosen an area typically outside of standard fare for criminology. What follows is a selection of techniques used in
research coming from the medical arena. I decided to look at this area as it has a reputation for a
more savvy use of statistics and models that many in our field and the readings in this semester
see as a gold standard. It, of course, cannot be said that this selection is in any way random or
representative. Further this section does not consider the research reported using a true experimental model. The reasoning for this non-reporting is that it seemed superfluous to the world of
criminological study, and the resulting methods seemed overly simplistic when the reality of
even a treatment design is implicated in criminological research.
This said, there are methods that are more reflective of the issues many criminologists
face in the reality of their own research. The selected research methods reported here have been
employed in the study of epidemiology, a sub-section of the medical realm that has many parallels to the study of crime. It must be remembered by the reader that the studies here are a very
small selection of the papers available from this one journal in the past year. The American Journal of Epidemiology is a bi-weekly publication with approximately fifteen papers per issue on
new research alone, and one to two pooled, meta-analysis papers offered in each issue. That is
approximately 390 articles to be considered, an amount far in excess of the restricted needs of
this section.
Lessons that Criminal Justice Could Learn From Epidemiology:
Before looking at the methods employed in the chosen articles it struck the current reader that the field of Criminal Justice should take a leaf from the medical field. The first thing noticed is the voracity with which publications are made. The high number of articles published
could be a double edged sword in criminology, possibly encouraging a lesser quality of work.
However, one assumes the same screening process occurs, and the articles are forced to be more
concise as a result. But the flow-on effect is illustrated by the Meta-analysis summary provided
later in this section. With such a large number of publications, over 1,000 were found and over
700 could be included looking at heart disease, it seems that this mechanism builds the body of
evidence for one thing or another at a more rapid rate than we experience in the criminal justice
field. Further, the journal has a dedicated section for pooled and meta-analysis to encourage the
analysis and discovery of the direction of the accumulating body of evidence. How much more
Page | 73
certain would the criminologist be of the causes of crime if the criminal justice field followed the
whole medical model and did not simply place faith in its methods?
Further, the medical researchers also seem to adhere to the strict rules of models and what
statistics to employ more so than much criminological research4. It seems, rather than check to
see if violations skew data to an extent that makes the researcher shift to more complicated or
less sensitive models, the medical researcher will rely on multiple methods and other techniques
to make the results comparable, or simply report the whole picture created.
The Following Sections:
The next section summarizes some of the methods employed in the medical research. Not
all are new, but nor are they all common as far as this reader is aware to criminal justice research. The summary notes also include the methods discussed illustrating how the medical researchers employ them and what complimentary methods are also employed giving a fuller picture of the actual implementation of the broad methods discussed.
Meta Analysis:
While not a new analysis technique meta analysis seems to be something of a rarity in
criminological research especially when compared to medical research. And the technique could
be useful to the criminological field to start conducting these analyses on a more regular basis
and provide them with more prestige in the field. It seems to be a prevalent way in which a field
we often look to emulate check to see in what direction the body of findings is pointing.
What is Meta Analysis?
Meta analysis is essentially a synthesis of literature/previous research. Ideally randomized trials are used, but many other methods have been developed to use other types of studies as
is evidenced in the first summary in Part III of the Newer Statistical Methods section. Generally,
in cases in which non-randomized samples are used in employed data the effect sizes reported
must control for any theoretically important confounders as a minimum requirement.
The broad and general steps include a search and recording of the search strategy (captured in the abstract of the summary sheet). Once this is conducted a reduction of studies eligible
to be included occurs due to the comparability of the studies and if the studies offer the required
statistics to be used in a meta-analysis.
Once the researcher has entered the required data into their chosen program (Usually
SAS and the STATA sample provided for medical research) the researcher looks at effect sizes,
either as rate differences, relative risk or rate ratios (relative risk seems to be common in epidemiology). The type of model selected depends on the intention of the study. A fixed effects model states that the conclusions are correct for the studies in the analysis and random effects models
assume the studies are a random sample of a universe. After gaining the effect size, the
confidence intervals and the Q-statistic for homogeneity and its proportional effect on the regression slop are determined. Finally the meta-analysis, and medical research in many other models,
conducts sensitivity tests to test the validity of the model.
Limitations:
The limitations of meta-analysis are nicely summarized by the quote in the summary limitation section below from the example study. Also, it must be remembered that meta-analysis
4
It is accepted that the author only has their own anecdotal familiarity with criminological research.
Page | 74
cannot control for any confounds not controlled for in the studies themselves. A further limitation of meta-analysis as discussed below derives from publication biases as well as any biases
included but unknown in the studies employed. A technique used in some epidemiology metaanalyses to check for bias is to create a plot with the effect size on the X axis and the sample size
on the Y axis. If the plot resembles an upside down funnel no bias is indicated. The two figures
below show a simplified concept of the plots.
Figure 9.1: Hypothetical STATA Funnel Plot Showing No Bias
Figure 9.2: Hypothetical STATA Funnel Plot Showing Bias
Page | 75
An Epidemiology Example of a Meta-Analysis:

Citation:
Michael Roerecke and Jrgen Rehm. Irregular Heavy Drinking Occasions and Risk of Ischemic
Heart Disease: A Systematic Review and Meta-Analysis. American Journal of Epidemiology. Advance Access published on February 8, 2010 Am. J. Epidemiol. 2010 171:
633-644; doi:10.1093/aje/kwp451
Purpose:
To conduct a meta-analysis of the (mal) effects of a treatment (Binge drinking)
Abstract:
Contrary to a cardioprotective effect of moderate regular alcohol consumption, accumulating
evidence points to a detrimental effect of irregular heavy drinking occasions (>60 g of pure alcohol or 5 drinks per occasion at least monthly) on ischemic heart disease risk, even for drinkers
whose average consumption is moderate. The authors systematically searched electronic databases from 1980 to 2009 for case-control or cohort studies examining the association of irregular
heavy drinking occasions with ischemic heart disease risk. Studies were included if they reported
either a relative risk estimate for intoxication or frequency of 5 drinks stratified by or adjusted
for total average alcohol consumption. The search identified 14 studies (including 31 risk estimates) containing 4,718 ischemic heart disease events (morbidity and mortality). Using a standardized protocol, the authors extracted relative risk estimates and their variance, in addition to
study characteristics. In a random-effects model, the pooled relative risk of irregular heavy drinking occasions compared with regular moderate drinking was 1.45 (95% confidence interval: 1.24,
1.70), with significant between-study heterogeneity (I2 = 53.9%). Results were robust in several
sensitivity analyses. The authors concluded that the cardioprotective effect of moderate alcohol
consumption disappears when, on average, light to moderate drinking is mixed with irregular
heavy drinking occasions.
Sample:
734 Studies (Originally more than 1000)
Reduced to 14 studies with a very specific type of DV which covered 4,718 incidents.
Data Sources and Variables:
Electronic Journals
Only studies that provided enough detail to calculate effect size and variation included (734)
Technologies and Techniques:
DerSimonian-Laird random-effects models to obtain pooled effect sizes.
metan and metareg of STATA to obtain heterogeneity, pooled effect sizes, and random effects
meta-regression models.
Cochranes Q Test and I2 test used for estimating total propn in slope due to heterogeneity
Limitations:
Measurement error, selection bias, and confounding are inherent to our analysis, as they are to
the individual studies, and need to be considered in determining the validity of any estimates derived from such study designs
Page | 76
Propensity Score Analysis with a Mahalanobis Distance Matching Extension:

Again like meta-analysis this method is not new, but one that seems to be rare comparative to research using medical datasets. This technique helps address some of the issues covered
in class relating to the issues on non randomized studies. In the study below the researchers extend the propensity score analysis (PSA) by employing the Mahalanobis distance matching
(MDM) concept to negate the violations of non-randomization further in an exposure, nonexposure study on birth data recorded previously at the incident. For the basis of this summary
the MDM can be accepted as an extension that matches points within the boundary of the propensity score based on the Mahalanobis distance of those points. The ability to address these issues prevalent in many medical data sets (as well as criminological) may have spurned the increasing use of this method in more recent publications, as evidenced by Figure 3 sourced from
Vanderbilt Universitys Biostatistics department. Piquero and Weisburd, (2010) state that there
has been an increasing use of this method in criminological research, yet they mention only a
dozen instances.
Figure 9.3: Number of Publications that used Propensity Score Analysis in Pub Med.
What is Propensity Score Analysis?
In observational studies or even evaluation studies the groups may not be comparable due
to many reasons covered in many research texts (such as feasibility, ethics, or non-compliance by
administrators). Ultimately, we as researchers are interested in knowing if the outcomes are due
to treatment, and non-randomized experiments/studies lead these assertions to be questionable.
As noted, PSA and its extensions, of which MDM is one, allow us to reduce the questionability
of the assertions made using observational data.
The first step is to model the non-random variables that set the persons propensity to be
selected for the treatment. Once this propensity score is created, the researcher can then use
matching protocols between treatment receivers and counterfactuals (Piquero and Weisburd,
2010). By using these matching protocols a greater degree of homogeneity of the groups can be
achieved. For example Piquero and Weisburd, (2010) use a national youth study of drug use with
employment status to illustrate how the treatment (high intensity employment) co-varies negatively with drug use; PSA indicates that this is not a treatment but the result of self selection.
Something must differ in the background of these two groups.
For a much more in depth discussion and explanation on how to apply PSA please refer
to the chapter in the book referenced below.
Page | 77
REFERENCE
Apel, R.J. and Sweeten, G. (2010). Propensity score matching in criminology and criminal justice. In A. Piquero and D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.
543-562) New York: Springer.
Page | 78
An Epidemiology Example of PSA:

Citation:
Yvonne W. Cheng, Alan Hubbard, Aaron B. Caughey, and Ira B. Tager
The Association Between Persistent Fetal Occiput Posterior Position and Perinatal Outcomes: An Example of Propensity Score and Covariate Distance Matching
American Journal of Epidemiology Advance Access published on February 5, 2010. Am.
J. Epidemiol. 2010 171: 656-663; doi:10.1093/aje/kwp437
Purpose:
To investigate the link between a series of IVs mostly categorical and ordinal and a dichotomous DV
Abstract:
In a retrospective cohort study of 18,880 full-term, cephalic singletons born in San Francisco,
California, during 19762001, the authors used multivariable logistic regression (MVLR) and
propensity score analysis (PSA) to examine the association between persistent fetal occiput posterior (OP) position and perinatal outcomes. The principles and applications of these techniques
are compared and discussed. Pregnancies with OP positions at delivery were compared with
those with occiput anterior positions. Perinatal outcomes were examined as adjusted odds ratios
determined by MVLR and PSA and as risk differences determined by propensity score matched
bootstrapping based on covariate distance. Persistent OP position was associated with operative
delivery and maternal morbidity. The odds ratio estimates based on PSA were somewhat larger
than those obtained with standard MVLR, and the confidence intervals were narrower. When statistical inference was evaluated with the permutation test, the results were more consistent with
the PSA. These analyses demonstrate that PSA is likely to provide more precise estimates of exposure associations and more reliable statistical inferences than MVLR. The authors show that
PSA can be extended with Mahalanobis distance matching to obtain estimates of risk difference
between exposed and unexposed subjects that avoid violations of the experimental treatment assignment (positivity) assumption that is required for valid causal inference.
Sample:
Retrospective analysis of 18,800 Births.
Birth, delivery records of a single hospital from 1976-2001
Multivariate Logistic Regression Analysis,
Propensity Score Analysis Allows approximating stratified randomized treatment for observational data. PSA models the joint effect of all confounders.
A further extension, Mahalanobis Distance Propensity scores
Limitations:
Confounding, one that is an issue to criminology as well. Estimation of the risk of exposure,
based on the fact that something has occurred, yet it may be something inherent in the exposed
vs. unexposed.
Page | 79
Cox Proportional Hazards Regression Model (Survival Analysis)

Originally proposed by Sir David Cox in 1972, the Cox proportional hazards regression
model is a more recent statistical method than some of the others discussed here and yet seems
more familiar to many. This is likely due to the fact that 20 years after its publication Sir David
Coxs article had been recorded as cited over 800 times, making it at the time the most cited statistical paper and one of the most highly cited papers in scientific literature (Allison, 2008). The
popularity of this model is probably even higher as these counts are likely an underestimation of
the use of Coxs work due to many authors not citing the original paper (Allison, 2008).
Coxs semi-parametric regression model is a more robust regression method than many
parametric survival methods. The comparison of the semi-parametric and parametric methods is
presented in the epidemiological paper provided in summary below. The robustness of this
model stems from the fact that it makes no assumptions about the curve or distribution of survival times, it allows for time varying (dependant) covariates, and can be used for both discrete
and continuous times variables (Allison, 2008). The model handles left truncation efficiently and
can be generalized to suit a non-proportional hazards model (Allison, 2008).
The primary disadvantage and limitation of the model is that the researcher cannot test a
hypothesis about the shape of the curve, though the curve is often confounded with unobserved
heterogeneity, so this loss may not be great (Allison, 2008). This model has more recently been
expanded to look at repeated events on the individual unit (Dugan, 2010).
What is Coxs Proportional Hazards Regression (Survival Analysis)?
The example provided from the American Journal of Epidemiology shows survival analysis being used to study its original intended dependent variable, death. However, these models
are employed in a wide variety of studies looking more at the time that lapses before a defining
event occurs. Thus, the issue of time dependency is included.
As noted above the model is semi-parametric; that is to say that while the baseline hazard
can take any form, the covariates must be linear in function (Allison, 2008). The advantage is
that no assumptions need to be made about the baseline risk unlike the MV models. The semiparametric nature of the model also makes the statistic less efficient than its parametric counterparts (Allison, 2008).
Limitations:
The result of requiring linear covariates means that Coxs regression can be biased by
similar violations as parametric statistics. As a result, Allison (2008) recommends three tests for
a biased hazard rate. These tests are: scaled Schoenfeld residuals which are analyzed in a similar
manner to other residual plots, checking df Betas in the same way as we check in linear models,
and sensitivity analyses for censored data. One final assumption is that any censored data is not
dependant on its future outcome (people stop turning up to the study when they have started recidivating in a self-report study). A sensitivity analysis will allow for an estimation of the bias if
this was the case.
For a more in-depth discussion of survival analysis and time series analysis the reader is
directed to Allison, (2008) or Dugan (2010). For a discussion focused on the workings and implementation of Coxs Proportional Hazard Model the reader is directed to Allison (2008).
REFERENCES
Page | 80
Allison, P. (2008). Survival analysis using SAS: A practical guide. Cary, NC: SAS Institute Inc.
Dugan, L. (2010). Estimating effects over time for single and multiple units. In A. Piquero and
D. Weisburd (Eds.). Handbook of quantitative criminology. (pp.741-763). New York:
Springer.
Page | 81
An Epidemiological Example of Hazards Regression (Cox and MV):

Citation:
Joanna Kaluza, Nicola Orsini, Emily B. Levitan, Anna Brzozowska, Wojciech Roszkowski, and
Alicja Wolk. Dietary Calcium and Magnesium Intake and Mortality: A Prospective Study
of Men. American Journal of Epidemiology Advance Access published on February 19,
2010 Am. J. Epidemiol. 2010 171: 801-807; doi:10.1093/aje/kwp467
Purpose:
To investigate the relationship between a variable (calcium) and outcome (CVD or Cancer death)
Abstract:
The authors examined the association of dietary calcium and magnesium intake with all-cause,
cardiovascular disease (CVD), and cancer mortality among 23,366 Swedish men, aged 4579
years, who did not use dietary supplements. Cox proportional hazards regression models were
used to estimate the multivariate hazard ratios and 95% confidence intervals of mortality. From
baseline 1998 through December 2007, 2,358 deaths from all causes were recorded in the Swedish population registry; through December 2006, 819 CVD and 738 cancer deaths were recorded
in the Swedish cause-of-death registry. Dietary calcium was associated with a statistically significant lower rate of all-cause mortality (hazard ratio (HR) = 0.75, 95% confidence interval (CI):
0.63, 0.88; Ptrend < 0.001) and a nonsignificantly lower rate of CVD (HR = 0.77, 95% CI: 0.58,
1.01; Ptrend = 0.064) but not cancer mortality (HR = 0.87, 95% CI: 0.65, 1.17; Ptrend = 0.362) when
the highest intake tertile (mean = 1,953 mg/day; standard deviation (SD), 334) was compared
with the lowest (990 mg/day; SD, 187). Dietary magnesium intake (means of tertiles ranged from
387 mg/day (SD, 31) to 523 mg/day (SD, 38) was not associated with all-cause, CVD, or cancer
mortality. This population-based, prospective study of men with relatively high intakes of dietary
calcium and magnesium showed that intake of calcium above that recommended daily may reduce all-cause mortality.
Sample:
Cohort of Swedish Men aged 45-79 (letter sent to all 100,403- 48, 645 replied)
Any with pre-existing condition excluded.
Follow up done via registry of deaths.
All data on participant gathered at one time. Follow up was by way death registry with detailed
information available to the research from the registrar.
Cox Proportional Hazards Regression models used to estimate time to all deaths and per cause
deaths.
Multivariate Hazards Regression was employed and used to check the proportional hazards assumption.
Limitations:
Other sources of input not measured. Recall Bias. Relevance of baseline measure to relevant intervening period is questionable.
Page | 82
Regression Models that Account for Spatial Autocorrelation:

In the 1970s works by the likes of Odland as well as Cliff and Ord started to draw attention to the issue of spatial autocorrelation. This concept stems from Toblers first law of geography that all things are related to all other things, but that things closer together are more related
than things far apart. This spatial dependence means that data is likely to vary across regions similarly causing problems for the assumptions of standard regression models (Tita and Radil,
2010). A further issue plagues spatial data called spatial heterogeneity. Spatial Heterogeneity is
the constant finding that the relationships between variables (often ethnic heterogeneity, poverty,
population mobility) and dependant variables (crime rates) vary across space (Tita and Radil,
2010). These issues have driven a more recent area of statistical innovation that has sought to
measure, control and employ spatial autocorrelation.
Basic tests for spatial autocorrelation have been developed such as the joint count statistic for binary data, and Morans I and Gearys C to test for autocorrelation in the data set as a
whole (Tita and Radil, 2010). There are further tests for autocorrelation in local areas or subsets
of the data (Getis and Ords GI and GI* statistics, local Morans I and Ratcliffes extension of the
local Morans I). For some research these parameters are the outcome of interest.
What is Regression with Autocorrelation?
The methods employed cannot be rightly referred to as a statistical method in and of
themselves yet they are more extensions of various models that account for the autoregressive
process observed in the data. The example included below combine the spatial analysis from a
GIS program with multilevel Poisson models to reach an outcome. Tita and Radil (2010) highlight many innovative inclusions of autocorrelation into statistical models ranging from the simple to the highly complex with weighting parameters operating in rooks and queens case (The
weighting operates in the same way the chess pieces can move). Most simply, the method either
adds a spatial lag variable onto the DV in the model or more rarely places a spatial lag on the IV
terms, or simply removing the spatial autocorrelation to the error term (Tita and Radil, 2010).
Limitations:
While spatial weighting methods enable familiar statistical techniques to be employed
when studying spatially distributed data there are some drawbacks to the use of such techniques.
Tita and Radel (2010) highlight the issue of weight matrices that do not reflect the theoretical
processes implicated in theory. This can raise the specter of research that is significant and more
reliable, but not controlling for the process implied, thus the theoretical condition that is being
supported. What social processes only occur the way a rook moves or dissipates evenly in all directions from a centroid with mathematical regularity?
Further, a small group of ecological researchers are starting to question the validity of
statistics based on aggregation as indices of social processes (Ratcliffe, 2010).
REFERENCES
Ratcliffe, J.H. (2010). The spatial dependency of spatial dispersion. Security Journal 23 (1).
pp.18-36.
Tita, G.E., and Radil, S.M. (2010). Estimating spatial regression models in criminology: Model
ing social processes in the spatial weights matrix. In A. Piquero and D. Weisburd (Eds.).
Handbook of quantitative criminology. (pp.101-121). New York: Springer
Page | 83
An Epidemiological Example of Regression where spatial autocorrelation exists.

Citation:
Jennifer L. Black and James Macinko. The Changing Distribution and Determinants of Obesity in the
Neighborhoods of New York City, 20032007 . American Journal of Epidemiology Advance Access published on February 19, 2010. Am. J. Epidemiol. 2010 171: 765-775;
Purpose:
To measure obesity rates, across time and neighborhoods. While controlling for contextual effects.
Abstract:
Obesity (body mass index 30 kg/m2) is a growing urban health concern, but few studies have examined
whether, how, or why obesity prevalence has changed over time within cities. This study characterized the
individual- and neighborhood-level determinants and distribution of obesity in New York City from 2003
to 2007. Individual-level data from the Community Health Survey (n = 48,506 adults, 34 neighborhoods)
were combined with neighborhood measures. Multilevel regression assessed changes in obesity over time
and associations with neighborhood-level income and food and physical activity amenities, controlling for
age, racial/ethnic identity, education, employment, US nativity, and marital status, stratified by gender.
Obesity rates increased by 1.6% (P < 0.05) each year, but changes over time differed significantly between neighborhoods and by gender. Obesity prevalence increased for women, even after controlling for
individual- and neighborhood-level factors (prevalence ratio = 1.021, P < 0.05), whereas no significant
changes were reported for men. Neighborhood factors including increased area income (prevalence ratio =
0.932) and availability of local food and fitness amenities (prevalence ratio = 0.889) were significantly
associated with reduced obesity (P < 0.001). Findings suggest that policies to reduce obesity in urban environments must be informed by up-to-date surveillance data and may require a variety of initiatives that
respond to both individual and contextual determinants of obesity.
Sample:
N=10,000.
Longitudinal Study (5yr, yearly repeat). Sampled from across the 5 boroughs of New York.
Data Sources Variables:

Community Health Study.
hood)Variables
DV= BMI (Dichotomous), IVs = Categorical Contextual (Neighbor-

Univariate and multivariate spatial regression.
ArcGIS used to measure changes spatially over time.
IVs reduced by Factor Scores.
Multilevel Poisson Analyses to generate prevalence ratios.
Fixed Effect Ecologic model (Panel Data Analysis)
Missing data replaced by single imputation bootstrapping (sensitivity tests conducted)
Limitations:
Resources measure may not solely capture context
The method due to standardization not sensitive to small changes
Differing response rates by group, Women, Unemployed and lower education = less complete
Page | 84
Summary:
The purpose of this section looking at medical journals was to try and see what we as
criminologists may be able to take away from this field to work in our own. In the field of epidemiology it seems that for all but the experimental designs the research is contending with the
similar issues stemming from observational research.
In addition to highlighting some methods that seem to be more common in the medical
field yet equally useful to criminology, and apart from the lessons we could take from the medical field as discussed at the beginning of this section, the following also stands out to the criminological researcher looking at medical research.
The most noticeable difference between criminological research on the whole and epidemiological studies apart from experimental designs, is that multiple methods are often used.
These are either driven by strictly adhering to the types of variables that can be used in a model,
as opposed to using them and checking for VIFs or skewness. Further, it was very common to
see analyses include other methods as a check for reliability or validity, as opposed to the checks
produced by adding commands. The medical research realm seems more comfortable with more
computationally demanding tests to validate the observed results, such as sensitivity analyses.
Criminology as a field talks a lot of being or becoming a science, and yet we mimic only
what we want to, we rarify the act of publishing, more so, it seems, than a field that produces
substantially more publications allowing it to quickly build vast bodies of knowledge with which
they can see the direction the findings are going in sooner. Finally as a result of searching this
literature, this author is left wondering how far we are falling behind a field dealing with similar
issues in statistical savvy-ness. Leaving a final thought, maybe it is time criminological schools
focused on criminology, and used the vastly more knowledgeable mathematics departments to
replete their students with the modern statistical knowledge.
Page | 85

Data Sources For Criminological Resear

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Sources For Criminological Resear

Uploaded by

Copyright:

Available Formats

CRIMINOLOGICAL RESEARCH AND EVALUATION

INTRODUCTION TO CRIMINOLOGICAL AND CRIMINAL JUSTICE RESEARCH

CHAPTER 2. DATA SOURCES FOR CRIMINOLOGICAL RESEARCH

National Criminal Justice Reference Service

For a listing of data series available at the BJS website, go to

I. Official Data Sources

A. Criminal Statistics, England and Wales: UK

Fear and avoidance behaviors;

For more information and data downloads, go to: http://www.icpsr.umich.edu/NACJD/NCVS/

For more information and to see a list of participating nations, go to:

K. School Survey on Crime and Safety (SSOCS)

For more information and to download data, go to:

C. National Violence Against Women Survey

E. Survey of Inmates in State and Federal Correctional Facilities

III. Major Longitudinal Datasets

For more information and to download data files, go to: http://www.bls.gov/nls/home.htm

For more information or to download data, go to:

Birth Cohort in Philadelphia, Pennsylvania, 1945-1963. It followed 3,595 offenders born in

The Sampling Frame (How can you get access to them?)

The Sample (Who is in your study?)

Systematic Random Sampling Is a sampling method where you determine randomly

even if the proportion of the population is accurately estimates. For example, an

Very easy to do; almost like not sampling at all

When concerned about under representing smaller subgroups

Stratified random sampling

Systematic random sampling

Cluster (area) random sampling

Multi-stage random sampling

Accidental, haphazard, or convenience nonprobability sampling

Modal purposive nonprobability

Heterogeneity purposive nonprobability sampling

Snowball purposive nonprobability sampling

Easy to implement and explain; useful

Experts can provide opinions to support

Allows you to oversample minority

Simple to implement; easy to explain to

Simple random sampling

Table 3.1. Summary of Sampling Methods (Trochim, 2001:59)

Low external validity

Results only limited to the

Requires a sample list

Determining Sample Size

Figure 3.3. SAMPLE SIZE CALCULATION

S = Required Sample Size

Sampling Credibility Scale

Reasonable response rate, controlled field operations

CHAPTER 4. SCALE MEASUREMENT

Step 2: Generate an item pool

Step 4: Review of the item pool by experts

2) Items should be worded in both positive and negative directions

3) Colloquialisms, expressions and jargon should be avoided

4) Reading level of the primary audience should be considered

Norms describe the distribution of a given population on a scale variable

Item Response Theory

CHAPTER 5. EXPERIMENTAL AND MODIFIED-EXPERIMENTAL DESIGNS

Figure 5.1. Design Flow of the True Experiment

Interactions between any of

The Posttest Only Comparison Group Test

This design is also frequently used and is the

effects, it does provide some test of program

The importance of familiarizing yourself with notation is paramount to understanding

Shadish, W. R., T. D. Campbell, and D. T. Cook. (2002). Experimental and Quasi-Experimental

CHAPTER 6. QUALITATIVE RESEARCH METHODS