Finalprojectreport

Author
FACTORS RELATED TO ARUSHI KAPOOR

YIANNI KANELLOPOULOS
THE CRIME RATE IN

YIQING YANG
WEI WANG
NORTH CAROLINA
1. Background
1.1 Basic situation of data

This dataset contains data for 90 of North Carolinas 100 Counties. For our purposes, the dependent
variable is the crime rate, which is calculated as a ratio of the total crimes committed in the county to the
county population and measured for each year between 1981 and 1987. Throughout this period, crime
rates for the state as a whole remained relatively constant, starting from 0.0452 crimes per person in 1981,
reaching a local minimum in 1984 of 0.0404, and climbing back up to 0.0465 by 1987. There are 23
independent variables, including wages for specified industries, probability of being sentenced and
convicted, population density, police officer density, and the percentage of young males.
1.2 Variable explanation
Format
1.3 Our analyzing purpose
1
A regions crime rate is tied to numerous important factors. Regions with high crime rates induce
regional depopulation, which in turn reduces tax revenue and is unattractive to businesses, encouraging
economic stagnation (Cullen and Levitt 1996). Further, regions with high crime tend to have poorer
educational outcomes, as students in high-crime areas tend to be more worried about self-preservation
than classroom success.
As such, we seek to find factors that are predictive of crime rates, and use our findings to suggest
ways in which counties can reduce their crime rates.
2. Exploratory Data Analysis

2.1 Numerical Variables
Table 1
Table 1 shows mean, standard deviation, min value, first quartile, median, third quartile and the
maximum value of each variables. Crime rate in each region (Crmrte) is the dependent variable,
and others are numerical independent variables.
2.2 Categorical Variables

Then we will see how the mean value of each numerical variables differ in categorical variables:
smsa, year, and region.
Table 2
We notice that a large amount of variables have significant difference in different smsa status,
and smsa region has a higher crime rate, which indicates that smsa might be a good predictor.
Table 3
Here, we notice that the crime rate makes a slight u shape, decreasing from 1981-84 but
increasing thereafter. We also notice that wage is increasing, thanks to Ronald Reagan.
2
Table 4
The west region has a significantly lower crime rate when compared to central and other. The
other independent variables also have distinctions, which indicates that region might be a good
predictor.
Considering that the number of some variables are small like crime rate (from 0.0018 to 0.1638),
and the number of other variables are large like weekly wage of federal employees (from 361.525 to
597.95), we standardize all variables to make the model more readable. The overview of the standardized
data is shown as below.
2.3 Data Visualization
3
We use ggplot to visualize the relationship between density, crime rate, smsa and region. There is a
significant linear relationship between density and crime rate, which means density could be a useful
independent variable to predict crime rate. The SMSA regions (blue point) have a significant higher
density and crime rate than not SMSA regions (red point). The central regions have higher mean value of
crime rate and have a larger range of density.
We use ggplot to see the distribution of weekly wage of federal employees (wfed) in different years. The
shape and curve is similar, but it is obvious that the wage is increasing with the year.
3. Linear regression
3.1 Full model
4
According to the result of the full model (linear regression), we can found that the data is worth to be
analyzed. First of all, there are several variables that are significant. Besides, the R square of the full
model is 0.7202, which means that for this data, the full model can explain it to a decent extent. And we
will figure out better models to fit it in the following process.
The coefficients of variables show their relationship with the dependent variables. When we keep the
other variables constant, if we we increase one variable in 1 unit, the crime rate will increase by its
coefficient.
3.2 Variable selection

1) Best subset regression
5
6
a. As we can see in the form, the p-value for all the variables are smaller
than 0.05, which means that the variables are statistically significant.
b. We can infer through the last column that year and region variable is less
significant than all the other variables, which means that the relationship
between them and crime rate is weaker.
c. The coefficients of variables show their relationship with the dependent
variables. When we keep the other variables constant, if we we increase one
variable in 1 unit, the crime rate will increase by its coefficient.
d. Whats more, the adjusted R square can show us in what extent the data
are explained by this model. In this case, we found that it equals to 0.7153,
which means that this module explain 71.53% of the variability in the data,
which means that the data has been explained by the best subset model in
some way.
2) Stepwise AIC regression

During the selection process, we used both AIC and BIC as criterion. The
results of AIC and BIC are shown below.
7
3) Stepwise BIC regression
4) LASSO
8
9
The LASSO method estimates the coefficients and performs variable selection simultaneously. As
we can see in the tableX, it forces some variable coefficients 0, based on the best lamda that was
generated according to the min cv,MSE.
In this case, we found that variables county, year, prbpris, avgsen, taxpc, smsa, wtrd, wfir, wser,
wloc, mix coefficients are forced to 0. This method help us to shrink the variables in a good way.
However, in this case, shrinking too much variables might not be expected since we want to predict the
crime rate, which should be predicted based on a certain amount of variables.
We will compare this model with others in the fourth part.
4. Regression Tree
4.1 Large Tree
First, we let cp=0.005 and build a large regression tree. The result is shown as below.
10
According to the tree, we can indicate that people per square mile (density) is the most important
independent variable. First, we test whether the density of the region is less than 1.309, if yes, we
go to the left side of the tree to test whether the density is less than -0.267, if no , we go to the
right side of the tree to test whether the percentage minority in 1980 (pctmin) is less than
-0.4478. Based on this logic, we go through the tree until we arrive the terminal and get the final
result.
4.2 Prune Tree

Considering that the previous tree we build is too complex, we check the cp table to see which cp
is the most suitable one that not only have low error but also have a small size.
11
We choose 0.035 as our optimal cp, and us this value to prune the tree. The result of the pruned
tree is shown as below.
The people per square mile (density) is still the most important variable, and it can be interpreted
as we explained in the last part.
4.3 Random Forest

Random forest is to construct a multitude of regression tree, and use the mean predicted value as the final
result to get a more reliable result.
We build the random forest, considering it is not a single tree, we cannot draw the forest here, and the in-
sample and out-of-sample performance of the random forest will be shown in the Model Comparison part.
5. Model Comparison
12
1) In this section, we will compare the in sample and out of sample MSE and Compared different models
based on AIC, BIC, MSE, CV, R^2 for the full linear regression model, variable selection models, the
large tree, and the pruned tree, random forest to see which model fits the data better.
According to Table X above, the MSE for all the models are quite similar but contain slight differences,
which can also help us define the better model. If we only compare variable selection part, the stepwise
AIC is the better model, so we use the plots to analyze this model.
13
According to these 4 plots (plots for AIC), in the Residual vs Fitted and the Scale-Location plots, the
residuals have a random pattern, which is good since the residuals are supposed to perform in a random
distribution instead of a pattern. In the Normal Q-Q plot, the standardized residuals fit the straight line in
the interval of -2 to 2. However, after 2, the residual gradually deviate from the line, which means that the
model might doesnt fit the data after 2 as good as the interval -2 to 2. In the Residual vs Leverage, we
found that most of the residual points are included within the two dotted line (0.5). This means that there
are few outlier standardized residuals. In conclusion, stepwise AIC model quite fit the dataset.
We also use tree model to fit the dataset and found that the adjusted R square for large tree and random
forest are obviously larger than prune tree model and other variable selection models. In addition, their
MSE of in and out samples are similar. According to the result, the large tree model is the better one. We
assume that is because the dependent variable in our case is crime rate, which might should be predicted
based on a certain amount of variables, if we shrank too much variables in the model, the prediction will
not be as precise as the model with more variables.
6. Another Sample
Generate another random sample to check the dependability of the previous procedures and reject
data bias. In this section, we resample the crime data and get another training data which contains 90% of
original data, and the rest 10% as testing, to see whether there is a difference when compared with the
former sample.
14
For most models, there are not many differences between the two samples. The in sample and out of
sample MSE are also similar, which indicates the models are stable and worth to be use.
7. Conclusion
Based on the above table, we believe that the best model is the stepwise selection(AIC) model. Because it
heavily outperforms the fullmodel in terms of BIC, is similar to both models in terms of MSE, and
outperforms all the other models in terms of AIC and cv score. Fortunately, the model puts a highly
significant negative coefficient on year, meaning that it as time passes time tends to decrease. Based on
the model, we suggest a number of proposals to reduce crime rates. First, implement get tough on crime
policies. Our study shows that high probabilities of arrest and conviction has a highly significant negative
effect on the crime rate. Economically, makes sense: as the probability of arrest and conviction increase,
the expected value of punishment increases, making not committing a crime the preferred option among
would be criminals. However, a counter-intuitive result is the positive coefficient on polpc, which
measures police density. We believe this could be because more police in an area increases the number of
crimes discovered, not necessarily caused by the greater police presence. In any case, more research
should be done. Lastly, due to the negative coefficient of pctymle, a wise government should deport all
men immediately.
15

Finalprojectreport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finalprojectreport

Uploaded by

Copyright:

Available Formats

Author

FACTORS RELATED TO ARUSHI KAPOOR

THE CRIME RATE IN

1.1 Basic situation of data

1.2 Variable explanation

1.3 Our analyzing purpose

2. Exploratory Data Analysis

2.2 Categorical Variables

2.3 Data Visualization

3.2 Variable selection

2) Stepwise AIC regression

4.2 Prune Tree

4.3 Random Forest

You might also like