You are on page 1of 17

Implementation in R

Predicting Traffic activities

Abstract. In this report, I will design a comprehension of the foundation investi-


gation of the data set utilized for execution alongside the analytical programming
which has been utilized for the research. At that point I will set the necessary
objectives alongside the requirement analysis and how the data set has been gath-
ered. After that I will investigate the data to meet the research findings and will
be trying to reach a conclusion by critically reviewing the analysis.

Keywords: Dataset, Traffic Activity, Methodology, R, Linear Regression.


Contents
1 Introduction ..................................................................................................... 3
2 Background Study ........................................................................................... 3
3 Research Objectives ........................................................................................ 3
4 Requirement Analysis...................................................................................... 4
4.1 Dataset Description: ................................................................................ 4
4.2 Dataset Specifications: ............................................................................ 4
4.3 Methodology Used: ................................................................................. 4
4.4 Attributes of The Dataset: ....................................................................... 5
5 Research Question & Data Collection Technique ........................................... 5
6 Data Analysis & Implementation .................................................................... 6
7 Critical Review and Results: ......................................................................... 11
8 Conclusion: .................................................................................................... 16
References .............................................................................................................. 17
1 Introduction

Advanced Analytics is an extending classification of inquiry that can be used to drive


changes and improvements in business hones. It is advancing to give on time, critical
and precise information to engage continuous basic leadership not just for specific cli-
ents and in like manner for all level of workers in an association. Among its various
utilize cases, it can be utilized to discover certain examples in enormous measures of
information, prediction, development, and progression and for complex event analysis.

2 Background Study

The instrument or measure used to investigate the dataset is diagnostic programming


known as R. R is a free of cost explanatory programming that has been constructed and
intended to allow proficient experts and others with refined programming aptitudes to
perform complex information investigation and show the result in a graphical represen-
tation. Keeping in mind the end goal to dissect the information, direct relapse demon-
strating will be utilized, which is a capacity inside R. Relapse displaying or all the more
regularly known as relapse investigation is a measurable device utilized for investigat-
ing connections between at least two factors. It demonstrates the relationship between
one autonomous variable and another needy variable in its easiest shape and along these
lines indicates how the variety in one variable takes after with the variety in another. It
is critical to perceive that relapse investigation is fundamentally assorted from finding
out the relationships among various factors. The viability of the relationship between
factors is dictated by the connection, although relapse attempts to portray the relation-
ship between the factors in more detail.

3 Research Objectives

It is important to characterize the research goals previously as it is a very vital part of


the procedure. The extent of the examination is to figure out what I need to accomplish
and the destinations and the sorts of choices it will help me make. My data set comprises
of information about Police traffic and enforcement activity. In addition, the violations
that is being brought on by the people for the traffic and additionally the traffic counts
and its rate is likewise there. The purpose of my research on the dataset is to compare
the time when the traffic has been impacting so much and its prediction. The research
will also consist of the analysis of the rate of 3 different times as well as days when the
traffic pressure is high. The rates will be analysed and will be shown how it can be
reduced. The research will also develop a prediction or explanatory theory that will
associate the rate of incident type occurring by days and on basis of that result we might
predict whether it will increase or decrease.
4 Requirement Analysis

4.1 Dataset Description:

The dataset combines the traffic counts in ballarat 2012 as well as it also consists of the
Police traffic enforcement activity in that region. There are so many types of activity
that has been led to the data set. It consists of violations such as Traffic violation, Motor
violations, Parking Violations and many more. The locations in the data set has been
stated using Latitude and Longitude which helps the user to determine based on the
incident type and incident number where the incident took place and its frequency that
for how many times it has occurred. Moreover, it also consists of the data of the days
where most traffic is being recorded as well as the data for the traffic counts has been
divided into 3 columns; MidweekADT, WeekendADT and X7DaysADT. These infor-
mation leads us to information that how much the pressure is there.

4.2 Dataset Specifications:

The variant of the diagnostic programming which is R to be used is R rendition 3.2.3.

The document configuration of the dataset is CSV (Comma Separated Values) record,
which appoints information to be spared in a table organized arrangement. It takes after
a run of the mill spreadsheet just with a .csv expansion. Fundamentally, it appears as a
content document comprising of information insulated by commas.

At present, R bolsters a couple configurations of information records, in particular doc-


uments finishing with '.R', '.Data', ".txt" and ".csv" and the dataset that I will use is in
the ".csv" design. The dataset to be stacked can be resolved as a game plan of character
strings or names and for the given dataset it will bring about the development of a sol-
itary variable with an indistinguishable name from the dataset.

4.3 Methodology Used:

I have used the Reasoning cycle methodology for my research. My methodology com-
piles of Hypothesis, Predictions and Test of Predictions for the result (KOTHARI,
1990).
4.4 Attributes of The Dataset:

The table consists of the variables in my dataset which are described below.

Variable Name Type of Data

CountID Numeric

Date String

Road String

Location String

Direction String

Days Numeric

MidweekADT Numeric

WeekendADT Numeric

7DaysADT Numeric

Latitude Numeric

Longitude Numeric

Incnum Numeric

Inctype String

Inctypecode Numeric

5 Research Question & Data Collection Technique

The research question mainly focuses on the segment of a more extensive topic area. It
is the issue that an individual will attempt to answer when the research on a topic is
done. In the case of my research Police Traffic Enforcement Activity, the research ques-
tion is as follows:
1. Can the prediction of the rate of traffic be lessen by depending on the results of traffic
possibilities as well as its rate of per day?
2. How the rate of Midweek traffic differs from the Weekend traffic?
3. On which street the number of activity is taking place per data set?
4. Which incident Type has got the more ratios then the others?

Data Collection is a basic part of an exploration and uncertain information collection


can influence the consequences of the review and in the long run prompt to invalid
outcomes (KOTHARI, 1990). For the most part, there is a great deal of data that has as
of now been gathered by others, regardless of the way that it may not really have been
dissected. Finding these sources and recovering the information is a good beginning
stage in any information accumulation exertion. Because my examination on the rate
of Police enforcement activity, I have gathered the information set from data.gov,
which is an American site page propelled by the US government in 2009 that expects
to upgrade free to profitable and machine discernible information sets.

6 Data Analysis & Implementation

In this part of the research, I will break down the information inside the dataset along-
side the usage of the exploration that has been done. To start the analysis and usage of
the dataset, I need to stack the dataset on to R keeping in mind the end goal to work
with the distinctive variables/objects that are put away inside the dataset (Torres-Reyna,
2010). The accompanying code has been utilized to stack the dataset:

“ptea<-read.csv("C:\\Users\\IT\\Desktop\\PTEA.csv”, header=T)” where, ptea being


the name of the data frame.

Now that the dataset has been loaded on to R, the next step is to attach the data frame
to the search path so that it is possible to refer to the variables in the data frame by their
names. The following code has been used to attach the data frame:

“attach(ptea)”

Now that the extra variables have been nullified, I will re-check to ensure that the var-
iables have been nullified. To check the remaining variables, I will run the following
codes:

“names(ptea)” to be used to get the variable names.


“dim(ptea)” to be used to get the dimensions of the variables.

I have nullified the extra variables from the dataset previously and now I want to see a
summary of the remaining variables to view the minimum, 1st quartile, median, mean,
3rd quartile and the maximum of each variable. The following code will let me view the
summary:

“summary(ptea)”
However, I will not need to work with all the variables in the dataset as R only recog-
nizes the numeric variables. To remove the extra variables, I will run a code through
the script of R to nullify the variables. The following codes show the process:


ptea$Date<-NULL
ptea$Road<-NULL
ptea$location<-NULL
ptea$Direction<-NULL
ptea$stnum<-NULL
ptea$stname1<-NULL
ptea$stname2<-NULL
ptea$stname3<-NULL

The following screenshot shows the summary of each of the variables within the da-
taset:

Now I will plot a histogram of the “No. of Incident” to view that which incident is
occurring more frequently. The following code will produce the histogram and the
screenshot allows me to view the region with the highest rate of incident:

“hist(incnum, main="Histogram for incident", xlab="No. of Incident", border="blue",


col="green", xlim=c(10000000,17000000), las=1, breaks=4)”
Next I will boxplot a graph of the variable “Days” against the variables “inctype” to
determine which violation is occurring more frequently. The following code will pro-
duce the plotted graph and the screenshot shows the graph:

“plot(Days ~ inctype)”

Next I will plot a graph of Midweek average day Traffic record against Weekend aver-
age day traffic. By plotting this graph, I will get a clear idea at what rate the traffic is
increasing. The following code will produce the plotted graph and the screenshot shows
the graph:

“boxplot (WeekendADT ~ MidweekADT)”


Moreover, to get a linear model of variables MidweekADT against WeekendADT we
need to type:

“plot(lm(MidweekADT ~ WeekendADT))” which will give us a graph that will picture


the Residuals against the Fitted curve. The line will show the effect of prediction that
will most probably occur.

The Q-Q plot, or quantile-quantile plot, is a graphical device to help us evaluate if an


arrangement of information conceivably originated from some hypothetical distribu-
tion, for example, a Normal or exponential. These are focuses in your information be-
neath which a specific extent of your information fall (Selcuk Korkmaz, 2014). The
below graph will show the Normal Q-Q curve that occurs for my variables.
A Scale-Location plot is like the residuals versus fitted qualities plot, however it utilizes
the square foundation of the institutionalized residuals. Like the main plot, there ought
to be no recognizable example to the plot.

Information focuses with huge residuals or potentially high influence may affect the
result and accuracy of a regression. Cook's distance measures the impact of erasing a
given observation. Cook's distance with many point in the analysis are being considered
for a closer examination to be merit.
Moreover, if I want to predict that at what state of the month the traffic is going to
increase or decrease I will use the “predict” command to do so and “abline” to support
my prediction.

In the above graph the blue line shows that the Midweek traffic is going to increase in
a gradual way than the Weekends. The coefficient is NULL and that’s why the straight
line formed from 0.

7 Critical Review and Results:

As I will be utilising the linear regression model on the dataset, the next step is to pro-
duce the linear model. I will apply the linear model on the variable Days against Mid-
weekADT. The following code shows the linear model application on the variables:

“lm1<-lm(Days~MidweekADT, data=ptea)” , here lm1 working as being the variable


in which the linear model has been set for prediction later in the implementation. Now,
that the linear model has been applied on the variables, I will view the summary of lm1
to find out the residuals, coefficients and the p-value (Sykes, 1996). The following code
will produce the summary and the screenshot follows:

“summary(lm1)”
Next I will produce an analysis of variance table (ANOVA) to assess the importance of
the factors by comparing the response variable means at the different factor levels. The
following code will produce the ANOVA and the screenshot follows:

“anova(lm1)”

Now that the analysis of variance table has been produced, the next step is to predict
the linear model fits based on the object. It will produce the predicted values that have
been obtained by evaluating the linear regression function. The following code will
predict the linear model and the following screenshot shows the fit value, lower value
and the upper value:

“predict(lm1, interval="predict")”

Now that the prediction has been accomplished, the next and final step is to plot the
predicted values that have been obtained by evaluating the linear regression function.
The following code will plot the predicted values and the following screenshot shows
the graph of prediction:

“plot(predict(lm1), residual=(lm1))”
After plotting the predictions in the graph, I will now use the command “abline” to
show the relationship of the graph. I will be using the code:

“abline(lm1, col=2, lwd=2)”

The graph is shown below:

Now, per this graph our prediction says that the number of Days in the Mid-week tends
to have less traffic than other days and it’s going to decrease more in recent years. I am
going to that judgement after plotting the “abline” in the graph. As, we can see that the
relationship shows a negative gradient which indicates a negative relationship. After
plotting the graph, I will now test the correlation of the curve. Correlation helps us to
identify how well the graph is being fitted and what will the relation be which will
readily help us to identify the accuracy of the result. To test the correlation, I will now
use the command shown below:

“cor.test(Days,MidweekADT)”

The result has been delivered and we can see that the correlation is -0.1635854. My
plotted results have a confidence interval of 95% which means that the interval has a
0.95 probability of containing the population mean (Plotts, 2011).
We can also find the correlation by using the command “cor(Days,MidweekADT,
method="pearson")”. This is also known as pearson’s method to find the correlation.
Now, the covariance of two factors x and y in an information set measures how the two
are directly related. A positive covariance would show a positive direct relationship
between the factors, and a negative covariance would demonstrate the inverse. To get
the covariance I will use the command stated below:

“cov(Days,MidweekADT)”

I will use the “coef(lm1)” command to determine the coefficient and the y intercept of
the graph. My coefficient came 7.663013e+00. We can conclude that the expected re-
sult is true because out plotted graph have shown us that the correlation line started
from that point.

Now, I will use the command to plot four more graphs to conclude my predictions. For
doing that I need to use layout command to plot them in one frame. The command for
the linear model to put into the frame is shown below:

“layout(matrix(1:4,2,2))”

“plot(lm1)”

The graph has been plotted using the “plot()” function and it has been organised by the
“layout()” function. Below represents the graph.
8 Conclusion:

As per to the research I have tried to show that how the rate of the traffic can be lessened
in Midweeks per Days. Now I will discuss two different graphs which will represent of
what type of incidents are frequently taking place and the other one will show in which
street the activity is recently taking places more.

The above graph represents the most incident that took place per the incident code is 1
and the least of the incident codes is 607.

This are the four street numbers of where the activity has been recorded. As per the
graph suggest we can say that street ”X” has the most activity of around >800 and
streets “24X” and “643GX” has been supported with equal data of less than 100 activ-
ity.
References

1. KOTHARI, C. (1990). Research Methodology. An Introduction, 7-10.


2. Plotts, T. (2011). A Multiple Regression Analysis of Factors. A Multiple Regression
Analysis of Factors, 114-18.
3. Selcuk Korkmaz, D. G. (2014). MVN: An R Package for Assessing Multivariate
Normality. MVN: An R Package for Assessing Multivariate Normality, 155-
156.
4. Sykes, A. O. (1996). An Introduction to Regression Analysis. The Inaugural Coase
Lecture, 1-7.
5. Torres-Reyna, O. (2010). Getting Started in. Linear Regression using R, 1-12.

You might also like