You are on page 1of 3

Modelling with R: part 2

October 2, 2011
By MK

inShare1
(This article was first published on We think therefore we R, and kindly contributed to R-bloggers)

I apologize for the delay in the second post (just in case anybody was waiting), I had been vary
involved with work the past week. I shall try to be more regular. Well, in the previous post, we
successfully imported data into R and got a basic feel of it by looking at the various variables
present and their types as well. Now we will try to process the data to make sense of it. Data by
themselves are just space hogging particles, aesthetically challenged, and practically worthless
unless well unless we can get some information out of them. And to get that information,
they need to be processed, transformed, and at times coerced. This post will describe how we can
start to do that. So, lets grab its throat and make it spit out the ugly truth (excuse me for the
histrionics). Also, this last sentence bore no relation this horrible Gerad Butler movie.
##########-1.2: Processing the data###############
There are three main steps involved here:
1. Preliminary visulaization
2. Data transformation and/or variable creation
3. Development-validation division of data set
Lets start with 1.
Suppose we want to check the distribution of amount in the given data. We can use a simple plot
command.
plot(amount, type = l, col = royalblue)
plot(age, type = l, col = brown)
Now the plots we have created present the variation in these variables in a manner which neither
easily discernible nor is it clean.
It would be better if created histograms to check the frequency distribution of these variables.
hist(amount)
hist(age)
# The hist command has a lot of options that help extend the features of the plot.
Now, with the hist command, we are able to see the picture clearly (literally). But it is not always
appropriate to plot a histogram. What is the best way depends largely on the problem at hand.
Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better
to use the plot command which will not only present the data in a much neater way but also enable
us to compare different time series in a single plot. For a simple example, you can check
Shreyes post.
Coming back, we can similarly visualize the pattern, frequency and distributions of other
variables as well.
Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it
is better to create categories out continuous variables. This helps a lot while implementing the model
that we build because it is more convenient to come with strategies for individuals belonging to a
particular income group rather than for all individuals with specific incomes.
For this we need to bin some variables like amount and age. One approach to do this is to run the
following code
DO NOT run this chunk of code. I will explain later why.
# g.data$amount <- as.factor(ifelse(amount <= 2500, 0-2500,
ifelse(amount <= 5000, 26005000, 5000+)))
Here we are creating three categories for the variable amount. One for those with income level less
than 2500, one for income between 2500 and 5000, and one for income greater than 5000.

There is an important point to note here. Above, while creating the category variable, we overwrote
the original variable amount in the R object g.data. Ideally this process is not well advised
because if we later find that there was an error in our code or there was some flaw in the logic and
we need to change it, we will have to re-do all the steps that we have done till this point. But, there is
another side here as well. R, while working, stores all the data and the objects that we create in the
RAM and hence if the data set is of considerably large size then creating additional variables by
transformation is not a very wise idea either. This trade-off needs to be balanced.
In this case, the data set is quite small and hence it would be better if we create an additional
object instead of overwriting the original one.
g.data$amt.fac <- as.factor(ifelse(amount <= 2500, 0-2500,
ifelse(amount <= 5000, 26005000, 5000+)))
head(g.data$amt.fac)
Similarly, we can do so for age.
g.data$age.fac <- as.factor(ifelse(age<=30, 0-30, ifelse(age <=
40, 30-40, 40+)))
head(g.data$age.fac)
Here our dependent variable is response. It is a factor variable and has 1 and 2 as the factor
levels. Now, R by itself can handle factor variables and so we do not need to transform then unless
we plan to combine categories But I like to keep the response category coded as 1 (this is just
because of habit and nothing else). Hence, I reassign the levels to 0 and 1.
g.data$default <- as.factor(ifelse(response == 1, 0, 1))
is.factor(g.data$default)
contrasts(g.data$default)
head(g.data$default)
We attach the data again to include the newly created variables.
attach(g.data)
In the previous post, one the comments introduced me to the with() command as a substitute for
the attach(). The with() command also serves the purpose quite well. It reduces the pain of
writing the object name with the $ sign before we can refer to a variable but it needs to be included
for every operation that we perform on the object.
Now, we saw that there are a lot of categorical variables present in the data. R provides many
functions to plot categorical data.
Lets see an example.
mosaicplot(default ~ age.fac, col = T)
mosaicplot(default ~ job, col = T)
mosaicplot(default ~ chk_acct, col = T)
We can also use a spine plot.
spineplot(default ~ age.fac)
We can also check the relations between variables
library(lattice)
xyplot(amount ~ age)
In case you dont have the lattice library installed, you can download it by running
install.packages(lattice)
We can also condition on a variable and see the interaction
xyplot(amount ~ age | default)
lattice package also has the option for a barchart and it lets you plot the barchart and a histogram
for factor variable type as well.
barchart(age.fac, col = grey)
barchart(amt.fac, col = grey)

histogram(employment, col = grey)


histogram(sav_acct, col = grey)
As a last step in this stage, we need to create a development sample and a validation sample. We
take about 70% percent of the data as development sample and 30% as validation sample.
d <- sort(sample(nrow(g.data), nrow(g.data)*0.7))
# The sample command here creates a random sample of the number of rows in g.data and then
# selects 70% of this random sample and assigns it to object d.
Note that here the sample is being generated from the row numbers and not the exact rows of data
so that if you see the object d, you will see 700 randomly selected natural numbers between 1 and
1000 which are nothing but the row numbers in the data frame g.data.
# The sort command in the beginning just sorts these randomly generated row numbers in an
ascending order.
Then to create the development sample, we use the vector properties of R and assign the d rows to
the R object dev, and the remaining to the R object val.
dev<-g.data[d,]
val<-g.data[-d,]
After creating the sample, we can check the size of the two samples vis-a-vis the original data.
dim(g.data)
dim(dev)
dim(val)
Finally we have been able to domesticate the data. We have sliced and diced them according to our
needs. In the next post we will try to cook them in the modelling pan.

You might also like