Professional Documents
Culture Documents
October 2, 2011
By MK
inShare1
(This article was first published on We think therefore we R, and kindly contributed to R-bloggers)
I apologize for the delay in the second post (just in case anybody was waiting), I had been vary
involved with work the past week. I shall try to be more regular. Well, in the previous post, we
successfully imported data into R and got a basic feel of it by looking at the various variables
present and their types as well. Now we will try to process the data to make sense of it. Data by
themselves are just space hogging particles, aesthetically challenged, and practically worthless
unless well unless we can get some information out of them. And to get that information,
they need to be processed, transformed, and at times coerced. This post will describe how we can
start to do that. So, lets grab its throat and make it spit out the ugly truth (excuse me for the
histrionics). Also, this last sentence bore no relation this horrible Gerad Butler movie.
##########-1.2: Processing the data###############
There are three main steps involved here:
1. Preliminary visulaization
2. Data transformation and/or variable creation
3. Development-validation division of data set
Lets start with 1.
Suppose we want to check the distribution of amount in the given data. We can use a simple plot
command.
plot(amount, type = l, col = royalblue)
plot(age, type = l, col = brown)
Now the plots we have created present the variation in these variables in a manner which neither
easily discernible nor is it clean.
It would be better if created histograms to check the frequency distribution of these variables.
hist(amount)
hist(age)
# The hist command has a lot of options that help extend the features of the plot.
Now, with the hist command, we are able to see the picture clearly (literally). But it is not always
appropriate to plot a histogram. What is the best way depends largely on the problem at hand.
Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better
to use the plot command which will not only present the data in a much neater way but also enable
us to compare different time series in a single plot. For a simple example, you can check
Shreyes post.
Coming back, we can similarly visualize the pattern, frequency and distributions of other
variables as well.
Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it
is better to create categories out continuous variables. This helps a lot while implementing the model
that we build because it is more convenient to come with strategies for individuals belonging to a
particular income group rather than for all individuals with specific incomes.
For this we need to bin some variables like amount and age. One approach to do this is to run the
following code
DO NOT run this chunk of code. I will explain later why.
# g.data$amount <- as.factor(ifelse(amount <= 2500, 0-2500,
ifelse(amount <= 5000, 26005000, 5000+)))
Here we are creating three categories for the variable amount. One for those with income level less
than 2500, one for income between 2500 and 5000, and one for income greater than 5000.
There is an important point to note here. Above, while creating the category variable, we overwrote
the original variable amount in the R object g.data. Ideally this process is not well advised
because if we later find that there was an error in our code or there was some flaw in the logic and
we need to change it, we will have to re-do all the steps that we have done till this point. But, there is
another side here as well. R, while working, stores all the data and the objects that we create in the
RAM and hence if the data set is of considerably large size then creating additional variables by
transformation is not a very wise idea either. This trade-off needs to be balanced.
In this case, the data set is quite small and hence it would be better if we create an additional
object instead of overwriting the original one.
g.data$amt.fac <- as.factor(ifelse(amount <= 2500, 0-2500,
ifelse(amount <= 5000, 26005000, 5000+)))
head(g.data$amt.fac)
Similarly, we can do so for age.
g.data$age.fac <- as.factor(ifelse(age<=30, 0-30, ifelse(age <=
40, 30-40, 40+)))
head(g.data$age.fac)
Here our dependent variable is response. It is a factor variable and has 1 and 2 as the factor
levels. Now, R by itself can handle factor variables and so we do not need to transform then unless
we plan to combine categories But I like to keep the response category coded as 1 (this is just
because of habit and nothing else). Hence, I reassign the levels to 0 and 1.
g.data$default <- as.factor(ifelse(response == 1, 0, 1))
is.factor(g.data$default)
contrasts(g.data$default)
head(g.data$default)
We attach the data again to include the newly created variables.
attach(g.data)
In the previous post, one the comments introduced me to the with() command as a substitute for
the attach(). The with() command also serves the purpose quite well. It reduces the pain of
writing the object name with the $ sign before we can refer to a variable but it needs to be included
for every operation that we perform on the object.
Now, we saw that there are a lot of categorical variables present in the data. R provides many
functions to plot categorical data.
Lets see an example.
mosaicplot(default ~ age.fac, col = T)
mosaicplot(default ~ job, col = T)
mosaicplot(default ~ chk_acct, col = T)
We can also use a spine plot.
spineplot(default ~ age.fac)
We can also check the relations between variables
library(lattice)
xyplot(amount ~ age)
In case you dont have the lattice library installed, you can download it by running
install.packages(lattice)
We can also condition on a variable and see the interaction
xyplot(amount ~ age | default)
lattice package also has the option for a barchart and it lets you plot the barchart and a histogram
for factor variable type as well.
barchart(age.fac, col = grey)
barchart(amt.fac, col = grey)