You are on page 1of 79

The image cannot be displayed.

Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Principles of Data Visualization

Integrated Analysis with R


Integrated Analysis using R
! Overview of R
! Data Loading with R
! R based Graphics Packages
! Clustering with R
! Decision Trees with R

2
What is R?
! R is an open source statistical software environment
! Supported byWindows, OSX, and UNIX

3
RStudio

Files
Text Editor

Packages
Console Plots

R for Everyone, Jared Lander


Functions for Navigation
! getwd() - return working directory
! setwd() - set working directory
! help() opens help page
! library(help=packageName) help on a specific package

5
Data Frame
! data.frame() - List of variables(vectors) with same number of rows

Vectors

R Help, r-tutor.com 6
Additional Functions
! append() - add elements to a vector
! cbind() - Combine vectors by row/column
! mean(x), weighted.mean(x), median(x), min(x), max(x),
quantile(x)
! plot() - generic R object plotting
! hist() histogram

7
Loading Data in R
! Step 1- Open a new project in R with New Directory

8
Loading Data in R (continued)

! Step 2: Select Empty Project

9
New Directory
! Step 3: Provide a name for the Directory
! Step 4: Save the New Project

10
Create an Analysis File
! Step 5: Create a new R file and name it KmeanSYYXXX.R
! Step 6: Place the file irish.csv in the new directory
! Step 7: Read the csv file in using the read.csv function
> iris<- read.csv(iris.csv)
Object iris is of type data frame. Check the class of object using
class() e.g. class(iris)

11
Graphics Packages for R
! Graphics Base Package
! Lattice
! GGPlot (future lecture)

12
Graphics Package Functions

Graphics Package Func1on Descrip1on


Barplot Bar and column charts
Hist Histograms
Density Kernel Density plots
Plot Sca@er Plots
Qqplot QuanCle - quanCle plot
Pie Pie Chart

13
Histogram
! Graph of a single variable
! Also known as a frequency chart or distribution
! Execute with hist (x)
! Example: > hist(diamonds$carat)

14
Scatterplot
! Visualize relationship between 2 variables(vectors)
! Execute with plot (x,y)
! Example: > plot(diamonds$carat, diamonds$price)

15
Scatterplot Arguments
! (x,y) define the two vectors, or variables that you wish to
plot on the x-axis and y-axis
! Time series, formula, list or matrix can also be specified
! Type = specifies the type of plot
"p"for points (default value)
"l"for lines
"o"for over-plotted points and lines
"b"for points joined by lines
"s"for stair steps
"h"for histogram-style vertical lines
"n"for no points or lines

16
Scatterplot Functions
! Main Title of the plot
! Sub Subtitle of the plot
! Xlab x-axis label
! Ylab y-axis label
! Axes = true specification to include axes (default is true)
! Xlim - numeric vector specifying the xlimit of the plot

17
Creating a Matrix
! Matrix can be created from a data frame file

r-tutor.com 18
Barplot
1. Load the data library >library(nutshell)
2. Locate data directory >data(doctorates)
3. Transform the data frame to a matrix
> doctorates.m <- as.matrix(doctorates[2:7])
> rownames(doctorates.m) <- doctorates[, 1]
> doctorates.m

4. Create Barplot
> barplot(doctorates.m[1, ])

R in a Nutshell 19
Barplot Result

R in a Nutshell 20
Horizontal Barplot
! Display Horizontal Bars for the categories for each year
>barplot(doctorates.m, beside=TRUE, horiz=TRUE,
legend=TRUE, cex.names=.75)

R in a Nutshell 21
Stacked Vertical Barplot
! Display Stacked Bar Chart
>barplot(t(doctorates.m), legend=TRUE, ylim=c(0, 66000))
Note- t is the transpose function in R

R in a Nutshell 22
Barplot Arguments
! beside=FALSE, then the bars are stacked
! beside=TRUE, then the bars are plotted adjacent
! Width - numeric vector for of the bars, default value 1
! Space - beside=FALSE, a numeric value indicating the amount
of space between bars.You specify the space as a fraction of the
average column width. Ifbeside=TRUE, then you can specify a
two-element vector, where the first element specifies the space
within a group and the second represents the space between
groups.

23
Pie chart
! It is one of the most popular charts used
! Execute with pie(data)
! Example: > x1 <- 1:9
! > Pie(x1)

R in a Nutshell 24
Arguments for Pie Chart
! X non negative numeric values that will be plotted
! Labels - An expression to generate labels, a vector of character
strings, or another object that can be used as labels
! Edges - A numeric value indicating the number of segments
used to draw the outside of the pie. Default value 200.
! Radius - A numeric value that specifies how big the pie should
be. (Parts of the pie are cut off for values over 1.). Default
value 0.8
! Clockwise - A logical vector indicating whether slices are
drawn clockwise or counterclockwise. Default value is FALSE
! Main - A character string that represents the title.

25
Clustering Analysis
Clustering analysis is grouping for attributes similar to each
other and different from attributes present in other groups
Two most popular algorithms
1. K-means
2. Hierarchical clustering

26
K-means Clustering
Dividing observations into discrete groups based on a distance
metric based on a distance metric from the mean
User specifies the number of clusters (groups)
Executed as >kmeans(data, no. of clusters)
> result.val<-kmeans(iris, centers=3)

27
Kmeans clustering contd
! >iris$clust<-result.val$cluster
! >ggplot(iris,aes(x=Petal.Length,y=Sepal.Length))
+geom_point(color=iris$clust)+xlab("Petal Length")
+ylab("Sepal Length")

28
Optimal number of clusters - Kmeans
! The below graph depicts the Total within sum of square errors for each count of cluster. The Total
within sum of squares in a way explains the amount of variance that is explained by the cluster.
! We see below that the first clusters provides a lot of information (explain a lot of variance), so does
2nd cluster (to a lesser degree) but after 3 clusters, adding additional cluster doesnt explain much of
variance. This point gives an angle to the graph and this must be the optimum number of cluster.

OpCmum # of clusters

Example

29
Optimal number of clusters - Kmeans
! The below graph depicts the Total between sum of square errors for each count of cluster. The Total
between sum of squares explains the amount of variation between clusters.
! We see below that as we go from 4 clusters to 6 clusters, there is considerable amount of variation
between clusters but beyond 8 clusters, the amount of variation starts reducing and further clusters do
not explain much. Hence we stop at 8 clusters. This is also the point of discontinuity in the curve as
highlighted below.

OpCmum # of clusters

Example

30
Partitioning Around Medoids (PAM)
The problems with K-means algorithm is that it does not work
with categorical data and its performance is affected by
outliers. These 2 problems could be overcome with the use of
K-medoids.
Instead of taking the mean of the cluster as its centre, it takes
an actual observation in the cluster as its centre.
PAM is the most common K-medoids algorithm. The pam()
performs the clustering and similar to K-means clustering, we
will have to specify the number of clusters to be formed as one
of the arguments to this function.
PAM handles the missing data well and can take a data matrix
or data frame or a dissimilarity matrix as input.

31
Hierarchical Clustering
A hierarchical clustering can be thought of as a tree and displayed as
a dendrogram; at the top there is just one cluster consisting of all
the observations, and at the bottom each observation is an entire
cluster.
In between are varying levels of clustering.
Unlike, K-means and K- medoids algorithm the user need not
specify the number of clusters to be formed in this algorithm
beforehand.
There are different ways to compute the distance between the
clusters: single, complete, average and centroid.
The measure chosen to compute the distance can have a significant
impact on the results of the clustering.

32
Hierarchical Clustering (Contd)
The function hclust() takes the data to be clustered and the
measure to compute the distance as arguments to perform the
clustering.
The resulting tree produced by hierarchical clustering could
be cut in two different ways to generate desired number of
clusters.
We can either specify the required number of clusters which
determines where the cut takes place or we can specify where
to make the cut which will determine the number of clusters.

33
Decision Trees
The decision trees are the most common predictive analysis
techniques used for the categorical data.
Nested structures where each node represents an attribute to
classify data
Route from First node(Root node) to Last node(Leaf node)
gives the rule
Training data and Testing data used to build and fine tune the
tree.

34
Decision Trees Visualization

A complete decision tree is shown on


the right with the root node and other
subsequent nodes highlighted below.

The root node always denotes the variable (vector) of the highest importance. We can
see that the 71% of the passengers didnt survive and 29% survived and Passenger
Class is the variable (vector) that gives the best probability split.

Example

35
Decision Trees Visualization
A complete decision tree is shown on
the right with the root node and other
subsequent nodes highlighted below.

The first node to the left, following root node, is what is shown above. Note that
the variable (vector) Sex, gives the highest probability split for this node.
All the variables (vectors) that appear in the decision tree are the ones that provides
the best decision tree and hence are the mot significant variables (vectors) in
analysis.
Example

36
Decision Trees(Contd)
Misclassification Rate used to test the performance
Lower the misclassification rate better the model
Over fitting should be avoided by Pruning the tree to the
appropriate level

37
Rpart and Rattle
These packages are not a part of the standard R installation and
should be installed and loaded before using it
Rpart is used to build Classification trees in R
A prediction method from Rpart is used to predict the values
of the test data based on the generated tree.
Mean method of Rpart will calculate misclassification rate.
FancyRpartPlot method from Rattle package will draw a
decision tree plot
It can be imported to various formats for future use.

38
Export the plot
jpeg, pdf, bmp, png are functions to create plot in their
respective formats
Create pdf file iris.pdf
>pdf(iris.pdf )
Execute the ggplot command to create plot
>ggplot(iris,aes(x=Petal.Length,y=Sepal.Length))
+geom_point(colour=iris$clust)+xlab("Petal Length")
+ylab("Sepal Length")
Close the file using dev.off() function
Note- The file would be created in your current working
directory

39
Tableau - R Integration

! R functions and models can be invoked


through Tableau by creating calculated
fields. Through this calculated field we can
dynamically invoke the R engine and pass
values to R. The results would then be
returned back to Tableau and used by
Tableau visualization engine.
! An example of K means clustering
visualization with R and Tableau

Tableau.com
Tableau - R Integration Connection (Step 1)

! Connections are made possible between Tableau and


R using R serve. To initiate Rserve we need to type
the following code on the R window. This code installs
the Rserve
package.

This code loads


the library.
This code
initiates the
Rserve function.
Note that your R window should be left open after initiating Rserve(). If you
close the R connection then it will not be possible to connect/maintain the
connection between Tableau and R*
Tableau - R Integration Connection (Step 2)
! Connect Tableau to the R Server. Open
Tableau Desktop and follow the steps
below:
! a. Go to the Help menu and select
Manage R Connection.
! b. Enter a server name of
Localhost (or 127.0.0.1) and a
port of 6311.
! c. Click on the Test Connection
button to make sure everything runs
smoothly.You should see a successful
message. Click OK to close
Tableau - R Integration Connection (Step 3)

! There are four new built-in functions that are used to call specific R models and
functions.
The functions are:
SCRIPT_REAL
SCRIPT_STR
SCRIPT_INT
SCRIPT_BOOL
These functions are distinct only in the type of result they return: a real number, a
string, an integer, or a Boolean. The arguments you pass into each of these
functions include R-language scripts and function calls.You can pass 1 or more
arguments to R, which are then passed dynamically via Tableau.
Tableau - Parameters
! Parameters are dynamic values that can replace constant values in
calculations, filters, and reference lines.
! In the drop-down arrow in the upper right corner and selectCreate
Parameter.
! Specify the parameter name, data type and the allowable values to be used.
Generally, we leave it at All unless we want to use value from a list of
values, in which case we select List.
Tableau Calculated Field
! You created calculated fields in Tableau by defining a formula that is
based on existing fields and other calculated fields, using standard
functions and operators
! To open the calculation editor, click the drop down to the right of
Dimensions on theDatapane and chooseCreate Calculated
Field.
R-script within a Calculated Field in Tableau
SCRIPT_INT("
## Sets the seed Seed determines the output from a random number.
set.seed( .arg8[1] ) Therefore, if we set a constant seed, then we will get
## Studentizes the variables the same cluster every time we run this.
age <- ( .arg1 - mean(.arg1) ) / sd(.arg1)
edu <- ( .arg2 - mean(.arg2) ) / sd(.arg2) Sometimes, our variables in the dataset might have
gen <- ( .arg3 - mean(.arg3) ) / sd(.arg3) different units and hence it becomes important to
car <- ( .arg4 - mean(.arg4) ) / sd(.arg4) standardize them. This piece of code carries out the
chi <- ( .arg5 - mean(.arg5) ) / sd(.arg5) same function on the input variables used in kmeans
inc <- ( .arg6 - mean(.arg6) ) / sd(.arg6) clustering.
dat <- cbind(age, edu, gen, car, chi, inc)
This argument determines the number of clusters to
num <- .arg7[1]
be formed. Note that it is also one of the input
## Creates the clusters
parameters to the function
kmeans(dat, num)$cluster
",
MAX( [Age] ), MAX( [Education ID] ), MAX( [Gender ID] ), This invokes the kmeans function in R.
MAX( [Number of Cars] ), MAX( [Number of Children] ),
MAX( [Yearly Income] ), [Number of Clusters], [Seed])
Example
Tableau - R Integration Visualization of kmeans
cluster in Tableau

The above snapshot is from a demo screen using dummy data from Titanic dataset.
Example
Tableau - R Integration Visualization of kmeans
cluster in Tableau

The field cluster must be set to Discrete as shown in the snapshot below. Also the other
variable within columns, Survived must be a Dimension. Note they appear in blue compared
to fields in the rows.
Tableau - R Integration Visualization of kmeans
cluster in Tableau

The highlighted
area provides us
insights into the
distribution of
gender across
clusters.
Cluster 2 and
Cluster 4 are
majorly comprised
of male passenger
and cluster 6 is
mostly comprised
of female
passengers.
Example
Tableau - R Integration Visualization of kmeans
cluster in Tableau

The highlighted
area provides us
insights into the
distribution of
embarked points
across clusters.
Cluster 4 and 6 has
passengers who
boarded only from
Cherbourg (blue
dot)

Example
Tableau - R Integration Visualization of kmeans
cluster in Tableau

The highlighted
area provides us
insights into the
distribution of
number of siblings
across clusters.
Cluster 1 has
passengers with
higher number of
siblings on average,
in both survived
and non-survived
category compared
to rest of the
clusters.
Example
Tableau Analyzing bar charts across clusters
! The bar chart shows
the distribution of
survivors and non-
survivors across
clusters by gender.
! Note that cluster 2 has
the highest number of
non-survivors and
they are all male
passengers (orange
color)
! Cluster 5 has the
highest number of
survivors and they
Right click on Y axis (No of Records) and click on Add have almost equal
Reference Line. Select Per Pane as the Scope and value as representation of male
Average for Sum(Number of Records) and female passengers
Example
52
Tableau Analyzing bar charts across clusters
! The chart shows the
distribution of
Passenger class across
cluster for both male
and female
passengers.
! As we see, most of the
male passengers from
cluster 2, who didnt
survive were from
passenger class 3.

This feature of Tableau is called Cards. Cards are containers for shelves, legends, and other
controls.What we see here is color card and it contains the legend for the colors in the view and
is only available when there is at least one field on Color. Example

53
Possible errors while visualizing clusters 1

! When you drag any dimension (e.g.Embarked) into the shelf, and have
the calculated field Cluster in the column, you may encounter this
error message.
! If so, please click on OK. Click on field under Marks section, and select
Attribute as shown below.

54
Possible errors while visualizing clusters - 2

! If you still continue to encounter the below error message, please ensure
that under Analysis tab, the Aggregate Measures are turned off.

55
Summary
! Rpart and Rattle packages are used for Decision Tree Analysis
! It is important to identify the optimal number of clusters
before we begin our clustering exercise
! Kmeans, PAM and hclust are important functions for
Clustering Analysis
! Tableau R integration provides a visually enriching platform to
analyze results.

56
APPENDIX

57
Sorting Ascending / Descending

58
Discrete Sorting

59
Sort on Any Field

60
Sort Sub-Category within Category

61
Sort Sub-Category (label category)

62
Clear Sorts

63
Grouping Approaches
! Equal Steps
0-100, 100-200, 200-300, etc..

! Unequal Steps
2 + 5 + 8 + 11 + 14 = 40
First & Last * count = double of addition
(2 + 14) * 5 = 80

! Irregular Steps
Quartiles of a Distribution

Isabel Meirelles, Design for Information 64


Defining Groups for a Measure

65
Find Meaningful Groupings

66
Lattice Plotting System
Splitting charts into different panels.
Includes bar charts, scatter plots, dot plots, strip plots
Data assigned to each panel is referred to as packet
Higher level functions call lower level functions
Unlike graphics package function, lattice functions share
common arguments
Customize the appearance of graphs

67
Create a Data Frame

R in a Nutshell 68
Simple Lattice Plot
! Display single plot chart in lattice
>xyplot(y~x, data=d)

R in a Nutshell 69
Lattice Chart
! Create a Lattice Chart
>library(lattice)
>xyplot (y~x|z, data=d)

R in a Nutshell 70
Lattice to Graphics Functions

Graphics Func1on La5ce Func1ons Descrip1on


Barplot Barchart Bar and column charts
Dotchart Dotplot Cleveland dot plots
Hist Histogram Histograms
Density Densityplot Kernel Density plot
Xplot Xyplot Sca@er plots

71
ggplot Package
The ggplot package is built on grid which provides another
way to generate plots
It is not part of the standard R installation and should be
installed and loaded before using it.

72
Reasons for using ggplot
It provides a very powerful language for concisely expressing a
wide variety of plots
The default appearance of plots has been carefully chosen with
visual perception in mind
The arrangement of plot components and the inclusion of
legends in the plot is automated

73
Qplot
For very simple plots, the qplot() function in ggplot2 serves a
similar purpose to the plot() function in traditional graphics
All that is required is to specify the relevant data values and the
qplot() function produces a complete plot
>qplot(temperature, pressure, data=pressure)

R in a Nutshell 74
Steps to create a ggplot
1. The basic structure for ggplot2 starts with the ggplot function,
which defines the data that we want to view
2. After initializing the object, we specify the shapes or geoms
that we are going to use to view the data and add those to the
plot
3. Specify the features or aesthetics that will be used to represent
the data values with aes(). The different layers in the plot are
brought together using the + operator.

> ggplot(data=diamonds) + geom_histogram(aes(x=carat))

R in a Nutshell 75
Histogram in ggplot
! A histogram of the same data set using ggplot2
! Eg: ggplot(data=diamonds) + geom_histogram(aes(x=carat))
! Since, histogram is a 1-dimensional plot, we need to specify
only the x-axis in the aesthetic mapping

R in a Nutshell 76
Scatterplot in ggplot
! A scatterplot produced using ggplot is shown below
! Eg: > ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
! Here, we mention both the X and Y axis variables in the
aesthetic mapping function

R in a Nutshell 77
Scatterplot (Contd.)
We can add colors to the scatterplot to help us differentiate
different categories and interpret it better. This is the real
power of ggplot2.
It helps us to prepare more easily readable, interpretable
graphs with just a few lines of code compared to the basics
graphics package.
Eg: ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(aes(color=color))

R in a Nutshell
78
Box plot in ggplot
! Box plot could be drawn with the help of the
geom_boxplot()
! Eg: > ggplot(diamonds, aes(y = carat, x = 1)) +
geom_boxplot()

R in a Nutshell 79

You might also like