You are on page 1of 396

Advanced Statistics Bootstrapping

Generalized Linear Models Nonparametric Bootstrapping


Oiscriminant Function The boot package provides extensive facilities for bootstrapping and related resampling methods. You
can bootstrap a single statistic (e.g. a median), or a vector (e.g. , regression weights). This section will
Time Series
get you started with basic nonparametric bootstrapping.
Factor Analysis

Correspondence Analysis The main bootstrapping function is boot( ) and has the follovting forrnat:

M11ltjdjmen5jona( Scaling
bootobject <· boot(data= , statistic= , R=, ... ) where
CI 115ter Analysis
parameter description
Tree·Based Models
data A vector, matrix, or data frame
Bootstrapoing statistic A function that produces the k statistics to be bootstrapped (k=1 if
bootstrapping a single statistic).
Matrix Algebra The function should indude an indices parameter that the boot() function
can use to select cases for each replication (see examples below).
R Number of bootstrap replicates
R in Action Additional parameters to be passed to the function that produces the
statistic of interest

boot( ) calls the statistic function R times. Each time, it generates a set of random indices, vtith
replacement, from the integers 1 :nrow(data). These indices are used within the statistic function to
select a sample. The statistics are calculated on the sample and the results are accumulated in the
bootobject. The bootobject structure includes

element description
R in Action significantly expands
tO The observed values of k statistics applied to the orginal data.
upon this material. Use promo
t An R x k matrix where each row is a bootstrap replicate of the k statistics.
code ria38 for a 38% discount.
You can access these as bootobject$t0 and bootobject$t.

Top Menu Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to
examine the results. lf the results look reasonable, you can use boot.ci( ) function to obtain confidence
intervals for the statistic(s).

The R Interface
The format is
Data Input
boot.ci(bootobject, conf=, type= ) where
Data Management

Basic Statistics
parameter description
Advanced Statistics
bootobject The object retumed by the boot function
Basic Graphs conf The desired confidence interval (default: conf=0.95)
Advanced Graohs type The type of confidence interval retumed. Possible values are "norm",
"basic", "stud", "perc", "bca" and "all" (default: type="all")

Bootstrapping a Single Statistic (k=1)


The following example generates the bootstrapped 95% confidence interval for R·squared in the linear
regression of miles per gallon (mpg) on car weight (wt) and displacen1ent (disp). The data source is
mtcars. The bootstrapped confidence interval is based on 1000 replications.

# Bootstrap 95% CI fer R-Squared


l i brary(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(sunvnary(fi t) $r. square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
R=lOOO, formula=mpg~wt+disp)

# view results
results
plot(results)

# get 95% confi de ne e i nte rva l


boot.ci(results, type="bca")

11 click to view

Bootstrapping several Statistics (k> 1)


In example above, the function rsq retumed a number and boot.ci returned a single confidence
interval. The statistics function you provide can also retum a vector. In the next example we get the
95% CI fer the three model regression coefficients (intercept, car weight, displacement). In this case we
add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to
analyzed.

# Bootstrap 95% CI fer regression coefficients


l i brary(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
R=lOOO, formula=mpg~wt+disp)

# view results
results
pl ot(results, index=l) # intercept
plot(results, index=2) # wt
pl ot(results, index=3) # disp

# get 95% confidence intervals


boot.ci(results, type="bca", index=l) # intercept
boot. ci (resu l ts, type="bca", i ndex=2) # wt
boot.ci (results, type="bca", index=3) # disp

/ / / click to view
Going Further
The boot( ) function can generate both nonparametric and parametric resampling. For the
nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation.
For the nonparametric bootstrap, stratified resampling is supported. lmportance resampling weights can
also be specified.

The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric
confidence intervals. These include the first order nom1al approximation, the basic bootstrap interval,
the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap
percentile (BCa) interval.

Look at help(boot), help(boot.ci), and help(plot.boot) for more details.

Learning More
Good sources of infom1ation include Resampling Methods in R: The boot Package by Angelo Canty,
Getting started with the boot package by Ajay Shah, Bootstrapping Regression Models by John Fox,
and Bootstrap Metbods and Their Appli<:atjons by Oavison and Hinkley.
Advanced Statistics Correspondence Analysis
Correspondence analysis provides a graphic method of exploring the relationship between variables in a
Generalized Linear Models contingency table. There are many options for correspondence analysis in R. 1 recommend the ca
Oiscriminant Function package by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and
comprehensive graphics. You can obtain the package hf!re.
Time Series

Factor Analysis Although ca can perform multiple correspondence analysis (more than two categorical variables), only

Correspondence Analysis simple correspondence analysis is covered here. See their fil1k.ll! for details on multiple CA.

M11ltjdjmen5jona( Scaling

CI 115ter Analysis Simple Correspondence Analysis


Tree-8ased Models In the following example, A and B are categorical factors.

Bootstrapoing
# Correspondence Analysis
Matrix Algebra
library(ca)
mytable <- with(mydata, table(A,B)) # create a 2 way table
prop.table(mytable, 1) # row percentages
R in Action
prop.table(mytable, 2) # column percentages
fit <- ca(mytable)
print(fit) # basic results
sunvnary(fit) # extended results
pl ot(fit) # symmetric map
pl ot(fit, mass =TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

R in Action significantly expands


The first graph is the standard symmetric representation of a simple correspondence analysis with rows
upon this material. Use promo
and column represented by points.
code ria38 for a 38% discount.

Top Menu

elick to view
The R Interface
Row points (column points) that are eloser together have more similar column profiles (row profiles).
Data Input
Keep in mind that you can not interpret the distance between row and column points directly.
Data Management

Basic Statistics The second graph is asymmetric , with rows in the principal coordinates and colunms in reconstructions
of the standarized residuals. Additionally, mass is represented by points and columns are represented
Advanced Statistics
by arrows. Point intensity (shading) corresponds to the absolute contributions for the rows. This
Basic Graphs example is ineluded to highlight sorne of the available options.
Advanced Graphs

elick to view
Advanced Statistics Tree-Based Models
Recursive partitioning is a fundamental tool in data mining. lt helps us explore the stucture of a set of
Generalized Linear Models data, while developing easy to visualize decision rules for predicting a categorical (classification tree)
Oiscriminant Function or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional
inference trees, and random forests.
Time Series

Factor Analysis
CART Modeling via rpart
Correspondence Analysis
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be
M11ltjdjniensjona( Scaling
generated through the rpart package. Oetailed information on rpart is available in An lntroduction to
CI 115ter Analysis Rec11rsjve Partjtjonjng llsing tbe RPART Ro11tines. The general steps are provided below followed by two
Tree-Based Models examples.

Bootstrapoing 1. Grow the Tree


Matrix Algebra To grow a tree, use
rpart(formula, data=, method=,control=) where

R in Action
formula is in the fom1at
outcome - predictor1+predictor2+predictor3+ect.
data= specifies the data frame
method= "class" for a classification tree
"anova" for a regression tree
control= optional parameters for controlling tree growth. For example,
control=rpart.control(minsplit=30, cp=0.001) requires that the minimum
number of observations in a node be 30 before attempting a split and that a
split must decrease the overall lack of fit by a factor of 0.001 (cost
R in Action significantly expands complexity factor) before being attempted.
upon this material. Use promo
code ria38 for a 38% discount. 2. Examine the resutts
The following functions help us to examine the results.

Top Menu printcp(fit) display cp table


plotcp(fit) plot cross-validation results
rsq.rpart(fit) plot approximate R-squared and relative error for different splits (2
plots). labels are only appropriate for the "anova" method.
The R Interface print(fit) print results
Data Input summary(fit) detailed results including surrogate splits
plot(fit) plot decision tree
Data Management
text(fit) label the decision tree plot
Basic Statistics
post(fit, create postscript plot of decision tree
file=)
Advanced Statistics

Basic Graphs In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs
below).
Advanced Graphs
3. prune tree
Prune back the tree to avoid overfitting the data. Typically, you will want to selecta tree size that
minimizes the cross-validated error, the xerror column printed by printcp( ).

Prune the tree to the desired size using


prune(fi t, cp= )

Specifically, use printcp( ) to examine the cross-validated error results, select the complexity
parameter associated with mínimum error, and place it into the prune( ) function. Altematively, you
can use the code fragment

fit $cptable[which. min (fit $cptable[,..xerror.. ]), ..CP")

to automatically select the complexity parameter associated with the smallest cross-validated error.
Thanks to HSAUR for this idea.

Classification Tree example


Let's use the data frame kyphosis to predict a type of defom1ation (kyphosis) after surgery, from age in
months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

# classification Tree with rpart


l i brary(rpart)

# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)

printcp(fit) # display the results


pl otcp(fit) # visualize cross-validation results
sunvnary(fi t) # detai l ed summary of sp l i ts

# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree


post(fit, file = "c:/tree.ps",
title = "classification Tree for Kyphosis")

1.
.. ..
click to view

# prune the tree


pfit<- prune(fit, cp=
fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])

# plot the pruned tree


plot(pfit, uniform=TRUE,
main="Pruned classification Tree for Kyphosis")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
title = "Pruned classification Tree for Kyphosis")

click to view

Regression Tree example


In this example we will predict car mileage from price, country, reliability, and car type. The data
frame is cu.summary.
# Regression Tree Example
l i brary( rpart)

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)

printcp( fit) # display the results


pl otcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# create additional plots


par(mfrow=c(l,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results

# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attracti ve postcript plot of tree


post(fit, file = "c: / tree2.ps'',
title = "Regression Tree for Mileage ")

1: 1:
;. . click to view

# prune the tree


pfit<- prune(fit, cp=0.01160389) # from cptable

# plot the pruned tree


plot(pfit, uniform=TRUE,
main="Pruned Regression Tree for Mileage")
text(pfit, use.n=TRUE , all=TRUE, cex=.8)
post(pfit, file = "c: / ptree2.ps",
title = "Pruned Regression Tree for Mileage")

lt turns out that this produces the same tree as the original.

Conditional inference trees via party


The ~ package provides nonparametric regression trees for nominal, ordinal, numeric, censored,
and multivariate responses. party· A laboratory for rernr5ive partjtjonjng, provides details.

You can create a regression or classification tree via the function

ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor,
numeric, etc. ). Tree growth is based on statistical stopping rules, so pruning should not be required.

The previous two examples are re-analyzed below.

# Conditional Inference Tree for Kyphosis


l i brary(party)
fit <- ctree(Kyphosis ~ Age + Number + Start,
data=kyphosis)
pl ot(fit, main="Conditional Inference Tree for Kyphosis")

~I elick to view

# Conditional Inference Tree for Mileage


l i brary(party)
fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu. sunvnary))

:=
. - elick to view

Random Forests
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based
on random samples of variables), elassifying a case using each tree in this new "forest'', and deciding a
final predicted outcome by combining the results across ali of the trees (an average in regression, a
majority vote in elassification). Breiman and Cutler's random forest approach is implimented via the
randomForest package.

Here is an example.

# Random Forest prediction of Kyphosis data


l i brary(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor

For more details see the comprehensive Random Fore;t website.

Going Further
This section has only touched on the options available. To learn more, see the CRAN Task View on
Machine & Statistical Leaming.
Advanced Statistics Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, 1 will describe three of the
Generalized Linear Models many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best
Oiscriminant Function solutions for the problem of determining the number of clusters to extract, severa! approaches are
given below.
Time Series

Factor Analysis

Correspondence Analysis
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescate variables for
M11ltjdjmen5jona( Scaling
comparability.
CI 115ter Analysis

Tree-8ased Models
# Prepare Data
Bootstrapoing mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Matrix Algebra 1
R in Action Partitioning
K·means clustering is the most popular partitioning method. lt requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help detem1ine the appropriate number of clusters. The analyst looks for a bend in the
plot similar to a scree test in factor analysis. 5ee Everitt & Hothom (og . 251\.

# Determine number of clusters


R in Action significantly expands
wss <- (nrow(mydata)-l) *sum(apply(mydata,2, var))
upon this material. Use promo for (i in 2:15) wss[i] <- sum(kmeans(mydata,
code ria38 for a 38% discount. centers=i)$withinss)
pl ot(1:15, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
Top Menu

# K-Means cluster Analysis


fit <- kmeans(mydata, 5) # 5 cluster solution
The R Interface # get cluster means
Data Input aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Data Management mydata <- data.frame(mydata, fit$cluster)
Basic Statistics

Advanced Statistics A robust version of K·means based on mediods can be invoked by using pam( ) instead of kmeans( ).
Basic Graphs The function pamk( ) in the ~ package is a wrapper for pam that also prints the suggested number of
clusters based on optimum average silhouette width.
Advanced Graohs

Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches. 1 have had good luck with Ward's method
described below.
# Ward Hierarchical clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fi t <- he l ust (d, method="ward")
pl ot(fit) # displ ay dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")

,.
: imn~~f'~'l!!~~ 1

_._ click to view

The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on
multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p
values. lnterpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows.
Transpose your data before using.

# Ward Hierarchical clustering with Bootstrapped p val ues


library(pvclust)
fit <- pvcl ust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

,.
.iR . 1fir~~~
...,.. click to view

Model Based
Model based approaches assume a variety of data models and apply maximum likelihood estimation and
Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( )
function in the mclust package selects the optimal model according to BIC for EM initialized by
hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and
number of clusters with the largest BIC. See helo!mclustModelNamesl to details on the model chosen as
best.

# Model Based Clustering


l i brary(mc l ust)
fit <- Mclust(mydata)
pl ot(fit) # plot resul ts
summary(fit) # display the best model

,.
.
,

elick to view

Plotting Cluster Solutions


lt is always a good idea to look at the cluster results.
# K-Means clustering with 5 clusters
fit <- kmeans(mydata, 5)

# Cluster Plot against lst 2 principal components

# vary parameters for most readable graph


library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=O)

# Centroid Plot against lst 2 discriminant functions


library(fpc)
plotcluster(mydata, fit$cluster)

-- ......
click to view

Validating cluster solutions


The function cluster.stats() in the ~ package provides a mechanism for comparing the similarity of
two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index
and the corrected rand index)

# comparing 2 cluster solutions


1 i brary(fpc)
cluster.stats(d, fitl$cluster, fit2$cluster)
1
where d is a distance matrix among objects, and fit1 $cluster and fit$cluster are integer vectors
containing classification results from two different clusterings of the same data.
Advanced Statistics Discriminant Function Analysis
The MASS package contains functions for perfom1ing linear and quadratic
Generalized Linear Models discriminant function analysis. Unless prior probabilities are specificed, each assumes proportional prior

Oiscriminant Function probabilities (i.e., prior probabilities are based on sample sizes). In the examples below, lower case
letters are numeric variables and upper case letters are categorical íaó.ms,.
Time Series

Factor Analysis

Correspondence Analysis
Linear Discriminant Function
M11ltjdjmen5jona( Scaling # Linear Discriminant Anal ysis with Jacknifed Prediction
l i brary(MASS)
CI 115ter Analysis
fi t <- lda(G - xl + x2 + x3, data=mydata,
Tree-8ased Models na . action="na_omit", CV=TRUE)
fit # show results
Bootstrapoing

Matrix Algebra
The code above performs an LOA, using listviise deletion of missing data. CV=TRUE generates jacknifed
(i.e., leave one out) predictions. The code below assesses the accuracy of the prediction.
R in Action

# Assess the accuracy of the prediction


# percent correct for each category of G
et <- table(mydata$G , fit$c l ass)
diag(prop . table(ct, 1))
# total percent correct
sum(di ag(prop. table (et)))

R in Action significantly expands


upan this material. Use promo lda() prints discriminant functions based on centered (not standardized) variables. The "proportion of
code ria38 for a 38% discount. trace" that is printed is the proportion of between-dass variance that is explained by successive
discriminant functions . No significance tests are produced. Refer to the section on MAl~OVA for such
tests.
Top Menu

Quadratic Discriminant Function


To obtain a quadratic discriminant function use qda( ) instead of Ida( ). Quadratic discriminant function
The R Interface
does not assume homogeneity of variance-covariance matrices.
Data Input

Data Management
# Quad r atic Discriminant Analysis with 3 groups applying
Basic Statistics # resubstitution prediction and equal prior probabi l ities .
l i brary(MASS)
Advanced Statistics
fi t <- qda(G - xl + x2 + x3 + x4, data=na . omi t(mydata),
Basic Graphs prior=c(l,1,1) / 3))

Advanced Graohs
Note the altemate way of specifying listviise deletion of missing data. Re-subsitution (using the same
data to derive the functions and evaluate their prediction accuracy) is the default method unless
CV=TRUE is specified. Re-substitution viill be overly optimistic.
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the
following code. Points are identified with the group ID.

# Scat:t:er plot: using t:he lst: t:wo discriminant: dimensions


plot:(fit:) # fit: from lda
1

click to view

The following code displays histograms and density plots for the observations in each group on the first
linear discriminant dimension. There is one panel for each group and they ali appear lined up on the
same graph.

# Panel s of hi st:ograms and overl ayed densi t:y plot:s


# for lst: di scrimi nant: funct:i on
plot:(fit:, dimen=l, t:ype="bot:h") # fit: from lda
1

·- click to view

The partimat( ) function in the klaR package can display the results of a linear or quadratic
classifications 2 variables at a time.

# Explorat:ory Graph for LDA or QDA


library(klaR)
part:imat:(G~xl+x2+x3,dat:a=mydat:a,met:hod="lda")
1

click to view

You can also produce a scatterplot matrix vlith color coding by group.

# Scat:t:erpl ot: for 3 Group Prob l em


pai rs(mydat:a[c("xl", "x2", "x3")], mai n="My Ti t:l e ", pch=22,
bg=c("red", "yellow", "blue") [unclass(mydat:a$G)])
1
.. \:_~J. ·.... r 1

.•\....... - - ,,i. .,
"-
c. ; ~·

o;.#..,,.~· • ~ ··~.
'
"'c .1i.-{\";. · t~
...... .
:: ~~
~

·'
- click to view

Test Assumptions
See (M)ANOVA Assumotions for methods of evaluating multivariate normalíty and homogeneíty of
covariance matrices.
Advanced Statistics Principal Components and Factor Analysis
This section covers principal components and factor analysis. The later includes both exploratory and
Generalized Linear Models confim1atory methods.

Oiscriminant Function

Time Series Principal Components


Factor Analysis The princomp( ) function produces an unrotated principal component analysis.
Correspondence Analysis

M11ltjdjmen5jona( Scaling # Pricipal Components Analysis


# entering raw data and extracting PCs
CI 115ter Analysis
# from the corre 1ati on matri x
Tree-8ased Models fit <- princomp(mydata, cor=TRUE)
sunm1ary(fit) # print variance accounted for
Bootstrapoing loadings(fit) # pe loadings
Matrix Algebra plot(fit, type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
R in Action

click to view

Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat= option to
R in Action significantly expands
enter a correlation or covariance matrix directly. lf entering a covariance matrix, include the option
upan this material. Use promo
n.obs=.
code ria38 for a 38% discount.

The principal( ) function in the psvch package can be used to extract and rotate principal components.
Top Menu
# Vari max Rotated Pri nci pa1 Components
# retaining 5 components
l i brary(psych)
The R Interface fi t <- principal (mydata, nfactors=5, rotate="varimax")
Data Input fit # print results

Data Management
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Basic Statistics
rotate can "none", ''varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
Advanced Statistics

Basic Graphs
Exploratory Factor Analysis
Advanced Graohs
The factanal( ) function produces maximum likelihood factor analysis.

Maximum Likelihood Factor Analysis


entering raw data and extracting 3 factors,
with varimax rotation
fit <- factanal(mydata, 3 , rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
pl ot(load, type="n") # set up plot
text( load,labels=names(mydata),cex=.7) # add variable names

1 -=-
¡.... • -

click to view

The rotation= options include "varimax", "promax", and "none". Add the option scores="regression" or
"Bartlett" to produce factor seores. Use the covmat= option to enter a correlation or covariance matrix
directly. lf entering a covariance matrix, include the option n.obs=.

The factor.pa( ) function in the ~ package offers a number of factor analysis related functions,
including principal axis factoring.

# Principal Axis Factor Analysis


l i brary(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Rotation can be "varimax" or "promax".

Determining the Number of Factors to Extract


A crucial decision in exploratory factor analysis is how many factors to extract. The nFactors package
offer a suite of functions to aid in this decision. Details on this methodology can be found in a
Pov1erPoint oresentation by Raiche, Riopel, and Blais. Of course, any factor solution must be
interpretable to be useful.

# Determine Number of Factors to Extract


library( nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata), var=ncol(mydata),
rep=lOO,cent=.05)
ns <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

1-

elick to view

Going Further
The FactoMjneR package offers a large number of additional functions for exploratory factor analysis.
This includes the use of both quantitative and qualitative variables, as well as the inclusion of
supplimentary variables and observations. Here is an example of the types of graphs that you can
create with this package.

# PCA Variable Factor Map


library( FactoMineR)
result <- PCA( mydata) # graphs generated automatically
1
-- -
.. _....__ .
_...
-=-_-.=:;.;- --=--
-==---~

click to view

Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.

Structual Equation Modeling


Confirmatory Factor Analysis (CFA) is a subset of the much wider Structural Equation Modelíng (SEM)
methodology. SEM is provided in R via the sem. package. Models are entered via RAM specification
(similar to PROC CALIS in SAS). While sem ís a comprehensive package, my recommendation is that íf
you are doing significant SEM work, you spring for a copy of AMOS. lt can be much more user-friendly
and creates more attractive and publication ready output. Having said that, here is a CFA example
using sem.

e3

e4

e5

~ e6
Assume that we have six observered variables (X1, X2, ... , X6). We
hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables
as described in thís diagram. X1, X2, and X3 load on F1 (with loadíngs lam1, lam2, and lam3). X4, X5,
and X6 load on F2 (with loadings lam4, lam5, and lan16). The double headed arrow indicates the
covariance between the two latent factors (F1 F2). e1 thru e6 represent the residual variances (variance
in the observed variables not accounted for by the t\'VO latent factors). We set the variances of F1 and
F2 equal to one so that the parameters will have a scale. This will result in F1F2 representing the
correlatíon between the two latent factors.

For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the
code below. The CFA model is specified using the specify.model( ) function. The fom1at is arrow
specification, parameter name, start value. Choosing a start value of NA tells the program to choose a
start value rather than supplying one yourself. Note that the variance of F1 and F2 are fixed at 1 (NA in
the second column). The blank líne is required to end the RAM specification.

# Simple CFA Model


l i brary(sem)
mydata . cov <- cov (mydata)
model. mydat a <- speci fy. mode l ()
Fl - > Xl, laml, NA
Fl - > X2, lam2, NA
F1 - > X3, lam3 , NA
F2 - > X4 , lam4, NA
F2 - > xs, lam5, NA
F2 - > X6, l am6, NA
Xl <- > Xl, el, NA
X2 <- > X2, e2, NA
X3 <- > X3, e3, NA
X4 <- > X4, e4, NA
xs <- > xs, eS , NA
X6 <- > X6, e6 , NA
F1 <- > Fl, NA, 1
F2 <- > F2, NA, 1
Fl <- > F2 , FlF2 , NA

mydata.sem <- sem(model.mydata, mydata.cov, nrow(mydata))


# print results (fit indices, paramters, hypothesis tests)
summary(mydata.sem)
# print standardized coefficients (loadings)
std.coef(mydata.sem)

You can use the boot.sem( ) function to bootstrap the structual equation model. See help(boot.sem)
for details. Additionally, the function mod.indices( ) will produce modification indices. Using
modification indices to improve model fít by respecifying the parameters moves you from a
confirmatory to an exploratory analysis.

For more information on sem , see Structurn! Equatjoo Modeling rnth tbe sem Package in R, by John
Fox.
Advanced Statistics Generalized Linear Models
Generalized linear models are fit using the glm( ) function. The form of the glm function is
Generalized Linear Models
glm(formula, family=familytype(link=linkfunction), data=)
Oiscriminant Function

Time Series Family Oefault Link Function


binomial (link = "logit")
Factor Analysis
gaussian (link = "identity")
Correspondence Analysis Gan1ma (link ="inverse")
M11ltjdjmen5jona( Scaling inverse. gaussian (link = "1 / muA2")
poisson (link = "log")
CI 115ter Analysis
quasi (link = "identity'', variance = "constant")
Tree-8ased Models
quasibinomial (link ="logit")
Bootstrapoing quasipoisson (link = "log")

Matrix Algebra See help(glm) for other modeling options. See help(family) for other allowable link functions for each
family. Three subtypes of generalized linear models will be covered here: logistic regression, poisson
regression, and survival analysis.
R in Action

Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables. lt is frequently preferred over discrjmjnant ftmctjon analysis because of its less
restrictive assumptions.

R in Action significantly expands


upan this material. Use promo # Logistic Regression
# where F is a binary factor and
code ria38 for a 38% discount.
# xl-x3 are continuous predictors
fit <- glm(F~xl+x2+x3,data=mydata, family=binomial ())
summary(fit) # display results
Top Menu confint(fit) # 95% CI for the coefficients
exp(coef(fit)) # exponentiated coefficients
exp(confint(fit)) # 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
The R Interface residuals(fit, type="deviance") # residuals

Data Input

Data Management You can use anova(fit1 ,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F-x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x
Basic Statistics
variable.
Advanced Statistics

Basic Graphs

Advanced Graohs

elick to view

Poisson Regression
Poisson regression is useful v1hen predicting an outcome variable representing counts from a set of
continuous predictor variables.

# Poisson Regression
# where count is a count and
# xl-x3 are continuous predictors
fit <- glm(count ~ xl+x2+x3, data=ltlydata, family=poisson())
summary(fit) display results

lf you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may
want to use quasi poisson() instead of poisson () .

Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for
modeling the time to an event. Data may be right ce nsored - the event may not have occured by the
end of the study or we may have incomplete information on an observation but know that up to a
certain time the event had not occured (e.g. the participant dropped out of study in week 10 but was
alive at that time).

While generalized linear models are typically analyzed using the glm( ) function, survival analyis is
typically carried out using functions from the survival package . The survival package can handle one
and two sample problems, parametric accelerated failure models, and the Cox proportional hazards
model.

Data are typically entered in the format start time, stop time , and status (1=event occured, O=event
did not occur). Alternatively, the data may be in the fom1at time to event and status (1=event
occured, O=event did not occur). A status=O indicates that the observation is right cencored. Data are
bundled into a Surv object vía the Surv( ) function prior to further analyses.

survfit( ) is used to estímate a survival distribution for one or more groups.


survdiff( ) tests for differences in survival distributions between two or more groups.
coxph( ) models the hazard function on a set of predictor variables.

# Mayo clinic Lung Cancer Data


library(survi val)

# learn about the dataset


help(lung)

# create a Surv object


survobj <- with(lung, Surv(time,status))

# Plot survival distribution of the total sampl e


# Kapl an-Meier estimator
fitO <- survfit(survobj~l, data=lung)
summary(fitO)
pl ot(fitO, xlab=" Survival Time in Days" ,
yl ab="% Survi ving", yscal e=lOO,
main="Survival Distribution (Overal l)")

# Compare the survi val distributions of men and women


fitl <- survfit(survobj~sex,data=l ung)

# plot the survival distributions by sex


p lot(fi tl, xl ab="Survi va 1 Time in Days",
ylab="% Survi ving", yscale=lOO, col =c(" r ed", "bl ue"),
mai n="Survi val Di stri butions by Gender")
l egend("topright", title="Gender", c("Mal e", "Female"),
fil l=c ("red", "b1 ue"))
# test for difference between male and female
# survi val curves ( l ogrank test)
su r vdi ff(survobj ~sex, data=l ung)

# predi et mal e surv i va l from age and medí cal seores


MaleMod <- coxph(survobj~age+ph.ecog+ph.karno+pat . karno,
data=lung, subset=sex==l)

# display resu l ts
MaleMod

# evaluate the proportional hazards assumpti on


cox . zph(MaleMod)

·~
\
l. \ l . . \..

click to view

See Thomas Lumley's R news article on the survival package for more information . Other good sources
include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M. J . Crawley's chapter
on Survival Analysis.
Advanced Statistics Matrix Algebra
Most of the methods on this website actually describe the programming of matrices. lt is built deeply
Generalized Linear Models into the R language. This section will simply cover operators and functions specifically suited to linear

Discriminant Function algebra. Before proceeding you many want to review the sections on Data Tvoes and Ooerators.

Time Series

Factor Analysis Matrix facilites


Correspondence Analysis In the following examples, A and B are matrices and x and b are a vectors.

M11ltjdjmen5jona( Scaling
Operator or Description
CI 115ter Analysis Function
Tree-8ased Models A*B Element-vlise multiplication
A%*% B Matrix multiplication
Bootstrapoing
A %o% B Outer product. AB'
Matrix Algebra
crossprod(A,B) A'B and A'A respectively.
crossprod(A)
t(A) Transpose
R in Action
diag(x) Creates diagonal matrix with elements of x in the principal diagonal
diag(A) Returns a vector containing the elements of the principal diagonal
diag(k) lf k is a scalar, this creates a k x k identity matrix. Go figure.
solve(A, b) Returns vector x in the equation b =Ax (i.e., A' 1b)
solve(A) lnverse of A where A is a square matrix.
ginv(A) Moore-Penrose Generalized lnverse of A.
ginv(A) requires loading the MASS package.
R in Action significantly expands y< ·eigen(A) y$val are the eigenvalues of A
y$vec are the eigenvectors of A
upon this material. Use promo
y<·svd(A) Single value decomposition of A.
code ria38 for a 38% discount. y$d =vector containing the singular values of A
y$u = matrix with colunms contain the left singular vectors of A
y$v = matrix vlith columns contain the right singular vectors of A
Top Menu R <· chol(A) Choleski factorization of A. Returns the upper triangular factor, such that R'R
=A.
y<· qr(A) QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and a lower
triangle that contains information on the Q decomposition.
The R Interface y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
Data Input y$pivot contains information on the pivoting strategy used.
cbind(A,B,... ) Combine matrices(vectors) horizontally. Returns a rnatrix.
Data Management
rbind(A,B, ... ) Combine matrices(vectors) vertically. Returns a matrix.
Basic Statistics rowMeans(A) Returns vector of row means.
Advanced Statistics rowSums(A) Returns vector of row sums.

Basic Graphs colMeans(A) Returns vector of column means.


colSums(A) Returns vector of coumn means.
Advanced Graohs

Matlab Emulation
The ma1lah package contains wrapper functions and variables used to replicate MATLAB function calls
as best possible. This can help porting MATLAB applications and code to R.
Going Further
The Ma1dx package contains functions that extend R to support highly dense or sparse matrices. lt
provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack (dense mat rix), TAUCS
(sparse mat rix) and UMFPACK (sparse matrix) rout ines.
Advanced Statistics Multidimensional Scaling
R provides functions for both classical and nonmetric multidimensional scaling. Assume that we have N
Generalized Linear Models objects measured on p numeric variables. We want to represent the distances among the objects in a

Oiscriminant Function parsimonious (and visual) way (i.e., a lower k-dimensional space).

Time Series

Factor Analysis Classical MDS


Correspondence Analysis You can perform a classical MOS using the cmdscale() function.

M11ltjdjmen5jona( Scaling
# classical MDS
CI 115ter Analysis
# N rows (objects) x p columns ( variables)
Tree-8ased Models # each row identified by a unique row name

Bootstrapoing
d <- dist(mydata) # eucl idean distances between the rows
Matrix Algebra fit < - cmdscal e(d,eig=TRUE, k=2) # k is the number of dim
fit # view results

R in Action # plot solution


x <- fit$points [ ,l]
y <- fit$points [ ,2]
plot(x, y, xlab="Coordinate l", yl ab="Coordinate 2",
main="Metric MDS", type="n")
text(x, y, l abels = row . names(mydata), cex= -7)

R in Action significantly expands


upan this material. Use promo
code ria38 for a 38% discount.
elick to vie w

Top Menu
Nonmetric MDS
Nonmetric MOS is performed using the isoMDS() function in the MASS package.

The R Interface

Data Input # Nonmetric MDS


# N rows (objects) x p columns ( variabl es)
Data Management # each row identified by a unique row name
Basic Statistics
l i brary(MASS)
Advanced Statistics d <- dist(mydata) # eucl idean distances between the rows
fi t <- isoMDS(d, k=2) # k is the number of di m
Basic Graphs
fit # view results
Advanced Graohs
# plot solution
x <- fit$points [ ,l]
y <- fit$points [ ,2]
pl ot(x, y, xlab="Coordinate l", yl ab="Coordinate 2",
main="Nonmetric MDS", type="n")
text(x, y, l abels = row . names(mydata), cex= -7)
click to view

Individual Difference Scaling


3-way or individual difference scaling can be completed using the indscal() function in the SensoMineR
package. The smacof package offers a three way analysis of individual differences based on stress
minimization of means of majorization.
Advanced Statistics Time Series and Forecasting
R has extensive facilities for analyzing time series data. This section describes the creation of a time
Generalized Linear Models series, seasonal decompostion, rnodeling with exponential and ARIMA models, and forecasting with the
Oiscriminant Function forecast pacakge.

Time Series

Factor Analysis Creating a time series


Correspondence Analysis The ts() function vtill convert a numeric vector into an R time series object. The format is ts( vector,
start=, end=, frequency=) where start and end are the times of the first and last observation and
M11ltjdjmen5jona( Scaling
frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
CI 115ter Analysis

Tree-8ased Models
# save a numeric vector containing 48 monthly observations
Bootstrapoing # from Jan 2009 to Dec 2012 as a time series object
myts <- ts(myvector , start=c(2009, 1), end=c(2012, 12), frequency=12)
Matrix Algebra
# subset the time series (June 2012 to December 2012)
myts2 <- wi ndow(myts, start=c(2012, 6), end=c (2012, 12))
R in Action
# plot series
plot(myts)

Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
R in Action significantly expands stl() function. Note that a series with multiplicative effects can often by transfom1ed into series with
upan this material. Use promo additive effects through a log transformation (i.e., newts <- log(myts)).
code ria38 for a 38% discount.

# Seasonal decompostion
Top Menu fit <- stl(myts, s.window="period")
plot(fit)

# additional plots
monthplot(myts)
The R Interface
library(forecast)
Data Input seasonplot(myts)

Data Management

Basic Statistics
Exponential Models
Advanced Statistics
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package,
Basic Graphs can be used to fit exponential models.
Advanced Graohs
# simple exponential - models level
fit <- Holtwinters(myts, beta=FALSE, gamma=FALSE)
# double exponential - models level and trend
fit <- HoltWinters(myts, gamma=FALSE)
#triple exponential - models level, trend, and seasonal components
fit <- Holtwinters(myts)
# predictive accuracy
library(forecast)
accuracy(fit)

# predict next three future values


library(forecast)
forecast(fit, 3)
plot(forecast(fit, 3))

ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model. Other
useful functions include:

lag(ts, k) lagged version of time series, shifted back k observations


diff(ts, difference the time series d times
differences=d)
ndiffs(ts) Number of differences required to achieve stationarity (from the
forecast package)
acf(ts) autocorrelation function
pacf(ts) partial autocorrelation function
adf.test(ts) Augemented Dickey-Fuller test. Rejecting the null hypothesis
suggests that a time series is stationary (from the tseries package)
Box.test(x, Pormanteau test that observations in vector or time series x are
type="Ljung· independent
Box")

Note that the forecast package has somewhat nicer versions of acf() and pad() called Acf() and Pad()
respectively.

# fit an ARIMA model of order P, o, Q


fit <- arima(myts, order=c(p, d, q)

# predicti ve accuracy
library( forecast)
accuracy(fit)

# predict next 5 observations


library(forecast)
forecast(fit, 5)
plot(forecast(fit, 5))

Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models.
The ets() function supports both additive and multiplicative models. The auto.arima() function can
handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of severa[ fit
criteria.

library(forecast)
# Automated forecasting using an exponential model
fit <- ets(myts)

# Automated forecasting using an ARIMA model


fit <- auto.arima(myts)

Going Further
There are many good online resources for learning time series analysis with R. These include A little
book of R for time series by Avril Chohlan, and Forecasting: principies and practice by Rob Hyndman and
George Athanasopoulos. Vito Ricci has created a time series reference card . There are also a time series
tutorial by Walter Zuccbjnj Oleg llenadic that is quite useful.

See also the comprehensive book Time Series Analysis and its Aoplications with R Examples by Robert
Shunway and David Stoffer.
Data Input Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices,
Data tvoes data frames, and lists.
lmoorting Data

Keyboard Input Vectors


Database lnout
a <- c(l,2,5.3,6,-2,4) # numeric vector
Exporting Data b <- c("one", "two", "three") # character vector
e <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Vjewjng Data
1
Vaáable 1abels
Refer to elements of a vector using subscripts.
Value Labels

Missing Data

Date Values
1 a[c(2,4)] # 2nd and 4th elements of vector

R in Action Matrices
Ali columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The
general format is

mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,


dimnames=list( char_vector_rownames , char_vector_colnames))

R in Action significantly expands byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
upon this material. Use promo matáx should be filled by columns (the default). dimnames provides optional labels for the columns
code ria38 for a 38% discount. and rows.

Top Menu # generates 5 x 4 numeric rnatrix


y<-rnatrix(1:20, nroW=5,ncol=4)

# another example
cells <- c(l,26,24,68)
The R Interface
rnarnes <- c("Rl", "R2")
Data Input cnarnes <- c("cl", "c2")
mymatri x <- rnatri x(ce lls, nroW=2, neo1=2 , byroW=TRUE,
Data Management dimnames=list(rnarnes, cnarnes))
Basjc Statjstics

Advanced Statistics ldentify rows, columns or elements using subscripts.


Basic Graphs

Advanced Graohs x[,4] # 4th colurnn of matrix


x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
1
Arrays
Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.

Data Frames
A data frame is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.). This is similar to SAS and SPSS datasets.

d <- c(l,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data . frame(d,e, f)
names(mydata) <- c("ID", "col or", "Passed") # variable names

There are a variety of ways to identify the elements of a data frame .

myframe [ 3:5] # columns 3,4,5 of data frame


myframe [c("ID", "Age")] # columns ID and Age from data frame
myframe$Xl # variable xl in the data frame
1
Lists
An ordered coltection of objects (components). A list allows you to gather a variety of (possibly
unrelated) objects under one name.

# examp l e of a l i st wi th 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred'', mynumbers=a, mymatrix=y, age=5.3)

# exampl e of a list containing two lists


v <- c(listl,list2)

ldentify elements of a list using the [[]] convention.

mylist[ [2]] # 2nd component of the list


mylist[["mynumbers"] ] # component named mynumbers in list
1
Factors
Tell R that a variable is nominal by making ita factor. The factor stores the nominal values as a vector
of integers in the range [ 1... k] (where k is the number of unique values in the nominal variable), and
an internal vector of character strings (the original values) mapped to these integers.

# variable gender with 20 "male" entries and


# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 ls and 30 2s and associates
# l =fema le, 2=mal e interna11 y (al phabeti ca11 y)
# R now treats gender as a nominal variable
summary(gender)

An ordered factor is used to represent an ordinal variable .

1# variable rating ceded as "large", "medium" , "smal l'


rating <- ordered(rating)
# recodes rating to 1,2,3 and associates
# l=large, 2=medium, 3=small internally
# R now treats rating as ordinal
1
R vtill treat factors as nominal variables and ordered factors as ordinal variables in statistical
proceedures and graphical analyses. You can use options in the factor( ) and ordered( ) functions to
control the mapping of integers to strings (overiding the alphabetical ordering). You can also use
factors to create value labels. For more on factors see the UCLA page.

Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
el ass(object) # class or type of an object
names(object) # names

c(object,object, ... ) # combine objects into a vector


cbind(object, object, ... ) # combine objects as columns
rbind(object, object, ... ) # combine objects as rows

object # prints the object

ls() # list current objects


rm(object) # delete an object

newobject <- edit(object) # edit copy and save as newobject


fix(object) # edit in place
Data Input Date Values
Dates are represented as the number of days since 1970-01-01, •..vith negative values for earlier dates.
Data tvoes

lmoorting Data # use as.Date( ) to convert strings to dates


Keyboard Input mydates <- as.Oate(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
Database lnout days <- mydates [1] - mydates [2]
Exporting Data

Vjewjng Data Sys.Date( ) returns today's date.


date() returns the current date and time.
Variable 1abels

Value Labels The following symbols can be used with the format( ) function to print dates.
Missing Data
Symbol Meaning Example
Date Values %d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
R in Action %m month (00-12) 00-12
%b abbreviated month Jan
%8 unabbreviated month January
%y 2-digit year 07
%Y 4-digit year 2007

Here is an example.

R in Action significantly expands # print today's date


upon this material. Use promo today <- Sys.Oate()
code ria38 for a 38% discount. format(today, format="%B rod %Y")
"June 20 2007"

Top Menu
Date Conversion
Character to Date
The R Interface
You can use the as.Date( ) function to convert character data to dates. The fom1at is as.Date(x,
Data Input "forma('), where x is the character data and format gives the appropriate fom1at.

Data Management

Basjc Statjstics # convert date info in format 'mm/dd/yyyy'


strOates <- c("Ol/05/1965", "08/16/1975")
Advanced Statistics dates <- as. Date(stroates, "r,,m/%d/%Y")
Basic Graphs 1
Advanced Graohs The default format is yyyy-mm-dd

1mydates <- as.Oate(c("2007-06-22", "2004-02-13"))

Date to Character
You can convert dates to character data using the as.Character() function.

# convert dates to character data


strDates <- as.character(dates)
1

Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates. See
help(ISOdatetime) for more information about formatting date/times.
Data Input Access to Database Management Systems (DBMS)

Data tvoes ODBC Interface


lmoorting Data The ROOBC package provides access to databases (including Microsoft Access and Microsoft SQL Server)
through an OOBC interface.
Keyboard Input

Database lnout The primary functions are given below.


Exporting Data Oescription
Function
Vjewjng Data odbcConnect(dsn, uid=-·, pwd="") Open a connection to an ODBC database
sqlFetch(channel, sqtable) Read a table from an ODBC database into a data
Vaáable 1abels
frame
Value Labels sqlQuery(channel, query) Submit a query to an ODBC database and return
the results
Missing Data
sqlSave(channel, mydf, tablename = Write or update (append=True) a data frame to a
Date Values sqtab/e, append =FALSE) table in the ODBC database
sqlOrop(channel, sqtab/e) Remove a table from the ODBC database
close(channe/) Close the connection
R in Action
# RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R data frames (and call them crimedat and pundat)

l i brary(RODBC)
myconn <-odbcconnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, Crime)
R in Action significantly expands pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
upon this material. Use promo
code ria38 for a 38% discount.

Other lnterfaces
Top Menu
The RMySOL package provides an interface to MySQL.

The ~ package provides an interface for Oracle.


The R Interface
The RJOBC package provides access to databases through a JOBC interface.
Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Exporting Data
There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you
Data tvoes will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
lmoorting Data

Keyboard Input To A Tab Delimited Text File


Database lnout

Exporting Data
1 write.table(mydata, "c:/mydata.txt", sep="\t")

Vjewjng Data

Vaáable 1abels To an Excel Spreadsheet


Value Labels
library(xlsReadWrite)
Missing Data write.xl s(mydata, "e : /mydata.xl s")
Date Values
1

R in Action
To SPSS
# write out text datafi l e and
# an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c : /mydata . sps" , package="SPSS")

R in Action significantly expands To SAS


upon this material. Use promo
code ria38 for a 38% discount. # write out text datafi l e and
# an SAS program to read it
library(foreign)
write.foreign(mydata, " c:/mydata.txt", "c :/mydata . sas" , package=" SAS")
Top Menu

To Stata
The R Interface

Data Input # export data frame to Stata binary format


library(foreign)
Data Management
write.dta(mydata, "c : /mydata.dta")
Basjc Statjstics 1
Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input lmporting Data
lmporting data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS 1
Data tvoes would recommend the Hmisc package for ease and functionality. 5ee the Quick·R section on packages,
lmoorting Data for information on obtaining and installing the these packages. Example of importing data are provided
below.
Keyboard Input

Database lnout
From A Comma Delimited Text File
Exporting Data

Vjewjng Data # fi rst row contai ns variable names, convna is separator


# assign the variabl e id to row names
Vaáable 1abels
# note the / instead of \ on mswindows systems
Value Labels
mydata <- read.table("c: /mydata.cs v", header=TRUE,
Missing Data
sep=", row . nan1es=ºid")
11
,

Date Values

R in Action From Excel


The best way to read an Excel file is to export it to a comma delimited file and import it using the
method above. On windows systems you can use the RODBC package to access Excel files . The first row
should contain variable/column names.

# first row contains va riable names


# we will read in worksheet mysheet
R in Action significantly expands
upon this material. Use promo library(RODBC)
channel <- odbcConnectExcel ("e: /myexe l. xl s")
code ria38 for a 38% discount.
mydata <- sqlFetch(channel, "mysheet")
odbcclose(channel)

Top Menu

From SPSS
The R Interface # save SPSS dataset in trasport format
Data Input get fil e=' e: \mydata. sav' .
export outfile='c:\mydata.por'.
Data Management
# in R
Basjc Statjstics
l i brary(Hmi se)
Advanced Statistics mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
Basic Graphs

Advanced Graohs

From SAS
# save SAS dataset in trasport format
l i bname out xport 'c: /mydata. xpt' ;

1 data out.mydata;
set sasuser.mydata;
run;

# in R
l i brary(Hmi se)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors

From Stata
# input Stata file
library(foreign)
mydata <- read. dta("c: /mydata.dta")
1
From systat
# input Systat file
library(foreign)
mydata <- read.systat("c: /mydata.dta")
1
Data Input Keyboard Input
Usually you will obtain a data frame by imoorting it from SAS, SPSS, Excel, Stata, a database, or an
Data tvoes ASCII file. To create it interactively, you can do something like the following.
lmoorting Data

Keyboard Input # create a data frame from scratch


age <- c(25, 30, 56)
Database lnout gender <- c("male", "fernale" , "mal e")
Exporting Data weight <- c(160 , 110, 220)
mydata <- data . frame(age,gender,weight)
Vjewjng Data

Vaáable 1abels
You can also use R's built in spreadsheet to enter the data interactively, as in the following example.
Value Labels

Missing Data
# enter data using editor
Date Values mydata <- data . frame(age=numeric(O), gender=character(O),
weight=numeric(O))
mydata <- edit(mydata)
R in Action # note t hat without the assignment in the line above,
# the edits are not saved!

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Missing Data
In R, missing values are represented by the symbol NA (not available) . lmpossible values (e.g., dividing
Data tvoes by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for

lmoorting Data character and numeric data.

Keyboard Input

Database lnout Testing for Missing Values


Exporting Data
is.na(x) # returns TRUE of x is missing
Vjewjng Data y <- c(l,2,3,NA)
is.na(y) # returns a vecto r (F F F T)
Vaáable 1abels 1
Value Labels

Missing Data Recoding Values to Missing


Date Values
# recode 99 to missing for variable vl
# select rows where vl is 99 and recode col umn vl
R in Action mydata$vl[mydata$v1==99] <- NA
1
Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.

X <- c(l,2,NA,3)
R in Action significantly expands
mean(x) # returns NA
upon this material. Use promo mean(x, na . rm=TRUE) # returns 2
code ria38 for a 38% discount. 1
The function complet e. cases() returns a logical vector indicating which cases are complete.
Top Menu

# l ist rows of data that have missing values


mydata[!complete . cases(mydata),]
The R Interface
1
Data Input The function na.omit() retums the object with listwise deletion of missing values.

Data Management

Basjc Statjstics # create new dataset without missing data


newdata <- na.omit(mydata)
Advanced Statistics 1
Basic Graphs

Advanced Graohs Advanced Handling of Missing Data


Most modeling functions in R offer options for dealing with missing values. You can go beyond pairwise
of listwise deletion of missing values through methods such as multiple imputation. Good
implementations that can be accessed through R indude Ame lia 11, ~' and 1llillli2.ls..
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Value Labels
To understand value labels in R, you need to understand the data structure factor.
Data tvoes
You can use the factor functíon to create your own value lables.
lmoorting Data

Keyboard Input
# variable vl is coded 1, 2 or 3
Database lnout
# we want to attach va1 ue 1abe1 s l=red, 2=b1 ue, 3=green
Exporting Data
mydata$v1 <- factor(mydata$v1,
Vjewjng Data
levels c(l,2,3),
Vaáable 1abels labels = c("red", "blue", "green"))

Value Labels

Míssing Data # variable y is coded 1, 3 or 5


Date Values # we want to attach value labels l=Low, 3=Medium, 5=High

mydata$v1 <- ordered(mydata$y,


levels c(l,3, 5),
R in Action
labels = c("Low", "Medium", "High"))

Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical
and graphic functions wíll then treat the data appriopriately.

Note: factor and ordered are used the same way, wíth the same arguments. The former creates factors
and the later creates ordered factors.
R in Actíon significantly expands
upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Variable Labels
R's ability to handle variable labels is somewhat unsatisfying.
Data tvoes
lf you use the Hmisc package, you can take advantage of sorne labeling features.
lmoorting Data

Keyboard Input
library(Hmisc)
Database Input
label(mydata$myvar) <- "Vari ab 1 e 1abe1 for vari ab 1 e myvar"
Exporting Data describe(mydata)

Vjewjng Data
1
Variable 1abels Unfortunately the label is only in effect for functions provided by the Hmisc package, such as
describe(). Your other option is to use the variable !abe! as the variable name and then refer to the
Value Labels
variable by position index.
Missing Data

Date Values
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the va riable
1
R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Management Aggregati ng Data
lt is relatively easy to collapse data in R using one or more BY variables anda defined function.
Creating l~ew Variables

Ooerators # aggregate data frame mtcars by cyl and vs, returning means
Built-in Functions # for nurneric variables
attach(mtcars)
Control Structures aggdata <-aggregate(mtcars, by=list(cyl,vs),
! Jser-defined F1mctjons
FUN=mean, na. rm=TRUE)
print(aggdata)
Sortjng Data detach(mtcars)

Mergjng Data

Aggregating Data When using the aggregate() function, the by variables must be in a list (even if there is only one). The
function can be built -in or user provided.
Reshaoing Data

Subsetting Data See also:


Data Tyoe Conversion • summarize() in the J:imiK package
• summaryBy() in the ~ package

R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Control Structures
R has the standard control structures you would expect. expr can be multiple (compound) statements
Creating l~ew Variables by enclosing them in braces ( }. lt is more efficient to use built-in functions rather than control
Ooerators structures whenever possible.

Built-in Functions

Control Structures if-else


! Jser-defined F1 mctjons
if (cond) expr
Sortjng Data i f (cond) exprl else expr2

Mergjng Data
1
Aggregating Data
for
Reshaoing Data

Subsetting Data 1 for evar in seq) expr

Data Tyoe Conversion

while
R in Action

1 whi 1e econd) expr

switch
1 switch(expr, ... )
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount. ifelse

Top Menu
1 ifelse(test,yes,no)

Example
The R Interface

Data Input # transpose of a matrix


# a poor alternative to built-in t() function
Data Management

Basic Statistics mytrans <- function(x) {


if (!is.matrix(x)) {
Advanced Statistics warning("argument is not a matrix: returning NA")
Basic Graphs return(NA_real_)
}
Advanced Graohs y<- matrix(l, nroW=ncol(x), ncol=nrow(x))
for (i in l:nrow(x)) {
for (j in l:ncol(x)) {
y[j,i] <- x[i,j]
}
}
return(y)
}

# try it
z <- matrix(l:lO, nrow:5, ncol=2)
tz <- mytrans(z)
Data Management Built-in Functions
Almost everything in R is done through functions. Here l'm only refering to numeric and character
Creating l~ew Variables functions that are commonly used in creating or recoding variables.

Ooerators

Built-in Functions Numeric Functions


Control Structures
Function Oescription
! Jser-defined F1 mctjons
abs(x) absolute value
Sortjng Data sqrt(x) square root

Mergjng Data ceiling(x) ceiling(3.475) is 4


floor(x) floor(3.475) is 3
Aggregating Data
trunc(x) trunc(S. 99) is 5
Reshaoing Data round(x, digits=n) round(3.475, digits=2) is 3.48
Subsetting Data signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
Data Tyoe Conversion
log(x) natural logarithm
log10(x) common logarithm
R in Action exp(x) e' x

Character Functions
Function Oescription
substr(x, start= n 1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef'
R in Action significantly expands substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef'
upon this material. Use promo
grep(pattem, x, Search for pattem in x. lf fixed =FALSE then pattem is a regular.
code ria38 for a 38% discount. ignore.case=FALSE, expression. lf fixed=TRUE then pattern is a text string. Returns
fixed=FALSE) matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2

Top Menu sub(pattem, rep/acement, Find pattern in x and replace with rep/acement text. lf
x, ignore.case =FALSE, fixed=FALSE then pattern is a regular expression~
fixed=EALSE) lf fixed = T then pattem is a text string.
sub("\\s",".","Hello There") returns "Helio.There"
strsplit(x, split) Split the elements of character vector x at split.
The R Interface strsplit("abc", "") returns 3 element vector "a","b","c"
paste( ... , sep="") Concatenate strings after using sep string to seperate them.
Data Input paste("x",1 :3,sep="") returns c("x1","x2" "x3")
paste("x", 1 :3,sep="M") returns c("xM1","xM2" "xM3")
Data Management paste("Today is", date())
Basic Statistics toupper(x) Uppercase
tolower(x) lowercase
Advanced Statistics

Basic Graphs

Advanced Graohs Statistical Probability Functions


The following table describes functions related to probaility distributions. For random number
generators below, you can use set.seed(1234) or sorne other integer to create reproducible pseudo-
random numbers.
Function Description
dnorm(x) normal density function (by default m=O sd=1)
# plot standard nom1al curve
x <- pretty(c(-3, 3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q) cumulative normal probability far q
(area under the normal curve to the right of q)
pnom1(1.96) is 0.975
qnorm(p) normal quantile.
value at the p percentile of nom1al distribution
qnom1(. 9) is 1.28 # 90th percentile
rnorm(n, m=O,sd=1) n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- morm(50, m=50, sd=10)
dbinom(x, size, prob) binomial distribution where size is the sample size
pbinom(q, size, prob) and prob is the probability of a heads (pi)
qbinom{p, size, prob) # prob of O to 5 heads of fair coin out of 1O flips
rbinom(n, size, prob) dbinom(0:5, 10, .5)
# prob of 5 ar less heads of fair coin out of 1O flips
pbinom(5, 10, .5)
dpois(x, lamda) poisson distribution with m=std=lamda
ppois(q, lamda) #probability of O, 1, ar 2 events with lamda=4
qpois(p, lamda) dpois(0:2, 4)
rpois(n, lamda) # probability of at least 3 events with lamda=4
1- ppois(2, 4)
dunif(x, min=O, max=1) unifann distribution, fallows the same pattem
punif(q, min=O, max=1) as the nom1al distribution above.
qunif(p, min=O, max=1) #10 unifarm random variates
runif(n, min=O, max=1) x <- runif(10)

Other Statistical Functions


Other useful statistical functions a re provided in the fallowing table. Each has the option na.rm to strip
missing values befare calculations. Otherwise the presence of missing values will lead to a missing
result. Object can be a numeric vector ar data frame.

Function Description
mean(x, trim=O, mean of object x
na.rm=FALSE) # trimmed mean, removing any missing values and
# 5 percent of highest and lowest seores
mx <- mean(x,trim=.05, na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) far variance and
mad(x) far median absolute deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and
probs is a numeric vector with probabilities in [O, 1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=/) lagged differences, with lag indicating which lag to use
min{x) mínimum
max(x) maximum
scale(x, column center or standardize a matrix.
center=TRUE,
sea le= TRUE)

Other Useful Functions


Function Description
seq(from , to, by) generate a sequence
indices <- seq(1, 10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes) repeat x n times
y<- rep(1 :3, 2)
#y is c(1 , 2, 3, 1, 2, 3)
cut(x, n) divide continuous variable in factor with n levels
y <- cut(x, 5)

Note that while the examples on this page apply functions to individual variables, many can be applied
to vectors and matrices as well.
Data Management Merging Data

Creating l~ew Variables Adding Columns


Ooerators To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two
data frames by one or more common key variables (i.e., an inner join).
Built-in Functions

Control Structures
# merge two data frames by ID
! Jser-defined F1 mctjons
total <- merge(data frameA,data frameB,by="ID")
Sortjng Data
1
Mergjng Data
# merge two data frames by ID and Country
Aggregating Data total <- merge (data frameA, data frameB, by=c ("ro", "country"))
Reshaoing Data 1
Subsetting Data

Data Tyoe Conversion

Adding Rows
R in Action To join two data frames (datasets) vertically, use the rbind function. The two data frames must have
the same variables, but they do not have to be in the same order.

1 total <- rbind(data frameA, data frameB)

lf data frameA has variables that data frameB does not, then either:
R in Action significantly expands 1. Delete the extra variables in data frameA or
upon this material. Use promo 2. Create the additional variables in data frameB and set them to UA (missing)

code ria38 for a 38% discount. before joining them with rbind( ).

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Operators
R's binary and logical operators \'lill look very familiar to programmers. Note that binary operators work
Creating l~ew Variables on vectors and matrices as well as scalars.
Ooerators

Built-in Functions Arithmetic Operators


Control Structures
Operator Description
! Jser-defined F1 mctjons
+ addition
Sortjng Data subtraction
* multiplication
Mergjng Data
/ division
Aggregating Data
A or ** exponentiation
Reshaoing Data x %% y modulus (x mod y) 5%%2 is 1
Subsetting Data x %/% y integer division 5%/%2 is 2

Data Tyoe Conversion

Logical Operators
R in Action
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
-- exactly equal to
!= not equal to
R in Action significantly expands
!x Not X
upon this material. Use promo
X 1Y x OR y
code ria38 for a 38% discount.
x!Iy xANDy
isTRUE(x) test if X is TRUE
Top Menu

# An example
X <- c(l;lQ)
The R Interface x[(x>8) 1 (x<5)]
# yei l ds 1 2 3 4 9 10
Data Input

Data Management # How it works


x <- c(l:lO)
Basic Statistics X

Advanced Statistics 1 2 3 4 5 6 7 8 9 10
X > 8
Basic Graphs E E E E E E E FTT
X < 5
Advanced Graohs
TTTT E E E E E E
X > 8 1 X < 5
TTTTFF E FTT
x[c(T,T,T,T,F ,F,F,F,T,T)]
1 2 3 4 9 10
Data Management Reshapi ng Data
R provides a variety of methods for reshaping data prior to analysis.
Creating l~ew Variables

Ooerators
Transpose
Built-in Functions
Use the t() function to transpose a matrix or a data frame. In the later case, rownames become
Control Structures variable (column) names.
! Jser-defined F1 mctjons

Sortjng Data # exampl e using bui l t:-in dat:aset:


nrtcars
Mergjng Data
t:(mt:cars)
Aggregating Data 1
Reshaoing Data

Subsetting Data The Reshape Package


Hadley Wickham has created a comprehensive package called ~ to massage data. Both an
Data Tyoe Conversion
introduction and artide are available. There is even a video!

R in Action Basically, you "melt" data so that each row is a unique id-variable combination. Then you "cast" the
melted data into any shape you would like. Here is a very simple example.

,~
mydata

id time x1 x2
s 6
2 3 s
R in Action significantly expands 2 6
upon this material. Use promo 2 2 2 4
code ria38 for a 38% discount.

Top Menu # exampl e of melt: funct:ion


library(reshape)
mdat:a <- melt:(mydat:a, id=c("id", "t:ime"))
1
The R Interface
newdata
Data Input

Data Management id time variable value


x1 5
Basic Statistics
2 x1 3
Advanced Statistics 2 x1 6
Basic Graphs 2 2 x1 2
x2 6
Advanced Graohs
2 x2 5
2 x2
2 2 x2 4

1
# cast the melted data
# cast(data, formula, function)
subjrneans <- cast(mdata, i d~vari ab le, mean)
timemeans <- cast(mdata, time~va riable, mean)
1
subjmeans

id x1 x2
1 4 5.5
2 4 2.5

timemeans

time x1 x2
5.5 3.5
2 2.5 4.5

There is much more that you can do with the melt( ) and cast( ) functions. See the documentation fer
more details.
Data Management Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the
Creating l~ew Variables sorting variable by a minus sign to indicate DESCENDING order. Here are some examples.

Ooerators

Built-in Functions # sorting examples using the mtcars dataset


att ach(mtcars)
Control Structures

! Jser-defined F1 mctjons # sort by mpg


newdata <- mtcars [order(mpg),]
Sortjng Data

Mergjng Data # sort by mpg and cyl


newdata <- mtcars [order(mpg, cyl),]
Aggregating Data
#sort by mpg (ascending) and cyl (descending)
Reshaoing Data
newdata <- mtcars [order(mpg, - cyl),]
Subsetting Data
detach(mtcars)
Data Tyoe Conversion

R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select
Creating l~ew Variables and exdude variables and observations. The following code snippets demonstrate ways to keep or

Ooerators delete variables and observations and to take random samples from a dataset.

Built-in Functions

Control Structures Selecting (Keeping) Variables


! Jser-defined F1 mctjons
# select variables vl, v2, v3
Sortjng Data myvars <- c("vl", "v2", "v3")
newdata <- mydata[myvars]
Mergjng Data

Aggregating Data # another method


myvars <- paste("v", 1: 3, sep="")
Reshaoing Data newdata <- mydata[myvars]
Subsetting Data
# select lst and Sth thru lOth variabl es
Data Tyoe Conversion newdata <- mydata[c(l,5:10)]

R in Action
Excluding (DROPPING) Variables
# exclude variables vl, v2, v3
myvars <- names(mydata) %in% c("vl", "v2", "v3")
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable


newdata <- mydata[c(-3,-5)]
R in Action significantly expands
upon this material. Use promo # delete vari ables v3 and v5
code ria38 for a 38% discount. mydata$v3 <- mydata$v5 <- NULL

Top Menu
Selecting Observations
# first 5 observerations
The R Interface newdata <- mydata[l:S,]

Data Input # based on variable values


newdata <- mydata[ which(mydata$gender='E'
Data Management
& 111ydata$age > 65), ]
Basic Statistics
# or
Advanced Statistics
attach(newdata)
Basic Graphs newdata <- mydata[ which(gender='E' & age > 65),]
detach(newdata)
Advanced Graohs

Selection using the Subset Function


The subset( ) function is the easiest way to select variables and observeration. In the following
example, we select alt rows that have a value of age greater than or equat to 20 or age less then 10.
We keep the ID and Weight columns.

# using subset function


newdata <- subset(mydata, age >= 20 1 age < 10 ,
select=c(ID , Weight))
1
In the next example, we select atl men over the age of 25 and we keep variables weight through
income (weight, income and alt columns between them).

# using subset function (part 2)


newdata <- subset(mydata, sex= "m" & age > 25,
select=weight:income)
1
Random Samples
Use the sample ( ) function to take a random sample of size n from a dataset.

# take a random sample of size 50 from a dataset mydata


# samp1e wi t hout rep1acement
mysample <- mydata[sample(l : nrow(mydata), 50,
replace=FALSE),]

Going Further
R has extensive facilities for sampling, including drawing and calibrating survey samples (see the
sampling package), analyzing complex survey data (see the survey package and it's homepagel and
bootstrapping.
Data Management Data Type Conversion
Type conversions in R work as you would expect. For example, adding a character string to a numeric
Creating l~ew Variables vector converts ali the elements in the vector to character.

Ooerators
Use is.feo to test for data type feo. Returns TRUE or FALSE
Built-in Functions
Use as.feo to explicitly convert it.
Control Structures
is.numeric(), is.character(), is_vector(), is.matrix(), is.data.trame()
! Jser-defined F1 mctjons
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.trame)
Sortjng Data

Mergjng Data
Examples
Aggregating Data to one long to to
vector matrix data trame
Reshaoing Data
from c(x,y) cbind(x,y) data.frame(x,y)
Subsetting Data vector rbind(x,y)
trom as. vector(mymatrix) as.data.frame(mymatrix)
Data Tyoe Conversion
matrix
from as. matrix(myframe)
data trame
R in Action

Dates
You can convert dates to and from character or numeric data. See date values for more inforn1ation.

R in Action significantly expands


upen this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management User-written Functions
One of the great strengths of R is the user's ability to add functions . In fact, many of the functions in R
Creating l~ew Variables are actually functions of functions. The structure of a function is given below.
Ooerators

Built-in Functions myfunction <- function(argl, arg2, . . . ){


statements
Control Structures return(object)
! Jser-defined F1 mctjons }

Sortjng Data

Mergjng Data Objects in the function are local to the function. The object returned can be any data tvoe. Here is an
example.
Aggregating Data

Reshaoing Data
# function example - get measures of central tendency
Subsetting Data # and spread for a numeri c vector x . The u ser has a
# choice of measures and whether the resul ts are printed.
Data Tyoe Conversion

mysummary <- function(x,npar=TRUE,print=TRUE) {


i f (!npar) {
R in Action center <- mean(x) ; spread <- sd(x)
} else {
center <- median(x) ; spread <- mad(x)
}
if (print & !npar) {
cat("Mean=" , center, "\n", "SD=" , spread, "\n")
} else if (p r int & npar) {
cat("Medi an=", center, "\n", "MAD=", spread, "\n")
}
R in Action significantly expands
result <- list(center=center,spread=spread)
upon this material. Use promo
return(resu l t)
code ria38 for a 38% discount. }

# invoking the funct ion


Top Menu set . seed(1234)
x <- rpois(SOO, 4)
y <- mysurnmary(x)
Median= 4
The R Interface MAD= 1.4826
# y$center is the median (4)
Data Input # y$spread is the medían abso l ute devi ati on (l. 4826)
Data Management
y <- mysummary(x, npar=FALSE, pri nt=FALSE)
Basic Statistics # no output
# y$center is the mean (4.052)
Advanced Statistics
# y$spread is the standar d deviation (2 . 01927)
Basic Graphs

Advanced Graohs lt can be instructive to look at the code of a function. In R, you can view a function's code by typing
the function name without the ( ). lf this method fails, look at the following R Wiki link for hints on
viewing function sourcecode.
Finally, you rnay want to store your own functions, and have thern available in every session. You can
customjze the R enyjroment to load your functions at start-up.
Data Management Creating new variables
Use the assignment operator <- to create new variables. A wide array of ooerators and functions are
Creating l~ew Variables available here.
Ooerators

Built-in Functions # Three examples for doing the same computations

Control Structures mydata$sum <- mydata$xl + mydata$x2


! Jser-defined F1 mctjons mydata$mean <- (mydata$xl + mydata$x2) / 2

Sortjng Data attach(mydata)


Mergjng Data mydata$sum <- xl + x2
mydata$mean <- (xl + x2) / 2
Aggregating Data detach(mydata)
Reshaoing Data
mydata <- transform( mydata,
Subsetting Data sum = xl + x2,
mean = (xl + x2) / 2
Data Tyoe Conversion )

R in Action
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.

# create 2 age categories


mydata$agecat <- i fe 1 se(mydata$age > 70,
c("older"), c("younger"))
R in Action significantly expands
upon this material. Use promo # another example: create 3 age categories
attach(mydata)
code ria38 for a 38% discount.
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
Top Menu detach(mydata)

The R Interface Renaming variables


Data Input You can rename variables programmatically or interactively.

Data Management

Basic Statistics # rename interactively


fix(mydata) # results are saved on close
Advanced Statistics

Basic Graphs # rename programmatically


library(reshape)
Advanced Graohs mydata <- rename(mydata, c(oldname="newname"))

# you can re-enter all the variable names in order


# changing the ones you need to change.the limitation
# is that you need to enter all of them!
1
names(mydata) <- c("xl", "age", "y", "ses")
Enhancing Output

Value Labels

R SPSS
l >mydata$workshop <- CD ' C:\myRfolder' .
2 factor ( mydata$workshop, 1
2 GET FILE='mydata.sav' .VARIABLE LEVEL workshop
3 levels = c(l, 2, 3, 4),
4 labels = c( "R" , "SAS" , 3 (NOMINAL)
5 "SPSS" , "Stata" ) ) 4 /ql TO q4 (SCALE).VALUE LABELS
5 /workshop
6 1 'R'
7 2 'SAS '
8 3 'SPSS'
9 4 'Stata'
10 /ql TO q4
11 1 'Strongly Disagree'
12 2 'Disagree'
13 3 'Neutral'
14 4 ' Agree'
15 5 'Strongly Agree' .
16 SAVE OUTfile = "mydata.sav" .

Variable Labels

1 # Filename: VarLabels.Rsetwd("c:/myRfolder")
2 load (file = "mydata.RData" )
3 options (width = 63)
4 mydata
5 # Using the Hmisc label attribute.
6 library ( "Hmisc" )
7 label (mydata$ql) <- "The i nstructor was well prepared."
8 label (mydata$q2) <- "The instructor communicated well. "
9 label (mydata$q3) <- "The course materials were helpful."
10 label (mydata$q4) <- "Overall, I found this workshop useful."
11
12 # Hmisc describe function uses the labels.
13 des cribe ( mydata[ ,3:6] )
14
15 # Buit-in summary funct ion
16 # ignores the labels.
17 summary ( mydata[ ,3:6] )

SPSS

1 * Filename: VarLabels . sps .CD ' C:\myRfolder ' .


2
3 GET FILE= ' mydata . sav' .
4 VARIABLE LABELS
5 Ql "The instructor was well prepared"
6 Q2 "The instructor communicated well "
7 Q3 "The course materials were helpful"
8 Q4 "Overall, I found this workshop useful" .
9
10 FREQUENCIES VARIABLES=ql q2 q3 q4.
11 SAVE OUTFILE= ' mydata.sav' .

=-:t

-
[Eñier your emaíl address
Enhancing Output 1r4stats com

Writiug HTML & LaTeX

1 # filename: FormattedOutput.Roptions(width 60)


2 setwd ( "c:/myRfolder" )
3 load ( "mydata.RData" )
4 attach (mydata)
5 library ( "xtable" )
6
7 # Formatting a Data Frame
8 print (mydata)
9 myXtable <- xtable (mydata)
10 class (myXtable)
11
12 print (myXtable, type = "html" )
13 print (myXtable, type = "latex" )
14
15 # Formatting a Linear Model
16 mymodel <- lm ( q4 ~ ql + q2 + q3, data mydata)
17 myXtable <- xtable (mymodel)
18
19 label (myXtable) <- c ( "xtableOutput" )
20 caption (myXtable) <- c ( "Linear model results formatted by
21 xtable" )
22 print (myXtable, type "html " )
print (myXtable, type "latex" )

SAS

In SAS, getting tabular results into your word processor or Web page is as easy as setting
the style when you install it and then saving the output directly in fonnat you need. To
get a subset of output, copy and paste works fine .

SPSS

In SPSS, getting tabular results into your word processor or Web page is as easy as
setting the style when you install it and then saving the output directly in format you
need. To get a subset of output, copy and paste works fine.

Stata

l~ 1 use c:\myRfolder\mydata, list


regress q4 ql q2 q3
outtex

Share this: Email Twitter Facebook Google

Like this:

Enhancing Output _ r4stats com htm[27/0l/2014 22:20:43]


Graphics, ggplot2
While R's traditional graphics offers a nice set of plots, sorne of them require a lot of
work Viewing the same plot for different groups in your data is particularly difficult. The
ggplot2 package is extremely flexible and repeating plots for groups is quite easy. The
"gg" in ggplot2 stands for the Grammar of Graphics, a comprehensive theo1y of graphics
by Lelaud Wilkinsou which he described in his book by the same name. In his book, The
Grammar ofGraphics, Wilkiusou showed how you could describe plots notas discrete
types like bar plot or pie chart, but using a "grammar" that would work not only for plots
we commonly use but for almost any conceivable graphic. From this perspective a pie
chrut is justa bar chart with a circular (polar) coordinate system replacing the
rectangular Ca1tesiau coordinate system. Wilkinson's book is perhaps the most important
one on graphics ever written. However, it is not a light read and it presents an abstract
graphical syntax that is meant to clruify his concepts. It is not a language you can use to
recreate his graphs.

The g¡,,aplot2 is a sin1plified implementation of grrunmar of graphics written by Hadley


Wickham for R It is simplified only in that he uses R for data transformation and
restructuring, rather than implementing that in his syntax. Wickham's book, ggplot2:
Elegant Graphics for Data Analysis, provides a detailed presentation of the ggplot2
package. Here I will review the basic exrunples presented in my books. The practice data
set is shown here. The programs and the data they use are also available for dovntload
here.

To make it easy to get struted, the ggplot2 package offers two main functions: quickplotü
and ggplotü. The quickplotü function - also knmvn as qplotü - mi.mies R's traditional
plotü function in mru1y ways. It is particularly easy to use for simple plots. Below is an
example of the default plots that qplot() makes. The command that created each plot is
shown in the title of each graph. Most of them are useful except for núddle oue in the left
column of qplot(workshop, gender). A plot like that of two factors simply shows the
combinations of tlle factors that exist which is ce1tainly not w01th doing a graph to
discover.

=-:t

-
[Eñier your emaíl address

Grnphics, ggplot2 _ r4stats com htm(27/0l /2014 22:2 1:26]


qplot(workshop) qplot(posttest)
30- 12-
25- 10-
20- a-
§ 15- e:
::J 6-

ü.~
0 o
u 10- u 4-
5-
o- 1 1 1 1
2-
o- • .1
1 1
.I ~ 1 1 1 1
R SAS SPSS Stata 60 65 70 75 80 85
workshop pretest

qplot(workshop, gender) qplot(workshop, posttest)


•• •
1 ••• ••
90 -
Mala- • • • •
1 1
~ so-
••
~

••••
Cl)
-o
e
Cl)
j:::l
(J) 1 1
8.10- •

Ol
Female - • • • •
60 - •1
1 1 1 1 1 1 1
R SAS SPSS Stata R SAS SPSS Stata
workshop workshop

........ -
qplot(posttest, workshop) qplot(pretest, posttest)

: .
•• •
.........
--
Stata - •

§-SPSS - .. ~ ao-
90 -

........
a.:........
. ... .•.•:a-
r.
(J)

~ SAS - ............. :t:l


(J) • • • •• •
•• • • •
o
s
1 1
... ..1
...._
1
8.10 -

60- .
1 t 1 1 1 1
60 70 80 90 60 65 70 75 80 85
posttest pretest

While qplot() is easy to use for simple graphs, it does not use the powerful grammar of
graphics. The ggplot() function does that. To understand ggplot, you need to ask yourself,
what are the fundamental parts of every data graph? They are:

• Aesthetics - these are the roles that the variables play in each graph. A variable may
control where points appear, the color or shape of a point, the height of a bar and so
on.
• Geoms - these are the geometric objects. Do you need bars, points, lines?
• Statistics - these are the functions like linear regression you might need to draw a
line.
• Scales - these are legends that show things like circular symbols represent females
while circles represent males.
• Facets - these are the groups in your data. Faceting by gender would cause the graph
to repeat for the two genders.

In R for SAS and SPSS Users and R for Stata Users I showed how to create almost all the
graphs using both qplot() and ggplot(). For the remainder of this page I use only ""ill
ggplot() because it is the more flexible function and by focusing on it, I hope to make it
easier to learn.Let us start our use of the ggplot() function with a single stacked bar plot.
It is not a very popular plot, but it helps demonstrate how different the grammar of
graphics perspective is. On the x-axis there really is no variable, so I plugged in a call to
the factor() function that creates an empty one on the fly. I then fill the single bar in
using the fill argument. There is only one type of geomebic object on the plot, which I

Grnpbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1

add with geom_bar. TI1e colors are a bit garish, but they are chosen so that colorblind
people (10% of males) can still read them.

1 1 > ggplot (mydata100, aes (x = factor ( "" ), fill workshop) ) +


2. + geom_bar ()

80-

60 -
.R
workshop

. SAS
. SPSS
. Stata
20-

o-
factor("")

The x-axis comes out labeled as "factor("")" but we can over-write that ·with a title for the
x-axis. \i\i'hat is particularly interesting is that this can become a pie chart simply by
changing its coordinate system to polar. TI1e final line of code changes the label on the
discrete x-axis to blank with "".

1 > ggplot (mydata100,


2. + aes ( x = factor ('"' ), fill = workshop) ) +
3 + geom_bar () +
4 + coord_polar (theta = "y" ) +
5 + scale_x_discrete ( "" )

e:
~
o
20
.R
workshop

. SAS
o
. SPSS
• Stata

BarPlots

The upper left corner of the plot of the first plot above shows a bar plot of workshop
created with qplotQ. From the grammar of graphics approach, that graph has only one
type of geometric object: bars. The ggplotQ function itself only needs to specify the data
set to use. Note the unusual use of the plus sign "+" to add the effect of of geom_ barQ to
ggplotQ. Only one variable plays an "aesthetic" role: workshop. The aes() function sets
that role. So here is one way to v.Tite the code:

l 1 > ggpl ot (mydata100) +


geom_bar ( aes (workshop)
2. +

Grnpbics, ggplot2 _ r4stats com htm[27/0l /2014 22:21:26]


30-

25-

20 -

§ 15-
oo
10-

5-

o-
1 1 1 1
R SAS SPSS Sta ta
workshop

A very useful feature of the ggplot() function is that it can pass aesthetic roles to all the
functions that are "added" to it. For example, we can create exactly the sanie barplot with
this code:

1 1 > ggplot ( mydata100, aes ( workshop) ) +


2 + geom_ bar ()

In our case it's just as easy either way but I like the first approach since it ties the
aesthetic role clearly to the bars. However, as our graphs become more complex, it can be
a big time-saver to set as many aesthetic roles in the ggplot() function call and let it pass
them through to various other functions that we will add on to build a more complex
plot.

The grammar of graphics way of creating plots looks quite odd at first, especially when
you consider that qplot(workshop) also does the above plot! However, as graphs get more
complex, ggplot() can handle it using the same ideas while qplot() cannot.Flipping from
ve1tical to horizontal bars is easy ·with the addition of the coord_ flip() function.

1
2
1 +> ggplot (mydata100, aes (workshop)
geom_ bar () + coord_flip ()
) +

Stata -

SPSS-
a.
o
L:,
(J)

~
~ SAS-

R-

1 1 1 1 1 1 1
o 5 10 15 20 25 30
count

If you want to fill the bars ·with color, you can do that using the "fill" argument.

1 1 +> ggplot (mydata100, aes (workshop, fill = workshop ) ) +


2 geom_ bar ()

Grnpbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

30-

25 -

20 - workshop

§ 15-
o
o
l :AS
. SPSS
10-
• Stata
5-

o-
1 1 1 1
R SAS SPSS Stata
workshop

The use of color above was, well, colorful, but it did not add any useful infonnation.
However, when displaying bar plots of two factors, the fill argument becomes very useful.
You can display it severa} ways. Below I use fill to color the bars by workshop and set the
"position" to stack.

1 1 > ggpl ot (mydata100, aes (gender, fill


geom_bar (position = "st ack" )
workshop) ) +
2 +

50-

40 -

e::J
o
30 - .R
workshop

. SAS
0 20 - . SPSS
• Stata
10-

o-
1
Fe mate Male
gender

In the plot above, the height of the bars represents the total number of males and
females. This is fine if you want to compare counts, but if you want to compare
proportions of each gender that took each class, you would have to make the bars equal
heights. You can do that by simply changing the position to "fill".

1 1 +> ggplot (mydata100, aes (gender, fill=workshop) ) +


2 geom_bar (position="fill" )

Grapbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

1 o-

o 8-
workshop
o6-
e::::¡
8 04 -
l :AS
. SPSS
• Stata
o 2-

oo-
1 1
Female Male
gender

Here is the same plot changing only the bar position to be "dodge".

1 1 +> ggpl ot (mydata100, aes (gender, fill=workshop ) ) +


2 geom_bar (position="dodge" )

15 -

e::::¡
o
10- .R
workshop

. SAS
o
. SPSS
5- • Stata

o-
1
Female Male
gender

You can change any of the above colored graphs to shades of grey by sin1ply adding the
scale_fill_ grey() function. Here is the plot immediately above repeated in greyscale.

1 1 > ggpl ot (mydata100, aes (gender, fill=workshop ) ) +


2 + geom_bar (position="dodge" ) +
3 + scale_fill_grey (start = 0, end = 1)

15 -

e:::::¡
o
10- .R
workshop

. SAS
o
SPSS
5- Stata

o-
1
Female Male
gender

You can get the same information that is in the above plot by making small separate plots

Grapbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

for one of the groups. You can accomplish that ·with the facet_grid() function. It accepts a
formula in the form "ro·ws ~ colums", so using "gender ~ ." asks for two rows for the
genders (three if we had not removed missing values) and no columns.

l 1 +> ggplot (mydata100, aes (workshop) ) +


2 geom_ bar () + facet_grid (gender ~ .)

15-

10-

5-

e: o-
:::J
8 15-

10-

5-

o- 1 1 1 1
R SAS SPSS Stata
workshop

Uu s mumarized Data

The ggplot2 package summarizes your data for you. If it is already sUll1111arized, you can
create a small data frame of the results to plot.

l myTemp <- data . frame (


2 + myGroup=factor ( c ( "Before" , "After" ) ),
3 + myMeasure=c (40, 60)
4 +
5
6 > ggplot (data=myTemp, aes (myGroup, myMeasure) ) +
7 + geom_bar ()

60-

50 -

40 -
Q)
:s
~ 30 -
Q)
2
E'20-

10-

o-
1 1
After Before
myGroup

Dot Charts

Dot chaits are similar to bai· chaits, but since they are plotting points on both an x - and
y-axis, they reqtúre a special variable called " ..count..". It calculates the counts and lets
you plot them on the y-axis. The points use the "hin" statistic. Since dot chaits are
usually sho·wn "sideways" I an1 adding the coord_ flip() funtion.

1 1 > ggplot ( mydata100, + aes ( workshop, . . count. . ) ) +


2 + geom_ point (stat = "bin" , size = 3) + coord_flip () +
3 facet_grid (gender ~ .)

Grapbics, ggplot2 _ r4stats com htm[27/0l/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

Stata - •
SPSS - • "TI
CD

...3
SAS - • ¡¡-

a.
o
r:.
R- •
IJ>
~ Stata-
o
::

SPSS - • ...s:
¡¡-
SAS- •
R-
1 1 1 1

10 12 14 16
count

Addiug Titles aud La.beis

To add a title, use the opts() function and its title argument. Adding titless to axes is
trickier. You use four different functions depending 011 the axis and whether or not it is
discrete: scale_ x_ discrete scale_y_ distrete scale_ x_ continuous scale_y_ continuous For a
bar plot, the x-axis is discrete so 1 will use scale_ x_ discrete to assign a label to it. The
character sequence "\ n" tells R to go to a new line in all R packages.

l > ggplot (mydata100, aes (workshop, .. count .. )) +


2 + geom_ bar () +
3 + opts ( title="Workshop Attendance" ) +
4 + scale_x_ discrete ( "Statistics Package \nWorkshops" )

Workshop Attendance
30 -

25-

20 -

§ 15-
oo
10-

5-

o- 1 1 1
R SAS SPSS Stata
Statistics Package
Workshops

Histograms

Recall from om first example that you can use qplot to get a quick histogram:
qplot(posttest). However, as things get more complicated, ggplot() is easier to conh·ol.
The geom_ histogran1 function is all you need. 1 have set the color of the bar edges to
white. Without that, the bars all run together in the same shade of grey.

l 1 +> ggpl ot (mydata100, aes (posttest) ) +


2 geom_ histograrn (color= "white" )

Grapbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

12-

10-

B-

e::::¡ 6-
o

~
o
4-

2-

o- 1
1
111111
1
111 1 111
1 1 1
60 70 80 90 100
posttest

You can change the number ofbars used using the binwidth argument. Since this many
bars do not touch, I did not bother setting the edge color to white.

1
2
1 >> ggplot (mydata100, aes (posttest) )
geom_histogram (binwidth = 0.5)
+

s-

6-

e
54-
o

2-

o- 111111
1 1 1 1
60 70 80 90
posttest

If you prefer a density plot, that is easy too.

l
2
1 >> ggpl ot (mydata100,
geom_dens i ty ()
aes (posttest)) +

o05-
o04 -
2:-o 03-
·¡¡¡
e
Q)

"O o02-
o01-
o 00-
1 1 1 1
60 70 80 90
posttest

It is easy to layer many different geometric objects onto your plots. In this case to get the
same axis on the histogram as the density uses, I used a special ggplot2 variable named
" .. density.." on the y-axis. I also added a "rug" of carpet-like tick marks on the x-axis

Grapbics, ggplot2 _ r4stats com htm[27/0l/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

using geom_ rug.

1 > ggplot (data=mydata100) +


2 + geom_histogram ( aes (posttest, .. density .. ) ) +
3 + geom_density ( aes (posttest, .. density .. ) ) +
4 > geom_ rug ( aes (posttest) )

0.08-

006-
.?;-
·¡¡¡
fü o04 -
u

o02-

o00- 111111 111 11 111 11 11111 1¡1 1 11 1 j


70 60 90 100
posttest

Comparing group histograms is easy when you facet them.

1 1 > ggpl ot (mydata100, aes (posttest) ) +


2 + geom_histogram ( color = "white" ) +
3 + facet_grid (gender ~ .)

7-
6-

.11IJ~.. 1.1.
5- ..,
...3
4-
3-
2-
..
¡¡-

1-
-c o- 111 1
5 1-
0 6-

1J~~ul1 1
5-
4-
3-
..s::
¡¡-
2-
1-
o- 1 1 1 1
1 1 1 1 1
60 70 80 90 100
posttest

N ormal QQ Plots

Normal QQ plots are done in ggplot ·with the stat_ qq() function and the sample aesthetic.

1 1 +> ggplot (mydata100, aes (sample = posttest) ) +


2 stat_qq ()

Grapbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

--.-
•• •
90-
r
~ 80-
/
a.
E
ro
en •
_/
10 - ••
• ••
60- •
1 1 1 1
-2 -1 o 2
theoretical

StripPlots
With fairly small data sets, you can do strip plots using the point geom.

u > ggplot (mydata100, aes (workshop, posttest) ) +


+ geom_point ()



90- t ••
1 •t
1 t 1 1
t> ao- •t •
(])
t
±::
en
o
a.
• 1 1 ¡
10 - • •
•• •

60-
1 1 1
•l
R SAS SPSS Stata
workshop

With large data sets, you can use the jitter geom instead. Our data is so small that the
default amount of jitter malees it hard to even notice where each group ends. See the
books for details on controlling the amount of jitter.

l 1 +> ggpl ot (mydata100, aes (workshop, posttest) ) +


2 geom_jitter ()


• ••
•• ••
90-
... ... • .• ...••• . •• • •
I •l •
• \


• •• • .
~
• • • ~.
t> ao -
(])
±::
en
o
a.
• • ••••• • •
• • •••
••
\ • •
• •• •• •
10- • •
•• •

60-
1 1 1
•1
R SAS SPSS Stata
workshop

Scatter and Li.ue Plots Vaiious type of scatter ai1d line plots Catl be done using different

Grapbics, ggplot2 _ r4stats com htm[27/0l/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

geoms as shown below. You can, of course, add multiple geoms to a plot. For example, you might
want both points and lines, in wbich case you would simply add botll geoms.

1 1 > ggplot (mydata100, aes (pretest, posttest)) +


2 > geom_point ()

• • •
90- • •• •• •• •
• i ••••
···i· •• i •• •.
• i ·i·i:.1i
• ··i
t> so -
Q) • • i·¡ i. • i ••
:t:::
(/)
o • • • • ••
a.
10- • •
• •

60- •
1 1 ' 1 1 1
60 65 70 75 BO B5
pretest

When you add a line geom, the ggplot s01ts the data along the x-axis automatically. If you
had time-series data that were not s01ted by date, it would do so.

1
2
1 +> ggpl ot (mydata100, aes (pretest, posttest) )
geom_line ()
+

90-

t> so -
Q)
:t:::
(/)
o
a.
10-

60-
1 1 1 1 1 1
60 65 70 75 80 85
pretest

The path geom leaves the order of the data as it is; it does not s01t it before com1ecting
the points. See the books for more examples.

1
2
1 +> ggpl ot (mydata100, aes (pretest, posttest) )
geom_path ()
+

Grapbics, ggplot2 _ r4stats com htm[27/01/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

90-

(¡) 80-
Q)
:t::
en
o
a.
10 -

60-
1 1 1 1 1 1
60 65 70 75 80 85
pretest

Scatter Plots for Large Data Sets

Large data sets provide a challenge since so many points are obscured by other points.
First let us create a data set 'l>vith 5 ,000 points.

1 > pretest2 <- round ( rnorm ( n = 5000, mean = 80, sd = 5) )


2 > posttest2 <- round ( pretest2 + rnorm( n = 5000, mean = 3, sd
3 3) )
4 > pretest2 [pretest2 > 100] <- 100
5 > posttest2[posttest2 > 100] <- 100
> temp=data . frame (pretest2,posttest2)
Now I will plot the data using small-sized points, jittering their positions and coloring
them 'l>vith sorne transparency (called "alpha" in computer-speak).

11 > ggplot (temp, aes (pretest2, posttest2),


2 + size=2, position = position_jitter (x 2,y 2) ) +
3 + geom_jitter (colour=a l pha ( "black" ,0 .15) )

100-

95-

90 -

N 85-
(¡)
Q)
:t:: 80-
en
o
a.
75 -

70-

65-
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2

Next I \-vill use very small sized points and lay a set of 2D density contours on top of
them. To help see the contours more clearly, I will not jitter the points.

1 1 +> ggplot (temp, aes ( x=pretest2, y=posttest2) ) +


2 geom_ point ( size=l ) + geom_density2d ()

Grapbics, ggplot2 _ r4stats com htm[27/0l/2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

100-

95 -

90-

N 85-
u;
Q)
:t::
(/) so -
o
a.
75 -

70-
..
65-
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2

Finally, I will create a hexbin plot, that replaces bunches of points ·with a larger
hexagonal symbol.

1 1 > ggplot (temp, aes (pretest2, posttest2)) +


2 + geom_hex ( bins=30 )

100-

95 -

90- count

N 85- 20
u; 40
Q)
:t:: 80 -
(/)
o 60
Q.
75- 80
100
70-

65 - •
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2

Scatter Plots with Fit Lines

The ggplotü functi.on makes it parti.cularly easy to add fit lines to scatter plots. Simply
adding the geom_ smooth() functi.on <loes the trick.

1 1 > ggplot (mydata100, aes (pretest, posttest) ) +


2 + geom_point () + geom_smooth ( )

• • •
90 - ...
l •
••• ••• •
• ~

•• i. 1 ,;.;/ ••

¡¡; 80 -
a>
:t:: . ~
• • *·.
• i .. ~ ~ ¡i
--.• i
. .•:1!.· ..
(/)
o
a.
10 -


60 - •
/
1 1 1 1 1 1
60 65 70 75 BO 85
pretest

Grapbics, ggplot2 _ r4stats com htm[27/0l /2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

Adding a linear regression fit requires only the addition of "method = hn" argument.

l
2
1 +> ggplot (mydata100, aes (pretest, posttest)
geom_point () + geom_smooth (method=lm)
) +

• • •
90

¡¡¡ 80 -
:m
(/)
o
a.
10-

60 - •
1 1 1 1 1 1
60 65 70 75 80 85
pretest

To plot labels instead of point characters, add the label aesthetic. I placed "size = 3 11 in
the geom_ text function to clarify its role. I could have put it in the aesQ function call
within the &,aplotü call, but then it would have added a useless legend indicating what 3
represented, when it is merely a size.

.l
2 1 +> ggplot (mydata100,
aes (pretest, posttest, label as.character (gender))) +
3 + geom_text (size = 3)

Male
Female
Male
MIMBle Female
90 - Female M. le Fenl6ilet.lale Female

¡¡¡ 80 - Female
Fe~~~
:m
(/) Female
Ma1Ee~ll!Ya1e
o Male le
a. Male
Female
10-
FemaleFemele
Male

60 - Male
1 1 1 1 1 1
60 65 70 75 80 85
pretest

To use point shapes to represent the value of a third variable, simply set the shape
aesthetic.

1
2
1 +> ggplot (mydata100, aes (pretest,
geom_point ( aes (shape=gender
posttest) ) +
) )

Grapbics, ggplot2 _ r4stats com htm[27/0l /2014 22:21:26]


Grnpbics, ggplot2 1r4stats com

• ..
..... •
90- •

t> 8o- gender


Q)
:t:: • Female
en
o •
a. ... " Male
10 - •
• •

60- ...
1 1 1 1 1 1
60 65 70 75 80 85
pretest

Scatter Plots with Lineru· Fits by Group

One way to use a different fit for each group is to do them 011 the same plot. This involves
setting aesthetics for both linetype and point shape. You can place these in the main
ggplot() function call, but since linetype applies only to geom_ smooth and shape applies
only to geom_point, I prefer to place them in those function calls. I tend to think oflines
being added to the scattered points, but in this case I placed the geom_point() call last so
that the shading from the gray confidence intervals would not shade the points
themselves.

1 1 > ggplot (mydata100, aes (pretest, posttest) ) +


2 + geom_smooth ( aes (linetype = gender), method "lm" ) +
3 + geom_ point ( aes (shape = gender) )

• ..
90- •
t • : .i "'..,.....
.
............. •&..• ,,..,
.........
•t • v • :t
• ..f t ••• gender
t>
Q)
8o- ...••r ......
:t::
en
o • t·1···· ...
4 A • •.
-+-- Female

a. ..... Male
...
10 - •
. .. • •

60- ...
1 1 1 1 1 1
60 65 70 75 80 85
pretest

Another way to display linear fits per group is to facet the plot.

l > ggplot (mydata100,


2 + aes (pretest, posttest ) ) +
3 + geom_ smooth (method = "lm" ) +
4 + geom_ point () +
5 + facet_grid (gender ~ . )

Grapbics, ggplot2 _ r4stats com htm[27/0l /2014 22:21:26]


Grnpbics, ggplot2 1r4stats com


90 - •
• ••
• •
. .• •: ...... • •••
~· ...,,
..
80 -
3
"'
•• •
70 - •
• •

.._, 60 -
fJ)
Q)
:t::
en
o •
a.

90-
.. .... .• • ••
,~

/
.,,_,

·;. ::r:7 . ... /

• • .
.
· • ...
80-
·~ • •
•• ••

70-

60 - •
1 1 1 1 1 1
60 65 70 7f. 80 85
pretest

Box Plots

The ggplot package offers consíderable control over how you can do box plots. Here I plot
the raw poínts and then the boxes on top of them. This hldes the poínts that are actually
ín the míddle 50% of the data. They are usually dense and of less ínterest than the poínts
that are further out. If you have a lot of data you míght consíder usíng geom_jítter() to
spread the poínts around, preventing over-plotting.

1 1 > ggplot (mydata100,


2 + aes (workshop, posttest )) +
3 + geom_ point () + geom_boxpl ot ()

90

t> 8o -
~
fJ)
o
a.
70-

60-
1 1 1

1
R SAS SPSS Stata
workshop

The ggplot package offers a nearly endless a.rray of combínations to visualize your data. I

Grapbics, ggplot2 _ r4stats com htm[27/0l /2014 22:21:26]


Grnphics, Trnditional 1 r4stats com

Graphics, Traditional
R offers three main graphics packages: traditional (or base), lattice and ggplot2. Traditional
graphics are built into R, create nice looking graphs, and are ve1y flexible. However, they require
a lot of work when repeating a graph for different groups in your data. Lattice graphics excel at
repeating graphs for va1ious groups. The ggplot2 package also deals with groups well and is quite
a bit more flexible than lattice graphics. This section deals just with traditional R graphics
functions. Om· books devote 130 pages to desc1ibing the relationship among these packages and
explaining how to crea te each type of plot. However, if you look at the examples below, you will
often be able to plug your variables into the code to create the graph you need. The practice data
set is shown here. The programs and the data they use are also available for download
here.

R for SAS and SPSS Users and R for Stata Users contai.t1 examples for advanced users. Paul
Murrell 's book R Graphics (right) also offers excellent coverage of traditional graphics in great
detail.

Bar Plots

Bar Plots of Counts

If you have pre-sununarized data, it is easy to get a bar plot of it.

> barplot( c( 40 , 60) )

Graphics, Trnditional _ r4stats comhtm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

60-

50 -

40 -

10-

o-
1 1
Alter Before
myGroup

This bar plot summarizes the variable q4. If the data is a factor (q4 is not) then plot() will
do it automatically since it is a generic function. You can also use barplot() if the data is
swnmarized váth table() first.

> plot( as . factor(q4) )


> barplot( table(q4 ) )

o
<")

lO
N

o
N

I[)

ºº
I[)

2 3 4 5

This plots a factor, gender. The plotO ftmction recognizes that it is a factor and so it smnmarizes it
befo1-e plotting. it. Altematively, you can use barplotO on the frequencies obtained by tableO.

> plot(gender) # or .. .
> barplot( table (gender)

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

o
1()

o
N

Female Male

Using traditional graphics functions you can hlrn a graph on its side by adding the
argument, horiz = TRUE. As before, plotü will do frequencies automatically and barplotü
requires some sort of suunnarization, in this case tableO.

> plot(workshop, horiz = TRUE) # or . . .


> barplot(table(workshop), horiz =TRUE)

(/)
(/)
a.
(/)

([)
<(
(/)

o 5 10 15 20 25 30
A stacked bar plot is like a rectangular pie chart. You can make one by conve1ting the
output from table() into a matrix. Note that it drops the value labels so we would have to
add them to make this useful. More on that later.

> barplot( as . matri x( tabl e(workshop) ), besid e FALSE)

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

o
CD

o
<.D

o
N

You can visualize frequencies 011 two factors by using plotQ. It will label the x- and y-axes so I
have suppressed that using the arguments xlal:>="" and ylab="". TI1e barplotO ftmction can also do
this plot ifyou first swnmaiize the data with another fünction, tableO in this case. However, it
does not label the genders 011 the y-axis, making it a bit more work.

> plot (workshop, gender, xlab "" ' y l ab "")

You can also do this plot using the mosaicplotO function. It uses tableO to get frequencies. It
would display " table(workshop. gender)" in the main title ifI had not suppressed it with the
argun1ent main="".

> mosaicpl ot( table(workshop, gender), ma i n "" )

R SAS SPSS Sta ta

.._
Q)
u
eQ)
O>

workshop
The mosaicplotO function can also handle more than two factors. Our practice data set only has
two, so we will use the Titanic srnvivor data that comes witll R. The plot below is much lai·ger
than the others because displaying the th.ird vatiable takes quite a bit more space.

> mosaicp l ot(- Sex + Age + Survi ved,


> d a ta = Ti tani c, color = TRUE)

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

Titanic
Mele Female
Yes

liD
No

Sex

Bar Plots of M ean.s

So far we llave been plotting frequencies. An advantage ba¡plotO has over plotO is that you can
get the height of bars to represent any measlU'e you like. Below I use tapplyO to get the means of
ql by gender, store it in myMeans and then plot those means.

> myMeans < - tapply(ql , gender, mean, na.rm TRUE)


> barplot(myMeans)

Female Male
We can get means broken down by both gender and workshop by including both of them on the
tapplyO call. To include more than one factor in tapplyQ, you must supply the factors in a list (or
a data frame, which is a type of a list). Note that we do not labels for the workshops. You could
add them with the legendO fi.mction (see next example) but this is a good example of something
the ggplotO fi.mction would do automatically. I never use the baiplot fünction for more thai1 one
factor.

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

> myMeans <- tapply(ql, list(workshop, gender), mean,na.rm


TRUE)
> barplot(myMeans, beside TRUE)

Female Male

Adding Titles, Labels, Colors, and Legends

Ali the traditional graphics functions calls can be embellished with a variety of
arguments: col for color, xlab/ylab for x- and y-axis labels, main and sub for main and
sub-titles. The legend function provides an extreme level of control over legends or scales.
However, the functions in the ggplot2 package will do very nice legends automatically.

> barplot( table(gender,workshop),


> beside TRUE,
> col c( "gray90 ", " g ray60 " ) ,
> xlab "Workshop",
> ylab ''Countº,
> main "Number o f males and females \nin each wor kshop" )
> legend ( "topright",
> legend e ( "Female ", "Male" ),
> fill = c("gray90 ", "gray60")

Number of males and females


in each workshop

l() O Female
o Male

o
e:J
o
o

R SAS SPSS Stata

Workshop

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

Grapbics Parameters and Multiple Plots on a Page

Graphics parameters control R's traditional plot functions. You can get a list ofthem by
running simply "parü''. One of the parameters is mfrow. It sets up m ulti-frame plots in
rows. Once you have set how many rows (first value) and columns (second) then all the
plots that follow ""ill
fill the rows as we read: left to right, top to bottom. Here is an
example.

> p a r(mf row = c(2,2)) # 2 rows, 2 colurnns


> b a rplot( table(gende r , workshop)
> barplot( table(workshop, gender)
> barplot( table(gender, workshop), besi de = TRUE)
> barplot( table(workshop, gender), besi de = TRUE)
> par(mfrow = c(l, 1)) # 1 row, 1 colurnn (back t o the default )

o('f)
~
U)
N o
....
o
N
g
U)

o
N
o
~

o
U)

o o

R SAS SPSS Stata Female Male

U) U)

o o

U) U)

o o

R SAS SPSS Stata Female Mal e

Pie Charts

A pie chait is easy to do using pieü but the slices are empty by default. The col argument
can fill in shades of gray or colors.

>pie( t able(workshop) ,
> col = c("wh ite", "gray90" , " gray60 ", "black" ) )

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

Dot Charts

The dotchart() function works just like barplot() in that you either provide it values
directly or use other summarization functions to get those values. Below I use table() to
get frequencies. By default dotchart() uses open circles as its plot character. The
argument pch = 19 changes that to be a solid circle, and the cex argument makes the
character bigger through character expansion.

dotchart( tabl e(workshop, gender) , pch 19, cex l . 5)

Fema le
Stata •
SPSS •
SAS •
R •
Mal e
Stata
SPSS -· •
SAS •
R •
10 12 14 16

Histogranis

A basic histogram is very easy to get. Note that it adds its own main title to the plot. You
can suppress that by adding: main = "".

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

> hist(posttest )

Histogram of posttest
oC")

ll)
N

o
N

ll)

ll)

60 70 80 90 100

You can change the nmnber of bars in the histogram with the breaks argument. The
linesü function can add a density curve to the plot, and the rug() function adds shag
carpet-like tick marks to the x-axis where each data point appears.

> hist (post test, breaks = 20 , probability TRUE)


> lines( density(posttest)
> rug(posttest)

Histogram of posttest

e.o
q
o

~
q
o

N
q
o

o
q
o

60 70 80 90
You can select subsets of your data to plot by using logical selections as in any R function
call. Here I have selected the males. It displays the logical selection as the label on the x-
axis. You can suppress that >vith: xlab = '"'. I have also used the col argument to change
the color of the bars to gray.

> hist ( posttest[gender == "Male" ] ,


> col "gray60",
> main = "Histogram f or Males Only" )

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

Histogram for Males Only


lO

>-
o o
eQ)
::J
o-
Q)

u:: lO

60 70 80 90 100

posttest[gender == "Male"]
Normal QQ Plots

A nom1al QQ plot displays a fairly straight line when the data are normally distributed. R
has a qqnormü function that is built in, but I prefer the qq.plotO function in the car
(Companion to Applied Regression) package since it includes a 95% confidence interval.

> library ( " car")


> qq.plot(posttest,
> labels = r o w.names(mydatal OO) ,
> col = "black" )

o
(J)

......
(/)
<I> o
tl 00
(/)
o
a.
,,.
,, ,,
o
"-
,, ,,
o /
CD e;

-2 -1 o 1 2

norm quantiles
Sn·ip Charts

Strip chruts ru·e scatter plots of single variables. To prevent points from obscuring one
another, they can either be moved arow1d a bit at rru1dom (iittered) or stacked upon one
another. The multi-frame plot below shows both approaches.

> par(mfrow = c(2,1)) # 2 rows, 2 columns.

> s tripchart(pos t tes t, method= " jitter",


+ main="Stripchart with Jitter")

> stripchart(posttest, method= "stack",


+ main="Stripchart with Stacking" )

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

> par( mfrow=c(l,1) # Back to 1 r ow, 1 column.

Stripchart with Jitter

o o
o
ºº

1 1 1 1

60 70 80 90
Stripchart with Stacking

o Bo o

1 1 1 1

60 70 80 90
You can display either type of strip chart by group.

> s t ripchart(posttest - workshop, method " jitter")

Stata - o

SPSS -

SAS -

R -

1 1 1 1
o
(D
o
,,... o(J() o
O)

posttest

Scatter Plots aud Liue Plots

Scatter plots are the default when you supply hvo nmneri.c variables to the plot() function.

> plot(pr etest, pos t test)

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

o
o o
o o 00 o
0
O)
§ 0° 00 ºgoº o
ºº 8º 8 8 ºº
00 o o 08 º 8º8 º 0§8
(1)
j::; 00 - o ºº8 o
(f)
o o o 8º§8º 08 ºo
a. o o o o0
o - o o
....... o
o o

o - o
<D
1 1 1 1 1 1

60 65 70 75 80 85

pretest

The type argmnent controls whetl1er plotO displays points ("p"), lines ("1"), both ("b"), or
histogram-like needles to each point ("h"). Displaying liues would make sense if our data were
collected th.rough time and displayed overall or seasonal trends. The order of the points in our data
set mean nothing so we get a mess ofzig-zagging liues. Note that "main" changes the title, not the
display; that is cont.rolled only by the type argmnent.

> p a r( mfrow=c(2,2) ) jf 2 rows, 2 columns.

> plot ( pretest, postt est , type "p'', main ' type "pº'
> plot ( pret est , postt est, type "l " I main ' type "l " '
> plot ( pret est , postt est, type "b " I main ' type "b"'
> plot ( pretes t, postt est , type "h", main ' t ype "h" '

> par( mfrow=c(l , l) ) # Back to 1 row , 1 column.

type="p" type="I"
o
o o
o 00 o
g - o§,,g000 º'/x,o o ~
gº 8 S 0 o
¡¡; o oS oaº&,ºofi8 ¡¡;
~ ~- ~
oCX)
¡¡;
o o ºº~80
o o u;
o
o.. o o o a.
o o o
12 - o o .....
o

~ -o o
<D
1 1 1 1 1 1

60 65 70 75 80 85 60 65 70 75 80 85

pretest pre test


type="b" type="h''

o o -
m m

¡¡;
o
u; o
~ -
¡¡; CD
;.;¡"' CD
o o
a. a.
o
..... o -
.....

g g - 1
1 1

60 65 70 75 80 85 60 65 70 75 80 85

pretest pretest

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

If you have many points plotting on top of one another, as often happens with 1 to 5
Likert-scale data, you can add sorne jitter (random variation) to the points to get a better
view of overall trends.

> par( mfrow = c(l , 2) )

> plot( ql, q4, main= "Likert Scale Without Jitter " )
> plot( jitter(ql, 3), j itter(q4, 3), main=" Likert Scale With
Jitter " )

> par( mfrow e (1 , 1) )

Likert Scale Without Jitter Likert Scale With Jitter


l.() - o o o o 8 o o
l.() - o ~s º
~
'<l' - o o o o
e;)' '<l' -
o i~
'<l'
o ~ cO o
<;J'
O"
<") - o o o o o
~
O" <")
- ª9:iºó( o ~
:t::: o º"o ~ººº
o
-o o o o N -
N
00 G% ~ o
- o o
-o o o o
1 1 1 1 1 1 º1 º 1 1 1

1 2 3 4 5 2 3 4 5

q1 jitter(q 1, 3)

Scatter Plots of LaJ.·ge Data Sets

The problem of overplotting becomes severe when you have thousands of points. In the
neid: example I generate a new data set containing 5,000 observations. Then I plot them
first using the default settings (left). Many of the points are obscured by other points. On
the right side I plot the data using a much smaller point character and add sorne jitter so
you can see many more of the points.

> # Create two new variables with 5, 000 values.


> pretest2 < - round ( rno rm( n 50 00 , mean = 8 0, s d = 5) )
> posttest2 < - round ( pretest2 + rnorm(n = 5000, mean = 3, sd
3) )

> # Make sure 1 00 is the largest p o ssible s c ore .


> pretest2[pretest2 > 1 00 ] < - 10 0
> posttest2 [p o sttest2 > 10 0 ] < - 10 0

> par(mfro w=c ( l , 2) ) # 1 row, 2 col umns .

> plo t( pretest2 , posttest2 ,


> main= "5, 0 00 Points, Default Character \ nNo Jitter" )
> plo t( j itter(pretest2 ,4 ) , jitter(po sttest2,4), pch = ".",
> main = "5, 000 Po ints Using pch = \ nand Ji tter")

Grapbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


Grnpbics, Traclitional 1 r4stats com

5,000 Polnts, Default Character 5,000 Polnts Uslng pch='.'


No Jitter and Jitter
o o
o
.... - ~

;¡¡ ~
o

om - o o
~ m
83 ~

~
~ o
o V>
CIJ o
,gi co - ::
Ul co
Ul
o o
a. a.
~
~
o - o o
r-
"'" ~¡ ~ ti
;;:::;,

o(O
o(O - oo
1 1 1 1

60 70 80 90 60 70 80 90 100

pretest2 Jitter(pretest2, 4)

Another way to do scatter plots of large data sets is to replace a set of points with a single large
hexagonal point. The hexbinO function does just that. The plot below is shown larger than most
because the values in the scale showing the number of com1ts in each hexagon overwrite one
anotller in smaller sizes.

> l ibrary( "hexb i n " )


> plo t ( hexbin(pretes t 2 , pos t tes t 2 ) )

100
Counts
108
101
90 95
N
88
...... 81
en 75
(]) 80 68
~ 61
en 54
o 48
o.
70 41
34
28
21
14
60 8
1

65 70 75 80 85 90 95
pretest2

A final way to get a scatter plot for large numbers of observations is to use the smootl1ScatterO
function shown below. The white lines that divide the scatter into rectangles look oddly spaced in
this low-resolution inlage, but they look much better in a bigh-resolution version for publication.
smooth Scatter (pretest2 , p o stt est2)

o
o

o
m

~
VJ
Q)
:t::l o
VJ CX>
o
a.

o
,....

o
IO -~...-~~---..--~~--.-~~~...-~~---..--~~--.-~~~...-~~--.--'

65 70 75 80 85 90 95 100

pretest2

Scatter Plots wi.th Lines

To see what type ofline might fit your data, the lowess() function is a good place to stait.
The lines() function adds the lowess() fit to the data.

> plo t (posttest - p r etest )

> l ines( l owess(po sttest - pretest ) )

o
o o
0 0
o o o
(j)
§ o o o o
º ºg º
(¡) o
o 8 8
(]) o o o
:t::l co o
VJ
o o
o 8 o §8 o o 8 o o
a. o o o o
o o
o
,.... o
o o

o
<D o

60 65 70 75 80 85

pretest

That looks li.ke a fairly straight line so we might want to fit that with a simple linear regression.

Grnpbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


The ablineO function can add any straight line in the form y=ax+b where the a and b arguments
provide the slope and y-intercept, respectively. Since the hu() function does linear models
that supply the slope and intercept, abline() allows you to nest 1111() within it.

> plot(posttest - pret est)

> abl ine( lm(posttest - p r etest)

o
o o
0 0
o o
(J) o 000
§ºggº 8
...... o 8 o
U>
Q)
t::
o<X) o
o
00 80
U>
o o
o 8o §8o o 8 o o
a. o o o o
o o
r- o
o o

o
<D o

60 65 70 75 80 85

pretest

To use point characters to identify groups in your data, you can add the pch argmnent. It accepts
nllllle1ic vectors, so to use a factor like gender, nest it within the as.muneiicO function. By using
logical selections 011 the variables, such as gender = "Male" you can easily get the ablineO
fünction to do separate lines for each group. You can also use which(gender == "Male") to
select groups while eliminate missing values a bit more cleanly; see the books for
obsessive details 011 that. TI1e lty argument sets the line type for each group.

> plot (posttes t -pretest , pch = as . nume ric(gende r )

> abl ine( lm( pos t test [gender " Male" J


+ - p r etest[ gender "Male" J ),
+ lty = 1 )

> abl ine( lm(posttest[gender " Female"J


+ - pre t est[ gender "Femal e"J),
+ l t y = 2)

> l egend( " topl e ft ", c( "Ma l e " , "Female" ),


+ lty = c(l, 2), pch = c(2, 1) )

Grnpbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42)


~ Male o
- -0- - Female
o _o-
O>

u; o
Q)
tl
w
"'a.
o

,-
o
t-

60 65 70 75 80 85

pretest
Plotting Labels Ins te ad of Points

A helpful way to display group membership on a scatter plot is to plot labels instead of
other plot characters. If one character will suffice, the pch argument "vill do this. Since
gender is a factor and I need character values for this plot, I enclose gender in the
as.character() function. Below you can easily see the lowest scoring person on both
pretest aud posttest is male.

> plot(pretest, pos t test , pch as . character (gender)

M
F M
~ - F M MMF~ ~ F
~ M~ FMM
M ~ F
F
FM
u; o
~
M ~ M=M=MMF~M
M~~~ F M
(])
tl ro
"'a.
o F F MM
M M F MF
o F M
t-
F F
M
o
<D - M
1 1 1 1 1 1

60 65 70 75 80 85

pretest

The pch argument will only display the first character. That's often a good idea since it minimizes
point overlap. However, you can use whole labels by suppressing all plot characters with pch =
"n" and adding them with the textO function. Although this plot is quite a mess vátl1 our practice
data set, it works quite well wifu small data sets.

> plot (pretest, posttest , type ''n " )

Grnpbics, Traclitional _ r4stats com htm[27/0l/2014 22:21:42]


> text(pretest, posttest , label as . character(gender) )

o
O>

(¡)
Q) o
:i:: ro
U>
o
a. Female Male Mi
o
,.._ Female Male
MaleFemal~emale

o
<D ale

60 65 70 75 80 85

pretest

Box Plots

Box plots are easy to do with the plot() function. R also has a boxplot() function that
allows for additional control. See books for details.

> plot(workshop, pos t test)

o
--r--
----.--
t
1
1 •
o - ----.--- ---.--- '
O> ' t

~
'' t
t
''
''

bd ~ g
t

1
o
ro - '
1

'
---'-- _.___
1

''
' ''
o
,.._ - '' __.____
'
'
----'---

o
<D o
1 1 1 1

R SAS SPSS Stata

Share this: Email Twitter Facebook Google

Li ke this: .ke Loading ..

Grnphics, Traditional _ r4stats comhtm[27/0l/2014 22:21:42]


Statistics 1r4s1ats com

r4stats.com Analyzing the World of Ana/ytics

Statistics
Below is a comparison of the commands used to peiform various statistical analyses in R,
SAS, SPSS and Stata. For R functions that are not included in base R, the libra11'Ü
function loads the package that contains the function right before it is used. The variables
gender and workshop are categorical factors and ql to q4, pretest and posttest are
considered continuous and normally distributed.

The practice data set is shown her.e.. The programs and the data they use are also
available for dovmload her.e.. Detailed step-by-step explanations are in the books along
>vith the output of each analysis. My Books
R for SAS and SPSS Users
R for Stata Users
A:nalysis ofVariance
My Workshops
R for SAS, SPSS & Stata Users
R
Managing Data with R

Top Posts & Pages


myModel <- aov{posttest - workshop, • The Popularity of Data Analysis
Software
data = mydatalOO)
• Graphics, ggplot2
summary(myModel)
• Forecast Update: Will 2014 be the
pairwise.t.test(posttest, workshop) Beginning of the End for SAS and
TukeyHSD{myModel , "workshop") SPSS?
plot {TukeyHSD (myModel , "workshop") ) • Home
• Wtll 2015 be the Beginning of the
End for SAS and SPSS?
• Wby R is Hard to Leam
SAS • Downl.oads
• R for SAS and SPSS Users
• Leam R and/ or Data
Management from Home Januaiy
PROC GLM; or April
CLASS workshop; • Selecting Variables and
MODEL posttest workshop; Observations
MEANS workshop / TUKEY;
RUN; Recent Posts
• Leam R and/ or Data
Management from Home Januaiy
or April
SPSS • Knoxville R User's Group Meeting
November 1
• Wbat R Has Been Missing
• Leam R and/ or Data
UNIANOVA posttest BY workshop Management from Home
/POSTHOC = workshop ( TUKEY ) October 7 -11
/ PRINT = ETASQ HOMOGENE I TY • Trends in the Analytics
Job Market
/DESI GN = workshop.

Statistics _ r4stats com ojooooooooooooo htm[27/0l/2014 22:22:26]


Statistics 1r4stats com

Archives
• December 2013
Stata • October 2013
• September 2013
• August 2013
• June 2013
anova posttest workshop • May 2013
• April 2013
• March 2013
• Februaty 2013
Correlate, Pearson • January 2013
• October 2012
• September 2012
• July 2012
cor( mydata[3:6J, • June 2012
method = "pearson", • May 2012
use = "pairwise" ) • April 2012
cor .test(mydata$ql,
Blogroll
mydata$q2, u se = "pairwise")
• Cookbook for R
• Deducer Group
# Again, adjusting p - values for multiple t esti ng . • Deducer Manual
library ( "Rcmdr" ) • ggplot2 Group
rcorr. adjust( mydata[ 3:6 ] • ggplot2 Web Site
• plyr / reshape Group
• Quick-R
• R
SAS • R Cheat Sheets
• R-Bloggers
• Stack Overflow
• Stata-bloggers.com
PROC CORR; • Statistics Blog
VAR ql - q4 ; • StatsB!ogs
RUN; • The R Journal
• Twotorials

Categories
SPSS • Analytics
• Data Mangement
• R
• SAS
CORRELATI ONS
• SPSS
/ VARIABLES=ql TO q4 . • Statistics
• Uncategorized

Follow Blog vía Email


Stata Enter your email address to follow
this blog and receive notifications of
new posts by emaíl.
correlat e q*

Correlate, Spearman
Blog Stats
R • 233,134 hits

Meta
• Register
cor( mydata [3:6 ], • Login
method = "spearman", • Entries RSS
use = "pairwise") • Comments RSS
• WordPress.com
cor.test (mydata$ql ,
mydata$q2, Twitter
use = "pairwise" ) • @kohske @hadleywickham A
tagged vector in which the only
# Again, adjusting p -values for multip l e t esti ng . element, 999, is named (or
library ( "Rcmd r") tagged) as "e" 1 week ago
• @hadleywíckham Time to start a
rcorr.adjust(mydata [3 : 6J , type "spearman")
ggplot2 Task View on GRAN! (re:
your list of pkgs using ggplot2)

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

1 weekago
SAS • #Rstats comes out on top on yet
another survey. #BigData
#Analytics @SASsoftware
@IBMspss
PROC CORR S PEARMAN; blog.revolutionanalytics.com/2014
VAR ql - q4 ; /01/in-dat. .. 1 week ago
• @wrathematics Haha, so
RUN;
appropriate! You'll never hear the
end of it. Carla says Joyce was a
very ponderous writer, so I can't
SPSS laugh 2 much 1 week ago
• Adoption of R by large Enterprise
Software Vendors r-
bloggers.com/adoption-of-r-... via
NONPAR CORR @rbloggers #rstats #bigdata
/VARIABLES=ql t o q4 #analytics 2 weeks ago
/ PRINT=S PEARMAN.

Stata

spearman q*

Crosstabulation & Chl-squared

myWG < - table (workshop, gender )


chisq .tes t (myWG)

library ( " gmodels")


Cr ossTable (workshop, gender,
chi sq = TRUE,
f orma t = " SAS " )

SAS

PROC FREQ;
TABLES workshop*gender / CHISQ;
RUN;

SPSS

CROSSTABS
/TABLES=workshop BY gender
/ FORMAT= AVALUE TABLES
/STATISTIC=CHI SQ
/CELLS= COUNT ROW
/COUNT ROUND CELL

Stata

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

tab gender workshop, row col exact

Descriptive Statistics

summary(mydata)

library("Hmisc")
describe(mydata)

SAS

PROC MEANS;
VAR ql --posttest;

PROC UNIVARIATE; VAR ql --posttest;

SPSS

DESCRIPTI VES VARIABLES=ql to posttest


/STATISTICS=MEAN STDDEV VARIANCE
MI N MAX SEMEAN.

EXAMINE VARIABLES=ql to posttest


/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUP
/STATISTICS DESCRIPTIVES EXTREME
/MI SSING PAIRWISE.

Stata

summary q*

summary q*, detail

Frequencies

summary(mydata)

library ("Deducer")
frequencies(mydata)

SAS

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

PROC FREQ;
TABLES workshop--q4;
RUN;

SPSS

FREQUENCIES VARIABLES= workshop TO q4.

Stata

tabl workshop gender q*

Kruskal-Wallis

kruska l.test (pos ttes t -


workshop)

pairwise.wi lcox .test(posttest,


workshop)

SAS

PROC nparlway;
CLASS works h op;
VAR posttest;

SPSS

NPAR TESTS
/K -W=postte s t BY
worksh op(l 3) .

Stata

kwal lis q l, by(gender)

Linear Regression

myMode l < - lm(q4 - ql + q2 + q3, data mydatal OO)


summary(myModel)
plot (myModel)

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

SAS

PROC REG;
MODEL q4 ql - q3;

SPSS

REGRESSION
/MI SSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRI TERIA=PIN ( . 05) POUT(.10)
/NOORIGIN
/DEPENDENT q4
/METHOD=ENTER ql q2 q3 .

Stata

regress q4 ql - q3
lvr2plot

Sign Test

libr a ry ( "PASWR")
SIGN.test(posttest, pretest,
conf.level = .95)

SAS

myDiff=posttest -pretest;
PROC UNIVARIATE;
VAR myDiff;
RUN;

SPSS

NPTESTS
/RELATED TEST(ql q2) SI GN
/MI SSING SCOPE=ANALYSIS USERMI SSING=EXCLUDE
/CRI TERIA ALPHA=0.05
CILEVEL= 95.

Stata

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

bit est posttest > pretest

t-Test, Indepdendent

t . test(ql - gender,
data = mydatalOO)

SAS

PROC TTEST;
CLASS gender ;
VAR ql ¡
RUN;

SPSS

T-TEST
GROUPS = gende r ( •m' • f. )
/VARIABLES = q l .

Stata

t test g e nder=ql, unpair unequ

t-Test, Paired

t. test (pos t tes t, p re tes t,


paired = TRUE)

SAS

PROC TTEST;
PAIRED pretest*posttest;
RUN;

SPSS

T-TEST
PAI RS=p retest WITH
posttes t (PAIRED) .

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4stats com

Stata

anova pos t test worksh op

Variance Tes t

t .test(postt est, pret est,


pai red = TRUE)

SAS

It's built into other procedures, such as GLM.

SPSS

It's built into other procedures, such as GLM.

St ata

robvar posttest , by(gender)

* Or ...
sdt est postt est gender

Wtlcoxon Rank Sum (Mann-Whitney)

wilcox. t est(ql - gender,


data = mydatalOO)

SAS

PROC NPARl WAY;


CLASS gender;
VAR ql;
RUN;

SPSS

NPTESTS
/RELATED TEST(pretest p o s t tes t ) SIGN WILCOXON.

Statistics _ r4stats com ojooooooooooooo htm[27/01/2014 22:22:26]


Statistics 1r4s1ats com

Stata

ranksurn posttest, b y (gender )

Wilcoxou Sigued Rauk (Paired)

wilco x . test(postt est, pret est, paired TRUE)

SAS

myDi f f =postt est - pretest;


PROC UNIVARI ATE;
VAR myDi f f;
RUN;

SPSS

NPTESTS
/RELATED TEST(ql q2) WILCOXON
/MI SSING SCOPE=ANALYSIS USERMI SSING=EXCLUDE
/CRI TERIA ALPHA=0 . 05 CILEVEL=95 .

Stata

signrank ql gender

Share this: Email Twitter Facebook Google

Like this: Uke Loadmg .

Leave a Reply

http: //r4stats.com/category/stata Ji)1 r4stats.com » R !ii1 r4stats.com » SAS ~ r4stats.com


I • Learn R and/ or Data Management • Leam R and/ or Data Management • Leam R and/ or Data Management
8 RSS - Posts from Home Januaiy or April from Home Januaiy or April from Home Januaiy or April
8 RSS - Comments • Knoxville R User's Group Meeting • What R Has Been Missing • Knoxville R User's Group Meeting
November 1 • Leam R and/ or Data Management November1
• What R Has Been Missing from Home October 7 -11 • What R Has Been Missing
• Learn R and/ or Data .Management • Trends in the Analytics Job .Market • Leam R and/ or Data Management

Statistics _ r4stats com ojooooooooooooo htm[27/0l/2014 22:22:26]


R show data - summarize and tabulate data with R

Using R to show data


data summary & mining with R
Home

R main

Access
Manipulate
Summarise
Plot
Analyse

R provides a variety of methods for summarising data in tabular and other forms.

View data structure


Before you do anything else, it is important to understand the structure of your data and that of any objects
derived from it.
A <- data.frame(a=LETTERS[1:10], x=1:10)
class(A) # "data.frame"
sapply(A, class) # show classes of all columns
typeof(A) # "list"
names(A) # show list components
dim(A) # dimensions of object, if any
head(A) # extract first few (default 6) parts
tail(A, 1) # extract last row
head(1:10, -1) # extract everything except the last element

It is sometimes useful to work with a smaller version of a large data frame, by creating a representative
subset of the data, via random sampling:
A.small <- A[sample(nrow(A), 4), ] # select 4 rows at random

Basic numerical summaries


Generate and summarise some random numbers:
a <- rnorm(50)
summary(a) # gives min, max, mean, median, 1st & 3rd quartiles
min(a); max(a) # }
range(a) # } self-explanatory
mean(a); median(a) # }
sd(a); mad(a) # standard deviation, median absolute deviation
IQR(a) # interquartile range
quantile(a) # quartiles (by default)
quantile(a, c(1, 3)/4) # specific percentiles (25% & 75% in this case)

Data frame summaries:


A <- data.frame(a=rnorm(10), b=rpois(10, lambda=10))
summary(A) # summarise data frame
apply(A, 1, mean) # calculate row means
apply(A, 2, mean) # calculate column means: same as "mean(A)"

"which.min " & " which.max " return the element number of the lowest/highest value:
set.seed(123) # allow reproducible random numbers

r-show_data.html[27/01/2014 22:23:59]
R show data - summarize and tabulate data with R

x <- sample(10)
> which.max(x)
[1] 7
> x[which.max(x)]
[1] 10

This can be used in a data frame to extract the corresponding row containing the min/max value of one of the
columns:
A <- data.frame(x=rnorm(10), y=runif(10))
A[which.min(A$x), ]
#--Alternatively:
subset(A, x == min(x))

Other summaries:
x <- rnorm(100)
fivenum(x) # Tukey's five number summary, used to construct a boxplot:
boxplot(x) # see "?boxplot.stats" for more details
stem(x) # A stem-and-leaf plot

Matrix summaries:
A <- matrix(rnorm(50), nrow=10) # create 10x5 random number matrix
colSums(A); rowSums(A); colMeans(A), rowMeans(A) # self-explanatory
max.col(A) # maximum position for each row of a matrix, same as:
which.max(A[1,]); which.max(A[2,]) # etc.

Tables
Load some data on a sample of 20 galaxy clusters with a categorical classification status ("cctype")
indicating whether there is a cool core or not and a factor ("det") specifying which of two detectors was used
to make the X-ray observation of the cluster:
file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt"
a <- read.table(file, header=TRUE, sep="|")
#
table(a$cctype) # count numbers in each cctype category
table(a$cctype, a$det) # 2-way table
xtabs(~ cctype + det, data=a) # alternative (formula) syntax
addmargins(xtabs(~ cctype + det, data=a)) # add row/col summary (default is sum)
prop.table(xtabs(~ cctype + det, data=a)) # show counts as proportions of total

To test whether the input factors are independent of each other:


chisq.test(xtabs(~ det + cctype, data=a), simulate.p.value=TRUE)

-there is marginal evidence (p=0.07) of an interaction: clusters observed with ACIS-S are more likely to have
a cool core than not.

Calculate aggregate statistics


Calculate numerical summaries for subsets of a data frame (using above dataset):
> aggregate( kT ~ cctype, data=a, FUN=mean)
cctype kT
1 CC 5.121111
2 non-CC 6.146364
# mean cluster redshift of each cctype for each detector:
> aggregate(z ~ cctype + det, data=a, FUN=mean)
cctype det z
1 CC I 0.06070000
2 non-CC I 0.05137500
3 CC S 0.04105714
4 non-CC S 0.03636667
#--Show mean values of a few quantitied, for each cctype:
aggregate(. ~ cctype, data=a[c("cctype", "z", "kT", "Z", "S01", "index")], mean)

r-show_data.html[27/01/2014 22:23:59]
R show data - summarize and tabulate data with R

You can also apply multi-number summaries:


> aggregate( index ~ cctype, data=a, FUN=range)
cctype index.1 index.2
1 CC 0.714 1.120
2 non-CC 0.283 0.944

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data
using R.

Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source
code used to create them.

Quick links

why R
getting started
R plot gallery
R tutorials
R resources
R function list

Jump to

view structure
numerical summary
contingency tables
aggregate statistics

Top | R main | Access | Manipulate | Summarise | Plot | Analyse | Contact | Sitemap

Last updated: 01/27/2014 16:17:56 | info | chk

Copyright © 2010-2013 Alastair Sanderson

r-show_data.html[27/01/2014 22:23:59]
Recode Data in R with R Recode Examples 1 RProgramming net

HOME HOWTO'S BLOG ABOUT CONTACT PRIVACY POLICY

Programming.net
Recode Data in R PAGES

.Me gusta ~ Tweet @ Share 2


Ab.o.u1
Aggregate Data in R Using
data table
How.IÍIQ,Recode Data in R Blog
Connect to Database in R
This page will show you how to recode data in R by either replacing data Connect to MS Access in R
in an existing field or recoding into a new field based on criteria you Contact
specify. This page first addresses how to recode in base R. lf you're Create a Slideshow
looking for information on the recode() command in the package car, (PowerPoint) with R Knjtr
scroll to the bottom. Pandoc. and Slidy
Create HTML or PDF Files with
R. Knitr. MiKTeX. and Pandoc
Replace Data in an existing field In R
Download and Instan RStudio
Oownload R
The first example shows how to replace the data in an existing field when
Format a Number as a
you want to replace the data for every row (no criteria). This code
Percentage in R
replaces any data that is already in the field Grade in the data trame
How to Download R Qujckly
SchoolData with the number 5, the text string five, or NA.
and Easily
How to Instan R Fast'
How To's
# Replace all the data in a field with a number
Privacy Policy
Schoo1Data$Grade <- 5
R Data Manipulation
# Replace all the data in a field with with text R Order to Sort Data
Schoo1Data$Grade <- "Five" R Programming - Help. How-
To's and Examples
# Replace all the data in a field with NA (missing
Read CSV in R
data)
Recode Data in R
Schoo1Data$Grade <- NA
Rename Columns in R
Round Numbers in R
Set Working Directory in R
The second example shows how to apply criteria so that only data in

Recode Data in R with R Recode Examples _ RProgramming net htm[27/0l/2014 22:23:47]


Recode Data in R with R Recode Examples 1 RProgramming net

Subset Data in R
specific rows is reptaced. Note that if you want to replace NA with sorne
Wrjte CSV in R
vatue you cannot use ==NA. You must use is.na(). See below for an
exampte.

RECENT POSTS

# Replace the data in a field based on equal to sorne value


New Madjson Wt R
Schoo1Data$Grade[Schoo1Data$Grade==5) <- "Grade Five"
Programming UseRs Group
# Or replace based on greater than or equal to sorne value (MadR)
Schoo1Data$Grade[Schoo1Data$Grade<=5] <- "Grade Five lntroduction to R Programming
or Less"
at Sector 67 Materials
lntroduction to the R
# Or replace based on equal to sorne text
Programmjng Language at
Schoo1Data$Grade[Schoo1Data$Grade=="Five") <- "Grade Five"
Sector 67 217/13
# Or replace only missing data
# Note that ==NA does not work!
Schoo1Data$Grade[is.na(Schoo1Data$Grade)] <- "Missing
Grade"
- - - - - ' Search J

The third example shows how to replace data based on more than one RECENT COMMENTS

criteria. This code creates a new field SchoolType and enters


"Elementary" into it for all rows where Grade is less than or equal to 5 and David on Wrjte CSV in R
SchoolStatus is OPEN. Justin@RProgramming.net on
Create HTML or PDF Files with
R. Knitr. MiKTeX. and Pandoc
# Replace data based on the values in more than one field jill on Create HTML or PDF
Schoo1Data$Schoo1Type[Schoo1Data$Grade<=5 & Files with R. Knitr. MiKTeX.
Schoolstatus=="OPEN"] <- "Elernentary School" and Pandoc
a~f~ on Create HTML or

PDF Files with R. Knitr.


Recode into a new field MiKTeX and Pandoc
Write CSV in R with Examples

The fourth example shows how to make a copy of an existing field. using write csy 1 on Wri1e..
Sometimes you don't want to recode data but instead just w ant another CSV in R

column containing the same data. This example makes a new column
called CopyOfGrade and fills it with the data from Grade. This isn't exactly
recoding but is related and comes up a lot since it is usually a good idea IN MADISON, WI? JOIN THE R

to make a copy of a field and then do the recoding on the copy rather MEETUP

than on the original.


MadR - Madison R
Programming
UseRs Group
# Copy a column in R
# First create the new colurnn
SchoolData$CopyOfGrade <- NA
ARCHIVES

Recode Data in R with R Recode Examples _ RProgramming net htm[27/0l/2014 22:23:47]


Recode Data in R with R Recode Examples 1RProgramming net

# Then copy the data from the existing column into the new
one. January 2014
SchoolData$CopyOfGrade <- Schoo1Data$Grade Februarv 2013
Janyary 2013

The fifth example shows how to recode data into a new numeric field
based on criteria from a numeric field. Note that w ith numeric fields you do CATEGORIES

not surround the value w ith quotation marks. With a character field you do
surround the value with quotation marks (next example}. Uncategorized

This example creates a new field called NewGrade based on the field
Grade. Note that, as w ith the above examples, you can again use & or META

any of the other operators to produce the criteria you want.


.l.Qg.jn
Entries RSS
# Recode into a new field in R Comments RSS
WordPress.org
# First create the new field
StudentData$NewGrade <- NA
# Then recode the old f ield into the new one for the
RPROGRAMMING.NET HAS NO
specified rows
SchoolData$NewGrade(Schooloata$Grade==5) <- s ADS BECAUSE IT IS SUPPORTED

BY THE FOLLOWING

ORGANIZATIONS. THANK YOU!

The sixth example shows how to recode data into a new character field
based on criteria from a numeric field. This example again creates a new Cjgar Humidifier

field called NewGrade based on the field Grade.

# Recode into a new field in R

# First create the new f ield


StudentData$NewGrade <- NA
# Then recode the old f ield into the new one for the

specified rows
SchoolData$NewGrade(SchoolData$Grade==5) <- "Grade Five"

The seventh example shows how to recode data into a new character field
based on criteria from a character field. This example creates a new field
called NewGrade based on the field Grade.

# Recode into a new field in R

# First create the new field

Recode Data in R with R Recode Examples _ RProgramming net htm(27/0 l/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net

StudentData$NewGrade <- NA
# Then recode the old field into the new one for the
specified rows
schooloata$NewGrade[SchoolData$Grade=="Grade Five"] <- "Grade
Five"

The eighth example shows how to recode data into a new numeric field
based on criteria from a character field. This example again creates a new
field called NewGrade based on the field Grade.

# Recode into a new field in R

# First create the new field


studentoata$NewGrade <- NA
# Then recode the old field into the new one for the
specified rows
schoolData$NewGrade[SchoolData$Grade=="Grade Five"] <- s

Recode into A New Field Using Data From An Existing field


And Criteria from Another Field

This is w here things get a little weird. lf you want to recode data into a
field and pull that data from another field, you have to specify the criteria
on both sides of the <-. lf you don't, R will still recode but you won't get
the results you're expecting. For example, let's say you want to copy the
data from Grade into NewGrade but only where SchoolType is
"Elementary". You might think that this will work:

# Recode into a new field in R

# First create the new field


studentoata$NewGrade <- NA
# Then recode the old field into the new one f or the
specified rows
schooloata$NewGrade[SchoolData$SchoolType=="Elementary") <-
Schoo1Data$Grade

And it will work! But you won't get the results you are expecting. R won't
copy the data from Grade for only the rows where SchoolType is
Elementary. lnstead, it will start at the top of the data frame and copy
each row. To recode correctly you have to specify the criteria on both

Recode Data in R with R Recode Examples _ RProgranuuing net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net

sides of the <-,as in example nine:

# Recode into a new field in R

# First create the new f ield


studentoata$NewGrade <- NA
# Then recode the old f ield into the new one for the
specified rows
schooloata$NewGrade[Schooloata$SchoolType=="Elementary") <-
Schooloata$Grade[Schoo1Data$Schoo1Type=="Elementary")

The Recode Command From the Package Car

The recode() command from the car package is another great way to
recode data in R. Recode from car can be very powerful and is a good
alternative to the code above.

lf you want to recode from car you have to first install the car package and
then load it far use.

# Install the car package


install .packages("car")

# Load the car package


library (car)

Now recode Grade from 5 to 6:

# Recode grade 5 to grade 6


school oata$Grade<-recode(Schooloata$Grade,"5=6")

lf you want to recode based on text, use the ' mark around the text.

Now recode Grade from 5 to 6:

# Recode grade 5 to grade 6


schooloata$Grade<-recode(Schooloata$Grade,"'Grade Five'=S")

Recode Data in R with R Recode Examples _ RProgranuuing net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net

To set recode multiple values use c()

# Recode grade 5 to grade 6


schooloata$Grade< - recode(Schoo1Data$Grade,"c(l,2,3,4,5)='Five
or Less' ")

Recode can recode data into a new field. This code creates a new field
called NewGrade based on Grade. Note that if you don't specify that value
is recoded R will just copy the existing value into the new field.

# create a new field called NewGrade


schooloata$NewGrade <-
recode(Schooloata$Grade , "5='Elementary' " )

Of course, you can convert a value to NA, or NA to a value.

# Recode grade 3 to NA
schooloata$Grade< - recode(Schooloata$Grade,"3=NA")

# or recode NA to 7
schooloata$Grade <- recode(Schooloata$Grade, "NA=7")

One advantage to recode is that it can recode multiple values in one line
of code.

# Recode grade 5 to grade 6 and grade 6 to grade 7


schooloata$Grade< - recode(Schoo1Data$Grade, "5=6;6=7")

Another advantage to recode is that it makes using ranges easy.

# Recode grades 1 through 5 to Elementary


schooloata$Grade< - recode(Schoo1Data$Grade,"1:5= ' Elementary'")

One more advantage to recode is that it includes the use of the


commands lo and hi to specify a range. Lo tells recode to start the range
at the lowest value. Hi tells recode to end the range at the highest value.

Recode Data in R with R Recode Examples _ RProgranuuing net htm[27/01/2014 22:23:47]


Recode Data in R with R Recode Examples 1 RProgramming net

# Recode the lowest grade through s to Elementary


schooloata$Grade<-recode(Schooloata$Grade,"lo:S='Elementary'")

# Recode grade 9 to the highest grade to High school


SchoolDataSGrade<-recode(SchoolDataSGrade,"9:hi= ' High
School '" )

A final advantage to recode is that it includes the use of the command


else to to specify a what to do with any value that was not already
recoded. The following converts grades 1 through 5 to Elementary, 6
through 8 to Middle, and all other grades (including NA) to high.

# Recode grades
schooloata$Grade<-
recode(schooloata$Grade, "1:S='Elementary' ;6:8='Middle;else='Hig
h ... )

There are other options that can be used with recode in car. See official
R-manual page on read.csv to learn more: http·lfcran r-
project.orq/web/packages/car/car.pdf.

tl•,...,_.n,.c...• •
g ..... ,,..,
,.~ ~ ,, .
J. • tr a thc cr a< ' f!
•"•'•ll
:P'.

l p..(.~~ , ..
l
• Ln• t?it (.tr' padl1Jt!

1
.
1tbr,¡ry <M

'
sc"°o1DU•'Gl'AOe reeiod1 scNolt\at•'Vld•.
. ,_, ~.

Recode in action.

Thanks for reading! This website took a great deal of time to create. lf it
was helpful to you, please show it by sharing with friends, liking, or
tweeting! lf you have any thoughts regarding this R code please post in
the comments.

.Me gusta '# Tweet ~ Share 2

RelateQr-posts:

Recode Data in R with R Recode Examples _ RProgramming net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1 RProgramming net

1. Subset Data jo R
2. R Data Maoipulatioo
3. Aggregate Data jo R Usjog data table

One thought on "Recode Data in R"

Sierra Bravo
May 7. 2013 at 9:30 am

This was greatly useful to me. Thaoks for your efforts!

Leave a Reply

Your email address will oot be published. Required fields are marked ..

Name ..

LEmail ..

L
Website

Commeot

Recode Data in R with R Recode Examples _ RProgramming net htm[27/01/2014 22:23:47]


Recode Data in R with R Recode Examples 1RProgramming net

You may use these HTML tags and attributes: <a href="" title=""> <abbr

ti tle=" "> <acronym title=""> <b> <blockquote cite='"'> <cite>

<code> <del datetime="" > <ero> <i> <q cite=""> <strike> <strong>

Post Comment'J

Proudly powered by WordPress

Recode Data in R with R Recode Examples _ RProgramming net htm[27/01/2014 22:23:47]


Advanced Statistics Bootstrapping

Generalized Linear Models Nonparametric Bootstrapping


Oiscriminant Function The boot package provides extensive facilities for bootstrapping and related resampling methods. You
can bootstrap a single statistic (e.g. a median), or a vector (e.g. , regression weights). This section will
Time Series
get you started with basic nonparametric bootstrapping.
Factor Analysis

Correspondence Analysis The main bootstrapping function is boot( ) and has the follovting forrnat:

M11ltjdjmen5jona( Scaling
bootobject <· boot(data= , statistic= , R=, ... ) where
CI 115ter Analysis
parameter description
Tree·Based Models
data A vector, matrix, or data frame
Bootstrapoing statistic A function that produces the k statistics to be bootstrapped (k=1 if
bootstrapping a single statistic).
Matrix Algebra The function should indude an indices parameter that the boot() function
can use to select cases for each replication (see examples below).
R Number of bootstrap replicates
R in Action Additional parameters to be passed to the function that produces the
statistic of interest

boot( ) calls the statistic function R times. Each time, it generates a set of random indices, vtith
replacement, from the integers 1 :nrow(data). These indices are used within the statistic function to
select a sample. The statistics are calculated on the sample and the results are accumulated in the
bootobject. The bootobject structure includes

element description
R in Action significantly expands
tO The observed values of k statistics applied to the orginal data.
upon this material. Use promo
t An R x k matrix where each row is a bootstrap replicate of the k statistics.
code ria38 for a 38% discount.
You can access these as bootobject$t0 and bootobject$t.

Top Menu Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to
examine the results. lf the results look reasonable, you can use boot.ci( ) function to obtain confidence
intervals for the statistic(s).

The R Interface
The format is
Data Input
boot.ci(bootobject, conf=, type= ) where
Data Management

Basic Statistics
parameter description
Advanced Statistics
bootobject The object retumed by the boot function
Basic Graphs conf The desired confidence interval (default: conf=0.95)
Advanced Graohs type The type of confidence interval retumed. Possible values are "norm",
"basic", "stud", "perc", "bca" and "all" (default: type="all")

Bootstrapping a Single Statistic (k=1)


The following example generates the bootstrapped 95% confidence interval for R·squared in the linear
regression of miles per gallon (mpg) on car weight (wt) and displacen1ent (disp). The data source is
mtcars. The bootstrapped confidence interval is based on 1000 replications.

# Bootstrap 95% CI fer R-Squared


l i brary(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(sunvnary(fi t) $r. square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
R=lOOO, formula=mpg~wt+disp)

# view results
results
plot(results)

# get 95% confi de ne e i nte rva l


boot.ci(results, type="bca")

11 click to view

Bootstrapping several Statistics (k> 1)


In example above, the function rsq retumed a number and boot.ci returned a single confidence
interval. The statistics function you provide can also retum a vector. In the next example we get the
95% CI fer the three model regression coefficients (intercept, car weight, displacement). In this case we
add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to
analyzed.

# Bootstrap 95% CI fer regression coefficients


l i brary(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
R=lOOO, formula=mpg~wt+disp)

# view results
results
pl ot(results, index=l) # intercept
plot(results, index=2) # wt
pl ot(results, index=3) # disp

# get 95% confidence intervals


boot.ci(results, type="bca", index=l) # intercept
boot. ci (resu l ts, type="bca", i ndex=2) # wt
boot.ci (results, type="bca", index=3) # disp

/ / / click to view
Going Further
The boot( ) function can generate both nonparametric and parametric resampling. For the
nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation.
For the nonparametric bootstrap, stratified resampling is supported. lmportance resampling weights can
also be specified.

The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric
confidence intervals. These include the first order nom1al approximation, the basic bootstrap interval,
the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap
percentile (BCa) interval.

Look at help(boot), help(boot.ci), and help(plot.boot) for more details.

Learning More
Good sources of infom1ation include Resampling Methods in R: The boot Package by Angelo Canty,
Getting started with the boot package by Ajay Shah, Bootstrapping Regression Models by John Fox,
and Bootstrap Metbods and Their Appli<:atjons by Oavison and Hinkley.
Advanced Statistics Correspondence Analysis
Correspondence analysis provides a graphic method of exploring the relationship between variables in a
Generalized Linear Models contingency table. There are many options for correspondence analysis in R. 1 recommend the ca
Oiscriminant Function package by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and
comprehensive graphics. You can obtain the package hf!re.
Time Series

Factor Analysis Although ca can perform multiple correspondence analysis (more than two categorical variables), only

Correspondence Analysis simple correspondence analysis is covered here. See their fil1k.ll! for details on multiple CA.

M11ltjdjmen5jona( Scaling

CI 115ter Analysis Simple Correspondence Analysis


Tree-8ased Models In the following example, A and B are categorical factors.

Bootstrapoing
# Correspondence Analysis
Matrix Algebra
library(ca)
mytable <- with(mydata, table(A,B)) # create a 2 way table
prop.table(mytable, 1) # row percentages
R in Action
prop.table(mytable, 2) # column percentages
fit <- ca(mytable)
print(fit) # basic results
sunvnary(fit) # extended results
pl ot(fit) # symmetric map
pl ot(fit, mass =TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

R in Action significantly expands


The first graph is the standard symmetric representation of a simple correspondence analysis with rows
upon this material. Use promo
and column represented by points.
code ria38 for a 38% discount.

Top Menu

elick to view
The R Interface
Row points (column points) that are eloser together have more similar column profiles (row profiles).
Data Input
Keep in mind that you can not interpret the distance between row and column points directly.
Data Management

Basic Statistics The second graph is asymmetric , with rows in the principal coordinates and colunms in reconstructions
of the standarized residuals. Additionally, mass is represented by points and columns are represented
Advanced Statistics
by arrows. Point intensity (shading) corresponds to the absolute contributions for the rows. This
Basic Graphs example is ineluded to highlight sorne of the available options.
Advanced Graphs

elick to view
Advanced Statistics Tree-Based Models
Recursive partitioning is a fundamental tool in data mining. lt helps us explore the stucture of a set of
Generalized Linear Models data, while developing easy to visualize decision rules for predicting a categorical (classification tree)
Oiscriminant Function or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional
inference trees, and random forests.
Time Series

Factor Analysis
CART Modeling via rpart
Correspondence Analysis
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be
M11ltjdjniensjona( Scaling
generated through the rpart package. Oetailed information on rpart is available in An lntroduction to
CI 115ter Analysis Rec11rsjve Partjtjonjng llsing tbe RPART Ro11tines. The general steps are provided below followed by two
Tree-Based Models examples.

Bootstrapoing 1. Grow the Tree


Matrix Algebra To grow a tree, use
rpart(formula, data=, method=,control=) where

R in Action
formula is in the fom1at
outcome - predictor1+predictor2+predictor3+ect.
data= specifies the data frame
method= "class" for a classification tree
"anova" for a regression tree
control= optional parameters for controlling tree growth. For example,
control=rpart.control(minsplit=30, cp=0.001) requires that the minimum
number of observations in a node be 30 before attempting a split and that a
split must decrease the overall lack of fit by a factor of 0.001 (cost
R in Action significantly expands complexity factor) before being attempted.
upon this material. Use promo
code ria38 for a 38% discount. 2. Examine the resutts
The following functions help us to examine the results.

Top Menu printcp(fit) display cp table


plotcp(fit) plot cross-validation results
rsq.rpart(fit) plot approximate R-squared and relative error for different splits (2
plots). labels are only appropriate for the "anova" method.
The R Interface print(fit) print results
Data Input summary(fit) detailed results including surrogate splits
plot(fit) plot decision tree
Data Management
text(fit) label the decision tree plot
Basic Statistics
post(fit, create postscript plot of decision tree
file=)
Advanced Statistics

Basic Graphs In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs
below).
Advanced Graphs
3. prune tree
Prune back the tree to avoid overfitting the data. Typically, you will want to selecta tree size that
minimizes the cross-validated error, the xerror column printed by printcp( ).

Prune the tree to the desired size using


prune(fi t, cp= )

Specifically, use printcp( ) to examine the cross-validated error results, select the complexity
parameter associated with mínimum error, and place it into the prune( ) function. Altematively, you
can use the code fragment

fit $cptable[which. min (fit $cptable[,..xerror.. ]), ..CP")

to automatically select the complexity parameter associated with the smallest cross-validated error.
Thanks to HSAUR for this idea.

Classification Tree example


Let's use the data frame kyphosis to predict a type of defom1ation (kyphosis) after surgery, from age in
months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

# classification Tree with rpart


l i brary(rpart)

# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)

printcp(fit) # display the results


pl otcp(fit) # visualize cross-validation results
sunvnary(fi t) # detai l ed summary of sp l i ts

# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree


post(fit, file = "c:/tree.ps",
title = "classification Tree for Kyphosis")

1.
.. ..
click to view

# prune the tree


pfit<- prune(fit, cp=
fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])

# plot the pruned tree


plot(pfit, uniform=TRUE,
main="Pruned classification Tree for Kyphosis")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
title = "Pruned classification Tree for Kyphosis")

click to view

Regression Tree example


In this example we will predict car mileage from price, country, reliability, and car type. The data
frame is cu.summary.
# Regression Tree Example
l i brary( rpart)

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)

printcp( fit) # display the results


pl otcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# create additional plots


par(mfrow=c(l,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results

# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attracti ve postcript plot of tree


post(fit, file = "c: / tree2.ps'',
title = "Regression Tree for Mileage ")

1: 1:
;. . click to view

# prune the tree


pfit<- prune(fit, cp=0.01160389) # from cptable

# plot the pruned tree


plot(pfit, uniform=TRUE,
main="Pruned Regression Tree for Mileage")
text(pfit, use.n=TRUE , all=TRUE, cex=.8)
post(pfit, file = "c: / ptree2.ps",
title = "Pruned Regression Tree for Mileage")

lt turns out that this produces the same tree as the original.

Conditional inference trees via party


The ~ package provides nonparametric regression trees for nominal, ordinal, numeric, censored,
and multivariate responses. party· A laboratory for rernr5ive partjtjonjng, provides details.

You can create a regression or classification tree via the function

ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor,
numeric, etc. ). Tree growth is based on statistical stopping rules, so pruning should not be required.

The previous two examples are re-analyzed below.

# Conditional Inference Tree for Kyphosis


l i brary(party)
fit <- ctree(Kyphosis ~ Age + Number + Start,
data=kyphosis)
pl ot(fit, main="Conditional Inference Tree for Kyphosis")

~I elick to view

# Conditional Inference Tree for Mileage


l i brary(party)
fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu. sunvnary))

:=
. - elick to view

Random Forests
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based
on random samples of variables), elassifying a case using each tree in this new "forest'', and deciding a
final predicted outcome by combining the results across ali of the trees (an average in regression, a
majority vote in elassification). Breiman and Cutler's random forest approach is implimented via the
randomForest package.

Here is an example.

# Random Forest prediction of Kyphosis data


l i brary(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor

For more details see the comprehensive Random Fore;t website.

Going Further
This section has only touched on the options available. To learn more, see the CRAN Task View on
Machine & Statistical Leaming.
Advanced Statistics Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, 1 will describe three of the
Generalized Linear Models many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best
Oiscriminant Function solutions for the problem of determining the number of clusters to extract, severa! approaches are
given below.
Time Series

Factor Analysis

Correspondence Analysis
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescate variables for
M11ltjdjmen5jona( Scaling
comparability.
CI 115ter Analysis

Tree-8ased Models
# Prepare Data
Bootstrapoing mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Matrix Algebra 1
R in Action Partitioning
K·means clustering is the most popular partitioning method. lt requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help detem1ine the appropriate number of clusters. The analyst looks for a bend in the
plot similar to a scree test in factor analysis. 5ee Everitt & Hothom (og . 251\.

# Determine number of clusters


R in Action significantly expands
wss <- (nrow(mydata)-l) *sum(apply(mydata,2, var))
upon this material. Use promo for (i in 2:15) wss[i] <- sum(kmeans(mydata,
code ria38 for a 38% discount. centers=i)$withinss)
pl ot(1:15, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
Top Menu

# K-Means cluster Analysis


fit <- kmeans(mydata, 5) # 5 cluster solution
The R Interface # get cluster means
Data Input aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Data Management mydata <- data.frame(mydata, fit$cluster)
Basic Statistics

Advanced Statistics A robust version of K·means based on mediods can be invoked by using pam( ) instead of kmeans( ).
Basic Graphs The function pamk( ) in the ~ package is a wrapper for pam that also prints the suggested number of
clusters based on optimum average silhouette width.
Advanced Graohs

Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches. 1 have had good luck with Ward's method
described below.
# Ward Hierarchical clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fi t <- he l ust (d, method="ward")
pl ot(fit) # displ ay dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")

,.
: imn~~f'~'l!!~~ 1

_._ click to view

The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on
multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p
values. lnterpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows.
Transpose your data before using.

# Ward Hierarchical clustering with Bootstrapped p val ues


library(pvclust)
fit <- pvcl ust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

,.
.iR . 1fir~~~
...,.. click to view

Model Based
Model based approaches assume a variety of data models and apply maximum likelihood estimation and
Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( )
function in the mclust package selects the optimal model according to BIC for EM initialized by
hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and
number of clusters with the largest BIC. See helo!mclustModelNamesl to details on the model chosen as
best.

# Model Based Clustering


l i brary(mc l ust)
fit <- Mclust(mydata)
pl ot(fit) # plot resul ts
summary(fit) # display the best model

,.
.
,

elick to view

Plotting Cluster Solutions


lt is always a good idea to look at the cluster results.
# K-Means clustering with 5 clusters
fit <- kmeans(mydata, 5)

# Cluster Plot against lst 2 principal components

# vary parameters for most readable graph


library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=O)

# Centroid Plot against lst 2 discriminant functions


library(fpc)
plotcluster(mydata, fit$cluster)

-- ......
click to view

Validating cluster solutions


The function cluster.stats() in the ~ package provides a mechanism for comparing the similarity of
two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index
and the corrected rand index)

# comparing 2 cluster solutions


1 i brary(fpc)
cluster.stats(d, fitl$cluster, fit2$cluster)
1
where d is a distance matrix among objects, and fit1 $cluster and fit$cluster are integer vectors
containing classification results from two different clusterings of the same data.
Advanced Statistics Discriminant Function Analysis
The MASS package contains functions for perfom1ing linear and quadratic
Generalized Linear Models discriminant function analysis. Unless prior probabilities are specificed, each assumes proportional prior

Oiscriminant Function probabilities (i.e., prior probabilities are based on sample sizes). In the examples below, lower case
letters are numeric variables and upper case letters are categorical íaó.ms,.
Time Series

Factor Analysis

Correspondence Analysis
Linear Discriminant Function
M11ltjdjmen5jona( Scaling # Linear Discriminant Anal ysis with Jacknifed Prediction
l i brary(MASS)
CI 115ter Analysis
fi t <- lda(G - xl + x2 + x3, data=mydata,
Tree-8ased Models na . action="na_omit", CV=TRUE)
fit # show results
Bootstrapoing

Matrix Algebra
The code above performs an LOA, using listviise deletion of missing data. CV=TRUE generates jacknifed
(i.e., leave one out) predictions. The code below assesses the accuracy of the prediction.
R in Action

# Assess the accuracy of the prediction


# percent correct for each category of G
et <- table(mydata$G , fit$c l ass)
diag(prop . table(ct, 1))
# total percent correct
sum(di ag(prop. table (et)))

R in Action significantly expands


upan this material. Use promo lda() prints discriminant functions based on centered (not standardized) variables. The "proportion of
code ria38 for a 38% discount. trace" that is printed is the proportion of between-dass variance that is explained by successive
discriminant functions . No significance tests are produced. Refer to the section on MAl~OVA for such
tests.
Top Menu

Quadratic Discriminant Function


To obtain a quadratic discriminant function use qda( ) instead of Ida( ). Quadratic discriminant function
The R Interface
does not assume homogeneity of variance-covariance matrices.
Data Input

Data Management
# Quad r atic Discriminant Analysis with 3 groups applying
Basic Statistics # resubstitution prediction and equal prior probabi l ities .
l i brary(MASS)
Advanced Statistics
fi t <- qda(G - xl + x2 + x3 + x4, data=na . omi t(mydata),
Basic Graphs prior=c(l,1,1) / 3))

Advanced Graohs
Note the altemate way of specifying listviise deletion of missing data. Re-subsitution (using the same
data to derive the functions and evaluate their prediction accuracy) is the default method unless
CV=TRUE is specified. Re-substitution viill be overly optimistic.
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the
following code. Points are identified with the group ID.

# Scat:t:er plot: using t:he lst: t:wo discriminant: dimensions


plot:(fit:) # fit: from lda
1

click to view

The following code displays histograms and density plots for the observations in each group on the first
linear discriminant dimension. There is one panel for each group and they ali appear lined up on the
same graph.

# Panel s of hi st:ograms and overl ayed densi t:y plot:s


# for lst: di scrimi nant: funct:i on
plot:(fit:, dimen=l, t:ype="bot:h") # fit: from lda
1

·- click to view

The partimat( ) function in the klaR package can display the results of a linear or quadratic
classifications 2 variables at a time.

# Explorat:ory Graph for LDA or QDA


library(klaR)
part:imat:(G~xl+x2+x3,dat:a=mydat:a,met:hod="lda")
1

click to view

You can also produce a scatterplot matrix vlith color coding by group.

# Scat:t:erpl ot: for 3 Group Prob l em


pai rs(mydat:a[c("xl", "x2", "x3")], mai n="My Ti t:l e ", pch=22,
bg=c("red", "yellow", "blue") [unclass(mydat:a$G)])
1
.. \:_~J. ·.... r 1

.•\....... - - ,,i. .,
"-
c. ; ~·

o;.#..,,.~· • ~ ··~.
'
"'c .1i.-{\";. · t~
...... .
:: ~~
~

·'
- click to view

Test Assumptions
See (M)ANOVA Assumotions for methods of evaluating multivariate normalíty and homogeneíty of
covariance matrices.
Advanced Statistics Principal Components and Factor Analysis
This section covers principal components and factor analysis. The later includes both exploratory and
Generalized Linear Models confim1atory methods.

Oiscriminant Function

Time Series Principal Components


Factor Analysis The princomp( ) function produces an unrotated principal component analysis.
Correspondence Analysis

M11ltjdjmen5jona( Scaling # Pricipal Components Analysis


# entering raw data and extracting PCs
CI 115ter Analysis
# from the corre 1ati on matri x
Tree-8ased Models fit <- princomp(mydata, cor=TRUE)
sunm1ary(fit) # print variance accounted for
Bootstrapoing loadings(fit) # pe loadings
Matrix Algebra plot(fit, type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
R in Action

click to view

Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat= option to
R in Action significantly expands
enter a correlation or covariance matrix directly. lf entering a covariance matrix, include the option
upan this material. Use promo
n.obs=.
code ria38 for a 38% discount.

The principal( ) function in the psvch package can be used to extract and rotate principal components.
Top Menu
# Vari max Rotated Pri nci pa1 Components
# retaining 5 components
l i brary(psych)
The R Interface fi t <- principal (mydata, nfactors=5, rotate="varimax")
Data Input fit # print results

Data Management
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Basic Statistics
rotate can "none", ''varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
Advanced Statistics

Basic Graphs
Exploratory Factor Analysis
Advanced Graohs
The factanal( ) function produces maximum likelihood factor analysis.

Maximum Likelihood Factor Analysis


entering raw data and extracting 3 factors,
with varimax rotation
fit <- factanal(mydata, 3 , rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
pl ot(load, type="n") # set up plot
text( load,labels=names(mydata),cex=.7) # add variable names

1 -=-
¡.... • -

click to view

The rotation= options include "varimax", "promax", and "none". Add the option scores="regression" or
"Bartlett" to produce factor seores. Use the covmat= option to enter a correlation or covariance matrix
directly. lf entering a covariance matrix, include the option n.obs=.

The factor.pa( ) function in the ~ package offers a number of factor analysis related functions,
including principal axis factoring.

# Principal Axis Factor Analysis


l i brary(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Rotation can be "varimax" or "promax".

Determining the Number of Factors to Extract


A crucial decision in exploratory factor analysis is how many factors to extract. The nFactors package
offer a suite of functions to aid in this decision. Details on this methodology can be found in a
Pov1erPoint oresentation by Raiche, Riopel, and Blais. Of course, any factor solution must be
interpretable to be useful.

# Determine Number of Factors to Extract


library( nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata), var=ncol(mydata),
rep=lOO,cent=.05)
ns <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

1-

elick to view

Going Further
The FactoMjneR package offers a large number of additional functions for exploratory factor analysis.
This includes the use of both quantitative and qualitative variables, as well as the inclusion of
supplimentary variables and observations. Here is an example of the types of graphs that you can
create with this package.

# PCA Variable Factor Map


library( FactoMineR)
result <- PCA( mydata) # graphs generated automatically
1
-- -
.. _....__ .
_...
-=-_-.=:;.;- --=--
-==---~

click to view

Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.

Structual Equation Modeling


Confirmatory Factor Analysis (CFA) is a subset of the much wider Structural Equation Modelíng (SEM)
methodology. SEM is provided in R via the sem. package. Models are entered via RAM specification
(similar to PROC CALIS in SAS). While sem ís a comprehensive package, my recommendation is that íf
you are doing significant SEM work, you spring for a copy of AMOS. lt can be much more user-friendly
and creates more attractive and publication ready output. Having said that, here is a CFA example
using sem.

e3

e4

e5

~ e6
Assume that we have six observered variables (X1, X2, ... , X6). We
hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables
as described in thís diagram. X1, X2, and X3 load on F1 (with loadíngs lam1, lam2, and lam3). X4, X5,
and X6 load on F2 (with loadings lam4, lam5, and lan16). The double headed arrow indicates the
covariance between the two latent factors (F1 F2). e1 thru e6 represent the residual variances (variance
in the observed variables not accounted for by the t\'VO latent factors). We set the variances of F1 and
F2 equal to one so that the parameters will have a scale. This will result in F1F2 representing the
correlatíon between the two latent factors.

For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the
code below. The CFA model is specified using the specify.model( ) function. The fom1at is arrow
specification, parameter name, start value. Choosing a start value of NA tells the program to choose a
start value rather than supplying one yourself. Note that the variance of F1 and F2 are fixed at 1 (NA in
the second column). The blank líne is required to end the RAM specification.

# Simple CFA Model


l i brary(sem)
mydata . cov <- cov (mydata)
model. mydat a <- speci fy. mode l ()
Fl - > Xl, laml, NA
Fl - > X2, lam2, NA
F1 - > X3, lam3 , NA
F2 - > X4 , lam4, NA
F2 - > xs, lam5, NA
F2 - > X6, l am6, NA
Xl <- > Xl, el, NA
X2 <- > X2, e2, NA
X3 <- > X3, e3, NA
X4 <- > X4, e4, NA
xs <- > xs, eS , NA
X6 <- > X6, e6 , NA
F1 <- > Fl, NA, 1
F2 <- > F2, NA, 1
Fl <- > F2 , FlF2 , NA

mydata.sem <- sem(model.mydata, mydata.cov, nrow(mydata))


# print results (fit indices, paramters, hypothesis tests)
summary(mydata.sem)
# print standardized coefficients (loadings)
std.coef(mydata.sem)

You can use the boot.sem( ) function to bootstrap the structual equation model. See help(boot.sem)
for details. Additionally, the function mod.indices( ) will produce modification indices. Using
modification indices to improve model fít by respecifying the parameters moves you from a
confirmatory to an exploratory analysis.

For more information on sem , see Structurn! Equatjoo Modeling rnth tbe sem Package in R, by John
Fox.
Advanced Statistics Generalized Linear Models
Generalized linear models are fit using the glm( ) function. The form of the glm function is
Generalized Linear Models
glm(formula, family=familytype(link=linkfunction), data=)
Oiscriminant Function

Time Series Family Oefault Link Function


binomial (link = "logit")
Factor Analysis
gaussian (link = "identity")
Correspondence Analysis Gan1ma (link ="inverse")
M11ltjdjmen5jona( Scaling inverse. gaussian (link = "1 / muA2")
poisson (link = "log")
CI 115ter Analysis
quasi (link = "identity'', variance = "constant")
Tree-8ased Models
quasibinomial (link ="logit")
Bootstrapoing quasipoisson (link = "log")

Matrix Algebra See help(glm) for other modeling options. See help(family) for other allowable link functions for each
family. Three subtypes of generalized linear models will be covered here: logistic regression, poisson
regression, and survival analysis.
R in Action

Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables. lt is frequently preferred over discrjmjnant ftmctjon analysis because of its less
restrictive assumptions.

R in Action significantly expands


upan this material. Use promo # Logistic Regression
# where F is a binary factor and
code ria38 for a 38% discount.
# xl-x3 are continuous predictors
fit <- glm(F~xl+x2+x3,data=mydata, family=binomial ())
summary(fit) # display results
Top Menu confint(fit) # 95% CI for the coefficients
exp(coef(fit)) # exponentiated coefficients
exp(confint(fit)) # 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
The R Interface residuals(fit, type="deviance") # residuals

Data Input

Data Management You can use anova(fit1 ,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F-x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x
Basic Statistics
variable.
Advanced Statistics

Basic Graphs

Advanced Graohs

elick to view

Poisson Regression
Poisson regression is useful v1hen predicting an outcome variable representing counts from a set of
continuous predictor variables.

# Poisson Regression
# where count is a count and
# xl-x3 are continuous predictors
fit <- glm(count ~ xl+x2+x3, data=ltlydata, family=poisson())
summary(fit) display results

lf you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may
want to use quasi poisson() instead of poisson () .

Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for
modeling the time to an event. Data may be right ce nsored - the event may not have occured by the
end of the study or we may have incomplete information on an observation but know that up to a
certain time the event had not occured (e.g. the participant dropped out of study in week 10 but was
alive at that time).

While generalized linear models are typically analyzed using the glm( ) function, survival analyis is
typically carried out using functions from the survival package . The survival package can handle one
and two sample problems, parametric accelerated failure models, and the Cox proportional hazards
model.

Data are typically entered in the format start time, stop time , and status (1=event occured, O=event
did not occur). Alternatively, the data may be in the fom1at time to event and status (1=event
occured, O=event did not occur). A status=O indicates that the observation is right cencored. Data are
bundled into a Surv object vía the Surv( ) function prior to further analyses.

survfit( ) is used to estímate a survival distribution for one or more groups.


survdiff( ) tests for differences in survival distributions between two or more groups.
coxph( ) models the hazard function on a set of predictor variables.

# Mayo clinic Lung Cancer Data


library(survi val)

# learn about the dataset


help(lung)

# create a Surv object


survobj <- with(lung, Surv(time,status))

# Plot survival distribution of the total sampl e


# Kapl an-Meier estimator
fitO <- survfit(survobj~l, data=lung)
summary(fitO)
pl ot(fitO, xlab=" Survival Time in Days" ,
yl ab="% Survi ving", yscal e=lOO,
main="Survival Distribution (Overal l)")

# Compare the survi val distributions of men and women


fitl <- survfit(survobj~sex,data=l ung)

# plot the survival distributions by sex


p lot(fi tl, xl ab="Survi va 1 Time in Days",
ylab="% Survi ving", yscale=lOO, col =c(" r ed", "bl ue"),
mai n="Survi val Di stri butions by Gender")
l egend("topright", title="Gender", c("Mal e", "Female"),
fil l=c ("red", "b1 ue"))
# test for difference between male and female
# survi val curves ( l ogrank test)
su r vdi ff(survobj ~sex, data=l ung)

# predi et mal e surv i va l from age and medí cal seores


MaleMod <- coxph(survobj~age+ph.ecog+ph.karno+pat . karno,
data=lung, subset=sex==l)

# display resu l ts
MaleMod

# evaluate the proportional hazards assumpti on


cox . zph(MaleMod)

·~
\
l. \ l . . \..

click to view

See Thomas Lumley's R news article on the survival package for more information . Other good sources
include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M. J . Crawley's chapter
on Survival Analysis.
Advanced Statistics Matrix Algebra
Most of the methods on this website actually describe the programming of matrices. lt is built deeply
Generalized Linear Models into the R language. This section will simply cover operators and functions specifically suited to linear

Discriminant Function algebra. Before proceeding you many want to review the sections on Data Tvoes and Ooerators.

Time Series

Factor Analysis Matrix facilites


Correspondence Analysis In the following examples, A and B are matrices and x and b are a vectors.

M11ltjdjmen5jona( Scaling
Operator or Description
CI 115ter Analysis Function
Tree-8ased Models A*B Element-vlise multiplication
A%*% B Matrix multiplication
Bootstrapoing
A %o% B Outer product. AB'
Matrix Algebra
crossprod(A,B) A'B and A'A respectively.
crossprod(A)
t(A) Transpose
R in Action
diag(x) Creates diagonal matrix with elements of x in the principal diagonal
diag(A) Returns a vector containing the elements of the principal diagonal
diag(k) lf k is a scalar, this creates a k x k identity matrix. Go figure.
solve(A, b) Returns vector x in the equation b =Ax (i.e., A' 1b)
solve(A) lnverse of A where A is a square matrix.
ginv(A) Moore-Penrose Generalized lnverse of A.
ginv(A) requires loading the MASS package.
R in Action significantly expands y< ·eigen(A) y$val are the eigenvalues of A
y$vec are the eigenvectors of A
upon this material. Use promo
y<·svd(A) Single value decomposition of A.
code ria38 for a 38% discount. y$d =vector containing the singular values of A
y$u = matrix with colunms contain the left singular vectors of A
y$v = matrix vlith columns contain the right singular vectors of A
Top Menu R <· chol(A) Choleski factorization of A. Returns the upper triangular factor, such that R'R
=A.
y<· qr(A) QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and a lower
triangle that contains information on the Q decomposition.
The R Interface y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
Data Input y$pivot contains information on the pivoting strategy used.
cbind(A,B,... ) Combine matrices(vectors) horizontally. Returns a rnatrix.
Data Management
rbind(A,B, ... ) Combine matrices(vectors) vertically. Returns a matrix.
Basic Statistics rowMeans(A) Returns vector of row means.
Advanced Statistics rowSums(A) Returns vector of row sums.

Basic Graphs colMeans(A) Returns vector of column means.


colSums(A) Returns vector of coumn means.
Advanced Graohs

Matlab Emulation
The ma1lah package contains wrapper functions and variables used to replicate MATLAB function calls
as best possible. This can help porting MATLAB applications and code to R.
Going Further
The Ma1dx package contains functions that extend R to support highly dense or sparse matrices. lt
provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack (dense mat rix), TAUCS
(sparse mat rix) and UMFPACK (sparse matrix) rout ines.
Advanced Statistics Multidimensional Scaling
R provides functions for both classical and nonmetric multidimensional scaling. Assume that we have N
Generalized Linear Models objects measured on p numeric variables. We want to represent the distances among the objects in a

Oiscriminant Function parsimonious (and visual) way (i.e., a lower k-dimensional space).

Time Series

Factor Analysis Classical MDS


Correspondence Analysis You can perform a classical MOS using the cmdscale() function.

M11ltjdjmen5jona( Scaling
# classical MDS
CI 115ter Analysis
# N rows (objects) x p columns ( variables)
Tree-8ased Models # each row identified by a unique row name

Bootstrapoing
d <- dist(mydata) # eucl idean distances between the rows
Matrix Algebra fit < - cmdscal e(d,eig=TRUE, k=2) # k is the number of dim
fit # view results

R in Action # plot solution


x <- fit$points [ ,l]
y <- fit$points [ ,2]
plot(x, y, xlab="Coordinate l", yl ab="Coordinate 2",
main="Metric MDS", type="n")
text(x, y, l abels = row . names(mydata), cex= -7)

R in Action significantly expands


upan this material. Use promo
code ria38 for a 38% discount.
elick to vie w

Top Menu
Nonmetric MDS
Nonmetric MOS is performed using the isoMDS() function in the MASS package.

The R Interface

Data Input # Nonmetric MDS


# N rows (objects) x p columns ( variabl es)
Data Management # each row identified by a unique row name
Basic Statistics
l i brary(MASS)
Advanced Statistics d <- dist(mydata) # eucl idean distances between the rows
fi t <- isoMDS(d, k=2) # k is the number of di m
Basic Graphs
fit # view results
Advanced Graohs
# plot solution
x <- fit$points [ ,l]
y <- fit$points [ ,2]
pl ot(x, y, xlab="Coordinate l", yl ab="Coordinate 2",
main="Nonmetric MDS", type="n")
text(x, y, l abels = row . names(mydata), cex= -7)
click to view

Individual Difference Scaling


3-way or individual difference scaling can be completed using the indscal() function in the SensoMineR
package. The smacof package offers a three way analysis of individual differences based on stress
minimization of means of majorization.
Advanced Statistics Time Series and Forecasting
R has extensive facilities for analyzing time series data. This section describes the creation of a time
Generalized Linear Models series, seasonal decompostion, rnodeling with exponential and ARIMA models, and forecasting with the
Oiscriminant Function forecast pacakge.

Time Series

Factor Analysis Creating a time series


Correspondence Analysis The ts() function vtill convert a numeric vector into an R time series object. The format is ts( vector,
start=, end=, frequency=) where start and end are the times of the first and last observation and
M11ltjdjmen5jona( Scaling
frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
CI 115ter Analysis

Tree-8ased Models
# save a numeric vector containing 48 monthly observations
Bootstrapoing # from Jan 2009 to Dec 2012 as a time series object
myts <- ts(myvector , start=c(2009, 1), end=c(2012, 12), frequency=12)
Matrix Algebra
# subset the time series (June 2012 to December 2012)
myts2 <- wi ndow(myts, start=c(2012, 6), end=c (2012, 12))
R in Action
# plot series
plot(myts)

Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
R in Action significantly expands stl() function. Note that a series with multiplicative effects can often by transfom1ed into series with
upan this material. Use promo additive effects through a log transformation (i.e., newts <- log(myts)).
code ria38 for a 38% discount.

# Seasonal decompostion
Top Menu fit <- stl(myts, s.window="period")
plot(fit)

# additional plots
monthplot(myts)
The R Interface
library(forecast)
Data Input seasonplot(myts)

Data Management

Basic Statistics
Exponential Models
Advanced Statistics
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package,
Basic Graphs can be used to fit exponential models.
Advanced Graohs
# simple exponential - models level
fit <- Holtwinters(myts, beta=FALSE, gamma=FALSE)
# double exponential - models level and trend
fit <- HoltWinters(myts, gamma=FALSE)
#triple exponential - models level, trend, and seasonal components
fit <- Holtwinters(myts)
# predictive accuracy
library(forecast)
accuracy(fit)

# predict next three future values


library(forecast)
forecast(fit, 3)
plot(forecast(fit, 3))

ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model. Other
useful functions include:

lag(ts, k) lagged version of time series, shifted back k observations


diff(ts, difference the time series d times
differences=d)
ndiffs(ts) Number of differences required to achieve stationarity (from the
forecast package)
acf(ts) autocorrelation function
pacf(ts) partial autocorrelation function
adf.test(ts) Augemented Dickey-Fuller test. Rejecting the null hypothesis
suggests that a time series is stationary (from the tseries package)
Box.test(x, Pormanteau test that observations in vector or time series x are
type="Ljung· independent
Box")

Note that the forecast package has somewhat nicer versions of acf() and pad() called Acf() and Pad()
respectively.

# fit an ARIMA model of order P, o, Q


fit <- arima(myts, order=c(p, d, q)

# predicti ve accuracy
library( forecast)
accuracy(fit)

# predict next 5 observations


library(forecast)
forecast(fit, 5)
plot(forecast(fit, 5))

Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models.
The ets() function supports both additive and multiplicative models. The auto.arima() function can
handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of severa[ fit
criteria.

library(forecast)
# Automated forecasting using an exponential model
fit <- ets(myts)

# Automated forecasting using an ARIMA model


fit <- auto.arima(myts)

Going Further
There are many good online resources for learning time series analysis with R. These include A little
book of R for time series by Avril Chohlan, and Forecasting: principies and practice by Rob Hyndman and
George Athanasopoulos. Vito Ricci has created a time series reference card . There are also a time series
tutorial by Walter Zuccbjnj Oleg llenadic that is quite useful.

See also the comprehensive book Time Series Analysis and its Aoplications with R Examples by Robert
Shunway and David Stoffer.
Data Input Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices,
Data tvoes data frames, and lists.
lmoorting Data

Keyboard Input Vectors


Database lnout
a <- c(l,2,5.3,6,-2,4) # numeric vector
Exporting Data b <- c("one", "two", "three") # character vector
e <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Vjewjng Data
1
Vaáable 1abels
Refer to elements of a vector using subscripts.
Value Labels

Missing Data

Date Values
1 a[c(2,4)] # 2nd and 4th elements of vector

R in Action Matrices
Ali columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The
general format is

mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,


dimnames=list( char_vector_rownames , char_vector_colnames))

R in Action significantly expands byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
upon this material. Use promo matáx should be filled by columns (the default). dimnames provides optional labels for the columns
code ria38 for a 38% discount. and rows.

Top Menu # generates 5 x 4 numeric rnatrix


y<-rnatrix(1:20, nroW=5,ncol=4)

# another example
cells <- c(l,26,24,68)
The R Interface
rnarnes <- c("Rl", "R2")
Data Input cnarnes <- c("cl", "c2")
mymatri x <- rnatri x(ce lls, nroW=2, neo1=2 , byroW=TRUE,
Data Management dimnames=list(rnarnes, cnarnes))
Basjc Statjstics

Advanced Statistics ldentify rows, columns or elements using subscripts.


Basic Graphs

Advanced Graohs x[,4] # 4th colurnn of matrix


x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
1
Arrays
Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.

Data Frames
A data frame is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.). This is similar to SAS and SPSS datasets.

d <- c(l,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data . frame(d,e, f)
names(mydata) <- c("ID", "col or", "Passed") # variable names

There are a variety of ways to identify the elements of a data frame .

myframe [ 3:5] # columns 3,4,5 of data frame


myframe [c("ID", "Age")] # columns ID and Age from data frame
myframe$Xl # variable xl in the data frame
1
Lists
An ordered coltection of objects (components). A list allows you to gather a variety of (possibly
unrelated) objects under one name.

# examp l e of a l i st wi th 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred'', mynumbers=a, mymatrix=y, age=5.3)

# exampl e of a list containing two lists


v <- c(listl,list2)

ldentify elements of a list using the [[]] convention.

mylist[ [2]] # 2nd component of the list


mylist[["mynumbers"] ] # component named mynumbers in list
1
Factors
Tell R that a variable is nominal by making ita factor. The factor stores the nominal values as a vector
of integers in the range [ 1... k] (where k is the number of unique values in the nominal variable), and
an internal vector of character strings (the original values) mapped to these integers.

# variable gender with 20 "male" entries and


# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 ls and 30 2s and associates
# l =fema le, 2=mal e interna11 y (al phabeti ca11 y)
# R now treats gender as a nominal variable
summary(gender)

An ordered factor is used to represent an ordinal variable .

1# variable rating ceded as "large", "medium" , "smal l'


rating <- ordered(rating)
# recodes rating to 1,2,3 and associates
# l=large, 2=medium, 3=small internally
# R now treats rating as ordinal
1
R vtill treat factors as nominal variables and ordered factors as ordinal variables in statistical
proceedures and graphical analyses. You can use options in the factor( ) and ordered( ) functions to
control the mapping of integers to strings (overiding the alphabetical ordering). You can also use
factors to create value labels. For more on factors see the UCLA page.

Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
el ass(object) # class or type of an object
names(object) # names

c(object,object, ... ) # combine objects into a vector


cbind(object, object, ... ) # combine objects as columns
rbind(object, object, ... ) # combine objects as rows

object # prints the object

ls() # list current objects


rm(object) # delete an object

newobject <- edit(object) # edit copy and save as newobject


fix(object) # edit in place
Data Input Date Values
Dates are represented as the number of days since 1970-01-01, •..vith negative values for earlier dates.
Data tvoes

lmoorting Data # use as.Date( ) to convert strings to dates


Keyboard Input mydates <- as.Oate(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
Database lnout days <- mydates [1] - mydates [2]
Exporting Data

Vjewjng Data Sys.Date( ) returns today's date.


date() returns the current date and time.
Variable 1abels

Value Labels The following symbols can be used with the format( ) function to print dates.
Missing Data
Symbol Meaning Example
Date Values %d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
R in Action %m month (00-12) 00-12
%b abbreviated month Jan
%8 unabbreviated month January
%y 2-digit year 07
%Y 4-digit year 2007

Here is an example.

R in Action significantly expands # print today's date


upon this material. Use promo today <- Sys.Oate()
code ria38 for a 38% discount. format(today, format="%B rod %Y")
"June 20 2007"

Top Menu
Date Conversion
Character to Date
The R Interface
You can use the as.Date( ) function to convert character data to dates. The fom1at is as.Date(x,
Data Input "forma('), where x is the character data and format gives the appropriate fom1at.

Data Management

Basjc Statjstics # convert date info in format 'mm/dd/yyyy'


strOates <- c("Ol/05/1965", "08/16/1975")
Advanced Statistics dates <- as. Date(stroates, "r,,m/%d/%Y")
Basic Graphs 1
Advanced Graohs The default format is yyyy-mm-dd

1mydates <- as.Oate(c("2007-06-22", "2004-02-13"))

Date to Character
You can convert dates to character data using the as.Character() function.

# convert dates to character data


strDates <- as.character(dates)
1

Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates. See
help(ISOdatetime) for more information about formatting date/times.
Data Input Access to Database Management Systems (DBMS)

Data tvoes ODBC Interface


lmoorting Data The ROOBC package provides access to databases (including Microsoft Access and Microsoft SQL Server)
through an OOBC interface.
Keyboard Input

Database lnout The primary functions are given below.


Exporting Data Oescription
Function
Vjewjng Data odbcConnect(dsn, uid=-·, pwd="") Open a connection to an ODBC database
sqlFetch(channel, sqtable) Read a table from an ODBC database into a data
Vaáable 1abels
frame
Value Labels sqlQuery(channel, query) Submit a query to an ODBC database and return
the results
Missing Data
sqlSave(channel, mydf, tablename = Write or update (append=True) a data frame to a
Date Values sqtab/e, append =FALSE) table in the ODBC database
sqlOrop(channel, sqtab/e) Remove a table from the ODBC database
close(channe/) Close the connection
R in Action
# RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R data frames (and call them crimedat and pundat)

l i brary(RODBC)
myconn <-odbcconnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, Crime)
R in Action significantly expands pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
upon this material. Use promo
code ria38 for a 38% discount.

Other lnterfaces
Top Menu
The RMySOL package provides an interface to MySQL.

The ~ package provides an interface for Oracle.


The R Interface
The RJOBC package provides access to databases through a JOBC interface.
Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Exporting Data
There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you
Data tvoes will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
lmoorting Data

Keyboard Input To A Tab Delimited Text File


Database lnout

Exporting Data
1 write.table(mydata, "c:/mydata.txt", sep="\t")

Vjewjng Data

Vaáable 1abels To an Excel Spreadsheet


Value Labels
library(xlsReadWrite)
Missing Data write.xl s(mydata, "e : /mydata.xl s")
Date Values
1

R in Action
To SPSS
# write out text datafi l e and
# an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c : /mydata . sps" , package="SPSS")

R in Action significantly expands To SAS


upon this material. Use promo
code ria38 for a 38% discount. # write out text datafi l e and
# an SAS program to read it
library(foreign)
write.foreign(mydata, " c:/mydata.txt", "c :/mydata . sas" , package=" SAS")
Top Menu

To Stata
The R Interface

Data Input # export data frame to Stata binary format


library(foreign)
Data Management
write.dta(mydata, "c : /mydata.dta")
Basjc Statjstics 1
Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input lmporting Data
lmporting data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS 1
Data tvoes would recommend the Hmisc package for ease and functionality. 5ee the Quick·R section on packages,
lmoorting Data for information on obtaining and installing the these packages. Example of importing data are provided
below.
Keyboard Input

Database lnout
From A Comma Delimited Text File
Exporting Data

Vjewjng Data # fi rst row contai ns variable names, convna is separator


# assign the variabl e id to row names
Vaáable 1abels
# note the / instead of \ on mswindows systems
Value Labels
mydata <- read.table("c: /mydata.cs v", header=TRUE,
Missing Data
sep=", row . nan1es=ºid")
11
,

Date Values

R in Action From Excel


The best way to read an Excel file is to export it to a comma delimited file and import it using the
method above. On windows systems you can use the RODBC package to access Excel files . The first row
should contain variable/column names.

# first row contains va riable names


# we will read in worksheet mysheet
R in Action significantly expands
upon this material. Use promo library(RODBC)
channel <- odbcConnectExcel ("e: /myexe l. xl s")
code ria38 for a 38% discount.
mydata <- sqlFetch(channel, "mysheet")
odbcclose(channel)

Top Menu

From SPSS
The R Interface # save SPSS dataset in trasport format
Data Input get fil e=' e: \mydata. sav' .
export outfile='c:\mydata.por'.
Data Management
# in R
Basjc Statjstics
l i brary(Hmi se)
Advanced Statistics mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
Basic Graphs

Advanced Graohs

From SAS
# save SAS dataset in trasport format
l i bname out xport 'c: /mydata. xpt' ;

1 data out.mydata;
set sasuser.mydata;
run;

# in R
l i brary(Hmi se)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors

From Stata
# input Stata file
library(foreign)
mydata <- read. dta("c: /mydata.dta")
1
From systat
# input Systat file
library(foreign)
mydata <- read.systat("c: /mydata.dta")
1
Data Input Keyboard Input
Usually you will obtain a data frame by imoorting it from SAS, SPSS, Excel, Stata, a database, or an
Data tvoes ASCII file. To create it interactively, you can do something like the following.
lmoorting Data

Keyboard Input # create a data frame from scratch


age <- c(25, 30, 56)
Database lnout gender <- c("male", "fernale" , "mal e")
Exporting Data weight <- c(160 , 110, 220)
mydata <- data . frame(age,gender,weight)
Vjewjng Data

Vaáable 1abels
You can also use R's built in spreadsheet to enter the data interactively, as in the following example.
Value Labels

Missing Data
# enter data using editor
Date Values mydata <- data . frame(age=numeric(O), gender=character(O),
weight=numeric(O))
mydata <- edit(mydata)
R in Action # note t hat without the assignment in the line above,
# the edits are not saved!

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Missing Data
In R, missing values are represented by the symbol NA (not available) . lmpossible values (e.g., dividing
Data tvoes by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for

lmoorting Data character and numeric data.

Keyboard Input

Database lnout Testing for Missing Values


Exporting Data
is.na(x) # returns TRUE of x is missing
Vjewjng Data y <- c(l,2,3,NA)
is.na(y) # returns a vecto r (F F F T)
Vaáable 1abels 1
Value Labels

Missing Data Recoding Values to Missing


Date Values
# recode 99 to missing for variable vl
# select rows where vl is 99 and recode col umn vl
R in Action mydata$vl[mydata$v1==99] <- NA
1
Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.

X <- c(l,2,NA,3)
R in Action significantly expands
mean(x) # returns NA
upon this material. Use promo mean(x, na . rm=TRUE) # returns 2
code ria38 for a 38% discount. 1
The function complet e. cases() returns a logical vector indicating which cases are complete.
Top Menu

# l ist rows of data that have missing values


mydata[!complete . cases(mydata),]
The R Interface
1
Data Input The function na.omit() retums the object with listwise deletion of missing values.

Data Management

Basjc Statjstics # create new dataset without missing data


newdata <- na.omit(mydata)
Advanced Statistics 1
Basic Graphs

Advanced Graohs Advanced Handling of Missing Data


Most modeling functions in R offer options for dealing with missing values. You can go beyond pairwise
of listwise deletion of missing values through methods such as multiple imputation. Good
implementations that can be accessed through R indude Ame lia 11, ~' and 1llillli2.ls..
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Value Labels
To understand value labels in R, you need to understand the data structure factor.
Data tvoes
You can use the factor functíon to create your own value lables.
lmoorting Data

Keyboard Input
# variable vl is coded 1, 2 or 3
Database lnout
# we want to attach va1 ue 1abe1 s l=red, 2=b1 ue, 3=green
Exporting Data
mydata$v1 <- factor(mydata$v1,
Vjewjng Data
levels c(l,2,3),
Vaáable 1abels labels = c("red", "blue", "green"))

Value Labels

Míssing Data # variable y is coded 1, 3 or 5


Date Values # we want to attach value labels l=Low, 3=Medium, 5=High

mydata$v1 <- ordered(mydata$y,


levels c(l,3, 5),
R in Action
labels = c("Low", "Medium", "High"))

Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical
and graphic functions wíll then treat the data appriopriately.

Note: factor and ordered are used the same way, wíth the same arguments. The former creates factors
and the later creates ordered factors.
R in Actíon significantly expands
upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Variable Labels
R's ability to handle variable labels is somewhat unsatisfying.
Data tvoes
lf you use the Hmisc package, you can take advantage of sorne labeling features.
lmoorting Data

Keyboard Input
library(Hmisc)
Database Input
label(mydata$myvar) <- "Vari ab 1 e 1abe1 for vari ab 1 e myvar"
Exporting Data describe(mydata)

Vjewjng Data
1
Variable 1abels Unfortunately the label is only in effect for functions provided by the Hmisc package, such as
describe(). Your other option is to use the variable !abe! as the variable name and then refer to the
Value Labels
variable by position index.
Missing Data

Date Values
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the va riable
1
R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basjc Statjstics

Advanced Statistics

Basic Graphs

Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Management Aggregati ng Data
lt is relatively easy to collapse data in R using one or more BY variables anda defined function.
Creating l~ew Variables

Ooerators # aggregate data frame mtcars by cyl and vs, returning means
Built-in Functions # for nurneric variables
attach(mtcars)
Control Structures aggdata <-aggregate(mtcars, by=list(cyl,vs),
! Jser-defined F1mctjons
FUN=mean, na. rm=TRUE)
print(aggdata)
Sortjng Data detach(mtcars)

Mergjng Data

Aggregating Data When using the aggregate() function, the by variables must be in a list (even if there is only one). The
function can be built -in or user provided.
Reshaoing Data

Subsetting Data See also:


Data Tyoe Conversion • summarize() in the J:imiK package
• summaryBy() in the ~ package

R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Control Structures
R has the standard control structures you would expect. expr can be multiple (compound) statements
Creating l~ew Variables by enclosing them in braces ( }. lt is more efficient to use built-in functions rather than control
Ooerators structures whenever possible.

Built-in Functions

Control Structures if-else


! Jser-defined F1 mctjons
if (cond) expr
Sortjng Data i f (cond) exprl else expr2

Mergjng Data
1
Aggregating Data
for
Reshaoing Data

Subsetting Data 1 for evar in seq) expr

Data Tyoe Conversion

while
R in Action

1 whi 1e econd) expr

switch
1 switch(expr, ... )
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount. ifelse

Top Menu
1 ifelse(test,yes,no)

Example
The R Interface

Data Input # transpose of a matrix


# a poor alternative to built-in t() function
Data Management

Basic Statistics mytrans <- function(x) {


if (!is.matrix(x)) {
Advanced Statistics warning("argument is not a matrix: returning NA")
Basic Graphs return(NA_real_)
}
Advanced Graohs y<- matrix(l, nroW=ncol(x), ncol=nrow(x))
for (i in l:nrow(x)) {
for (j in l:ncol(x)) {
y[j,i] <- x[i,j]
}
}
return(y)
}

# try it
z <- matrix(l:lO, nrow:5, ncol=2)
tz <- mytrans(z)
Data Management Built-in Functions
Almost everything in R is done through functions. Here l'm only refering to numeric and character
Creating l~ew Variables functions that are commonly used in creating or recoding variables.

Ooerators

Built-in Functions Numeric Functions


Control Structures
Function Oescription
! Jser-defined F1 mctjons
abs(x) absolute value
Sortjng Data sqrt(x) square root

Mergjng Data ceiling(x) ceiling(3.475) is 4


floor(x) floor(3.475) is 3
Aggregating Data
trunc(x) trunc(S. 99) is 5
Reshaoing Data round(x, digits=n) round(3.475, digits=2) is 3.48
Subsetting Data signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
Data Tyoe Conversion
log(x) natural logarithm
log10(x) common logarithm
R in Action exp(x) e' x

Character Functions
Function Oescription
substr(x, start= n 1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef'
R in Action significantly expands substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef'
upon this material. Use promo
grep(pattem, x, Search for pattem in x. lf fixed =FALSE then pattem is a regular.
code ria38 for a 38% discount. ignore.case=FALSE, expression. lf fixed=TRUE then pattern is a text string. Returns
fixed=FALSE) matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2

Top Menu sub(pattem, rep/acement, Find pattern in x and replace with rep/acement text. lf
x, ignore.case =FALSE, fixed=FALSE then pattern is a regular expression~
fixed=EALSE) lf fixed = T then pattem is a text string.
sub("\\s",".","Hello There") returns "Helio.There"
strsplit(x, split) Split the elements of character vector x at split.
The R Interface strsplit("abc", "") returns 3 element vector "a","b","c"
paste( ... , sep="") Concatenate strings after using sep string to seperate them.
Data Input paste("x",1 :3,sep="") returns c("x1","x2" "x3")
paste("x", 1 :3,sep="M") returns c("xM1","xM2" "xM3")
Data Management paste("Today is", date())
Basic Statistics toupper(x) Uppercase
tolower(x) lowercase
Advanced Statistics

Basic Graphs

Advanced Graohs Statistical Probability Functions


The following table describes functions related to probaility distributions. For random number
generators below, you can use set.seed(1234) or sorne other integer to create reproducible pseudo-
random numbers.
Function Description
dnorm(x) normal density function (by default m=O sd=1)
# plot standard nom1al curve
x <- pretty(c(-3, 3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q) cumulative normal probability far q
(area under the normal curve to the right of q)
pnom1(1.96) is 0.975
qnorm(p) normal quantile.
value at the p percentile of nom1al distribution
qnom1(. 9) is 1.28 # 90th percentile
rnorm(n, m=O,sd=1) n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- morm(50, m=50, sd=10)
dbinom(x, size, prob) binomial distribution where size is the sample size
pbinom(q, size, prob) and prob is the probability of a heads (pi)
qbinom{p, size, prob) # prob of O to 5 heads of fair coin out of 1O flips
rbinom(n, size, prob) dbinom(0:5, 10, .5)
# prob of 5 ar less heads of fair coin out of 1O flips
pbinom(5, 10, .5)
dpois(x, lamda) poisson distribution with m=std=lamda
ppois(q, lamda) #probability of O, 1, ar 2 events with lamda=4
qpois(p, lamda) dpois(0:2, 4)
rpois(n, lamda) # probability of at least 3 events with lamda=4
1- ppois(2, 4)
dunif(x, min=O, max=1) unifann distribution, fallows the same pattem
punif(q, min=O, max=1) as the nom1al distribution above.
qunif(p, min=O, max=1) #10 unifarm random variates
runif(n, min=O, max=1) x <- runif(10)

Other Statistical Functions


Other useful statistical functions a re provided in the fallowing table. Each has the option na.rm to strip
missing values befare calculations. Otherwise the presence of missing values will lead to a missing
result. Object can be a numeric vector ar data frame.

Function Description
mean(x, trim=O, mean of object x
na.rm=FALSE) # trimmed mean, removing any missing values and
# 5 percent of highest and lowest seores
mx <- mean(x,trim=.05, na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) far variance and
mad(x) far median absolute deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and
probs is a numeric vector with probabilities in [O, 1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=/) lagged differences, with lag indicating which lag to use
min{x) mínimum
max(x) maximum
scale(x, column center or standardize a matrix.
center=TRUE,
sea le= TRUE)

Other Useful Functions


Function Description
seq(from , to, by) generate a sequence
indices <- seq(1, 10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes) repeat x n times
y<- rep(1 :3, 2)
#y is c(1 , 2, 3, 1, 2, 3)
cut(x, n) divide continuous variable in factor with n levels
y <- cut(x, 5)

Note that while the examples on this page apply functions to individual variables, many can be applied
to vectors and matrices as well.
Data Management Merging Data

Creating l~ew Variables Adding Columns


Ooerators To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two
data frames by one or more common key variables (i.e., an inner join).
Built-in Functions

Control Structures
# merge two data frames by ID
! Jser-defined F1 mctjons
total <- merge(data frameA,data frameB,by="ID")
Sortjng Data
1
Mergjng Data
# merge two data frames by ID and Country
Aggregating Data total <- merge (data frameA, data frameB, by=c ("ro", "country"))
Reshaoing Data 1
Subsetting Data

Data Tyoe Conversion

Adding Rows
R in Action To join two data frames (datasets) vertically, use the rbind function. The two data frames must have
the same variables, but they do not have to be in the same order.

1 total <- rbind(data frameA, data frameB)

lf data frameA has variables that data frameB does not, then either:
R in Action significantly expands 1. Delete the extra variables in data frameA or
upon this material. Use promo 2. Create the additional variables in data frameB and set them to UA (missing)

code ria38 for a 38% discount. before joining them with rbind( ).

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Operators
R's binary and logical operators \'lill look very familiar to programmers. Note that binary operators work
Creating l~ew Variables on vectors and matrices as well as scalars.
Ooerators

Built-in Functions Arithmetic Operators


Control Structures
Operator Description
! Jser-defined F1 mctjons
+ addition
Sortjng Data subtraction
* multiplication
Mergjng Data
/ division
Aggregating Data
A or ** exponentiation
Reshaoing Data x %% y modulus (x mod y) 5%%2 is 1
Subsetting Data x %/% y integer division 5%/%2 is 2

Data Tyoe Conversion

Logical Operators
R in Action
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
-- exactly equal to
!= not equal to
R in Action significantly expands
!x Not X
upon this material. Use promo
X 1Y x OR y
code ria38 for a 38% discount.
x!Iy xANDy
isTRUE(x) test if X is TRUE
Top Menu

# An example
X <- c(l;lQ)
The R Interface x[(x>8) 1 (x<5)]
# yei l ds 1 2 3 4 9 10
Data Input

Data Management # How it works


x <- c(l:lO)
Basic Statistics X

Advanced Statistics 1 2 3 4 5 6 7 8 9 10
X > 8
Basic Graphs E E E E E E E FTT
X < 5
Advanced Graohs
TTTT E E E E E E
X > 8 1 X < 5
TTTTFF E FTT
x[c(T,T,T,T,F ,F,F,F,T,T)]
1 2 3 4 9 10
Data Management Reshapi ng Data
R provides a variety of methods for reshaping data prior to analysis.
Creating l~ew Variables

Ooerators
Transpose
Built-in Functions
Use the t() function to transpose a matrix or a data frame. In the later case, rownames become
Control Structures variable (column) names.
! Jser-defined F1 mctjons

Sortjng Data # exampl e using bui l t:-in dat:aset:


nrtcars
Mergjng Data
t:(mt:cars)
Aggregating Data 1
Reshaoing Data

Subsetting Data The Reshape Package


Hadley Wickham has created a comprehensive package called ~ to massage data. Both an
Data Tyoe Conversion
introduction and artide are available. There is even a video!

R in Action Basically, you "melt" data so that each row is a unique id-variable combination. Then you "cast" the
melted data into any shape you would like. Here is a very simple example.

,~
mydata

id time x1 x2
s 6
2 3 s
R in Action significantly expands 2 6
upon this material. Use promo 2 2 2 4
code ria38 for a 38% discount.

Top Menu # exampl e of melt: funct:ion


library(reshape)
mdat:a <- melt:(mydat:a, id=c("id", "t:ime"))
1
The R Interface
newdata
Data Input

Data Management id time variable value


x1 5
Basic Statistics
2 x1 3
Advanced Statistics 2 x1 6
Basic Graphs 2 2 x1 2
x2 6
Advanced Graohs
2 x2 5
2 x2
2 2 x2 4

1
# cast the melted data
# cast(data, formula, function)
subjrneans <- cast(mdata, i d~vari ab le, mean)
timemeans <- cast(mdata, time~va riable, mean)
1
subjmeans

id x1 x2
1 4 5.5
2 4 2.5

timemeans

time x1 x2
5.5 3.5
2 2.5 4.5

There is much more that you can do with the melt( ) and cast( ) functions. See the documentation fer
more details.
Data Management Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the
Creating l~ew Variables sorting variable by a minus sign to indicate DESCENDING order. Here are some examples.

Ooerators

Built-in Functions # sorting examples using the mtcars dataset


att ach(mtcars)
Control Structures

! Jser-defined F1 mctjons # sort by mpg


newdata <- mtcars [order(mpg),]
Sortjng Data

Mergjng Data # sort by mpg and cyl


newdata <- mtcars [order(mpg, cyl),]
Aggregating Data
#sort by mpg (ascending) and cyl (descending)
Reshaoing Data
newdata <- mtcars [order(mpg, - cyl),]
Subsetting Data
detach(mtcars)
Data Tyoe Conversion

R in Action

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select
Creating l~ew Variables and exdude variables and observations. The following code snippets demonstrate ways to keep or

Ooerators delete variables and observations and to take random samples from a dataset.

Built-in Functions

Control Structures Selecting (Keeping) Variables


! Jser-defined F1 mctjons
# select variables vl, v2, v3
Sortjng Data myvars <- c("vl", "v2", "v3")
newdata <- mydata[myvars]
Mergjng Data

Aggregating Data # another method


myvars <- paste("v", 1: 3, sep="")
Reshaoing Data newdata <- mydata[myvars]
Subsetting Data
# select lst and Sth thru lOth variabl es
Data Tyoe Conversion newdata <- mydata[c(l,5:10)]

R in Action
Excluding (DROPPING) Variables
# exclude variables vl, v2, v3
myvars <- names(mydata) %in% c("vl", "v2", "v3")
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable


newdata <- mydata[c(-3,-5)]
R in Action significantly expands
upon this material. Use promo # delete vari ables v3 and v5
code ria38 for a 38% discount. mydata$v3 <- mydata$v5 <- NULL

Top Menu
Selecting Observations
# first 5 observerations
The R Interface newdata <- mydata[l:S,]

Data Input # based on variable values


newdata <- mydata[ which(mydata$gender='E'
Data Management
& 111ydata$age > 65), ]
Basic Statistics
# or
Advanced Statistics
attach(newdata)
Basic Graphs newdata <- mydata[ which(gender='E' & age > 65),]
detach(newdata)
Advanced Graohs

Selection using the Subset Function


The subset( ) function is the easiest way to select variables and observeration. In the following
example, we select alt rows that have a value of age greater than or equat to 20 or age less then 10.
We keep the ID and Weight columns.

# using subset function


newdata <- subset(mydata, age >= 20 1 age < 10 ,
select=c(ID , Weight))
1
In the next example, we select atl men over the age of 25 and we keep variables weight through
income (weight, income and alt columns between them).

# using subset function (part 2)


newdata <- subset(mydata, sex= "m" & age > 25,
select=weight:income)
1
Random Samples
Use the sample ( ) function to take a random sample of size n from a dataset.

# take a random sample of size 50 from a dataset mydata


# samp1e wi t hout rep1acement
mysample <- mydata[sample(l : nrow(mydata), 50,
replace=FALSE),]

Going Further
R has extensive facilities for sampling, including drawing and calibrating survey samples (see the
sampling package), analyzing complex survey data (see the survey package and it's homepagel and
bootstrapping.
Data Management Data Type Conversion
Type conversions in R work as you would expect. For example, adding a character string to a numeric
Creating l~ew Variables vector converts ali the elements in the vector to character.

Ooerators
Use is.feo to test for data type feo. Returns TRUE or FALSE
Built-in Functions
Use as.feo to explicitly convert it.
Control Structures
is.numeric(), is.character(), is_vector(), is.matrix(), is.data.trame()
! Jser-defined F1 mctjons
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.trame)
Sortjng Data

Mergjng Data
Examples
Aggregating Data to one long to to
vector matrix data trame
Reshaoing Data
from c(x,y) cbind(x,y) data.frame(x,y)
Subsetting Data vector rbind(x,y)
trom as. vector(mymatrix) as.data.frame(mymatrix)
Data Tyoe Conversion
matrix
from as. matrix(myframe)
data trame
R in Action

Dates
You can convert dates to and from character or numeric data. See date values for more inforn1ation.

R in Action significantly expands


upen this material. Use promo
code ria38 for a 38% discount.

Top Menu

The R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graohs
Data Management User-written Functions
One of the great strengths of R is the user's ability to add functions . In fact, many of the functions in R
Creating l~ew Variables are actually functions of functions. The structure of a function is given below.
Ooerators

Built-in Functions myfunction <- function(argl, arg2, . . . ){


statements
Control Structures return(object)
! Jser-defined F1 mctjons }

Sortjng Data

Mergjng Data Objects in the function are local to the function. The object returned can be any data tvoe. Here is an
example.
Aggregating Data

Reshaoing Data
# function example - get measures of central tendency
Subsetting Data # and spread for a numeri c vector x . The u ser has a
# choice of measures and whether the resul ts are printed.
Data Tyoe Conversion

mysummary <- function(x,npar=TRUE,print=TRUE) {


i f (!npar) {
R in Action center <- mean(x) ; spread <- sd(x)
} else {
center <- median(x) ; spread <- mad(x)
}
if (print & !npar) {
cat("Mean=" , center, "\n", "SD=" , spread, "\n")
} else if (p r int & npar) {
cat("Medi an=", center, "\n", "MAD=", spread, "\n")
}
R in Action significantly expands
result <- list(center=center,spread=spread)
upon this material. Use promo
return(resu l t)
code ria38 for a 38% discount. }

# invoking the funct ion


Top Menu set . seed(1234)
x <- rpois(SOO, 4)
y <- mysurnmary(x)
Median= 4
The R Interface MAD= 1.4826
# y$center is the median (4)
Data Input # y$spread is the medían abso l ute devi ati on (l. 4826)
Data Management
y <- mysummary(x, npar=FALSE, pri nt=FALSE)
Basic Statistics # no output
# y$center is the mean (4.052)
Advanced Statistics
# y$spread is the standar d deviation (2 . 01927)
Basic Graphs

Advanced Graohs lt can be instructive to look at the code of a function. In R, you can view a function's code by typing
the function name without the ( ). lf this method fails, look at the following R Wiki link for hints on
viewing function sourcecode.
Finally, you rnay want to store your own functions, and have thern available in every session. You can
customjze the R enyjroment to load your functions at start-up.
Data Management Creating new variables
Use the assignment operator <- to create new variables. A wide array of ooerators and functions are
Creating l~ew Variables available here.
Ooerators

Built-in Functions # Three examples for doing the same computations

Control Structures mydata$sum <- mydata$xl + mydata$x2


! Jser-defined F1 mctjons mydata$mean <- (mydata$xl + mydata$x2) / 2

Sortjng Data attach(mydata)


Mergjng Data mydata$sum <- xl + x2
mydata$mean <- (xl + x2) / 2
Aggregating Data detach(mydata)
Reshaoing Data
mydata <- transform( mydata,
Subsetting Data sum = xl + x2,
mean = (xl + x2) / 2
Data Tyoe Conversion )

R in Action
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.

# create 2 age categories


mydata$agecat <- i fe 1 se(mydata$age > 70,
c("older"), c("younger"))
R in Action significantly expands
upon this material. Use promo # another example: create 3 age categories
attach(mydata)
code ria38 for a 38% discount.
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
Top Menu detach(mydata)

The R Interface Renaming variables


Data Input You can rename variables programmatically or interactively.

Data Management

Basic Statistics # rename interactively


fix(mydata) # results are saved on close
Advanced Statistics

Basic Graphs # rename programmatically


library(reshape)
Advanced Graohs mydata <- rename(mydata, c(oldname="newname"))

# you can re-enter all the variable names in order


# changing the ones you need to change.the limitation
# is that you need to enter all of them!
1
names(mydata) <- c("xl", "age", "y", "ses")
R Interface Axes and Text
Many high leve[ plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options
Graphical Parameters (as well as otber graphical paramters). For example
Axes and Text

Combining Plots # Specify axis options within plot()


plot(x, y, main="title", sub=" subtitle",
Lattice Graphs xlab="X-axis 1abe1", ylab="y-axix 1abe1",
ggpl ot2 Graphs xlím=c( xmin, xmax), ylim=c(ymin, ymax))

Pmbability Plots
For finer control or for modularization, you can use the functions described below.
Mosaic Plots

Correlograms

lnteractive Graphs Titles


Use the title( ) function to add labels to a plot.

R in Action
title(main="main title", sub=" sub-title",
xlab="x-axis 1abe1", ylab="y-axis 1abe1")
1
Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in
the title( ) function.

R in Action significantly expands


# Add a red title and a blue subtitle. Make x and y
upon this material. Use promo # labels 25% smaller than the default and green.
code ria38 for a 38% discount. title(main="My Title", col .main="red",
sub="My sub-title", col.sub="blue",
xlab="My X label", ylab="My Y label",
Top Menu col. lab="green", cex. lab=O. 75)

The R Interface
Text Annotations
Text can be added to graphs using the text( ) and mtext( ) functions . text( ) places text within the
Data Input
graph while mtext( ) places text in one of the four margins.
Data Management

Basic Statistics
text( 1ocation, "text to place", pos, ... )
Advance<l Statj5tic5 mtext("text to place", side, line=n, ... )
Basic Graphs
1
Advanced Graphs Common options are described below.

option description
location location can be an x,y coordinate. Alternatively, the text can be placed
interactively vía mouse by specifying location as locator( 1).
pos position relative to location. 1=below, 2=left, 3=above , 4=rigbt. lf you
specify pos, you can specify offset= in percent of character width.
side which margin to place text. 1=bottom, 2=left, 3=top, 4=rigbt. you can
specify line= to indicate the line in the margin starting with O and moving
out. you can also specify adj=O for left/bottom alignment or adj=1 for
top/ right alignment.

Other common options are cex, col, and font (for size, color, and font style respectively).

Labeling points
You can use the text( ) function (see above) for labeling point as well as for adding other text
annotations. Specífy location as a set of x, y coordinates and specify the text to place as a vector of
labels. The x, y, and label vectors should ali be the sarne length.

# Example of labeling points


attach(mtcars)
plot(wt, mpg, main="Milage vs. Car Weight",
xlab="Weight", ylab="Mileage", pch=18, col="blue")
text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")

click to view

Math Annotations
You can add mathematically formulas to a graph using TEX·like rules. See help(plotmath) for details
and examples.

Axes
You can create custom axes using the axis() function .

1 axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ... )

where

option description
side an integer indicating the side of the graph to draw the axis {1=bottom, 2=left,
3=top, 4=right)
at a numeric vector indicating where tic marks should be drawn
labels a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty líne type
col the line and tick mark color
las labels are parallel (=O) or perpendicular(=2) to axis
tck length of tick mark as fraction of plotting region (negative number is outside
graph, positive number is inside, O suppresses ticks, 1 creates gridlines) default
is -0.01
(... ) other grapbical parameters

lf you are going to create a custom axis, you should suppress the axis automatically generated by your
high level plotting function. The option axes=FALSE suppresses both x and y axes. xaxt=··n·· and
yaxt="n" suppress the x and y axis respectively. Here is a (somewhat overblown) example.

# A Silly Axis Example

# specify the data


x <- c(l:lO); y <- x; z <- 10/ x

# e reate extra margi n room on the ri ght for an axis


par(mar=c(5, 4, 4, 8) + 0.1)
# plot X VS. y
plot(x, y,type="b", pch=21, col="red",
yaxt="n'', lty=3, xlab="", ylab="")

# add x v s. 1/ x
lines(x, z, type="b", pch=22, col="blue", lty=2)

# draw an axis on the left


axi s(2, at=x, 1abel s=x, co1 .axi s="red", 1as=2)

# draw an axis on the right, with smaller text and ticks


axis(4, at=z,labels=round(z,digits=2),
col. axi s="b 1ue", las=2, cex. axi s==Ü. 7, tck=-. 01)

# add a title for the right axis


mtext("y=l/ x", side=4, line=3, cex. lab=l, las=2, col="blue")

# add a main title and bottom and left axis labels


title("An Example of Creative Axes", xlab="X values",
yl ab="Y=X")

click to view

Minor Tick Marks


The minor.tick() function in the Hmisc package adds minor tick marks.

# Add minor tick marks


1 i brary(Hmi se)
minor.tick(nx=n, ny=n, tick.ratio=n)
1
nx is the number of minor tick marks to place between x·axis major tick marks.
ny does the same for the y-axis. tick.ratio is the size of the minor tick mark relative to the major tick
mark. The length of the major tick mark is retrieved from par("tck").

Reference Lines
Add reference lines to a graph using the abline( ) function.

1 abline(h=yva7ues, v=xva7ues)

Other graohical oarameters (such as line type, color, and width) can also be specified in the abline( )
function.

# add solid horizontal lines at y=l,5,7


abline(h=c(l,5,7))
# add dashed blue verical lines at x = 1,3,5,7,9
abline(v=seq(l,10,2),lty=2,col="blue")

Note: You can also use the grid( ) functíon to add reference lines.

Legend
Add a legend with the legend() function.
1 legend(location, title, legend, ... )

Common options are described below.

option clescription
location There are severa( ways to indicate the location of the legend. You can give
an x,y coordinate for the upper left hand corner of the legend. You can use
locator(1), in which case you use the mouse to indicate the location of the
legend. You can also use the keywords "bottom", "bottomleft", "left",
"topleft", "top", "topright", "right", "bottomright", or "center". lf you use a
keyword, you may want to use inset= to specify an amount to move the
legend into the graph (as fraction of plot region).
title A character string for the legend title (optional)
legend A character vector with the labels
Other options. lf the legend labels colored lines, specify col= and a vector of
colors. lf the legend labels point symbols, specify pch= anda vector of point
symbols. lf the legend labels line width or line style, use twd= or lty= anda
vector of widths or styles. To create colored boxes for the legend (common
in bar, box, or pie charts), use fill= and a vector of colors.

Other common legend options include bty for box type, bg for background color, cex for size, and
text.col for text color. Setting horiz=TRUE sets the legend horizontally rather than vertically.

# Legend Examp le
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
yaxt="n", xlab="Milage", horizontal=TRUE,
col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
c("4", "6", "8"), fill=terrain.colors(3), horiz=TRUE)

- • click to view

For more on legends, see help(legend). The exan1ples in the help are particularly informative.
R Interface Correlograms
Correlograms help us visualize the data in correlation matrices. For details, see Corrgrams:
Graphical Parameters Exploratory displays for correlation matrices.
Axes and Text
In R, correlograms are implimented through the corrgram(x, order = , panel=, lower.panel=,
Combining Plots
upper.panel=, text.panel=, diag.panel=) function in the corrgram package.
Lattice Graphs
Options
ggp! ot2 Graphs
x is a data frame with one observation per row.
Pmbability Plots
order=TRUE will cause the variables to be ordered using principal component analysis of the
Mosaic Plots
correlation matrix.
Correlograms

lnteractive Graphs panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose
different options below and above the main diagonal respectively. text.panel= and diag.panel= refer
to the main diagnonal. Allowable parameters are giv en below.
R in Action
off diagonal panels
panel.pie (the filled portien of the pie indicates the magnitude of the correlation)
panel.shade (the depth of the shading indicates the magnitude of the correlation)
panel.ellipse (confidence ellipse and smoothed line)
panel.pts (scatterplot)

R in Action significantly expands main diagonal panels


panel.minmax (min and max values of the variable)
upon this material. Use promo
panel.txt (variable name).
code ria38 for a 38% discount.

Top Menu

# First Correlogram Example


library(corrgram)
corrgram(mtcars, order=TRUE, lower .panel=panel. shade,
The R Interface
upper.panel=panel.pie, text.panel=panel.txt,
Data Input mai n="Car Mil age Data in PC2/ PC1 Order")

Data Management

Basic Statistics

Advance<l Statj5tic5

Basic Graphs

Advanced Graphs

# Second Correlogram Example


l i brary(corrgram)
corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse,
upper.panel=panel.pts, text.panel=panel.txt,
diag.panel=panel.minmax,
main="Car Milage Data in PC2/ PC1 Order")
1
- -::. l ..---
....... -· -
...-:.tri"""'
~
;". ::.
- ~ ).:,
- ,· .,. r

.. click to view

# Third Correlogram Example


1 i brary(corrgram)
corrgram(mtcars, order=NULL, 1 ower .panel=pane 1. shade,
upper.panel=NULL, text.panel=panel.txt,
main="Car Milage Data (unsorted)")

elick to view

Changing the colors in a correlogram


You can control the colors in a correlogram by specifying 4 colors in the colorRampPalette( ) function
within the col.corrgram( ) function. Here is an example.

# Changing Colors in a Correlogram


1 i brary(corrgram)
co1 . corrg ram <- functi on(nco1) {
co lorRampPa1ette(c("darkgo1 denrod4", "burlywoodl",
"darkkhaki", "darkgreen")) (neo1)}
corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Correlogram of Car Mileage Data (PC2/PC1 Order)")

click to view
R Interface Graphics with ggplot2
The ~ package, created by Hadley Wickham, offers a powerful graphics language for creating
Graphical Parameters elegant and complex plots. lts popularity in the R community has exploded in recent years. Origianlly

Axes and Text based on Leland Wilkinson's The Grammar of Graohics, ggplotl allows you to create graphs that
represent both univariate and multivariate numerical and categorical data in a straightforward manner.
Combining Plots
Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots
Lattice Graphs (i.e., conditioning) is relatively simple.
ggpl ot2 Graphs
Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful
Pmbability Plots
resources). There is a helper function called qplot() (for quick plot) that can hide much of this
Mosaic Plots complexity when creating standard graphs.
Correlograms

lnteractive Graphs qplot()


The qplot() function can be used to create the most common graph types. While it does not expose
R in Action ggplot's full power, it can create a very wide range of useful plots. The format is:

qplot(x, y, .data=, .color=, shape=, siz~=,


alpha=, geom=, method=, formul a=,
facets=, xl1m=, yl1m= xlab= , ylab=, ma1n= , sub=)
1
where the options are:

R in Action significantly expands option description


upon this material. Use promo alpha Alpha transparency for overlapping elements expressed as a fraction between O
(complete transparency) and 1 (complete opacity)
code ria38 for a 38% discount.
color, Associates the levels of variable with symbol color, shape, or size. For line plots,
shape, color associates levels of a variable with line color. For density and box plots, fil!
size, fill associates fil! colors with a variable. Legends are drawn automatically.
Top Menu data Specifies a data fran1e
facets Creates a trellis graph by specifying conditioning variables. lts value is expressed as
rowvar - calvar. To create trellis graphs based on a single conditioning variable, use
rowvar- . or . -calvar)
The R Interface geom Specifies the geometric objects that define the graph type. The geom option is
expressed as a character vector with one or more entries. geom values include
Data Input ºpoint", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".

Data Management main, Character vectors specifying the title and subtitle
sub
Basic Statistics method, lf geom="smooth", a loess fit line and confidence limits are added by default. When
formula the number of observations is greater than 1,000, a more efficient smoothing
Advance<l Statj5tic5 algorithm is employed. Methods include "lm" for regression, "gam" for generalized
additive models, and "rlm" for robust regression. The formula parameter gives the
Basic Graphs form of the fit.
Advanced Graphs For example, to add simple linear regression lines, you'd specify geom="smooth",
method="lm", fom1ula=y- x. Changing the formula to y- poly(x,2) would produce a
quadratic fit. Note that the fom1ula uses the letters x and y, not the nan1es of the
variables.

For method="gam", be sure to load the mgcv package. For method="rml", load the
MASS package.
x, y Specifies the variables placed on the horizontal and vertical axis. For univariate
plots (for example, histograms), omit y
xlab, Character vectors specifying horizontal and vertical axis labels
ylab
xlim,ylim Two -etement numeric vectors giving the mínimum and maximum values for the
horizontal and vertical axes, respectively

Notes:

• At present, ggplot2 cannot be used to create 30 graphs or mosaic plots.


• Use l(volue) to indicate a specific value. For example size=z makes the size of the plotted points or
lines proporational to the values of a variable z. In contrast, size=l(3) sets each point or line to three
times the default size.

Here are sorne examples using automotive data (car mileage, weight, number of gears, number of
cylinders, etc.) contained in the mtcars data frame.

# ggplot2 examples
library(ggplot2)

# create factors with value labels


nrtcars$gear <- factor(mtcars$gear, leve l s=c(3, 4, 5) ,
labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1),
labe l s=c("Automati c", "Manual"))
mtcars$cyl <- factor(mtcars$cyl, levels=c(4,6,8),
l abe l s=c("4cyl", "6cyl ", "8cyl "))

# Kernel density plots for mpg


# grouped by number of gears (indicated by color)
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
main="Distribution of Gas Milage", xlab="Miles Per Gallon",
ylab="Density")

# Scatterplot of mpg vs . hp for each combination of gears and cylinders


# in each facet, transmittion type is represented by shape and color
qplot(hp, mpg, data=mtcars, shape=am, color=am,
facets=gear~cyl, size=I(3),
xlab="Horsepower", ylab="Miles per Gallon")

# Separate regressions of mpg on weight for each number of cylinders


qplot(wt, mpg, data=mtcars, geoni=c("point", "smooth"),
method="lm", formula=y~x, color=cyl,
main="Regression of MPG on Weight",
xlab="Weight", ylab="Miles per Gallon")

# Boxplots of mpg by number of gears


# observations (points) are overlayed and jittered
qplot(gear, mpg, data=mtcars, geom=c("boxplot", "ji tter"),
fill=gear, main="Mileage by Gear Number",
xlab="", ylab="Miles per Gallon")

J,
•1
• 1I:
1:. 1 ..
• •••
click to view

Customizing ggplot2 Graphs


Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par( )
function. They can be modified using the themeO function, and by adding graphic parameters within
the qplot() function. For greater control, use ~ and other functions provided by the package.
Note that ggplot2 functions can be chained with "+" signs to generate the final plot.
library(ggpl ot2)

p <- qplot(hp, mpg, data=mtcars, shape=am, color=am,


facets=gea r~cyl , main="Scatterplots of MPG vs . Horsepower",
xl ab=" Horsepower" , ylab="Mi l es per Gal l on")

# White background and black grid l ines


p + theme_bw()

# Large brown bold itali cs l abels


# and legend placed at t op of pl ot
p + theme(axi s . title=el emenLtext(face="bold. ita l i c",
size="12", color="brown"), legend.position="top")

-,l
1
Ji
1 '• •
1

• click to view

Going Further
We have only scratched the surface here. To learn more, see the ggolot reference site, and Winston
Chang's excellent Cookbook for R site. Though slightly out of date, ggolot2: Elegant Graohics for Data
Anaysis is still the definative book on this subject.
R Interface lnteractive Graphics
There are a severa! ways to interact with R graphics in real time. Three methods are described below.
Graphical Parameters

Axes and Text


GGobi
Combining Plots
GGobi is an open source visualization program for exploring high-dimensional data. lt is freely available
Lattice Graphs for MS Windows, Linux, and Mac platforms. lt supports linked interactive scatterplots, barcharts,
ggpl ot2 Graphs parallel coordinate plots and tours, with both brushing and identification. A good tutorial is included
with the GGobi manual. You can download the software here.
Pmbability Plots

Mosaic Plots Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from
Correlograms within R . This gives you interactive graphics access to all of your R data! See An lntroduction to
RGGOBI.
lnteractive Graphs

# Interact wi th R data using GGobi


R in Action library(rggobi)
g <- ggobi (mydata)
1

R in Action significantly expands


upon this material. Use promo click to view
code ria38 for a 38% discount.

iPlots
Top Menu
The ip1Qtt package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots
and histograms that can be linked and color brusbed. iplots is implimented through the Java GUI for R.
For more information, see the íplots websjte.
The R Interface

Data Input # Install iplots


Data Management install.packages("iplots",dep=TRUE)

Basic Statistics # Create some linked plots


l i brary(ip lots)
Advance<l Statj5tic5
cyl. f <- factor(mtcars$cyl)
Basic Graphs gear.f <- factor(mtcars$factor)
attach(mtcars)
Advanced Graphs
ihist(mpg) # histogram
ibar(carb) # barchart
iplot(mpg, wt) # scatter plot
ibox(mtcars[c("qsec", "disp", "hp")]) # boxplots
ipcp(mtcars[c("mpg", "wt", "hp")]) # parallel coordinates
imosaic(cyl. f,gear. f) # mosaic plot
On windows platforms, hold down the cntrl key and move the mouse over each graph to get identifying
information from points, bars, etc.

lnteracting with Plots (lndentifying Points)


R offers two functions for identifying points and coordinate locations in plots. With identify(), clicking
the mouse over points in a graph will display the row number or (optionally) the rowname for the poínt.
This continues until you select stop . With locator() you can add poínts or lines to the plot using the
mouse. The function retums a list of the (x,y) coordinates. Again, this continues until you select stop.

# Interacting with a scatterplot


attach(mydata)
plot(x, y) # scatterplot
identify(x, y, labels=row.names(mydata)) # identify points
coords <- locator(type="l") # add lines
coords # display list

Other lnteractive Graphs


See scatterolots for a description of rotating 30 scatterplots in R.
R Interface Combining Plots
R makes it easy to combine multiple plots into one overall graph, using either the
Graphical Parameters par( ) or layout( ) function.
Axes and Text
With the par( ) function , you can include the option mfrow=c(nrows, ncols) to create a matrix of
Combining Plots
nrows x ncols plots that are filled in by row. mfcol=c(nrows, ncols) fills in the matrix by columns.
Lattice Graphs

ggpl ot2 Graphs # 4 figures arranged in 2 rows and 2 columns


Pmbability Plots
attach(mtcars)
par(mfrow=c(2,2))
Mosaic Plots pl ot(wt,mpg, main="Scat:t:erplot of wt: vs . mpg")
pl ot(wt:,disp, main= "Scat:terplot of wt vs disp")
Correlograms
hist(wt, main="Hist:ogram of wt:")
lnteractive Graphs boxplot(wt, main="Boxplot: of wt:")

R in Action

elick to view

# 3 figures arranged in 3 rows and 1 column


att ach(mtcars)
R in Action significantly expands
par(mfrow=c(3,1))
upon this material. Use promo hist(wt)
code ria38 for a 38% discount. hist:(mpg)
hi st(di sp)

Top Menu

The R Interface

Data Input click to view

Data Management The layout( ) function has the form layout(mat) where
Basic Statistics mat is a matrix object specifying the location of the N figures to plot.

Advance<l Statj5tic5

Basic Graphs # One figure in row 1 and two figures in row 2


att:ach(mtcars)
Advanced Graphs layout(matrix(c(l,1,2,3), 2, 2, byrow = TRU E))
hist(wt)
hist(mpg)
hi st:(di sp)
click to view

Optionally, you can include widths= and heights= options in the layout() function to control the size of
each figure more precisely. These options have the form
widths= a vector of values for the widths of columns
heights= a vector of values for the heights of rows.

Relative widths are specified with numeric values. Absolute widths (in centimetres) are specified vlith
the lcm() function.

# One figure in row 1 and two figures in row 2


# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(l,1,2,3), 2, 2, byrow = TRU E),
widths=c(3,1), heights=c(l,2))
hist(wt)
hist(mpg)
hi st(di sp)

·:- click to view

See help(layout) for more details.

Creating a figure arrangement with fine control


In the following example, two box plots are added to scatterplot to create an enhanced graph.

# Add boxplots to a scatter pl ot


par(fig=c(0,0.8,0,0 . 8), neW=TRUE)
pl ot(mtcars$wt, mtcars$mpg, xlab="Miles Per Gal lon" ,
ylab="Car Weight")
par(fig=c(0,0.8,0.55,1), new=TRU E)
boxplot(mtcars$wt, horizontal =TRUE, axes=FALSE)
par(fig=c(0 . 65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg, axes=FALSE)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3)

1.
elick to view

To understand this graph, think of the full graph area as going from (O,O) in the lower left comer to
(1, 1) in the upper right comer. The format of the fig= parameter is a numerical vector of the form
c(x1, x2, y1, y2). The first fig= sets up the scatterplot going from O to 0.8 on the x axis and O to 0.8 on
the y axis. The top boxplot goes from O to 0.8 on the x axis and 0.55 to 1 on the y axis. 1chose0.55
rather than 0.8 so that the top figure will be pulled closer to the scatter plot. The right hand boxplot
goes from 0.65 to 1 on the x axis and O to 0.8 on the y axis. Again, 1 chose a value to pull the right
hand boxplot closer to the scatterplot. You have to experiment to get ít just right.

fíg= starts a new plot, so to add toan exísting plot use new=TRUE.

You can use thís to combine severa[ plots in any arrangement ínto one graph.
R Interface Visualizing Categorical Data
The ved package provides a variety of methods for visualizing multivariate categorical data, inspired by
Graphical Parameters Michael Friendly's wonderful ºVisualizing Categorical Data". Extended mosaic and association plots are
Axes and Text described here. Each provides a method of visualing complex data and evaluating deviations from a
specified independence model. For more details, see The Strucplot framework
Combining Plots

Lattice Graphs

ggp! ot2 Graphs


Mosaic Plots
For extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula, condvar= is
Pmbability Plots
an optional conditioning variable, and data= specifies a data fran1e or a table. lndude shade=TRUE to
Mosaic Plots color the figure, and legend=TRUE to display a legend for the Pearson residuals.
Correlograms

lnteractive Graphs # Mosai e Pl ot Examp le


l i brary(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
R in Action 1

R in Action significantly expands


upon this material. Use promo Association Plots
code ria38 for a 38% discount. To produce an extended association plot use assoc(x, row_vars, col_vars) where x is a contingency
table, row_vars is a vector of integers giving the indices of the variables to be used for the rows, and
col_vars is a vector of integers giving the indices of the variables to be used for the colunms of the
Top Menu
association plot.

# Association Plot Example


The R Interface l i brary(vcd)
assoc(HairEyeColor, shade=TRUE)
Data Input
1
Data Management

Basic Statistics
,_
...
Advance<l Statj5tic5

Basic Graphs

Advanced Graphs
--· ·;

·-.
dick to view

Going Further
Both functions are complex and offer multiple input and output options. See help(mosaic) and
help(assoc) for more details.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
R Interface Graphical Parameters
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
Graphical Parameters
One way is to specify these options in through the par() function. lf you set parameter values here, the
Axes and Text
changes will be in effect for the rest of the session or until you change them again. The format is
Combining Plots
par(optionname=value, optionname= value, ... )
Lattice Graphs

ggpl ot2 Graphs # Set a graphi cal parameter usi ng par()


Pmbability Plots
par() # view current settings
Mosaic Plots opar <- par() # make a copy of current settings
par(col. lab="red") # red x and y labels
Correlograms
hist(mtcars$mpg) # create a plot with these new settings
lnteractive Graphs par(opar) # restore original settings

R in Action A second way to specify graphical parameters is by providing the optionname=va/ue pairs directly to a
high leve! plotting function. In this case, the options are only in effect for that specific graph.

# Set a graphi cal parameter wi thi n the p 1otti ng functi on


hi st(mtcars$mpg, col. lab="red")
1
See the help for a specific high leve! plotting function (e.g. plot, hist, boxplot) to determine which
R in Action significantly expands
graphical parameters can be set this way.
upon this material. Use promo
code ria38 for a 38% discount. The remainder of this section describes sorne of the more important graphical parameters that you can
set.

Top Menu
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
The R Interface

Data Input option description

Data Management cex number indicating the amount by which plotting text and symbols should be
scaled relative to the default. 1=default, 1.5 is 50% larger, 0.5 is 50%
Basic Statistics smaller, etc.
cex.axis magnification of axis annotation relative to cex
Advance<l Statj5tic5
cex. lab magnification of x and y labels relative to cex
Basic Graphs cex.main magnification of titles relative to cex
Advanced Graphs cex.sub magnification of subtitles relative to cex

Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21 through 25, specify
border color (col=) and fil! color (bg=).
plot symbols : pch =

ºº
1 0
6\7

7 ~
12 EB

13~
1s•

1s •
24.6.

25 '\i'
ºº
·+
14~
2 b.
ª* 20 .
* *
3+ 9~ 1s • 21 o

%
%
4 X 10 $- 1s • 22 0
ºº
s <> 11 Z& 11 & 23<>
ºº ##

Lines
You can change lines using the folloV'ting options. This is particularty useful for reference tines, axes,
and fit lines.

option description
lty line type. see the chart below.
lwd line width relative to the default (default=1). 2 is t'Jl/Íce as 1'Vide.

Line Types: lty=

6
5
4
3
2

Colors
Options that specify colors include the follovting.

option description
col Default plotting color. Some functions (e.g. lines) accept a vector of values
that are recycled.
col.axis color for axis annotation
col.lab color for x and y labels
col.main color for titles
col.sub color for subtitles
fg plot foreground color (axes, boxes · also sets col= to sanie)
bg plot background color

You can specify colors in R by index, name, hexadecimal, or RGB.


For example col=1, col="white", and col="#ffFFFF" are equivalent.

The following chart was produced with code developed by Earl F. Glynn. See his Color Chart for ali the
details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n),
terrain.colors(n), topo.colors(n), and cm.colors(n).

colors() returns ali available color names.

Fonts
You can easily set font size and style, but font family is a bit more complicated.

option description
font lnteger specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
font.axis font for axis annotation
font.lab font for x and y labels
font.main font for titles
font. sub font for subtitles
ps font point size (roughly 1/n inch)
text size=ps•cex
family font family for drawing text. Standard values are "serif", "sans", "mono",
"symbol". Mapping is device dependent.

In vlindows, mono is mapped to 'TI Courier New", serif is mapped to"TT Times New Roman", sans is
mapped to "TT Arial", mono is mapped to 'TI Courier New", and symbol is mapped to 'TI Symbol"
(TT=True Type). You can add your own mappings.

# Type family examples - creating new mappings


plot(l:l0,1:10,type="n")
windowsFonts(
A=windowsFont("Arial Black"),
B=WindowsFont("Bookman Old Style"),
C=Wi ndowsFont ("Comí e Sans MS "),
D=Wi ndowsFont("Symbo l ")
)
text(3,3, "Hello World Default")
text(4,4, family="A", "Hello World from Arial Black")
text(S,5, family="B", "Hello World from Bookman Old Style")
text(6,6, family="C", "Hello World from Comic Sans MS")
text(7,7,family="D", "Hello World from Symbol")
... click to view

Margins and Graph Size


You can control the margin size using the following parameters.

option description
mar numerical vector indicating margin size c(bottom, left, top, right) in lines.
default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right) in inches
pin plot dimensions (width, height) in inches

For complete information on margins, see Earl F. Glynn's marein t utoría!.

Going Further
See help(par) for more information on graphical parameters. The customization of plotting axes and
text annotations are covered next section.
R Interface Probability Plots
This section describes creating probability plots in R for both didactic purposes and for data analyses.
Graphical Parameters

Axes and Text


Probability Plots for Teaching and Demonstration
Combining Plots
When l was a college professor teaching statistics, l used to have to draw normal distributions by hand.
Lattice Graphs They always carne out looking like bunny rabbits. What can l say?
ggpl ot2 Graphs
R makes it easy to draw probability distributions and demonstrate statistical concepts. Sorne of the
Probability Plots
more common probability distributions available in R are given below.
Mosaic Plots
distribution Rname distribution Rname
Correlograms Lognormal
Beta beta lnorm
lnteractive Graphs Binomial binom Negative Binomial nbinom
Cauchy cauchy Normal norm
Chisquare chisq Poisson pois
R in Action
Exponential exp Student t
F f Uniform unif
Gamma gamma Tukey tukey
Geometric geom Weibull weib
Hypergeometric hyper Wilcoxon wilcox
Logistic logis

For a comprehensive list, see Statistical Oistributions on the R \'lliki. The functions available for each
R in Action significantly expands
distribution follow this format :
upen this material. Use promo
code ria38 for a 38% discount. name description
dname( ) density or probability function
pname( ) cumulative density function
Top Menu
qname{ ) quantile function
rname( ) random deviates

For example, pnorm{O) =0.5 (the area under the standard normal curve to the left of zero).
The R Interface
qnorm(O. 9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution). rnorm(100)
Data Input generates 100 random deviates from a standard normal distribution.

Data Management
Each function has parameters specific to that distribution. For example, rnorm(100, m=SO, sd=10)
Basic Statistics generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10.
Advance<l Statj5tic5
You can use these functions to demonstrate various aspects of probability distributions. Two common
Basic Graphs
examples are given below.
Advanced Graphs

# Display the Student's t distributions with various


# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)


hx <- dnorm(x)
degf < - c(l, 3, 8, 30)
colors <- e(" red", "blue", "darkgreen", "gol d", "bl ack")
labels <- c("df=l", "df=3", "df=8", "df=30", "normal")

pl ot(x, hx, type="l", lty=2, xlab="x value",


ylab="Oensity", main="Comparison of t Oistributions")

for (i in 1:4){
lines(x, dt(x,degf[i ] ), lwd=2, col=colors[i])
}

legend("topri ght", i nset=. 05, title="Di stri butions",


labels, lwd=2, lty=c(l, 1, 1, 1, 2), col=colors)

A
,)

click to view

# chi ld ren's IQ seores are normally distributed with a


# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?

mean=lOO; sd=15
l b=80; ub=l20

x <- seq( -4, 4, l ength=lOO) "sd + mean


hx <- dnorm(x,mean,sd)

pl ot(x, hx, type="n" , xlab="IQ Values", ylab="",


main="Normal Oistribution", axes=FALSE)

<- X >= lb & X <= ub


lines(x, hx)
polygon(c(lb,x[i] ,ub), c(O,hx[i] ,O), col="red")

area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)


result <- paste("P(",lb," < IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(l, at=seq(40, 160, 20), pos=O)

elick to view

For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Pistdbutions.

Fitting Distributions
There are severa! methods of fitting distributions in R. Here are some options.

You can use the qqnorm() function to create a Quantile-Quantile plot evaluating the fit of sample data
to the normal distribution. More generally, the qqplot( ) function creates a Quantile-Quantile plot for
any theoretical distribution.

1# Q-Q plots
par(mfrow=c(l,2))

# create sample data


x <- rt ( lOO, df=3)

# normal fít
qqnorm(x); qq1 í ne ( x)

# t(3Df) fít
qqplot(rt(1000,df=3), x, maín="t(3) Q-Q Plot",
ylab="Sarnple Quantíles")
ab lí ne(O, 1)

_/ I

elick to view

The fitdistr( ) function in the MASS package provides maximurn-likelihood fitting of univariate
distributions. The fom1at is fitdistr(x, densityftmction) where x is the sample data and densityfunction
is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f', "gamma", "geometríc", "log-
nom1al", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull".

# Estímate parameters assumíng log-Normal dístríbutíon

# create some sample data


x <- rlnorm(lOO)

# estímate paramters
1 í brary(MASS)
fítdístr(x, "lognormal")

Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a
random sample comes from a specified theoretical distríbution. These include chi-square, Kolmogorov-
Smirnov, and Anderson-Darling.

For more details on fitting distríbutions, see Vito Ricci's Fjttioe Pistábutjoos witb R. For general (non
R) advice, see Bill Huber's Fitting Distábutions to Data.
R Interface Lattice Graphs
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing
Graphical Parameters better defaults and the ability to easily display multivariate relationships. In particular, the package
Axes and Text supports the creation of tre//is graphs · graphs that display a variable or the relationship between
variables, conditioned on one or more other variables.
Combining Plots

Lattice Graphs The typical format is

ggpl ot2 Graphs

Probability Plots 1 graph_type(formula, data=)

Mosaic Plots

Correlograms where graph_type is selected from the listed below. formula specifies the variable(s) to display and
any conditioning variables . For example -x 1A means display numeric variable x for each leve! of factor
lnteractive Graphs
A. y-x 1 A•s means display the relationship between numeric variables y and x separately for every
combination of factor A and B levels. -x means display numeric variable x alone.
R in Action
graph_type description formula examples
barchart bar chart x-A or A-x
bwplot boxplot x-A or A-x
cloud 3D scatterplot z-x•ylA
contourplot 3D contour plot z- x"y
densityplot kernal density plot -xlA*B
dotplot dotplot -xi A
R in Action significantly expands histogram histogram -x
upen this material. Use promo levelplot 3D leve! plot z-y•x
cede ria38 for a 38% discount. parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A- x or x-A
Top Menu
xyplot scatterplot y-xi A
wirefrarne 3D wireframe graph z-y•x

Here are sorne examples. They use the car data (mileage, weight, number of gears, number of
The R Interface
cylinders, etc.) from the mtcars data frame.
Data Input

Data Management # Lattice Examples


Basic Statistics library(lattice)
attach(mtcars)
Advance<l Statj5tic5
# create factors with value labels
Basic Graphs
gear.f<-factor(gear,levels=c(3,4,5),
Advanced Graphs labe l s=c("3gears", "4gears", "5gears"))
cyl.f <-factor(cyl,levels=c(4,6,8),
labe l s=c("4cyl ", "6cyl ", "8cyl "))

# kernel density plot


densityplot(~mpg,

main="Density Plot",
xlab="Miles per Gallon")
# kernel dens i ty plots by factor l evel
densityplot (~mpg l cyl. f,
main="Density Plot by Number of Cyl inders",
xl ab="Miles per Gallon")

# kernel density plots by factor level (alternate layout)


densityplot(~mpg l cyl .f,
main="Density Plot by Numer of Cylinders" ,
xl ab="Miles per Gallon" ,
layout=c(l,3))

# boxplots for each combination of two factors


bwpl ot(cyl.f~mpg l gear.f,
yl ab="Cylinders", xlab="Miles per Gallon",
main="Mileage by Cylinders and Gears",
layout=( c(l, 3))

# scatterpl ots for each combination of t wo factors


xypl ot(mpg.vwtlcyl . f *gear . f,
main="Scatterpl ot s by Cylinders and Gears",
yl ab="Mi les per Gallon", xl ab="Car Weight")

# 3d scatterplot by factor l evel


cloud(mpg~wt*qsec l cyl.f,
main="3o Scatterplot by Cylinders")

# dotplot for each combination of two factors


dotplot(cyl . f~mpg l gear.f,
main="Dotplot Plot by Number of Gears and Cylinders",
xl ab="Miles Per Gallon")

# scatterplot matrix
splom(mtcars [c(l,3,4,5,6)),
main="MTCARS Data")

-
...

click to view

Note, as in graph 1, that you specifying a conditioning variable is optional. The difference between
graphs 2 & 3 is the use of the layout option to contol the placement of panels.

Customizing Lattice Graphs


Unlike base R graphs, lattice graphs are not effected by many of the options set in the par( ) function.
To view the options that can be changed, look at help(xyplot) . lt is frequently easiest to set these
options within the high leve! plotting functions described above. Additionally, you can write functions
that modify the rendering of panels. Here is an example.

# Customized Lattice Example


library(lattice)
panel.smoother <- function(x, y) {
panel . xyplot(x, y)# show points
panel.loess(x, y) #show smoothed line
}
attach(mtcars)
hp <- cut(hp,3) # divide horse power into three bands
xyplot(mpg---wt:lhp, scales=list(cex= . 8, col="red"),
pane1=pane1 . smoother,
xlab="Weight", yl ab="Miles per Gallon",
main="MGP vs Weight by Horse Power")

click to view

Going Further
Lattice graphics are a comprehensive graphical system in their own right. Deepanyan Sarkar's book
Lattice: Multivariate Data Visualization vtith R is the definitive reference. Additionally, see the Trellis
Graphic5 homepage and the Trellis Usea Gujde. Dr. lhaka has created a wonderful set of 5ljde5 on the
subject. An excellent early consideration of trellis graphs can be found in W.S. Cleavland's classic book
Visualizjng Data.
Stat1stics ANOVA
lf you bave been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's
Oescriptive Statistics approacb less coberent and user-friendly. A good online presentation on ANOVA in R can be found in

Freguencies & Crosstabs ANOVA section of tbe Personality Project. (Note: 1 bave found tbat tbese pages render fine in Chrome
and Safari browsers, but can appear distorted in iExplorer.)
Correlations
1. Fit a Model
In tbe fallowing examples lower case letters are numeric variables and upper case letters are factors .
Nonparametric Statjstics

Multjple R~re55jon
# One Way Anova (Completely Randomized Design)
R~re55jon Djagnostjc5 fit <- aov(y - A, data=mydataframe)
AMOVA/MAMOVA
1
(M)Al~OVA Assumptions
# Randomized Block Design (B is the blocking factor)
Resampling Stats fit <- aov(y - A+ B, data=mydataframe)
Power Analysis
1
Using Witb and By
# Two Way Factorial Design
fi t <- aov (y - A + B + A: B, data=111ydataframe)
fi t <- aov(y - A~'B, data=mydataframe) # same thi ng
R in Action
1
# Analysis of Covariance
fit <- aov(y - A + x, data=mydataframe)
1
Far witbin subjects designs, the data frame has to be rearranged so tbat eacb measurement on a

R in Action significantly expands subject is a separate observation. 5ee R and Analysis of Variaoce
upan this material. Use promo
code ria38 far a 38% discount. # One Witbin Factor
fi t <- aov(y-A+Error(Subject/A) ,data=mydataframe)
1
Top Menu

# Two Within Factors Wl W2, Two Between Factors Bl B2


fi t <- aov(y-(Wl *W2~'BFB2)+Error(Subject/(Wl*W2))+(Bl*B2)'
data=mydataframe)
Tbe R Interface
1
Data Input

Data Management
2. Look at Diagnostic Plots
Basic Statistics
Oiagnostic plots provide cbecks far beteroscedasticity, normality, and influential observerations.
Advanced Statistics

Basic Graphs

Advanced Graphs l layout(matri x(c(l, 2, 3 ,4), 2, 2)) # optiona l layout


plot(fit) # diagnostic plots

Far details on tbe evaluation of test requirements, see (M)AHOVA Assumptions.


3. Evaluate Model Effects
WARNING: R provides Type l seauential SS , not the default Tyoe lll marginal SS reported by SAS and
SPSS. In a nonorthogonal design wíth more than one term on the ríght hand síde of the equatíon order
will matter (í.e., A+B and B+A will produce dífferent results)! We will need use the drop1 ( ) functíon
to produce the familiar Type 111 results. lt v1íll compare each term wíth the ful! model. Altematively,
we can use anova(fit.model1 , fit.model2) to compare nested models directly.

sunm1ary(fit) # display Type I ANOVA table


dropl(fit,~., test="F") # type III SS and F Tests
1
tlonoarametric and resamplíng altematíves are available.

Multiple Comparisons
You can get Tukey HSD tests using the function belov·1. By default, ít calculates post hoc comparisons on
each factor in the model. You can specify specífic factors asan option. Again, remember that results
are based on Type 1 SS!

# Tukey Honestly Significant Differences


TukeyHSD(fit) # where fit comes from aov()
1

Visualizing Results
Use box olots and line olots to visualize group differences. There are also two functions specifically
designed for visualizing mean differences in ANOVA layouts. interaction.plot( ) in the base stats
package produces plots for two-way interactions. plotmeans( ) in the ~ackage produces mean
plots for single factors, and includes confidence intervals.

# Two-way Interaction Plot


attach(mtcars)
gears < - factor(gears)
cyl <- factor(cyl)
interaction.plot(cyl, gear, mpg, type="b", col=c(1:3),
leg .bty="o", leg .bg="bei ge", lwd=2, pch=c(18, 24, 22),
xlab="Number of Cylinders",
ylab="Mean Miles Per Gal lon",
mai n="Interacti on Pl ot")

..i· ·~
'\
J.i

\
••
~.
click to view

# Plot Means with Error Bars


l i brary(gp l ots)
attach(mtcars)
cyl <- factor(cyl)
plotmeans(mpg~cyl,xlab="Number of Cylinders",
ylab="Miles Per Gallon", main="Mean Plot\nwith 95% CI")
click to view

MANOVA
lf there is more than one dependent (outcome) variable, you can test them simultaneously using a
multivariate analysis of variance (MANOVA). In the following example, let Y be a matrix whose
columns are the dependent variables.

# 2x2 Factorial MANOVA with 3 Dependent Variabl es.


Y <- cbind(yl,y2,y3)
fit <- manova(Y ~ A*B)
summary(fit, test="Pillai")

Other test options are 'Wilks", "Hotelling-lawley", and "Roy". Use summary.aov( ) to get univariate
statistics. TukeyHSD( ) and plot( ) will not work with a MANOVA fit. Run each dependent variable
separately to obtain them. Like ANOVA, MANOVA results in R are based on Type 1 55. To obtain Type 111
55, vary the order of variables in the model and rerun the analyses. For example, fit y-A*B for the
Typell l B effect and y-B'A for the Type 111 A effect.

Going Further
R has excellent facilities for fitting linear and generalized linear mixed-effects models. The lastest
implimentation is in package lme4. 5ee the R News Article on Fitting Mixed linear Models in R for
details.
Stat1stics Assessing Classical Test Assumptions
In elassical parametric procedures we often assume nom1ality and constant variance for the model error
Oescriptive Statistics term. Methods of exploring these assumptions in an ANOVA/ ANCOVA/MANOVA framework are discussed

Freguencies & Crosstabs here. Regression diagnostics are covered under multiple linear regression.

Correlations
Outliers
Nonparametric Statjstics Since outliers can severly affect normality and homogeneity of variance, methods for detecting
disparate observerations are described first.
Multjple R~re55jon

R~re55jon Djagnostjc5 The aq.plot() function in the mvoutlier package allows you to identfy multivariate outliers by plotting
AMOVA/MAMOVA the ordered squared robust Mahalanobis distances of the observations against the empírica[ distribution
function of the M0 2¡ . Input consists of a matrix or data frame. The function produces 4 graphs and
(M)Al~OVA Assumptions
retums a boolean vector identifying the outliers.
Resampling Stats

Power Analysis
# Detect Outl iers in the MTCARS Data
Using With and By 1 i brary(mvoutl i er)
outliers <-
aq .plot(mtcars [c("mpg", "di sp", "hp", "drat", "wt", "qsec") ])
R in Action outliers # show list of outliers

-·-·
elick to view
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount.
Univariate Normality
You can evaluate the normality of a variable using a Q·Q plot.

Top Menu
# Q-Q Plot for variable MPG
attach(mtcars)
qqnorm(mpg)
Tbe R Interface qqline(mpg)

Data Input

Data Management

Basic Statistics

Advanced Statistics
elick to view
Basic Graphs

Advanced Graphs Significant departures from the line suggest violations of normality.

You can also perfom1 a Shapiro-Wilk test of normality with the shapiro.test(x) function, where x is a
numeric vector. Additional functions for testing normality are available in nortest package.

Multivariate Normality
MANOVA assumes multívaríate normality. The function mshapíro.test( ) ín the mvnormtest package
produces the Shapiro-Wilk test for multivariate normality. Input must be a numeric matrix.

# Test Multivariate Normality


mshapiro.test(M)
1
lf we have p x 1 multívariate normal random vector X - 1Vüi. L)
then the squared Mahalanobis distance between x and µ is goíng to be chi-square distributed with p
degrees of freedom. We can use this fact to construct a Q-Q plot to assess multivariate nom1ality.

# Graphical Assessment of Mu l tivariate Normality


x <- as.matrix(mydata) # n x p numeric matrix
center <- colMeans(x) # centroid
n <- nrow(x); p <- ncol(x); cov <- cov(x);
d <- mahalanobis(x,center,cov) # distances
qqplot(qchisq(ppoints(n),df=p),d,
rnain="QQ Plot Assessing Multivariate Normality",
ylab="Mahalanobis D2")
abline(a=Ü,b=l)

click to view

Homogeneity of Variances
The bartlett.test( ) functíon provides a parametric K-sample test of the equality of variances. The
fligner.test( ) function provides a non-parametric test of the same. In the follov1ing examples y is a
numeric variable and G ís the groupíng variable.

# Bartlett Test of Homogeneity of Variances


bartlett.test(y,..,c;, data=mydata)

# Figner-Killeen Test of Homogeneity of Variances


fligner. test (y~G, data=niydata)

The hovPlot( ) functíon ín the HH package provides a graphic test of homogeneity of variances based on
Brown-Forsyth. In the following example, y is numeric and Gis a grouping factor. Note that G must be
of type factor.

# Homogeneity of Variance Plot


l i brary(HH)
hov(y~G, data=mydata)
hovPlot(y~G,data=mydata)

click to view
Homogeneity of Covariance Matrices
MANOVA and LDF assume homogeneity of variance-covariance matrices. The assumption is usually
tested with Box's M. Unfortunately the test is very sensitive to violations of normality, leading to
rejection in most typical cases. Box's Mis not included in R, but code is available.
Stat1stics Correlations
You can use the cor( ) function to produce correlations and the cov( ) function to produces
Oescriptive Statistics covariances.
Freguencies & Crosstabs
A simplified format is cor(x, use=, method= ) where
Correlations
Option Description
x Matrix or data frame
Nonparametric Statjstics use Specifies the handling of missing data. Options are all.obs (assumes no
missing data - missing data will produce an error), complete.obs (listwise
Multjple R~re55jon
deletion), and pairwise.complete.obs (pairwise deletion)
R~re55jon Djagnostjc5 method Specifies the type of correlation. Options are pearson, spearman or kendall.

AMOVA/MAMOVA
# Correlations/ covariances among numeric variables in
(M)Al~OVA Assumptions # data frame mtcars. Use listwise deletion of missing data.
Resampling Stats cor(mtcars, use="complete.obs", method="kendall")
cov (mtcars, use="complete.obs")
Power Analysis

Using With and By


Unfortunately, neither cor() or cov() produce tests of significance, although you can use the cor.test(
) function to test a single correlation coefficient.
R in Action
The rcorr() function in the Hmisc package produces correlations/covariances and significance levels for
pearson and spearman correlations. However, input must be a matrix and pairwise deletion is used.

# Correlations with significance levels


1 i brary(Hmi se)
rcorr(x, type="pearson") # type can be pearson or spearman

R in Action significantly expands #mtcars is a data frame


upon this material. Use promo rcorr(as.matrix(mtcars))
code ria38 for a 38% discount.

You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and
Top Menu the columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.

# Correlation matrix from mtcars


# with mpg , cyl, and disp as rows
Tbe R Interface
# and hp, drat, and wt as columns
Data Input x <- mtcars [l: 3]
y <- mtcars[4:6]
Data Management
cor(x, y)
Basic Statistics

Advanced Statistics

Basic Graphs
Other Types of Correlations
Advanced Graphs
# polychoric correlation
# x is a contingency table of counts
library(polycor)
1
polychor(x)

# heterogeneous correlations in one matrix


# pearson (numeric-numeric),
# polyserial (numeric-ordinal),
# and polychoric (ordinal-ordinal)
# x is a data frame with ordered factors
# and numeric variables
library(polycor)
hetcor(x)

# partial correlations
1 i brary(ggm)
data(mydata)
pcor(c("a", "b", "x", "y", "z"), var(mydata))
# partial corr between a and b controlling for x, y, z

Visualizing Correlations
Use corr¡¡ram( ) to plot correlograms .

Use the ~or splom( ) to create scatterplot matrices.

A great example of a plotted correlation matrix can be found in the R Graoh Gallery.
Stat1stics Descriptive Statistics
R provides a wide range of functíons for obtaining summary statistics. One method of obtaining
Oescriptive Statistics descriptive statistics is to use the sapply( ) function with a specified summary statistic.
Freguencies & Crosstabs

Correlat ions # get means for variables in data frame mydata


# exeluding missing values
sapply(mydata, mean, na.rm=TRUE)
Nonparametric Statjstics
1
Multjple R~re55jon Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.
R~re55joo Djagnostjc5
There are also numerous R functions designed to provide a range of descriptive statistics at once. For
AMOVA/MAMOVA
example
(M)Al~OVA Assumptions

Resampling Stats
# mean,median,25th and 75th quartiles,min,max
Power Analysis summary(mydata)

Using With and By # Tukey min, lower-hinge, median,upper-hinge,max


fivenum(x)

R in Action
Using the Hmisc package

l i brary(Hmi se)
deseribe(mydata)
# n, nmiss, unique, mean, 5,10,25,50, 75,90,95th pereentiles
# 5 lowest and 5 highest seores

R in Action significantly expands


upon this material. Use promo Using the pastees package
code ria38 for a 38% discount.

library(pastees)
stat.desc(mydata)
Top Menu
# nbr.val, nbr.null, nbr.na, min max, range, sum,
# median, mean, SE.mean, CI.mean, var, std.dev, eoef . var

The R Interface
Using the ~ package
Data Input

Data Management l i brary(psyeh)


Basic Statistics describe(mydata)
#ítem name ,ítem number, nvalid, mean, sd,
Advanced Statistics # median, mad, min, max, skew, kurtosis, se
Basic Graphs

Advanced Graphs
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the ~ package.
library(psych)
describe. by(mydata, group, ... )
1
The doBy package provides much of the functionality of SAS PROC SUMMARY. lt defines the desired
table using a model formula and a function. Here is a simple example.

1 i brary(doBy)
sumrnaryBy(mpg + wt ~ cyl + vs, data = mtcars,
FUN = function(x) { c(m = mean(x), s = sd(x)) } )
#produces mpg . rn wt . rn rnpg . s wt . s for each
# combination of t he levels of cyl and vs

See also: aggregating data.


Stat1stics Frequencies and Crosstabs
Ibis section describes tbe creation of frequency and contingency tables from categorical variables,
Oescriptive Statistics along witb tests of independence, measures of association, and metbods far grapbically displaying
Freguencies & Crosstabs results.

Correlations
Generating Frequency Tables
Nonparametric Statjstics R provides many metbods far creating frequency and contingency tables. Three are described below. In
tbe following examples, assume tbat A, B, and C represent categorical variables.
Multjple R~re55jon
table
R~re55jon Djagnostjc5
You can generate frequency tables using tbe table( ) function, tables of proportions using tbe
ANOVA/MANOVA
prop.table( ) function, and marginal frequencies using margin.table( ).
(M)Al~OVA Assumptions

Resampling Stats
# 2-Way Frequency Table
Power Analysis attacb(mydata)
mytable <- table(A,B) # A wi l l be rows, B will be columns
Using Witb and By mytable # print table

margin . table(mytable, 1) #A frequencies (summed over B)


R in Action margi n. tab l e(mytab le, 2) # B frequenci es (summed over A)

prop.table(mytable) # cel l percentages


prop . table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages

table( ) can also generate multidimensional tables based on 3 or more categorical variables. In tbis
case, use tbe ftable( ) function to print tbe results more attractively.
R in Action significantly expands
upan this material. Use promo
code ria38 far a 38% discount. # 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
Top Menu 1
Table ignores missing values. To include NA as a category in counts, include tbe table option
exclude=NULL if tbe variable is a vector. lf tbe variable is a factor you bave to create a new factor
Tbe R Interface using newfactor <· factor(oldfactor, exclude=NULL).

Data Input xtabs


Data Management Tbe xtabs( ) function allows you to create crosstabulations using formula style input.

Basic Statistics

Advanced Statistics # 3- way Frequency Table


mytab le <- xtabs ( ~A+B+c, data=mydata)
Basic Graphs ftable (mytab le) # pri nt table
Advanced Graphs summary(mytable) # cbi - square test of indepedence

lf a variable is included on tbe left side of tbe formula, it is assumed to be a vector of frequencies
(useful if the data have already been tabulated).

Crosstable
The CrossTable( ) function in the ~ package produces crosstabulations modeled after PROC FREQ
in SAS or CROSSTABS in SPSS. lt has a wealth of options.

# 2-Way Cross Tabulation


library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)
1
There are options to report percentages (row, column, cell), specify decimal places, produce Chi-
square, Fisher, and McNemar tests of independence, report expected and residual values (pearson,
standardized, adjusted standardized), include missing values as valid, annotate with row and column
titles, and fom1at as SAS or SPSS style output!
See hetp(CrossTable) for details.

Testsof lndependence

Chi-Square Test
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable.
By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic.
Optionally, the p-value can be derived vía Monte Cario simultation.

Fisher Exact Test


fisher.test(x) provides an exact test of independence. x is a two dimensional contingency table in
matrix fom1.

Mantel-Haenszel test
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null
hypothesis that two nominal variables are conditionally independent in each stratum, assuming that
there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension
refers to the strata.

Loglinear Models
You can use the loglm( ) function in the MASS package to produce log-linear models. For example, let's
assume we have a 3-way contingency table based on variables A, B, and C.

1 i brary(MASS)
mytable <- xtabs(-A+B+C, data=mydata)
1
We can perform the following tests:

Mutual lndependence: A, B, and C are pairwise independent.

1 loglm(-A+B+C, mytable)

Partial lndependence: A is partially independent of B and C (i.e., A is independent of the composite


variable BC).

1 loglin(-A+B+C+B*C, mytable)

Conditional lndependence: A is independent of B, given C.

1 loglm(-A+B+C+A*C+B*C, mytable)
No Three-Way lnteraction

1 loglm(~A+B+C+A*B+A*C+B*C, mytabl e)

Martín Theus and Stephan Lauer have written an excellent article on Yisualizjng 1oglinear Models, using
mosaic olots. There is also great tutoría! example by Kevin Quino on analyzing loglinear models vía g!m.

Measures of Association
The assocstats(mytable) function in the ved package calculates the phi coefficient, contingency
coefficient, and Cramer's V for an rxc table. The kappa(mytable) function in the ~ package calculates
Cohen's kappa and weighted kappa for a confusion matrix. See Richard Oarlington's article on Measures
of Assocjatjon jo Cmsstab Tables for an excellent review of these statistics.

Visualizing results
Use bar and oie charts for visualizing frequencies in one dimension.

Use the ved package for visualizing relationships among categorical data (e.g. mosaic and association
plots).

Use the ca package for correspondence analysis (visually exploring relationships between rows and
columns in contingency tables).

Converting Frequency Tables to an "Original" Flat


file
Finally, there may be times that you wil need the original "flat file" data frame rather than the
frequency table. Marc Schwartz has provided code on the Rhelo mailing list for converting a table back
into a data frame.
Stat1stics Nonparametric Tests of Group Differences
R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and
Oescriptive Statistics Friedman tests.
Freguencies & Crosstabs

Correlations # independent 2-group Mann-Whitney u Test


wilcox.test(y~A)

Nonparametric Statjstics 1# where y is numeri c and A is A bi nary factor


Multjple R~re55jon
# independent 2-group Mann-Whitney u Test
R~re55jon Djagnostjc5 wilcox . test(y,x) # where y and x are numeric
AMOVA/MANOVA 1
(M)Al~OVA Assumptions
# dependent 2-group Wilcoxon Signed Rank Test
Resampling Stats
wilcox . test(yl,y2,paired=TRUE) # where yl and y2 are numeric
Power Analysis 1
Using Witb and By
# Kruskal Wallis Test One Way Anova by Ranks
kruska1. test(y~A) # where yl is numeri c and A is a factor
R in Action 1

# Randomized Block Design - Friedman Test


friedman.test(y~AIB)
# where y are the data values, A is a grouping factor
# and B is a blocking factor

R in Action significantly expands For the wilcox.test you can use the alternative="less" or alternative="greater" option to specify a one
upon this material. Use promo tailed test.
code ria38 for a 38% discount.
Parametric and resampling altematives are available.

Top Menu Tbe package npmc provides nonparametric multiple comparisons. (Note: This package has been
witbdrawn but is still available in the CRAN archives.)

Tbe R Interface 1 i brary(npmc)


npmc(x)
Data Input # where x is a data frame contai ni ng vari ab1e 'var'
# (response variable) and 'class' (grouping variable)
Data Management

Basic Statistics

Advanced Statistics
Visualizing Results
Basic Graphs
Use box plots or densitv plots to visual group differences.
Advanced Graphs
Stat1stics Power Analysis

Oescriptive Statistics Overview


Freguencies & Crosstabs Power analysis is an important aspect of experimental design. lt allows us to detem1ine the sample size
required to detectan effect of a given size with a given degree of confidence. Conversely, it allows us
Correlations
to determine the probability of detecting an effect of a given size with a given level of confidence,
under sample size constraints. lf the probability is unacceptably low, we would be wise to alter or
Nonparametác Statjstics abandon the experiment.

Multjple R~re55jon
The follovling four quantities have an intimate relationship:
R~re55joo Djagnostjc5
1. sample size
AMOVA/MANOVA 2. effect size
3. significance level =P(Type 1 error) =probability of finding an effect that is not there
(M)Al~OVA Assumptions 4. power = 1 - P(Type 11 error) = probability of finding an effect that is there

Resampling Stats Given any three, we can determine the fourth.

Power Analysis

Using With and By Power Analysis in R


The Rfil package develped by Stéphane Champely, impliments power analysis as outlined by Cohen
.il2filn. Sorne of the more important functions are listed below.
R in Action
function power calculations for
pwr.2p.test two proportions (equal n)
pwr.2p2n.test two proportions (unequal n)
pwr.a nova. test balanced one way ANOVA
pwr.chisq.test chi ·square test
pwr.f2.test general linear model

R in Action significantly expands pwr.p.test proportion (one sample)


upon this material. Use promo pwr.r.test correlation

code ria38 for a 38% discount. pwr.t.test t-tests (one sample, 2 sample, paired)
pwr.t2n.test t-test (two samples with unequal n)

For each of these functions, you enter three of the four quantities (effect size, sample size,
Top Menu
significance level, power) and the fourth is calculated.

The significance level defaults to 0.05. Tberefore, to calculate the significance level, given an effect
Tbe R Interface size, sample size, and power, use tbe option "sig.level=NULL".

Data Input
Specifying an effect size can be a daunting task. ES fom1ulas and Cohen's suggestions (based on social
Data Management science research) are provided below. Cohen's suggestions should only be seen as very rough guidelines.
Basic Statistics Your own subject matter experience should be brought to bear.

Advanced Statistics

Basic Graphs t-tests


Advanced Graphs For t-tests, use the following functions:

pwr.t.test(n =, d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired"))


where n is the sample size, d is the effect size, and type indicates a two-sample t-test, one-sample t-
test or paired t-test. lf you have unequal sample sizes, use

pwr_t2n.test(n1 = , n2=, d = , sig.level =, power =)

where n1 and n2 are the sample sizes.

For t-tests, the effect size is assessed as

where µ1 =mean of group 1


µ2 = mean of group 2
a2 = common error variance

Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.

You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.

ANOVA
For a one-way analysis of variance use

pwr_anova.test(k =, n =, f =, sig.level =, power =)

where k is the number of groups and n is the common sample size in each group.

For a one-way ANOVA effect size is measured by f where

where p; =n; / N,
"
"'\' P; * rv<;
"--
i- 1
.. _ µ )2 n; = number of observati ons in group i
f = N = total number of observations
µ; =mean of group i
µ =grand mean
a2 = error variance within groups
Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes
respectively.

Correlations
For correlation coefficients use

pwr_r.test(n =, r = , sig.level =, power = )

where n is the sample size and r is the correlation. We use the population correlation coefficient as
the effect size measure. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium,
and large effect sizes respectively.

Linear Models
For linear models (e.g., multiple regression) use

pwr.f2.test(u =, v = , f2 =, sig.level =, power = )

where u and v are the numerator and denominator degrees of freedom. We use f2 as the effect size
measure.

where R2 = population squared


multiple correlation
where R2A = variance accounted far in the
population by variable set A
R2 AB = variance accounted far in the
population by variable set A and B
together

The first formula is appropriate when we are evaluating the impact of a set of predictors on an
outcome. The second fom1ula is appropriate when we are evaluating the impact of one set of predictors
above and beyond a second set of predictors (or covariates). Cohen suggests f2 values of 0.02, 0.15,
and 0.35 represent small, medium, and large effect sizes.

Tests of Proportions
When comparing two proportions use

pwr .2p.test(h = , n = , sig. leve l =, power = )

where h is the effect size and n is the common sample size in each group.

h = 2arcsin (.JA )-2 arcsin(.JP; l

Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.

For unequal n's use

pwr.2p2n.test (h = , n 1 =, n2 = , sig.level =, powe r = )

To test a single proportion use

pwr.p.test (h = , n =, sig.leve l = powe r = )

For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less", or
"greater" to indicate a two-tailed, or one-tailed test. A two taíled test is the default.

Chi-square Tests
For chi-square tests use

pwr.chisq.t est (w =, N =, clf =, sig.leve l =, powe r = )

where w is the effect size, Nis the total sample size, and df is the degrees of freedom . The effect size
w is defined as

where pO; = cell pro bability in ith cell under Ho


W= p l; = cell probability in ith cell under H1

Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes
respectively.

sorne Examples
1 i brary(pwr)

# For a one- way ANOVA comparing 5 groups, calculate the


# sample size needed in each group t o obtain a power of
# 0 . 80, when the effect size is moderate (0.25) and a
# significance l evel of 0.05 is empl oyed .

pwr . anova.test (k=S,f= .25,sig.level =. 05,power=. 8)

# What is the power of a one- tailed t - test, wi t h a


# significance l evel of 0.01, 25 peopl e in each group,
# and an effect si ze equal to 0.75?

pwr .t.test(n=25,d=0 . 75,sig. l evel =. 01,alternat ive="greater")

# Using a two- tai l ed test proportions, and assuming a


# si gni fi canee l eve l of O. 01 and a convnon samp l e si ze of
# 30 for each proportion, what effect size can be detected
# with a power of . 75?

pwr.2p . test(n=30,sig.level=0.0l,power=0.75)

Creating Power or Sample Size Plots


The functions in the Rfil package can be used to generate power and sample size graphs.

# Plot sample size curves for detecting correl ations of


# various sizes.

l i brary(pwr)

# range of correl ations


r <- seq( . 1, . 5, . 01)
nr <- length(r)

# power val ues


p <- seq(.4, . 9, . 1)
np <- length(p)

# obtain sample sizes


samsize <- array(numeric(nr*np), dim=c(nr,np))
for (i in l : np){
for (j in l : nr){
result <- pwr.r.test(n = NULL, r r[j],
sig . level = .05, power = p[i],
al ternat i ve = "two . si ded")
samsize[j, i] <- ceiling(result$n)
}
}

# set up graph
xrange <- range(r)
yrange <- round(range(samsi ze))
colors <- rainbow(length(p))
pl ot(xrange, yrange, type="n",
xlab="Correl ation Coefficient (r)",
ylab="Sample Size (n)" )

# add power curves


for (i in l : np){
l ines(r, samsize[, i ], type="l " , lwd=2, col=colors [i])
}

# add annotation (gri d lines, title, legend)


abline(v=O, h=seq(O,yrange[2],50), lty=2, col ="grey89")
abline(h=O, v=seq(xrange[l],xrange[2], . 02), lty=2,
co l ="grey89")
titl e("Sampl e Size Estimation for Correlation Studies\n
Sig=0 .05 (Two- tai l ed)")
legend("topright", title="Power", as . character(p),
fill =colors)
1

click to view
Stat1stics Regression Diagnostics
An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of
Oescriptive Statistics Regression Oiagnostics. Dr. Fox's ca r package provides advanced utilities far regression modeling.
Freguencies & Crosstabs

Correlations # Assume that we are fitting a multiple linear regression


# on the MTCARS dat a
1 i brary(car)
Nonparametric Statjstics fit <- lm(mpg-disp+hp+wt+drat, data=rntcars)

Multjple R~re55jon

Ibis example is far expo5it ion only. We vlill ignore the fact that this may not be a great way of
R~re55jon Djagnostjc5
modeling the this particular set of data!
AMOVA/MAMOVA

(M)Al~OVA Assumptions
Outliers
Resampling Stats

Power Analysis # Assessing Outliers


outl ierTest(fit) # Bonferonni p- value far most extreme obs
Using With and By
qqPl ot(fit, main="QQ Plot") #qq plot for studenti zed resid
leveragePlots(fit) # leverage plots

R in Action
leverage plot

elick to view

R in Action significantly expands lnfluential Observations


upan this material. Use promo
code ria38 far a 38% discount. # Infl uenti a 1 Observa ti ons
# added variable plots
av. Plots(fit)
Top Menu # Cook' s D plot
# identify D values > 4/ (n-k-1)
cutoff <- 4/ ( ( nrow(mtcars)-length(fit$coeffi cients) - 2))
plot(fit, which=4, cook.levels=cutoff)
The R Interface # Infl uence Pl ot
influencePlot(fit, id . method="identify" , main="Influence Plot", sub="Circle
Data Input size is proportial to Cook ' s Distance" )
Data Management

Basic Statistics - --- ,.--


Advanced Statistics

Basic Graphs

Advanced Graphs --- - click to view

Non-normality
# Normal ity of Residuals
# qq pl ot for studentized resid
qqPl ot(fit , main="QQ Pl ot")
# distribution of studentized residual s
l i brary(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residual s")
xfit<- seq(mi n(sresi d) ,max(sresi d), l ength=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)

1 /
-
/ elick to view

Non-constant Error Variance


# Eval uate homoscedasticity
# non - constant error va riance t est
ncvTest(fit)
# plot studentized residuals vs. fitted values
spreadl evelPlot(fit)

1-
elick to view

Mu lti-collinearity
# Eval uate Col l inearity
vif(fit) # variance infl ation factors
sqrt(vif(fit)) > 2 # prob l em?
1
Nonlinearity
# Eval uate Nonl inearity
# component + residual pl ot
crPl ots(fit)
# Ceres plots
ceresPlots(fit)

~ l~ ._
1 . .. -.. ... '.. ..........
"
-- -- - .
1~! .. j
elick to view

Non-independence of Errors
1# Test for Autocorrelated Errors
1durbinWatsonTest(fit)

Additional Diagnostic Help


The gvlma( ) function in the ~ package, performs a global validation of linear model assumptions
as well separate evaluations of skewness, kurtosis, and heteroscedasticity.

# Global test of model assumptions


library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)

Going Further
lf you would like to delve deeper into regression diagnostics, two books written by John Fox can help:
Applied regression analysis and generalized linear models (2nd ed) and An R and 5-Plus comoanion to
applied regression.
Stat1stics Multiple (Linear) Regression
R provides comprehensive support for multiple linear regression. The topics below are provided in order
Oescriptive Statistics of increasing complexity.
Freguencies & Crosstabs

Correlations Fitting the Model


# Multiple Linear Regression Example
Nonparametric Statjstics
fit <- l m(y ~ xl + x2 + x3, data=mydata)
summary(fit) # show results
Multjple R~re55jon
1
R~re55joo Djagnostjc5

AMOVA/MAMOVA # Other useful functions


(M)Al~OVA Assumptions coefficients(fit) # model coefficients
confint(fit, level=0.95) # Cis for model parameters
Resampling Stats fitted(fit) # predicted values
residuals(fit) # residuals
Power Analysis
anova(fit) # anova table
Using With and By vcov(fit) # covariance matrix for model parameters
influence(fit) # regression diagnostics

R in Action

Diagnostic Plots
Oiagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

# diagnostic plots
layout(matrix(c(l,2,3,4),2,2)) # optional 4 graphs/page
pl ot(fi t)
R in Action significantly expands
upon this material. Use promo
1
code ria38 for a 38% discount.

Top Menu
click to view

For a more comprehensive evaluation of model fit see regression diagnostics.


Tbe R Interface

Data Input

Data Management
Comparing Models
You can compare nested models ~vith tbe anova( ) function. The following code provides a simultaneous
Basic Statistics
test that x3 and x4 add to linear prediction above and beyond x1 and x2.
Advanced Statistics

Basic Graphs
# compare models
Advanced Graphs fitl <- lm(y ~ xl + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ xl + x2)
anova(fitl, fit2)
Cross Validation
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.

# K-fold cross-validation
l i brary(DAAG)
cv. lm(df=mydata, fit, m=3) # 3 fold cross-validation
1
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the
cross-validated standard error of estímate.

You can assess R2 shrinkage vía K-fold cross-validation. Using the crossval() function from the
bootstrap package, do the follovting:

# Assessing R2 shrinkage using 10-Fold Cross-Validation

fit <- lm(y-xl+x2+x3,data=mydata)

library(bootstrap)
# define functions
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function (fit,x){cbind(l,x)%*%fit$coef}

# matrix of predictors
X<- as.matrix(mydata[c("xl","x2","x3")])
# vector of predicted values
y <- as.matrix(mydata[c("y")])

results <- crossval(X,y,theta.fit,theta.predict,ngroup=lO)


cor(y, fit$fitted.values)**2 # raw R2
cor(y,results$cv.fit)**2 # cross-validated R2

Variable Selection
Selecting a subset of predictor variables from a larger set (e. g., stepwise selection) is a controversia!
topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from
the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.

# Stepwise Regression
l i brary(MASS)
fit <- lm(y-xl+x2+x3,data=mydata)
step <- stepAIC(fi t, di rection="both")
step$anova # display results

Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps_
package. In the following code nbest indicates the number of subsets of each size to report. Here, the
ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.).

# All Subsets Regression


library(leaps)
attach(mydata)
leaps<-regsubsets(y-xl+x2+x3+x4,data=mydata,nbest=10)
# view results
summary(leaps)
# p lot a table of mode ls showing variables in each mode l.
# models are ordered by the selection statistic.
plot(leaps, seal e="r2")
# plot statistic by subset size
l
library(car)
subsets (1 eaps, stati sti e=" rsq")

click to view

Other options for plot( ) are bic, Cp, and adjr2. Other options for plotting with
subset( ) are bic, cp, adjr2, and rss.

Relative lmportance
The relaimpo package provides measures of relatíve importance for each of the predictors in the
model. See help(calc.relimp) for details on the four measures of relative importance provided.

# Ca1cu1 ate Re1 ati ve Importance for Each Predi ctor


1 i brary(re laimpo)
cale. relimp(fit, type=c ( " l mg", "last", "fi rst", "pratt"),
rela=TRUE)

# Bootstrap Measures of Relati ve Importance (1000 samples)


boot <- boot.relimp(fit, b = 1000, type = c(''lmg",
"last", "fi rst", "pratt"), rank = TRUE,
diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print resu l t
plot(booteval.relimp(boot,sort=TRUE)) # plot resu l t

......... _ elick to view

Graphic Enhancements
The car package offers a wide variety of plots for regression , including added variable plots, and
enhanced diagnostic and scatter olots.

Going Further
Nonlinear Regression

The nis package provides functions for nonlinear regression. 5ee John Fox's Nonlinear Regression and
Nonljoear 1 east Squares for an overview . Huet and colteagues' Statistical Tools for (loolinear Regressioo-
A Practicat Guide with 5-PLUS and R Examples is a valuable reference book.

Robust Regression

There are many functions in R to aid with robust regressíon. For example, you can perfom1 robust
regression with the rlm( ) function in the MASS package. John Fox's (who else?) Robust Regression
provides a good starting overview . The UCLA Statísticat Computing website has Robqst Regression
Examples.

The mbun package provides a comprehensive library of robust methods, including regression. The
robustbase package also provides basic robust statistics including model selection methods. And David
Olive has provided an detailed entine review of Applied Robust Statistics with sample R cede.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Stat1stics Resampling Statistics
Tbe coin package provides tbe ability to perform a wide variety of re-randomization or pem1utation
Oescriptive Statistics based statistical tests. These tests do not assume random sampling from well-defined populations. They
Freguencies & Crosstabs can be a reasonable alternative to classical procedures vlhen test assumptions can not be met. See
cojo· A Comp11tatjonal Frameymrk for Coodjtjonal lnference far details.
Correlations

In tbe examples below, lowe r case letters represent numerical variables and upper case letters

Nonparametric Statjstics represent categorical .fac.tms.. Monte-Cario simulation are available far ali tests. Exact tests are
available far 2 group procedures.
M11ltjple R~re55jon

R~re55jon Djagnostjc5
lndependent Two- and K-Sample Location Tests
AMOVA/MAMOVA

(M)Al~OVA Assumptions # Exact Wi l coxon Mann Whitney Rank Sum Test


# where y is numeric and A is a binary factor
Resampling Stats
library(coin)
Power Analysis wilcox_test(y~A, data=mydata, distribution="exact")

Using Witb and By

# One- Way Permutation Test based on 9999 Monte - Carla


R in Action # resamplings . y is nume r ic and A is a categorical factor
l i brary(coin)
oneway_test (y~A, data=mydat a,
distribution=approximate(B=9999))

Symmetry of a response for repeated


R in Action significantly expands
measurements
upan this material. Use promo
# Exact Wi l coxon Signed Rank Test
code ria38 far a 38% discount.
# where yl and y2 are repeated measures
li brary(coin)
wilcoxsign_test(yl~y2, data=mydata, distribution="exact")
Top Menu

# Freidman Test based on 9999 Monte-Carla resamplings .


Tbe R Interface # y is numeric, A is a grouping factor, and B is a
# blocking factor .
Data Input li brary(coin)
friedman_test(y~A I B, data=mydata,
Data Management
distribution=approximate(B=9999))
Basic Statistics

Advanced Statistics

Basic Graphs lndependence of Two Numeric Variables


Advanced Graphs
# Spearman Test of Independence based on 9999 Monte-Carla
# resamplings . x and y are numeric variables.
l i brary(coi n)
1
spearman_test(y~x, data=mydata,
distribution=approximate(B=9999))
1
lndependence in Contingency Tables
# Independence in 2-way Contingency Table based on
# 9999 Monte-Carlo resamplings. A and B are factors.
1 i brary(coin)
chisq_test(A~B, data=mydata,
distribution=approximate(B=9999))

# Cochran-Mantel-Haenzsel Test of 3-way Contingency Table


# based on 9999 Monte-Car1o resamp1i ngs. A, B, are factors
# and e is a stratefying factor.
library(coin)
mh_test(A~B I C, data=1nydata,

distribution=approximate(B=9999))

# Linear by Linear Association Test based on 9999


# Monte-Carlo resamplings. A and B are ordered factors.
library(coin)
lbl_test(A~B, data=mydata,
distribution=approximate(B=9999))

Many other univariate and multivariate tests are possible using the functions in the coin package. See 8
1ego System for Condjtjonal loference for more details.
Stat1stics t-tests
The t .te st( ) function produces a variety of t-tests. Unlike most statistical packages, the default
Oescriptive Statistics assumes unequal variance and applies the Welsh df modification.

Freguencies & Crosstabs


# independent 2-group t - test
Correlations t . test(y~x) # where y is numeric and x is a binary factor
1
Nonparametric Statjstics
# independent 2 - group t - test
Multjple R~re55jon t . test(yl,y2) # where yl and y2 are numeric
1
R~re55joo Djagnostjc5

ANOVA/MANOVA
# paired t-test
(M)Al~OVA Assumptions t .test(yl,y2,paired=TRUE) # whe r e yl & y2 are numeri c
1
Resampling Stats

Power Analysis
# one sample t - test
t . test(y,mu=3) # Ho : mu =3
Using With and By
1
R in Action You can use the va r.equal =TRUE option to specify equal variances and a pooled variance estimate.
You can use the alternative="less" or alternative="greater" option to specify a one tailed test.

Nonparametric and resampling alternatives to t-tests are available.

Visualizing Results
Use box plots or densitv plots to visualize group differences.
R in Action significantly expands
upen this material. Use promo
cede ria38 for a 38% discount.

Top Menu

Tbe R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graphs
Stat1stics Using with( ) and by( )
There are two functions that can help write simpler and more efficient code.
Oescriptive Statistics

Freguencies & Crosstabs


With
Correlations
The with() function applys an expression to a dataset. lt is similar to DATA= in SAS.

Nonparametric Statjstics # with(data, expression)


Multjple R~re55jon # example applying a t-test to a data frame mydata
with(mydata, t.test(y ~ group))
R~re55joo Djagnostjc5 1
AMOVA/MAMOVA

(M)Al~OVA Assumptions By
Resampling Stats The by( ) function applys a function to each leve! of a factor or factors. lt is similar to BY processing in

Power Analysis SAS.

Using With and By


# by(data, factorlist, function)
# example obtain variable means separately for
R in Action # each level of byvar in data frame mydata
by(mydata, mydatat$byvar, function(x) mean(x))

R in Action significantly expands


upon this material. Use promo
code ria38 for a 38% discount.

Top Menu

Tbe R Interface

Data Input

Data Management

Basic Statistics

Advanced Statistics

Basic Graphs

Advanced Graphs
R Interface Axes and Text
Many high leve[ plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options
Graphical Parameters (as well as otber graphical paramters). For example
Axes and Text

Combining Plots # Specify axis options within plot()


plot(x, y, main="title", sub=" subtitle",
Lattice Graphs xlab="X-axis 1abe1", ylab="y-axix 1abe1",
ggpl ot2 Graphs xlím=c( xmin, xmax), ylim=c(ymin, ymax))

Pmbability Plots
For finer control or for modularization, you can use the functions described below.
Mosaic Plots

Correlograms

lnteractive Graphs Titles


Use the title( ) function to add labels to a plot.

R in Action
title(main="main title", sub=" sub-title",
xlab="x-axis 1abe1", ylab="y-axis 1abe1")
1
Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in
the title( ) function.

R in Action significantly expands


# Add a red title and a blue subtitle. Make x and y
upon this material. Use promo # labels 25% smaller than the default and green.
code ria38 for a 38% discount. title(main="My Title", col .main="red",
sub="My sub-title", col.sub="blue",
xlab="My X label", ylab="My Y label",
Top Menu col. lab="green", cex. lab=O. 75)

The R Interface
Text Annotations
Text can be added to graphs using the text( ) and mtext( ) functions . text( ) places text within the
Data Input
graph while mtext( ) places text in one of the four margins.
Data Management

Basic Statistics
text( 1ocation, "text to place", pos, ... )
Advance<l Statj5tic5 mtext("text to place", side, line=n, ... )
Basic Graphs
1
Advanced Graphs Common options are described below.

option description
location location can be an x,y coordinate. Alternatively, the text can be placed
interactively vía mouse by specifying location as locator( 1).
pos position relative to location. 1=below, 2=left, 3=above , 4=rigbt. lf you
specify pos, you can specify offset= in percent of character width.
side which margin to place text. 1=bottom, 2=left, 3=top, 4=rigbt. you can
specify line= to indicate the line in the margin starting with O and moving
out. you can also specify adj=O for left/bottom alignment or adj=1 for
top/ right alignment.

Other common options are cex, col, and font (for size, color, and font style respectively).

Labeling points
You can use the text( ) function (see above) for labeling point as well as for adding other text
annotations. Specífy location as a set of x, y coordinates and specify the text to place as a vector of
labels. The x, y, and label vectors should ali be the sarne length.

# Example of labeling points


attach(mtcars)
plot(wt, mpg, main="Milage vs. Car Weight",
xlab="Weight", ylab="Mileage", pch=18, col="blue")
text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")

click to view

Math Annotations
You can add mathematically formulas to a graph using TEX·like rules. See help(plotmath) for details
and examples.

Axes
You can create custom axes using the axis() function .

1 axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ... )

where

option description
side an integer indicating the side of the graph to draw the axis {1=bottom, 2=left,
3=top, 4=right)
at a numeric vector indicating where tic marks should be drawn
labels a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty líne type
col the line and tick mark color
las labels are parallel (=O) or perpendicular(=2) to axis
tck length of tick mark as fraction of plotting region (negative number is outside
graph, positive number is inside, O suppresses ticks, 1 creates gridlines) default
is -0.01
(... ) other grapbical parameters

lf you are going to create a custom axis, you should suppress the axis automatically generated by your
high level plotting function. The option axes=FALSE suppresses both x and y axes. xaxt=··n·· and
yaxt="n" suppress the x and y axis respectively. Here is a (somewhat overblown) example.

# A Silly Axis Example

# specify the data


x <- c(l:lO); y <- x; z <- 10/ x

# e reate extra margi n room on the ri ght for an axis


par(mar=c(5, 4, 4, 8) + 0.1)
# plot X VS. y
plot(x, y,type="b", pch=21, col="red",
yaxt="n'', lty=3, xlab="", ylab="")

# add x v s. 1/ x
lines(x, z, type="b", pch=22, col="blue", lty=2)

# draw an axis on the left


axi s(2, at=x, 1abel s=x, co1 .axi s="red", 1as=2)

# draw an axis on the right, with smaller text and ticks


axis(4, at=z,labels=round(z,digits=2),
col. axi s="b 1ue", las=2, cex. axi s==Ü. 7, tck=-. 01)

# add a title for the right axis


mtext("y=l/ x", side=4, line=3, cex. lab=l, las=2, col="blue")

# add a main title and bottom and left axis labels


title("An Example of Creative Axes", xlab="X values",
yl ab="Y=X")

click to view

Minor Tick Marks


The minor.tick() function in the Hmisc package adds minor tick marks.

# Add minor tick marks


1 i brary(Hmi se)
minor.tick(nx=n, ny=n, tick.ratio=n)
1
nx is the number of minor tick marks to place between x·axis major tick marks.
ny does the same for the y-axis. tick.ratio is the size of the minor tick mark relative to the major tick
mark. The length of the major tick mark is retrieved from par("tck").

Reference Lines
Add reference lines to a graph using the abline( ) function.

1 abline(h=yva7ues, v=xva7ues)

Other graohical oarameters (such as line type, color, and width) can also be specified in the abline( )
function.

# add solid horizontal lines at y=l,5,7


abline(h=c(l,5,7))
# add dashed blue verical lines at x = 1,3,5,7,9
abline(v=seq(l,10,2),lty=2,col="blue")

Note: You can also use the grid( ) functíon to add reference lines.

Legend
Add a legend with the legend() function.
1 legend(location, title, legend, ... )

Common options are described below.

option clescription
location There are severa( ways to indicate the location of the legend. You can give
an x,y coordinate for the upper left hand corner of the legend. You can use
locator(1), in which case you use the mouse to indicate the location of the
legend. You can also use the keywords "bottom", "bottomleft", "left",
"topleft", "top", "topright", "right", "bottomright", or "center". lf you use a
keyword, you may want to use inset= to specify an amount to move the
legend into the graph (as fraction of plot region).
title A character string for the legend title (optional)
legend A character vector with the labels
Other options. lf the legend labels colored lines, specify col= and a vector of
colors. lf the legend labels point symbols, specify pch= anda vector of point
symbols. lf the legend labels line width or line style, use twd= or lty= anda
vector of widths or styles. To create colored boxes for the legend (common
in bar, box, or pie charts), use fill= and a vector of colors.

Other common legend options include bty for box type, bg for background color, cex for size, and
text.col for text color. Setting horiz=TRUE sets the legend horizontally rather than vertically.

# Legend Examp le
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
yaxt="n", xlab="Milage", horizontal=TRUE,
col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
c("4", "6", "8"), fill=terrain.colors(3), horiz=TRUE)

- • click to view

For more on legends, see help(legend). The exan1ples in the help are particularly informative.
R Interface Correlograms
Correlograms help us visualize the data in correlation matrices. For details, see Corrgrams:
Graphical Parameters Exploratory displays for correlation matrices.
Axes and Text
In R, correlograms are implimented through the corrgram(x, order = , panel=, lower.panel=,
Combining Plots
upper.panel=, text.panel=, diag.panel=) function in the corrgram package.
Lattice Graphs
Options
ggp! ot2 Graphs
x is a data frame with one observation per row.
Pmbability Plots
order=TRUE will cause the variables to be ordered using principal component analysis of the
Mosaic Plots
correlation matrix.
Correlograms

lnteractive Graphs panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose
different options below and above the main diagonal respectively. text.panel= and diag.panel= refer
to the main diagnonal. Allowable parameters are giv en below.
R in Action
off diagonal panels
panel.pie (the filled portien of the pie indicates the magnitude of the correlation)
panel.shade (the depth of the shading indicates the magnitude of the correlation)
panel.ellipse (confidence ellipse and smoothed line)
panel.pts (scatterplot)

R in Action significantly expands main diagonal panels


panel.minmax (min and max values of the variable)
upon this material. Use promo
panel.txt (variable name).
code ria38 for a 38% discount.

Top Menu

# First Correlogram Example


library(corrgram)
corrgram(mtcars, order=TRUE, lower .panel=panel. shade,
The R Interface
upper.panel=panel.pie, text.panel=panel.txt,
Data Input mai n="Car Mil age Data in PC2/ PC1 Order")

Data Management

Basic Statistics

Advance<l Statj5tic5

Basic Graphs

Advanced Graphs

# Second Correlogram Example


l i brary(corrgram)
corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse,
upper.panel=panel.pts, text.panel=panel.txt,
diag.panel=panel.minmax,
main="Car Milage Data in PC2/ PC1 Order")
1
- -::. l ..---
....... -· -
...-:.tri"""'
~
;". ::.
- ~ ).:,
- ,· .,. r

.. click to view

# Third Correlogram Example


1 i brary(corrgram)
corrgram(mtcars, order=NULL, 1 ower .panel=pane 1. shade,
upper.panel=NULL, text.panel=panel.txt,
main="Car Milage Data (unsorted)")

elick to view

Changing the colors in a correlogram


You can control the colors in a correlogram by specifying 4 colors in the colorRampPalette( ) function
within the col.corrgram( ) function. Here is an example.

# Changing Colors in a Correlogram


1 i brary(corrgram)
co1 . corrg ram <- functi on(nco1) {
co lorRampPa1ette(c("darkgo1 denrod4", "burlywoodl",
"darkkhaki", "darkgreen")) (neo1)}
corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Correlogram of Car Mileage Data (PC2/PC1 Order)")

click to view
R Interface Graphics with ggplot2
The ~ package, created by Hadley Wickham, offers a powerful graphics language for creating
Graphical Parameters elegant and complex plots. lts popularity in the R community has exploded in recent years. Origianlly

Axes and Text based on Leland Wilkinson's The Grammar of Graohics, ggplotl allows you to create graphs that
represent both univariate and multivariate numerical and categorical data in a straightforward manner.
Combining Plots
Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots
Lattice Graphs (i.e., conditioning) is relatively simple.
ggpl ot2 Graphs
Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful
Pmbability Plots
resources). There is a helper function called qplot() (for quick plot) that can hide much of this
Mosaic Plots complexity when creating standard graphs.
Correlograms

lnteractive Graphs qplot()


The qplot() function can be used to create the most common graph types. While it does not expose
R in Action ggplot's full power, it can create a very wide range of useful plots. The format is:

qplot(x, y, .data=, .color=, shape=, siz~=,


alpha=, geom=, method=, formul a=,
facets=, xl1m=, yl1m= xlab= , ylab=, ma1n= , sub=)
1
where the options are:

R in Action significantly expands option description


upon this material. Use promo alpha Alpha transparency for overlapping elements expressed as a fraction between O
(complete transparency) and 1 (complete opacity)
code ria38 for a 38% discount.
color, Associates the levels of variable with symbol color, shape, or size. For line plots,
shape, color associates levels of a variable with line color. For density and box plots, fil!
size, fill associates fil! colors with a variable. Legends are drawn automatically.
Top Menu data Specifies a data fran1e
facets Creates a trellis graph by specifying conditioning variables. lts value is expressed as
rowvar - calvar. To create trellis graphs based on a single conditioning variable, use
rowvar- . or . -calvar)
The R Interface geom Specifies the geometric objects that define the graph type. The geom option is
expressed as a character vector with one or more entries. geom values include
Data Input ºpoint", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".

Data Management main, Character vectors specifying the title and subtitle
sub
Basic Statistics method, lf geom="smooth", a loess fit line and confidence limits are added by default. When
formula the number of observations is greater than 1,000, a more efficient smoothing
Advance<l Statj5tic5 algorithm is employed. Methods include "lm" for regression, "gam" for generalized
additive models, and "rlm" for robust regression. The formula parameter gives the
Basic Graphs form of the fit.
Advanced Graphs For example, to add simple linear regression lines, you'd specify geom="smooth",
method="lm", fom1ula=y- x. Changing the formula to y- poly(x,2) would produce a
quadratic fit. Note that the fom1ula uses the letters x and y, not the nan1es of the
variables.

For method="gam", be sure to load the mgcv package. For method="rml", load the
MASS package.
x, y Specifies the variables placed on the horizontal and vertical axis. For univariate
plots (for example, histograms), omit y
xlab, Character vectors specifying horizontal and vertical axis labels
ylab
xlim,ylim Two -etement numeric vectors giving the mínimum and maximum values for the
horizontal and vertical axes, respectively

Notes:

• At present, ggplot2 cannot be used to create 30 graphs or mosaic plots.


• Use l(volue) to indicate a specific value. For example size=z makes the size of the plotted points or
lines proporational to the values of a variable z. In contrast, size=l(3) sets each point or line to three
times the default size.

Here are sorne examples using automotive data (car mileage, weight, number of gears, number of
cylinders, etc.) contained in the mtcars data frame.

# ggplot2 examples
library(ggplot2)

# create factors with value labels


nrtcars$gear <- factor(mtcars$gear, leve l s=c(3, 4, 5) ,
labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1),
labe l s=c("Automati c", "Manual"))
mtcars$cyl <- factor(mtcars$cyl, levels=c(4,6,8),
l abe l s=c("4cyl", "6cyl ", "8cyl "))

# Kernel density plots for mpg


# grouped by number of gears (indicated by color)
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
main="Distribution of Gas Milage", xlab="Miles Per Gallon",
ylab="Density")

# Scatterplot of mpg vs . hp for each combination of gears and cylinders


# in each facet, transmittion type is represented by shape and color
qplot(hp, mpg, data=mtcars, shape=am, color=am,
facets=gear~cyl, size=I(3),
xlab="Horsepower", ylab="Miles per Gallon")

# Separate regressions of mpg on weight for each number of cylinders


qplot(wt, mpg, data=mtcars, geoni=c("point", "smooth"),
method="lm", formula=y~x, color=cyl,
main="Regression of MPG on Weight",
xlab="Weight", ylab="Miles per Gallon")

# Boxplots of mpg by number of gears


# observations (points) are overlayed and jittered
qplot(gear, mpg, data=mtcars, geom=c("boxplot", "ji tter"),
fill=gear, main="Mileage by Gear Number",
xlab="", ylab="Miles per Gallon")

J,
•1
• 1I:
1:. 1 ..
• •••
click to view

Customizing ggplot2 Graphs


Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par( )
function. They can be modified using the themeO function, and by adding graphic parameters within
the qplot() function. For greater control, use ~ and other functions provided by the package.
Note that ggplot2 functions can be chained with "+" signs to generate the final plot.
library(ggpl ot2)

p <- qplot(hp, mpg, data=mtcars, shape=am, color=am,


facets=gea r~cyl , main="Scatterplots of MPG vs . Horsepower",
xl ab=" Horsepower" , ylab="Mi l es per Gal l on")

# White background and black grid l ines


p + theme_bw()

# Large brown bold itali cs l abels


# and legend placed at t op of pl ot
p + theme(axi s . title=el emenLtext(face="bold. ita l i c",
size="12", color="brown"), legend.position="top")

-,l
1
Ji
1 '• •
1

• click to view

Going Further
We have only scratched the surface here. To learn more, see the ggolot reference site, and Winston
Chang's excellent Cookbook for R site. Though slightly out of date, ggolot2: Elegant Graohics for Data
Anaysis is still the definative book on this subject.
R Interface lnteractive Graphics
There are a severa! ways to interact with R graphics in real time. Three methods are described below.
Graphical Parameters

Axes and Text


GGobi
Combining Plots
GGobi is an open source visualization program for exploring high-dimensional data. lt is freely available
Lattice Graphs for MS Windows, Linux, and Mac platforms. lt supports linked interactive scatterplots, barcharts,
ggpl ot2 Graphs parallel coordinate plots and tours, with both brushing and identification. A good tutorial is included
with the GGobi manual. You can download the software here.
Pmbability Plots

Mosaic Plots Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from
Correlograms within R . This gives you interactive graphics access to all of your R data! See An lntroduction to
RGGOBI.
lnteractive Graphs

# Interact wi th R data using GGobi


R in Action library(rggobi)
g <- ggobi (mydata)
1

R in Action significantly expands


upon this material. Use promo click to view
code ria38 for a 38% discount.

iPlots
Top Menu
The ip1Qtt package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots
and histograms that can be linked and color brusbed. iplots is implimented through the Java GUI for R.
For more information, see the íplots websjte.
The R Interface

Data Input # Install iplots


Data Management install.packages("iplots",dep=TRUE)

Basic Statistics # Create some linked plots


l i brary(ip lots)
Advance<l Statj5tic5
cyl. f <- factor(mtcars$cyl)
Basic Graphs gear.f <- factor(mtcars$factor)
attach(mtcars)
Advanced Graphs
ihist(mpg) # histogram
ibar(carb) # barchart
iplot(mpg, wt) # scatter plot
ibox(mtcars[c("qsec", "disp", "hp")]) # boxplots
ipcp(mtcars[c("mpg", "wt", "hp")]) # parallel coordinates
imosaic(cyl. f,gear. f) # mosaic plot
On windows platforms, hold down the cntrl key and move the mouse over each graph to get identifying
information from points, bars, etc.

lnteracting with Plots (lndentifying Points)


R offers two functions for identifying points and coordinate locations in plots. With identify(), clicking
the mouse over points in a graph will display the row number or (optionally) the rowname for the poínt.
This continues until you select stop . With locator() you can add poínts or lines to the plot using the
mouse. The function retums a list of the (x,y) coordinates. Again, this continues until you select stop.

# Interacting with a scatterplot


attach(mydata)
plot(x, y) # scatterplot
identify(x, y, labels=row.names(mydata)) # identify points
coords <- locator(type="l") # add lines
coords # display list

Other lnteractive Graphs


See scatterolots for a description of rotating 30 scatterplots in R.
R Interface Combining Plots
R makes it easy to combine multiple plots into one overall graph, using either the
Graphical Parameters par( ) or layout( ) function.
Axes and Text
With the par( ) function , you can include the option mfrow=c(nrows, ncols) to create a matrix of
Combining Plots
nrows x ncols plots that are filled in by row. mfcol=c(nrows, ncols) fills in the matrix by columns.
Lattice Graphs

ggpl ot2 Graphs # 4 figures arranged in 2 rows and 2 columns


Pmbability Plots
attach(mtcars)
par(mfrow=c(2,2))
Mosaic Plots pl ot(wt,mpg, main="Scat:t:erplot of wt: vs . mpg")
pl ot(wt:,disp, main= "Scat:terplot of wt vs disp")
Correlograms
hist(wt, main="Hist:ogram of wt:")
lnteractive Graphs boxplot(wt, main="Boxplot: of wt:")

R in Action

elick to view

# 3 figures arranged in 3 rows and 1 column


att ach(mtcars)
R in Action significantly expands
par(mfrow=c(3,1))
upon this material. Use promo hist(wt)
code ria38 for a 38% discount. hist:(mpg)
hi st(di sp)

Top Menu

The R Interface

Data Input click to view

Data Management The layout( ) function has the form layout(mat) where
Basic Statistics mat is a matrix object specifying the location of the N figures to plot.

Advance<l Statj5tic5

Basic Graphs # One figure in row 1 and two figures in row 2


att:ach(mtcars)
Advanced Graphs layout(matrix(c(l,1,2,3), 2, 2, byrow = TRU E))
hist(wt)
hist(mpg)
hi st:(di sp)
click to view

Optionally, you can include widths= and heights= options in the layout() function to control the size of
each figure more precisely. These options have the form
widths= a vector of values for the widths of columns
heights= a vector of values for the heights of rows.

Relative widths are specified with numeric values. Absolute widths (in centimetres) are specified vlith
the lcm() function.

# One figure in row 1 and two figures in row 2


# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(l,1,2,3), 2, 2, byrow = TRU E),
widths=c(3,1), heights=c(l,2))
hist(wt)
hist(mpg)
hi st(di sp)

·:- click to view

See help(layout) for more details.

Creating a figure arrangement with fine control


In the following example, two box plots are added to scatterplot to create an enhanced graph.

# Add boxplots to a scatter pl ot


par(fig=c(0,0.8,0,0 . 8), neW=TRUE)
pl ot(mtcars$wt, mtcars$mpg, xlab="Miles Per Gal lon" ,
ylab="Car Weight")
par(fig=c(0,0.8,0.55,1), new=TRU E)
boxplot(mtcars$wt, horizontal =TRUE, axes=FALSE)
par(fig=c(0 . 65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg, axes=FALSE)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3)

1.
elick to view

To understand this graph, think of the full graph area as going from (O,O) in the lower left comer to
(1, 1) in the upper right comer. The format of the fig= parameter is a numerical vector of the form
c(x1, x2, y1, y2). The first fig= sets up the scatterplot going from O to 0.8 on the x axis and O to 0.8 on
the y axis. The top boxplot goes from O to 0.8 on the x axis and 0.55 to 1 on the y axis. 1chose0.55
rather than 0.8 so that the top figure will be pulled closer to the scatter plot. The right hand boxplot
goes from 0.65 to 1 on the x axis and O to 0.8 on the y axis. Again, 1 chose a value to pull the right
hand boxplot closer to the scatterplot. You have to experiment to get ít just right.

fíg= starts a new plot, so to add toan exísting plot use new=TRUE.

You can use thís to combine severa[ plots in any arrangement ínto one graph.
R Interface Visualizing Categorical Data
The ved package provides a variety of methods for visualizing multivariate categorical data, inspired by
Graphical Parameters Michael Friendly's wonderful ºVisualizing Categorical Data". Extended mosaic and association plots are
Axes and Text described here. Each provides a method of visualing complex data and evaluating deviations from a
specified independence model. For more details, see The Strucplot framework
Combining Plots

Lattice Graphs

ggp! ot2 Graphs


Mosaic Plots
For extended mosaic plots, use mosaic(x, condvar=, data=) where x is a table or formula, condvar= is
Pmbability Plots
an optional conditioning variable, and data= specifies a data fran1e or a table. lndude shade=TRUE to
Mosaic Plots color the figure, and legend=TRUE to display a legend for the Pearson residuals.
Correlograms

lnteractive Graphs # Mosai e Pl ot Examp le


l i brary(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)
R in Action 1

R in Action significantly expands


upon this material. Use promo Association Plots
code ria38 for a 38% discount. To produce an extended association plot use assoc(x, row_vars, col_vars) where x is a contingency
table, row_vars is a vector of integers giving the indices of the variables to be used for the rows, and
col_vars is a vector of integers giving the indices of the variables to be used for the colunms of the
Top Menu
association plot.

# Association Plot Example


The R Interface l i brary(vcd)
assoc(HairEyeColor, shade=TRUE)
Data Input
1
Data Management

Basic Statistics
,_
...
Advance<l Statj5tic5

Basic Graphs

Advanced Graphs
--· ·;

·-.
dick to view

Going Further
Both functions are complex and offer multiple input and output options. See help(mosaic) and
help(assoc) for more details.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
R Interface Graphical Parameters
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
Graphical Parameters
One way is to specify these options in through the par() function. lf you set parameter values here, the
Axes and Text
changes will be in effect for the rest of the session or until you change them again. The format is
Combining Plots
par(optionname=value, optionname= value, ... )
Lattice Graphs

ggpl ot2 Graphs # Set a graphi cal parameter usi ng par()


Pmbability Plots
par() # view current settings
Mosaic Plots opar <- par() # make a copy of current settings
par(col. lab="red") # red x and y labels
Correlograms
hist(mtcars$mpg) # create a plot with these new settings
lnteractive Graphs par(opar) # restore original settings

R in Action A second way to specify graphical parameters is by providing the optionname=va/ue pairs directly to a
high leve! plotting function. In this case, the options are only in effect for that specific graph.

# Set a graphi cal parameter wi thi n the p 1otti ng functi on


hi st(mtcars$mpg, col. lab="red")
1
See the help for a specific high leve! plotting function (e.g. plot, hist, boxplot) to determine which
R in Action significantly expands
graphical parameters can be set this way.
upon this material. Use promo
code ria38 for a 38% discount. The remainder of this section describes sorne of the more important graphical parameters that you can
set.

Top Menu
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
The R Interface

Data Input option description

Data Management cex number indicating the amount by which plotting text and symbols should be
scaled relative to the default. 1=default, 1.5 is 50% larger, 0.5 is 50%
Basic Statistics smaller, etc.
cex.axis magnification of axis annotation relative to cex
Advance<l Statj5tic5
cex. lab magnification of x and y labels relative to cex
Basic Graphs cex.main magnification of titles relative to cex
Advanced Graphs cex.sub magnification of subtitles relative to cex

Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21 through 25, specify
border color (col=) and fil! color (bg=).
plot symbols : pch =

ºº
1 0
6\7

7 ~
12 EB

13~
1s•

1s •
24.6.

25 '\i'
ºº
·+
14~
2 b.
ª* 20 .
* *
3+ 9~ 1s • 21 o

%
%
4 X 10 $- 1s • 22 0
ºº
s <> 11 Z& 11 & 23<>
ºº ##

Lines
You can change lines using the folloV'ting options. This is particularty useful for reference tines, axes,
and fit lines.

option description
lty line type. see the chart below.
lwd line width relative to the default (default=1). 2 is t'Jl/Íce as 1'Vide.

Line Types: lty=

6
5
4
3
2

Colors
Options that specify colors include the follovting.

option description
col Default plotting color. Some functions (e.g. lines) accept a vector of values
that are recycled.
col.axis color for axis annotation
col.lab color for x and y labels
col.main color for titles
col.sub color for subtitles
fg plot foreground color (axes, boxes · also sets col= to sanie)
bg plot background color

You can specify colors in R by index, name, hexadecimal, or RGB.


For example col=1, col="white", and col="#ffFFFF" are equivalent.

The following chart was produced with code developed by Earl F. Glynn. See his Color Chart for ali the
details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n),
terrain.colors(n), topo.colors(n), and cm.colors(n).

colors() returns ali available color names.

Fonts
You can easily set font size and style, but font family is a bit more complicated.

option description
font lnteger specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
font.axis font for axis annotation
font.lab font for x and y labels
font.main font for titles
font. sub font for subtitles
ps font point size (roughly 1/n inch)
text size=ps•cex
family font family for drawing text. Standard values are "serif", "sans", "mono",
"symbol". Mapping is device dependent.

In vlindows, mono is mapped to 'TI Courier New", serif is mapped to"TT Times New Roman", sans is
mapped to "TT Arial", mono is mapped to 'TI Courier New", and symbol is mapped to 'TI Symbol"
(TT=True Type). You can add your own mappings.

# Type family examples - creating new mappings


plot(l:l0,1:10,type="n")
windowsFonts(
A=windowsFont("Arial Black"),
B=WindowsFont("Bookman Old Style"),
C=Wi ndowsFont ("Comí e Sans MS "),
D=Wi ndowsFont("Symbo l ")
)
text(3,3, "Hello World Default")
text(4,4, family="A", "Hello World from Arial Black")
text(S,5, family="B", "Hello World from Bookman Old Style")
text(6,6, family="C", "Hello World from Comic Sans MS")
text(7,7,family="D", "Hello World from Symbol")
... click to view

Margins and Graph Size


You can control the margin size using the following parameters.

option description
mar numerical vector indicating margin size c(bottom, left, top, right) in lines.
default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right) in inches
pin plot dimensions (width, height) in inches

For complete information on margins, see Earl F. Glynn's marein t utoría!.

Going Further
See help(par) for more information on graphical parameters. The customization of plotting axes and
text annotations are covered next section.
R Interface Probability Plots
This section describes creating probability plots in R for both didactic purposes and for data analyses.
Graphical Parameters

Axes and Text


Probability Plots for Teaching and Demonstration
Combining Plots
When l was a college professor teaching statistics, l used to have to draw normal distributions by hand.
Lattice Graphs They always carne out looking like bunny rabbits. What can l say?
ggpl ot2 Graphs
R makes it easy to draw probability distributions and demonstrate statistical concepts. Sorne of the
Probability Plots
more common probability distributions available in R are given below.
Mosaic Plots
distribution Rname distribution Rname
Correlograms Lognormal
Beta beta lnorm
lnteractive Graphs Binomial binom Negative Binomial nbinom
Cauchy cauchy Normal norm
Chisquare chisq Poisson pois
R in Action
Exponential exp Student t
F f Uniform unif
Gamma gamma Tukey tukey
Geometric geom Weibull weib
Hypergeometric hyper Wilcoxon wilcox
Logistic logis

For a comprehensive list, see Statistical Oistributions on the R \'lliki. The functions available for each
R in Action significantly expands
distribution follow this format :
upen this material. Use promo
code ria38 for a 38% discount. name description
dname( ) density or probability function
pname( ) cumulative density function
Top Menu
qname{ ) quantile function
rname( ) random deviates

For example, pnorm{O) =0.5 (the area under the standard normal curve to the left of zero).
The R Interface
qnorm(O. 9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution). rnorm(100)
Data Input generates 100 random deviates from a standard normal distribution.

Data Management
Each function has parameters specific to that distribution. For example, rnorm(100, m=SO, sd=10)
Basic Statistics generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10.
Advance<l Statj5tic5
You can use these functions to demonstrate various aspects of probability distributions. Two common
Basic Graphs
examples are given below.
Advanced Graphs

# Display the Student's t distributions with various


# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)


hx <- dnorm(x)
degf < - c(l, 3, 8, 30)
colors <- e(" red", "blue", "darkgreen", "gol d", "bl ack")
labels <- c("df=l", "df=3", "df=8", "df=30", "normal")

pl ot(x, hx, type="l", lty=2, xlab="x value",


ylab="Oensity", main="Comparison of t Oistributions")

for (i in 1:4){
lines(x, dt(x,degf[i ] ), lwd=2, col=colors[i])
}

legend("topri ght", i nset=. 05, title="Di stri butions",


labels, lwd=2, lty=c(l, 1, 1, 1, 2), col=colors)

A
,)

click to view

# chi ld ren's IQ seores are normally distributed with a


# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?

mean=lOO; sd=15
l b=80; ub=l20

x <- seq( -4, 4, l ength=lOO) "sd + mean


hx <- dnorm(x,mean,sd)

pl ot(x, hx, type="n" , xlab="IQ Values", ylab="",


main="Normal Oistribution", axes=FALSE)

<- X >= lb & X <= ub


lines(x, hx)
polygon(c(lb,x[i] ,ub), c(O,hx[i] ,O), col="red")

area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)


result <- paste("P(",lb," < IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(l, at=seq(40, 160, 20), pos=O)

elick to view

For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Pistdbutions.

Fitting Distributions
There are severa! methods of fitting distributions in R. Here are some options.

You can use the qqnorm() function to create a Quantile-Quantile plot evaluating the fit of sample data
to the normal distribution. More generally, the qqplot( ) function creates a Quantile-Quantile plot for
any theoretical distribution.

1# Q-Q plots
par(mfrow=c(l,2))

# create sample data


x <- rt ( lOO, df=3)

# normal fít
qqnorm(x); qq1 í ne ( x)

# t(3Df) fít
qqplot(rt(1000,df=3), x, maín="t(3) Q-Q Plot",
ylab="Sarnple Quantíles")
ab lí ne(O, 1)

_/ I

elick to view

The fitdistr( ) function in the MASS package provides maximurn-likelihood fitting of univariate
distributions. The fom1at is fitdistr(x, densityftmction) where x is the sample data and densityfunction
is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f', "gamma", "geometríc", "log-
nom1al", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull".

# Estímate parameters assumíng log-Normal dístríbutíon

# create some sample data


x <- rlnorm(lOO)

# estímate paramters
1 í brary(MASS)
fítdístr(x, "lognormal")

Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a
random sample comes from a specified theoretical distríbution. These include chi-square, Kolmogorov-
Smirnov, and Anderson-Darling.

For more details on fitting distríbutions, see Vito Ricci's Fjttioe Pistábutjoos witb R. For general (non
R) advice, see Bill Huber's Fitting Distábutions to Data.
R Interface Lattice Graphs
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing
Graphical Parameters better defaults and the ability to easily display multivariate relationships. In particular, the package
Axes and Text supports the creation of tre//is graphs · graphs that display a variable or the relationship between
variables, conditioned on one or more other variables.
Combining Plots

Lattice Graphs The typical format is

ggpl ot2 Graphs

Probability Plots 1 graph_type(formula, data=)

Mosaic Plots

Correlograms where graph_type is selected from the listed below. formula specifies the variable(s) to display and
any conditioning variables . For example -x 1A means display numeric variable x for each leve! of factor
lnteractive Graphs
A. y-x 1 A•s means display the relationship between numeric variables y and x separately for every
combination of factor A and B levels. -x means display numeric variable x alone.
R in Action
graph_type description formula examples
barchart bar chart x-A or A-x
bwplot boxplot x-A or A-x
cloud 3D scatterplot z-x•ylA
contourplot 3D contour plot z- x"y
densityplot kernal density plot -xlA*B
dotplot dotplot -xi A
R in Action significantly expands histogram histogram -x
upen this material. Use promo levelplot 3D leve! plot z-y•x
cede ria38 for a 38% discount. parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A- x or x-A
Top Menu
xyplot scatterplot y-xi A
wirefrarne 3D wireframe graph z-y•x

Here are sorne examples. They use the car data (mileage, weight, number of gears, number of
The R Interface
cylinders, etc.) from the mtcars data frame.
Data Input

Data Management # Lattice Examples


Basic Statistics library(lattice)
attach(mtcars)
Advance<l Statj5tic5
# create factors with value labels
Basic Graphs
gear.f<-factor(gear,levels=c(3,4,5),
Advanced Graphs labe l s=c("3gears", "4gears", "5gears"))
cyl.f <-factor(cyl,levels=c(4,6,8),
labe l s=c("4cyl ", "6cyl ", "8cyl "))

# kernel density plot


densityplot(~mpg,

main="Density Plot",
xlab="Miles per Gallon")
# kernel dens i ty plots by factor l evel
densityplot (~mpg l cyl. f,
main="Density Plot by Number of Cyl inders",
xl ab="Miles per Gallon")

# kernel density plots by factor level (alternate layout)


densityplot(~mpg l cyl .f,
main="Density Plot by Numer of Cylinders" ,
xl ab="Miles per Gallon" ,
layout=c(l,3))

# boxplots for each combination of two factors


bwpl ot(cyl.f~mpg l gear.f,
yl ab="Cylinders", xlab="Miles per Gallon",
main="Mileage by Cylinders and Gears",
layout=( c(l, 3))

# scatterpl ots for each combination of t wo factors


xypl ot(mpg.vwtlcyl . f *gear . f,
main="Scatterpl ot s by Cylinders and Gears",
yl ab="Mi les per Gallon", xl ab="Car Weight")

# 3d scatterplot by factor l evel


cloud(mpg~wt*qsec l cyl.f,
main="3o Scatterplot by Cylinders")

# dotplot for each combination of two factors


dotplot(cyl . f~mpg l gear.f,
main="Dotplot Plot by Number of Gears and Cylinders",
xl ab="Miles Per Gallon")

# scatterplot matrix
splom(mtcars [c(l,3,4,5,6)),
main="MTCARS Data")

-
...

click to view

Note, as in graph 1, that you specifying a conditioning variable is optional. The difference between
graphs 2 & 3 is the use of the layout option to contol the placement of panels.

Customizing Lattice Graphs


Unlike base R graphs, lattice graphs are not effected by many of the options set in the par( ) function.
To view the options that can be changed, look at help(xyplot) . lt is frequently easiest to set these
options within the high leve! plotting functions described above. Additionally, you can write functions
that modify the rendering of panels. Here is an example.

# Customized Lattice Example


library(lattice)
panel.smoother <- function(x, y) {
panel . xyplot(x, y)# show points
panel.loess(x, y) #show smoothed line
}
attach(mtcars)
hp <- cut(hp,3) # divide horse power into three bands
xyplot(mpg---wt:lhp, scales=list(cex= . 8, col="red"),
pane1=pane1 . smoother,
xlab="Weight", yl ab="Miles per Gallon",
main="MGP vs Weight by Horse Power")

click to view

Going Further
Lattice graphics are a comprehensive graphical system in their own right. Deepanyan Sarkar's book
Lattice: Multivariate Data Visualization vtith R is the definitive reference. Additionally, see the Trellis
Graphic5 homepage and the Trellis Usea Gujde. Dr. lhaka has created a wonderful set of 5ljde5 on the
subject. An excellent early consideration of trellis graphs can be found in W.S. Cleavland's classic book
Visualizjng Data.
R Tutorials--Counts and Proportions

COUNTS AND PROPORTIONS

Binomial and Poisson Data

Count data--data derived from counting things--are often treated as if they are assumed to be binomial distributed or
Poisson distributed. The binomial model applies when the counts are derived from independent Bernoulli trials in which
the probability of a "success" is known, and the number of trials (i.e., the maximum possible value of the count) is also
known. The classic example is coin tossing. We toss a coin 100 times and count the number of times the coin lands heads
side up. The maximum possible count is 100, and the probability of adding one to the count (a "success") on each trial is
known to be 0.5 (assuming the coin is fair). Trials are independent; i.e., previous outcomes do not influence the probability
of success on current or future trials. Another requirement is that the probability of success remain constant over trials.

Poisson counts are often assumed to occur when the maximum possible count is not known, but is assumed to be large, and
the probability of adding one to the count at each moment (trials are often ill defined) is also unknown, but is assumed to
be small. A example may help.

Suppose we are counting traffic fatalities during a given month. The maximum possible count is quite high, we can
imagine, but in advance we can't say exactly (or even approximately) what it might be. Furthermore, the probability of a
fatal accident at any given moment in time is unknown but small. This sounds like it might be a Poisson distributed
variable. But is it?

The built in data set "Seatbelts" allows us to have a look at such a count variable. "Seatbelts" is not a data frame, it's a time
series, so extracting the counts of fatalities will take a bit of trickery...

> deaths = as.vector(Seatbelts[,1]) # All rows of column 1 extracted.


> length(deaths)
[1] 192
> mean(deaths)
[1] 122.8021
> var(deaths)
[1] 644.1386
The vector "deaths" now contains the monthly number of traffic deaths in Great Britain during the 192 months from
January 1969 through December 1984. We also have our first indication this is not a Poisson distributed variable. In the
Poisson distribution, the mean and the variance are the same. Let's continue nevertheless.

Next we'll look at a histogram of the "deaths" vector, and plotted over top of that we will put the Poisson distribution with
mean = 122.8...
> hist(deaths, breaks=15, prob=T, ylim=c(0,.04))
> lines(x<-seq(65,195,10),dpois(x,lambda=mean(deaths)),lty=2,col="red")
And I'm not gonna lie to you. That took some trial and error! The histogram function asks for 15 break points, turns on
density plotting, and sets the y-axis to go from 0 to .04. The poisson density function was plotted using the lines( )
function. The x-values were generated by using seq( ) and stored into "x" on the fly. The y-values were generated using
the dpois( ) function. We also requested a dashed, red line. Examination of the figure shows what we suspected. The
empirical distribution does not match the theoretical Poisson distribution.

proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions

> sim.dist = rpois(192, lambda=mean(deaths))


> qqplot(deaths, sim.dist, main="Q-Q Plot")
> abline(a=0, b=1, lty=2, col="red")

Finally, you may recall (if you read that tutorial) the
qqnorm( ) function can be used to check a distribution
for normality. The qqplot( ) function will compare any
two distributions to see if they have the same shape. If they
do, the plotted points will fall along a straight line. The plot
to the right doesn't look too terribly bad until we realize
these two distributions should not only both be poisson,
they should also both have the same mean (or lambda
value). Thus, the points should fall along a line with
intercept 0 and slope 1. The moral of the story: just because
count data sound like they might fit a certain distribution
doesn't mean they will. R provides a number of mechanisms
for checking this.

We could have saved ourselves a lot of trouble by looking


at a time series plot to begin with. I will not reproduce it,
but the command for doing so is below. In the plot, we see
the data are strongly cyclical and, therefore, that the
individual elements of the vector should not be considered
independent counts. In fact, there is a yearly cycle in the
number of traffic deaths. (How would you show that?)

> plot(Seatbelts[,1]) # Not shown.


Furthermore, a scatterplot of "deaths" against its index values appears to show that the probability of dying in a traffic
accident is decreasing over the duration of the record...
> scatter.smooth(1:192, deaths) # Not shown.

Lot's of problems here!

The Binomial Test

Suppose we set up a classic card-guessing test for ESP using a 25-card deck of Zener cards, which consists of 5 cards each
of 5 different symbols. If the null hypothesis is correct (H0 : no ESP) and the subject is just guessing at random, then we
should expect pN correct guesses from N independent Bernoulli trials on which the probability of a success (correct guess)
is p = 0.2. Suppose our subject gets 9 correct guesses. Is this out of line with what we should expect just by random
chance?

proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions

A number of proportion tests could be applied here, but the sample size is fairly small (just 25 guesses), so an exact
binomial test is our best choice...

> binom.test(x=9, n=25, p=.2)


Exact binomial test
data: 9 and 25
number of successes = 9, number of trials = 25, p-value = 0.07416
alternative hypothesis: true probability of success is not equal to 0.2
95 percent confidence interval:
0.1797168 0.5747937
sample estimates:
probability of success
0.36
Oh, too bad! Assuming we set alpha at the traditional value of .05, we fail to reject the null hypothesis with an obtained p-
value of .074. The 95% confidence interval tells us this subject's true rate of correct guessing is best approximated as being
between 0.18 and 0.57. This incorporates the null value of 0.2, so once again, we must regard the results as being
consistent with the null hypothesis.

We might argue at this point that we should have done a one-tailed test. (A two-tailed test is the default.) Of course, this
decision should be made in advance, but if the subject is displaying evidence of ESP, we would expect his success rate to
be not just different from chance but greater than chance. To take this into account in the test, we need to set the
"alternative=" option. Choices are "less", "greater", and "two.sided" (the default)...

> binom.test(x=9, n=25, p=.2, alternative="greater")


Exact binomial test
data: 9 and 25
number of successes = 9, number of trials = 25, p-value = 0.04677
alternative hypothesis: true probability of success is greater than 0.2
95 percent confidence interval:
0.2023778 1.0000000
sample estimates:
probability of success
0.36
The one-tailed test allows the null to be rejected at alpha=.05. The confidence interval says the subject is guessing with a
success rate of at least 0.202. The confidence level can also be set by changing the "conf.level=" option to any reasonable
value less than 1. The default value is .95.

The Single-Sample Proportion Test

The subject keeps guessing because, of course, we'd like to see this above chance performance repeated. He has now made
400 passes through the deck for a total of 10,000 independent guesses. He has guessed correctly 2,022 times. What should
we conclude?

An exact binomial test is probably not the best choice here as the sample size is now very large. We'll substitute a single-
sample proportion test...

> prop.test(x=2022, n=10000, p=.2, alternative="greater")


1-sample proportions test with continuity correction
data: 2022 out of 10000, null probability 0.2
X-squared = 0.2889, df = 1, p-value = 0.2955
alternative hypothesis: true p is greater than 0.2
95 percent confidence interval:
0.1956252 1.0000000
sample estimates:
p
0.2022
The syntax is exactly the same. Notice the proportion test calculates a chi-squared statistic. The traditional z-test of a
proportion is not implemented in R, but the two tests are exactly equivalent. Notice also a correction for continuity is

proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions

applied. If you don't want it, set the "correct=" option to FALSE. The default value is TRUE. (This value must be set to
FALSE to make the test mathematically equivalent to the uncorrected z-test of a proportion.)

Two-Sample Proportions Test

A random sample of 428 adults from Myrtle Beach reveals 128 smokers. A random sample of 682 adults from San
Francisco reveals 170 smokers. Is the proportion of adult smokers in Myrtle Beach different from that in San Francisco?

> prop.test(x=c(128,170), n=c(428,682),


+ alternative="two.sided",
+ conf.level=.99)
2-sample test for equality of proportions with continuity correction
data: c(128, 170) out of c(428, 682)
X-squared = 3.0718, df = 1, p-value = 0.07966
alternative hypothesis: two.sided
99 percent confidence interval:
-0.02330793 0.12290505
sample estimates:
prop 1 prop 2
0.2990654 0.2492669
Don't be upset by the fact that I typed this on multiple lines by hitting the Enter key at convenient spots. Being neat is
optional! The two-proportions test also does a chi-square test with continuity correction, which is mathematically
equivalent to the traditional z-test with correction. Enter "hits" or successes into the first vector ("x"), the sample sizes into
the second vector ("n"), and set options as you like. To turn off the continuity correction, set "correct=F". I set the
alternative to two-sided, but this was unnecessary as two-sided is the default. I also set the confidence level for the
confidence interval to 99% to illustrate this option. I made up these data, by the way.

R incorporates a function for calculating the power of a 2-proportions test. The syntax is illustrated here from the help
page...

power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05, power = NULL,


alternative = c("two.sided", "one.sided"),
strict = FALSE)
The value of n should be set to the sample size per group, p1 and p2 to the group probabilities or proportions of successes,
and power to the desired power. One and only one of these options must be passed as NULL, and R will calculate it from
the others. In the example above, what sample sizes should we have if we want a power of 90%?
> power.prop.test(p1=.299, p2=.249, sig.level=.05, power=.9,
+ alternative="two.sided")
Two-sample comparison of proportions power calculation
n = 1670.065
p1 = 0.299
p2 = 0.249
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group

We would need 1,670 subjects in each group.

Multiple Proportions Test

The two-proportions test generalizes directly to a multiple proportions test. The example from the help page should suffice
to illustrate...

> example(prop.test)
> ## Data from Fleiss (1981), p. 139.
> ## H0: The null hypothesis is that the four populations from which
> ## the patients were drawn have the same true proportion of smokers.
> ## A: The alternative is that this proportion is different in at
> ## least one of the populations.
>

proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions

> smokers <- c( 83, 90, 129, 70 )


> patients <- c( 86, 93, 136, 82 )
> prop.test(smokers, patients)
4-sample test for equality of proportions without continuity
correction
data: smokers out of patients
X-squared = 12.6004, df = 3, p-value = 0.005585
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.9651163 0.9677419 0.9485294 0.8536585
If you run this example, you will notice that I have deleted some of the output from the simpler proportion tests.

revised 2010 August 6

Return to the Table of Contents

proport.html[27/01/2014 22:19:22]
R Tutorials--Data Frames

DATA FRAMES

Preamble

There is plenty to say about data frames because they are the primary data structure in R. Some of what follows is essential
knowledge. Some of it will be satisfactorily learned for now if you remember that "R can do that." I will try to point out
which parts are which. Set aside some time. This is a long one!

Definition and Examples (essential)

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one
variable, and each row contains one case. As we shall see, a "case" is not necessarily the same as an experimental subject
or unit, although they are often the same. Technically, in R a data frame is a list of column vectors, although there is only
one reason why you might need to remember such an arcane thing. Unlike an array, the data you store in the columns of a
data frame can be of various types. I.e., one column might be a numerical variable, another might be a factor, and a third
might be a character variable. All columns have to be the same length (contain the same number of data items).

Let's say we've collected data on one response variable or DV from 15 subjects, who were divided into three experimental
groups called control ("contr"), treatment one ("treat1"), and treatment two ("treat2"). We might be tempted to table the
data as follows...
contr treat1 treat2
---------------------------
22 32 30
18 35 28
25 30 25
25 42 22
20 31 33
---------------------------
While this is a perfectly acceptable table, it is NOT a data frame, because values on our one response variable have been
divided into three columns (and so have values on the grouping or independent variable). A data frame has the name of the
variable at the top of the column, and values of that variable in the column under the variable name. So the data above
should be tabled as follows...
scores group
----------------
22 contr
18 contr
25 contr
25 contr
20 contr
32 treat1
35 treat1
30 treat1
42 treat1
31 treat1
30 treat2
28 treat2
25 treat2
22 treat2
33 treat2
----------------

This is a proper data frame (and leave out the dashed lines, although in actual fact R could read this table just as you see it
here). It does not matter what order you type the columns in, as long as each column contains values of one variable, and
every recorded value of that variable is in that column.

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

In a previous tutorial we used the data object "women" as an example of a data frame...

> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
In this data frame we have two numerical variables and no real explanatory variables (IVs) or response variables (DVs).
Notice when R prints out a data frame, it numbers the rows. These numbers are for convenience only and are not part of
the data frame, and I'll have much more to say about them shortly.

We can refer to any value, or subset of values, in this data frame using the already familiar notation...
> women[12,2] # row 12, column 2; note: square brackets
[1] 150
> women[8,] # row 8, all columns
height weight
8 65 135
> women[1:5,] # rows 1 to 5, all columns
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
> women[,2] # all rows, column 2
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> women[c(1,3,7,13),] # rows 1, 3, 7, and 13, all columns
height weight
1 58 115
3 60 120
7 64 132
13 70 154
> women[c(1,3,7,13),1] # rows 1, 3, 7, and 13, column 1
[1] 58 60 64 70
Here's the catch. Those index numbers do NOT necessarily correspond to the numbers you see printed out with the data
frame. This can be confusing at first, and it is something you need to keep in mind. I will explain in a moment.

Another built-in data object that is a data frame is "warpbreaks". This data frame contains 54 cases, so I will print out only
every third one. I do this with the sequence function, since this function creates a vector just as the c( ) function did in
the above examples...

> warpbreaks[seq(1,54,3),]
breaks wool tension
1 26 A L
4 25 A L
7 51 A L
10 18 A M
13 17 A M
16 35 A M
19 36 A H
22 18 A H
25 28 A H
28 27 B L
31 19 B L
34 41 B L
37 42 B M

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

40 16 B M
43 21 B M
46 20 B H
49 17 B H
52 15 B H
In this data frame we have one numerical variable (number of breaks), and two categorical variables (type of wool and
tension on the wool). We don't have to look at the data frame itself to get this information. We can also use the str( )
function, which displays a breakdown of the structure of a data frame...
> str(warpbreaks)
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

Another example is the data object "sleep"...

> sleep
extra group
1 0.7 1
2 -1.6 1
3 -0.2 1
4 -1.2 1
5 -0.1 1
6 3.4 1
7 3.7 1
8 0.8 1
9 0.0 1
10 2.0 1
11 1.9 2
12 0.8 2
13 1.1 2
14 0.1 2
15 -0.1 2
16 4.4 2
17 5.5 2
18 1.6 2
19 4.6 2
20 3.4 2
Here we have two variables, the change in sleep time a subject got ("extra"), and what drug the subject received ("group").
In this case, the first variable (the dependent variable, DV, response variable, etc.) is numerical and the second (the
independent variable, IV, explanatory variable, grouping variable, etc.) is categorical, even though the categorical variable
is coded as a number. Once again, it does not matter in what order the columns occur. Put the IV in the first column and
the DV in the second column if you want.

However, if categorical variables are coded as numbers (a common practice), R will not know this until you tell it...

> str(sleep)
'data.frame': 20 obs. of 2 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
In this case, the fact that "group" is a factor is stored internally in the data frame, but that will not always be the case. So
it's worth taking a look to make sure things you intend to be factors are being interpreted as factors by R. You can do this
with str( ), but you can also do it with summary( ), because numerical variables and factors are summarized
differently...
> summary(sleep)
extra group
Min. :-1.600 1:10
1st Qu.:-0.025 2:10
Median : 0.950
Mean : 1.540
3rd Qu.: 3.400
Max. : 5.500

Notice that numerical variables (extra) are summarized with numerical summary statistics, while factors are summarized
with a frequency table. In these data, there are 10 subjects in group 1 and 10 subjects in group 2.

An Ambiguous Case (essential)

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

Entering data into a data frame sometimes involves making a tough decision as to what your variables are. The following
example is from a built-in data object called "anorexia". This data set is not in the libraries that are loaded by default when
R starts, so to see it, the first thing we need to do is attach the correct library to the search path. Let's see how that works...
> search()
[1] ".GlobalEnv" "tools:RGUI" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
This is the default search path, the one you have right after R starts. (It will be a little different in different operating
systems.) We want to see an object in the MASS library (or package), which is not currently in the search path. So to get it
into the search path, do this...
> library(MASS)
> search()
[1] ".GlobalEnv" "package:MASS" "tools:RGUI"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"

Notice we have added "package:MASS" to the search path in position 2. This means if we request an R object, R will look
first in the global environment (the workspace), and if the object is not found there, R will look next in MASS, then in
RGUI, then in stats, and so on, until the object either is found or R runs out of places to look for it. The "anorexia" data
frame is 72 cases long, so to conserve space we will look at only every fifth row of it...
> anorexia[seq(1,72,5),]
Treat Prewt Postwt
1 Cont 80.7 80.2
6 Cont 88.3 78.1
11 Cont 77.6 77.4
16 Cont 77.3 77.3
21 Cont 85.5 88.3
26 Cont 89.0 78.8
31 CBT 79.9 76.4
36 CBT 80.5 82.1
41 CBT 70.0 90.9
46 CBT 84.2 83.9
51 CBT 83.3 85.2
56 FT 83.8 95.2
61 FT 79.6 76.7
66 FT 81.6 77.8
71 FT 86.0 91.7

The data frame contains data from women who underwent treatment for anorexia. In the first column we have the
treatment variable ("Treat"). The second column contains the pretreatment body weight in pounds ("Prewt"). The third
column contains the posttreatment body weight in pounds ("Postwt"). So where is the ambiguity?

Here's the awkward question. In our analysis of these data, do we wish to treat weight as two variables (pre and post) each
measured once on each subject, or as one variable (weight) measured twice on each subject? The data frame is currently
arranged as if the plan was for an analysis of covariance, with "Postwt" being the response, "Treat" the explanatory
variable, and "Prewt" the covariate. Prewt and Postwt are treated as two variables.

If the plan was for a repeated measures ANOVA, then the data frame is wrong, because in this case, "weight" is ONE
variable measured twice ("pre" and "post") on each woman. In this analysis, we would also need to add a "subject"
variable to the data frame as well, since each subject would have two lines, a "pre" line and a "post" line.

It's not a disaster. The data frame is easy enough to rearrange on the fly, and we will do so below.

By the way, this is how you get the MASS package out of the search path if you no longer need it...

> detach("package:MASS")

Creating a Data Frame in R (essential)

The easiest way--and the usual way--of getting a data frame into the R workspace is to read it in from a file. We will do
that in the next tutorial. Sometimes it becomes necessary to create one at the console, however. Here are the steps
involved:

Type each variable into a vector.

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

Use the data.frame( ) function to create a data frame from the vectors.

You may remember these data from the "Objects" tutorial...

name age hgt wgt race year SAT


Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
Let's make a data frame of this...
> ls() # A clean workspace is a good start!
character(0)
> name = scan(what="character")
1: Bob Fred Barb Sue Jeff # Remember: press Enter twice to end data entry.
6:
Read 5 items
> age = scan()
1: 21 18 18 24 20
6:
Read 5 items
> hgt = scan()
1: 70 67 64 66 72
6:
Read 5 items
> wgt = scan()
1: 180 156 128 1118 202
6:
Read 5 items
> race = scan(what="character")
1: Cauc Af.Am Af.Am Cauc Asian
6:
Read 5 items
> year = scan(what="character")
1: Jr Fr Fr Sr So
6:
Read 5 items
> SAT = scan()
1: 1080 1210 840 1340 880
6:
Read 5 items
> my.data = data.frame(name, age, hgt, wgt, race, year, SAT)
> my.data
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
4 Sue 24 66 1118 Cauc Sr 1340
5 Jeff 20 72 202 Asian So 880

Tah dah! It's as simple as that. You wouldn't want to have to do that with a large data set, however, and that's why we'll
learn how to read them in from a file in the next tutorial. DON'T clean up your workspace. We will carry this example
over into the next section.

Accessing Information Inside a Data Frame (essential)

First, let's look at a few functions that allow us to get general information about a data frame...

> dim(my.data) # Get size in rows by columns.


[1] 5 7
> names(my.data) # Get the names of variables in the data frame.
[1] "name" "age" "hgt" "wgt" "race" "year" "SAT"
> str(my.data) # See the internal structure of the data frame.
'data.frame': 5 obs. of 7 variables:
$ name: Factor w/ 5 levels "Barb","Bob","Fred",..: 2 3 1 5 4
$ age : num 21 18 18 24 20
$ hgt : num 70 67 64 66 72
$ wgt : num 180 156 128 1118 202
$ race: Factor w/ 3 levels "Af.Am","Asian",..: 3 1 1 3 2
$ year: Factor w/ 4 levels "Fr","Jr","So",..: 2 1 1 4 3
$ SAT : num 1080 1210 840 1340 880
These are self-explanatory, with the exception of str( ). First, notice that our character variables were entered into the
data frame as factors. This is standard in R, but it may not be what you want. Second, notice on the lines giving info about
factors that there are strange numbers at the ends of those lines. You don't have to worry about these. What R is telling you
is that factors are coded internally in R as numbers. R will keep it all straight for you, so don't sweat the details.

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

The summary( ) function is also useful here...

> summary(my.data)
name age hgt wgt race year
Barb:1 Min. :18.0 Min. :64.0 Min. : 128.0 Af.Am:2 Fr:2
Bob :1 1st Qu.:18.0 1st Qu.:66.0 1st Qu.: 156.0 Asian:1 Jr:1
Fred:1 Median :20.0 Median :67.0 Median : 180.0 Cauc :2 So:1
Jeff:1 Mean :20.2 Mean :67.8 Mean : 356.8 Sr:1
Sue :1 3rd Qu.:21.0 3rd Qu.:70.0 3rd Qu.: 202.0
Max. :24.0 Max. :72.0 Max. :1118.0
SAT
Min. : 840
1st Qu.: 880
Median :1080
Mean :1070
3rd Qu.:1210
Max. :1340
Or at least that would be useful if the data frame were larger!

There are four ways to get at the data inside a data frame, and this is NOT one of them...

> SAT
[1] 1080 1210 840 1340 880
That only seemed to work, because remember when you created the data frame, you started by putting a vector called
"SAT" into the workspace. THAT'S what you're seeing now! You are not seeing the SAT variable from inside the data
frame.

Let's erase all those vectors EXCEPT "age", which we will keep to illustrate something that you will need to remember
about R...
> ls()
[1] "age" "hgt" "my.data" "name" "race" "SAT" "wgt"
[8] "year"
> rm(hgt, name, race, SAT, wgt, year) ### Don't erase my.data!
> ls()
[1] "age" "my.data"
Now if we try to see SAT as we did above...
> SAT
Error: object 'SAT' not found

...we get an error. R will not look inside data frames for variables unless you tell it to. Here are the four ways to do that...

by using $
by using with( )
by using data=
by using attach( )

A data frame is a list of column vectors. We can extract items from inside it by using the usual list indexing device, $. To
do this, type the name of the data frame, a dollar sign, and the name of the variable you want to work with...

> my.data$SAT
[1] 1080 1210 840 1340 880
> mean(my.data$SAT)
[1] 1070
If that dollar sign stuff gets hard to read, you can put spaces around the $ to make the command line easier to read...
> mean(my.data $ SAT)
[1] 1070

This can certainly be a nuisance, because it will mean that in some commands you have to type the data frame name
multiple times. An example is the command that calculates a correlation...
> cor(my.data$hgt, my.data$wgt)
[1] -0.2531835

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

In this case, you can use the with( ) function to tell R where to get the data from...
> with(my.data, cor(hgt, wgt))
[1] -0.2531835

It doesn't save much typing in this example, but there are cases where that will save a LOT of typing! Notice the syntax of
this function. You type the name of the data frame first, followed by a comma, followed by the function you want to
execute, then you close the parentheses on with( ).

As we will learn later, some functions, especially significance tests, take what's called a formula interface. When that's the
case, there is always a data= option to specify the name of the data frame where the variables are to be found. I'll just show
you an example for now. We'll have plenty of time to examine the formula interface later...

> cor.test( ~ hgt + wgt, data=my.data)


Pearson's product-moment correlation
data: hgt and wgt
t = -0.4533, df = 3, p-value = 0.6811
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9281289 0.8100218
sample estimates:
cor
-0.2531835

Finally, there is the dreaded attach( ) function. This attaches the data frame to your search path (in position 2) so that
R will know to look there for data objects that are referenced by name. Some people use this device routinely when
working with data frames, but it can cause problems, and we are about to see one...

> attach(my.data)
The following object(s) are masked _by_ .GlobalEnv :
age
Say what? When an object is masked (or shadowed) by the global environment, that means there is a data object in the
workspace that has this name AND there is a variable inside the data frame that has this name. I can now ask for any
variable inside the data frame EXCEPT age...
> SAT
[1] 1080 1210 840 1340 880
> mean(SAT)
[1] 1070
> table(year)
year
Fr Jr So Sr
2 1 1 1
> age
[1] 21 18 18 24 20

You might think you are seeing my.data$age here, but YOU ARE NOT! You're seeing "age" from the workspace. In this
case they're the same, but that won't always be true...
> age = 112
> age
[1] 112

The assignment changed the value of "age" in the workspace, but not in the data frame...
> my.data$age
[1] 21 18 18 24 20

If we remove age from the workspace, R will then search inside the data frame for it...
> rm(age)
> age
[1] 21 18 18 24 20

The lesson is, when you get one of these masking (or shadowing) conflicts, WATCH OUT! Be extra careful to know
which version of the variable you're working with. This has tripped up many an R user, including me. This is why you
want to keep your workspace as clean as possible. The best strategy here is to remove the masking variable from the
workspace. If you want to keep it, at least rename it and then remove the conflicting version from the workspace. You'll
eventually be sorry if you don't!

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

One more lesson...

> detach(my.data)
When you're done with an attached data frame, ALWAYS detach it. This will remove it from the search path so that R will
no longer look inside it for variables. You'll have to go back to using $ to reference variables inside the data frame after it
is detached. This isn't necessary if you're going to quit your R session right away. Quitting detaches everything that was
attached. But if you're going to continue working, detach data frames you no longer need. Otherwise, your search path will
get messy, and you'll get more and more masking conflicts as other objects are attached.

DON'T erase my.data. We still need it.

Data Frame Indexing and Row Names (critical)

This will cost you BIGTIME eventually if you don't pay close attention!

> ls() # Still there?


[1] "my.data"
> my.data
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
4 Sue 24 66 1118 Cauc Sr 1340
5 Jeff 20 72 202 Asian So 880
Let's talk about those line numbers at the leftmost verge of the printed data frame. THEY ARE NOT NUMBERS. Let me
repeat that. THEY ARE NOT NUMBERS. They are row names. So the rows and columns of this data frame are NAMED
as follows:
> dimnames(my.data)
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "name" "age" "hgt" "wgt" "race" "year" "SAT"

What's the big deal?

Look at the printed data frame. Suppose we wanted to extract Barb's weight. That's the value in row 3 and column 4, so we
could get it this way...

> my.data[3,4] # Remember to use square brackets for indexing.


[1] 128
"Yeah, so?" We could also get it this way...
> my.data[3,"wgt"]
[1] 128

...and this way...


> my.data["3","wgt"]
[1] 128

Those last two ways seem to be the same, BUT THEY ARE NOT!!!

Let's sort the data frame using the age variable. Sorting a data frame is done using the order( ) function. Remember
how it worked when we sorted a vector? If a call to the order( ) function is put in place of the row index the data
frame will be sorted on whatever variable is specified inside that function. You will have to use the full name of the
variable; i.e., you will have to use the $ notation. (Why?) Otherwise, R will be looking in your workspace for a variable
called "age", not finding it, and giving a "not found" error. It happens to me a lot, so you might as well just get used to it!

> my.data[order(my.data$age),]
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

Observe the row names! They have also sorted, haven't they? Let's save this into a new data object so we can play with it a
bit...
> my.data[order(my.data$age),] -> my.data.sorted # Did you remember up arrow?
> my.data.sorted
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340

Now let's try to extract Barb's weight from this new data frame...
> my.data.sorted[3,4] ### Wrong!
[1] 202
> my.data.sorted[3,"wgt"] ### Also wrong!
[1] 202
> my.data.sorted["3","wgt"] ### Correct!
[1] 128
> my.data.sorted[2,4] ### Also correct!
[1] 128

Confused yet?

Here's what you have to remember. Those numbers that often print out on the left side of a data frame ARE NOT
NUMBERS. They're row names. So data frames have both row and column names, whether you like it or not! The point
becomes clearer when we give the rows actual names. Let's erase the names from my.data and then re-enter them as row
names...
> rm(my.data.sorted) # Get rid of that first.
> my.data$name <- NULL # This is how you erase a variable.
> my.data # See?
age hgt wgt race year SAT
1 21 70 180 Cauc Jr 1080
2 18 67 156 Af.Am Fr 1210
3 18 64 128 Af.Am Fr 840
4 24 66 1118 Cauc Sr 1340
5 20 72 202 Asian So 880
> rownames(my.data) <- c("Bob","Fred","Barb","Sue","Jeff")
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 1118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data["Barb", "wgt"] # Makes getting Barb's weight a lot
easier!
[1] 128
Notice the numbers are gone now because we have actual row names. And OF COURSE they sort with the rest of the data
frame...
> my.data[order(my.data$age),]
age hgt wgt race year SAT
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Jeff 20 72 202 Asian So 880
Bob 21 70 180 Cauc Jr 1080
Sue 24 66 1118 Cauc Sr 1340

It would be absolutely silly if they didn't! Just remember: Data frames ALWAYS have row names. Sometimes those row
names just happen to look like numbers. It's the row names that print out to your console when you ask to see the data
frame, or any part of it, and NOT the index numbers.

Don't remove my.data yet. We still need it.

Modifying a Data Frame (not so essential just now)

Rule number one with a bullet:

NEVER MODIFY AN ATTACHED DATA FRAME!

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

While this isn't strictly against the law, it's a bad idea and can get very confusing as to exactly what it is you've modified. I
could try to explain it, but I'm not sure I understand it myself. So just don't do it!

The time will come when you want to change a data frame in some way. Here are some examples of how to do that. You
may have noticed that Sue (in the my.data data frame) is a wee bit on the chunky side. This was an innocent mistake. I
really didn't do that on purpose. How do we fix it? The value was supposed to be 118, but let's change it to 135 just for
kicks...

> ls() # Still there?


[1] "my.data"
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 1118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data["Sue","wgt"] <- 135
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 135 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
That's all there is to it. Use any kind of indexing you like. Let's use numerical indexing to give Sue her correct weight...
> my.data[4,3] <- 118
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880

Just remember that "wgt" is now in column 3, since the row names don't count as a column.

I have to warn you about modifying data frames. It's always a good idea to make a backup copy in the workspace first.
Because there are some commands that modify data frames that, if they go wrong, can really screw things up! But let's live
dangerously. Suppose we wanted "wgt" to be in kilograms instead of pounds. Easy enough...
> my.data$wgt / 2.2
[1] 81.81818 70.90909 58.18182 53.63636 91.81818
> my.data # Nothing has changed yet. Why not?
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data$wgt / 2.2 -> my.data$wgt # Aha! It has to be stored back into
my.data.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.81818 Cauc Jr 1080
Fred 18 67 70.90909 Af.Am Fr 1210
Barb 18 64 58.18182 Af.Am Fr 840
Sue 24 66 53.63636 Cauc Sr 1340
Jeff 20 72 91.81818 Asian So 880
> round(my.data$wgt, 1) -> my.data$wgt # A little rounding for good measure.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.8 Cauc Jr 1080
Fred 18 67 70.9 Af.Am Fr 1210
Barb 18 64 58.2 Af.Am Fr 840
Sue 24 66 53.6 Cauc Sr 1340
Jeff 20 72 91.8 Asian So 880
Now that we've rounded them off, we've lost the original weight data in pounds...
> my.data$wgt*2.2

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

[1] 179.96 155.98 128.04 117.92 201.96

We could have avoided this by making a backup copy of my.data first, or by putting the new weight in kilograms into a
new column in the data frame.

Let's see how to create a new column in the data frame...

> my.data$IQ = c(115, 122, 100, 144, 96)


> my.data
age hgt wgt race year SAT IQ
Bob 21 70 81.8 Cauc Jr 1080 115
Fred 18 67 70.9 Af.Am Fr 1210 122
Barb 18 64 58.2 Af.Am Fr 840 100
Sue 24 66 53.6 Cauc Sr 1340 144
Jeff 20 72 91.8 Asian So 880 96
Just name it and assign values to the name in a vector. The new vector has to be the same length as the other variables
already in the data frame.

You can clean up now. We're done with this data frame.

Missing Values (kinda important, so listen up!)

Do this...

> library(MASS)
> data(Cars93)
> attach(Cars93)
> str(Cars93) # Output not shown.
This is a data frame with 93 observations on 27 variables. You can see what the variables represent by looking at the help
page for this data set: ?Cars93. We're interested in the variable "Luggage.room" in particular, which is the trunk space in
cubic feet, to the nearest cubic foot...
> summary(Luggage.room)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.00 12.00 14.00 13.89 15.00 22.00 11.00

This is a numerical variable, so we get the summary we are accustomed to by now. But what are those NAs? Whether we
like it or not, data sets often have missing values, and we need to know how to deal with them. R's standard code for
missing values is "NA", for "not available". The number associated with NA is a frequency. There are 11 cases in this data
frame in which "Luggage.room" is a missing value. If you looked at the help page, you know why.

Some functions fail to work when there are missing values, but this can (almost always) be fixed with a simple option...

> mean(Luggage.room)
[1] NA
> mean(Luggage.room, na.rm=TRUE)
[1] 13.89024
> mean(Luggage.room, na.rm=T)
[1] 13.89024
There is no mean when some of the values are missing, so the "na.rm" option removes them when set to TRUE (must be all
caps, but the shorter form T also works provided you haven't assigned another value to it). If you want to clean the data set
by removing casewise all cases with missing values on any variable, use the na.omit( ) function...
> na.omit(Cars93) # Output not shown.

I will not reproduce the output here because it is extensive, but it is also instructive, so take a look at it. Scroll the console
window backwards to see all of it. Of course, to use this cleaned data frame, you would have to assign it to a new data
object.

The which( ) function does not work to identify which of the values are missing. Use is.na( ) instead...

> which(Luggage.room == NA)


integer(0)
> is.na(Luggage.room)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

[23] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[89] TRUE FALSE FALSE FALSE FALSE
> which(is.na(Luggage.room))
[1] 16 17 19 26 36 56 57 66 70 87 89
Finally, some data sets come with other codes for missing values. 999 is a common missing value code, as are blank
spaces. Blanks are a very bad idea. If you find a data set with blanks in it, it may have to be edited in a text editor or
spreadsheet before the file can be read into R. It depends on how the file is formatted. In some cases, R will automatically
assign NA to blank values, but in other cases it will not. Other missing value codes are not a problem, as they can be
recoded...
> ifelse(is.na(Luggage.room), 999, Luggage.room) -> temp
> temp
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 999 999
[18] 20 999 15 14 17 11 13 14 999 16 11 11 15 12 12 13 12
[35] 18 999 18 21 10 11 8 12 14 11 12 9 14 15 14 9 19
[52] 22 16 13 14 999 999 12 15 6 15 11 14 12 14 999 14 14
[69] 16 999 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13
[86] 15 999 10 999 14 15 14 15
> # first we'll mess it up
> # and then we'll fix it
> ifelse(temp == 999, NA, temp) -> fixed
> fixed
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 NA NA 20 NA 15 14 17 11
[24] 13 14 NA 16 11 11 15 12 12 13 12 18 NA 18 21 10 11 8 12 14 11 12 9
[47] 14 15 14 9 19 22 16 13 14 NA NA 12 15 6 15 11 14 12 14 NA 14 14 16
[70] NA 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13 15 NA 10 NA 14 15 14
[93] 15

The ifelse( ) function is very handy for recoding a data vector, so let me take a moment to explain it. Inside the
parentheses, the first thing you give is a test. In the second of these commands above, where we are going from the messed
up variable back to "fixed", the test was "if any value of temp is equal to 999". Notice the double equals sign meaning
"equal". (I still get this wrong a lot!) The second thing you give is how to recode those values, and finally you tell what to
do with the values that don't pass the test. So the whole command reads like this: "If any value of temp is equal to 999,
assign it the value NA, else assign it the value that is currently in temp."

In the first instance of the function, we had to use is.na, since nothing can really be "equal to" something that is not
available! Try these, and say them in words as you're typing them...

> ifelse(fixed == 10, 0, 100) # Output not shown.


> ifelse(fixed > 10, 100, 0) # Output not shown.
> ifelse(fixed > 10, "big", "small") # Output not shown.
If you stored that last one, it would create a character vector.

Don't forget to clean up your workspace and search path!!

Subsetting a Data Frame (optional)

We will use a data frame called USArrests for this exercise...

> data(USArrests)
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Here is another useful function for looking at a data frame. The head( ) function shows the first six lines of data (cases)
inside a data frame. There is also a tail( ) function that shows the last six lines, and the number of lines shown can be
changed with an option (see the help pages).

In this case we have a data frame with row names set to state names and containing variables that give the crime rates (per

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

100,000 population) for Murder, Assault, and Rape, as well as the percentage of the population that lives in urban areas.
These data are from 1973 so are not current.

Because state names are used as row names, to see the data for any state, all we have to be able to do is spell the name of
the state...

> USArrests["Pennsylvania",] # No column index, so all columns displayed.


Murder Assault UrbanPop Rape
Pennsylvania 6.3 106 72 14.9
We do not have to figure out what the index number would be for that row. Thus, explicit row names can be very handy.
To display the entire row of data for PA, we just left out the column index, but THE COMMA STILL HAS TO BE
THERE! Otherwise, you are trying to index a two-dimensional data object using only one index, and R will tell you to
knock it off!

Let's answer the following questions from these data...

Which state has the lowest murder rate?


Which states have murder rates less than 4.0?
Which states are in the top quartile for urban population?

> min(USArrests$Murder) # What is the minimum murder rate?


[1] 0.8
> which(USArrests$Murder == 0.8) # Which line of the data is that?
[1] 34
> USArrests[34,] # Give me the data from that line.
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
>
> which(USArrests$Murder < 4.0) # Gives the result in a vector.
[1] 7 12 15 19 23 29 34 39 41 44 45 49
> USArrests[which(USArrests$Murder < 4.0),] # Use that vector as an index.
Murder Assault UrbanPop Rape
Connecticut 3.3 110 77 11.1
Idaho 2.6 120 54 14.2
Iowa 2.2 56 57 11.3
Maine 2.1 83 51 7.8
Minnesota 2.7 72 66 14.9
New Hampshire 2.1 57 56 9.5
North Dakota 0.8 45 44 7.3
Rhode Island 3.4 174 87 8.3
South Dakota 3.8 86 45 12.8
Utah 3.2 120 80 22.9
Vermont 2.2 48 32 11.2
Wisconsin 2.6 53 66 10.8
>
> summary(USArrests$UrbanPop)
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.00 54.50 66.00 65.54 77.75 91.00
> USArrests[which(USArrests$UrbanPop >= 77.75),]
Murder Assault UrbanPop Rape
Arizona 8.1 294 80 31.0
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Florida 15.4 335 80 31.9
Hawaii 5.3 46 83 20.2
Illinois 10.4 249 83 24.0
Massachusetts 4.4 149 85 16.3
Nevada 12.2 252 81 46.0
New Jersey 7.4 159 89 18.8
New York 11.1 254 86 26.1
Rhode Island 3.4 174 87 8.3
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9

Suppose we wanted to work with data only from these states. How can we extract them from the data frame and make a
new data frame that contains only those states? I'm glad you asked...

> subset(USArrests, subset=(UrbanPop >=77.75)) -> high.urban


> high.urban
Murder Assault UrbanPop Rape
Arizona 8.1 294 80 31.0
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Florida 15.4 335 80 31.9
Hawaii 5.3 46 83 20.2
Illinois 10.4 249 83 24.0

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

Massachusetts 4.4 149 85 16.3


Nevada 12.2 252 81 46.0
New Jersey 7.4 159 89 18.8
New York 11.1 254 86 26.1
Rhode Island 3.4 174 87 8.3
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9
The subset( ) function does the trick. The syntax is a little squirrelly, so let me go through it. The first thing you give
is the name of the data frame. That is followed by the subset= option. Then inside of parentheses (which actually aren't
necessary) give the test that defines the subset. Store the output into a new data object so that you can then work with it.
Functions that take a data= option can also take a subset option, so it's a useful thing to know.

You can clean up your workspace now.

Stacking and Unstacking (optional)

Suppose someone has retained your services as a data analyst and gives you his data (from an Excel file or something) in
this format...

contr treat1 treat2


---------------------------
22 32 30
18 35 28
25 30 25
25 42 22
20 31 33
---------------------------
If you're working for free, you can yell at him and make him do it the right way, but if you're being paid, you probably
really shouldn't. Here's how to deal with it. First, let's get these data into a "data frame" in this format, and I will leave out
the command prompts so that you can just copy and paste these three lines directly into R...
### start copying here
wrong.data = data.frame(contr = c(22,18,25,25,20),
treat1 = c(32,35,30,42,31),
treat2 = c(30,28,25,22,33))
### stop copying here
> wrong.data
contr treat1 treat2
1 22 32 30
2 18 35 28
3 25 30 25
4 25 42 22
5 20 31 33

Now do this...
> stack(wrong.data) -> correct.data
> correct.data
values ind
1 22 contr
2 18 contr
3 25 contr
4 25 contr
5 20 contr
6 32 treat1
7 35 treat1
8 30 treat1
9 42 treat1
10 31 treat1
11 30 treat2
12 28 treat2
13 25 treat2
14 22 treat2
15 33 treat2

And there you go. Now you have a proper data frame.

There is also an unstack( ) function that does the reverse of this, and it will work automatically on a data frame that
has been created by stack( ), but otherwise is a little trickier to use. You probably won't have to use it much, so I'll
refer you to the help page if you ever need it.

You can remove these data objects. We won't use them again.

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

Going From Wide to Long and Long to Wide (eventually you'll probably need to know this)

I mention this above under "An Ambiguous Case." There are two kinds of data frames in R, and in most statistical
software: wide ones and long ones. Let's fetch the "anorexia" data again (and we'll do it without attaching the MASS
package this time)...

> data(anorexia, package="MASS")


What we are about to do is a little confusing until you get some experience with it, so it will be necessary to be able to see
what's happening. The anorexia data frame is too long to print to a single console screen with causing it to scroll, so I'm
going to cut it down to only nine cases, three from each group. This will help us to see the difference between wide and
long data frames without constantly scrolling the console window...
> anorexia[c(1,2,3,27,28,29,56,57,58),] -> anor
> anor
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
27 CBT 80.5 82.2
28 CBT 84.9 85.6
29 CBT 81.5 81.4
56 FT 83.8 95.2
57 FT 83.3 94.3
58 FT 86.0 91.5

I also shortened up the name of our data frame, because we're going to be typing it a lot.

This is a wide data frame. It's wide because each line of the data frame contains information on ONE SUBJECT, even
though that subject was measured multiple times (twice) on weight (Prewt, Postwt). So all the data for each subject goes on
ONE LINE, even though we could interpret this as a repeated measures design, or longitudinal data.

In a long data frame, each value of weight would define a case. So each of these subjects would have two lines in such a
data frame, one for the subject's Prewt, and one for her Postwt. A wide data frame would be used, for example, in analysis
of covariance. A long data frame would be used in repeated measures analysis of variance. Do we have to retype the data
frame to get from wide to long? Fortunately not! Because R has a function called reshape( ) which will do the work
for us.

It is not an easy function to understand, however (and don't count on the help page being a whole lot of help!). So let me
illustrate it, and then I will explain what's happening...

> reshape(data=anor, direction="long",


+ varying=c("Prewt","Postwt"), v.names="Weight",
+ idvar="subject", ids=row.names(anor),
+ timevar="PrePost", times=c("Prewt","Postwt")
+ ) -> anor.long
> anor.long
Treat PrePost Weight subject
1.Prewt Cont Prewt 80.7 1
2.Prewt Cont Prewt 89.4 2
3.Prewt Cont Prewt 91.8 3
27.Prewt CBT Prewt 80.5 27
28.Prewt CBT Prewt 84.9 28
29.Prewt CBT Prewt 81.5 29
56.Prewt FT Prewt 83.8 56
57.Prewt FT Prewt 83.3 57
58.Prewt FT Prewt 86.0 58
1.Postwt Cont Postwt 80.2 1
2.Postwt Cont Postwt 80.1 2
3.Postwt Cont Postwt 86.4 3
27.Postwt CBT Postwt 82.2 27
28.Postwt CBT Postwt 85.6 28
29.Postwt CBT Postwt 81.4 29
56.Postwt FT Postwt 95.2 56
57.Postwt FT Postwt 94.3 57
58.Postwt FT Postwt 91.5 58
In this example, the first argument I gave to the reshape( ) function was the name of the data frame to be reshaped,
and that was given in the data= option. Then I specified the direction= option as "long" so that the data frame would be
convert TO a long format.

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

In the second line of this command, I specified varying= as a vector of variable names in anor that correspond to the
repeated measures or longitudinal measures (the time-varying variables). These values will be given in one column in the
new data frame, so I named that new column using the v.names= option.

A long data frame needs two things that a wide one does not have. One of those things is a column identifying the subject
(case or experimental unit) from which the data in a row of the data frame come from. This is necessary because each
subject will have multiple rows of data in a long data frame. So I used the idvar= option to specify the name of this new
column that would identify the subjects. I then used ids= to specify how the subjects were to be named. I told it to use the
row names from anor, which is a sensible thing to do.

The other thing a long format data frame needs that a wide one does not is a variable giving the condition (or time) in
which the subject is being measured for this particular row of data. In the wide format, this information is in the column
(variable) names, but that will no longer be true in the long format. We need to know which measure is Prewt and which
measure is Postwt for each subject, since these will be on different rows of the data frame in long format. I named this new
variable using the timevar= option, and I gave its possible values in a vector using the times= option. The order in which
those values should be listed is the same as the order in which the corresponding columns occur in the wide data frame.

Finally, I closed the parentheses on the reshape( ) function and assigned the output to a new data object. Done!

This can also be made to work if you have more than one repeated measures variable, in which case all I can say is may
the saints be with you!

If the data frame results from a reshape( ) command, then it can be converted back very simply. All you have to do is
this...

> reshape(anor.long)
Treat subject Prewt Postwt
1.Prewt Cont 1 80.7 80.2
2.Prewt Cont 2 89.4 80.1
3.Prewt Cont 3 91.8 86.4
27.Prewt CBT 27 80.5 82.2
28.Prewt CBT 28 84.9 85.6
29.Prewt CBT 29 81.5 81.4
56.Prewt FT 56 83.8 95.2
57.Prewt FT 57 83.3 94.3
58.Prewt FT 58 86.0 91.5
The row names have gone a little screwy, but all the correct information is there. This isn't very useful actually, because we
already have the data in wide format in the data frame anor, which we were smart enough not to overwrite. So let's see
how to convert from long to wide the hard way.

First, we will get rid of those ridiculous row names...

> rownames(anor.long) <- as.character(1:18) # Just do it!


> anor.long
Treat PrePost Weight subject
1 Cont Prewt 80.7 1
2 Cont Prewt 89.4 2
3 Cont Prewt 91.8 3
4 CBT Prewt 80.5 27
5 CBT Prewt 84.9 28
6 CBT Prewt 81.5 29
7 FT Prewt 83.8 56
8 FT Prewt 83.3 57
9 FT Prewt 86.0 58
10 Cont Postwt 80.2 1
11 Cont Postwt 80.1 2
12 Cont Postwt 86.4 3
13 CBT Postwt 82.2 27
14 CBT Postwt 85.6 28
15 CBT Postwt 81.4 29
16 FT Postwt 95.2 56
17 FT Postwt 94.3 57
18 FT Postwt 91.5 58
And now for the reshaping. I won't bother storing it...
> reshape(data=anor.long, direction="wide",
+ v.names=c("Weight"),

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames

+ idvar="subject",
+ timevar="PrePost"
+ )
Treat subject Weight.Prewt Weight.Postwt
1 Cont 1 80.7 80.2
2 Cont 2 89.4 80.1
3 Cont 3 91.8 86.4
4 CBT 27 80.5 82.2
5 CBT 28 84.9 85.6
6 CBT 29 81.5 81.4
7 FT 56 83.8 95.2
8 FT 57 83.3 94.3
9 FT 58 86.0 91.5

We didn't quite recover the original table, but then we probably didn't really want to. The first two options name the data
frame we are reshaping and tell the direction we are reshaping TO. The next option, v.names=, gives the name of the time-
varying variable that will be split into two (or more) columns. The idvar= option gives the name of the variable that is the
subject identifier. Finally, the timevar= option gives the name of the variable that contains the conditions under which the
longitidinal information was collected; i.e., there were two weights, a Prewt and a Postwt. Notice these values were used to
name the two new columns of Weight data. Want a pneumonic to help you remember all that? Yeah, me too!

revised 2010 Aug 1

Return to the Table of Contents

dataframes.html[26/01/2014 15:20:41]
R Tutorials--Describing Data Graphically

DESCRIBING DATA GRAPHICALLY

Introduction

The graphical procedures in R are extremely powerful. I'm told there are some people who use R not so much for data
analysis as for its ability to produce top notch publication quality graphics. I will only scratch the surface of these
capabilities here. A later tutorial will fill in a few more of the details.

R graphics functions can be grouped into three types:

High level plotting functions that will create a more or less complete graph, often with axis labels, titles, and so
forth.
Low level plotting functions that allow additional information to be added to an existing graph, or that allow graphs
to be drawn from scratch.
Interactive graphics functions that allow you to extract information from an existing graph, or to label points and so
on.

I'm going to do something unusual, and perhaps ill-advised, and cover the low level functions first, so that you will be
ready to use them in conjunction with the high level functions when we get to those. If you just want the quick and dirty
approach, then skip this first section (for now).

Low Level Plotting Functions

Just about anything can be drawn into a graphics window in R if you are clever enough. I'm not that clever, so I'll keep it
simple. To conserve space, I'm also not going to reproduce the output of every single example. If you have R open and are
following along, you can see it on your own screen.

High level plotting functions open a graphics device (window) automatically, but the low level functions do not. So to get
a graph and some axes to work with, the following command will get us started without actually drawing a graph...

> plot(1:100, 1:100, type="n", xlab="", ylab="")


The plot( ) function is high level, opening a graphics window and drawing labeled axes, but in this case we've asked it
not to plot anything with the 'type="n"' option. Now we have a palette. Let's paint on it.

One thing we can do is plot a curve from an algebraic equation. Let's say the equation is y = 0.01 x2 ...

> curve(x^2/100, add=TRUE)


We may want to add some text to the graph, which tells our intended audience just what it is we've plotted...
> text(80, 50, "This is a graph of")
> text(80, 45, "the equation")
> text(80, 37, expression(y == frac(1,100) * x^2))

The text( ) function takes, first, arguments that give x,y-coordinates at which the text will be centered (and this can
take some careful eyeballing or some trial and error), and then it takes quoted text or a mathematical expression. The
syntax for the expression( ) function is an art form in itself (similar to LaTex), and I have not mastered it, but it can
be used to produce some very fancy mathematical expressions. There are also options for controlling font face and size as
well as spacing, etc.

Next let's draw some points on this curve...

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

> points(x=c(20, 60, 90), y=c(4, 36, 81), pch=6)


The first vector gives the x-coordinates of the desired points, the second vector gives the y-coordinates, and the "pch="
option gives the point character to use. There are about twenty different point characters to choose from. To see some of
them, do this...
> points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2))

You can experiment for yourself to find out what the rest of them look like.

Now let's draw a straight line through a couple of those points, say the one at (20, 4) and the one at (90, 81). The draw-a-
straight-line function is abline( ), and in this case it's arguments are "a=the y-intercept" and "b=the slope" of the
desired line...

> abline(a=-18, b=1.1, col="red")


And what the heck? Just to be showy, let's make it red. We can also draw horizontal and vertical lines with this function...
> abline(h=20, lty=2) # abline(h=20, lty="dashed") also works
> abline(v=20, lty=3) # abline(v=20, lty="dotted") also works

The "lty=" option specifies the line type. (1=solid, 2=dashed, 3=dotted.) You can also change the color of these lines with
col=, and the width of the lines with lwd= options. Try repeating that last command but set lty=1 and lwd=3.

We can also draw lines and/or points using the lines( ) function...

> lines(x=c(40, 40, 60, 60), y=c(80, 100, 100, 80), type="b")
> lines(x=c(40, 60), y=c(80, 80), type="l")
Once again, the first vector gives the x-coordinates, the second vector the y-coordinates, and the "type=" option tells
whether you want just points (type="p"), just lines (type="l"), or both (type="b"). Note: for just lines, use a lower case L.
This example shows that type="l" and type="b" behave a bit differently in terms of where the line begins and terminates.

Finally (at least as far as this tutorial is concerned!), titles and axis labels can be added using the title( ) function. We
already have axis labels (they are set by default in the plot( ) function), so I'll use a little trick that SOMETIMES
works to erase one of them. I'll write over it in the background color of the graph...

> title(main="A Drawing To Put On the Refrigerator!")


> title(xlab="x", col.lab="white")
> title(xlab="This is the x-axis", col.lab="black")
This example also gives a little taste of how various options can be used to control colors, fonts, text sizes, and so forth.
We'll do more of this in a future tutorial. The "xlab=" option was used to overwrite the existing x-axis label with itself
written in white, and then a new label written in black. The same thing could have been done on the y-axis using the
"ylab=" option. And now let's have a look at our masterpiece...

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

Beautiful! Okay, so it's a little first-graderish as R graphics go. There are entire books on R graphics, and I am but a
beginner! Here is the entire script if you just now have decided you want to see this happen on your own monitor.
# Start copying here.
plot(1:100, 1:100, type="n", xlab="", ylab="")
curve(x^2/100, add=TRUE)
text(80, 50, "This is a graph of")
text(80, 45, "the equation")
text(80, 37, expression(y == frac(1,100) * x^2))
points(c(20,60,90), c(4,36,81), pch=6)
points(rep(100,10), seq(0,90,10), pch=0:9)
abline(a=-18, b=1.1, col="red")
abline(h=20, lty=2)
abline(v=20, lty=3)
lines(c(40,40,60,60), c(80,100,100,80), type="b")
lines(c(40,60), c(80,80), type="l")
title(main="A Drawing To Put On the Refrigerator!")
title(xlab="x", col.lab="white")
title(xlab="This is the x-axis", col.lab="black")
# Stop copying here and paste to your R Console.

High Level Plotting Functions

Usually, we don't want to fuss that much. We just want to see a graph of some data we're examining. If we want to dress it
up for publication, THEN we'll worry about the low-level functions and various options.

The basic high level plotting function is plot( ), and it works differently depending upon what you're asking it to plot.
The basic syntax is plot(x, y, ...), where x is a vector of x-coordinates, y is a vector of y-coordinates, and ...
represents further refinements and options, as will be illustrated...

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> plot(waiting, eruptions) # x is num., y is num., plot is scatterplot
> detach(faithful)
> rm(faithful)
>
>
> data(ToothGrowth)
> attach(ToothGrowth)
> names(ToothGrowth)
[1] "len" "supp" "dose"
> plot(supp, len) # x is factor, y is num., plot is boxplots
> plot(factor(dose), len) # coercing dose to a factor
> detach(ToothGrowth)
> rm(ToothGrowth)
>
>
> data(sunspots)
> class(sunspots)
[1] "ts"
> plot(sunspots) # x is time series, y missing, plot is a
> rm(sunspots) # time-series plot
>
>
> data(UCBAdmissions)
> class(UCBAdmissions)
[1] "table"
> plot(UCBAdmissions) # x is table, y missing, plot is a mosaic plot
> rm(UCBAdmissions)
>
>
> data(mtcars)
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17.0 18.6 19.4 17.0 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> plot(mtcars) # x is dataframe of num. vars., y missing,
> rm(mtcars) # plot is a scatterplot matrix
>

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

And so on. I think I've made my point.

As we go through the individual data analyses in future tutorials, we will see these various plots again, and we will dress
them up a bit. So for now, let me just illustrate a few other things R can do.

Piecharts and Barplots

When a single categorical variable is being graphed, the customary way is to use a piechart or a barplot. Statisticians are
somewhat biased against piecharts, and I suppose for good reason, but I'll illustrate them anyway, just in case you have a
hankerin' to flaunt good statistical practice.

The data set UCBAdmissions, which we were using above, is the Berkeley admissions data we used in a different form in
a previous tutorial. The data set is a 3-D table, and we need a 1-D table to illustrate a basic piechart and barplot, so...

> margin.table(UCBAdmissions, 3) # Collapse over dimensions 1 and 2.


Dept
A B C D E F
933 585 918 792 584 714
> margin.table(UCBAdmissions,3) -> Department
> pie(Department)
> barplot(Department, xlab="Department", ylab="frequency")

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

The pie( ) function in R is limited because, as I mentioned above, many statisticians (including the R folks) consider
pie charts to be poor statistical practice. However, if you want something flashy like a 3D exploded pie chart, you can get
it by installing an optional graphics package called "plotrix", which contains a function called pie3D( ), which has an
"explode" option. It just goes to show, if you want it, someone has probably written an R package that will do it! To see an
example of an exploded pie chart produced with this package, try this link. In fact, I recommend the plotrix package if you
want some useful extensions to the basic R graphics capabilities.

If you want to look at two categorical variables at once, a stacked barplot, or better yet, a side-by-side barplot is usually
the way to go...
> margin.table(UCBAdmissions, c(1,3)) -> Admit.by.Dept
> barplot(Admit.by.Dept)
> barplot(Admit.by.Dept, beside=T, ylim=c(0,1000), legend=T,
+ main="Admissions by Department")

Notice a stacked barplot is the default. To change that, set the "beside=" option to TRUE. Also, I dressed up the second
barplot a bit by adding a main title, and by changing the limits on the y-axis to make room for a legend. I need to adjust
the font size a bit in the legend, and maybe change its location, but that's a future tutorial!

Histograms

When you have one numerical variable to look at, a histogram is appropriate. I'll use the "faithful" data set again to
illustrate...

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

> data(faithful) # This is really optional.


> attach(faithful)
> hist(waiting)

It doesn't get much more straightforward than that! And by the way, in case you're wondering, I resized the graphic by
resizing the graphics device window before saving it. There are better ways, but that works in a pinch.

If you want more or fewer bars, you can refine your plot by using the "breaks=" option and defining your own
breakpoints...
> range(waiting)
[1] 43 96
> hist(waiting, breaks=seq(40,100,10))
By default, R includes the right limit (right side of the bar) but not the left limit in the intervals. Usually, I prefer it the
other way around, so I change it with the "right=" option, which by default is TRUE...
> hist(waiting, breaks=seq(40,100,10), right=F)

There are many, many other options as well, which you can examine by looking at the help page for this function: ?hist.

R also incorporates many functions for data smoothing, including kernel density smoothing of histograms. If you'd rather
see a smooth curve than a boxy histogram, it can be done as follows...

> plot(density(waiting))
> # Or, getting fancier...
> hist(waiting, prob=T)
> lines(density(waiting))
> detach(faithful)

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

The density( ) function does kernel density smoothing, which can be refined by adjusting the options of the function.
To plot the smoothed curve on top of a histogram, set the "prob=" option to TRUE inside the hist( ) function. This
plots densities rather than frequencies. Also, use lines( ) rather than plot( ) to plot the smoothed curve. This low
level graphics function will add the smoothed curve to the histogram rather than drawing a new plot and thereby erasing
the histogram.

Numerical Summaries by Groups

When you have a numerical variable indexed by a categorical variable or factor, you might want a group-by-group
summary in graphical form. The primary way R offers to achieve this is side-by-side boxplots...

> data(chickwts) # Weight gain by type of diet.


> str(chickwts)
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
> attach(chickwts)
> plot(feed, weight)
> title(main="Body Weight of Chicks by Type of Diet")
> detach(chickwts)

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

The function boxplot( ), which takes a formula interface, can also be used. Here is the example copied and pasted off
the "chickwts" help page...
> boxplot(weight ~ feed, data = chickwts, col = "lightgray",
+ varwidth = TRUE, notch = TRUE, main = "chickwt data",
+ ylab = "Weight at six weeks (gm)")
Warning message:
In bxp(list(stats = c(216, 271.5, 342, 373.5, 404, 108, 136, 151.5, :
some notches went outside hinges ('box'): maybe set notch=FALSE

Notice that several options are set, including an option to color the boxes, the "varwidth=" option, which sets the width of
the box according to the sample size, the "notch=" option, which gives a confidence interval around the median, and
options to print a main title and y-axis label. The procedure generated a warning message, which you will understand when
you look at the graphic (which I have not reproduced here).

Scatterplots

For examining the relationship between two numerical variables, you can't beat a scatterplot. R has several functions for
producing them, two of which will be demonstrated here...

> data(mammals, package="MASS")


> str(mammals)
'data.frame': 62 obs. of 2 variables:
$ body : num 3.38 0.48 1.35 465.00 36.33 ...
$ brain: num 44.5 15.5 8.1 423.0 119.5 ...
> attach(mammals)
> plot(log(body), log(brain))
> scatter.smooth(log(body), log(brain))
> detach(mammals)

graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically

Some explanations are in order. First, I didn't want to attach the MASS package to the search path, so I used an option
when I copied the "mammals" data frame that told R to look for it there. The data frame contains brain and body weights
from 62 species of land mammals. Second, to produce a linear plot, I had to do a log transform on both variables, and I did
that "on the fly." Third, the two functions produced the same scatterplot, but the scatter.smooth( ) function also
plots a smoothed, nonparametric regression line on the plot. This line is computed using the loess technique and is called
the "loess line" (locally weighted scatterplot smoothing, sometimes also called "lowess", although I understand some
sources use the two acronyms differently). Both functions have options that allow the plots to be modified in several ways.

Interacting With Plots

R supplies several functions that allow you to interact with the graphics window, including functions that allow you to
identify and label points on the graph. See the help pages for the locator( ) and identify( ) functions for details.
I'll discuss these briefly in a later tutorial.

Remember to clean up your workspace!

revised 2010 August 4

Return to the Table of Contents

graphically.html[27/01/2014 22:18:39]
R Tutorials--Functions and Scripts

FUNCTIONS AND SCRIPTS

Functions

One of the advantages of using a scriptable statistics language like R is, if you don't like the way it does something, you
can change it. Or if there is a function missing you'd like to have, write it. Writing basic functions is not difficult. If you
can calculate it at the command line, you can write a function to calculate it.

There is no function in the R base packages to calculate the standard error of the mean. So let's create one. The standard
error of the mean is calculated from a sample (I should say estimated from a sample) by taking the square root of the
sample variance divided by the sample size. So from the command line...

> setwd("Rspace") # if you've created this directory


> rm(list=ls()) # clean out the workspace
> ls()
character(0)
> nums = rnorm(25, mean=100, sd=15) # create a data vector to work with
> mean(nums) # calculate the mean
[1] 97.07936 # your results will differ
> sd(nums) # and the standard deviation (sample)
[1] 12.92470
> length(nums) # and the length or "sample size"
[1] 25
> sem = sqrt(var(nums)/length(nums)) # this is how the sem is calculated
> sem
[1] 2.584941
So we know how to calculate the sem at the command line. Automating this by creating an "sem( )" function is a piece of
cake...
> rm(sem) # get rid of the object we created above
> ?sem # check to see if something by this name exists
No documentation for 'sem' in specified packages and libraries:
you could try 'help.search("sem")'
> sem = function(x)
+ {
+ sqrt(var(x)/length(x))
+ }

First, we checked to make sure "sem" was not already used as a keyword by asking for a help page. (That's no guarantee,
but it's a good check.) Then we typed "sem=function(x)", which requires a bit of explanation. The name we desire for our
function is "sem", so that's the first thing we type. Then we tell R we want to define this as a function by typing
"=function". The sem is going to be calculated on a data object--a vector--so we have to pass the data to the function, and
that is the point of "(x)". This tells R to expect one argument to be passed to the function. It doesn't have to be called "x".
This is just a dummy variable, so call it "fred" if you want, as long as you call it the same thing throughout the function
definition.

After you hit the Enter key, R will see that you are defining a function, and it will give you the + prompt, meaning "tell me
more." Type an open curly brace and hit Enter again. (A more common practice is to type this on the first line, but it
doesn't matter, and I learned it otherwise.) Then type the calculations needed to get the standard error. Spacing is optional,
but I think it makes it a bit easier to understand if you use some indenting here. Hit Enter. Type a closed curly brace and hit
Enter again. Your function has been defined and is now in your workspace to be used whenever you want...

> ls()
[1] "nums" "sem"
And it will stay in your workspace for whatever working directory you are in PROVIDED you save your workspace when

functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts

you quit R. You use the function just like you use any other function in R...
> sem(nums)
[1] 2.584941
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
2.584941 2.584941 2.584941

So next week you fire up R, you see "sem" in your workspace, and you wonder what it is (if you're like me). Easy enough
to find out...

> class(sem)
[1] "function"
> sem
function(x)
{
sqrt(var(x)/length(x))
}
Just like any other data object, typing its name without an argument prints it out.

I don't like it. Our sem function is good enough, but if there are missing values in the data vector, sem( ) will choke...
> nums[20] = NA # create a missing value
> sem(nums)
Error in var(nums) : missing observations in cov/cor
So let's fix it...
> rm(sem) # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+ n = sum(x,na.rm=T)/mean(x,na.rm=T)
+ sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem"
> sem(nums)
[1] 2.641737

By the way, we couldn't use the length( ) function in this calculation because it has no "na.rm=" option. However, another
way to get the length of the vector without counting missing values, and perhaps a more elegant way, is: n=sum(!is.na(x)).
This tests each value of the vector to see if it's missing. If it is NOT (the ! means NOT), then it returns TRUE for that
position in the vector. Finally, the values returned as TRUE are counted with sum( ).

The length( ) function counts NAs as data values and doesn't tell you. (Which is why we couldn't use it above--it would
have given the wrong value for n.) Let's create another function for sample size that reports on NAs...

> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+ n = length(x) - sum(is.na(x))
+ nas = sum(is.na(x))
+ out = c(n, nas)
+ names(out) = c("", "NAs")
+ out
+ }
> ls()
[1] "nums" "samp.size" "sem"
> samp.size(nums)
NAs
24 1
Now samp.size( ) returns a vector the first element of which is the number of nonmissing values in the data object we feed
into it. So...
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])
2.641737

...you can use it like this.

functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts

If you ask me, R has some annoying idiosyncrasies. Take the tapply( ) function for example. What could "tapply" possibly
mean? And who came up with that convoluted syntax? Don't like it? Then change it!

> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+ tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums" "samp.size" "sem"
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
Which makes more sense to you?! You should also know that these one-liners can be entered all on one line...
> rm(calculate)
> ls()
[1] "nums" "samp.size" "sem"
> calculate = function(FUN, of, by) tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540

You don't even need the curly braces, although you can use them if you want. I usually do.

Another R function that annoys the crap out of me is summary( ) when applied to a numerical vector. I want a standard
deviation, and I want a sample size...

> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+ m=mean(x,na.rm=T)
+ s=sd(x,na.rm=T)
+ N=sum(is.na(x))
+ n=length(x)-N
+ se=s/sqrt(n)
+ out=c(m,s,se,n,N)
+ names(out)=c("mean","sd","sem","n","NAs")
+ round(out,4)
+ }
> ls()
[1] "calculate" "describe" "nums" "samp.size" "sem"
> describe(nums)
mean sd sem n NAs
96.5680 12.9418 2.6417 24.0000 1.0000
That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want to keep these functions.

Scripts

A script is just a plain text file with R commands in it. You can prepare a script in any text editor, such as vim,
TextWrangler, or Notepad. You can also prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad,
PROVIDED you save the script in plain text (ascii) format. In Windows, this will append a ".txt" file extension to the file.
Drop the script into your working directory, and then read it into R using the source( ) function.

In a moment you're going to see a link. Click on it and a text page will appear with a sample script on it. Use your browser
to save this page to your desktop. Then move the saved file into your R working directory. Or, if you really want to be
adventurous, type the script into a text editor like Notepad, save it in your working directory, and you are ready to go.
Okay, here is the link...

Link To Sample Script

functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts

Now that you've got it in your working directory one way or another, do this in R...
> source("sample_script.txt") # Don't forget those quotes!
A note: R does not like spaces in script names, so don't put spaces in your script names! Now, what didn't happen that you
expected to happen? Go back to the link and read the script again if you have to. What happened to the mean of "y" and
the mean of "x"?

The script has created the variables "x" and "y" in your workspace (and has erased any old objects you had by that name--
sorry). You can see them with the ls( ) function. Executing a script does everything typing those commands in the Console
would do, EXCEPT print things to the Console. Do this...

> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8
See? It's there. But if you want to be sure a script will print it to the Console, you should use the print( ) function...
> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8

When you're working in the Console, the print( ) is understood (implicit) when you type a command or data object name.
This is not necessarily so in a script.

A script is a good way to keep track of what you're doing. If you have a long analysis, and you want to be able to recreate
it later, a good idea is to type it into a script. If you're working in the Windows R GUI (also in the Mac R GUI), there is
even a built-in script editor. To get to it, pull down the File menu and choose New Script (New Document on a Mac). A
window will open in which you can type your script. Type this script into the open window...

with(PlantGrowth, tapply(weight, group, mean))


with(PlantGrowth, aov(weight ~ group)) -> aov.out
summary.aov(aov.out)
summary.lm(aov.out)

Hit the Enter key after the last line. Now, in the editor window, pull down the Edit menu and choose Run All. (On a Mac,
highlight all the lines of the script and choose Execute.) The script should execute in your R Console. Pull down the File
Menu and choose Save As... Give the file a nice name, like "script2.txt". R will NOT save it by default with a file
extension, so be sure you give it one. Close the editor window. Now, in the R Console, do this...
> source("script2.txt")
Nothing happens! Why not? Actually, something did happen. The "aov.out" object was created in your workspace.
However, nothing was echoed to your Console because you didn't tell it to print( ). Go to File and choose New Script. In
the script editor, pull down File and choose Open Script... In the Open Script dialog that appears, change Files Of Type to
all files. Then choose to open "script2.txt". Edit it to look like this...

print(with(PlantGrowth, tapply(weight, group, mean)))


with(PlantGrowth, aov(weight ~ group)) -> aov.out
print(summary.aov(aov.out))
print(summary.lm(aov.out))

Pull down File and choose Save. Close the script editor window(s). And FINALLY...
> source("script2.txt")

Scripts! Nothing to it, right?

revised 2013 June 22

Return to the Table of Contents

functions.html[27/01/2014 22:18:26]
R Tutorials--Logistic Regression

LOGISTIC REGRESSION

Preliminaries

Model Formulae

You will need to know a bit about Model Formulae to understand this tutorial.

Odds, Odds Ratios, and Logit

When you go to the track, how do you know which horse to bet on? You look at the odds. In the program, you may see the
odds for your horse, Sea Brisket, are 8 to 1, which are the odds AGAINST winning. This means in nine races Sea Brisket
would be expected to win 1 and lose 8. In probability terms, Sea Brisket has a probability of winning of 1/9, or 0.111. But
the odds of winning are 1/8, or 0.125. Odds are actually the ratio of two probabilities...
p(one outcome) p(success) p
odds = -------------------- = ----------- = ---, where q = 1 - p
p(the other outcome) p(failure) q
So for Sea Brisket, odds(winning) = (1/9)/(8/9) = 1/8. Notice that odds have these properties:

If p(success) = p(failure), then odds(success) = 1 (or 1 to 1, or 1:1).


If p(success) < p(failure), then odds(success) < 1.
If p(success) > p(failure), then odds(success) > 1.
Unlike probability, which cannot excede 1, there is no upper bound on odds.

The natural log of odds is called the logit, or logit transformation, of p:


logit(p) = log e(p/q). Logit is sometimes called "log odds." Because of the
properties of odds given in the list above, the logit has these properties:

If odds(success) = 1, then logit(p) = 0.


If odds(success) < 1, then logit(p) < 0.
If odds(success) > 1, then logit(p) > 0.
The logit transform fails if p = 0.

Logistic regression is a method for fitting a regression curve, y = f(x),


when y consists of proportions or probabilities, or binary coded (0,1--
failure,success) data. When the response is a binary (dichotomous)
variable, and x is numerical, logistic regression fits a logistic curve to the
relationship between x and y. The logistic curve looks like the graph at
right. It is an S-shaped or sigmoid curve, often used to model population
growth, survival from a disease, the spread of a disease (or something
disease-like, such as a rumor), and so on. The logistic function is...

y = [exp(b0 + b1x)] / [1 + exp(b 0 + b1x)]

Logistic regression fits b0 and b1 , the regression coefficients (which were 0 and 1, respectively, for the graph above). It
should have already struck you that this curve is not linear. However, the point of the logit transform is to make it linear...

logit(y) = b0 + b1x

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of
success at each value of x. However, you should avoid the temptation to do a traditional least-squares regression at this
point, as neither the normality nor the homoscedasticity assumption will be met.

Odds ratio might best be illustrated by returning to our horse race. Suppose in the same race Seattle Stew is given odds of
2 to 1, which is to say, two expected loses for each expected win. Seattle Stew's odds of winning are 1/2, or 0.5. How
much better is this than the winning odds for Sea Brisket? The odds ratio tells us: 0.5 / 0.125 = 4.0. The odds of Seattle
Stew winning are four times the odds of Sea Brisket winning. Be careful not to say "times as likely to win," which would
not be correct. The probability (likelihood, chance) of Seattle Stew winning is 1/3 and for Sea Brisket is 1/9, resulting in a
likelihood ratio of 3.0. Seattle Stew is three times more likely to win than is Sea Brisket.

Logistic Regression: One Numerical Predictor

In the "MASS" library there is a data set called "menarche" (Milicer, H. and Szczotka, F., 1966, Age at Menarche in
Warsaw girls in 1965, Human Biology, 38, 199-203), in which there are three variables: "Age" (average age of age
homogeneous groups of girls), "Total" (number of girls in each group), and "Menarche" (number of girls in the group who
have reached menarche)...

> library("MASS")
> data(menarche)
> str(menarche)
'data.frame': 25 obs. of 3 variables:
$ Age : num 9.21 10.21 10.58 10.83 11.08 ...
$ Total : num 376 200 93 120 90 88 105 111 100 93
...
$ Menarche: num 0 0 0 2 2 5 10 17 16 29 ...
> summary(menarche)
Age Total Menarche
Min. : 9.21 Min. : 88.0 Min. : 0.00
1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00
Median :13.08 Median : 105.0 Median : 51.00
Mean :13.10 Mean : 156.7 Mean : 92.32
3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00
Max. :17.58 Max. :1049.0 Max. :1049.00
> plot(Menarche/Total ~ Age, data=menarche)
From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
+ family=binomial(logit), data=menarche)

Numerous explanation are in order! First, glm( ) is the function used to do generalized linear models, and will be explained
more completely in another tutorial. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic
regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least
squares linear regression (normality and homoscedasticity) don't apply. Second, our data frame does not contain a row for
every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response
variable. No problem! If we feed glm( ) a table (or matrix) in which the first column is number of successes and the
second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the
fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table using cbind( ).

Let's look at how closely the fitted values from our logistic regression match the observed values...

> plot(Menarche/Total ~ Age, data=menarche)


> lines(menarche$Age, glm.out$fitted, type="l", col="red")
> title(main="Menarche Data with Fitted Logistic Regression Line")

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

I'm impressed! I don't know about you. The numerical results are extracted like this...
> summary(glm.out)
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit),
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4

The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and
anova(glm.out).

Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase
in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the
print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean).
This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference
between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by
3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant
decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the
model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of
0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the
observed values.

Logistic Regression: Multiple Numerical Predictors

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

During the Fall semester of 2005, two students in our program--Rachel Mullet and Lauren Garafola--did a senior research
project in which they studied a phenomenon called Inattentional Blindness (IB). IB refers to situations in which a person
fails to see an obvious stimulus right in front of his eyes. (For details see this website.) In their study, Rachel and Lauren
had subjects view an online video showing college students passing basketballs to each other, and the task was to count the
number of times students in white shirts passed the basketball. During the video, a person in a black gorilla suit walked
though the picture in a very obvious way. At the end of the video, subjects were asked if they saw the gorilla. Most did
not!

Rachel and Lauren hypothesized that IB could be predicted from performance on the Stroop Color Word test. This test
produces three scores: "W" (word alone, i.e., a score derived from reading a list of color words such as red, green, black),
"C" (color alone, in which a score is derived from naming the color in which a series of Xs are printed), and "CW" (the
Stroop task, in which a score is derived from the subject's attempt to name the color in which a color word is printed when
the word and the color do not agree). The data are in the following table, in which the response, "seen", is coded as 0=no
and 1=yes...

seen W C CW
1 0 126 86 64
2 0 118 76 54
3 0 61 66 44
4 0 69 48 32
5 0 57 59 42
6 0 78 64 53
7 0 114 61 41
8 0 81 85 47
9 0 73 57 33
10 0 93 50 45
11 0 116 92 49
12 0 156 70 45
13 0 90 66 48
14 0 120 73 49
15 0 99 68 44
16 0 113 110 47
17 0 103 78 52
18 0 123 61 28
19 0 86 65 42
20 0 99 77 51
21 0 102 77 54
22 0 120 74 53
23 0 128 100 56
24 0 100 89 56
25 0 95 61 37
26 0 80 55 36
27 0 98 92 51
28 0 111 90 52
29 0 101 85 45
30 0 102 78 51
31 1 100 66 48
32 1 112 78 55
33 1 82 84 37
34 1 72 63 46
35 1 72 65 47
36 1 89 71 49
37 1 108 46 29
38 1 88 70 49
39 1 116 83 67
40 1 100 69 39
41 1 99 70 43
42 1 93 63 36
43 1 100 93 62
44 1 110 76 56
45 1 100 83 36
46 1 106 71 49
47 1 115 112 66
48 1 120 87 54
49 1 97 82 41
To get them into R, try this first...
> file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/gorilla.csv"
> read.csv(file) -> gorilla
> str(gorilla)

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

'data.frame': 49 obs. of 4 variables:


$ seen: int 0 0 0 0 0 0 0 0 0 0 ...
$ W : int 126 118 61 69 57 78 114 81 73 93 ...
$ C : int 86 76 66 48 59 64 61 85 57 50 ...
$ CW : int 64 54 44 32 42 53 41 47 33 45 ...

If that doesn't work (and it should), try copying and pasting this script into R at the command prompt...
### Begin copying here.
gorilla = data.frame(rep(c(0,1),c(30,19)),
c(126,118,61,69,57,78,114,81,73,93,116,156,90,120,99,113,103,123,
86,99,102,120,128,100,95,80,98,111,101,102,100,112,82,72,72,
89,108,88,116,100,99,93,100,110,100,106,115,120,97),
c(86,76,66,48,59,64,61,85,57,50,92,70,66,73,68,110,78,61,65,
77,77,74,100,89,61,55,92,90,85,78,66,78,84,63,65,71,46,70,
83,69,70,63,93,76,83,71,112,87,82),
c(64,54,44,32,42,53,41,47,33,45,49,45,48,49,44,47,52,28,42,51,54,
53,56,56,37,36,51,52,45,51,48,55,37,46,47,49,29,49,67,39,43,36,
62,56,36,49,66,54,41))
colnames(gorilla) = c("seen","W","C","CW")
str(gorilla)
### End copying here.

And if that doesn't work, well, you know what you have to do!

We might begin like this...

> cor(gorilla) ### a correlation matrix


seen W C CW
seen 1.00000000 -0.03922667 0.05437115 0.06300865
W -0.03922667 1.00000000 0.43044418 0.35943580
C 0.05437115 0.43044418 1.00000000 0.64463361
CW 0.06300865 0.35943580 0.64463361 1.00000000
...or like this...
> with(gorilla, tapply(W, seen, mean))
0 1
100.40000 98.89474
> with(gorilla, tapply(C, seen, mean))
0 1
73.76667 75.36842
> with(gorilla, tapply(CW, seen, mean))
0 1
46.70000 47.84211

The Stroop scale scores are moderately positively correlated with each other, but none of them appears to be related to the
"seen" response variable, at least not to any impressive extent. There doesn't appear to be much here to look at. Let's have
a go at it anyway.

Since the response is a binomial variable, a logistic regression can be done as follows...

> glm.out = glm(seen ~ W * C * CW, family=binomial(logit), data=gorilla)


> summary(glm.out)
Call:
glm(formula = seen ~ W * C * CW, family = binomial(logit), data = gorilla)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8073 -0.9897 -0.5740 1.2368 1.7362
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.323e+02 8.037e+01 -1.646 0.0998 .
W 1.316e+00 7.514e-01 1.751 0.0799 .
C 2.129e+00 1.215e+00 1.753 0.0797 .
CW 2.206e+00 1.659e+00 1.329 0.1837
W:C -2.128e-02 1.140e-02 -1.866 0.0621 .
W:CW -2.201e-02 1.530e-02 -1.439 0.1502
C:CW -3.582e-02 2.413e-02 -1.485 0.1376
W:C:CW 3.579e-04 2.225e-04 1.608 0.1078
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 65.438 on 48 degrees of freedom

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

Residual deviance: 57.281 on 41 degrees of freedom


AIC: 73.281
Number of Fisher Scoring iterations: 5
> anova(glm.out, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: seen
Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)


NULL 48 65.438
W 1 0.075 47 65.362 0.784
C 1 0.310 46 65.052 0.578
CW 1 0.106 45 64.946 0.745
W:C 1 2.363 44 62.583 0.124
W:CW 1 0.568 43 62.015 0.451
C:CW 1 1.429 42 60.586 0.232
W:C:CW 1 3.305 41 57.281 0.069
Two different extractor functions have been used to see the results of our analysis. What do they mean?

The first gives us what amount to regression coefficients with standard errors and a z-test, as we saw in the single variable
example above. None of the coefficients are significantly different from zero (but a few are close). The deviance was
reduced by 8.157 points on 7 degrees of freedom, for a p-value of...

> 1 - pchisq(8.157, df=7)


[1] 0.3189537
Overall, the model appears to have performed poorly, showing no significant reduction in deviance (no significant
difference from the null model).

The second print out shows the same overall reduction in deviance, from
65.438 to 57.281 on 7 degrees of freedom. In this print out, however, the
reduction in deviance is shown for each term, added sequentially first to
last. Of note is the three-way interaction term, which produced a nearly
significant reduction in deviance of 3.305 on 1 degree of freedom
(p = 0.069).

In the event you are encouraged by any of this, the following graph might
be revealing...

> plot(glm.out$fitted)
> abline(v=30.5,col="red")
> abline(h=.3,col="green")
> abline(h=.5,col="green")
> text(15,.9,"seen = 0")
> text(40,.9,"seen = 1")

I'll leave it up to you to interpret this on your own time.

Logistic Regression: Categorical Predictors

Let's re-examine the "UCBAdmissions" data set, which we looked at in a previous tutorial...

> ftable(UCBAdmissions, col.vars="Admit")


Admit Admitted Rejected
Gender Dept
Male A 512 313
B 353 207
C 120 205
D 138 279
E 53 138

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

F 22 351
Female A 89 19
B 17 8
C 202 391
D 131 244
E 94 299
F 24 317
The data are from 1973 and show admissions by gender to the top six grad programs at the University of California,
Berkeley. Looked at as a two-way table, there appears to be a bias against admitting women...
> dimnames(UCBAdmissions)
$Admit
[1] "Admitted" "Rejected"
$Gender
[1] "Male" "Female"
$Dept
[1] "A" "B" "C" "D" "E" "F"
> margin.table(UCBAdmissions, c(2,1))
Admit
Gender Admitted Rejected
Male 1198 1493
Female 557 1278

However, there are also relationships between "Gender" and "Dept" as well as between "Dept" and "Admit", which means
the above relationship may be confounded by "Dept" (or "Dept" might be a lurking variable, in the language of traditional
regression analysis). Perhaps a logistic regression with the binomial variable "Admit" as the response can tease these
variables apart.

If there is a way to conveniently get that flat table into a data frame (without splitting an infinitive), I don't know it. So I
had to do this...

> ucb.df = data.frame(gender=rep(c("Male","Female"),c(6,6)),


+ dept=rep(LETTERS[1:6],2),
+ yes=c(512,353,120,138,53,22,89,17,202,131,94,24),
+ no=c(313,207,205,279,138,351,19,8,391,244,299,317))
> ucb.df
gender dept yes no
1 Male A 512 313
2 Male B 353 207
3 Male C 120 205
4 Male D 138 279
5 Male E 53 138
6 Male F 22 351
7 Female A 89 19
8 Female B 17 8
9 Female C 202 391
10 Female D 131 244
11 Female E 94 299
12 Female F 24 317
Once again, we do not have a binary coded response variable, so the last two columns of this data frame will have to be
bound into the columns of a table to serve as the response in the model formula...
> mod.form = "cbind(yes,no) ~ gender * dept"
> glm.out = glm(mod.form, family=binomial(logit), data=ucb.df)

I used a trick here of storing the model formula in a data object, and then entering the name of this object into the glm( )
function. That way, if I made a mistake in the model formula (or want to run an alternative model), I have only to edit the
"mod.form" object to do it.

Let's see what we have found...

> options(show.signif.stars=F) # turn off significance stars (optional)


> anova(glm.out, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: cbind(yes, no)

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)


NULL 11 877.06
gender 1 93.45 10 783.61 4.167e-22
dept 5 763.40 5 20.20 9.547e-163
gender:dept 5 20.20 0 -2.265e-14 1.144e-03
This is a saturated model, meaning we have used up all our degrees of freedom, and there is no residual deviance left over
at the end. Saturated models always fit the data perfectly. In this case, it appears the saturated model is required to explain
the data adequately. If we leave off the interaction term, for example, we will be left with a residual deviance of 20.2 on 5
degrees of freedom, and the model will be rejected (p = .001144). It appears all three terms are making a significant
contribution to the model.

How they are contributing appears if we use the other extractor...


> summary(glm.out)
Call:
glm(formula = mod.form, family = binomial(logit), data = ucb.df)
Deviance Residuals:
[1] 0 0 0 0 0 0 0 0 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5442 0.2527 6.110 9.94e-10
genderMale -1.0521 0.2627 -4.005 6.21e-05
deptB -0.7904 0.4977 -1.588 0.11224
deptC -2.2046 0.2672 -8.252 < 2e-16
deptD -2.1662 0.2750 -7.878 3.32e-15
deptE -2.7013 0.2790 -9.682 < 2e-16
deptF -4.1250 0.3297 -12.512 < 2e-16
genderMale:deptB 0.8321 0.5104 1.630 0.10306
genderMale:deptC 1.1770 0.2996 3.929 8.53e-05
genderMale:deptD 0.9701 0.3026 3.206 0.00135
genderMale:deptE 1.2523 0.3303 3.791 0.00015
genderMale:deptF 0.8632 0.4027 2.144 0.03206
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.7706e+02 on 11 degrees of freedom
Residual deviance: -2.2649e-14 on 0 degrees of freedom
AIC: 92.94
Number of Fisher Scoring iterations: 3
These are the regression coefficients for each predictor in the model, with the base level of each factor being suppressed.
Remember, we are predicting log odds, so to make sense of these coefficients, they need to be "antilogged"...
> exp(-1.0521) # antilog of the genderMale coefficient
[1] 0.3492037
> 1/exp(-1.0521)
[1] 2.863658

This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The
odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on
its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted.

Each coefficient compares the corresponding predictor to the base level. So...

> exp(-2.2046)
[1] 0.1102946
...the odds of being admitted to department C were only about 1/9th the odds of being admitted to department A, all else
being equal. If you want to compare, for example, department C to department D, do this...
> exp(-2.2046) / exp(-2.1662) # C:A / D:A leaves C:D
[1] 0.962328

All else equal, the odds of being admitted to department C were 0.96 times the odds of being admitted to department D.

logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression

(To be honest, I'm not sure I'm comfortable with the interaction in this model. You might want to examine the interaction,
and if you think it doesn't merit inclusion, run the model again without it. Statistics are nice, but in the end it's what makes
sense that should rule the day.)

Return to the Table of Contents

logistic.html[27/01/2014 22:18:53]
R Tutorials--Model Formulae

MODEL FORMULAE

This is a short tutorial on writing model formulae for ANOVA and regression analyses. It will be linked to from those
tutorials, but you are welcome to read it just for kicks if you'd like.

R functions such as aov( ), lm( ), and glm( ) use a formula interface to specify the variables to be included in the analysis.
The formula determines the model that will be built (and tested) by the R procedure. The basic format of such a formula
is...

response variable ~ explanatory variables


The tilde should be read "is modeled by" or "is modeled as a function of." The trick is in how the explanatory variables are
given.

A basis regression analysis would be formulated this way...

y ~ x
...where "x" is the explanatory variable or IV, and "y" is the response variable or DV. Additional explanatory variables
would be added in as follows...
y ~ x + z

...which would make this a multiple regression with two predictors. This raises a critical issue that must be understood to
get model formulae correct. Symbols used as mathematical operators in other contexts do not have their usual
mathematical meaning inside model formulae. The following table lists the meaning of these symbols when used in a
formula.

symbol example meaning


+ +x include this variable
- -x delete this variable
: x:z include the interaction between these variables
* x*z include these variables and the interactions between them
/ x/z nesting: include z nested within x
| x|z conditioning: include x given z
^ (u + v + w)^3 include these variables and all interactions up to three way
poly poly(x,3) polynomial regression: orthogonal polynomials
Error Error(a/b) specify the error term
I I(x*z) as is: include a new variable consisting of these variables multiplied
1 -1 intercept: delete the intercept (regress through the origin)

You may have noticed already that some formula structures can be specified in more than one way...

y ~ u + v + w + u:v + u:w + v:w + u:v:w


y ~ u * v * w
y ~ (u + v + w)^3

formulae.html[27/01/2014 22:18:14]
R Tutorials--Model Formulae

All three of these specify a model in which the variables "u", "v", "w", and all the interactions between them are included.
Any of these formats...
y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2

...would delete the three way interaction.

The nature of the variables--binary, categorial (factors), numerical--will determine the nature of the analysis. For example,
if "u" and "v" are factors...

y ~ u + v
...dictates an analysis of variance (without the interaction term). If "u" and "v" are numerical, the same formula would
dictate a multiple regression. If "u" is numerical and "v" is a factor, then an analysis of covariance is dictated.

That ought to do if for now. Specific examples will appear in the tutorials devoted to specific analyses.

revised 2013 June 22

Return to the Table of Contents

formulae.html[27/01/2014 22:18:14]
R Tutorials--More Descriptive Statistics

MORE DESCRIPTIVE STATISTICS

Categorical Data

You summarize categorical data basically by counting up frequencies and by calculating proportions and percentages.

Categorical data are commonly encountered in three forms: a frequency table or crosstabulation, a flat table, or a case-by-
case data frame. Let's begin with the last of these. Copy and paste the following lines ALL AT ONCE into R. That is,
highlight these lines with your mouse, hit Ctrl-C on your keyboard, click at a command prompt in R, and hit Ctrl-V on
your keyboard, and hit Enter if necessary, i.e., if R hasn't returned to a command prompt. On the Mac, use Command-C
and Command-V. This will execute these lines as a script and create a data frame called "ucb" in your workspace.
WARNING: Your workspace will also be cleared, so save anything you don't want to lose first.

# Begin copying here.


rm(list=ls())
gender = rep(c("female","male"),c(1835,2691))
admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493))
dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
c(89,17,202,131,94,24,19,8,391,244,299,317))
dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
c(512,353,120,138,53,22,313,207,205,279,138,351))
department = c(dept,dept2)
ucb = data.frame(gender,admitted,department)
rm(gender,admitted,dept,dept2,department)
ls()
# End copying here.

Data sets that are purely categorical are not economically represented in case-by-case data frames, and so the built-in data
sets that are purely categorical come in the form of tables (contingency tables or crosstabulations). We have just taken the
data from one of these (the "UCBAdmissions" built-in data set) and turned it into a case-by-case data frame. It's the classic
University of California, Berkeley, admissions data from 1973 describing admissions into six different graduate programs
broken down by gender.

First, we want some general information about the data frame...


> str(ucb) # Examine the structure.
'data.frame': 4526 obs. of 3 variables:
$ gender : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
$ admitted : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ department: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
> ucb[seq(1,4526,400),] # Look at every 400th row.
gender admitted department
1 female yes A
401 female yes D
801 female no C
1201 female no D
1601 female no F
2001 male yes A
2401 male yes B
2801 male yes C
3201 male no A
3601 male no C
4001 male no D
4401 male no F
There are 4526 cases or observations or people represented in this data frame, with three variables observed on each one:

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

gender, admitted, and department the person applied to. All three variables are coded as factors, but this doesn't matter. For
our present purposes, categorical variables and factors are the same thing.

What would we like to know? Here's a good start...

> summary(ucb)
gender admitted department
female:1835 no :2771 A:933
male :2691 yes:1755 B:585
C:918
D:792
E:584
F:714
We now have frequency tables for each of the variables. We could get the same information for any individual variable
using the table( ) function...
> table(ucb$gender)
female male
1835 2691
>
> table(ucb$admitted)
no yes
2771 1755
>
> table(ucb$department)
A B C D E F
933 585 918 792 584 714

These tables can easily be turned into relative frequency tables using the prop.table( ) function...
> table(ucb$department) -> dept.table # Requires a table as an argument.
> prop.table(dept.table) # Calculate proportions.
A B C D E F
0.2061423 0.1292532 0.2028281 0.1749890 0.1290323 0.1577552
>
> prop.table(dept.table) * 100 # Or calculate percentages.
A B C D E F
20.61423 12.92532 20.28281 17.49890 12.90323 15.77552

As we see, 20.6% of the applicants applied to department A, 12.9% to department B, 20.3% to department C, etc.

The prop.table( ) function requires a table object as its argument, so the data must first be tabled before being
prop.tabled.

Contingency tables or crosstabs can be produced using either the table( ) or xtabs( ) function. Table is easier, so
I'll illustrate that first...

> with(ucb, table(gender, admitted)) # or table(ucb$gender, ucb$admitted)


admitted
gender no yes
female 1278 557
male 1493 1198
Within the table( ) function, the first variable named will be in the rows of the contingency table, and the second
variable named will be in the columns. If a third variable is named, it will form separate layers or strata of a three
dimensional contingency table.

When using prop.table( ) on a multidimensional table, it's necessary to specify which marginal sums you want to
use to calculate the proportions. To use the row sums, specify 1; to use the column sums, specify 2...

> with(ucb, table(gender, admitted)) -> gen.adm.table


> prop.table(gen.adm.table, 1) # With respect to row marginal sums.
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
>
> prop.table(gen.adm.table, 2) # With respect to column marginal sums.
admitted

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

gender no yes
female 0.4612053 0.3173789
male 0.5387947 0.6826211
The name of this option is "margin=", so this form could also have been used...
> prop.table(my.table, margin=1)
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877

Since "margin=" is the first thing prop.table( ) expects after the table name, either form will work here.

The xtabs( ) function works quite a bit differently. It uses a formula interface. The formula interface is used most often
in model building and significance testing, so we'll see it a lot, and it's discussed in detail in another tutorial. Formulas can
become quite complex, but their most basic form is as follows...

DV ~ IV1 + IV2 + IV3 + ...


First, a dependent or response variable is specified, followed by a tilde (in the upper left corner of most keyboards), which
is read as "is a function of" or "is modeled by", and finally a list of independent or explanatory variables separated by plus
signs. For xtabs( ) there is no DV (in a case-by-case data frame), so we just leave it out...
> with(ucb, xtabs(~ gender + admitted))
admitted
gender no yes
female 1278 557
male 1493 1198

Instead of using with( ) to give the name of the data frame, we could also have used the data= option, since we are
using a formula interface in the function...
> xtabs(~ gender + admitted, data=ucb)
admitted
gender no yes
female 1278 557
male 1493 1198

The resulting table could also have been stored and operated on with other functions. Here are some examples...
> xtabs(~ gender + admitted, data=ucb) -> gen.adm.table
> prop.table(gen.adm.table, 1) # Get proportions relative to row sums.
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
>
> addmargins(gen.adm.table) # Add marginal sums to table.
admitted
gender no yes Sum
female 1278 557 1835
male 1493 1198 2691
Sum 2771 1755 4526
>
> margin.table(gen.adm.table, 1) # Collapse over admitted (row marginals).
gender
female male
1835 2691
>
> margin.table(gen.adm.table, 2) # Collapse over sex (column marginals).
admitted
no yes
2771 1755

Six different crosstabulations of the entire data set are possible, depending upon the order in which we list the variables...
> with(ucb, table(gender, department, admitted))
, , admitted = no
department
gender A B C D E F
female 19 8 391 244 299 317
male 313 207 205 279 138 351
, , admitted = yes
department
gender A B C D E F
female 89 17 202 131 94 24
male 512 353 120 138 53 22
> with(ucb, table(admitted, department, gender))

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

, , gender = female
department
admitted A B C D E F
no 19 8 391 244 299 317
yes 89 17 202 131 94 24
, , gender = male
department
admitted A B C D E F
no 313 207 205 279 138 351
yes 512 353 120 138 53 22

Etc.

A flat table is produced from the data frame by using the ftable( ) function. Use the "col.vars=" option to control
which variable goes in the columns...

> ftable(ucb) # Last variable in columns.


department A B C D E F
gender admitted
female no 19 8 391 244 299 317
yes 89 17 202 131 94 24
male no 313 207 205 279 138 351
yes 512 353 120 138 53 22
>
> ftable(ucb, col.vars="gender") # Gender in columns.
gender female male
admitted department
no A 19 313
B 8 207
C 391 205
D 244 279
E 299 138
F 317 351
yes A 89 512
B 17 353
C 202 120
D 131 138
E 94 53
F 24 22
>
> ftable(ucb, col.vars="admitted") # Admitted in columns.
admitted no yes
gender department
female A 19 89
B 8 17
C 391 202
D 244 131
E 299 94
F 317 24
male A 313 512
B 207 353
C 205 120
D 279 138
E 138 53
F 351 22
>
> ftable(ucb, col.vars="admitted") -> my.table
> prop.table(my.table, 1) # prop.table() works here, too.
admitted no yes
gender department
female A 0.17592593 0.82407407
B 0.32000000 0.68000000
C 0.65935919 0.34064081
D 0.65066667 0.34933333
E 0.76081425 0.23918575
F 0.92961877 0.07038123
male A 0.37939394 0.62060606
B 0.36964286 0.63035714
C 0.63076923 0.36923077
D 0.66906475 0.33093525
E 0.72251309 0.27748691
F 0.94101877 0.05898123

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

Flat tables can also be made from contingency tables, the order of the row variables can be changed, and multiple column
variables can be specified...
> with(ucb, table(admitted,department,gender)) -> my.table # A 3D contingency table.
> ftable(my.table)
gender female male
admitted department
no A 19 313
B 8 207
C 391 205
D 244 279
E 299 138
F 317 351
yes A 89 512
B 17 353
C 202 120
D 131 138
E 94 53
F 24 22
>
> ftable(my.table, col.vars="admitted")
admitted no yes
department gender
A female 19 89
male 313 512
B female 8 17
male 207 353
C female 391 202
male 205 120
D female 244 131
male 279 138
E female 299 94
male 138 53
F female 317 24
male 351 22
>
> ftable(my.table, row.vars=c("gender","department"), col.vars="admitted")
admitted no yes
gender department
female A 19 89
B 8 17
C 391 202
D 244 131
E 299 94
F 317 24
male A 313 512
B 207 353
C 205 120
D 279 138
E 138 53
F 351 22
>
> ftable(my.table, row.vars="department", col.vars=c("gender","admitted"))
gender female male
admitted no yes no yes
department
A 19 89 313 512
B 8 17 207 353
C 391 202 205 120
D 244 131 279 138
E 299 94 138 53
F 317 24 351 22

Use the "row.vars=" option to control the order in which the row variables occur. If more than one variable name is to be
used in either the "row.vars" or "col.vars" option, use vector notation to specify them. (HINT: In R, everything is a vector,
even if it has only one element! The notation row.vars="department" is just a shortcut for row.vars=c("department").)

You can also get a VERY useful and more efficient data frame from a contingency table as follows...
> as.data.frame(my.table) -> my.df.table
> my.df.table
admitted department gender Freq
1 no A female 19
2 yes A female 89
3 no B female 8
4 yes B female 17
5 no C female 391
6 yes C female 202
7 no D female 244
8 yes D female 131
9 no E female 299
10 yes E female 94
11 no F female 317
12 yes F female 24

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

13 no A male 313
14 yes A male 512
15 no B male 207
16 yes B male 353
17 no C male 205
18 yes C male 120
19 no D male 279
20 yes D male 138
21 no E male 138
22 yes E male 53
23 no F male 351
24 yes F male 22
There are only 24 possible unique cases in these data (2 genders times 2 admit statuses times 6 departments). So why put
this information into a data frame with 4500+ rows in it? The form above is much more efficient. It lists the possible
unique cases in the first three columns (every possible combination of factors) and then gives a frequency in the last Freq
column.

When there is a Freq column in the data frame, you CANNOT use table( ) to get crosstabulations. You must use
xtabs( ), AND you must specify Freq as the DV.

> xtabs(Freq ~ gender + admitted, data=my.df.table)


admitted
gender no yes
female 1278 557
male 1493 1198
We'll look at graphics for categorical data in the next tutorial.
> rm(list=ls())

Numerical Data

Imagine a number line stretching all the way across the universe from negative infinity to positive infinity. Somewhere
along this number line is your little mound of data, and you have to tell someone how to find it. What information would it
be useful to give this person?

First, you might want to tell this person how big the mound is. Is she to look for a sizeable mound of data or just a little
speck? Second, you might want to tell this person about where to look. Where is the mound centered on the number line?
Third, you might want to tell this person how spread out to expect the mound to be. Are the data all in a nice, compact
mound, or is it spread out all over the place? Finally, you might want to tell this person what shape to expect the mound to
be.

The data frame "faithful" consists of 272 observations on the Old Faithful geyser in Yellowstone National Park. Each
observation consists of two variables: "eruptions" (how long an eruption lasted in minutes), and "waiting" (how long in
minutes the geyser was quiet before that eruption)...

> data(faithful)
> str(faithful)
'data.frame': 272 obs. of 2 variables:
$ eruptions: num 3.60 1.80 3.33 2.28 4.53 ...
$ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
I'm going to attach this so that we can easily look at the variables inside. Remind me to detach it when we're done!
> attach(faithful)

We'll begin with the "waiting" variable. Let's see if we can characterize it.

First, we want to know how many values are in this variable. If this variable is our mound of data somewhere on that
number line, how big a mound is it?
> length(waiting)
[1] 272
I don't suppose you were much surprised by that. We already knew that this data frame contains 272 cases, so each
variable must be 272 cases long. There is a slight catch, however. The length( ) function counts missing values (NAs)
as cases, and there is no way to stop it from doing so. That is, there is no na.rm= option for this function. In fact, there are

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

no options at all. It will not even reveal the presence of missing values, and for this reason the length( ) function can
give a misleading result when missing values are present.

The following command will tell you how many of the values in a variable are NAs (missing values)...

> sum(is.na(waiting))
[1] 0
It does this as follows. The is.na( ) function tests every value in a vector for missingness, returning either TRUE or
FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as zeroes, so when we sum the result of
testing for missingness, we get the number of TRUEs, i.e., the number of missing values. It's kind of a nuisance to
remember that, so I propose a new function, one that doesn't actually exist in R.

Here's one of the nicest things about R. If it doesn't do something you want it to do, you can write a function that does it.
We'll talk about this in detail in a future tutorial, but for right now the basic idea is this. If you can calculate it at the
command line, you can write a function to do it. In this case...
> length(waiting) - sum(is.na(waiting))
[1] 272
...would give us the number of nonmissing cases in "waiting". So do this (and be VERY CAREFUL with your typing)...
> sampsize = function(x) length(x) - sum(is.na(x))
> sampsize(waiting)
[1] 272
> ls()
[1] "faithful" "sampsize"

The first line creates a function called "sampsize" and gives it a mathematical definition (i.e., tells how to calculate it). The
variable "x" is called a dummy variable, or a place holder. When we actually use the function, as we did in the second line,
we put in the name of the vector we want "sampsize" calculated for. This takes the place of "x" in the calculation. Notice
that an object called "sampsize" has been added to your workspace. It will be there until you remove it (or neglect to save
the workspace at the end of this R session). Better yet, if you go back to your default working directoy and save it there, it
will load automatically every time you start R. We'll do that at the end of this tutorial.

The second thing we want to know about our little mound of data is where it is located on our number line. Where is it
centered? In other words, we want a measure of center or of central tendency. There are three that are commonly used:
mean, median, and mode. The mode is not very useful and is rarely used in statistical calculations, so there is no R
function for it. Mean and median, on'the other hand, are straightforward...

> mean(waiting)
[1] 70.89706
> median(waiting)
[1] 76
Now we know to hunt around somewhere in the seventies for our data.

How spread out should we expect it to be? This is given to us by a measure of spread, also called a measure of variability
or a measure of dispersion. There are several of these in common usage: the range, the interquartile range, the variance,
and the standard deviation. Here's how to get them...

> range(waiting)
[1] 43 96
> IQR(waiting)
[1] 24
> var(waiting)
[1] 184.8233
> sd(waiting)
[1] 13.59497
There are several things you should be aware of here. First, range( ) does not actually give the range (difference
between the maximum and minimum values), it gives the minimum and maximum value. So the values in "waiting" range
from a minimum of 43 to a maximum of 96.

Second, the interquartile range (IQR) is defined as the difference between the value at the 3rd quartile and the value at the
1st quartile. However, there is not universal agreement on how these values should be calculated, and different software
packages do it differently. R allows nine different methods to calculate these values, and the one used by default is NOT
the one you were probably taught in elementary statiistics. So the result given by IQR( ) is not the one you might get if

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

you do the calculation by hand (or on a TI-84 calculator). It should be very close though. If your instructor insists that you
get the "calculator" value, do the calculation this way...

> quantile(waiting, c(.75,.25), type=2)


75% 25%
82 58
...and then subtract those two values (82-58=24). In this case, the answer is the same. While we're here, this would be a
good time to mention percentiles. Do you need the 65th percentile of the "waiting" values. Here it is...
> quantile(waiting, .65)
65%
79

The value waiting = 79 is at the 65th percentile.

The last two things you need to be aware of is that the variance and standard deviation calculated above are the sample
values. That is, they are calculated as estimates from a sample of the population values (the n − 1 method). There are no
functions for the population variance and standard deviation, although they are easily enough calculated if you need them.

You may remember from your elementary stats course a statistic called the standard error of the mean, or SEM.
Technically, SEM is a measure of error and not of variability, but you may need to calculate it nevertheless. There is no R
function for it, so let's write one.

-----This is optional. You can skip this if you want.-----

The standard error of the sample mean of "waiting" can be calculated like this...

> sqrt(var(waiting)/length(waiting))
[1] 0.8243164
...the square root of the (variance divided by the sample size). This calculation will choke on missing values, however.
Let's see that by adding a missing value to "waiting"...
> waiting -> wait2
> wait2[100] <- NA
> sqrt(var(wait2)/length(wait2))
[1] NA

It's the var( ) function that coughed up the hairball, so...


> sqrt(var(wait2, na.rm=T)/length(wait2))
[1] 0.8248208

...seems to fix things. But it doesn't. This isn't quite the correct answer, and that is because length( ) gave us the wrong
sample size. (It counted the case that was missing.) We could try this...
> sqrt(var(wait2, na.rm=T)/sampsize(wait2))
[1] 0.8263412

...but that depends upon the existence of the "sampsize" function. If that somehow gets erased, this calculation will fail.
Here's one of the disadvantages of using R: You have to know what you're doing. Unlike other statistical packages, such as
SPSS, to use R you occasionally have to think. It appears some of that is necessary here...
> sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x))))
> sem(wait2)
[1] 0.8263412
> ls()
[1] "faithful" "sampsize" "sem" "wait2"

We now have a perfectly good function for calculating SEM in our workspace, and we will not let it get away!

--------This is the end of the optional material.--------

Before we move on, I should remind you that the summary( ) function will give you quite a bit of the above information
in one go...

> summary(waiting)
Min. 1st Qu. Median Mean 3rd Qu. Max.
43.0 58.0 76.0 70.9 82.0 96.0
It will tell you if there are missing values, too...

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

> summary(wait2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
43.00 58.00 76.00 70.86 82.00 96.00 1.00

The number under NA's is a frequency. There is 1 missing value in "wait2".

The last thing we want to know (for the time being anyway) about our little mound of data is what its shape is. Is it a nice
mound-shaped mound, or is it deformed in some way, skewed or bimodal or worse? This will involve looking at some sort
of frequency distribution.

The table( ) function will do that as well for a numerical variable as it will for a categorical variable, but the result
may not be pretty (try this with eruptions--I dare you!)...

> table(waiting)
waiting
43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
1 3 5 4 3 5 5 6 5 7 9 6 4 3 4 7 6 4 3 4 3 2 1 1 2 4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
5 1 7 6 8 9 12 15 10 8 13 12 14 10 6 6 2 6 3 6 1 1 2 1 1
Looking at this reveals that we have one value of 43, three values of 45, five values of 46, and so on. Surely there's a better
way! We could try a grouped frequency distribution...
> cut(waiting, breaks=10) -> wait3
> table(wait3)
wait3
(42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8]
16 28 26 24 9 23
(74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1]
62 55 23 6

The cut( ) function cuts a numerical variable into class intervals, the number of class intervals given (approximately) by
the breaks= option. (R has a bit of a mind of it's own, so if you pick a clumbsy number of breaks, R will fix that for you.)
The notation (x,y] means the class interval goes from x to y, with x not being included in the interval and y being included.

Another disadvantage of using R is that it is intended to be utilitarian. The output will be useful but not necessarily pretty.
We can pretty that up a little bit with a trick...

> as.data.frame(table(wait3))
wait3 Freq
1 (42.9,48.3] 16
2 (48.3,53.6] 28
3 (53.6,58.9] 26
4 (58.9,64.2] 24
5 (64.2,69.5] 9
6 (69.5,74.8] 23
7 (74.8,80.1] 62
8 (80.1,85.4] 55
9 (85.4,90.7] 23
10 (90.7,96.1] 6

You can also specify the break points yourself in a vector, if you are that anal retentive...

> cut(waiting, breaks=seq(40,100,5)) -> wait4


> as.data.frame(table(wait4))
wait4 Freq
1 (40,45] 4
2 (45,50] 22
3 (50,55] 33
4 (55,60] 24
5 (60,65] 14
6 (65,70] 10
7 (70,75] 27
8 (75,80] 54
9 (80,85] 55
10 (85,90] 23
11 (90,95] 5
12 (95,100] 1
We are getting a hint of a bimodal distribution of values here.

A better way of seeing a grouped frequency distribution is by creating a stem-and-leaf display...

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

> stem(waiting)
The decimal point is 1 digit(s) to the right of the |
4 | 3
4 | 55566666777788899999
5 | 00000111111222223333333444444444
5 | 555555666677788889999999
6 | 00000022223334444
6 | 555667899
7 | 00001111123333333444444
7 | 555555556666666667777777777778888888888888889999999999
8 | 000000001111111111111222222222222333333333333334444444444
8 | 55555566666677888888999
9 | 00000012334
9 | 6
> stem(waiting, scale=.5)
The decimal point is 1 digit(s) to the right of the |
4 | 355566666777788899999
5 | 00000111111222223333333444444444555555666677788889999999
6 | 00000022223334444555667899
7 | 00001111123333333444444555555556666666667777777777778888888888888889
8 | 00000000111111111111122222222222233333333333333444444444455555566666
9 | 000000123346
Use the "scale=" option to determine how many lines are in the display. The value of scale= is not actually the number of
lines, so this may take some fiddling. The first of those displays clearly shows the bimodal structure of this variable.

Oftentimes, we are faced with summarizing numerical data by group membership, or as indexed by a factor. The
tapply( ) and by( ) functions are most useful here...

> detach(faithful)
> rm(faithful, wait2, wait3, wait4) # Out with the old...
> as.data.frame(state.x77) -> states
> states$region <- state.region
> head(states)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
region
Alabama South
Alaska West
Arizona West
Arkansas South
California West
Colorado West

The "state.x77" data set in R is a collection of information about the U.S. states (in 1977), but it is in the form of a matrix.
We converted it to a data frame. All the variables are numerical, so we created a factor in the data frame by adding
information about census region, from the built-in data set state.region. Now we can do interesting things like break down
life expectancy by regions of the country. But hold on there! Some very foolish person has named that variable "Life Exp",
with a space in the middle of the name. How do we deal with that?

> names(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder"
[6] "HS Grad" "Frost" "Area" "region"
> by(data=states[4], IND=states[9], FUN=mean)
region: Northeast
[1] 71.26444
------------------------------------------------------------
region: South
[1] 69.70625
------------------------------------------------------------
region: North Central

descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics

[1] 71.76667
------------------------------------------------------------
region: West
[1] 71.23462
>
> tapply(X=states[,4], IND=states[,9], FUN=mean)
Northeast South North Central West
71.26444 69.70625 71.76667 71.23462
That's one way. There are a couple things to notice here. The arguments to by( ) are listed in the default order, so you
don't really have to name them. The same is true of tapply( ). Notice also that you can be a lot more loosey goosey
with the indexing in by( ) than in tapply( ). You only have to give the column number that you want from the data
frame.

Of course, we could also rename the column and use the new, more sensible name...

> names(states)[4] # That's the one we want!


[1] "Life Exp"
> names(states)[4] <- "Life.Exp"
> tapply(states$Life.Exp, states$region, mean)
Northeast South North Central West
71.26444 69.70625 71.76667 71.23462
Now, before we quit and clean up, let's save those functions in our default working directory.
> ls()
[1] "sampsize" "sem" "states"
> rm(states) # Do some cleaning up first.
> setwd("..") # Go back one directory level from Rspace.
> getwd() # Check to be sure you're in the right place!
[1] "C:/Documents and Settings/kingw/My Documents"
> ### I'm in WinXP at the moment. This should be your home directory in
> ### any other OS. Now assuming you have both sampsize and sem...
> load(".RData") # To preserve anything already there...
> ls()
[1] "age.out" "m.ex" "m.ex.minussex" "outcome.out"
[5] "respiratory" "sampsize" "seizure" "sem"
[9] "visit.matrix"
> rm(age.out,m.ex,m.ex.minussex,outcome.out,
+ respiratory,seizure,visit.matrix) # I don't want any of that old junk!
> ls() # Check to see functions are still there.
[1] "sampsize" "sem"
> quit("yes") # Quit saving workspace.

The next time you start R, those functions will be loaded automatically.

On to graphical summaries!

revised 2010 August 3

Return to the Table of Contents

descriptive.html[27/01/2014 22:17:37]
R Tutorials--Objects

OBJECTS

Data.

Your data is the information upon which you wish to do a statistical analysis. By the way, the word "data" is plural, so
ordinarily you would not say "data is" or "data was." Correct are "date are" and "data were." I'm not the grammar police,
but I will nail you on that one!

Maintaining a data set is one of the most important things a statistician needs to know how to do. Most statistical software
requires that the data set be in a very specific format, called a data table or, in R, a data frame (one word or two, take your
pick). Data frames will be covered in detail in a future tutorial.

This is where R truly shines. R is much more flexible in that it does not require that you use the data frame format for your
data. If it is more convenient to keep your data in a contingency table, or a list, or a matrix, or a single vector, you can do
so. This flexibility has a price--more to learn. In the end, however, it makes R a much more convenient way to analyze
data sets, especially simple ones.

In the behavioral and social sciences, the unit of analysis is usually a subject, human or animal. In the more general case,
subjects are called "cases" or "observations" or "experimental units." I prefer cases. There will actually come a time when
we have to distinguish between subjects and cases, so you should not think of these two terms as being exactly equivalent.

Let's say you've collected data from five subjects: Bob, Fred, Barb, Sue, and Jeff. From each subject you have collected
information about age, height, weight, race, year in school (they are all college students), and SAT score. Your cases are
Bob, Fred, Barb, Sue, and Jeff. Age, height, weight, race, year in school, and SAT score are called variables. You would
ordinarily put this information into a data frame as follows:
name age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880

Notice that the cases, or subjects, go into rows in this table, and each variable has it's own column. This is the standard
form for maintaining a data table (data frame). It looks a lot like a spreadsheet, and in fact, using spreadsheet software is a
very good way to manage data. The first row in this table is called the header. It contains the variable names. Having a
header row is optional but usually a good idea.

I should call your attention to the fact that we have two fundamentally different kinds of variables in this data frame. Some
are numbers, like age and weight. These are called numerical variables. Other variables contain just the names of
categories that the subject falls into. Race is an example of such a variable, called a categorical variable. It's absolutely
essential that you be able to distinquish these two types of variables. You can't do statistics otherwise! Categorical variables
are often called factors in R. Just to make matters a bit more confusing, examine the "year" variable. What would you call
it, numerical or categorical? If those were your only choices, you'd have to call it categorical. In fact, in this variable the
categories have a natural order to them: Fr, So, Jr, Sr. Sometimes such a categorical variable is called an ordered factor in
R.

You may be more familiar with the terms nominal, ordinal, interval, and ratio variables. Nominal variables and categorical
variables are roughly the same thing. Factors are usually nominal. However, ordered factors are ordinal. Numerical
variables are either interval or ratio variables, and it usually doesn't matter which. One more catch to all this--examine the
column labeled "name" in the table above. Is this a variable? I suppose it is since its value is different for everyone.
Usually when we think of categorical variables or factors, we are thinking of variables that have relatively few possible
values. These values are called levels. The levels of year, for example, are Fr, So, Jr, Sr. When a variable has a different

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

value for everyone, like the subject's name or address for example, it's often called a character variable. You will see R
make this distinction, and it's a useful one, so remember it.

You get data into R by creating data objects, so let's see how that is done.

Assignment.

In R you create things, called "objects", by a process called assignment. Start an R session and set the working directory to
Rspace. Also, clear the workspace...

> setwd("Rspace") # There is a menu item for this in the GUI, btw.
> rm(list=ls()) # Or use the menus to do this.
If you don't know what this means or have forgotten to create the Rspace directory, you can find out how in the tutorial
called Preliminaries.

There are three ways to assign data to an object name in R (actually four, but one is rarely used). Here is one way...

> x = 7
This SHOULD NOT be read as "x equals 7", which will result in confusion later. Instead, the single equals sign means
"takes the value" or "is assigned the value." R is not usually picky about spacing, so all of the following are equivalent...
> x=7
> x = 7
> x= 7
> x = 7
> x = # Press Enter here.
+ 7 # Press Enter again.

Use spacing to make your typed input look "pretty." Or not. It's (generally) up to you. There are a few situations where R
will get uppity about spacing, but usually it is not an issue. DON'T, however, be so silly as to put a space in the middle of
the name of something. That would be bad!

Here is another way to do assignment...


> x <− 7
And here is one place where R insists on the correct spacing. The "arrow" assignment operator is actually two symbols, a
less than sign and a dash or minus (not an underline character no matter what it might look like in your brower). THERE
CANNOT BE A SPACE BETWEEN THEM! Why would anyone want to use two symbols instead of one if they do the
same thing? You'll see!

Now look at the object called "x" in your workspace...

> ls() # the "show me" function


[1] "x"
> x # print out the value of x
[1] 7
We will use the third kind of assignment to overwrite this value...
> 9 -> x # arrow points to the variable name
> x
[1] 9

Two things to note here. First, R is perfectly willing to let you be stupid and overwrite things you have in your workspace.
There is no warning. If you assign something to an object name that already exists, the old object is gone! Second, the
arrow assignment works from either direction. The equal sign does not! When using =, you must give the object name first
followed by the value you wish to give it.

Objects.

The following data objects exist in R:

vectors
lists

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

arrays
matrices
tables
data frames

Some of these are more important than others. And there are more, but these are the ones we need to know about for now.
Let's begin at the beginning.

Object and Variable Names.

R doesn't care much what you name things,whether they are variables or complete data objects. As noted in the last
tutorial, however, DO NOT put spaces or dashes in your names. Thus, all of these are acceptable (and different) object or
variable names:

x
X
x2
x.2
x_2
myData
MyData
my_data
my.data
my.data.from.the.learning.experiment
fred
Fred
FRED
Rutherford.B.Hayes

Be creative! But if you make your object names too long, you'll be sorry, because you'll be typing them a lot! Another
warning: It is generally safest to confine yourself to letters, numbers, dots, and underline characters and to start your
variable names with letters. Also, try to avoid using names that are also functions in R, like "mean" for example, although
R will usually work around this. The only names I would warn you against are T and F. Avoid those as variable names
because, as we will see later, R uses them to mean true and false. If you assign them another value, that could cause
trouble.

Where The Heck Did That Come From?

Remember, R has a large number of built-in data objects. Some of them will be used below to illustrate the various kinds
of R data objects. For example, here is a data object containing the lengths of major North American rivers (in miles)...

> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315
[15] 870 906 202 329 290 1000 600 505 1450 840 1243 890 350 407
[29] 286 280 525 720 390 250 327 230 265 850 210 630 260 230
[43] 360 730 600 306 390 420 291 710 340 217 281 352 259 250
[57] 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533
[71] 780 280 410 460 260 255 431 350 760 618 338 981 1306 500
[85] 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270
[99] 545 445 1885 380 300 380 377 425 276 210 800 420 350 360
[113] 538 1100 1205 314 237 610 360 540 1038 424 310 300 444 301
[127] 268 620 215 652 900 525 246 360 529 500 720 270 430 671
[141] 1770
(The output on your screen may be slightly different, depending upon how wide you have your R Console window set to.)

In this R output, everything is numbered, but only the number of the first item on each output line is printed. Thus, the
value 1205 (third line from the bottom three items in--may be different on your screen) is item number 115 in this output.
These index numbers are NOT PART OF THE DATA ITSELF! This will be made clearer in the following section. The
object "rivers" is a vector, so...

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

Vectors.

One kind of vector consists of numbers, as was the case just above for the vector "rivers". This is called a numerical
vector, cleverly enough. Any item in this vector can be addressed by using its index number...

> rivers[115] # "show item 115 in vector rivers"


[1] 1205
The index number must be enclosed within square brackets. Notice R prints it out as item [1], but within the "rivers" vector
it is item [115]. Don't get hung up over this. It happens because R considers this printout also to be a new vector. This can
be very useful, as we'll see. It means that, unlike other statistical software, R will allow you to use the output of a
command as input for further calculations. (If this isn't working for you, by the way, it probably means that you are using a
very old version of R. Try putting a copy of the "rivers" vector in your workspace first: data(rivers). This should
make the vector available no matter what.)

If you want to see items 10 through 20 in "rivers" do this...

> rivers[10:20]
[1] 600 330 336 280 315 870 906 202 329 290 1000
In R, a colon has two meanings. This is one of them. When two numbers are separated by a colon, it means "to" as in "10
to 20". Try this...
> 10:20 # Output not shown.

Since no function is specified to operate on these numbers, R assumes you meant print(10:20). If you want to see
items 18, 104, and 168, do this...
> rivers[c(18,104,168)]
[1] 329 380 NA

"NA" means not available, or missing. The "rivers" vector is only 141 items long, so you just asked for something that
doesn't exist. The point is, to see specific items within a vector, enter a vector of index numbers inside the square brackets.
You can also use relational operators (about which more later) to pick out certain items from a vector. If you just want to
see the data values greater than 500, do this...
> rivers[rivers > 500]
[1] 735 524 1459 600 870 906 1000 600 505 1450 840 1243 890 525 720
[16] 850 630 730 600 710 680 570 560 900 625 2348 1171 3710 2315 2533
[31] 780 760 618 981 1306 696 605 1054 735 1270 545 1885 800 538 1100
[46] 1205 610 540 1038 620 652 900 525 529 720 671 1770

I will tell you how to find out which rivers those are in a later tutorial.

To create a vector, use the c( ) function (short for concatenate, or combine)...
> x = c(12, 14, 15, 17, 19, 8, 10)
> x
[1] 12 14 15 17 19 8 10
Once again, R isn't picky about spacing. None of the spaces in the above command need to be there. Or you can put more
in if you like. I won't mention this again. I assume if you get curious about some special case, you will experiment and find
the answer for yourself.

If the values you wish to enter into a vector are consecutive, then this is sufficient:

> x = 100:200 # x = c(100:200) also works (but not in older versions of R)


> x
[1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
[19] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
[37] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
[55] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
[73] 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
[91] 190 191 192 193 194 195 196 197 198 199 200
And remember (also the last time I'll mention this), the old "x" has been overwritten, gone, history, is no more,
irretrievable!

Vectors can also contain words or character values. When you enter these values, they must be in double or single quotes...

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

> x = c("Bob","Carol","Ted","Alice")
> x
[1] "Bob" "Carol" "Ted" "Alice"
Two vectors can also be concatenated into one with the concatenate function as follows...
> y = c("John","Joy","Fred","Frances")
> z = c(x, y)
> z
[1] "Bob" "Carol" "Ted" "Alice" "John" "Joy" "Fred"
[8] "Frances"

What would have happened if, instead, you had done this?
> z = c("x", "y")
> z

It's worth finding out, so don't just sit there wondering. Type! One thing I had a bit of trouble getting used to in R is when
to put things in quotes and when not to. The basic rule is: If it's an already defined object, don't quote it. If you want to
refer to the values inside already existing x and y vectors, don't quote. If it's a new character value (i.e., a string--someone's
or something's name), use quotes. R assumes anything not in quotes is an object name (an already defined vector, list,
dataframe, etc.), and it will hunt for that object in the search path. If it doesn't find it, you will be told so...
> Joy # Print out object Joy.
Error: object "Joy" not found
> "Joy" # Print out "Joy".
[1] "Joy"
> y[2] # Print out the second value in vector y.
[1] "Joy"
> Joy = 6 # Create a new object named Joy.
> Joy
[1] 6

Now do this...
> islands # Only first four lines of output shown.
Africa Antarctica Asia Australia
11506 5500 16988 2968
Axel Heiberg Baffin Banks Borneo
16 184 23 280
...

This is called a named vector. Here is how to create one.


> x = c("Robert Culp","Natalie Wood","Elliott Gould","Dyan Cannon")
> x # The values are not named yet.
[1] "Robert Culp" "Natalie Wood" "Elliott Gould" "Dyan Cannon"
> names(x) = c("Bob","Carol","Ted","Alice")
> x # And now they are.
Bob Carol Ted Alice
"Robert Culp" "Natalie Wood" "Elliott Gould" "Dyan Cannon"
> x[Alice] # This is not correct! Why not?
Error: object "Alice" not found
> x["Alice"]
Alice
"Dyan Cannon"
> Alice = 4
> x[Alice] # Same thing as x[4].
Alice
"Dyan Cannon"

Confusing, right? You'll get used to it. This is a helpful example to study and play around with.

The vector "x" now contains the names of the actors in the movie "Bob and Carol, Ted and Alice." The names( )
function was used to label these values with the names of the characters they played in the movie. Then we used the name
of the character to retrieve the name of the actor. Dyan Cannon could also have been referred to as x[4]. Try it. (I have a
very funny story about this movie, but this is not the place for it!)

In the "islands" vector, the data values are the size of the land mass in thousands of square miles. Each data value is named
with the name of the land mass. Thus, to retrieve the area of Cuba, we do not need to know which of the data values is
Cuba. We can retrieve the value by name. The name is put inside of square brackets just as it if were an index number, and
it is quoted...

> islands["Cuba"]
Cuba
43
Cuba has a land area of 43,000 square miles. Suppose you wanted to work with this data vector, but you wanted the land

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

areas in square kilometers instead of square miles. The following procedure will allow this. First, use the data( )
function to write a copy of "islands" to your workspace. Then do the conversion. The converted values can either be stored
back into the "islands" vector, in which case the old values are overwritten, or it can be stored into a new vector with a
new name...
> data(islands) # writes a copy to your workspace
> ls()
[1] "Alice" "islands" "Joy" "x" "y" "z"
> km_islands <- islands * 2.59 # probably the best way
> km_islands["Cuba"]
Cuba
111.37
> islands <- islands * 2.59 # overwrites the original data values
> islands["Cuba"]
Cuba
111.37

And finally...
> ls()
[1] "Alice" "islands" "Joy" "km_islands" "x"
[6] "y" "z"
> rm(list=ls()) # clean up!
> ls()
character(0)

Vectors are used a lot in R. You should take some time to understand them.

Lists.

Lists are collections of other R objects collected into one place. To create a list, use the list( ) function...
> x=1:10 # a vector
> y=matrix(1:12,nrow=3) # a matrix
> z="Bill" # a character variable
> my.list=list(x,y,z) # create the list
> my.list # view the list
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[[3]]
[1] "Bill"

The output of a lot of R functions is actually composed of lists. Notice that items in a list are indexed by values inside
double brackets. Thus...
> my.list[[3]] # The third item in my.list.
[1] "Bill"

To name the items in the list...


> names(my.list) = c("my.vector","my.matrix","my.name")
> my.list
$my.vector
[1] 1 2 3 4 5 6 7 8 9 10
$my.matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
$my.name
[1] "Bill"

In R, the $ is used for list indexing. That is, it allows you to pull elements out of lists by name. First type the name of the
list, followed by $, followed by the name of the item in the list. For example...
> my.list$my.name
[1] "Bill"

Kinda trivial in this case, but it won't be when you have a much longer list. That's enough on lists for now.
> ls()
[1] "my.list" "x" "y" "z"
> rm(my.list,x,y,z) # Don't forget to clean up!

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

There is one more thing you should remember about lists. Data frames are actually lists. In fact, this is probably the most
important thing you need to remember about lists!

Matrices and Arrays.

Essentially, these are both table-like objects. You saw how to create a matrix in the last section on lists. That's really
enough for now. Except maybe for extracting values from one. The syntax is my.matrix[row,col], as follows...
> y = matrix(1:16, nrow=4) # First we need a matrix! With 4 rows.
> class(y) # y is an object of class "matrix"
[1] "matrix"
> y
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> y[3,2]
[1] 7

Remember this! Always put the row index first followed by the column index, and always put the indexes inside of square
brackets. The matrix( ) function, which is used to create a matrix, takes a vector as its argument, and then the option
"nrow=" tells how many rows to break the vector into. The matrix is filled "down the columns" first, although there is
another option that will change this behavior. Notice our matrix has no row names or column names. The notation [1,]
means "row one, all columns". To recall an entire row or an entire column of a matrix (or an array or a table), do this...
> y[1,] # all values in row 1
[1] 1 5 9 13
> y[,3] # all values in column 3
[1] 9 10 11 12

More later on matrices, including how to name the rows and columns.

An array is like a matrix, except it can have more than two dimensions. In other words, a matrix is just a two-dimensional
array.

> y = array(1:16, dim=c(4,2,2))


> y
, , 1
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
, , 2
[,1] [,2]
[1,] 9 13
[2,] 10 14
[3,] 11 15
[4,] 12 16
The array( ) function creates arrays. The "dim" option gives the number of rows, columns, and layers, in that order. Of
course, this would be more useful if we were putting real data into the array rather than just the numbers 1 to 16. It was
just a quick example. To put real data into a matrix or an array, simply put the data into a vector, and replace "1:16" with
the name of the vector in the matrix( ) or array( ) function.
> x = c(1.26, 3.89, 4.20, 0.76, 2.22, 6.01, 5.29, 1.93, 3.27)
> y = matric(x, nrow=3) # Hey! Everybody makes mistakes!
Error: could not find function "matric"
> y = matrix(x, nrow=3)
> y
[,1] [,2] [,3]
[1,] 1.26 0.76 5.29
[2,] 3.89 2.22 1.93
[3,] 4.20 6.01 3.27

Don't forget to clean up.

Tables.

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

If the function to create a matrix is matrix( ), and the function to create an array is array( ), I bet you can guess
what function is used to create a table. It's used quite a bit differently, however. The table( ) function is used to create
frequency tables or crosstabulations from raw data contained in a vector or a data frame. The result is something that looks,
in many cases, very much like a matrix or an array, and behaves very much like one as well. For now, we will confine
ourselves to one relatively simple example. First, we have to create some raw data...

> y = rnorm(100, mean=100, sd=15) # 100 normally distributed random nos.


> y = round(y, 0) # Rounded to zero decimal places.
Once again, don't worry about the syntax of these statements. I'm just using them to create some data to put into a table.
Since the values in the y vector are random, everyone's results here will be different. To view a frequency table (badly
formatted, but...small steps!), do this...
> table(y)
y
64 69 73 74 77 79 80 81 82 84 85 86 87 88 89 90 91 92 93
1 1 1 1 4 4 2 1 1 2 1 1 1 3 1 1 1 2 1
94 95 96 97 98 99 100 101 102 103 104 105 106 107 109 110 111 112 113
4 4 3 3 5 2 6 3 1 5 4 2 2 2 1 2 1 4 3
114 116 117 118 119 120 123 125 129
2 2 1 1 2 1 1 2 1

The top row of numbers contains the data values, which we can see range from 64 to 129, and the bottom row of numbers
gives the frequencies. The data value (i.e., y-value) of 100, for example, occurs 6 times in the data vector. (Once again,
your result will be different.) Tables, of course, just like everything else in R, can be stored and then used for further
analysis...
> table(y) -> myTable # Store it.
> barplot(myTable)
> ls()
[1] "myTable" "y"
> rm(myTable, y) # And remember to clean up.

This table is (was!) one-dimensional. The "HairEyeColor" object we were playing with previously was a multidimensional
table of frequencies, also called a crosstabulation.

Data Frames.

Data frames are so important that I will devote an entire tutorial just to them. For now, if you want to see one, try this...

> women
The basic structure of a data frame is illustrated here. It's basically a table (in fact, it's a list of column vectors) in which
each variable goes in its own column and each case goes in its own row.

Usually, data frames are read into the R workspace from external files, which may have been created using a spreadsheet.
Small ones can be typed in at the command line, however. Let's use the data at the beginning of this tutorial to see how
that would work.

> myFirstDataframe = data.frame( # Press Enter to start a new line.


+ name=c("Bob","Fred","Barb","Sue","Jeff"),
+ age=c(21,18,18,24,20), hgt=c(70,67,64,66,72),
+ wgt=c(180,156,128,118,202),
+ race=c("Cauc","Af.Am","Af.Am","Cauc","Asian"),
+ year=c("Jr","Fr","Fr","Sr","So"),
+ SAT=c(1080,1210,840,1340,880)) # End with double close parenthesis. Why?
> myFirstDataframe
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
4 Sue 24 66 118 Cauc Sr 1340
5 Jeff 20 72 202 Asian So 880
That's probably not something you're going to want to do too very often! In fact, I'd almost be willing to bet you got at
least one comma, one quote, or one parenthesis out of place, and the whole thing failed because of that. There is an easier
way.

objects.html[27/01/2014 22:19:11]
R Tutorials--Objects

Last Word.

Further details as needed on these data objects will be covered in future tutorials. For now, you should get the general idea.

revised 2010 July 27

Return to Table of Contents

objects.html[27/01/2014 22:19:11]
R Tutorials--Resampling Techniques

RESAMPLING TECHNIQUES

Caveat emptor

Resampling techniques depend upon repeated (re)randomization or simulation of a sample. Computers do not generate
random numbers. Since an algorithm is used to produce the results of a function like runif( ), these results are technically
referred to as pseudorandom. I've done a few casual tests of the R random number generator, and it seems to be very good,
but I'm not an expert on pseudorandom number generators. So I will begin with a warning: computer intensive resampling
is only as good as your pseudorandom number generator.

Monte Carlo Estimation of Power

Although the term "resampling" is often used to refer to any repeated random or pseudorandom sampling simulation, when
the "resampling" is done from a known theoretical distribution, the correct term is "Monte Carlo" simulation. I will use
such a scheme here to demonstrate how power can be estimated for the two-sample t-test using Student's sleep data...

> data(sleep)
> str(sleep)
'data.frame': 20 obs. of 2 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
> attach(sleep)
> tapply(extra, group, mean)
1 2
0.75 2.33
> tapply(extra, group, sd)
1 2
1.789010 2.002249
> tapply(extra, group, length)
1 2
10 10
> t.test(extra~group)
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
> power.t.test(n=10, delta=(2.33-.75), sd=1.9, sig.level=.05,
+ type="two.sample", alternative="two.sided")
Two-sample t test power calculation
n = 10
delta = 1.58
sd = 1.9
sig.level = 0.05
power = 0.4208457
alternative = two.sided
NOTE: n is number in *each* group

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

There is the traditional analysis. R claims this t-test has a power of 0.42.

Supposedly, we have a 42% chance of finding a two-tailed significant difference between two samples of 10 chosen from
normal populations with a common standard deviation of 1.9 (the pooled sd of the two samples) but with a true difference
in means of 1.58. To do the same calculation by simulation, we will use the rnorm( ) function to draw the samples, the
t.test( ) function to get a p-value, and then we will simply look to see what percentage of those p-values are less than
alpha=.05 after running this procedure a goodly number of times. But first, a couple explanations.

What we are about to do is, perhaps, best done with a script. That way, if something goes wrong, we don't have to re-enter
multiple lines into the R console. However, we will live dangerously for the sake of illustration. To get this to work with
minimal effort on our part--always a good thing--we will need to get R to do the same set of calculations many times. This
is done with a loop, and in R, loops are most often done using for( ). The syntax is "for (counter in 1 to number of
simulations)". The statements to be looped through repeatedly then follow enclosed in curly braces. I feel a bit like I'm
trying to describe an elephant to someone who's never seen one before. Perhaps it would be best to just show you the
elephant...

> R = 999
> alpha = numeric(R)
> for (i in 1:R) {
+ group1 = rnorm(10, mean=.75, sd=1.9)
+ group2 = rnorm(10, mean=2.33, sd=1.9)
+ alpha[i] = t.test(group1,group2)$p.value
+ }
The first line defines the number of random simulations we want and stores this value in the semi-official data object that
R typically uses for such things, namely R. We've done 999 sims, because that's the number that is normally done. The
second line creates a numeric vector with 999 elements, which will hold the results of each simulation. The third line
begins our for loop: "for i (the counter) going from 1 to 999", do the following stuff. I.e., do the following stuff 999 times,
keeping track by incrementing the counter, i, at each step. Group 1 is generated. Group 2 is generated. The t-test is done,
and the p-value is extracted and stored in "alpha", our designated storage place. In each simulation, the storage occurs in
the ith position of the alpha vector. Now all that remains is to determine how many of those values are less than .05 ...
> mean(alpha<.05)
[1] 0.3983984

To do this, we played a nice little trick. Remember, "alpha<.05" generates a logical vector of TRUEs and FALSEs, in
which TRUEs count as ones and FALSEs count as zeros. We took the mean of that vector, which is to say, we calculated
by a bit of trickery the proportion of TRUEs. Our simulation tells us we have about a 40% chance of rejecting the null
hypothesis under these conditions, pretty close to what we found above. Your results will differ, of course, because we are
using a randomizing procedure after all.

Permutation Tests

Oops! I forgot to clean up. So I will use the sleep data to illustrate our next topic as well, which is permutation tests. If you
were more on the ball than I am and detached the "sleep" data frame, reattach it now.

The logic behind a permutation test is straightforward. It says, "Okay, these are the twenty (in this example) scores we got,
and the way they are divided up between the two groups here is one possible permutation of them. There are, in fact...
> choose(20,10)
[1] 184756
...possible permutations (technically combinations--but same idea) of these data into two groups of ten, and each is equally
likely to occur if we were to pick one at random out of a hat. Most of these permutations would give no or little difference
between the group means, but a few of them would give large differences. How extreme is the obtained case?" In other
words, the logic of the permutation test is quite similar to the logic of the t-test. If the obtained case is in the most extreme
5% of possible results, then we reject the null hypothesis of no difference between the means (assuming we were looking at
differences between means). The advantage of the permutation test is it does not make any assumption about normality,
and in fact, doesn't make any assumption at all about parent distributions. Furthermore, a permutation test is generally more
powerful than a "traditional" nonparametric test.

The disadvantage of a permutation test is the number of permutations that must be generated. Even small samples, as we
have here, will choke a typical desktop computer with repetitive calculations. Therefore, instead of generating all possible
permutations, we generate only a random sample of them. This procedure is often called a "randomization test" instead of a

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

permutation test. Let's see it, and then I'll explain...


> ls()
[1] "alpha" "group1" "group2" "i" "R" "sleep"
> rm(alpha,group1,group2,i,R)
> # A little cleaning up never hurts anybody!
> R = 999
> scores = extra
> t.values = numeric(R)
> for (i in 1:R) {
+ index = sample(1:20, size=10, replace=F)
+ group1 = scores[index]
+ group2 = scores[-index]
+ t.values[i] = t.test(group1,group2)$statistic
+ }
First, we cleaned out some objects we wanted to use again and that "we" forgot to clean out when we were done with them
last time. Then we set the number of sims (replications, resamplings) to 999. Then we made a copy of the data, which was
really unnecessary. Then we established a storage vector for the results of the sims. Then we began looping. Inside the
loop, we took a random sample of the vector 1 to 20, without replacement, and stored that in an object called "index".
These index values were then used to extract the first resampled group of scores. The second group of scores was extracted
using a trick, which in words goes, "for group two take all the scores that are not indexed by the values in index." In other
words, put the remaining scores in group two. Then we did the t-test and stored the test statistic (t-value). It now remains
to compare those simulated t-values to the one we obtained above when we did the actual t-test...
> t.values = abs(t.values) # for a two-tailed test
> mean(t.values<=1.8608) # using the same trick we used above
[1] 0.0920921

The resampling scheme tells us that a little more than 9% of the sims gave us a result as extreme or more extreme than the
obtained case. This is the p-value resulting from the randomization test. We can see that it is pretty close to the p-value
obtained in the actual t-test above (.0794). The difference may be nothing more than random noise (the standard error is
.009), or we may be seeing that the t-test was a bit too generous here, perhaps due to nonnormality.

And this time, let's not forget...


> detach(sleep)
> rm(list=ls())

Bootstrap Resampling

Bootstrap resampling is similar to the above randomization procedure, except the resampling is done WITH replacement.
The idea is to consider the pooled sample to be a mini-population of scores. Presumably, we were sampling from a larger
parent distribution, but we know nothing about it for certain. The only information we have about the parent distribution is
the sample we've obtained from it. Therefore, we will calculate a sampling distribution for the test statistic by resampling
from our mini-population WITH replacement, calculating the desired statistic, and taking a look at what we get...

> with(sleep, t.test(extra~group)$statistic) # for reference


t
-1.860813
> scores = sleep$extra # the data
> R = 999 # the number of replicates
> t.values = numeric(R) # storage for the results
> for (i in 1:R) {
+ group1 = sample(scores, size=10, replace=T)
+ group2 = sample(scores, size=10, replace=T)
+ t.values[i] = t.test(group1,group2)$statistic
+ }
The only real difference here is that we have resampled with "replace=T", thus resampling with replacement from the
pooled data set. The sampling distribution of the t-statistic can be visualized in a histogram, and I will plot the obtained t-
value as a point on the x-axis...
> hist(t.values, breaks=20)
> points(-1.8608,0, pch=16)

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

The simulation p-value is obtained as previously...


> mean(t.values<=-1.8608) # one-tailed (lower)
[1] 0.04204204
> t.values = abs(t.values) # two-tailed
> mean(t.values>=1.8608)
[1] 0.08508509

The result is very similar to the randomization method result.

Note: R contains a library called "boot", the main function within which is boot( ). Supposedly, this automates the whole
resampling procedure. However, I have found it almost impossible to use, and certainly more difficult than the above
methods. All the examples I've seen are much more complex than the problems above and (I might add) poorly explained.
I've yet to get the function to produce a meaningful result. When I do, there will be a revision here. If you wish to give it a
try, read the help page first, then you might try Canty (2002) and Rizzo (2008). And good luck to you. I hope you have
better luck than I have!

Bootstrapped Confidence Intervals

Bootstrap resampling is useful for estimating confidence intervals from samples when the sample is from an unknown (and
clearly nonnormal) distribution. The data set "crabs" in the MASS package supplies an example. We will look at the
carapace length of blue crabs...

> data(crabs, package="MASS")


> cara = crabs$CL[crabs$sp=="B"]
> summary(cara)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.70 24.85 30.10 30.06 34.60 47.10
> length(cara)
[1] 100
> qqnorm(cara)
Hmmmm, that could be normal, but who can say for sure? So to get a 95% CI for the mean, we will use a bootstrap
scheme rather than methods based on a normal distribution. The procedure is to take a large number of bootstrap samples
and to examine the desired statistic...
> R = 999
> boot.means = numeric(R)
> for (i in 1:R) {
+ boot.sample = sample(cara, 100, replace=T)
+ boot.means[i] = mean(boot.sample)
+ }
> quantile(boot.means, c(.025,.975))
2.5% 97.5%
28.70265 31.44220

The bootstrapped CI is reasonably close to the one obtained by traditional methods...

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

> mean(cara)-1.96*sd(cara)/sqrt(length(cara))
[1] 28.70507
> mean(cara)+1.96*sd(cara)/sqrt(length(cara))
[1] 31.41093

Extension to more complex problems is straightforward.

Update: The boot( ) Function

I finally found a more-or-less sensible explanation of the boot( ) function in Crawley (2007). I can't claim anything near
"complete understanding", but I'm working on it. Meanwhile, here is a heavily commented example showing how to
bootstrap a confidence interval...

> library(boot)
> ### The syntax is: boot(data= , statistic= , R= )
> ### Send it the name of a vector or data frame for "data". "R" is the
> ### number of replications you want. The tricky part is "statistic=",
> ### where you have to send it a function you've written. If you're
> ### going to get fancy, you might want to read the Writing Your Own
> ### Functions tutorial first.
> data(crabs, package="MASS")
> ### Not necessary if you haven't yet cleaned up from the previous section.
> cara = crabs$CL[crabs$sp=="B"]
> ### The carapace lengths of 100 blue crabs. See previous section.
> ### And now we have to write a function to calculate the means of these...
> the.means = function(cara, i) {mean(cara[i])}
> ### Finally, we run the bootstrap...
> boot(data=cara, statistic=the.means, R=999)
ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = cara, statistic = the.means, R = 999)

Bootstrap Statistics :
original bias std. error
t1* 30.058 0.02926827 0.6820252
You MUST write your own function to calculate the statistic you are bootstrapping, even if that function is already built
into R. If it is a built-in function, then writing your own function is a simple matter, as you witnessed above. You must
also include the index i, which R uses internally to do the bootstrap. I don't know why. I'm led to believe this makes the
boot( ) function much more versatile. Finally, the output is interpreted as so. The value "original" is the mean of the
original data vector "cara". The value "bias" gives the difference between "original" and the mean of the bootstrapped
values for this statistic. Clearly, "bias" ought to be close to zero. The value "std. error" is the standard deviation of the
bootstrapped means. To get a confidence interval, the output of boot( ) should be stored...
> boot(data=cara, statistic=the.means, R=999) -> boot.out
> quantile(boot.out$t, c(.025,.975))
2.5% 97.5%
28.73275 31.42255

The result is similar, but not identical, to the result we got in the previous section.

Bootstrapping is more usefully applied to statistics like the median, for which there is no formula for CIs when the
distribution is not normal...

> the.medians = function(cara, i) {median(cara[i])}


> boot(data=cara, statistic=the.medians, R=999) -> boot.out2
> boot.out2
ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = cara, statistic = the.medians, R = 999)

Bootstrap Statistics :

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

original bias std. error


t1* 30.1 0.09814815 1.477237
> quantile(boot.out2$t, c(.025,.5,.975))
2.5% 50% 97.5%
27.7 30.1 32.4
I still can't get it to do anything more complex than a simple, one-variable confidence interval though, so obviously some
R guru needs to write some COMPREHENSIBLE examples using this procedure to do things that ordinary people might
actually want to do!
> detach("package:boot")

Bootstrapping a Oneway ANOVA

Let's do something a little more complex than a confidence interval! Yesterday, I wrote the tutorial on Oneway ANOVA,
and in that, I used the data frame "InsectSprays" for an example. However, "InsectSprays" has some severe problems with
normality and homogeneity of variance, which means the theoretical F-distribution may not apply. How can we get a
sampling distribution that does apply? First, let's use Monte Carlo simulation to reproduce the theoretical distribution...

> data(InsectSprays) # look at ?InsectSprays to get the skinny


> with(InsectSprays,tapply(count,spray,mean))
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
> with(InsectSprays,tapply(count,spray,var))
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> with(InsectSprays,tapply(count,spray,length))
A B C D E F
12 12 12 12 12 12
> summary(aov(count~spray, data=InsectSprays))
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2668.83 533.77 34.702 < 2.2e-16 ***
Residuals 66 1015.17 15.38
Those are some ugly variances! The help page for these data recommend a data transform. Let's assume the pooled
variance of 15.38 applies to all groups and get the F-distribution as follows (copy and paste this script into R)...
meanstar = mean(InsectSprays$count)
sdstar = sqrt(15.38)
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = rnorm(12, mean=meanstar, sd=sdstar)
groupB = rnorm(12, mean=meanstar, sd=sdstar)
groupC = rnorm(12, mean=meanstar, sd=sdstar)
groupD = rnorm(12, mean=meanstar, sd=sdstar)
groupE = rnorm(12, mean=meanstar, sd=sdstar)
groupF = rnorm(12, mean=meanstar, sd=sdstar)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}

Go get a snack here! Well, that took awhile, didn't it? "Fstar" should now contain our simulated F-distribution for df1=5
and df2=66 degrees of freedom...
> hist(Fstar, prob=T)
> x=seq(.25,5.25,.5)
> points(x,y=df(x,5,66),type="b",col="red")

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

Not too bad!

Now let's do the same, except this time we will use the InsectSprays data to get the Fstar-distribution. We will center all
our groups on the same mean (zero), but we will leave the variance and shape of the individual group distributions
undisturbed (copy and paste this script into R)...
rm(list=ls())
data(InsectSprays)
meanstar = with(InsectSprays, tapply(count,spray,mean))
grpA = InsectSprays$count[InsectSprays$spray=="A"] - meanstar[1]
grpB = InsectSprays$count[InsectSprays$spray=="B"] - meanstar[2]
grpC = InsectSprays$count[InsectSprays$spray=="C"] - meanstar[3]
grpD = InsectSprays$count[InsectSprays$spray=="D"] - meanstar[4]
grpE = InsectSprays$count[InsectSprays$spray=="E"] - meanstar[5]
grpF = InsectSprays$count[InsectSprays$spray=="F"] - meanstar[6]
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=12, replace=T)
groupB = sample(grpB, size=12, replace=T)
groupC = sample(grpC, size=12, replace=T)
groupD = sample(grpD, size=12, replace=T)
groupE = sample(grpE, size=12, replace=T)
groupF = sample(grpF, size=12, replace=T)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Wait for it! (That took about a minute and 20 seconds on my computer.) We now have a bootstrapped "F"-distribution in
"Fstar" based on equal means (the null hypothesis), but normality and homogeneity are no longer assumed. Let's see what
it looks like...
> max(Fstar)
[1] 10.54805
> hist(Fstar, breaks=seq(0,11,.5), ylim=c(0,.7), prob=T)
> x=seq(.25,6.75,.5)
> points(x,y=df(x,5,66),type="b",col="red")

resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques

It's a bit different from the theoretical distribution in that it appears to be heavier in the tails...
> qf(.95,5,66)
[1] 2.353809
> quantile(Fstar,.95)
95%
2.790884

The critical value of F(5,66) at alpha=.05 is 2.35, but from the bootstrapped Fstar distribution it appears to be 2.79. Either
way, Fobt=34.7 allows the null to be rejected. It might have been a different story though if the effect had been a marginal
one.

If there is an easier way to do that using the boot( ) function, I'd love to hear about it. I can't get it to work!

Return to the Table of Contents

resample.html[27/01/2014 22:19:37]
R Tutorials--Simple Data Entry

SIMPLE DATA ENTRY AND DESCRIPTION

A Couple Tips

One reason people don't like command line programs is because, if you make a mistake in typing a long command, you
have to start all over from scratch. Not so in R. Suppose you were trying to set your working directory to "Rspace", and
you accidently typed this...

> setwf("Rspace") # Type this into R.


There is no "setwf( )" function in R, and R will cheerfully tell you the function was not found. Go ahead and see for
yourself. Now, instead of retyping the whole line, R will allow you to recall it to the command line and edit it. To recall
the previously typed command, just press the up arrow key. You can then use the right and left arrow keys to move
through the command line. The Backspace and Delete keys (on a Windows keyboard) can be used to erase the errors. Then
make corrections and press Enter. Your cursor does not even need to be at the end of the line when you press Enter. Try
it...
> ### Press up arrow key here...
> setwd("Rspace") # Edit command using arrow keys, press Enter.
> getwd()
[1] "C:/Documents and Settings/kingw/My Documents/Rspace"

If you continue pressing the up arrow key, R will bring older and older commands to the command line. Thus, if you did
something five commands ago, and you want to do it again, press the up arrow key five times to recall the command, then
press Enter.

Here's another tip, and one you might be a bit miffed I didn't tell you earlier. You can copy and paste stuff into R. For
example, suppose I told you to execute the following command...

> boxplot(log(islands), main="Boxplot of Islands", ylab="log(land area)")


You're saying to yourself, "Oh man! I don't want to type all that, and I'm gonna get commas in the wrong place, and come
on!" You don't have to type it. With your mouse--yes, that's right, your mouse!--highlight the line on this webpage (not
including the command prompt or > symbol). Then either go to the Edit menu of your browser and choose Copy, or press
Ctrl-C on your keyboard (hold down the Ctrl key and press c and then release both). Now, go into the R Console window
and either pull down the Edit menu (in Windows) and choose Paste, or (with the cursor at a command prompt) press Ctrl-
V. Either one will paste the command at the command prompt. Then press Enter.

Note to my Mac friends: On the Mac keyboard the shortcuts for copy and paste are Command-C and Command-V,
respectively. On older Mac keyboards, the Command key is the one to the left of the space bar with the little flowery thing
on it.

Now you know. Of course, you will have to type your own command eventually.

Creating a Vector

Using built-in data objects is fine and dandy for demonstration purposes, but eventually you're going to want to enter and
analyze your own data. If the data set is small, you can do this easily from within R. The following data were collected by
a student doing his senior research project here at CCU. The numbers represent number of items recalled correctly on a
digit span task, supposedly a measure of short term memory. The explanatory variable ("IV") was whether or not the
subject admitted to regularly smoking marijuana.

smokers 16 20 14 21 20 18 13 15 17 18

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry

nonsmokers 18 22 21 17 20 17 23 20 22 21
It might seem a little silly to go to the trouble of formally entering such a small data set into a data frame or a spreadsheet
and then reading it into R, when the whole thing can be typed into an R console session in just a few seconds. The thing
you need to realize is that all these scores are ON THE SAME VARIABLE, the response variable, and therefore, they need
to go into the same data object or vector. So...
> scores = c(16,20,14,21,20,18,13,15,17,18,18,22,21,17,20,17,23,20,22,21)
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 17.00 19.00 18.65 21.00 23.00

The scores have been entered into a vector using the c( ) function. Since that was an assignment statement, it wrote
nothing to the screen. Then we asked to see the scores, a good check, since (confession) it took me three tries to get the
scores typed in correctly. (ALWAYS double check your data entry!) Then the summary( ) function was used to produce
a preliminary descriptive summary.

That's probably the most annoying way to get data into a vector--all those commas! So here is a more convenient way
when typing data at the keyboard. First, remove the "scores" vector. Then recreate it using scan( ). The scan( )
function allows you to type in numbers one at a time, hitting Enter after each one, rather than putting commas between
them...

> rm(scores)
> scores <- scan()
1: 16 # press Enter
2: 20 # press Enter
3: 14 # etc.
4: 21
5: 20
6: 18
7: 13
8: 15
9: 17
10: 18
11: 18
12: 22
13: 21
14: 17
15: 20
16: 17
17: 23
18: 20
19: 22
20: 21
21: # press Enter here to end data input
Read 20 items
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
This is handy when you're using a numeric keypad. But it gets better. You don't have to hit the Enter key between each
data value. You only have to leave some white space...
> rm(scores)
> scan() -> scores
1: 16 20 14 21 20 18 13 15
9: 17 18 18 22 21 17 20 17 23 20
19: 22 21
21:
Read 20 items
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21

The Enter key can be hit at any time to start a new line. Items entered into scan( ) must be separated by white space: a
space or spaces, a tab, a newline, a carriage return. Notice also that it doesn't matter whether left or right arrow assignment
is used. Better still, you can copy and paste the numbers from this webpage...
> rm(scores)
> scores = scan() # The = assignment can also be used.
1: 16 20 14 21 20 18 13 15 17 18 # Copied and pasted from above.
11: 18 22 21 17 20 17 23 20 22 21 # Copied and pasted from above.
21: # Remember to hit Enter to end entry.
Read 20 items
> scores

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry

[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21

You can also copy and paste comma separated values, but not into the scan( ) function. Copy comma separated values
into c( ). However, you can copy and paste a spreadsheet column (but not a row) into the scan( ) function.

Now, about that summary--what we want, of course, is a summary by groups, and not of all the scores at once. You can
probably think of one way to this...

> summary(scores[1:10]) # Summarize scores 1 to 10.


Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 15.25 17.50 17.20 19.50 21.00
> summary(scores[11:20]) # Summarize scores 11 to 20.
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 18.50 20.50 20.10 21.75 23.00
Another way to do it is to create a second vector with group names (i.e., values of the explanatory variable) in it and to use
that to extract scores by group...
> rep(c("smoker","nonsmoker"),c(10,10)) -> groups
> tapply(X=scores, IND=groups, FUN=summary) # Similar to by() function.
$nonsmoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 18.50 20.50 20.10 21.75 23.00
$smoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 15.25 17.50 17.20 19.50 21.00

The syntax of the tapply( ) function can be put into words like this: "Apply the summary function to scores by
groups." The by( ) function does something similar, but the output format is a bit different...
> by(data=scores, IND=groups, FUN=summary)
groups: nonsmoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 18.50 20.50 20.10 21.75 23.00
----------------------------------------------------------
groups: smoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 15.25 17.50 17.20 19.50 21.00

Now might be a good time to mention this. The summary( ) function is very versatile, and it's output will depend upon
what you are asking for a summary of, as we will have ample opportunity to see. When a numerical vector is summarized,
the output is the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. There is a qualification. The quartiles
are calculated assuming the vector contains a continuous numerical variable. The variable in this example is not
continuous. Therefore, the quartiles may not come out to have the same values as if you'd used the method you were taught
in elementary statistics to calculate them. We will return to this in a future tutorial. For now, I'll simply say that R can use
any of nine different methods to calculate these values.
> rm(list=ls()) # Clean up.

Entering Categorical Data

There is no way to get around it. Entering categorical data, or character values, is a pain in the posterior! However, once
they are entered, R handles them in a much more versatile way than any other statistical software I have ever used. For
example, if you are going to use a categorical variable (entered as character values) in a regression analysis, you do not
have to recode. R will do the appropriate recoding for you.

There are some cautions about entering character values that it will be very healthy to know about right up front. Suppose
we enter the following vector into R...

> gender = c("male","female","female","male ","Male","female","female","mail")


> summary(gender)
Length Class Mode
8 character character
> table(gender) # summary(factor(gender)) will work; try it!
gender
female mail male Male male
4 1 1 1 1
First we learn that summary( ) is not so useful for summarizing a character vector. So I used table( ) instead, which
gives me a freqency table. Notice I got what appears to be an unintended result. First, there is a misspelling. R doesn't
know you can't spell, so it assumes this is what you (I) intended. There is also a case where "Male" was capitalized, and R,

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry

being case sensitive, counted that as a different value from the uncapitalized "male"s. The one that can really puzzle you is
the difference between "male" and "male ". This can be a real mystery, in fact, when you've entered data using another
program, like a spreadsheet, and then read it into R. Moral of the story: BE CAREFUL TYPING CHARACTER DATA! If
you put a space on the beginning or end of a value, R will assume you mean it to be that way.

Here is another way you can go wrong entering character data...

> country = c("England","Russia","United States","England","England")


> table(country)
country
England Russia United States
3 1 1
What I am attempting to illustrate is that some data entry methods in R assume that white space separates variable values.
So suppose you have a value like United States. There are some cases in which R will read that as two values, "United"
and "States". If you are typing values into a vector, the necessary quotes will take care of it. However, it might be a good
idea not to put spaces inside data values. You can type a period into what would otherwise be a space, "United.States", and
that will never cause a problem.

Now let's use scan( ) to enter the same values...

> rm(country)
> country = scan(what="character")
1: England
2: Russia
3: United States
5: England
6: England
7:
Read 6 items
> table(country)
country
England Russia States United
3 1 1 1
And there it is! Now you see the problem. Let's do it right...
> rm(country)
> country = scan(what="character")
1: England
2: Russia
3: United.States
4: England
5: England
6:
Read 5 items
> table(country)
country
England Russia United.States
3 1 1

The default data type for scan( ) is numeric. Using scan( ) to enter character data is very convenient because you
can avoid typing commas and quotes, but you do have to remember to specify that you are entering character data by using
the what= option.

Large data sets, however, will probably be typed into a spreadsheet and then read into R. In this case, you will have to be
careful how you tell R the file is formatted. More about that when we get to reading and writing external files.

One more thing about character data...

> summary(country)
Length Class Mode
5 character character
> country = factor(country)
> summary(country)
England Russia United.States
3 1 1
Until you declare your entered vector to be a factor, R will consider it character data. Sometimes that is what you want, but
usually not. If you mean it to be a factor, use factor( ) to declare it as such.
> rm(list=ls()) # Clean up.

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry

Entering Tabled Data

Sometimes you have data that someone has already done the work of putting into a table for you. (This happens especially
with problems out of a textbook.) The following data occur in "A Handbook of Small Data Sets" by Hand et al. (1994)...

24. Snoring and heart disease (on page 18 of Hand et al.)


Norton, P.G. and Dunn, E.V. (1985) Snoring as a risk factor for disease: an
epidemiological survey. British Medical Journal, 291, 630-632.
Snore
nearly Snore
Heart Non- Occasional every every
disease snorers snorers night night
------- -------------------------------------------
yes 24 35 21 30
no 1355 603 192 224
------------------------------------------------------
These data can be entered into a matrix, an array, or a table. I prefer to enter them into a table, so that's what I'm going to
illustrate here, along with a few pointers for making things look a little neater when R prints it out...
> row1 = c(24,35,21,30)
> row2 = c(1355,603,192,224)
> rbind(row1,row2) -> snoring.table
> snoring.table
[,1] [,2] [,3] [,4]
row1 24 35 21 30
row2 1355 603 192 224
> dimnames(snoring.table) = list("heart.disease" = c("yes","no"),
+ "snore.status" = c("nonsnorer","occasional",
+ "nearly.every.night","every.night"))
> snoring.table
snore.status
heart.disease nonsnorer occasional nearly.every.night every.night
yes 24 35 21 30
no 1355 603 192 224

First, I entered the table row by row into separate vectors. Then I used the rbind( ), or "row bind", function to bind the
rows into a table. (There is also a cbind( ) function, if you prefer to enter your tables column by column.) Then I added
names to the various dimensions of the table, making liberal use of the Enter key and space bar so the screen did not scroll
as I was typing. Notice the row names were entered first followed by the column names. The same method would be used
to name the dimensions in an array or a matrix. It's worth taking a few minutes to examine the syntax of the
dimnames( ) function. Notice it takes a list of the variable names, and the individual levels of each variable are
assigned via vectors typed within the list. Tricky!

I don't like this table, and the reason I don't is because it's customary to put the explanatory variable in the rows and the
response variable in the columns of a contingency table (but not required). So I'm going to flip it using the t( ), for
"transpose", function...

> snoring.table = t(snoring.table)


> snoring.table
heart.disease
snore.status yes no
nonsnorer 24 1355
occasional 35 603
nearly.every.night 21 192
every.night 30 224
Better! Notice also I avoided putting spaces into my variable names. This is a good practice, although since the names had
to be quoted anyway in the dimnames command, it is not strictly necessary. Also, you should ignore the fact that I am
sometimes using arrow assignment and sometimes = assignment. I'm doing it to illustrate that it usually does not matter
which you use.

Now let's look at a few functions for extracting information from this table...

> dim(snoring.table) # no. of rows by no. of columns


[1] 4 2
> dimnames(snoring.table) # We already know this, but what the heck?
$snore.status

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry

[1] "nonsnorer" "occasional" "nearly.every.night"


[4] "every.night"
$heart.disease
[1] "yes" "no"
> snoring.table[1,] # Look at row 1.
yes no
24 1355
> snoring.table[,2] # Look at column 2.
nonsnorer occasional nearly.every.night every.night
1355 603 192 224
> snoring.table[3,2] # Look at the entry in row 3 and column 2.
[1] 192
> addmargins(snoring.table) # Show row and column sums.
heart.disease
snore.status yes no Sum
nonsnorer 24 1355 1379
occasional 35 603 638
nearly.every.night 21 192 213
every.night 30 224 254
Sum 110 2374 2484
> prop.table(snoring.table, margin=1) # Get proportions relative to row sums.
heart.disease
snore.status yes no
nonsnorer 0.01740392 0.9825961
occasional 0.05485893 0.9451411
nearly.every.night 0.09859155 0.9014085
every.night 0.11811024 0.8818898
> prop.table(snoring.table, margin=2) # Get proportions relative to column sums.
heart.disease
snore.status yes no
nonsnorer 0.2181818 0.57076664
occasional 0.3181818 0.25400168
nearly.every.night 0.1909091 0.08087616
every.night 0.2727273 0.09435552
> prop.table(snoring.table) # Get proportions relative to overall sum.
heart.disease
snore.status yes no
nonsnorer 0.009661836 0.54549114
occasional 0.014090177 0.24275362
nearly.every.night 0.008454106 0.07729469
every.night 0.012077295 0.09017713
> chisq.test(snoring.table) # You were wondering, weren't you?
Pearson's Chi-squared test
data: snoring.table
X-squared = 72.7821, df = 3, p-value = 1.082e-15
It's also easy enough to turn those proportions into percentages...
> prop.table(snoring.table, margin=1)*100
heart.disease
snore.status yes no
nonsnorer 1.740392 98.25961
occasional 5.485893 94.51411
nearly.every.night 9.859155 90.14085
every.night 11.811024 88.18898

Just multiply the entire prop.table by 100. And finally...


> rm(list=ls()) # clean up

revised 2010 July 29

Return to the Table of Contents

dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Linear Regression

SIMPLE LINEAR CORRELATION AND REGRESSION

Correlation

Correlation is used to test for a relationship between two numerical variables or two ranked (ordinal) variables. In this
tutorial, we assume the relationship (if any) is linear.

To demonstrate, we will begin with a data set called "cats" from the "MASS" library, which contains information on
various anatomical features of house cats...

> library("MASS")
> data(cats)
> str(cats)
'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
> summary(cats)
Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
"Bwt" is the body weight in kilograms, "Hwt" is the heart weight in grams, and "Sex" should be obvious. There are no
missing values in any of the variables, so we are ready to begin by looking at a scatterplot...
> with(cats, plot(Bwt, Hwt))
> title(main="Heart Weight (g) vs. Body Weight (kg)\nof Domestic Cats")

The plot( ) function gives a scatterplot whenever you


feed it two numerical variables. The first variable listed
will be plotted on the horizontal axis. A formula
interface can also be used, in which case the response
variable should come before the tilde and the variable
to be plotted on the horizontal axis after...
> with(cats, plot(Hwt ~ Bwt))

The scatterplot shows a fairly strong and reasonably


linear relationship between the two variables. A
Pearson product-moment correlation coefficient can be
calculated using the cor( ) function...
> with(cats, cor(Bwt, Hwt))
[1] 0.8041274
> with(cats, cor(Bwt, Hwt))^2
[1] 0.6466209

Pearson's r = .804 indicates a strong positive


relationship. To get the coefficient of determination, I
just hit the up arrow key to recall the previous
command to the command line and added "^2" on the
end to square it. If a test of significance is required,
this is also easily enough done using the cor.test( )

simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression

function. The function does a t-test, a 95% confidence


interval for the population correlation (use
"conf.level=" to change the confidence level), and
reports the value of the sample statistic. The alternative hypothesis can be set to "two.sided" (the default), "less", or
"greater"...
> with(cats, cor.test(Bwt, Hwt))
Pearson's product-moment correlation
data: Bwt and Hwt
t = 16.1194, df = 142, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7375682 0.8552122
sample estimates:
cor
0.8041274

Since we would expect a positive correlation here, we might have set the alternative to "greater"...
> with(cats, cor.test(Bwt, Hwt, alternative="greater", conf.level=.8))
Pearson's product-moment correlation
data: Bwt and Hwt
t = 16.1194, df = 142, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
80 percent confidence interval:
0.7776141 1.0000000
sample estimates:
cor
0.8041274

There is also a formula interface for cor.test( ), but it's tricky. Both variables should be listed after the tilde...
> with(cats, cor.test(~ Bwt + Hwt)) # output not shown

Using the formula interface makes it easy to subset the data by rows of the data frame...
> with(cats, cor.test(~ Bwt + Hwt, subset=(Sex=="F")))
Pearson's product-moment correlation
data: Bwt and Hwt
t = 4.2152, df = 45, p-value = 0.0001186
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2890452 0.7106399
sample estimates:
cor
0.5320497

The "subset=" option is not available unless you use the formula interface.

simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression

For a more revealing scatterplot, try this...


> with(cats, plot(Bwt, Hwt, type="n", xlab="Body Weight in kg",
+ ylab="Heart Weight in g",
+ main="Heart Weight vs. Body Weight of Cats"))
> with(cats,points(Bwt[Sex=="F"],Hwt[Sex=="F"],pch=16,col="red"))
> with(cats,points(Bwt[Sex=="M"],Hwt[Sex=="M"],pch=17,col="blue"))
>
Output not shown. ("Aw, man! I'm not gonna type all that!" Here's a trick for you Windows users out there. In your
browser window, copy the above lines command prompts and all, including that last empty prompt. Then go into the R
Console window, pull down the Edit menu, and choose "Paste commands only". Sometimes that works, and sometimes it
doesn't!)

Correlation and Covariance Matrices

If a data frame (or other table-like object) contains more than two numerical variables, then the cor( ) function will result in
a correlation matrix...
> rm(cats) # if you haven't already
> data(cement) # also in the MASS library
> str(cement)
'data.frame': 13 obs. of 5 variables:
$ x1: int 7 1 11 11 7 11 3 1 2 21 ...
$ x2: int 26 29 56 31 52 55 71 31 54 47 ...
$ x3: int 6 15 8 8 6 9 17 22 18 4 ...
$ x4: int 60 52 20 47 33 22 6 44 22 26 ...
$ y : num 78.5 74.3 104.3 87.6 95.9 ...
> cor(cement)
x1 x2 x3 x4 y
x1 1.0000000 0.2285795 -0.82413376 -0.24544511 0.7307175
x2 0.2285795 1.0000000 -0.13924238 -0.97295500 0.8162526
x3 -0.8241338 -0.1392424 1.00000000 0.02953700 -0.5346707
x4 -0.2454451 -0.9729550 0.02953700 1.00000000 -0.8213050
y 0.7307175 0.8162526 -0.53467068 -0.82130504 1.0000000
If you prefer a covariance matrix, use cov( )...
> cov(cement)
x1 x2 x3 x4 y
x1 34.60256 20.92308 -31.051282 -24.166667 64.66346
x2 20.92308 242.14103 -13.878205 -253.416667 191.07949
x3 -31.05128 -13.87821 41.025641 3.166667 -51.51923
x4 -24.16667 -253.41667 3.166667 280.166667 -206.80833
y 64.66346 191.07949 -51.519231 -206.808333 226.31359

If you have a covariance matrix and want a correlation matrix...


> cov.matr = cov(cement)
> cov2cor(cov.matr)
x1 x2 x3 x4 y
x1 1.0000000 0.2285795 -0.82413376 -0.24544511 0.7307175
x2 0.2285795 1.0000000 -0.13924238 -0.97295500 0.8162526
x3 -0.8241338 -0.1392424 1.00000000 0.02953700 -0.5346707
x4 -0.2454451 -0.9729550 0.02953700 1.00000000 -0.8213050
y 0.7307175 0.8162526 -0.53467068 -0.82130504 1.0000000

If you want a visual representation of the correlation matrix (i.e., a scatterplot matrix)...
> pairs(cement)

simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression

The command plot(cement) would also have done the same thing.

Correlations for Ranked Data

If the data are ordinal rather than true numerical measures, or have been converted to ranks to fix some problem with
distribution or curvilinearity, then R can calculate a Spearman rho coefficient or a Kendall tau coefficient. Suppose we
have two athletic coaches ranking players by skill...

> ls()
[1] "cement" "cov.matr"
> rm(cement, cov.matr) # clean up first
> coach1 = c(1,2,3,4,5,6,7,8,9,10)
> coach2 = c(4,8,1,5,9,2,10,7,3,6)
> cor(coach1, coach2, method="spearman")
[1] 0.1272727
> cor.test(coach1, coach2, method="spearman")
Spearman's rank correlation rho
data: coach1 and coach2
S = 144, p-value = 0.72
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1272727
> cor(coach1, coach2, method="kendall")
[1] 0.1111111
> cor.test(coach1, coach2, method="kendall")
Kendall's rank correlation tau
data: coach1 and coach2
T = 25, p-value = 0.7275
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.1111111
> ls()

simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression

[1] "coach1" "coach2"


> rm(coach1,coach2) # clean up again
I will not get into the debate over which of these rank correlations is better! You can also use t

simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Time Series Analysis

TIME SERIES ANALYSIS

Definition

When a process is measured over time--i.e., in a sense, "time" is the independent or explanatory variable--then the
resulting sequence of measured values is called a time series. The difference between time series data and independent
measurements that just happen to be made over time is that in time series data the successive data points are often
correlated. For example, the built-in data set "sunspots" is a count of the number of sunspots observed during every month
from 1749 through 1983. The autocorrelation function reveals how successive data points in the series are correlated...

> acf(sunspots)

As seen in this graph, the data points are perfectly correlated with themselves (lag = 0), but successive points are also
highly correlated as well. For example, a point is correlated with the next point at higher than r = 0.9, and this
autocorrelation between points in the series does not drop to near zero until three years have passed.

This means traditional techniques that assume independent measurements should not be used in the analysis of time series
data.

[NOTE: Who am I kidding? When I first began writing this back in 2010, I had big plans to teach myself time series
analysis. It hasn't happened. My knowledge of time series analysis is rudimentary to say the least. You should not assume
this is all that R can do with time series--it's just all I can do. In fact, R contains extensive facilities, many in optional
packages, for dealing with time series. If you want an elementary introduction to time series, I'm told that Chatfield (2003)
is an excellent source.]

revised 2013 June 22

Return to the Table of Contents

timeseries.html[27/01/2014 22:20:05]
Stat1stics ANOVA
lf you bave been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's
Oescriptive Statistics approacb less coberent and user-friendly. A good online presentation on ANOVA in R can be found in

Freguencies & Crosstabs ANOVA section of tbe Personality Project. (Note: 1 bave found tbat tbese pages render fine in Chrome
and Safari browsers, but can appear distorted in iExplorer.)
Correlations
1. Fit a Model
In tbe fallowing examples lower case letters are numeric variables and upper case letters are factors .
Nonparametric Statjstics

Multjple R~re55jon
# One Way Anova (Completely Randomized Design)
R~re55jon Djagnostjc5 fit <- aov