Rcourse3 PDF

R
Data Manipulation
R.M. Ripley
Department of Statistics
University of Oxford
2012/13
R.M. Ripley (University of Oxford) R 2012/13 1 / 35

Getting Data in and out of R
Reading your Data into R
Very commonly, data is in form of array in a text file. Each row

contains one data record, and the parts of the record (or fields)
are separated by a special character such as Tab.
For such files use read.table() or a variant such as

read.csv() or read.csv2().
If your data is not like an array, e.g. the lines have differing
structures, use the function readLines().
To import data from another system, try to export it as a tab- or

comma-delimited file and use read.table()
If that is not possible, look in the R Data Import/Export Manual to

see if there is a facility to import such files.

More about read.table()
File formats are very varied, hence there are many options for
read.table. Here are details of a few:
file name of file, usually a character string. Note for Windows
users: use "/" or "\\" but not "\" as the directory
separator.
sep field separator e.g. sep="\t" (tab), sep=","
header If your file has column headings in the first row, use
header=TRUE.
fill If empty fields at the end of lines are not present, specify
fill=TRUE
others To control the way R interprets your data, use options
such as na.strings, comment.char, quote,
as.is, ColClasses, StringsAsFactors

Variants of read.table()
These variants can be simpler to use for very common formats:

read.csv calls read.table with defaults suitable for reading
comma-separated files, where the decimal point is "."
read.csv2 calls read.table with defaults suitable for reading

semi-colon separated files, where the decimal point is ","
read.delim calls read.table with defaults suitable for reading

tab separated files, where the decimal point is ".".
read.delim2 calls read.table with defaults suitable for reading

tab separated files, where the decimal point is ","

Data read in by read.table()
Always check the column formats by viewing in the data editor or

using summary(). For example, are all the numeric columns
displayed as numbers?
By default, character columns will be read in as factors. Change if

necessary e.g. by
mydata$col1 = as.character(mydata$col1)
or use the as.is option.
Numeric factor columns will be read in as numbers. Use

factor() to correct these.

Saving your data out of R
write.table() writes a data frame to a delimited text file.

Some options:
file name of file
sep desired field separator
Others control output of column and row names,
treatment of missing data etc.
There are variants write.csv() and write.csv2() as for

input.
To write a matrix rather than a data frame, use write or

write.matrix, (latter from package MASS).

Matrices, arrays and indexing
Matrices
A matrix is a two dimensional array of objects, all of the same type.
To create a matrix, use the function matrix():

mymat <- matrix(1:12, 3, 4)
gives
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Entries go down columns unless you specify byrow=TRUE.
dim(mymat) gives the dimensions 3 4

Arrays
Arrays are like matrices but have more than 2 dimensions.
you can create them using the function array() or by assigning

3 or more dimensions to a vector or matrix:
myarr <- mymat
dim(myarr) <- c(3, 2, 2)
myarr[, , 1]
gives
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

Indexing
Indices can be used to select part of a vector or matrix or array
Extract these entries
Make assignments to selected part
The indexing operation is indicated by square brackets: [,]
To select one element of an array with 4 dimensions, say, simply

use 4 subscripts: e.g.
myarray[2, 3, 1, 4]
(just like algebra!)
Indices can have several different forms

Indexing: types of indices
Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of positive integers, in any desired order, indicating
elements to select.
e.g. x[c(1, 3, 6, 5)] will give 2, 6, 12, 10
A logical vector. This must be of the same length as the vector.
Values corresponding to the entries TRUE will be selected.
e.g. use <- c(rep(TRUE, 3), FALSE, TRUE, FALSE)
x[use] will give 2, 4, 6, 10
A vector of negative integers, indicating elements to exclude e.g.
x[-c(1:3)] will give 8, 10, 12

Indexing: character and empty indices
Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of character strings. Only applicable if the vector has
names.
x[c("E1", "E3")] will give 2, 6
Empty. Select all. Useful if assigning to an object as all values will
be replaced but all other aspects of the object will be unchanged.
x[] <- 0
names(x) will be the same as before
Compare with x <- 0
x will be the single character 0.

Indexing (more details)
Recycling will be used if sub-vector selected for replacement is

longer than the right-hand side.
x[c(1, 3)] <- 4.5
Replacing to an index greater than the length of the vector

extends it, filling in with NA’s.
x[10] <- 8
Extracting with an index outside the range of the vector length

returns NA
x[11]

Indexing matrices and arrays
can select using one vector for each dimension e.g.

mymat[1:2, -2]
or use a matrix subscript e.g.

mymat[mymat > 1] <- NA note: no comma
Incomplete matrix of subscripts e.g.

mymat[cbind(rep(1, 3), c(2,3,4))] <- NA
if the result has length 1 in any dimension, this is dropped unless

you use the argument drop=FALSE
mymat[1:2, 1] is a vector but
mymat[1:2, 1, drop=FALSE] is a matrix with 2 rows and one
column.
Forgetting to use drop=FALSE is a common error.

Indexing data frames
Data frames can be indexed like matrices, but only drop

dimensions if you select from a single column, not if you select
from a single row.
If you select rows from a data frame with only one column the
result will be a vector unless you use drop=FALSE.
Often want to select the rows of a data frame which meet some
criterion.
Use logical indexing:

mydata[mydata$weight > 400, ]

Writing simple functions
Function Examples
Two simple examples:

A function to calculate the standard deviation of a vector:
std.dev <- function(x) sqrt(var(x))
And one to calculate the two-tailed p-value of a t.test:

t.test.p <- function(x, mu=0)
{
n <- length(x)
t <- sqrt(n) * (mean(x) - mu) / std.dev(x)
2 * (1 - pt(abs(t), n-1))
}

Function Details
Note:
Arguments. May have default values. To call functions, either keep
the arguments in the same order
t.test.p(myx, 1)
or use names for each argument
t.test.p(mu=1, x=myx)
The two can be mixed, with ordered arguments first and named
ones at the end of the list.
To use more than one statement in the function, use braces {} to
define a block.
The object on the final line will be returned.
To return more than one item, create a list using list() or a
vector using c().
Flow Control
Our t.test.p performed 3 commands in a sequence.
Often we need to make a decision or execute a loop.
myfn <- function(n=100)
{
tmp <- rep(NA, 3)
tmp[1] <- mean(runif(n))
mean(tmp[tmp > .2])
}
Could adapt this to do any number of repeats, specified by an

argument
Could decide to store in tmp only those values greater than 0.2
Flow Control : If
Decisions are controlled by the if statement:

if (condition) true.branch else false.branch
The else part is optional.
true.branch or false.branch can be compound statements
enclosed in {}
if condition has a vector value only the first is used, and a
warning is issued.
if condition is compound e.g. A || B, if A is TRUE then B is not
evaluated.

Example
myfna <- function(n=100)

{
tmp <- rep(NA, 3)
x <- mean(runif(n))
if (x > 0.2)
tmp[1] <- x
x <- mean(runif(n))
if (x > 0.2)
tmp[2] <- x
x <- mean(runif(n))
if (x > 0.2)
tmp[3] <- x
mean(tmp, na.rm=TRUE)
}

Flow Control: For
Loops are controlled by the statements for, while and

repeat.
The for statement allows a statement to be iterated as a variable

takes values in a sequence:
for(variable in sequence) statement
e.g.
for(i in 1:10) print(x[i])
Statement can be compound, when it must be enclosed in {}.

Flow Control: While, Repeat
while and repeat statements do not use loop variables.
while (condition) statement

The while statement repeats statement until condition is
FALSE.
repeat statement
The repeat statement repeats statement until flow is
transferred out using the break statement.
In a loop, the command next will transfer control to the beginning

of the next iteration, break will transfer control to the statement
following the end of the loop.

Example functions
myfn1 <- function(obs=10, n=100)

{
x <- rep(NA, n)
for (i in 1:n)
{
tmp <- runif(obs)
x[i] <- mean(tmp)
}
c(mn=mean(x), std=sd(x))
}

Example functions
myfn2 <- function(obs=10)

{
x <- runif(obs)
while(mean(x) < 0.45)
{
obs <- 2 * obs
x <- runif(obs)
}
list(mn=mean(x), std=sd(x), obs=obs)
}

Example functions
myfn3 <- function(obs=10)

{
repeat
{
x <- runif(obs)
if (mean(x) >= 0.45)
break
obs <- 2*obs
}
list(mn=mean(x), std=sd(x), obs=obs)
}

Braces, square brackets, parentheses
Just in case you are confused by the notation we have introduced, let
us recap:
brace {, } Used to create blocks of statements
square bracket [, ] Used to subscript
parenthesis (, ) Used to enclose function arguments

More data manipulation
ifelse function
R has many functions which reduce the need for you to write loops.
This can be both easier and more efficient. One is ifelse.
Suppose x <- c(0, 1, 1, 2) and y <- c(44, 45, 56, 77).
Replace:
z <- rep(NA, 4)
for (i in 1:length(x))
{
if (x[i] > 0)
z[i] <- y[i] / x[i]
else
z[i] <- y[i] / 99
}
by
z <- ifelse(x > 0, y / x, y / 99)

apply functions
A group of functions useful for avoiding loops e.g.
lapply, sapply, apply, tapply, mapply
lapply and sapply are used to iterate along a list or a vector.
lapply(mylist, length)
will return a list with components the length of the components of
the list.
sapply(mylist, length)
will return a vector with elements the lengths of the components of
the list.
apply and tapply operate in a similar way on arrays or parts of
arrays or vectors.
mapply operates on corresponding elements of multiple lists or
vectors.
apply
apply and tapply have extra parameters, to indicate which part

of the array or vector to use. For apply it is the dimension(s) of
the array over which to iterate. e.g.
apply(mymat, 2, sum, na.rm=TRUE)
will produce the sum of each column of the matrix, after missing
values have been removed.
Note how to pass extra arguments to the function to be applied.
Can be used on data frames, but will turn them into matrices first.
If there are factor variables, all the variables will end up as
character.

tapply
tapply selects parts of a vector using factors. e.g.

tapply(myvector, myfactor, mean) will calculate the
mean of values of myvector after splitting them into groups
based on the different values of myfactor.
myfactor should be a vector the same length as myvector.
The function tapply can be used on data frames by using the

by function. (?by for further details.)
The function split can be used to partition a vector or data

frame based on the values of a factor.

mapply, etc
mapply(sum, x, y) will return a vector containing the sum

x + y
mapply(function(x, y)hist(x$weight, main=y),

mylist, names(mylist))
will produce histograms for each subset of the data with the name
of the corresponding feed as the title.
There are other extensions of lapply. Worth looking for (via the
help pages) if you seem to need to write something similar.
Package parallel contains parallel versions of some of these

functions.

Manipulating data
cbind(...) will join its arguments together by columns. e.g.

cbind(c(1, 2, 3), c(4, 5, 6))
gives
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
and rbind by row:
rbind(c(1, 2, 3), c(4, 5, 6))
gives
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
When used on (more than one) vectors, both return matrices.

Manipulating data
cbind and rbind can be used on data frames. In this case:
For cbind, the resulting data frame may have duplicate column
names.
For rbind the column names must match, although they need
not be in the same order.
You cannot create a data frame using cbind with vectors. To do

this, use data.frame().

Merge
The merge function joins together two data frames.
It joins together rows which have common values in specified

columns, producing a new row which contains all the information
in either data frame.
There are many possibilities. Use ?merge for full details.

Matrix Algebra
Matrices of the same size can be added, subtracted, and

element-wise multiplied by each other using +, -, *.
For matrix multiplication, use %*% on matrices of matching

dimensions.
Inverses of square matrices can be found using the solve

command.
eigen computes eigenvalues and eigenvectors of matrices.
colSums, colMeans returns the column sums or means of a

matrix. rowSums and rowMeans the sums or means of the rows.
Can also be used on arrays.

Exercises
Exercises 3
Generate a matrix with 10 rows and 5 columns, with random

entries between 0 and 10. (Hint: look at runif)
Write a function using for to calculate the column means of the

matrix.
Extract the even rows from the matrix.
Using the data frame you created in Exercises 2, select the rows
for which the date is after 1st June 2007.

Rcourse3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rcourse3 PDF

Uploaded by

Copyright:

Available Formats

R

R.M. Ripley (University of Oxford) R 2012/13 1 / 35

Reading your Data into R

Very commonly, data is in form of array in a text file. Each row

For such files use read.table() or a variant such as

To import data from another system, try to export it as a tab- or

If that is not possible, look in the R Data Import/Export Manual to

R.M. Ripley (University of Oxford) R 2012/13 2 / 35

More about read.table()

R.M. Ripley (University of Oxford) R 2012/13 3 / 35

These variants can be simpler to use for very common formats:

read.csv2 calls read.table with defaults suitable for reading

read.delim calls read.table with defaults suitable for reading

read.delim2 calls read.table with defaults suitable for reading

R.M. Ripley (University of Oxford) R 2012/13 4 / 35

Data read in by read.table()

Always check the column formats by viewing in the data editor or

By default, character columns will be read in as factors. Change if

Numeric factor columns will be read in as numbers. Use

R.M. Ripley (University of Oxford) R 2012/13 5 / 35

Saving your data out of R

write.table() writes a data frame to a delimited text file.

There are variants write.csv() and write.csv2() as for

To write a matrix rather than a data frame, use write or

R.M. Ripley (University of Oxford) R 2012/13 6 / 35

A matrix is a two dimensional array of objects, all of the same type.

To create a matrix, use the function matrix():

Entries go down columns unless you specify byrow=TRUE.

dim(mymat) gives the dimensions 3 4

R.M. Ripley (University of Oxford) R 2012/13 7 / 35

Arrays are like matrices but have more than 2 dimensions.

you can create them using the function array() or by assigning

R.M. Ripley (University of Oxford) R 2012/13 8 / 35

Indices can be used to select part of a vector or matrix or array

Extract these entries

Make assignments to selected part

The indexing operation is indicated by square brackets: [,]

To select one element of an array with 4 dimensions, say, simply

Indices can have several different forms

R.M. Ripley (University of Oxford) R 2012/13 9 / 35

Indexing: types of indices

R.M. Ripley (University of Oxford) R 2012/13 10 / 35

Indexing: character and empty indices

R.M. Ripley (University of Oxford) R 2012/13 11 / 35

Indexing (more details)

Recycling will be used if sub-vector selected for replacement is

Replacing to an index greater than the length of the vector

Extracting with an index outside the range of the vector length

R.M. Ripley (University of Oxford) R 2012/13 12 / 35

Indexing matrices and arrays

can select using one vector for each dimension e.g.

or use a matrix subscript e.g.

Incomplete matrix of subscripts e.g.

if the result has length 1 in any dimension, this is dropped unless

Forgetting to use drop=FALSE is a common error.

Indexing data frames

Data frames can be indexed like matrices, but only drop

Use logical indexing:

R.M. Ripley (University of Oxford) R 2012/13 14 / 35

Two simple examples:

And one to calculate the two-tailed p-value of a t.test:

R.M. Ripley (University of Oxford) R 2012/13 15 / 35

Could adapt this to do any number of repeats, specified by an