You are on page 1of 35

R

Data Manipulation

R.M. Ripley

Department of Statistics
University of Oxford

2012/13

R.M. Ripley (University of Oxford) R 2012/13 1 / 35


Getting Data in and out of R

Reading your Data into R

Very commonly, data is in form of array in a text file. Each row


contains one data record, and the parts of the record (or fields)
are separated by a special character such as Tab.

For such files use read.table() or a variant such as


read.csv() or read.csv2().

If your data is not like an array, e.g. the lines have differing
structures, use the function readLines().

To import data from another system, try to export it as a tab- or


comma-delimited file and use read.table()

If that is not possible, look in the R Data Import/Export Manual to


see if there is a facility to import such files.

R.M. Ripley (University of Oxford) R 2012/13 2 / 35


Getting Data in and out of R

More about read.table()

File formats are very varied, hence there are many options for
read.table. Here are details of a few:
file name of file, usually a character string. Note for Windows
users: use "/" or "\\" but not "\" as the directory
separator.
sep field separator e.g. sep="\t" (tab), sep=","
header If your file has column headings in the first row, use
header=TRUE.
fill If empty fields at the end of lines are not present, specify
fill=TRUE
others To control the way R interprets your data, use options
such as na.strings, comment.char, quote,
as.is, ColClasses, StringsAsFactors

R.M. Ripley (University of Oxford) R 2012/13 3 / 35


Getting Data in and out of R

Variants of read.table()

These variants can be simpler to use for very common formats:


read.csv calls read.table with defaults suitable for reading
comma-separated files, where the decimal point is "."

read.csv2 calls read.table with defaults suitable for reading


semi-colon separated files, where the decimal point is ","

read.delim calls read.table with defaults suitable for reading


tab separated files, where the decimal point is ".".

read.delim2 calls read.table with defaults suitable for reading


tab separated files, where the decimal point is ","

R.M. Ripley (University of Oxford) R 2012/13 4 / 35


Getting Data in and out of R

Data read in by read.table()

Always check the column formats by viewing in the data editor or


using summary(). For example, are all the numeric columns
displayed as numbers?

By default, character columns will be read in as factors. Change if


necessary e.g. by
mydata$col1 = as.character(mydata$col1)
or use the as.is option.

Numeric factor columns will be read in as numbers. Use


factor() to correct these.

R.M. Ripley (University of Oxford) R 2012/13 5 / 35


Getting Data in and out of R

Saving your data out of R

write.table() writes a data frame to a delimited text file.


Some options:
file name of file
sep desired field separator
Others control output of column and row names,
treatment of missing data etc.

There are variants write.csv() and write.csv2() as for


input.

To write a matrix rather than a data frame, use write or


write.matrix, (latter from package MASS).

R.M. Ripley (University of Oxford) R 2012/13 6 / 35


Matrices, arrays and indexing

Matrices

A matrix is a two dimensional array of objects, all of the same type.

To create a matrix, use the function matrix():


mymat <- matrix(1:12, 3, 4)
gives
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

Entries go down columns unless you specify byrow=TRUE.

dim(mymat) gives the dimensions 3 4

R.M. Ripley (University of Oxford) R 2012/13 7 / 35


Matrices, arrays and indexing

Arrays

Arrays are like matrices but have more than 2 dimensions.

you can create them using the function array() or by assigning


3 or more dimensions to a vector or matrix:
myarr <- mymat
dim(myarr) <- c(3, 2, 2)
myarr[, , 1]
gives
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

R.M. Ripley (University of Oxford) R 2012/13 8 / 35


Matrices, arrays and indexing

Indexing

Indices can be used to select part of a vector or matrix or array

Extract these entries

Make assignments to selected part

The indexing operation is indicated by square brackets: [,]

To select one element of an array with 4 dimensions, say, simply


use 4 subscripts: e.g.
myarray[2, 3, 1, 4]
(just like algebra!)

Indices can have several different forms

R.M. Ripley (University of Oxford) R 2012/13 9 / 35


Matrices, arrays and indexing

Indexing: types of indices

Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of positive integers, in any desired order, indicating
elements to select.
e.g. x[c(1, 3, 6, 5)] will give 2, 6, 12, 10
A logical vector. This must be of the same length as the vector.
Values corresponding to the entries TRUE will be selected.
e.g. use <- c(rep(TRUE, 3), FALSE, TRUE, FALSE)
x[use] will give 2, 4, 6, 10
A vector of negative integers, indicating elements to exclude e.g.
x[-c(1:3)] will give 8, 10, 12

R.M. Ripley (University of Oxford) R 2012/13 10 / 35


Matrices, arrays and indexing

Indexing: character and empty indices

Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of character strings. Only applicable if the vector has
names.
x[c("E1", "E3")] will give 2, 6
Empty. Select all. Useful if assigning to an object as all values will
be replaced but all other aspects of the object will be unchanged.
x[] <- 0
names(x) will be the same as before
Compare with x <- 0
x will be the single character 0.

R.M. Ripley (University of Oxford) R 2012/13 11 / 35


Matrices, arrays and indexing

Indexing (more details)

Recycling will be used if sub-vector selected for replacement is


longer than the right-hand side.
x[c(1, 3)] <- 4.5

Replacing to an index greater than the length of the vector


extends it, filling in with NA’s.
x[10] <- 8

Extracting with an index outside the range of the vector length


returns NA
x[11]

R.M. Ripley (University of Oxford) R 2012/13 12 / 35


Matrices, arrays and indexing

Indexing matrices and arrays

can select using one vector for each dimension e.g.


mymat[1:2, -2]

or use a matrix subscript e.g.


mymat[mymat > 1] <- NA note: no comma

Incomplete matrix of subscripts e.g.


mymat[cbind(rep(1, 3), c(2,3,4))] <- NA

if the result has length 1 in any dimension, this is dropped unless


you use the argument drop=FALSE
mymat[1:2, 1] is a vector but
mymat[1:2, 1, drop=FALSE] is a matrix with 2 rows and one
column.

Forgetting to use drop=FALSE is a common error.


R.M. Ripley (University of Oxford) R 2012/13 13 / 35
Matrices, arrays and indexing

Indexing data frames

Data frames can be indexed like matrices, but only drop


dimensions if you select from a single column, not if you select
from a single row.

If you select rows from a data frame with only one column the
result will be a vector unless you use drop=FALSE.

Often want to select the rows of a data frame which meet some
criterion.

Use logical indexing:


mydata[mydata$weight > 400, ]

R.M. Ripley (University of Oxford) R 2012/13 14 / 35


Writing simple functions

Function Examples

Two simple examples:


A function to calculate the standard deviation of a vector:
std.dev <- function(x) sqrt(var(x))

And one to calculate the two-tailed p-value of a t.test:


t.test.p <- function(x, mu=0)
{
n <- length(x)
t <- sqrt(n) * (mean(x) - mu) / std.dev(x)
2 * (1 - pt(abs(t), n-1))
}

R.M. Ripley (University of Oxford) R 2012/13 15 / 35


Writing simple functions

Function Details
Note:
Arguments. May have default values. To call functions, either keep
the arguments in the same order
t.test.p(myx, 1)
or use names for each argument
t.test.p(mu=1, x=myx)
The two can be mixed, with ordered arguments first and named
ones at the end of the list.
To use more than one statement in the function, use braces {} to
define a block.
The object on the final line will be returned.
To return more than one item, create a list using list() or a
vector using c().
R.M. Ripley (University of Oxford) R 2012/13 16 / 35
Writing simple functions

Flow Control
Our t.test.p performed 3 commands in a sequence.
Often we need to make a decision or execute a loop.
myfn <- function(n=100)
{
tmp <- rep(NA, 3)
tmp[1] <- mean(runif(n))
tmp[2] <- mean(runif(n))
tmp[3] <- mean(runif(n))
mean(tmp[tmp > .2])
}

Could adapt this to do any number of repeats, specified by an


argument
Could decide to store in tmp only those values greater than 0.2
R.M. Ripley (University of Oxford) R 2012/13 17 / 35
Writing simple functions

Flow Control : If

Decisions are controlled by the if statement:


if (condition) true.branch else false.branch
The else part is optional.
true.branch or false.branch can be compound statements
enclosed in {}
if condition has a vector value only the first is used, and a
warning is issued.
if condition is compound e.g. A || B, if A is TRUE then B is not
evaluated.

R.M. Ripley (University of Oxford) R 2012/13 18 / 35


Writing simple functions

Example

myfna <- function(n=100)


{
tmp <- rep(NA, 3)
x <- mean(runif(n))
if (x > 0.2)
tmp[1] <- x
x <- mean(runif(n))
if (x > 0.2)
tmp[2] <- x
x <- mean(runif(n))
if (x > 0.2)
tmp[3] <- x
mean(tmp, na.rm=TRUE)
}

R.M. Ripley (University of Oxford) R 2012/13 19 / 35


Writing simple functions

Flow Control: For

Loops are controlled by the statements for, while and


repeat.

The for statement allows a statement to be iterated as a variable


takes values in a sequence:
for(variable in sequence) statement
e.g.
for(i in 1:10) print(x[i])

Statement can be compound, when it must be enclosed in {}.

R.M. Ripley (University of Oxford) R 2012/13 20 / 35


Writing simple functions

Flow Control: While, Repeat

while and repeat statements do not use loop variables.

while (condition) statement


The while statement repeats statement until condition is
FALSE.

repeat statement
The repeat statement repeats statement until flow is
transferred out using the break statement.

In a loop, the command next will transfer control to the beginning


of the next iteration, break will transfer control to the statement
following the end of the loop.

R.M. Ripley (University of Oxford) R 2012/13 21 / 35


Writing simple functions

Example functions

myfn1 <- function(obs=10, n=100)


{
x <- rep(NA, n)
for (i in 1:n)
{
tmp <- runif(obs)
x[i] <- mean(tmp)
}
c(mn=mean(x), std=sd(x))
}

R.M. Ripley (University of Oxford) R 2012/13 22 / 35


Writing simple functions

Example functions

myfn2 <- function(obs=10)


{
x <- runif(obs)
while(mean(x) < 0.45)
{
obs <- 2 * obs
x <- runif(obs)
}
list(mn=mean(x), std=sd(x), obs=obs)
}

R.M. Ripley (University of Oxford) R 2012/13 23 / 35


Writing simple functions

Example functions

myfn3 <- function(obs=10)


{
repeat
{
x <- runif(obs)
if (mean(x) >= 0.45)
break
obs <- 2*obs
}
list(mn=mean(x), std=sd(x), obs=obs)
}

R.M. Ripley (University of Oxford) R 2012/13 24 / 35


Writing simple functions

Braces, square brackets, parentheses

Just in case you are confused by the notation we have introduced, let
us recap:
brace {, } Used to create blocks of statements

square bracket [, ] Used to subscript

parenthesis (, ) Used to enclose function arguments

R.M. Ripley (University of Oxford) R 2012/13 25 / 35


More data manipulation

ifelse function

R has many functions which reduce the need for you to write loops.
This can be both easier and more efficient. One is ifelse.
Suppose x <- c(0, 1, 1, 2) and y <- c(44, 45, 56, 77).
Replace:
z <- rep(NA, 4)
for (i in 1:length(x))
{
if (x[i] > 0)
z[i] <- y[i] / x[i]
else
z[i] <- y[i] / 99
}
by
z <- ifelse(x > 0, y / x, y / 99)

R.M. Ripley (University of Oxford) R 2012/13 26 / 35


More data manipulation

apply functions
A group of functions useful for avoiding loops e.g.
lapply, sapply, apply, tapply, mapply
lapply and sapply are used to iterate along a list or a vector.
lapply(mylist, length)
will return a list with components the length of the components of
the list.
sapply(mylist, length)
will return a vector with elements the lengths of the components of
the list.
apply and tapply operate in a similar way on arrays or parts of
arrays or vectors.
mapply operates on corresponding elements of multiple lists or
vectors.
R.M. Ripley (University of Oxford) R 2012/13 27 / 35
More data manipulation

apply

apply and tapply have extra parameters, to indicate which part


of the array or vector to use. For apply it is the dimension(s) of
the array over which to iterate. e.g.
apply(mymat, 2, sum, na.rm=TRUE)
will produce the sum of each column of the matrix, after missing
values have been removed.

Note how to pass extra arguments to the function to be applied.

Can be used on data frames, but will turn them into matrices first.
If there are factor variables, all the variables will end up as
character.

R.M. Ripley (University of Oxford) R 2012/13 28 / 35


More data manipulation

tapply

tapply selects parts of a vector using factors. e.g.


tapply(myvector, myfactor, mean) will calculate the
mean of values of myvector after splitting them into groups
based on the different values of myfactor.

myfactor should be a vector the same length as myvector.

The function tapply can be used on data frames by using the


by function. (?by for further details.)

The function split can be used to partition a vector or data


frame based on the values of a factor.

R.M. Ripley (University of Oxford) R 2012/13 29 / 35


More data manipulation

mapply, etc

mapply(sum, x, y) will return a vector containing the sum


x + y

mapply(function(x, y)hist(x$weight, main=y),


mylist, names(mylist))
will produce histograms for each subset of the data with the name
of the corresponding feed as the title.

There are other extensions of lapply. Worth looking for (via the
help pages) if you seem to need to write something similar.

Package parallel contains parallel versions of some of these


functions.

R.M. Ripley (University of Oxford) R 2012/13 30 / 35


More data manipulation

Manipulating data

cbind(...) will join its arguments together by columns. e.g.


cbind(c(1, 2, 3), c(4, 5, 6))
gives
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
and rbind by row:
rbind(c(1, 2, 3), c(4, 5, 6))
gives
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
When used on (more than one) vectors, both return matrices.

R.M. Ripley (University of Oxford) R 2012/13 31 / 35


More data manipulation

Manipulating data

cbind and rbind can be used on data frames. In this case:

For cbind, the resulting data frame may have duplicate column
names.

For rbind the column names must match, although they need
not be in the same order.

You cannot create a data frame using cbind with vectors. To do


this, use data.frame().

R.M. Ripley (University of Oxford) R 2012/13 32 / 35


More data manipulation

Merge

The merge function joins together two data frames.

It joins together rows which have common values in specified


columns, producing a new row which contains all the information
in either data frame.

There are many possibilities. Use ?merge for full details.

R.M. Ripley (University of Oxford) R 2012/13 33 / 35


More data manipulation

Matrix Algebra

Matrices of the same size can be added, subtracted, and


element-wise multiplied by each other using +, -, *.

For matrix multiplication, use %*% on matrices of matching


dimensions.

Inverses of square matrices can be found using the solve


command.

eigen computes eigenvalues and eigenvectors of matrices.

colSums, colMeans returns the column sums or means of a


matrix. rowSums and rowMeans the sums or means of the rows.
Can also be used on arrays.

R.M. Ripley (University of Oxford) R 2012/13 34 / 35


Exercises

Exercises 3

Generate a matrix with 10 rows and 5 columns, with random


entries between 0 and 10. (Hint: look at runif)

Write a function using for to calculate the column means of the


matrix.

Extract the even rows from the matrix.

Using the data frame you created in Exercises 2, select the rows
for which the date is after 1st June 2007.

R.M. Ripley (University of Oxford) R 2012/13 35 / 35

You might also like