Professional Documents
Culture Documents
Ronald Geskus
based on materials made in collaboration with Perry Moerland
Oxford University Clinical Research Unit
Hospital for Tropical Diseases, Ho Chi Minh City, Viet Nam
Introduction Basics Data import Functions etc.
Part I
Day 1
Introduction Basics Data import Functions etc.
Outline
Introduction
Basics
Data structures
Data import
Course setup
R: What is it?
• On http://www.r-project.org/about.html: “a language
and environment for statistical computing and graphics”
• Free statistical package: no money and open source
• Runs on all major operating systems
Introduction Basics Data import Functions etc.
R: What is it?
• On http://www.r-project.org/about.html: “a language
and environment for statistical computing and graphics”
• Free statistical package: no money and open source
• Runs on all major operating systems
• Standard distribution with basic statistical procedures
• Extensions via packages
• Recommended; come installed together with R
• Thousands more; can be installed from the R website
• Hard to learn(?)
• Very powerful language; exponential growth in popularity
Introduction Basics Data import Functions etc.
Outline
Introduction
Basics
Data structures
Data import
R as a pocket calculator
> sqrt(2)
[1] 1.414214
> cos(pi)
[1] -1
> log10(10^3)
[1] 3
Introduction Basics Data import Functions etc.
A simple R session
y < - x
Introduction Basics Data import Functions etc.
y < - x
y <- x
Introduction Basics Data import Functions etc.
y < - x
y <- x
Functions
Packages
Help (I)
> help(mean)
Help (I)
> help(mean)
Help (II)
Help (I)
> help(mean)
Vectors (I)
Vectors (I)
Vectors (II)
Vectors (III)
> 2*x
[1] 20 18 16 24 12 10 8 6 4 2
Introduction Basics Data import Functions etc.
Matrices (I)
Matrices (II)
Matrices (III)
Help (I)
> help(mean)
Objects (I)
Help (I)
> help(mean)
Modes
• R has several modes, the most important ones are:
• numeric:
> c(1, 2, 3, 4)
• logical: Boolean values: TRUE, FALSE
> -2 < 2
[1] TRUE
• character :
> letters[1:3]
[1] "a" "b" "c"
• You can change the mode of an object
> as.character(x)
[1] "10" "9" "8" "12" "6" "5" "4" "3" "2"
[10] "1"
• Modes cannot be mixed in vectors or matrices, but they can
in lists (see later)
Introduction Basics Data import Functions etc.
Naming (I)
Naming (II)
RStudio
Lists (I)
$number
[1] 5
Lists (II)
> titanic[c(2,3),c("name","age")]
name age
2 Hudson Trevor 0.9167
3 Helen Loraine 2.0000
Outline
Introduction
Basics
Data structures
Data import
> dim(titanic3)
[1] 1309 17
> head(titanic3[,1:4])
pclass survived name sex
1 1st 1 Allen, Miss. Elisabeth Walton female
2 1st 1 Allison, Master. Hudson Trevor male
3 1st 0 Allison, Miss. Helen Loraine female
4 1st 0 Allison, Mr. Hudson Joshua Crei male
5 1st 0 Allison, Mrs. Hudson J C (Bessi female
6 1st 1 Anderson, Mr. Harry male
> str(titanic3[,1:4])
'data.frame': 1309 obs. of 4 variables:
$ pclass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ survived: num 1 1 0 0 0 1 1 0 1 0 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hud-
son Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Cre
$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
Recapitulation: objects
Outline
Introduction
Basics
Data structures
Data import
Missing data
Missing data
Dates
Part II
Day 2
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
●
80
●
60
●
x^3 − 13 * x^2 + 39 * x
●
40
●
●●●●●●●●●
●● ●● ●
●● ●●
● ●●
● ●
● ●
● ●
● ●
● ●
20
● ●
● ●
● ●
●
● ● ●
●
● ● ●
●
● ● ●
●
● ● ●
●
● ●
0
●
●
● ●
● ●
●
● ●
● ●
● ●
●
●● ●
●● ●
●●
−20
●●
●●● ●●
●●●●●●●
0 2 4 6 8 10
x
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
●
80
●
x^3 − 13 * x^2 + 39 * x
60
●
40
●
●●●●●●●●●
●● ●● ●
●● ●●
● ●●
● ●
● ●
● ●
● ●
20
● ●
● ●
● ●
● ●
●
● ● ●
●
● ● ●
●
● ● ●
●
● ● ●
●
0
● ● ●
●
● ●
● ●
●
● ●
● ●
● ●
●
●● ●
●● ●
−20
●● ●●
●●● ●●
●●●●●●●
0 2 4 6 8 10
x
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Enhanced plot
80
60
temperature
40
20
0
−20
0 2 4 6 8 10
time (hours)
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Enhanced plot
80
60
temperature
40
20
0
−20
0 2 4 6 8 10
time (hours)
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Plot symbols
There are 25 different plot symbols, see ?points
> plot(1:25, pch=1:25, cex=2, bg="grey")
# bg: background colors for open plot symbols
25
●
20
●
●
●
15
1:25
●
10
5
5 10 15 20 25
Index
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Sahara
Gobi
60
local maximum
temperature
40
●
20
0
−20
0 2 4 6 8 10
time (hours)
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
You can change the default value of many graphical parameters via
par (see help(par)). For example to reset the background of a
plot to green:
80
Sahara
> par(bg="green") Gobi
60
local maximum
temperature
40
●
20
0
and then rerun the plot commands
−20
0 2 4 6 8 10
time (hours)
80
> x<-(0:100)/10 60
> plot(x,x^3-13*x^2+39*x,type="l",
temperature
40
lwd=3,las=1,cex.axis=1.5,cex.lab=1.5) 0
−20
0 2 4 6 8 10
time (hours)
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Histograms
0.03
> hist(titanic3$age,breaks=15,freq=FALSE,
Density
0.02
cex.axis=1.5,cex.lab=1.5)
0.01
0.00
0 20 40 60 80
titanic3$age
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Boxplot
The function boxplot can be used on a vector
300
●
250
●
●
●
> boxplot(titanic3$fare, ●
200
●
fare
150
ylim=c(0,300),ylab="fare", ●
●
●
●
●
●
●
●
100
●
●
●
cex.axis=1.5,cex.lab=1.5) ●
●
●
●
●
●
●
●
●
●
50
0
boxplot also allows for a formula specification
300
●
250
●
200
fare
150
data=titanic3,ylim=c(0,300),ylab="fare",
100
cex.axis=1.5,cex.lab=1.5) ●
●
●
50
●
●
●
●
●
0
1st 2nd 3rd
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Grammar of Graphics
See http://ggplot2.tidyverse.org/reference
Graph mapping from data to aesthetic attributes of geometric
objects
ggplot( dataset, aes( xvar, yvar, other vars)) +
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Grammar of Graphics
See http://ggplot2.tidyverse.org/reference
Graph mapping from data to aesthetic attributes of geometric
objects
ggplot( dataset, aes( xvar, yvar, other vars)) +
Geometric Object what you see on the plot (points, lines, bars)
geom_xxx( )
At xxx: point, line, histogram, boxplot, etc.
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Grammar of Graphics
See http://ggplot2.tidyverse.org/reference
Graph mapping from data to aesthetic attributes of geometric
objects
ggplot( dataset, aes( xvar, yvar, other vars)) +
Geometric Object what you see on the plot (points, lines, bars)
geom_xxx( )
At xxx: point, line, histogram, boxplot, etc.
Scales Role of variables in plot.
Map data values to aesthetics, like colour, size, shape, location
scale_._xxx( )
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Grammar of Graphics
See http://ggplot2.tidyverse.org/reference
Graph mapping from data to aesthetic attributes of geometric
objects
ggplot( dataset, aes( xvar, yvar, other vars)) +
Geometric Object what you see on the plot (points, lines, bars)
geom_xxx( )
At xxx: point, line, histogram, boxplot, etc.
Scales Role of variables in plot.
Map data values to aesthetics, like colour, size, shape, location
scale_._xxx( )
Faceting separate panels (subfigures)
facet_wrap( )
facet_grid( )
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Structure of R
Structure of R
Structure of R
Structure of R
R OS
objects files
Workspace current folder
environments folders in “path” variable
RStudio Environment pane Explorer window
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Project management
Project management
Project management
Reproducible research
Reproducible research
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Finding Information
Finding Information
Finding Information
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Regression; Formulas
Regression; Formulas
Regression; Formulas
Model output
Model output
Model output
Formula structure
Outline
Graphics
Base graphics
ggplot2 graphics
Internal and external communication
The structure of R
Export to other formats
Data manipulation and inspection
Documentation and help
Model fitting; formulas
Programming and ply functions
Programming constructs
The apply family
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Statements: if-then-else
Imagine that you have to repeat the same analysis for many files
that are all in the same folder on your computer. A short solution
using an iterative construct would be
> files <- dir()
> for (filename in files){
infile <- read.table(filename, ...)
do something with infile
}
Graphics Communication Data management Help Model fitting; formulas Programming and ply fun
Apply
tapply: example
THANKS!