Professional Documents
Culture Documents
Arnaud Legrand
1 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
2 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
3 / 52
As a Reviewer
4 / 52
As an Author
I thought I used the same parameters but I'm getting dierent results!
The new student wants to compare with the method I proposed last
year.
The damn reviewer asked for a major revision and wants me to change
this gure. :(
Which code and which data set did I use to generate this gure?
It worked yesterday!
Why did I do that?
5 / 52
My Feeling
Computer scientists have an incredibly poor training in probabilities and
statistics
Why should we ? Computer are deterministic machines after all, right? ;)
Eight years ago, I've started realizing how lame the articles I reviewed (as
well as those I wrote) were in term of experimental methodology.
Yeah, I know, your method/algorithm is better than the others as
demonstrated by the gures
Not enough information to discriminate real eects from noise
Little information about the workload
Would the conclusion still hold with a slightly dierent workload?
I'm tired of awful combination of tools (perl, gnuplot, sql, . . . ) and bad
methodology
6 / 52
Current practice in CS
Computer scientists tend to either:
vary one factor at a time, use a very ne sampling of the parameter range,
run millions of experiments for a week varying a lot of parameters and then
try to get something of it. Most of the time, they (1) don't know how to
analyze the results (2) realize something went wrong. . .
Interestingly, most other scientists do the exact opposite.
These two aws come from poor training and from the fact that C.S. experiments
are almost free and very fast to conduct.
Most strategies of experimentation have been designed to provide sound answers despite all the randomness and uncontrollable factors;
Maximize the amount of information provided by a given set of experiments;
Reduce as much as possible the number of experiments to perform to answer
a given question under a given level of condence.
Takes a few lectures on Design of Experiments to improve but anyone can start
by reading Jain's book on The Art of Computer Systems Performance Analysis
7 / 52
Particle
colliders
Obtain
Data
Web
Databases
Analyze/
Visualize
Publish/
Share
Sequencing
machines
Reproducible Research 11
UBC, Vancouver
Juliana Freire
8 / 52
Particle
colliders
Obtain
Data
Web
Databases
AVS
Analyze/
Visualize
VisTrails
Sequencing
machines
Reproducible Research 11
Publish/
Share
Taverna
UBC, Vancouver
Juliana Freire
9 / 52
Particle
colliders
Obtain
Data
Web
Databases
Analyze/
Visualize
Publish/
Share
Sequencing
machines
Reproducible Research 11
UBC, Vancouver
Juliana Freire
10 / 52
Particle
colliders
Obtain
Data
Web
Databases
Analyze/
Visualize
Publish/
Share
Sequencing
machines
Reproducible Research 11
UBC, Vancouver
Juliana Freire
11 / 52
Reproducible Research 11
UBC, Vancouver
Juliana Freire
12 / 52
Reproducible Research 11
UBC, Vancouver
Juliana Freire
13 / 52
Replicability
Reproducibility
Reproduction
using different
software, but with
access to the
original code
Completely
independent
reproduction based
only on text
description, without
access to the
original code
14 / 52
A Dicult Trade-o
15 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
17 / 52
The ALPS
Corresponding author:
troyer@comp-phys.org
Figure
3. In this example
2+3"'"+%4#
systems 15
16
{T= 0 . 6 ; }
{T= 0 . 7 ; }
{T= 0 . 7 5 ; }
{T= 0 . 8 ; }
!*$+#,-.#
/%0&120134#
5'6'#
+3/51%62"#
7'85108#
18 / 52
Courtesy of Matan Gavish and David Donoho (AMP Workshop on Reproducible research)
19 / 52
Courtesy of Matan Gavish and David Donoho (AMP Workshop on Reproducible research)
19 / 52
LaTeX source
\includegraphics{figure1.eps}
Courtesy of Matan Gavish and David Donoho (AMP Workshop on Reproducible research)
19 / 52
Courtesy of Matan Gavish and David Donoho (AMP Workshop on Reproducible research)
19 / 52
create new
record
yes
raise
exception
error
no
find dependencies
get platform information
code
change
policy
diff
store diff
run simulation/analysis
record time taken
find new files
add tags
20 / 52
20 / 52
20 / 52
20 / 52
Dissemination Platforms:
ResearchCompendia.org
MLOSS.org
Open Science Framework
IPOL
Madagascar
thedatahub.org
nanoHUB.org
The DataVerse Network RunMyCode.org
Embedded Publishing:
Kepler
GenePattern
Taverna
CDE
Synapse
Pegasus
knitR
Litterate programming
Donald Knuth: explanation of the program logic in a natural language interspersed
with snippets of macros and traditional source code.
I'm way too 3l33t to program this way but that's
exactly what we need for writing a reproducible article/analysis!
Org-mode (requires emacs)
My favorite tool.
plain text, very smooth, works both for html, pdf, . . .
allows to combine all my favorite languages
Ipython notebook
If you are a python user, go for it! Web app, easy to use/setup. . .
KnitR (a.k.a. Sweave)
22 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
23 / 52
Why R?
24 / 52
25 / 52
https://www.coursera.org/course/compdata
https://www.coursera.org/course/repdata
26 / 52
apt-cache search r
Err, that'is not very useful :) It's the same when searching on google but once the
lter bubble is set up, it gets better. . .
sudo apt-get install r-base
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
27 / 52
R has it's own package management mechanism so just run R and type the
following commands:
ddply, reshape
and
ggplot2
nz/)
1
2
3
install.packages("plyr")
install.packages("reshape")
install.packages("ggplot2")
knitR
install.packages("knitr")
28 / 52
IDE
29 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
30 / 52
Rstudio screenshot
31 / 52
`{r}
and
`
`r sin(2+2)`
Alt-Ctrl-c
I usually work mostly with the current environment and only knit in the
end
Other engines can be used (use rstudio completion)
1
2
3
```{r engine='sh'}
ls /tmp/
```
Makes reproducible analysis as simple as one click
Great tool for quick analysis for self and colleagues, homeworks, . . .
32 / 52
<<>>=
and
1
2
3
4
<<echo=FALSE>>=
opts_chunk$set(cache=TRUE,dpi=300,echo=FALSE,fig.width=7,
warning=FALSE,message=FALSE)
@
Great for journal articles, theses, books, . . .
33 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
34 / 52
Data frames
1
1
2
3
4
5
6
7
1
2
1
2
3
A data frame is a data tables (with columns and rows). mtcars is a built-in data
frame that we will use in the sequel
head(mtcars);
mpg cyl disp hp drat
Mazda RX4
21.0
6 160 110 3.90
Mazda RX4 Wag
21.0
6 160 110 3.90
Datsun 710
22.8 4 108 93 3.85
Hornet 4 Drive
21.4
6 258 110 3.08
Hornet Sportabout 18.7 8 360 175 3.15
Valiant
18.1
6 225 105 2.76
You can also load a data frame from a CSV le:
wt
2.620
2.875
2.320
3.215
3.440
3.460
qsec
16.46
17.02
18.61
19.44
17.02
20.22
vs
0
0
1
1
0
1
am gear carb
1
4
4
1
4
4
1
4
1
0
3
1
0
3
2
0
3
1
35 / 52
names(mtcars);
1
2
str(mtcars);
'data.frame':
$ mpg : num
$ cyl : num
$ disp: num
$ hp : num
$ drat: num
$ wt : num
$ qsec: num
$ vs : num
$ am : num
$ gear: num
$ carb: num
2
3
4
5
6
7
8
9
10
11
12
"disp" "hp"
"drat" "wt"
"qsec" "vs"
"am"
32 obs. of 11 variables:
21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
6 6 4 6 8 6 8 4 4 6 ...
160 160 108 258 360 ...
110 110 93 110 175 105 245 62 95 123 ...
3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
2.62 2.88 2.32 3.21 3.44 ...
16.5 17 18.6 19.4 17 ...
0 0 1 1 0 1 0 1 1 1 ...
1 1 1 0 0 0 0 0 0 0 ...
4 4 4 3 3 3 3 4 4 4 ...
4 4 1 1 2 1 4 2 2 4 ...
36 / 52
dim(mtcars);
length(mtcars);
[1] 32 11
[1] 11
summary(mtcars);
mpg
Min.
:10.40
1st Qu.:15.43
Median :19.20
Mean
:20.09
3rd Qu.:22.80
Max.
:33.90
drat
Min.
:2.760
1st Qu.:3.080
Median :3.695
Mean
:3.597
2
3
4
5
6
7
8
9
10
11
12
cyl
Min.
:4.000
1st Qu.:4.000
Median :6.000
Mean
:6.188
3rd Qu.:8.000
Max.
:8.000
wt
Min.
:1.513
1st Qu.:2.581
Median :3.325
Mean
:3.217
disp
Min. : 71.1
1st Qu.:120.8
Median :196.3
Mean :230.7
3rd Qu.:326.0
Max. :472.0
qsec
Min. :14.50
1st Qu.:16.89
Median :17.71
Mean :17.85
hp
Min.
: 52.0
1st Qu.: 96.5
Median :123.0
Mean
:146.7
3rd Qu.:180.0
Max.
:335.0
vs
Min.
:0.0000
1st Qu.:0.0000
Median :0.0000
Mean
:0.4375
37 / 52
16 18 20 22
cyl
wt
qsec
5.0
disp
gear
4.0
3.0
100
300
100
16 18 20 22
3.0
4.0
5.0
38 / 52
Accessing Content
1
mtcars$mpg
1
3
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.
[31] 15.0 21.4
mtcars[2:5,]$mpg
mtcars[mtcars$mpg == 21.0,]
Mazda RX4
Mazda RX4 Wag
1
2
Extending Content
1
2
3
4
mtcars$cost = log(mtcars$hp)*atan(mtcars$disp)/
sqrt(mtcars$gear**5);
mean(mtcars$cost);
summary(mtcars$cost);
[1] 0.345994
Min. 1st Qu.
0.1261 0.2038
hist(mtcars$cost,breaks=20);
Median
0.2353
4
3
2
1
Frequency
Histogram of mtcars$cost
1
2
0.2
0.3
0.4
0.5
40 / 52
Outline
Reproducible Research
Looks familiar ?
Many Dierent Alternatives
R
General Introduction
Reproducible Documents: knitR
Introduction to R
Needful Packages by Hadley Wickam
41 / 52
42 / 52
10
library(plyr)
mtcars_summarized = ddply(mtcars,c("cyl","carb"), summarize,
num = length(wt), wt_mean = mean(wt), wt_sd = sd(wt),
qsec_mean = mean(qsec), qsec_sd = sd(qsec));
mtcars_summarized
1
2
3
4
5
6
7
8
9
44 / 52
22
20
qsec
1
2
cyl
18
5
4
16
wt
45 / 52
22
20
qsec
1
2
factor(cyl)
18
16
wt
46 / 52
22
factor(cyl)
20
qsec
1
2
18
factor(gear)
3
4
5
16
wt
47 / 52
22
factor(cyl)
20
qsec
1
2
18
16
factor(gear)
3
4
5
wt
48 / 52
7.5
rating
1
2
5.0
2.5
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
factor(year)
49 / 52
4000
0
2000
count
1
2
4000
1
2000
0
0.0
2.5
5.0
7.5
10.0 0.0
2.5
5.0
7.5
10.0
rating
50 / 52
https://www.coursera.org/course/compdata
https://www.coursera.org/course/repdata
51 / 52
pyglist/pygments
52 / 52