You are on page 1of 9

What are some valuable Statistical Analysis

open source projects?


What are some valuable Statistical Analysis open source projects available
right now?
Edit: as pointed out by Sharpie, valuable could mean helping you get things
up vote 58 down done faster or more cheaply.
vote favorite
software open-source
37
community wiki
edited Feb 12 '11 at 5:50
shareimprove this question
6 revs, 3 users 67%
grokus
Could be a poster child fo argumentative and subjective. At the least,
5
need to define 'valuable'. Sharpie Jul 19 '10 at 19:15
1 Define "valuable"... Shane Jul 19 '10 at 19:20
Maybe the focus shouldn't be on "valuable" but rather "pros" and "cons"
2
of each project? A Lion Jul 19 '10 at 19:44
Or maybe even "How X will help you get Y done faster/cheaper and kill
the germs that cause bad breath." Sharpie Jul 19 '10 at 20:15
add a comment

19 Answers
active oldest votes
The R-project
http://www.r-project.org/

up vote 81
down vote
accepted

R is valuable and significant because it was the first widely-accepted OpenSource alternative to big-box packages. It's mature, well supported, and a
standard within many scientific communities.

Some reasons why it is useful and valuable


There are some nice tutorials here.
community wiki

shareimprove this answer

11
10

edited Jul 19 '10 at 19:21

2 revs
Jay Stevens
Yes, R is nice- but WHY is it 'valuable'. Sharpie Jul 19 '10 at 19:16
It's mature, well supported, and a standard within certain scientific
communities (popular in our AI department, for example) Menno Jul 19
'10 at 19:19
It's extensible and there's no statistical technique that can't be done in it.

up vote 17
down vote

aL3xa Jul 20 '10 at 1:22


add a comment
For doing a variety of MCMC tasks in Python, there's PyMC, which I've gotten
quite a bit of use out of. I haven't run across anything that I can do in BUGS that
I can't do in PyMC, and the way you specify models and bring in data seems to
be a lot more intuitive to me.
shareimprove this answer

answered Jul 19 '10 at 19:26

community wiki
Rich

add a comment
Two projects spring to mind:
1. Bugs - taking (some of) the pain out of Bayesian statistics. It allows the user to

focus more on the model and a bit less on MCMC.


up vote
16
down
vote

2. Bioconductor - perhaps the most popular statistical tool in Bioinformatics. I

know it's a R repository, but there are a large number of people who want to
learn R, just for Bioconductor. The number of packages available for cutting
edge analysis, make it second to none.
community wiki
shareimprove this answer

edited Jul 19 '10 at 20:43

2 revs
csgillespie
Andrew Gelman has a nice R library that links Bugs to R. bshor Jul 19 '10 at
19:30
I'd rephrase that "the most popular statistical tool in bioinformatics"...
3 Bioinformaticians doing microarray analysis use it extensively, yes. But
bioinformatics is not limited to that ;) Nicojo Jul 19 '10 at 20:25
@Nicojo - Very good point! csgillespie Jul 19 '10 at 20:43
add a comment
Incanter is a Clojure-based, R-like platform (environment + libraries) for statistical
up vote computing and graphics.
13
community wiki
down
answered Jul 19 '10 at 19:16
shareimprove this answer
vote
Alex Ott
Again- why? How would I convince my boss to use this over, say, Excel.
Sharpie Jul 19 '10 at 19:18
If moving from Excel is the issue, you could try: *
coventry.ac.uk/ec/~nhunt/pottel.pdf *
3 forecastingprinciples.com/files/McCullough.pdf *
lomont.org/Math/Papers/2007/Excel2007/Excel2007Bug.pdf *
csdassn.org/software_reports/gnumeric.pdf James Jul 20 '10 at 14:44
1 @James + j.mp/9fVmtV, j.mp/aNDyf2, j.mp/9Gzzri :-) chl Sep 3 '10 at 18:19
add a comment
up vote ggobi "is an open source visualization program for exploring high-dimensional data."
13 down
vote

Mat Kelcey has a good 5 minute intro to ggobi.


shareimprove this answer

answered Aug 8 '10 at 14:33

community wiki
Jeromy Anglim

add a comment
This may get downvoted to oblivion, but I happily used the Matlab clone Octave for
many years. There are fairly good libraries in octave forge for generation of random
variables from different distributions, statistical tests, etc, though clearly it is dwarfed
by R. One possible advantage over R is that Matlab/octave is the lingua franca
up vote among numerical analysts, optimization researchers, and some subset of applied
13 down mathematicians (at least when I was in school), whereas nobody in my department, to
vote
my knowledge, used R. my loss. learn both if possible!
shareimprove this answer

answered Sep 3 '10 at 16:27

community wiki

shabbychef
True comment. But as an experienced programmer I feel dirty every time I use
Matlab/Octave, which is a horribly-designed (if it's designed at all) language. Of
3
course, I also cringe at SAS, which was obviously designed for punched cards.
Wayne May 19 '11 at 13:38
@Wayne true enough. I recall once hearing Bob Harper refer to the Matlab
language as 'semantically suspect' ;) As with many languages, though, once you
use it enough, you learn to cope with its oddities. shabbychef Jun 2 '11 at
17:30
add a comment
Weka for data mining - contains many classification and clustering algorithms in
up vote Java.
12 down
community wiki
answered Jul 19 '10 at 19:33
vote
shareimprove this answer
Fabian Steeg
how's the performance of this? (I run screaming whenever I see the word
'Java'...) shabbychef Sep 3 '10 at 18:01
@shabbychef Seem quite good from what I've heard, but generally Weka is used
as a first step to test several algorithms and look at their performance on given
data sets (or a subset thereof), then one recode part of the core program to
optimize its efficiency (e.g. with high-dimensional data calling for crossvalidation or bootstraping), sometimes in C or Python. chl Sep 3 '10 at 18:14
@shabbychef: Java programs doesn't have to be a slow monsters. I admit that
well written C-code will almost always be faster than a similar implementation
2 in Java but well written Java code will most likely not be super slow. Plus
developing in Java has many significant advantages. posdef Feb 10 '11 at
12:52
add a comment
up vote There are also those projects initiated by the FSF or redistributed under GNU General
11 down Public License, like:
vote

PSPP, which aims to be a free alternative to SPSS


GRETL, mostly dedicated to regression and econometrics

There is even applications that were released just as a companion software for a
textbook, like JMulTi, but are still in use by few people.
I am still playing with xlispstat, from time to time, although Lisp has been largely
superseded by R (see Jan de Leeuw's overview on Lisp vs. R in the Journal of
Statistical Software). Interestingly, one of the cofounders of the R language, Ross
Ihaka, argued on the contrary that the future of statistical software is... Lisp: Back to
the Future: Lisp as a Base for a Statistical Computing System. @Alex already pointed
to the Clojure-based statistical environment Incanter, so maybe we will see a revival
of Lisp-based software in the near future? :-)
shareimprove this answer

answered Sep 3 '10 at 14:42

community wiki
chl

add a comment
RapidMiner for data and text mining

up
community wiki
vote 9
edited
Apr
20
'13
at
7:21
down shareimprove this answer
2 revs
vote
Neil McGuigan
add a comment
GSL for those of you who wish to program in C / C++ is a valuable resource as it
provides several routines for random generators, linear algebra etc. While GSL is
primarily available for Linux there are also ports for Windows. (See:
up
vote 6 http://gladman.plushost.co.uk/oldsite/computing/gnu_scientific_library.php and
down http://david.geldreich.free.fr/dev.html)
vote
community wiki
answered Jul 19 '10 at 19:28
shareimprove this answer
user28
add a comment
up
First of all let me tell you that in my opinion the best tool of all by far is R, which has
vote 6 tons of libraries and utilities I am not going to enumerate here.
down
vote Let me expand the discussion about weka
There is a library for R, which is called RWeka, which you can easily install in R, and
use many of the functionalities from this great program along with the ones in R, let me
give you a code example for doing a simple decision tree read from a standard database
that comes with this package (it is also very easy to draw the resulting tree but I am
going to let you do the research about how to do it, which is in the RWeka
documentation:
library(RWeka)
iris <- read.arff(system.file("arff", "iris.arff", package = "RWeka"))
classifier <- IBk(class ~., data = iris)

summary(classifier)

There are also several python libraries for doing this (python is very very easy to
learn)
First let me enumerate the packages you can use, I am not going to go in detail about
them; Weka(yes you have a library for python), NLKT (the most famous open source
package for textmining besides datamining), statPy, sickits, and scipy.
There is also orange which is excelent ( I will also talk about it later), here is a code
example for doing a tree from the data in the table cmpart1, that also performs 10 folds
validation, you can also graph the tree
import orange, orngMySQL, orngTree
data = orange.ExampleTable("c:\\python26\\orange\\cmpart1.tab")
domain=data.domain
n=10
buck=len(data)/n
l2=[]
for i in range(n):
tmp=[]
if i==n-1:
tmp=data[n*buck:]
else:
tmp=data[buck*i:buck*(i+1)]
l2.append(tmp)
train=[]
test=[]
di={'yy':0,'yn':0,'ny':0,'nn':0}
for i in range(n):
train=[]
test=[]
for j in range(n):
if j==i:
test=l2[i]
else:
train.extend(l2[j])
print "-----"
trai=orange.Example(domain, train)
tree = orngTree.TreeLearner(train)
for ins in test:
d1= ins.getclass()
d2=tree(ins)
print d1
print d2
ind=str(d1)+str(d2)
di[ind]=di[ind]+1
print di

To end with some other packages I used and found interesting


Orange:data visualization and analysis for novice and experts. Data mining through
visual programming or Python scripting. Components for machine learning. Extensions
for bioinformatics and text mining. (I personally recomend this, I used it a lot

integrating it in python and it was excelent) I can send you some python code if you
want me to.
ROSETTA: toolkit for analyzing tabular data within the framework of rough set theory.
ROSETTA is designed to support the overall data mining and knowledge discovery
process: From initial browsing and preprocessing of the data, via computation of
minimal attribute sets and generation of if-then rules or descriptive patterns, to
validation and analysis of the induced rules or patterns.(This I also enjoyed using very
much)
KEEL:assess evolutionary algorithms for Data Mining problems including regression,
classification, clustering, pattern mining and so on. It allows us to perform a complete
analysis of any learning model in comparison to existing ones, including a statistical
test module for comparison.
DataPlot: for scientific visualization, statistical analysis, and non-linear modeling. The
target Dataplot user is the researcher and analyst engaged in the characterization,
modeling, visualization, analysis, monitoring, and optimization of scientific and
engineering processes.
Openstats: Includes A Statistics and Measurement Primer, Descriptive Statistics,
Simple Comparisons, Analyses of Variance, Correlation, Multiple Regression,
Interrupted Time Series, Multivariate Statistics, Non-Parametric Statistics,
Measurement, Statistical Process Control, Financial Procedures, Neural Networks,
Simulation
shareimprove this answer

answered Feb 14 '11 at 7:39

community wiki
mariana soffer

(+1) Very good list! chl Jun 2 '11 at 12:00


add a comment
I second that Jay. Why is R valuable? Here's a short list of reasons. http://www.insider.org/why-use-r. Also check out ggplot2 - a very nice graphics package for R. Some
up
vote 5 nice tutorials here.
down
community wiki
vote shareimprove this answer answered Jul 19 '10 at 19:19
Stephen Turner
why ask the question here? All are community-wiki, why not just fix the
9
canonical answer? Jay Stevens Jul 19 '10 at 19:22
add a comment
I really enjoy working with RooFit for easy proper fitting of signal and background
distributions and TMVA for quick principal component analyses and modelling of
up vote multivariate problems with some standard tools (like genetic algorithms and neural
5 down networks, also does BDTs). They are both part of the ROOT C++ libraries which
vote
have a pretty heavy bias towards particle physics problems though.
shareimprove this answer answered Jul 20 '10 at 13:08 community wiki

Benjamin Bannier
add a comment
Few more on top of already mentioned:

KNIME together with R, Python and Weka integration extensions for data
mining
Mondrian for quick EDA

up vote And from spatial perspective:


5 down
vote
GeoDa for spatial EDA and clustering of areal data

SaTScan for clustering of point data

shareimprove this answer

answered Jul 24 '10 at 20:15

community wiki

radek
As a note GeoDa and SatScan aren't open source, they are freeware (not that it
2
makes much difference to me though!) Andy W Feb 14 '11 at 14:06
pySal by the GeoDa Center is open source (see below) B_Dev May 19 '11 at
1
13:29
add a comment
Colin Gillespie mentioned BUGS, but a better option for Gibbs Sampling, etc, is
JAGS.
If all you want to do is ARIMA, you can't beat X12-ARIMA, which is a goldstandard in the field and open source. It doesn't do real graphs (I use R to do that), but
the diagnostics are a lesson on their own.
Venturing a bit farther afield to something I recently discovered and have just begun
to learn...
ADMB (AD Model Builder), which does non-linear modeling based on the
AUTODIF library, with MCMC and a few other features thrown in. It preprocesses
up vote and compiles the model down to a C++ executable and compiles it as a standalone
5 down app, which is supposed to be way faster than equivalent models implemented in R,
MATLAB, etc. ADMB Project
vote
It started and is still most popular in the fisheries world, but looks quite interesting
for other purposes. It does not have graphing or other features of R, and would most
likely be used in conjunction with R.
If you want to work with Bayesian Networks in a GUI: SamIam is a nice tool. R has
a couple of packages that also do this, but SamIam is very nice.
community wiki
shareimprove this answer

edited May 1 '12 at 20:56


2 revs
Wayne

add a comment
This falls on the outer limits of 'statistical analysis', but Eureqa is a very user friendly
program for data-mining nonlinear relationships in data via genetic programming.
Eureqa is not as general purpose, but it does what it does fairly well, and the GUI is
up vote quite intuitive. It can also take advantage of the available computing power via the
3 down eureqa server.
vote
community wiki
answered Feb 11 '11 at 17:52
shareimprove this answer
shabbychef
add a comment
clusterPy for analytical regionalization or geospatial clustering

PySal for spatial data analysis.

up vote 2
down vote

community wiki
shareimprove this answer

edited Jun 2 '11 at 5:25


2 revs
B_Dev

add a comment
Symbolic mathematics software can be a good support for statistics, too. Here are a
few GPL ones I use from time to time:

up vote 2
down
vote

1. sympy is python-based and very small, but can still do a lot: derivatives,
integrals, symbolic sums, combinatorics, series expansions, tensor
manipulations, etc. There is an R package to call it from R.
2. sage is python-based and HUGE! If sympy can't do what you want, try sage
(but there is no native windows version).
3. maxima is lisp-based and very classical, intermediate in size between (1)
and (2).
All three are in active development.
community wiki
shareimprove this answer

edited May 27 '13 at 14:48


2 revs, 2 users 71%
kjetil b halvorsen

add a comment
Meta.Numerics is a .NET library with good support for statistical analysis.
Unlike R (an S clone) and Octave (a Matlab clone), it does not have a "front end". It
is more like GSL, in that it is a library that you link to when you are writing your own
up vote application that needs to do statistical analysis. C# and Visual Basic are more
1 down common programming languages than C/C++ for line-of-business apps, and
Meta.Numerics has more extensive support for statistical constructs and tests than
vote
GSL.
shareimprove this answer

answered Feb 20 '11 at 7:03

community wiki

You might also like