You are on page 1of 13

The Impoverished Social Scientist's Guide to Free Statistical Software and Resources

Last Updated: December 18, 2008 Table of Contents


General Statistics Packages Accurate Statistics Data Interactive Graphics (Data Visualization) Data Plotting (and Publication Ready Graphics) Image and Plot Analysis Data Mining Qualitative Data Text Manipuation, Management, Mining Spatial Statistics and GIS Survey Data Collection and Analysis Agent Based Simulation Dynamic Event Simulation Monte Carlo and Markov Chain Monte Carlo (MCMC) Simulation Specialized Statistical Packages Epidemiology Data Cleaning and Management Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems Social Network Analysis Differential Equations and Dynamic Simulation Machine Learning

Free/Open Source Software


For statistical computing resources and other software for accurate computing, such as high-precision libraries, optimizers, and random number generators see our statistical computing page. And for software written by me for data distribution, accuracy, and replication see my software page. For sources of research data, see my Data Resources page.

Where to Start The R Statistical Language The open source statistical language of choice for most GPL tasks. Based on the 'S' language. Thousands of contributed packages A modular multi-variate analysis program which includes GPL modules for spatial data analysis. Plays well with R. A general purpose package that specialized in client-server GPL based data management, and large-data/low memory computations. Good for large datasets. A powerful, but somewhat byzantine package from the National Institute of Standards OSS

Other General Statistics Packages ADE Adamsoft

DataPlot

Gretl ExaStat Macanova OpenStat PSPP

An open source econometrics package that plays nicely with R Basic statistics and regression on large data, using Windows. General package focusing on teaching, IRT.

GPL OSS

Reasonably powerful & programmable, if not easy to use. GPL OSS Aspires to replace SPSS. Reads SPSS files and provides GPL the data manipulation functions, but is missing most of the analytical features. Reasonably powerful with emphasis on simulation , command-line. A free Windows package for exploratory analysis, time series, and linear models. Nice interactive multidimensional table browser and interactive plots. OSS No Source

Simfit WinIADAMS

Accurate Statistics (The following modules for R, are very useful for highly accurate statistical computing on hard problem. For more resources, and computing libraries, see my Resources for Accurate Computing page. ) accuracy gmp OpenTURNS rgenoud rstream trust UNF Sensitivity analysis and true random number generation Multiple precision arithmetic Tools for modeling uncerntainty and risks. Optimizer using genetic algorithms and derivatives Parallelizable random number generators Trust region based optimization GPL GPL GPL GPL GPL GPL

Universal Numeric Fingerprints -- format independent data GPL validation. Also see the plotting category.

Data-Interactive Graphics (Data Visualization) Gaugin GGobi Grouping, glyphs, tableplots, oh my. GPL

Supports data interactive visualization, exploration, comp, GPL and analysis. Includes automated projection pursuit in high-dimensions. A Java toolkit for linked visualizations. Interactive analysis of classification and regression trees Data analysis and visualization GPL GPL GPL

Improvise KLIMT LabPlot Mondrian OPEN DX ParaView

Mondrian is especially useful for interactive visualization No Source of categorical data, and very large datasets. Generates visualizations and animations for very large scale scientific data Parallel visualization of large datasets. OSS GPL

prefuse Processing

Java visualization toolkit

OSS

A language for rapid developmet of interactive data OSS visualizations. Well integrated with Java and can produced polished visualizations. Parallel large data visualization software Dynamic, interactive, multi-view graphics. Plus a very interesting visual user-interface, akin to data-desk, but more advanced statistically. GPL

VISIT VISTA

Data Plotting (and publication-ready graphics) Almost all of the tools listed on this page have some sort of graphing capabilities. These packages specialize in it. Also see the visualization category. Gnuplot GUPPI Jas3 SciGraphica Command-line driven plots in 2D, and 3D. Extensible plotting tool for Gnome. A visualization and curve fitting package in java. High performance plotting package similar to Microcal Origin. GPL GPL GPL GPL

Image and Plot Analysis These packages can be used to manipulate images, extract quantitative information from images, including recovering data from published plots and graphs. DataScan g3data Image/J Scion Image Data Mining Also see the categories on text mining and machine learning Auton Labs Software Databionic Knime ORange Dozens of independent backages for machine learning, includig many classifiers. Source Available (registration required) Extracts information from topographic images, microscopic images, and others. Specifically for extracting data from published graphs. Can extract data from scanned maps, charts, graphs and even photos. Programmable image program with data capture capabilities. OSS GPL OSS No source

Clustering, visualization, and classification using emergent GPL self-organizing maps. Supports data pipelines for data processing, clustering, supervised learning, etc. GUI, CLI and API based. Predictive modeling, ensemble methods, clustering and validation, using C components and GUI widgets, and Python integration. A Gnome based interface that glues together a large OSS GPL

Rattle

GPL

number of (clustering, association, machine learing, evaluation) modules in R for data mining Shogun Machine learning toolbox with multiple SVM,LDA, LPM GPL classifiers. C++ with interfaces for Octave, R, Matlab, Python Supports data processing streams including clustering, supervised learing, meta-spv, and cross-validation. Provides a GUI interface. OSS

Tanagra

Qualitative Data Manipulation, Management, Mining and Analysis A list of commercial and non-commercial tools for qualitative analysis is part of the open directory project and a well-subscribed discussion list about software can be found as part of jisc, and a comparison of QDAS packages is here. The Natural Language Processing TaskView describes many R packages (interfaces to external toolkits) for text understanding. The MLInterfaces package on BioConductor provides a uniform interface to a large set of machine learning packages in R. Advene AnSWR Automap/ORA Video annotation Text tagging (similar to Atlas-TI), with more linguistic coding options, visualization and analysis of network of concepts identified . For complex annotation of audio and video. From the CDC, for textual data analysis. A toolkit for information extraction from text . Performs automatic classification and clustering of documents, Java librarie for linguistic processing and analysis. Performs automated key phrase extraction. A hosted service for text management and analysis. OSS No Source From the CDC, for mixed qualitative/quantitative analysis. No Source

Elan EZ-Text Gate Judge Lingpipe Kea Language Archiving Technology NLTK

GPL No Source GPL GPL No Commercial Use GPL Hosted

A python toolkit for natural language processsing. Includes GPL tutorials on NPL. The programming language for supreme text mangling. OSS

Perl Pliny SIL tools

For annotating documents, text and images, and generating OSS maps and graphs of relationships. If you have a lot of text on-line, the concordance, indexing, and database from the Summer Institute of Linguistics may be what you need. Uses special purpose rules for categorizing news events No Source

Tabari

GPL

from new text. Tams TextStat VUE Textual analysis and markup. Similar to Atlas-TI. Another indexing/concordance package. GPL GPL

Visual understanding environment. Allows you to create OSS annotated networks of multimedia objects for presentation and commentary. A sort of non-linear, scholarly, PowerPoint. For qualitative data management and coding. Weka is a collection of machine learning algorithms for data mining, including text mining. (R-Weka connects Weka and R, and is available on CRAN). Scaling software for estimating political positions from texts. A flexible standalone package that contains many data mining algorithms. No Source GPL

Weft Weka

Wordfish YALE (now RapidMiner)

GPL GPL

Spatial Statistics and GIS In addition to the individual packages below, the Free GIS Site and OpenSourceGis sites maintains lists of many open-source GIS packages. The CISSS Tools Clearinghouse maintains links to many spatial analysis programs. Kelly pace gives a list of links to software for advanced spatiotemporal econometrics. The AI-geostats software page has a links to geo-spatial statistics programs and code. And Rgeo lists lots of contributed packages for doing geospatial statistics with R, including 'fields', 'geoR', 'graper' , 'grass', and 'spatstat'. Choroware CrimeStat Fragstats Geoda Geovista Studio Grass LandSerf SatScan SAGA Chloropleth maps with genetic algorithm generated class intervals. Network, spatial and statistical analysis for crime data. Created for the National Institute of Justice. Designed to compute a wide variety of landscape metrics for categorical map patterns Unusual in in its combination of GIS and spatial econmetrics. General GIS toolkit and exploratory data analysis system One the most powerful, free, geographic information system for the display of spatial data. Land surface visualization and analysis Space-time scan statistics -- for analysis of disease and other clusters distributed in space and time Combines GIS with kriging and terrain analysis GPL No Source GPL No Source GPL GPL No Source No Source GPL OSS

Spatial A library of Matlab functions for advanced spatial, and Econometrics Lib. spatiotemporal econometric analysis

Space time analysis of regional systems. Designed for the GPL

STARS

dynamic exploratory analysis of data measured for areal units at multiple points in time. If you have spatial timeseries data, check this. The general software packages above have some facilities for survey analysis. The programs below specialize in data collection and/or the analysis of complex surveys. Also see the Epidemiology section.

Survey Data Collection and Analysis

AM dopoxtools Mod_survey

Handles analysis of complex survey samples, such as NAEP and TIMMS Free research web survey hosting

No Source Hosted

A very mature open source survey system. It is GPL implemented as a drop-in apache module. It supports creation of survey templates using XML, and export of the resulting data in a number of interchange formats. Mod_survey can be configured in a decentralized way, so that all users on a particular web server can administer their own surveys independently. (Also see YaaCs, below) Server based web survey system PHP based web survey system PHP based web survey system A programming environment for building interactive psychology experiments Free research web survey hosting GPL GPL GPL GPL Hosted

OpenSurveyPilot PHPEsp Lime Survey PEBL protogenie PsychExps Quex Suite SurveyWiz TESS

A repository of experimental design scripts to be run under Mixed the macromedia authorware environment. Web based CATI system with integrated VOiP (Asterix), GPL XML form language, and paper form scanning capability.. Simple JavaScript based web survey system GPL Time-Sharing Experiments for the Social Sciences. n NSF Hosted funded infrastructure to provide both web and phone surveys. A java-based system for on-line psych experiments. A CATI system that uses Mod_survey for the data collection, and offers additional management of other phases of the survey work flow -- questionnaire building, interviewer management, etc. No Source GPL

WebExp2 YaaCs

Agent-Based Simulation The International Society for Artificial Life maintains a list of links to many agent-based simulation framework. Ascape breve Agent based simulation package Simulation in a 3-D world, using Python or a simple scripting language. GPL GPL

EVO MASON NetLogo REPAST Sesam > SOAR Swarm

A simulation environment for co-evolution, based on SWARM A java-based agent-based modeling system popular in political science An updated dialect of the Logo language for multi-agent simulation A multi agent simulation toolkit, with multiple implementations and built in adaptive features Simulation system with cool visual model building interface. Agent based modeling based on cognitive/AI constructs. A mature, full-featured framework for agent-based modeling, built in Objective C

OSS OSS No Source OSS OSS GPL GPL

Dynamic Event Simulation This overlaps with Agent-Based Simulation above. I have listed only packages below, but several programmng libraries are also available, including: DSOL (Java), SimPy (Python), Adevs (C++) and DeX (Python, C++, Scripting). Desmo-J OMNet++ Discrete event simulation framework OMNeT++ is a component-based, modular and openarchitecture simulation environment with strong GUI support and an embeddable simulation kernel, focussing on communication networks, but general enough to be used for network, systems, and business process simulation. GPL Academic Source License (not open source)

Monte Carlo and Markov-Chain Monte Carlo (MCMC) Simulation R, and many of the other general packages above can be used for MC simulation. R also has a number of modules to perform Bayesian MCMC analysis directly, and through communicating with BUGS, and JAGS. JAGS MCMCpack Just another GIBBS sampler. A program for Bayesian hierarchical models. ("Not unlike BUGS") GPL

An R module to perform MCMC based analysis. Very easy GPL to use, since it contains a large variety of pre-configured models A specially tailored Monte Carlo simulation package. Goes GPL well beyond general packages. Open source rewrite of BUGS for bayesian simulation Still the best BUGS for windows, but not OSS. multi-response permutation tests Nonlinear peak fitting. GPL No Source No Source GPL

McSim OpenBugs WinBUGS Blossom Fityk

Specialized Statistical Packages

Gambit gSwing M.D. Anderson Cancer Center MDSX MPCA MX PAST Sitkis Permap TETRAD TDA Voteview Epidemiology

game theory made simple(r) Election result tracking and display Has useful biostat software from the biostats department. Multidimensional Scaling Routines for Windows Discrete and independent component analysis. Structureal Equation Modeling (like LISREL)

OSS GPL Mixed. No Source GPL No Source

PAlaeontological STatistics. Not strictly social science, of course, but the correspondence analysis, geometric No Source analysis and cladistics could be applied fruitfully. Computes common bibilometric network statistics. Perceptual maps created through interactive multidmensional scaling. A LISREL like structural equation modeling program Transition Data Analysis.A system for analyzing event data , supports lots of options and models No Source No Source GPL GPL

Voteview and nominate are for viewing and analyzing roll- GPL call voting. The CDC Software Page also offers a set of special packages for sampling design factors, meta-analysis, and spatial analysis.The WWW Virtual Epidemiology Library. Also see the category on survey tools.

MIX Epidata Epigrass

Guided interactive meta-analysis. Provides for programmed data entry and simple analysis. Epigrass is a software for visualizing, analyzing and simulating of epidemic processes on geo-referenced networks. Epidemiological statistics, maps, reports. Javascript-based (on or off-line) simple epidemiological statistics. Web based secure data entry and analysis for epidemiology. over 75 modules for common epidemiolical methods.

GPL No source. GPL

Epi-info Openepi Netepi WinPepi

No Source OSS GPL No Source

Data Cleaning, and Management For managing qualitative data, see the Text Tools section. For other database options see the Free SQL List and The ACM's Sigmod List Berkeley DB A fast key-value based DB. Very lightweight (much more OSS lightweight than SQL, and does not require separate server running). Very fast for key-based retrievals.Also see thefilehash and R.huge packages for using key-value DB's

in R. CCOUNT Does data cleaning, advanced cross-tabulation, and other market research function. Also reads many mainframestyle data formats (e.g. EBCDC, Column Binary). Modeled after SPSS Quantum. GPL

CSPRO DataCleaner HDF

Does form base data entry, crosstabulation, and mapping. GPL From the U.S. Census. Tools for data review and editing. Hierarchical Data Format -- a portable format for representing and manipulating large scientific datasets. The latest version is compatible with netcdf. Also see the netcdf packages for R. Multiple imputation for missing data One of the most mature and stable open source SQL databases. OSS GPL OSS

IVEware MySql netCDF

A portable format for repesresenting and manipulating GPL large scientific datasets. Also see the netcdf package in R; the NCO package for manipulating netcdf data on the command line, and the Parallel-NetCDF package for highspeed access to NetCCDF data. One of the most mature and stable open source SQL databases. Connects R and SQL databases. GPL GPL

PostGRES R DBI

Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems These are standalone systems. For related programmer's libraries see my Resources for Numerical Accuracy listing. The following feature comparison contrasts these and a dozen other more specialized packages. Axiom Giac/Xcas Ginac FreeMat GAP JACAL. Magnus matrex Computer algebra. Lots of functions. Good documentation GPL A computer algebra system. Included limited compatibility GPL with Maple, MuPad and TI89 syntax; arbitrary precision A computer algebra system. (C++ Library) Matrix algebra system. Matlab compatibility and built-in parallelization. GPL GPL

Computer algebra system for group theory. Computatinal OSS discrete algebra. A computer algebra system. Computer algebra system for group theory. A 'spreadsheet' where each cell is a matrix. Provides graphing, presentations, multi-threaded function-based calculations Yet another computer algebra system GPL GPL GPL

Mathomatic

GPL

Maxima OCTAVE PARI/GP RLAB SAGE SciLab Tela YACAS Yorick

A computer algebra system. A matrix manipulation/mathematics environment like Matlab. Mature. A computer algebra system with arbitrary precision arithmetic, like Maple or Mathematica. A matrix manipulation environment. General purpose mathematical computing environment A matrix manipulation/mathematics environment like Matlab. Mature. Tensor computing Yet another computer algebra system. (Eponymous) Comes with Euler, for numerical programming. An older matrix language.

GPL GPL GPL GPL GPL GPL GPL GPL OSS

Social Network Analysis Also see the Spatial category above for software with complementary and overlapping spatial network and display features. Bibexcel CiteSpace Cfinder Egonet GraphViz Insoshi Nettvis NetworkX NWD Pajek Proximity R Modules for Network Analysis Sitkis SocNetV Sonia STOCNET Bibliometric citation analysis. Visualizes networks over time. Uses the clique percolation method to find overlapping dense groups of nodes. Collection and analysis of egocentric network data. Mathematical graph visualization A social network platform -- useful for data collection. No Source No Source No Source No Source OSS GPL

Analyze and visualize social networks. Includes an on-line GPL service. Python toolkit for visualization and analysis Network workbench, visualization and descriptives. Graph clustering, partitioning, citation analysis, network comparison (differences, unions), metrics. OSS OSS No Source

Visualization and knowledge discovery from heterogenous OSS relational networks. A number of R modules mainatined by Carter Butts, including SNA, network, nettheory, metamatrix . Also see Statnet for more R network packages. Computes common bibilometric network statistics. Animated visualizations of logitudinal social networks Analysis of some interesting models, including evolution OSS

No Source GPL GPL

Provides core graph measures for social network analysis GPL

of social networks, blockmodeling, dyadic variable and actor anlaysis, maximum likelihood analysis of longitudinal (evolution of) networks (through SIENNA) , core network analysis. Tulip VISONE WinMine Visualization for extremely large graphs. Plugins are available for clustering and core graph metrics. GPL

Provides core graph measures for social network analysis No source Bayesian and dependency (decision-tree) network builder No source A good list of dynamic simulation packages is maintained by the SIAM activity group on dynamic systems.

Differential Equations and Dynamic Simulation

PETC scirun SUNDIALS Machine Learning

scientific toolkit for differential equations A scientific environment for simulation and PDE's. Nonlinear and differential/algebraic equation Solver

No Source No source OSS

A good list of machine learning tools is at mloss.org. Also see the categories on text mining and data mining dysii C++ Library for probablistic learning within dynamic systems, high peformance. GPL

Open source software, since it is inherently extensible, offers unparalleled opportunities to the researcher to do cutting edge research. Because it is free, it offers opportunities to the student or practitioner on a limited budget. This list concentrates on statistical packages that offer high-level statistical functions and that make source code freely available. Non open source free software is included only when it offers significant functionality that is not otherwise available. A number of software companies offer academic discounts, limited trials or other closed but usable software. See below for other lists that include commercial software.

Analyzing Data
There are some web-based statistics tutorials out there, but none that I like. I recommend some readings:

Introductory: Problem Solving, Chris Chatfield, Chapman & Hall, 1995. An excellent introduction to basic data analysis, from simple descriptive statistics through basic anova. This is a beginner's guide that emphasizes understanding data. Half the fun is in doing the exercises. Visualization: Visualizing Data and Elements of Graphing Data, William S. Cleveland. Not as beautiful as Tufte, but a more systematic approach to the visual analysis of data. The Visual Display of Quantitative Information, Edward Tufte. A classic, and a beautiful book. You may also wish to read his later books Envisioning Information (1990) and Visual Explanations (1997). If you are interested in presenting information with maps, you may be interested in two books by Mark Monmonnier: How to Lie With Maps , and Mapping It Out. Econometrics: Foundations of Econometrics, by Mittlehammer, Judge and Miller, is wide-ranging, and relatively gentle. Econometric Analysis by William H. Greene, is voluminous and comprehensive. A Guide to Econometrics by Peter Kennedy, shows

everything that can go with a regression and what to do about it. Unifying Political Methodology: The Likelihood Theory of Statistical Inference., by Gary King is invaluable for the political science graduate student, although not up to date with advanced methods. A consistent framework for numerous models used in political science. Spatial Statistics: Spatial Data Analysis by Robert Haining and Spatial Statistics, by Brian Ripley, are great references. Time Series: The Analysis of Time Series: An Introduction, by Chris Chatfield, is known by legions of students. Time Series Analysis, by James Hamilton, is more comprehensive. Event History Modeling, by Box-Steffensmeier and Jones is recommended for any social scientist using this technique. Bayesian Methods: Bayesian Data Analysis by Gelman, Carlin, Stern and Rubin, is the textbook for Bayesian methods. Social scientists should read Bayesian Methods: A social and Behavioral Sciences Approach, by Jeff Gill. Statistical Computation: Numerical Issues in Statistical Computing for the Social Scientist is our book on the subject, and we think its great for any social scientist who needs a practical introduction. Our resources page lists many others. Statistical Humor: Not an oxymoron, see The Gallery of Statistics Jokes

Other Lists of Statistical Software Packages


o o

o o o o o o o o

Econometrics Journal links to software of interest to economists. Mostly commercial, but some free software is included. John C. Pezzullo's list of software -- lists some minor packages not listed here because there functionality is already included in other major packages, and lists commercial packages Free Software by STATCON Free Statistical Software list by Andrea Corsini Gene Shackman's Sociological Research Methods Page Mailing lists, this list of discussion lists is a good place to start when you have questions about stat packages. MAS Scientific software links Stata Corporation maintains a list of other software packages, mostly commercial Statlib source of lots of statistical programs in SPLUS and other stat languages. York University's Statistical Resources Page

Caveats
"Entia non sunt mutiplicanda sine necessitate" - William of Ockham's rule "Ad indicia spectate." - Micah's corollary "Doing econometrics is like trying to learn the laws of electricity by playing the radio." Orcutt's observation "One problem with political science is that its laboratories are unsecured, allowing real people to roam around inside them, spitting in test tubes and fiddling with computers" - Walter Kirn "You can see a lot, just by looking." - Yogi Berra
Search this site for: Search tips [ Things to do with this page: | Print it! | Comment on it! | Track it! ] Copyright 1995-2009
Micah Altman

You might also like