You are on page 1of 5

3K

Data Analysis and Statistical Graphics with R


PD Dr Martin Elff, University of Konstanz
4 - 15 August (two week module / 35 hrs)
Course Outline

The course has to major parts. The first week mainly deals, apart from a basic introduction into
the R environment for data analysis, with the main steps in the workflow from data acquisition
to data analysis. The second part deals with various topics that allow to use R to gain a
deeper understanding of the foundations of data analysis and statistical inference and with
some advanced aspects of statistical graphics with R. There is also a couple of strictly optional
modules that deal with some particular and/or advanced aspects of using R that will be
covered if requested by course participants.

There is no textbook for this course. However, some further reading is suggested for those
who want to delve deeper into the topics in the course. Instead of reading books, learning to
use R are is best done practically. For this reason, participants are provided with exercises
that are done in class, the solutions of which will also be discussed during class sessions.

Week 1
Monday Basic Concepts: Data Objects, Basic Computation and Programming
This session introduces to the basic concepts of R: How R differs from other statistical
software, how data are represented in R, how elementary computations can be done in R
(such as using R like a pocket calculator) and how repetitive computational tasks can be
automatised by user-defined functions and control structures. The aim of this session is to
provide users with an orientation in the R environment of data analysis.
Suggested further reading:
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, sections 1.1-1.3, 8.1, 8.3.
Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An
Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1
Tuesday Data Management: Variables, Data Frames and Data Manipulation
This session deals with the crucial steps that precede any serious statistical data analysis R.
Before data can be analysed it must be prepared appropriately. First of all, data provided by
social science data archives comes in binary formats of statistical packages other than R and
in other cases data providers deliver the data in some tabular format. So this session starts
with importing data from such foreign sources. Second, data is not always structured in a
way that is appropriate for the intended analysis. Therefore another topic of this session is
data recoding, labelling and the handling of missing values. Furthermore, it is discussed how
to merge and append data sets and how to recast data sets from a wide into a long format for
repeated-measures analysis.
Suggested further reading:
Elff, Martin, 2008. Analysing the American National Election Study of 1948 using the
memisc package. R-package vignette, http://cran.r-project.org/web/packages/memisc/
Fox, John, 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, chapter 2.
Spector, Phil. 2008. Data Manipulation with R. New York: Springer.
Wednesday Summarising Data: Tables, Descriptive Statistics and Basic Graphics
Often preliminary research questions can be addressed by descriptive statistics, such as
contingency tables, or conditional means. This session therefore discusses how to create
tables of frequencies and other descriptive statistics. Further, structures in the data can often
be elucidated by statistical graphics. Since there are many facilities in R to create such
graphics and since these capabilities is a main point of attraction for many users of R, creation
of diagrams, scatter plots, bar charts and mosaic displays is extensively discussed in this
session.
Suggested further reading:
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 3.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, chapter 3.
Thursday Linear Regression with R: Model Construction and Interpretation
Linear regression and its generalisations are widely used tools for data analysis in the social
sciences. For this reason this session discusses how to construct and estimate linear
regression models in R. It covers the construction of regression models for metric dependent
and independent variables, regression models with dummy variables for categorical
independent variables, and regression models with interaction effects. Also various aspects
graphical model diagnostics are discussed and the formatting of regression estimates in a
format required by many publishers.
Suggested further reading:
Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in
R. Oxford: Oxford University Press, chapter 3.
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 5, 9-10.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, chapter 4.
Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An
Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters
5-7.
Friday Steps Beyond Linear Regression: Models for Categorical Responses, Counts,
and Survival Times
Linear regression requires the dependent variable to be metric (interval or ratio scaled). Yet
often variables that social scientists want to analyse are categorical or involve frequencies and
durations for which the classical linear regression model is not appropriate. This session
shows how to construct and estimate models for data of these kinds in R. Some challenges in
the interpretation of such models will be addressed as well as some tricks for the graphical
presentation their implications. Topics covered are (a) linear vs generalised linear models a
review; (b) logit and probit regression models; (c) Poisson regression; (d) models for
polychotomous dependent variables (e) duration and hazard models.
Suggested further reading:
Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in
R. Oxford: Oxford University Press, chapters 4,5, 6.
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 11,12.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, chapter 5.
Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An
Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 8.
Week 2
Monday Advanced Graphics: Taylor-made Diagrams, Lattice Graphics and Maps
R provides not only a set of standard statistical graphics but also makes it quite easy to
combine graphical elements, such as lines, dots, and rectangles, as building blocks of taylor-
made graphics for one's own particular purposes of representing data summaries or
estimation results. In this session we will discuss how to create new types of diagrams out of
these basic graphical elements. In addition we will discuss so-called lattice graphics, which
are a great tool of comparison relations between variables in different groups or under
different conditions and thus for the visualisation of interaction effects. Finally we will explore
how to create maps in R, thus allowing to represent statistical summaries or model predictions
with geographical referents.
Suggested further reading:
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, chapter 7.
Murrell, Paul. 2005. R Graphics. Boca Raton: Chapman & Hall/CRC.
Tuesday Exploring the Foundations of Statistical Inference: Random Variables and
Distributions in R, Random Numbers and Monte Carlo Simulations
Most models used in conventional statistical data analysis are probability models. Thus the
grasp of the concept of probability is essential for a solid understanding of the fundamentals of
statistical inference. R provides many facilities that help gaining such an understanding. It
provides density, probability mass and cumulative distribution functions for many common
statistical distributions as well as as excellent random number generators for these
distributions. In this session we will make use of these facilities to gain an understanding of
how point and interval estimates and statistical hypothesis tests work. We will also explore
using simulation studies the consequences of violations of the assumptions on which many
of-the-shelf statistical procedures rest.
Suggested further reading:
Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R.
Hoboken, NJ: Wiley, chapters 3,4,6,7,8, appendix A,B.
Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An
Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters
2-4.
Wednesday Linear Algebra and the Geometry of Linear Regression with R
This session introduces fundamental concepts of linear algebra that are necessary to grasp
several advanced topics in multivariate data analysis and statistical inference, such as vectors
and arrays. Furthermore, it discusses arithmetic operations on vectors and matrices and
equations involving matrices and vectors. These concepts are however discussed in a hands-
on manner using R rather than in the abstract in order to give participants an intuitive
understanding about how these concepts can be put to good use. Thus the topics of this
lesson are: (a) vectors, matrices, and arrays; (b) linear systems and matrix inverses; (c) linear
regression in matrix form; (d) the geometry of least-squares solutions; (e) matrix algebra and
model-based inference.
Suggested further reading:
Gill, Jeff 2006. Essential Mathematics for Political and Social Research. Cambridge:
Cambridge University Press, chapters 2, 3, and 4.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks:
Sage, sections 2.3 and 8.4.
Thursday Principal Components, Factor Analysis, and Structural Equations
Regression models rest on the distinction between dependent and independent variables, or
between responses and regressors/predictors. There are however also research questions in
which such distinctions do not make too much sense and researchers are more interested in
patterns and structures residing in the data. These research questions are addressed by
methods discussed in this and the following lesson. This session focusses on methods that
emphasize relations between variables and that distinguish between latent and manifest
variables, that is principal components and factor analysis. More specifically, the topics of this
lesson are: (a) foundations and applications of principal component analysis; (b) the general
factor model and confirmatory factor analysis; (c) systems of simultaneous equations; (d)
structural equation models with latent variables; (e) latent variable models with binary and
ordinal indicators.
Suggested further reading:
Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith 2008. Analysis of
Multivariate Social Science Data. (2nd ed.) Boca Raton: Chapman&Hall/CRC,
chapters 5 and 7.
Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New
York: Springer, section 11.3.
Friday Special and Advanced Topics of Data Analysis with R
This last session will be used to address the specific needs of the participants, to review some
topics that need a more thorough discussion, or to introduce some advanced topics in which
participants may be interested. Consequently, there are no pre-determined topics for this
lesson, although there are several possible topics that may be considered for this session.
These include: (a) cluster analysis; (b) multidimensional scaling and unfolding; (c) time series
analysis; (d) linear and generalised linear mixed-effects; (e) models with instrumental
variables; (f) non-linear and semi-parametric extensions of the (generalized) linear model; (g)
numeric optimisation and general maximum likelihood; (h) parametric and non-parametric
bootstrapping; (i) design-based causal inference; (k) textual data and computational content
analysis; (l) numeric optimisation and general maximum likelihood; (m) advanced
programming concepts: classes and methods; parallel computations
Suggested further reading:
Adler, Joseph. 2010. R in a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O'Reilly.
Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in
R. Oxford: Oxford University Press, chapters 8, 9.
Braun, John W. and Duncan J. Murdoch. 2007. A First Course in Statistical Programming
with R. Cambridge: Cambridge University Press.
Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R.
Hoboken, NJ: Wiley, chapters 5,10,11.
Chambers, John M. 2008. Software for Data Analysis: Programming in R. New York:
Springer.
Gelman, Andrew and Jennifer Hill. 2007. Data Analysis using Regression and
Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.
Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York:
Springer.
Lumley, Thomas, 2010. Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ:
Wiley.
Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An
Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters
9-14.
Ritz, Christian and Jens Carl Streibig 2009. Nonlinear Regression with R. New York:
Springer.
Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New
York: Springer, chapter 8.
Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Boca Raton,
FL: Chapman&Hall/CRC.

You might also like