You are on page 1of 93

A Guide to Data Analysis in

R Commander

Viann Nguyen-Feng, M.A.


Mark A. Stellmack, Ph.D.
University of Minnesota

Copyright 2016 by Viann Nguyen-Feng and Mark A. Stellmack


About the authors

Viann Nguyen-Feng received her M.P.H. from Eastern Virginia Medical School,
after which she completed a post-graduate epidemiology fellowship and then a M.A. at the
University of Minnesota. She is currently pursuing a Ph.D. at the University of Minnesota.

Mark A. Stellmack received his Ph.D. in Experimental Psychology from Loyola


University of Chicago, specializing in the study of auditory perception. He teaches
undergraduate statistics and research methods courses at the University of Minnesota.

Please contact the authors with questions, comments, and suggestions:

Viann Nguyen-Feng: nguy2174@umn.edu

Mark Stellmack: stell006@umn.edu

2
A Guide to Data Analysis in R Commander

Table of Contents
The Purpose of this Guide .................................................................................................................. 5
Installing R, RStudio, and Rcmdr ..................................................................................................... 6
Running R Commander ....................................................................................................................... 7
Initial data entry .................................................................................................................................... 9
Saving a Comma Separated Variable (CSV) spreadsheet in Excel ................................................... 9
Opening a Comma Separated Variable (CSV) spreadsheet in Rcmdr ......................................... 10
Viewing a data set .......................................................................................................................................... 11
Editing a data set ............................................................................................................................................ 11
Switching between multiple data sets.................................................................................................... 11
Saving and loading a data set .................................................................................................................... 12

BASIC ANALYSES .................................................................................................................................. 15


Descriptive statistics .......................................................................................................................... 15
Mean, standard deviation, standard error of mean, interquartile range, coefficient of
variation, skewness, kurtosis, quantiles ............................................................................................... 15
Correlations .......................................................................................................................................... 18
Correlation matrix......................................................................................................................................... 18
Two-sample t-tests ............................................................................................................................. 21
Between-groups/Independent-groups t-test ...................................................................................... 21
Within-groups/Repeated-measures/Correlated-groups/Paired t-test..................................... 26
One-Way Analysis of Variance (ANOVA) and Post-Hoc Tests .............................................. 30
Two-Way Analysis of Variance (ANOVA) .................................................................................... 35
Testing main effects and interactions with multi-way ANOVAs ................................................... 35
Graphing interactions with multi-way ANOVAs ................................................................................. 38
Linear regression ................................................................................................................................ 40
Chi-square .............................................................................................................................................. 43
Chi-square using raw data .......................................................................................................................... 43
Chi-square using frequency counts ......................................................................................................... 46

ADVANCED ANALYSES ....................................................................................................................... 49


Descriptive statistics for sub-groups ........................................................................................... 49
Correlation test: Testing the significance of correlations ................................................... 53
Graphs ..................................................................................................................................................... 56
Scatterplot ........................................................................................................................................................ 56
Scatterplot by groups ................................................................................................................................... 59

3
Line graph......................................................................................................................................................... 61
Bar graph .......................................................................................................................................................... 64
Miscellaneous ....................................................................................................................................... 66
Opening and Entering data......................................................................................................................... 66
Opening an Excel spreadsheet .................................................................................................................. 66
Entering data directly into Rcmdr ........................................................................................................... 67
Recoding variables ........................................................................................................................................ 70
Combining items ............................................................................................................................................ 73
Converting variables from numeric to factor items .......................................................................... 75
Coding in R ............................................................................................................................................. 77
Deleting data sets........................................................................................................................................... 77
Labeling points in a scatterplot ................................................................................................................ 79
Repeated-measures ANOVA ....................................................................................................................... 82
Mixed-method ANOVA.................................................................................................................................. 89
Updates ................................................................................................................................................... 93
Updating packages ........................................................................................................................................ 93
RStudio updates ............................................................................................................................................. 93

4
The Purpose of this Guide
Our background is in Psychology. We teach introductory statistics and research methods
courses, which are typically required for most Psychology majors. Our statistics course
teaches the basics of descriptive and inferential statistics and our students perform
computations by hand. In our research methods course, students learn to perform the
same statistical analyses on a computer.

In the past, we used a popular software package for which our university purchased a site-
license and which ran on university-owned computers. However, the software is
prohibitively expensive to students. As a result, students were able to use the software
only on university computers during times when the computers were available. We sought
an alternative software package that worked like the more expensive option but that was
more affordable and that students could use anytime on their own computers.

That led us to the R programming language. R is a free, powerful data-analysis program


that performs many complex statistical analyses, but using R requires one to learn the R
programming language. R Commander (Rcmdr) is a simple point-and-click interface to the
R language that provides easy access to the most common analyses that Psychology
students are likely to want to perform.

Our goal was not to write a statistics textbook. Thus, this guide does not contain exercises
or practice problems. Rather, our goal was to write a guide for users who already have
knowledge of basic statistical techniques. This guide provides simple, step-by-step
instructions for performing those analyses. This guide assumes that the user has
knowledge of statistics at the level of a student who has completed an introductory course,
including an understanding of the interpretation of p-values. Rcmdr is somewhat intuitive,
but there are enough quirks and hidden data-formatting requirements to bring some
analyses grinding to a halt if you do not format things properly. In addition, the Rcmdr
output sometimes can be cluttered and difficult to wade through. This guide instructs the
user on how to format the data for a particular analysis, what to click on, and where to find
the most relevant information in the output. We tried to keep the writing as brief as
possible so that this guide would be a useful, quick reference tool.

Layouts and instructions may vary depending on your operating system or computer type
(e.g., Mac or PC). The instructions were prepared using a Windows platform.

Disclaimer: All data presented in this guide are entirely fabricated and perhaps even
nonsensical. They are meant to serve an illustrative purpose in understanding the basics of
data analysis in Rcmdr. Depending on your operating system, your screen may appear
slightly different from the ones in this guide.

5
Installing R, RStudio, and Rcmdr
R is a free software package and programming language for performing a wide variety of
statistical analyses. RStudio and R Commander are interfaces that make it easier and more
convenient to use R. You will only need to follow the instructions on this page once, when
you first install R, RStudio, and Rcmdr.

You must first download and install R, then download and install RStudio, then you can
install R Commander.

1. Go to one of these following websites and follow the instructions to download and
install the R software:
Windows: http://cran.r-project.org/bin/windows/base/
Macintosh: http://cran.r-project.org/bin/macosx/

2. After R is downloaded and installed, go to the following website and follow the
instructions to download the RStudio software for your operating system:
http://www.rstudio.com/products/rstudio/download/

3. After RStudio is downloaded and installed, launch RStudio.

4. When RStudio opens, at the command prompt (>) in the Console panel, type
install.packages("Rcmdr", dependencies=TRUE)
and press Enter.

Note: R and RStudio are case sensitive!

If a pop-up window appears asking you to Please select a CRAN mirror for use in
this session; select the site closest to you, then click Ok.

5. Many messages will appear in the Console window as R Commander is being


installed. When the installation is complete, the command prompt (>) will appear
again at the bottom of the Console window.

6. To open R Commander, type library(Rcmdr) at the command prompt and press


Enter.

If a pop-up window appears saying that you need to install another package and
asking if you want to do so, click on YES.

To run R Commander in the future after it is installed, you only need to launch RStudio and
type library(Rcmdr) at the command prompt in the Console window. (You will need to
click in the Console window before you can type in it.)

6
Running R Commander
To run R Commander, you must first launch RStudio (by double-clicking the RStudio icon).
The window shown below will open. The panel on the left-hand side of the RStudio
window is the Console. The > in the Console panel is the command prompt. At the
command prompt, type library(Rcmdr) and press Enter. (You probably will need to
click in the Console window before you can type in it.)

7
When you press enter, Rcmdr will open. Rcmdr looks like this:

You can click on commands in the Rcmdr menus to run your analyses. All of your output
(the results of your point-and-click commands) will appear in the RStudio Console
window.

The big, white box at the bottom of the R Commander window is the R Script box of
Rcmdr. You will not need to type anything in the R Script box when you are doing
basic statistical analyses in Rcmdr. The R Script box has two purposes:
1. Script: You can enter R code into this box. This guide does not focus on
writing and entering R code, though the end of this guide provides code for
you to type in to perform some special functions. Rcmdr also generates code
that appears in this box when you point and click commands. Thus, as you
click on commands in R Commander, code will appear in the R Script box.
2. Messages: Rcmdr will give you notes, warnings, or error messages
about the commands that you execute. These messages are generated and
displayed by Rcmdr in the R Script box as you perform various operations
in Rcmdr. Notes do not require any action, warnings may require some
action, and error messages definitely require some action in order to have
the command run properly. The messages will most likely explicitly tell you
what went wrong and, thus, what must be changed in order to have the
command run properly.

8
Initial data entry
Here, we describe how to open a Comma Separated Variable (CSV) spreadsheet in Rcmdr.
You can easily create a CSV file by entering your data in Excel and saving it in CSV format.

Not all versions of Rcmdr are the same. If the methods for opening a data file described
below do not work on your computer, see the Miscellaneous section of this guide (page 66)
for other methods of entering data (e.g., opening an Excel spreadsheet, manually entering
data). We highlight opening a Comma Separated Variable (CSV) spreadsheet as the
primary data entry method because CSVs seem to open across all operating platforms.

Saving a Comma Separated Variable (CSV) spreadsheet in Excel

Launch Excel and enter your data in a spreadsheet, remembering to enter each subjects
data in a different row and each variable in a different column.
Save your Excel sheet as a .csv instead of a .xls or .xlsx:
In Excel, go to File Save As
In the dropdown menu next to Save as type, select CSV (Comma delimited). Click OK
through the remaining windows to save the file.

9
Opening a Comma Separated Variable (CSV) spreadsheet in Rcmdr

In Rcmdr, click on Data Import data from text file, clipboard, or URL

A Read Text Data From File, Clipboard, or URL window will open up. The default data set
name is Dataset. This is a label for the data set that is used within Rcmdr. You may
change this to something more meaningful by clicking in the box and typing. (This is
particularly useful if you intend to open more than one dataset at a time.) Under Field
Separator, change it from White space to Comma because it is a comma-separated
variable file. You may keep the other default values in this window.

Click OK and select the data file you want to open in the subsequent windows to create
the new data set.

10
Viewing a data set
When you return to the Rcmdr home screen, you can view the data you read into Rcmdr by
clicking the View data set button.

Your data set table will pop up in a new window.


You cannot edit the variable names or the data in this window.

Editing a data set


To change variable names or to change specific cell values, click on the Edit data set
button. The Data Editor window will open with your data in it. In the Data Editor window,
you can click on variable names or cell values to edit them. (For more information, refer to
the section Entering data directly into Rcmdr on page 67.)

Switching between multiple data sets


You can open or create more than one data set during an Rcmdr session. In the Rcmdr
home screen, the button next to Data set: shows the data set that is currently active. All
commands that you click on will operate on the data set that is named in the button.

If you opened or created more than one data set during this Rcmdr session and you would
like to switch to a different data set, click the button next to Data set: A window labeled
Select Data Set will pop up. This window contains a list of all of the data sets that you
have created during this session.

11
Select the data set that you would like to use, then click OK. You will notice that the
button on your Rcmdr home screen will change to reflect the name of the data set that you
have selected. That data set is now the active data set.

Saving and loading a data set


After creating a data set, you may want to save the data set as an R Data file so that you may
easily load it into Rcmdr later instead of re-importing the CSV (or Excel, etc.,) file. This may
come in handy in particular when you are manually loading data (see p. 67).

Data Active data set Save active data set

12
A Save As window will pop up. Select the folder that you would like the data set to be
stored in and change the File name as appropriate. Click Save.

To load the saved data set later, go to:


Data Load data set

13
An Open window will pop up. Select the R Data file in the appropriate folder. Press
Open and the data file to be loaded into Rcmdr. The data set you loaded will be the active
data set and the name that you gave the data set before you saved it will be displayed in the
Data set: button in Rcmdr.

14
BASIC ANALYSES

Descriptive statistics
Mean, standard deviation, standard error of mean, interquartile range, coefficient of
variation, skewness, kurtosis, quantiles
Statistics Summaries Numerical summaries

15
A Numerical Summaries window will pop up. Select the variable(s) for which you want to
calculate descriptive statistics. Only numeric variables will be shown in this box. Hold the
Ctrl key while clicking to select more than one variable. Hold the Shift key and click to
select more than one variable that are listed directly next to each other.

Click on the Statistics tab to select which descriptive statistics you would like. The ones
selected below (mean, standard deviation, interquartile range, quantiles) are selected by
default.

16
Click OK. The output will appear in your RStudio console window.

Rcmdr> numSummary(OurData2[,"test1"],
Rcmdr+ statistics=c("mean", "sd", "IQR", "quantiles"),
Rcmdr+ quantiles=c(0,.25,.5,.75,1))
mean sd IQR 0% 25% 50% 75% 100% n
19.4 4.102264 7 12 16 20 23 25 15

Interpretation:
Mean (mean) = 19.4
Standard deviation (sd) = 4.102264
Interquartile range (IQR) = 7
0th percentile score (0%) = 12
25th percentile score (25%) = 16
50th percentile score (50%) = 20
75th percentile score (75%) = 23
100th percentile score (100%) = 25
Sample size (n) = 15

17
Correlations
Correlation matrix
A correlation matrix is a table with all of the variables of interest listed in the rows and the
columns. The intersecting cell of a particular row and a particular column shows the
Pearson product-moment correlation (r) between the two variables. Correlations are
shown for all pair-wise combinations of the variables of interest.

Statistics Summaries Correlation matrix

18
A Correlation Matrix window will pop up.

-Under Variables (pick two or more): Select the variables that you want to include in
the correlation matrix. Only numeric variables appear in this list. Press the Ctrl key to
select more than one variable. Press the Shift key to select more than one variable that
are listed directly next to each other.
-Under Type of Correlations: When most people talk about a correlation, they are
referring to the Pearson product-moment correlation (r). This is the default setting.
-Under Observations to Use: The option selected here does not matter if you chose only
two variables in the box above. If you chose three or more variables, the options in this
section have the following effects:
1. Complete observations: If the value of one variable is missing for a case (or row),
then the entire case/row will be omitted from all correlation computations in the
correlation matrix. This will result in the same number of observations across all
correlations.
2. Pairwise-complete observations: If the value of one variable is missing for a case (or
row), then that case will be omitted from the analysis only for the correlations
involving the variable with the missing observation. This may result in different
numbers of observations for each correlation.

19
To calculate the p-value of the correlation (to determine if the r-value is significantly
different from zero), select the Pairwise p-values option so that a checkmark appears in
the box.

Click OK. The output will appear in the RStudio console window.
Rcmdr> rcorr.adjust(Dataset[,c("test1","test2")],
type="pearson", use="complete")
test1 test2
test1 1.00 0.92
test2 0.92 1.00

n= 15

P
test1 test2
test1 0
test2 0

Adjusted p-values (Holm's method)


test1 test2
test1 0
test2 0

Interpretation: The output tells us that the correlation between a students score on test1
and his/her score on test2 is 0.92 (r = 0.92).
Looking at the p-values in the p section, we see that the p-value for this correlation is 0. As
a result, we would conclude that this r is statistically significantly different from zero
(assuming we have chosen the .05 level of significance). (Note that 0s in the p section
indicate that p < .001, not zero. The value of p is never zero.)

20
Two-sample t-tests
Between-groups/Independent-groups t-test
In the following example, we will perform a between-groups/independent-groups t-test in
which we want to compare how students in different classes (freshman and sophomore)
perform on a specific test. Data sets often contain more information than needed for a
particular analysis. For example, in this case, we have scores for two different tests, but we
will only compare how students performed on one test, test1. The students names
(student variable) also are additional data that are not necessary to perform the
between-groups/independent-groups t-test in this case. This is what our data file looks
like:

IMPORTANT DATA-FORMATTING NOTE:


Rcmdr will only make the independent-samples t-test analysis available if Rcmdr can
identify a potential grouping variable in your data set. In order for the Independent
samples t-test option to be available in the Rcmdr menu, your data must have at least
one variable (column) with exactly two different values of a character variable (e.g. a and
b) that can possibly serve as the grouping variable; there cannot be more than two values
of the character variable that you want to use as the grouping variable. Also, the grouping
variable cannot be numeric (e.g., 1 and 2). For example, in the data set shown above,
the class variable has values of only freshman and sophomore, so it is a potential
grouping variable. If the active data set does not satisfy these conditions, the Independent
samples t-test option will be grayed out (not selectable) in the Rcmdr menu.

21
Statistics Means Independent samples t-test

An Independent Samples t-Test window will pop up.


-Under Groups (pick one): Select the grouping variable that identifies the two groups.
This will be the independent variable. Only character variables that Rcmdr determines to
be potential grouping variables (see the IMPORTANT DATA-FORMATTING NOTE above) will
be shown in this list; in this example, student does not appear on the list because no
student name is repeated more than once so it does not seem to be a grouping variable.
-Under Response Variable (pick one): Select the variable for which you want the means
to be calculated. This will be the dependent variable. Only numeric variables will be
shown in this list.

22
Select the Options tab (next to the Data tab). Difference has been automatically set to
freshman sophomore (freshman minus sophomore) as these are the two categories in
alphabetical order under the Groups variable (class). In this example, Rcmdr will
calculate the difference between means as the freshman mean minus the sophomore mean.
-Under Alternative Hypothesis: Select Two-sided if your alternative hypothesis is non-
directional and states that the freshman mean is different from the sophomore mean.
Select Difference < 0 if your alternative hypothesis is directional and predicts that
freshman sophomore (i.e., the freshman mean minus the sophomore mean) is less than 0,
meaning that the sophomore mean is larger than the freshman mean. Select Difference >
0 if your alternative hypothesis is directional and predicts that freshman sophomore
(i.e., the freshman mean minus the sophomore mean) is greater than 0, meaning that the
freshman mean is larger than the sophomore mean.
-Under Confidence Level: 1 the confidence level = alpha, your chosen level of
significance. Setting alpha to .05 is typical, so you probably will keep the default setting of
Confidence Level = .95.
-Under Assume equal variances?: Assuming that we have checked our assumptions
beforehand, we would ideally want the variances of the two groups (freshman and
sophomore) to be equal. The default setting is No. Change this to Yes.

When you assume equal variances in the Independent Samples t-test, you are
assuming that your data meet the condition of homogeneity of variance.
Homogeneity of variance means that the variances of the populations from which
your samples were drawn are equal. The homogeneity of variance condition is most
important when there is a large difference between the sizes of the samples. If the
samples sizes and the sample variances are very different, the results of the
Independent Samples t-test will be less interpretable. If you suspect that the
samples may have been drawn from populations with unequal variances, there are
tests for homogeneity of variance; for example, Hartleys F-max test. Refer to an
advanced statistics text for instructions on performing that test.

Click OK. The output will appear in your RStudio console window. (See following pages.)

23
Output when Alternative Hypothesis = Two-sided
Rcmdr> t.test(test1~class, alternative='two.sided',
conf.level=.95, var.equal=TRUE,
Rcmdr+ data=Dataset)

Two Sample t-test

data: test1 by class


t = 0.3913, df = 13, p-value = 0.7019
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-3.874943 5.589228
sample estimates:
mean in group freshman mean in group sophomore
19.85714 19.00000

Interpretation: The freshman and sophomore group means do not significantly differ from
each other, as p is greater than alpha = .05. In APA format, we would write the results:
t(13) = .39, p = .70.

Output when Alternative Hypothesis = Difference < 0


Rcmdr> t.test(test1~class, alternative='less', conf.level=.95,
var.equal=TRUE,
Rcmdr+ data=Dataset)

Two Sample t-test

data: test1 by class


t = 0.3913, df = 13, p-value = 0.649
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 4.736207
sample estimates:
mean in group freshman mean in group sophomore
19.85714 19.00000

Interpretation: The sophomore mean is not significantly larger than the freshman mean, as
p is greater than alpha = .05. In APA format, we would write the results: t(13) = .39, p =
.65.

24
Output when Alternative Hypothesis = Difference > 0
Rcmdr> t.test(test1~class, alternative='greater',
conf.level=.95, var.equal=TRUE,
Rcmdr+ data=Dataset)

Two Sample t-test

data: test1 by class


t = 0.3913, df = 13, p-value = 0.351
alternative hypothesis: true difference in means is greater than
0
95 percent confidence interval:
-3.021921 Inf
sample estimates:
mean in group freshman mean in group sophomore
19.85714 19.00000

Interpretation: The freshman mean is not significantly larger than the sophomore mean, as
p is greater than alpha = .05. In APA format, we would write the results: t(13) = .39, p =
.35.

25
Within-groups/Repeated-measures/Correlated-groups/Paired t-test
In the following example, we will use the following data set, which we introduced in the
previous section (covering the Between-groups/Independent-groups t-test):

IMPORTANT DATA-FORMATTING NOTE:


In order for the repeated-measures t-test (Paired t-test) option to be available in the
Rcmdr menu, your data must have at least two variables (columns) containing all
numerical values. For example, in the data set shown above, the test1 and test2
variables contain all numerical values. If your data do not satisfy these conditions, the
Paired t-test option will be grayed out (not selectable) in the Rcmdr menu.

Statistics Means Paired t-test

26
A Paired t-Test window will pop up. Select one variable under First variable (pick one)
and then a different variable under Second variable (pick one). Only variables that
contain all numerical values will be shown.

Select the Options tab (next to the Data tab). Difference has been automatically set to
your First variable your Second variable (first minus second variable), as selected on
the Data tab. In this case, Difference refers to test1 test2. (Note that, unlike the
Independent samples t-test, the Paired t-test does not show the difference in this
window.)

-
Under Alternative Hypothesis: Select Two-sided if your alternative hypothesis is non-
directional and states that the test1 mean is different from the test2 mean. Select
Difference < 0 if your alternative hypothesis is directional and predicts that test1
test2 (i.e., the test1 score minus the test2 score) is less than 0, meaning that the
test2 mean is larger than the test1 mean. Select Difference > 0 if your alternative
hypothesis is directional and predicts that test1 test2 (i.e., the test1 score minus the
test2 score) is greater than 0, meaning that the test1 mean is larger than the test2
mean.
-Under Confidence Level: 1 the confidence level = alpha. Setting alpha at .05 is typical,
so you probably will keep the default setting of Confidence Level = .95.

Click OK. The output will appear in your RStudio console window. (See following pages.)

27
Output when Alternative Hypothesis = Two-sided
Rcmdr> with(Dataset, (t.test(test1, test2,
alternative='two.sided', conf.level=.95,
Rcmdr+ paired=TRUE)))

Paired t-test

data: test1 and test2


t = -35.515, df = 14, p-value = 4.044e-15
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-32.87212 -29.12788
sample estimates:
mean of the differences
-31

Interpretation: The test1 and test2 mean scores significantly differ from each other, as
p is less than alpha = .05. In APA format, we would write the results: t(14) = -35.52, p <
.001.

Output when Alternative Hypothesis = Difference < 0


Rcmdr> with(Dataset, (t.test(test1, test2, alternative='less',
conf.level=.95,
Rcmdr+ paired=TRUE)))

Paired t-test

data: test1 and test2


t = -35.515, df = 14, p-value = 2.022e-15
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -29.4626
sample estimates:
mean of the differences
-31

Interpretation: The test2 mean score is significantly larger than the test1 mean score,
as p is less than alpha = .05. In APA format, we would write the results: t(14) = -35.52, p <
.001.

Output when Alternative Hypothesis = Difference > 0


Rcmdr> with(Dataset, (t.test(test1, test2,
alternative='greater', conf.level=.95,
Rcmdr+ paired=TRUE)))

Paired t-test
28
data: test1 and test2
t = -35.515, df = 14, p-value = 1
alternative hypothesis: true difference in means is greater than
0
95 percent confidence interval:
-32.5374 Inf
sample estimates:
mean of the differences
-31

Interpretation: The test1 mean score is not significantly larger than the test2 mean
score, as p is greater than alpha = .05. In APA format, we would write the results: t(14) = -
35.52, p = .99.

29
One-Way Analysis of Variance (ANOVA) and Post-Hoc Tests
Perform a one-way ANOVA if you want to compare the means of three or more samples
that differ in terms of the level of a single independent variable. If the means of the
samples are significantly different, you may want to perform a post-hoc test to test the
significance of the difference between each pairwise combination of group means in your
data set.

IMPORTANT DATA-FORMATTING NOTE:


Rcmdr will only make the one-way ANOVA analysis available if Rcmdr can identify a
potential grouping variable in your data set. In order for the One-way ANOVA option to
be available in the Rcmdr menu, your data must have at least one variable (column) with at
least two different values of a character variable (e.g. a, b, and c) that can possibly
serve as the grouping variable. The grouping variable cannot be numeric (e.g., 1, 2, and
3). If the active data set does not satisfy these conditions, the One-way ANOVA option
will be grayed out (not selectable) in the Rcmdr menu. Note that the One-way ANOVA
option will be available if your data have only two levels of the grouping variable, but it is
more appropriate to perform a t-test in that situation.

For this example, we would set up our data as shown in the following table.

Statistics Means One-way ANOVA

30
A One-Way Analysis of Variance window will pop up. You may elect to change the model
name besides Enter name for model: or you may keep the default name of
AnovaModel.1.
-Under Groups (pick one): Select the grouping variable that identifies the different levels
of the independent variable. Only character variables that satisfy the conditions described
above in the IMPORTANT DATA-FORMATTING NOTE appear in this list.
-Under Response Variable (pick one): Select the variable for which you wish to compare
sample means. This will be the dependent variable. Only numeric variables appear in this
list.
-Pairwise comparison of means: Select this box in order to output the contrasts, or
comparisons of each of the possible pairs within your 3+ groups, in this case, vehicle
types.

Click OK. The output will appear in the RStudio console window. (See following pages.)

31
ANOVA output
Rcmdr> AnovaModel.1 <- aov(maxspeed ~ vehicle, data=Dataset2)

Rcmdr> summary(AnovaModel.1)
Df Sum Sq Mean Sq F value Pr(>F)
vehicle 2 1383 691.7 6.58 0.00454 **
Residuals 28 2943 105.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Rcmdr> with(Dataset2, numSummary(maxspeed, groups=vehicle,


statistics=c("mean",
Rcmdr+ "sd")))
mean sd data:n
sedan 58.70000 9.592242 10
SUV 75.20000 9.330952 10
van 65.18182 11.539655 11

Interpretation: The maxspeed means of each vehicle type significantly differ from each
other, as p is less than an assumed alpha = .05. In APA format, we would write the results:
F(2) = 6.58, p < .01.

Pairwise comparison of means


Rcmdr> local({
Rcmdr+ .Pairs <- glht(AnovaModel.1, linfct = mcp(vehicle =
"Tukey"))
Rcmdr+ print(summary(.Pairs)) # pairwise tests
Rcmdr+ print(confint(.Pairs)) # confidence intervals
Rcmdr+ print(cld(.Pairs)) # compact letter display
Rcmdr+ old.oma <- par(oma=c(0,5,0,0))
Rcmdr+ plot(confint(.Pairs))
Rcmdr+ par(old.oma)
Rcmdr+ })

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = maxspeed ~ vehicle, data = Dataset2)

Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
SUV - sedan == 0 16.500 4.585 3.599 0.00346 **
van - sedan == 0 6.482 4.480 1.447 0.33134
van - SUV == 0 -10.018 4.480 -2.236 0.08240 .
---

32
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = maxspeed ~ vehicle, data = Dataset2)

Quantile = 2.4749
95% family-wise confidence level

Linear Hypotheses:
Estimate lwr upr
SUV - sedan == 0 16.5000 5.1521 27.8479
van - sedan == 0 6.4818 -4.6052 17.5688
van - SUV == 0 -10.0182 -21.1052 1.0688

sedan SUV van


"a" "b" "ab"

Interpretation: Because the one-way ANOVA was significant, we may examine the linear
contrasts (pairwise comparisons) to determine which vehicle types differed. In the first
section labeled Multiple Comparisons of Means: Tukey Contrasts, the results of
comparisons between each pair of levels of the independent variable are shown. The left-
hand column identifies the levels being compared and the right-hand column shows the p-
value for that comparison. Each p-value is followed by a code that allows you to quickly
identify whether the difference tested on that line is statistically significant. The possible
codes are listed on the line labeled Signif. codes. For example, if ** appears after a p-
value, the difference between means tested on that line is significant at the .01 level of
significance. In the example output shown above, there was a significant difference (p <
.01) between maxspeed means of SUVs and sedans and a marginally significant difference
(p = .08) between maxspeed means of vans and SUVs. The maxspeed means of vans and
sedans did not significantly differ (p = .33).

This is also explained at the very bottom of the output, in which sedan is given a symbol of
a, SUV is given a symbol of b, and van is given a symbol of ab. Because ab (van)
has symbols that overlap a (sedan) and b (SUV), this indicates that the maxspeed means
of van are not significantly different from those of sedan or SUV. Because a (sedan) and
b (SUV) do not have symbols that overlap (i.e., a and b are different symbols
entirely), this indicates that the maxspeed means of sedan are different from those of
SUV.

33
95% family-wise confidence level graph
A depiction of the 95% confidence intervals for the pairwise differences between sample
means will be automatically displayed in a separate window. If the interval does not
contain 0, you can conclude with 95% confidence that the means are not equal (i.e., the
difference between means is not equal to 0). In the example figure shown below, the
confidence interval for the SUV sedan comparison does not include 0, which is consistent
with the difference between the SUV and sedan means being statistically significant, as
shown by the post hoc test. If you wish to save this image, right click, and select Copy as
metafile. The metafile will have higher resolution than the bitmap option, which may be
more appropriate for use as a figure in a paper.

34
Two-Way Analysis of Variance (ANOVA)
Testing main effects and interactions with multi-way ANOVAs
For this example, we would set up our data as shown in the following table.

Statistics Means Multi-way ANOVA

A Multi-Way Analysis of Variance window will pop up. You may elect to change the
model name in the Enter name for model: text box or you may keep the default name of
AnovaModel.1.
-Under Factors (pick one or more): Select factors for which you would like to test the
interaction. These are the independent variables. Only character variables appear in this
list. Press the Ctrl key to select more than one variable. Press and hold the Shift key to
select more than one variable that are listed directly next to each other.

35
-Under Response Variable (pick one): Select the variable for which you want to compare
the sample means. This will be the dependent variable. Only numeric variables appear in
this list.

Click OK. The output will appear in your RStudio console window.
ANOVA output
Rcmdr> AnovaModel.1 <- (lm(maxspeed ~ driver*vehicle,
data=Dataset3))

Rcmdr> Anova(AnovaModel.1)
Anova Table (Type II tests)

Response: maxspeed
Sum Sq Df F value Pr(>F)
driver 0.02 1 0.0002 0.988984
vehicle 1383.46 2 6.2116 0.006456 **
driver:vehicle 159.31 2 0.7153 0.498777
Residuals 2784.00 25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The main effect of driver is not significant; in other words, the maxspeed
means do not differ with the experience of driver (new or old), as p is greater than alpha
= .05, F(1)<.001, p = .99.
The main effect of vehicle is significant; in other words, the maxspeed means do differ
with the type of vehicle (sedan, SUV, or van), as p is less than alpha = .05, F(2) = 6.21,
p<.01.
The interaction of driver and vehicle is not significant; in other words, the maxspeed
means do not differ as a function of both driver and vehicle (driver:vehicle), as p
is greater than alpha = .05, F(2) = .72, p = .50.

36
Descriptive statistics that are automatically outputted
Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),
mean, na.rm=TRUE)))
Rcmdr+ # means
sedan SUV van
new 57.0 73.6 68.0
old 60.4 76.8 61.8

Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),


sd, na.rm=TRUE)))
Rcmdr+ # std. deviations
sedan SUV van
new 8.514693 7.092249 12.58571
old 11.282730 11.798305 10.42593

Rcmdr> with(Dataset3, (tapply(maxspeed, list(driver, vehicle),


function(x)
Rcmdr+ sum(!is.na(x))))) # counts
sedan SUV van
new 5 5 6
old 5 5 5

Interpretation:
Means = # means
Standard deviations = # std. deviations
Sample size = # counts

37
Graphing interactions with multi-way ANOVAs
Graphs Plot of means

A Plot Means window will pop up.


-Under Factors (pick one or more): Select 2+ factors to plot. These are the independent
variables. Only character variables appear in this list. Press the Ctrl key to select more
than one variable. Press the Shift key to select more than one variable that are listed
directly next to each other.
-Under Response Variable (pick one): Select the variable for which you want the means
to be calculated. This will be the dependent variable. Only numeric variables appear in this
list.

38
Click on the Options tab (next to the Data tab).
-Under Error Bars: The default option is Standard errors. To simplify the graph, we
changed the selection to No error bars.
-Under Plot Labels: The default x-axis label is the name of the factor that comes first
alphabetically. The default y-axis label is mean of [the name of your response variable].
The default Graph title is Plot of Means. To change any of these default labels, click on the
white boxes to type in your new labels.

Click OK. A graph will automatically be outputted in a separate window. If you wish to
save this image, right click, and select Copy as metafile. The metafile will have higher
resolution than the bitmap option.

39
Linear regression
Perform this analysis when you want to find the equation of the best-fitting straight line to
a scatterplot of data involving a predictor (X) variable and a criterion (Y) variable. Finding
the best-fitting line amounts to finding a and b (the regression coefficients) in the equation
Y = a + bX, where a is the Y-intercept and b is the slope of the best-fitting line. In addition,
the standard error of estimate (sY.X) is a measure of the spread of the points in the
scatterplot about the regression line (the typical error of predictions made with the
regression equation).

For this example, we would set up our data as shown in the following table.

40
Click on Statistics Fit models Linear regression

A Linear Regression window will pop up. You may elect to change the model name in the
Enter name for model: text box or you may keep the default name of RegModel.1. This
window is set up differently from the others because the response variable (criterion) is on the
left-hand side, not the right.
-Under Response variable (pick one): Select one variable that you want to serve as the
response variable (Y; criterion, or predicted variable). Only numeric variables appear in
this list.
-Under Explanatory Variable (pick one or more): Select the variable(s) that you want to
serve as the predictor variable (X). Only numeric variables appear in this list.

Click OK. The output will appear in your RStudio console window as shown on the next
page.

41
Rcmdr> RegModel.1 <- lm(traveltime~maxspeed, data=Dataset3)

Rcmdr> summary(RegModel.1)

Call:
lm(formula = traveltime ~ maxspeed, data = Dataset3)

Residuals:
Min 1Q Median 3Q Max
-5457 -4356 -3140 3045 21865

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4866.1 6938.9 0.701 0.489
maxspeed 12.1 103.0 0.117 0.907

Residual standard error: 6775 on 29 degrees of freedom


Multiple R-squared: 0.0004755, Adjusted R-squared: -0.03399
F-statistic: 0.0138 on 1 and 29 DF, p-value: 0.9073

Interpretation: The regression coefficients are shown in the Estimate column of the
Coefficients: section of the output. The Y-intercept (a) is shown in the
(Intercept) row and the slope (b) is shown in the maxspeed row. In this example, a =
4866.1 and b = 12.1, so the equation of the best-fitting line is Y = 4866.1 + 12.1 X. The
standard error of estimate (sY.X) is shown in the Residual standard error: row. In
this example, sY.X = 6775.

There are several ways to test the significance of the results of the linear regression. The
question is whether the predictor variable predicts the criterion variable. One way to
answer this is to estimate the proportion of variance in one variable associated with
variance in the other variable (which you may recognize as the correlation coefficient
squared). This is expressed in the output as Multiple R-squared; in this case,
0.0004755. The p-value shown on the following line of the output tests whether the
Multiple R-squared value is significantly different from zero. In this case, p = 0.9073,
which is greater than alpha = .05, so there is not a significant relationship between
maxspeed and traveltime.

Another way to ask whether maxspeed significantly predicts traveltime is to test


whether the regression coefficients a and b are significantly different from zero. The p-
values for these tests are shown in the Pr(>|t|) column. For the Y-intercept, p = 0.489,
which indicates that the Y-intercept is not significantly different from zero. Likewise, for
the slope, p = 0.907, which indicates that the slope is not significantly different from zero.
(The p-value testing the significance of the slope is equal to the p-value testing the
significance of the Multiple R-squared because they are identical statistical tests for
linear regression involving only one predictor variable.)

42
Chi-square
Perform a chi-square test of independence when you want to test whether there are
significant differences between the proportions of observations that fall into different
categories of a nominal variable for different groups. For example, suppose we want to
determine whether freshman and sophomore students differ in their choices of lunch
options. Suppose that in this simple example, we are only examining freshman and
sophomore students and the only lunch options are pizza and salad. For each student who
enters the cafeteria, we record whether the student is a freshman or sophomore and
whether the student chooses pizza or salad. Note that both variables are measured on
nominal scales of measurement. Therefore, the question is whether different proportions
of freshman and sophomore students choose pizza and salad.

There are two ways to test this question in Rcmdr. We can either (1) enter the raw data for
individual observations or (2) enter the frequency counts (i.e., the total numbers of
freshmen and sophomores who choose each type of lunch food). We describe both
methods below.

Chi-square using raw data


For this example, we would set up our data as shown in the following table. Each line
represents a different subject that we observed. The values in each line show the food
option chosen by that subject (pizza or salad) and the class to which the subject belongs
(freshman or sophomore).

IMPORTANT DATA-FORMATTING NOTE:


Rcmdr will only make the chi-square analysis available if Rcmdr can identify at least two
character variables in your data set. In order for the Two-way table option to be
available in the Rcmdr menu, your data must have at least two variables (columns) that
contain character data (rather than numeric data). If the active data set does not contain at
least two character variables, the Two-way table option will be grayed out (not
selectable) in the Rcmdr menu.

43
Click on Statistics Contingency tables Two-way table

A Two-Way Table window will pop up. Select one variable under Row variable (pick
one) and another variable under Column variable (pick one). Only character variables
appear in these lists. (The statistical outcome will not change if you interchange your row
and column variables, but the results will be formatted differently in the output.)

44
Click on the Statistics tab (next to the Data tab) to select which percentages you would
like to appear.
-Under Compute Percentages: To simplify the outcome, No percentages is the default.
However, if you would like percentages in addition to frequencies to be outputted, then you
may select one of the other options.
-Under Hypothesis Tests: Chi-square test of independence is the default option. To
simplify the output, you may leave Components of chi-square statistic and Print
expected frequencies unchecked. If your data set is small enough that a cell may have
fewer than five counts in it, then select Fishers exact test.

Click OK. The output will appear in the RStudio console window.
Rcmdr> local({
Rcmdr+ .Table <- xtabs(~class+lunch, data=Dataset4)
Rcmdr+ cat("\nFrequency table:\n")
Rcmdr+ print(.Table)
Rcmdr+ .Test <- chisq.test(.Table, correct=FALSE)
Rcmdr+ print(.Test)
Rcmdr+ })

Frequency table:
lunch
class pizza salad
freshman 12 16
sophomore 8 24

Pearson's Chi-squared test

data: .Table
X-squared = 2.1429, df = 1, p-value = 0.1432

45
Interpretation: Lunch preferences (pizza or salad) do not differ by class (freshman or
sophomore), as p is greater than an assumed value of alpha = .05. In APA format, we would
write the results: 2(1) = 2.14, p = .14.

Chi-square using frequency counts


If you already have the frequency counts of each cell of your frequency table (as in the
Frequency table: output above), then you can enter those counts directly rather than
using the raw data method to enter data for each individual subject.

Click on Statistics Contingency tables Enter and analyze two-way table

An Enter Two-Way Table window will pop up.

46
-Next to Number of Rows: Adjust the number of rows by sliding the horizontal bar. The
default is 2 rows.
-Next to Number of Columns: Adjust the number of columns by sliding the horizontal
bar. The default is 2 columns.
-Under Enter counts: Change the row and column labels (1 and 2) to the levels of
your variables (in this example, pizza and salad for the columns and freshman and
sophomore for the rows). Enter the counts in the remaining boxes of the table.

47
Click OK. The output will appear in your RStudio console window.

Rcmdr> .Table <- matrix(c(12,16,8,24), 2, 2, byrow=TRUE)

Rcmdr> rownames(.Table) <- c('freshman', 'sophomore')

Rcmdr> colnames(.Table) <- c('pizza', 'salad')

Rcmdr> .Table # Counts


pizza salad
freshman 12 16
sophomore 8 24

Rcmdr> .Test <- chisq.test(.Table, correct=FALSE)

Rcmdr> .Test

Pearson's Chi-squared test

data: .Table
X-squared = 2.1429, df = 1, p-value = 0.1432

Interpretation: The output is identical to that obtained using the raw data method. Lunch
preferences (pizza or salad) do not differ by class (freshman or sophomore), as p is
greater than an assumed value of alpha = .05. In APA format, we would write the results:
2(1) = 2.14, p = .14.

48
ADVANCED ANALYSES

Descriptive statistics for sub-groups


You can output descriptive statistics for sub-groups of your data. To do so, you can
identify a grouping variable and Rcmdr will output descriptive statistics for different values
of that variable. For example, suppose we have data on test scores and we want to see how
students in different classes (e.g., freshman or sophomore) performed on a given test
(e.g., test1). In the data below, you can see that there is a variable called class that
indicates whether the student is in the freshman or the sophomore class.

49
Click on Statistics Summaries Numerical summaries

The Numerical Summaries window will appear. In the Numerical Summaries window,
click on the variable(s) for which you want descriptive statistics (in this example, test1).

50
Click on the Statistics tab and select which statistics you would like to calculate (see p.
15). After you have chosen at least one variable and you have chosen which descriptive
statistics you want, click on the Data tab (if it is not already selected) and click
Summarize by groups

A Groups window will pop up. This list contains all of the character variables in your data
set. You may select one, and only one, variable to serve as the basis of groups in your
output (in this example, class). Rcmdr will compute descriptive statistics separately for
each unique value of the Groups variable (in this example, freshman and sophomore).

51
Click OK. The output will appear in the RStudio console window.

Rcmdr> numSummary(Dataset[,"test1"], groups=Dataset$class,


statistics=c("mean",
Rcmdr+ "sd", "IQR", "quantiles"),
quantiles=c(0,.25,.5,.75,1))
mean sd IQR 0% 25% 50% 75% 100% data:n
freshman 19.85714 4.598136 6.0 12 17.5 20 23.5 25 7
sophomore 19.00000 3.891382 6.5 14 15.0 20 21.5 24 8

Interpretation: Each row of the output shows the descriptive statistics for one of the sub-
groups. For example, the mean for those in the freshman class is equal to 19.86 while the
mean for those in the sophomore class is equal to 19.00. The column labeled data:n
shows the sample size in each group. In this example, there are 7 individuals in the
freshman group and 8 individuals in the sophomore group.

52
Correlation test: Testing the significance of correlations
When you compute a correlation matrix (see p. 18), you can compute correlations between
many different variables at one time and you can obtain the p-values to test whether those
correlations are significantly different from zero. In this analysis, the correlation test, you
can compute the correlation between only two variables at one time. However, in the
correlation test analysis, the Students t statistic pertaining to the p-value calculation as
well as a 95% confidence interval for r are also outputted.

Click on Statistics Summaries Correlation test

53
A Correlation Test window will pop up. Select two, and only two, variables that you
would like to analyze. Only numeric variables appear in this list. Hold down the Ctrl key
to select more than one variable. Hold down the Shift key to select more than one
variable that are listed directly next to each other.
-Under Type of Correlations: Select the type of correlation you would like to compute.
The default is the Pearson product-moment correlation coefficient (r).
-Under Alternative Hypothesis: Select Two-sided if you want the alternative
hypothesis to assess whether r is different from 0 (a non-directional test). Select
Correlation < 0 if you want to test whether r is significantly less than 0. Select
Correlation > 0 if you want to test whether r is significantly greater than 0.

Click OK. The output will appear in the RStudio console window.

Output when Alternative Hypothesis = Two-sided


Rcmdr> with(Dataset, cor.test(test1, test2,
alternative="two.sided",
Rcmdr+ method="pearson"))

Pearson's product-moment correlation

data: test1 and test2


t = 8.269, df = 13, p-value = 1.554e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7623700 0.9723368
sample estimates:
cor
0.91665

Interpretation: The Pearson product-moment correlation between test1 and test2 is


0.91665 (i.e., r = 0.92). The correlation coefficient is significantly different from 0, as p is
less than an assumed alpha = .05 (p < .001).

54
Output when Alternative Hypothesis = Correlation < 0
Rcmdr> with(Dataset, cor.test(test1, test2, alternative="less",
method="pearson"))

Pearson's product-moment correlation

data: test1 and test2


t = 8.269, df = 13, p-value = 1
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.0000000 0.9669085
sample estimates:
cor
0.91665

Interpretation: The Pearson product-moment correlation between test1 and test2 is


0.91665 (i.e., r = 0.92). The correlation is not significantly less than 0 because p is greater
than alpha = .05 (p approaches 1). That is, we have evidence that our hypothesis that
test2 scores decrease as test1 scores increase is incorrect. The output also shows us
that we can be 95% confident that the true correlation in the population is between -1 and
0.9669085.

Output when Alternative Hypothesis = Correlation > 0


Rcmdr> with(Dataset, cor.test(test1, test2,
alternative="greater",
Rcmdr+ method="pearson"))

Pearson's product-moment correlation

data: test1 and test2


t = 8.269, df = 13, p-value = 7.77e-07
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.7979031 1.0000000
sample estimates:
cor
0.91665

Interpretation: The Pearson product-moment correlation between test1 and test2 is


0.91665 (i.e., r = 0.92). The correlation is significantly greater than 0 because p is less than
alpha = .05 (p<.001). That is, we have evidence that our hypothesis that test2 scores
increase as test1 scores increase is incorrect. The output also shows us that we can be
95% confident that the true correlation in the population is between 0.7979031 and 1.

55
Graphs
Scatterplot
A scatterplot typically is used to show the relationship between two variables that were
measured from a group of subjects. The researcher often computes a correlation
coefficient or performs a linear regression to further describe the relationship between the
two variables.

Click on Graphs Scatterplot

56
A Scatterplot window will pop up. Select one x-variable and one y-variable. Only
numeric variables appear in these lists.

57
Click on the Options tab (next to the Data tab).
-Under Plot Options: Deselect all of the checked boxes. The bottom four options are
selected by default, but deselecting them will simplify your scatterplot.
-Under Identify Points: Automatically is the default. Change the default to Do not
identify to simplify your scatterplot.
-Under Plot Labels and Points: These controls allow you to customize the axis labels
and other elements that effects how your scatterplot is displayed.

Click OK. A scatterplot will be displayed in a separate window. If you wish to copy and
paste the scatterplot (into a Word document, for example), right click, and select Copy as
metafile. (The metafile will have higher resolution than the bitmap option.) You can then
click on a Word document and press Ctrl+v to paste the image.

58
Scatterplot by groups
Use this option if you want to plot several scatterplots with different symbol types on a
single set of axes.

Follow the above directions to return to the Scatterplot window. Select the Plot by
groups button.

A Groups window will pop up. Select the variable in your data that you would like to use
as the basis of the different groups (symbol types). Be sure that the Plot lines by group
box is checked in order to see a separate icon for each level of the chosen variable, which is
vehicle in this case. Click OK to return to the Scatterplot window and to make the
necessary changes to the Options tab, as mentioned in the Scatterplot explanation in the
section above.

59
After the default options in the Options tab have been changed, click OK. A scatterplot
will be displayed in a separate window. If you wish to copy and paste the scatterplot (into
a Word document, for example), right click, and select Copy as metafile. (The metafile
will have higher resolution than the bitmap option.) You can then click on a Word
document and press Ctrl+v to paste the image.

60
Line graph
Construct a line graph when you want to show how the value of one variable (the y-
variable) changes as the value of another variable (the x-variable) increases. (The line
graph function of Rcmdr can plot several lines on one graph to depict how several y-
variables change as the x-variable increases.)

Your data set must have one variable (column of values) to be plotted on the x-axis and at
least one other variable (column of values) to be plotted on the y-axis. Note the following
two points about the way Rcmdr handles your data in plotting a line graph:

1. Rcmdr will plot the x-y pairs in the order that they appear in rows of your data set (from
top to bottom) so you should be sure that the values of your x-variable appear in ascending
order in your data set. Rcmdr will give you a warning before drawing the line graph if your
x-values are not in order. If you choose to have Rcmdr draw the line graph anyway, the line
that it plots will zigzag back and forth to follow the ordering of the x-variable.

2. Rcmdr will plot a point on the line graph for the x-y pair in each row of your data set, so
if you have repeated x-values in your data set, Rcmdr will plot multiple points for that x-
value on the line graph (and the line graph will zigzag back and forth or it will contain
vertical segments).

Suppose we want to construct a line graph to depict how temperature changes as time
passes. In the sample data set below, the x-variable (time) increases from the top of the
data set to the bottom and there is only one entry for each time value.

61
Graphs Line graph

A Line Plot window will pop up. Select one x-variable (independent variable) and one or
more y-variables (dependent variables). Only numeric variables appear in these lists.

62
Click OK. A graph will automatically be outputted in a separate window. If you wish to
copy and paste the graph (into a Word document, for example), right click, and select Copy
as metafile. (The metafile will have higher resolution than the bitmap option.) You can
then click on a Word document and press Ctrl+v to paste the image.

63
Bar graph
A bar graph depicts the number of times each value of a nominal variable appears in your
data set.

Suppose that for the data set shown below, we want to graphically show the number of
people who chose pizza for lunch and the number of people who chose salad for lunch. The
variable for which you want to plot counts must have character values (e.g., pizza and
salad) rather than numeric values (e.g., 1 and 2).

(Note that Rcmdr can only show counts for different values of one of the nominal variables
in the data set. For example, Rcmdr can show the numbers of students who chose pizza
and salad. Rcmdr cannot, for example, show the numbers of freshman who chose pizza and
salad and the numbers of sophomores who chose pizza and salad.)

Graphs Bar graph

64
A Bar Graph window will pop up. Select one variable under Variable (pick one). Rcmdr
will count the number of times the different values appear under that variable.

Click OK. A graph will be displayed in a separate window. If you wish to copy and paste
the graph (into a Word document, for example), right click, and select Copy as metafile.
(The metafile will have higher resolution than the bitmap option.) You can then click on a
Word document and press Ctrl+v to paste the image.

65
Miscellaneous
Opening and Entering data
There are several ways in which you can open and enter data in Rcmdr. Instructions
beginning on page 9 describe opening a .csv spreadsheet. You can also open an Excel
spreadsheet (.xls or .xlsx file) or you can type your data directly into a spreadsheet in
Rcmdr. Both methods are described below.

Opening an Excel spreadsheet


(Not all versions of Rcmdr are the same. Some versions may not show the option for
opening an Excel file that is described below.)

Data Import data from Excel file

An Import Excel Data Set window will pop up. The default data set name is Dataset.
You may change this by clicking in the box. By default, Rcmdr will use the first row of the
Excel spreadsheet as the variable names. Uncheck the first box if you do not want this to
happen. If the first column of your spreadsheet contains names for the rows/observations
(which is not typical), check the second box as well. By default, this is unchecked.

Click OK to create the new data set. You can click on the Edit data set or View data set
buttons in the Rcmdr window to edit and view, respectively, the active data set. (See p. 11.)

66
Entering data directly into Rcmdr
You can create a data set and type your data directly into Rcmdr.

Data New data set

67
A New Data Set window will pop up.
Press OK

A Data Editor window will pop up.


If the Data Editor does not pop up, then you most likely have an illegal character in the
filename. In your RStudio console, you might see an error message like the one of the ones
below. If so, correct the error, and then press OK again.
RcmdrMsg: [1] ERROR: "data set" is not a valid name.
RcmdrMsg: [2] ERROR: "Dataset!" is not a valid name.
RcmdrMsg: [3] ERROR: "data$set" is not a valid name.
RcmdrMsg: [4] ERROR: "data,set" is not a valid name.
RcmdrMsg: [5] ERROR: "12set" is not a valid name.

68
You can edit values in the Data Editor window by clicking on a cell and typing and/or
deleting as necessary. For example, click on the column heading in order to change the
variable name. Variable names follow the same guidelines as data set names. Be careful
when double clicking on any cell, as the value will then be changed to NA, and there is no undo
option, so you will permanently delete whatever value was in that cell.

To add more variables, select the Add column button.

To add more observations (i.e., subjects), select the Add row button.

Click OK when you are done to save your changes in the new data set and to exit the Data
Editor.

If you enter only numbers in a column, then Rcmdr will recognize the variable as
numeric. If you enter characters (i.e., at least some letters) in a column, then Rcmdr will
recognize the variable as character.
The distinction between numeric and character variables is very important because Rcmdr
will not allow you to run any statistical analyses (i.e., the variable will not appear or the
statistical test will be grayed out) for which the appropriate variable type is not set.

69
Recoding variables
Imagine you have a numeric variable that indicates the maximum speed at which each
subject drives a car during an experiment. Suppose you want to group the subjects into a
few categories based on their maximum driving speed; for example, subjects who drove 50
mph or greater would be categorized as high and those driving 49 mph or less would be
slow. This requires you to recode the variable from a continuous, numerical variable to a
categorical variable (or factor, in Rcmdr language). Follow these steps to recode the
variable:

Data Manage variables in active data set Recode variables

A Recode Variables window will pop up. Select the variable that you want to recode, in
this case, maxspeed. The default setting is that each new variable will be a factor, meaning
that the variable will no longer be numeric, but categorical. This is the most common
reason to recode variables. If you want to recode your variables to numeric data, then
uncheck the box. The default variable name is variable, but you may change that by
clicking on the text box and typing.

70
Below, we have chosen to name the new variable speed. Type the directions to recode
the variable in the large box at the bottom of the window.
50:hi = fast: This code indicates that we want maxspeed values of 50 or more (50:hi, i.e.,
maxspeed = 50 to the highest value of maxspeed) to be coded as fast in the new
variable called speed.
else = slow: All other maxspeed values (i.e., values less than 50) will equal slow.

Alternatively, this code could have been used:


lo:49 = slow: This code indicates that we want maxspeed values of 49 or less (lo:49, i.e.,
maxspeed = the lowest value of maxspeed to 49) to be coded as slow in the new
variable called speed.
else = fast: All other maxspeed values (i.e., values greater than 50) will equal fast.

Click OK. If you view your data set, you will see that a new variable speed has been
created with two levels: fast and slow.

71
In general, when recoding a variable, indicate the range of values that you want to assign to
different categories by separating them with a colon, :. For example, we could have
created three categories (fast, medium, and slow) in this way:

50:hi = fast: This code indicates that we want maxspeed values of 50 or more to be
coded as fast in the new variable called speed.
40:49 = fast: maxspeed values from 40 to 49 will be assigned a value of medium.
else = slow: All other maxspeed values (i.e., values less than 40) will equal slow.

72
Combining items
This is useful when you want to add, subtract, divide, multiply, etc., different variables to
form a new one.
Data Manage variables in active data set Compute new variable

A Compute New Variable window will pop up.

In the New variable name box, type the name of the new variable, e.g., test_total.

73
We want test_total to be equal to the sum of test1 and test2. We can double-click
test1 in the Current variables (double-click to expression) box to move it under
Expression to compute.

Then we can type in the operation. In this case, because we want to add the two variables,
we type in the + sign.
Other common options:
- = subtract
* = multiply
/ = divide

Then complete the remaining expression (in this case, double-click test2).

74
Click OK. If you view your data set, you will see that a new variable test_total has
been created as the sum of test1 and test2.

Converting variables from numeric to factor items


If you have coded a categorical variable as numeric, then you will find that you will not be
able to run certain analyses. Rcmdr only allows you to run analyses that it thinks are most
appropriate for your data. The simplest way of converting variables from numeric to factor
(categorical) is presented below.

In this example, condition is coded as 1 or 2 for two different experimental conditions.


Though we intend for this to be a categorical or factor variable, Rcmdr is reading these
numbers as numeric measurements.

75
Data Manage variables in active data set Convert numeric variables to factors

A Convert Numeric Variables to Factors window will pop up. Select the variable that you
want to convert under Variables (pick one or more), in this case, condition. Under
Factor Levels, select Use numbers in order to change the 1 and 2 to factors. The
numbers will still appear in the Condition column of your data, but they will be coded as
factor instead of numeric data. Click OK.

Your data will look the same, but condition will now be a factor variable.

76
Coding in R
If you are feeling adventurous, you can try entering your own code into the R Script box
in the R Commander window or in the Console window of RStudio. This allows you to
exert greater control (or command) over R, but it is not necessary for most basic
analyses. Some examples are shown in some of the following sections.

Deleting data sets


You may want to delete data sets from your session in order to unclutter your list of
available data sets. To do this, type rm(Dataset) into the R Script box, in which
Dataset = the name of the dataset that you wish to delete; in this case, the name of the
data set is Dataset, but you will have most likely given your unwanted data set a more
creative name.
Click the Submit button.

77
Your RStudio Console will say: Rmcdr > rm(Dataset).
If you do not have other data sets and if you try to click on the word Dataset next to Data
set: in the Rcmdr window, you will get these error messages in RStudio:
RcmdrMsg: [1] ERROR: the dataset Dataset is no longer available
RcmdrMsg: [2] ERROR: There are no data sets from which to
choose.

78
Labeling points in a scatterplot
Imagine we have a dataset named demodata that contains students names (student),
the number of lectures each student attended (lectures), and each students scores on
exam 1 (exam1). Suppose we would like to create a scatterplot depicting number of
lectures on the x-axis and exam 1 score on the y-axis, with each point in the scatterplot
labeled with the corresponding students name.

First, create a scatterplot with the point-and-click instructions shown previously in this
guide (p. 56). After you do so, code similar to this should be created by Rcmdr in the R
Script box:
scatterplot(exam1~lectures, reg.line=FALSE, smooth=FALSE,
spread=FALSE, boxplots=FALSE, span=0.5, data=demodata)
This code lists y then x, separated by ~: y~x (exam1~lectures).

79
Underneath that code, type in:
text(demodata$lectures, demodata$exam1, demodata$student)
This code labels the x-axis using the first variable name in the list (lectures), the y-axis
using the second variable name (exam1), and the points using the third variable name
(student). As illustrated here, the order in which you list the variables determines
whether they are used for the x-axis label, y-axis label, or labels for the points.

Highlight all of the code starting from scatterplot and ending at demodata$student)
then click the Submit button below the Script box.

80
The graph will pop up in a separate window (but you may have to move your current
window in order to see the graph).

81
Repeated-measures ANOVA
Perform a repeated-measures ANOVA when (1) you want to compare the means of three or
more samples of scores and (2) each subject contributes a score to each sample.

For example, imagine that you perform an experiment in which you measure whether
people are more relaxed (1) playing with a puppy, (2) playing with a kitten, or (3) sitting
alone. You recruit 10 subjects. Each subject plays with a puppy for 15 minutes, then
completes a questionnaire that measures how relaxed the subject feels (where higher
scores mean greater relaxation). Then each subject plays with a kitten for 15 minutes and
completes the questionnaire again. Finally, each subject sits alone for 15 minutes, then
completes the questionnaire a third time. Thus each subject contributes a score to the
puppy group, the kitten group, and the alone group. Therefore, there are three groups of
scores, each containing 10 scores (one from each subject). The independent variable in this
experiment is the puppy/kitten/alone condition and the dependent variable is the score
on the relaxation questionnaire.

The data file would appear as shown below. The first column, subject, shows a unique
subject identifier for each subject. (Note that each identifier appears three times because
each subject participated in each of the three conditions.) The second column,
condition, shows whether that row of the table contains the subjects relaxation score
for the puppy, kitten, or alone condition (the IV). The third column, relax, shows the
relaxation score (the DV) for that subject and condition.

82
You will have to install a special package in RStudio that will let you run a repeated-
measures ANOVA. You will do this in a way that is similar to the way in which you first
installed the Rcmdr package, except this package is called ez instead of Rcmdr. The
package is called ez because it should make your life easier. (Get it? ez = easy! Could
they be any more clever?) Type the following into your RStudio (not Rcmdr) Console
window and press Enter:
install.packages("ez")

Once the package has been installed, return to your Rcmdr window in order to load the
package.
Tools Load package(s)

A Load Packages window will pop up. Scroll down until you see the ez package. Select
ez then click OK.

83
Return to your RStudio console window. Type in the following line of code and press
Enter:
options(contrasts=c("contr.sum", "contr.poly"))
Then type in the following line of code (one continuous line of code) and press Enter:
ezANOVA(data=RM_Anova, dv=.(relax), wid=.(subject),
within=.(condition), detailed=TRUE)

The code shown above applies to this specific example. For your data, you should replace
the following items:
Replace RM_Anova with your dataset name.
Replace relax with your dependent variable.
Replace subject with the variable that identifies your subjects.
Replace condition with your independent variable.

Press Enter. The output will appear in your RStudio Console window.
You might first see a warning that says:
Warning: Converting "subject" to factor for ANOVA.
This is okay. This means that Rcmdr is converting the subject variable from a numeric
variable (because we used numbers to identify individual subjects) into a factor or
categorical variable in order to meet the criteria for an ANOVA.

84
Underneath the warning (if it appears), you will see your repeated-measures ANOVA
results.
$ANOVA
Effect DFn DFd SSn SSd F p p<.05
ges
1 (Intercept) 1 9 6453.3333 48.0 1210.00000 6.624759e-11 *
0.9804319
2 condition 2 18 345.8667 80.8 38.52475 3.132592e-07 *
0.7286517

Interpretation:
The output appears in a table with the following column headings:
Effect = Effect that is tested in each row of the table; we are interested in the row of the
table labeled with the name of the IV (in this case, condition).
DFn = Numerator (or between-groups) degrees of freedom
DFd = Denominator (or within-groups) degrees of freedom
SSn = Numerator (or between-groups) sum of squares
SSd = Denominator (or within-groups) degrees of freedom
F = F-ratio
p = p-value for the specific effect that we are looking at, which is condition in this case
p<.05 = Rcmdr will put an asterisk (*) in this column if the effect shown in the row is
significant at the .05 level of significance.
ges = Generalized eta-squared measure of effect size.
According to these results, there was a significant difference between relaxation scores for
the participants in this sample depending on whether they interacted with a puppy or
kitten or sat alone and the effect size was large, F(2,18) = 38.52, p < .001, 2 = .73.

85
Post-hoc test for repeated-measures ANOVA
Post-hoc tests for repeated-measures ANOVA require that you install and load another
package. The package is called agricolae. Type the following into your RStudio Console
window and press Enter:
install.packages("agricolae")

Once the package has been installed, return to your Rcmdr window in order to load the
package.
Tools Load package(s)

A Load Packages window will pop up. Scroll down until you see the agricolae
package. Select agricolae then click OK.

86
Return to your RStudio window. You need to run the repeated-measures ANOVA in a way
that saves the results as an object. This just involves typing an extra word and an arrow in
front of the repeated-measures ANOVA code that you typed before. Additionally, you
would change the last bit of code to return_aov instead of detailed. So the code
would look like this, with the additions bolded:
options(contrasts=c("contr.sum", "contr.poly"))
resultsname <- ezANOVA(data=RM_Anova, dv=.(score),
wid=.(participant), within=.(test), return_aov=TRUE)

resultsname may be replaced with any name you choose for the results of the analysis.
This name becomes the object name for the results within RStudio. When you provide an
object name, it gives you a way of referring to the results later so you can tell Rcmdr to do
additional things with the results. Type the code shown above into the Rstudio console
window and press Enter.

Next, save the portion of the results containing the ANOVA summary. We called the entire
set of results resultsname. The part of resultsname that contains the ANOVA
summary is indicated by $aov. To save the ANOVA summary portion of the results as a
new object called resultsname2, type the following in RStudio and press Enter:
resultsname2 <- summary(resultsname$aov)
In the line of code above, resultsname is the name you gave to the ANOVA results at the
top of this page. resultsname2 can be any name you choose.

Next, type in the following code so that your degrees of freedom within (dfW) and Mean
Squares Within (MSW) can be recalled later. Hit enter after typing each line. The numbers
just refer to the specific row and columns that you want to pull from. Insert the name you
chose above for resultsname2 but do not change the numbers:
MSW <- resultsname2[[2]][[1]][2, 3]
dfW <- resultsname2[[2]][[1]][2, 1]

Finally, to run the post-hoc test, type in this code in your RStudio window:
(HSD.test(y = RM_Anova$relax, trt = RM_Anova$condition, DFerror
= dfW, MSerror = MSW, alpha = .05))

The code shown above applies to this specific example. For your data, you should replace
the following items:
Replace RM_Anova with your dataset name (the dataset containing the raw data).
Replace relax with your dependent variable.
Replace condition with the within-group variable.

Press Enter. The output will appear in your RStudio console window (as shown on the next
page).

87
$statistics
Mean CV MSerror HSD
66.13333 14.83639 96.27143 7.684262

$parameters
Df ntr StudentizedRange
14 2 3.033186

$means
RM_Anova$score std r Min Max
post 66.33333 12.89334 15 48 90
pre 65.93333 11.84704 15 48 87

$comparison
NULL

$groups
trt means M
1 post 66.33333 a
2 pre 65.93333 a

Look under the $groups heading. If the letters under the M column are different from
each other, it means that the group means are significantly different. If the letters under
the M columns are not different from each other, that means that the means are not
significantly different from each other. In this case, the means for pre- and post-test were
not significantly different from each other, as evidenced by both of them being labeled with
the letter a.

Heres an example of output in which there are significant differences between means:
$groups
trt means M
1 D 73.250 a
2 C 56.875 b
3 B 35.625 c
4 A 34.125 c

In this case, Group A mean = Group B mean (because both of their M levels are the same, c),
but both of these group means are significantly different from Group C mean and Group D
mean (because the M levels for Group C and Group D are b and a, respectively).

88
Mixed-method ANOVA
Perform a mixed-method ANOVA when you have two independent variables, where one IV
is an independent-groups (between-subjects) variable and the other IV is a repeated-
measures (within-subjects) variable. For example, suppose we want to know whether
students in three statistics classes with three different instructors learn different amounts
of the course material. We will examine five students from class A, five from class B, and
five from class C (a total of 15 students). We will give each student a pre-test at the
beginning of the semester to see how much statistics they know before taking the class.
Then we will give each student a test at the end of the semester (a post-test) to see how
much statistics they know after taking the class. In this example, the type of test, pre or
post, is a repeated-measures variable because every student takes both tests. The class, A,
B, or C, is an independent-groups variable because each student is in only one of the three
classes. The test type (pre or post) and the class are independent variables. The test score
is the dependent variable.

The following table shows a way to visualize the data for this example:

Class: Pre-test Post-test


Scorepre1 Scorepost1
Scorepre4 Scorepost4
A Scorepre7 Scorepost7
Scorepre10 Scorepost10
Scorepre13 Scorepost13
Scorepre2 Scorepost2
Scorepre5 Scorepost5
B Scorepre8 Scorepost8
Scorepre11 Scorepost11
Scorepre14 Scorepost14
Scorepre3 Scorepost3
Scorepre6 Scorepost6
C Scorepre9 Scorepost9
Scorepre12 Scorepost12
Scorepre15 Scorepost15

In the table, Scorepre1 is the pre-test score for student 1, Scorepost1 is the post-test score for
student 1, etc. There is a total of six groups of scores in this research design. Note that
each student has a score in both columns, which represent the two levels of the repeated-
measures variable. Also note that each student appears in only one of the three Class rows,
which represent the three levels of the independent-groups variable.

89
For this example, we can set up the data file as shown below. In each row of data,
participant is a unique identifier for each student, test indicates whether that row
contains a pre- or post-test score, class indicates whether the student is in class A, B, or C,
and score is the score on the test. Note that there are two rows for each student because
each student took both the pre-test and post-test, the two levels of the repeated-measures
variable. Also note that each student is in only one of the three classes because Class is an
independent-groups variable.

In the data shown above, the participants are listed in numerical order such that the pre-
and post-test scores alternate and the students in each class are not listed together. You
can enter the data in any order as long as the participant number and level of each variable
are shown correctly on each line.

Return to your RStudio window. Type in the following code:


summary(aov(score ~ class*test, data = Dataset))

The code shown above applies to this specific example. For your data, replace the
following items:
Replace score with your dependent variable.
Replace class with your between-groups (independent-groups) variable.
Replace test with your within-groups (repeated-measures) variable.
The order of class and test does not matter.

90
Press Enter. The output will appear in your RStudio console window.
Df Sum Sq Mean Sq F value Pr(>F)
class 2 461 230.5 1.523 0.238
test 1 1 1.2 0.008 0.930
class:test 2 198 98.8 0.653 0.530
Residuals 24 3634 151.4

Interpretation:
Each row shows the results for a test of a main effect or of the interaction, as shown in the
first column.
Df = Degrees of freedom
Sum Sq = Sum of squares
Mean Sq = Mean squares
F value = F-ratio
Pr(>F) = p-value

According to these results, there were no significant main effects of class or test, nor was
there a significant class x test interaction.
class: F(2,24) = 1.52, p = .24
test: F(1,24) = 0.01, p = .93
class:test (class by test interaction): F(2,24) = 0.65, p = .53

Post-hoc test for mixed-method ANOVA


In this example, the mixed-method ANOVA detected no significant effects. If any of the
effects had been significant, it would be appropriate to follow up the ANOVA with a post-
hoc test to do pair-wise comparisons of each pair of the six group means. After you
perform the mixed-method ANOVA (following the instructions in the preceding section),
then running the post-hoc test is easy.

All you have to do is place some of the mixed-method ANOVA code (all but the word
summary) in these parentheses: TukeyHSD()

Heres how the example mixed-method ANOVA post-hoc test code would look:
TukeyHSD((aov(score ~ class*test, data = Dataset)))

91
Press Enter. The output will appear in your RStudio console window.

Tukey multiple comparisons of means


95% family-wise confidence level

Fit: aov(formula = score ~ class * test, data = Dataset)

$class
diff lwr upr p adj
B-A 9.4 -4.341888 23.141888 0.2227517
C-A 3.0 -10.741888 16.741888 0.8498866
C-B -6.4 -20.141888 7.341888 0.4860130

$test
diff lwr upr p adj
pre-post -0.4 -9.673008 8.873008 0.9297983

$`class:test`
diff lwr upr p adj
B:post-A:post 10.2 -13.8615 34.2615 0.7764164
C:post-A:post -2.0 -26.0615 22.0615 0.9998258
A:pre-A:post -3.2 -27.2615 20.8615 0.9982935
B:pre-A:post 5.4 -18.6615 29.4615 0.9808958
C:pre-A:post 4.8 -19.2615 28.8615 0.9886974
C:post-B:post -12.2 -36.2615 11.8615 0.6262243
A:pre-B:post -13.4 -37.4615 10.6615 0.5314074
B:pre-B:post -4.8 -28.8615 19.2615 0.9886974
C:pre-B:post -5.4 -29.4615 18.6615 0.9808958
A:pre-C:post -1.2 -25.2615 22.8615 0.9999861
B:pre-C:post 7.4 -16.6615 31.4615 0.9287373
C:pre-C:post 6.8 -17.2615 30.8615 0.9491845
B:pre-A:pre 8.6 -15.4615 32.6615 0.8744693
C:pre-A:pre 8.0 -16.0615 32.0615 0.9038305
C:pre-B:pre -0.6 -24.6615 23.4615 0.9999996

Interpretation:
According to these results, there are no significant differences between any of the pairwise
comparisons between the six individual group means (a total of 15 comparisons), as all of
the p-values (p adj) are greater than alpha of .05. This is expected in this specific case
due to the fact that the initial mixed-method ANOVA demonstrated non-significant results.
Once again, in a real-life situation, you would not have bothered to perform the post-hoc
test in this case.

92
Updates
Updating packages
Every once in a while, an R package will be updated. In order to check that all of your
packages are up to date, type:
update.packages()
into your RStudio (not Rcmdr) Console window and press Enter.

You will then see prompts asking you:


Update (y/N/c)?

Type:
y
into the Console window and press Enter. There will be one prompt for every package that
may be updated.

RStudio updates
To keep up with any RStudio updates and find update instructions, visit
https://blog.rstudio.org/. You likely will not be impacted unless commands fail to run.

93

You might also like