You are on page 1of 9

R Studio – Descriptive Statistics

Download links:

(Best) For Mac: https://download.cnet.com/R-for-Mac-OS-X/3000-2053_4-7831.html


For Windows/Linux/Mac : https://cran.cnr.berkeley.edu

Packages used in this chapter


The packages used in this chapter include:
• psych
• DescTools
• Rmisc
• FSA
• plyr
• boot

The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(Rmisc)){install.packages("Rmisc")}
if(!require(FSA)){install.packages("FSA")}
if(!require(plyrx)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}

Descriptive statistics are used to summarize data in a way that provides insight into the
information contained in the data. This might include examining the mean or median of
numeric data or the frequency of observations for nominal data. Plots can be created
that show the data and indicating summary statistics.

Choosing which summary statistics are appropriate depend on the type of variable being
examined. Different statistics should be used for interval/ratio, ordinal, and nominal
data.

In describing or examining data, you will typically be concerned with measures of


location, variation, and shape.

Location is also called central tendency. It is a measure of the values of the data. For
example, are the values close to 10 or 100 or 1000? Measures of location include mean
and median.

Variation is also called dispersion. It is a measure of how far the data points lie from one
another. Common statistics include standard deviation and coefficient of variation. For
data that aren’t normally-distributed, percentiles or the interquartile range might be
used.
Descriptive statistics for interval/ratio data
For this example, imagine that Ren and Stimpy have each held eight workshops
educating the public about water conservation at home. They are interested in how
many people showed up to the workshops.

Because the data are housed in a data frame, we can use the
convention Data$Attendees to access the variable Attendees within the data frame Data.
Input = ("
Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South 14
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North 15
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
")

Data = read.table(textConnection(Input),header=TRUE)

Data ### Will output data frame called Data

Functions sum and length


The sum of a variable can be found with the sum function, and the number of
observations can be found with the length function.
sum(Data$Attendees)

232

length(Data$Attendees)

16

Statistics of location for interval/ratio data

Mean
The mean is the arithmetic average, and is a common statistic used with interval/ratio
data. It is simply the sum of the values divided by the number of
values. The meanfunction in R will return the mean.
sum(Data$Attendees) / length(Data$Attendees)

14.5

mean(Data$Attendees)

14.5
Caution should be used when reporting mean values with skewed data, as the mean may
not be representative of the center of the data. For example, imagine a town with 10
families, nine of whom have an income of less than $50, 000 per year, but with one
family with an income of $2,000,000 per year. The mean income for families in the town
would be $233,000, but this may not be a reasonable way to summarize the income of
the town.
Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000,
2000000)

mean(Income)

233300

Median
The median is defined as the value below which are 50% of the observations. To find
this value manually, you would order the observations, and separate the lowest 50%
from the highest 50%. For data sets with an odd number of observations, the median is
the middle value. For data sets with an even number of observations, the median falls
half-way between the two middle values.

The median is a robust statistic in that it is not affected by adding extreme values. For
example, if we changed Stimpy’s last Attendees value from 23 to 1000, it would not affect
the median.
median(Data$Attendees)

15

Note that in this case the mean and median are close in value to one
another. The mean and median will be more different the more the data
are skewed.

The median is appropriate for either skewed or unskewed data. The median income for
the town discussed above is $40,500. Half the families in the town have an income
above this amount, and half have an income below this amount.

Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000,
2000000)

median(Income)

40500

Note that medians are sometimes reported as the “average person” or “typical
family”. Saying, “The average American family earned $54,000 last year” means that the
median income for families was $54,000. The “average family” is that one with the
median income.

Mode
The mode is a summary statistic that is used rarely in practice, but is normally included
in any discussion of mean and medians. When there are discreet values for a variable,
the mode is simply the value which occurs most frequently. For example, in the
Statistics Learning Center video in the Required Readings below, Dr. Nic gives an
example of counting the number of pairs of shoes each student owns. The most common
answer was 10, and therefore 10 is the mode for that data set.

For our Ren and Stimpy example, the value 15 occurs three times and so is the mode.

The Mode function can be found in the package DescTools.


library(DescTools)

Mode(Data$Attendees)

15

This printscreen shows what this function should look when you successfully
downloaded the packs in the beginning of this word.

Statistics of variation for interval/ratio data

Standard deviation
The standard deviation is a measure of variation which is commonly used with
interval/ratio data. It’s a measurement of how close the observations in the data set are
to the mean.

There’s a handy rule of thumb that—for normally distributed data—68% of data points
fall within the mean ± 1 standard deviation, 95% of data points fall within the mean ± 2
standard deviations, and 99.7% of data points fall within the mean ± 3 standard
deviations.

Because the mean is often represented with the letter mu, and the standard deviation is
represented with the letter sigma, saying someone is “a few sigmas away from mu”
indicates they are rather a rare character.
sd(Data$Attendees)

4.830459

Standard deviation may not be appropriate for skewed data.


Standard error of the mean
Standard error of the mean is a measure that estimates how close a calculated mean is
likely to be to the true mean of that population. It is commonly used in tables or plots
where multiple means are presented together. For example, we might want to present
the mean attendees for Ren with the standard error for that mean and the mean
attendees for Stimpy with the standard error that mean.

The standard error is the standard deviation of a data set divided by the square root of
the number of observations. It can also be found in the output for the describe function
in the psych package, labelled se.

sd(Data$Attendees) /
sqrt(length(Data$Attendees))

1.207615

library(psych)

describe(Data$Attendees)

vars n mean sd median trimmed mad min max range skew


kurtosis se
1 1 16 14.5 4.83 15 14.5 4.45 6 23 17 -0.04 -0.88
1.21

Another printscreen that shows the proper installation of the packages above:

Se stands for the standard error of the mean


Five-number summary, quartiles, percentiles

The median is the same as the 50th percentile, because 50% of values fall below this
value. Other percentiles for a data set can be identified to provide more
information. Typically, the 0th, 25th, 50th, 75th, and 100th percentiles are reported. This is
sometimes called the five-number summary.

These values can also be called the minimum, 1st quartile, 2nd quartile, 3rd quartile, and
maximum.

The five-number summary is a useful measure of variation for skewed interval/ratio


data or for ordinal data. 25% of values fall below the 1st quartile and 25% of values fall
above the 3rd quartile. This leaves the middle 50% of values between the 1st and
3rd quartiles, giving a sense of the range of the middle half of the data. This range is
called the interquartile range (IQR).

Percentiles and quartiles are relatively robust, as they aren’t affected much by a few
extreme values. They are appropriate for both skewed and unskewed data.
summary(Data$Attendees)

6 7 9 11 12 13 14 15 15 15 16 17 18 19 22 23

The answer is that there are several different methods to calculate percentiles, and they
may give slightly different answers. For details on the calculations, see ?quantiles.

For Attendees, the default type 7 calculation yields a 75th percentile value of 17.25,
whereas the type 2 calculation simply splits the difference between 17 and 18 and yields
17.5. The type 1 calculation doesn’t average the two values, and so just returns 17.

quantile(Data$Attendees, 0.75, type=7)

75%
17.25

quantile(Data$Attendees, 0.75, type=2)

75%
17.5

quantile(Data$Attendees, 0.75, type=1)

75%
17
Percentiles other than the 25th, 50th, and 75th can be calculated with the quantiles
function. For example, to calculate the 95th percentile:
quantile(Data$Attendees, .95)

95%
22.25

6 7 9 11 12 13 14 15 15 15 16 17 18 19 22 23

The answer is that there are several different methods to calculate percentiles, and they
may give slightly different answers. For details on the calculations, see ?quantiles.

For Attendees, the default type 7 calculation yields a 75th percentile value of 17.25,
whereas the type 2 calculation simply splits the difference between 17 and 18 and yields
17.5. The type 1 calculation doesn’t average the two values, and so just returns 17.

quantile(Data$Attendees, 0.75, type=7)

75%
17.25

quantile(Data$Attendees, 0.75, type=2)

75%
17.5

quantile(Data$Attendees, 0.75, type=1)

75%
17

Percentiles other than the 25th, 50th, and 75th can be calculated with the quantiles
function. For example, to calculate the 95th percentile:
quantile(Data$Attendees, .95)

95%
22.25

Statistics for grouped interval/ratio data


In many cases, we will want to examine summary statistics for a variable within
groups. For example, we may want to examine statistics for the workshops lead by Ren
and those lead by Stimpy.

Summarize in FSA
The Summarize function in the FSA package returns the number of observations, mean,
standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for
grouped data.
Note the use of formula notation: Attendees is the dependent variable (the variable you
want to get the statistics for); and Instructor is the independent variable (the grouping
variable). Summarize allows you to summarize over the combination of multiple
independent variables by listing them to the right of the ~ separated by a plus sign (+).

library(FSA)

Summarize(Attendees ~ Instructor,
data=Data)

Instructor n nvalid mean sd min Q1 median Q3 max


percZero
1 Ren 8 8 13.125 5.083236 6 10.75 13.5
15.25 22 0
2 Stimpy 8 8 15.875 4.454131 9 14.00 16.0
18.25 23 0

Summarize(Attendees ~ Instructor + Location,


data=Data)

Instructor Location n nvalid mean sd min Q1 median Q3


max percZero
1 Ren North 4 4 12.50 7.505554 6 6.75 11.0
16.75 22 0
2 Stimpy North 4 4 14.75 4.031129 9 13.50 16.0
17.25 18 0
3 Ren South 4 4 13.75 1.707825 12 12.75 13.5
14.50 16 0
4 Stimpy South 4 4 17.00 5.163978 11 14.00 17.0
20.00 23 0

summarySE in Rmisc
The summarySE function in the Rmisc package outputs the number of observations,
mean, standard deviation, standard error of the mean, and confidence interval for
grouped data. The summarySE function allows you to summarize over the combination
of multiple independent variables by listing them as a vector, e.g. c("Instructor",
"Student").
library(Rmisc)

summarySE(data=Data,
"Attendees",
groupvars="Instructor",
conf.interval = 0.95)

Instructor N Attendees sd se ci
1 Ren 8 13.125 5.083236 1.797195 4.249691
2 Stimpy 8 15.875 4.454131 1.574773 3.723747

summarySE(data=Data,
"Attendees",
groupvars = c("Instructor", "Location"),
conf.interval = 0.95)

Instructor Location N Attendees sd se ci


1 Ren North 4 12.50 7.505553 3.7527767 11.943011
2 Ren South 4 13.75 1.707825 0.8539126 2.717531
3 Stimpy North 4 14.75 4.031129 2.0155644 6.414426
4 Stimpy South 4 17.00 5.163978 2.5819889 8.217041
describeBy in psych
The describeBy function in the psych package returns the number of observations, mean,
median, trimmed means, minimum, maximum, range, skew, kurtosis, and standard error
of the mean for grouped data. describeBy allows you to summarize over the combination
of multiple independent variables by combining terms with a colon (:).

library(psych)

describeBy(Data$Attendees,
group = Data$Instructor,
digits= 4)

group: Ren
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 8 13.12 5.08 13.5 13.12 2.97 6 22 16 0.13 -1.08
1.8
----------------------------------------------------------------------
---
group: Stimpy
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 8 15.88 4.45 16 15.88 3.71 9 23 14 -0.06 -1.26
1.57

describeBy(Data$Attendees,
group = Data$Instructor : Data$Location,
digits= 4)

group: Ren:North
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 12.5 7.51 11 12.5 6.67 6 22 16 0.26 -2.14
3.75
----------------------------------------------------------------------
---
group: Ren:South
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 13.75 1.71 13.5 13.75 1.48 12 16 4 0.28 -1.96
0.85
----------------------------------------------------------------------
---
group: Stimpy:North
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 14.75 4.03 16 14.75 2.22 9 18 9 -0.55 -1.84
2.02
----------------------------------------------------------------------
---
group: Stimpy:South
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 17 5.16 17 17 5.93 11 23 12 0 -2.08
2.58

You might also like