Professional Documents
Culture Documents
Download links:
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(Rmisc)){install.packages("Rmisc")}
if(!require(FSA)){install.packages("FSA")}
if(!require(plyrx)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}
Descriptive statistics are used to summarize data in a way that provides insight into the
information contained in the data. This might include examining the mean or median of
numeric data or the frequency of observations for nominal data. Plots can be created
that show the data and indicating summary statistics.
Choosing which summary statistics are appropriate depend on the type of variable being
examined. Different statistics should be used for interval/ratio, ordinal, and nominal
data.
Location is also called central tendency. It is a measure of the values of the data. For
example, are the values close to 10 or 100 or 1000? Measures of location include mean
and median.
Variation is also called dispersion. It is a measure of how far the data points lie from one
another. Common statistics include standard deviation and coefficient of variation. For
data that aren’t normally-distributed, percentiles or the interquartile range might be
used.
Descriptive statistics for interval/ratio data
For this example, imagine that Ren and Stimpy have each held eight workshops
educating the public about water conservation at home. They are interested in how
many people showed up to the workshops.
Because the data are housed in a data frame, we can use the
convention Data$Attendees to access the variable Attendees within the data frame Data.
Input = ("
Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South 14
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North 15
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
")
Data = read.table(textConnection(Input),header=TRUE)
232
length(Data$Attendees)
16
Mean
The mean is the arithmetic average, and is a common statistic used with interval/ratio
data. It is simply the sum of the values divided by the number of
values. The meanfunction in R will return the mean.
sum(Data$Attendees) / length(Data$Attendees)
14.5
mean(Data$Attendees)
14.5
Caution should be used when reporting mean values with skewed data, as the mean may
not be representative of the center of the data. For example, imagine a town with 10
families, nine of whom have an income of less than $50, 000 per year, but with one
family with an income of $2,000,000 per year. The mean income for families in the town
would be $233,000, but this may not be a reasonable way to summarize the income of
the town.
Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000,
2000000)
mean(Income)
233300
Median
The median is defined as the value below which are 50% of the observations. To find
this value manually, you would order the observations, and separate the lowest 50%
from the highest 50%. For data sets with an odd number of observations, the median is
the middle value. For data sets with an even number of observations, the median falls
half-way between the two middle values.
The median is a robust statistic in that it is not affected by adding extreme values. For
example, if we changed Stimpy’s last Attendees value from 23 to 1000, it would not affect
the median.
median(Data$Attendees)
15
Note that in this case the mean and median are close in value to one
another. The mean and median will be more different the more the data
are skewed.
The median is appropriate for either skewed or unskewed data. The median income for
the town discussed above is $40,500. Half the families in the town have an income
above this amount, and half have an income below this amount.
Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000,
2000000)
median(Income)
40500
Note that medians are sometimes reported as the “average person” or “typical
family”. Saying, “The average American family earned $54,000 last year” means that the
median income for families was $54,000. The “average family” is that one with the
median income.
Mode
The mode is a summary statistic that is used rarely in practice, but is normally included
in any discussion of mean and medians. When there are discreet values for a variable,
the mode is simply the value which occurs most frequently. For example, in the
Statistics Learning Center video in the Required Readings below, Dr. Nic gives an
example of counting the number of pairs of shoes each student owns. The most common
answer was 10, and therefore 10 is the mode for that data set.
For our Ren and Stimpy example, the value 15 occurs three times and so is the mode.
Mode(Data$Attendees)
15
This printscreen shows what this function should look when you successfully
downloaded the packs in the beginning of this word.
Standard deviation
The standard deviation is a measure of variation which is commonly used with
interval/ratio data. It’s a measurement of how close the observations in the data set are
to the mean.
There’s a handy rule of thumb that—for normally distributed data—68% of data points
fall within the mean ± 1 standard deviation, 95% of data points fall within the mean ± 2
standard deviations, and 99.7% of data points fall within the mean ± 3 standard
deviations.
Because the mean is often represented with the letter mu, and the standard deviation is
represented with the letter sigma, saying someone is “a few sigmas away from mu”
indicates they are rather a rare character.
sd(Data$Attendees)
4.830459
The standard error is the standard deviation of a data set divided by the square root of
the number of observations. It can also be found in the output for the describe function
in the psych package, labelled se.
sd(Data$Attendees) /
sqrt(length(Data$Attendees))
1.207615
library(psych)
describe(Data$Attendees)
Another printscreen that shows the proper installation of the packages above:
The median is the same as the 50th percentile, because 50% of values fall below this
value. Other percentiles for a data set can be identified to provide more
information. Typically, the 0th, 25th, 50th, 75th, and 100th percentiles are reported. This is
sometimes called the five-number summary.
These values can also be called the minimum, 1st quartile, 2nd quartile, 3rd quartile, and
maximum.
Percentiles and quartiles are relatively robust, as they aren’t affected much by a few
extreme values. They are appropriate for both skewed and unskewed data.
summary(Data$Attendees)
6 7 9 11 12 13 14 15 15 15 16 17 18 19 22 23
The answer is that there are several different methods to calculate percentiles, and they
may give slightly different answers. For details on the calculations, see ?quantiles.
For Attendees, the default type 7 calculation yields a 75th percentile value of 17.25,
whereas the type 2 calculation simply splits the difference between 17 and 18 and yields
17.5. The type 1 calculation doesn’t average the two values, and so just returns 17.
75%
17.25
75%
17.5
75%
17
Percentiles other than the 25th, 50th, and 75th can be calculated with the quantiles
function. For example, to calculate the 95th percentile:
quantile(Data$Attendees, .95)
95%
22.25
6 7 9 11 12 13 14 15 15 15 16 17 18 19 22 23
The answer is that there are several different methods to calculate percentiles, and they
may give slightly different answers. For details on the calculations, see ?quantiles.
For Attendees, the default type 7 calculation yields a 75th percentile value of 17.25,
whereas the type 2 calculation simply splits the difference between 17 and 18 and yields
17.5. The type 1 calculation doesn’t average the two values, and so just returns 17.
75%
17.25
75%
17.5
75%
17
Percentiles other than the 25th, 50th, and 75th can be calculated with the quantiles
function. For example, to calculate the 95th percentile:
quantile(Data$Attendees, .95)
95%
22.25
Summarize in FSA
The Summarize function in the FSA package returns the number of observations, mean,
standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for
grouped data.
Note the use of formula notation: Attendees is the dependent variable (the variable you
want to get the statistics for); and Instructor is the independent variable (the grouping
variable). Summarize allows you to summarize over the combination of multiple
independent variables by listing them to the right of the ~ separated by a plus sign (+).
library(FSA)
Summarize(Attendees ~ Instructor,
data=Data)
summarySE in Rmisc
The summarySE function in the Rmisc package outputs the number of observations,
mean, standard deviation, standard error of the mean, and confidence interval for
grouped data. The summarySE function allows you to summarize over the combination
of multiple independent variables by listing them as a vector, e.g. c("Instructor",
"Student").
library(Rmisc)
summarySE(data=Data,
"Attendees",
groupvars="Instructor",
conf.interval = 0.95)
Instructor N Attendees sd se ci
1 Ren 8 13.125 5.083236 1.797195 4.249691
2 Stimpy 8 15.875 4.454131 1.574773 3.723747
summarySE(data=Data,
"Attendees",
groupvars = c("Instructor", "Location"),
conf.interval = 0.95)
library(psych)
describeBy(Data$Attendees,
group = Data$Instructor,
digits= 4)
group: Ren
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 8 13.12 5.08 13.5 13.12 2.97 6 22 16 0.13 -1.08
1.8
----------------------------------------------------------------------
---
group: Stimpy
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 8 15.88 4.45 16 15.88 3.71 9 23 14 -0.06 -1.26
1.57
describeBy(Data$Attendees,
group = Data$Instructor : Data$Location,
digits= 4)
group: Ren:North
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 12.5 7.51 11 12.5 6.67 6 22 16 0.26 -2.14
3.75
----------------------------------------------------------------------
---
group: Ren:South
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 13.75 1.71 13.5 13.75 1.48 12 16 4 0.28 -1.96
0.85
----------------------------------------------------------------------
---
group: Stimpy:North
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 14.75 4.03 16 14.75 2.22 9 18 9 -0.55 -1.84
2.02
----------------------------------------------------------------------
---
group: Stimpy:South
vars n mean sd median trimmed mad min max range skew
kurtosis se
1 1 4 17 5.16 17 17 5.93 11 23 12 0 -2.08
2.58