You are on page 1of 51

Descriptive Statistics

Chapter 2
Quantitative Methods for Economics
Dr. Katherine Sauer
Metropolitan State College of Denver
Chapter Overview:
I. Working With Raw Data
II. Working With Grouped Data
III. Measures of Dispersion for Raw Data
IV. Measures of Dispersion for Grouped Data
V. Other Measures of Dispersion
I. Working with Raw Data (mean, median and mode)
Suppose you are a manager preparing a report on hours worked
by your 49 staff members.
You might like to know the average number of hours worked.
49
49
1 1

= =
= =
i
i
N
i
i
x
N
x

= 1592.5 = 32.5
49
20.0 37.3 54.2 25.3 59.6 24.5 29.7
18.0 38.8 42.1 39.5 56.8 16.9 28.5
45.5 42.0 39.5 42.6 40.0 44.2 40.1
44.0 56.4 30.2 20.0 22.7 37.8 23.4
26.0 20.2 36.1 18.3 19.7 36.8 26.5
24.0 23.4 15.4 20.0 38.9 42.1 24.1
41.0 18.5 21.3 22.6 37.2 42.9 17.9
Hours worked in a given week by 49 staff members
You might also like to know the median hours worked.
- sort the data in ascending order
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
The mode and median can be determined from the sorted data.

Are there any outliers we should make note of?

mean: 32.5 hours
median: 30.2 hours
mode: 20 hours
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
One final calculation we might like to make is arranging the data
into quartiles.

The position of the lower quartile (Q1) is the item that is closest to
position
0.25(n+1)

Q1: 0.25(49 + 1)
= 12.5
There is no 12.5
th
position so well average
the 12
th
and 13
th
positions together.
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
So, Q1 = 21.3+22.6 = 21.45
2

Weve already found Q2.
30.2

To find the upper quartile (Q3), use the value of the item closest
to position
0.75(n + 1).
Q3: 0.75(50) = 37.5
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
So, Q3 has a value of
41+42 = 41.5
2



Q1 = 21.45 Q2 = 30.2 Q3 = 41.5
Sometimes the mean is not a good representation of the data.
- a representative statistic is fairly typical of most of the
data

Outliers can skew the mean.

Ex: Suppose we have the following data on ages of student taking
piano lessons.
5,6,7,7,7,8,9,9,32

Calculate the mean, median and mode:
10, 7, 7

Drop the outlier and re-calculate the mean, median and mode:
7.25, (7+7)/2 = 7 , 7
Graphically, skewed data has a long tail extending to the outlier.
- low outliers produce skewed to the left graphs
- high outliers produce skewed to the right graphs
For low outliers, the value of the mean will be less than the
value of the median.

For high outliers, the value of the mean will be more than the
value of the median.
II. Working with Grouped Data (mean, median and mode)

Many times it would be impractical to list all of the raw data.
Often data is first put into groups.

Example: employment data in the farming, fishing and forestry
industry
Age Group 1991 1996
15-19 4,585 2,826
20-24 11,872 9,319
25-34 27,171 24,492
35-44 31,299 28,210
45-54 31,626 30,902
55-64 33,477 25,846
65 and over 23,519 19,030
Total 163,549 140,625
Employment in the Farming, Fishing and Forestry Industry
Note: We are assuming that the values within each interval vary
uniformly between the lowest and highest values for the interval.
The mid-interval value is the average value of the data in
any interval.
- used to represent the group numerically

Mid-Interval Value for 15-19: 15+19 = 17
2
The age of each person in the interval is assumed to be 17.
Age Group 1991 1996
15-19 4,585 2,826
20-24 11,872 9,319
25-34 27,171 24,492
35-44 31,299 28,210
45-54 31,626 30,902
55-64 33,477 25,846
65 and over 23,519 19,030
Total 163,549 140,625
Employment in the Farming, Fishing and Forestry Industry
Back to our hours worked example
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
Lets group this data into a frequency distribution table.
- choose between 5 and 20 intervals

Data starts at 15.4 and goes to 59.6.
Grouping hours by 5s or 10s makes sense.
For our data, by 5s will be more revealing.
Hours Worked Frequency
15<20 7
20<25 12
25<30 5
30<35 1
35<40 9
40<45 10
45<50 1
50<55 1
55<60 3
Complete the frequency distribution table.
How to interpret:
9 people worked from 35 hours up to but not
including 40 hours.
Lets calculate the mid-interval values and add them to our table.
Hours Worked Frequency Mid Interval Value
15<20 7 17.5
20<25 12 22.5
25<30 5 27.5
30<35 1 32.5
35<40 9 37.5
40<45 10 42.5
45<50 1 47.5
50<55 1 52.5
55<60 3 57.5
Interpretation:
Everyone in the first interval is assumed to have worked
17.5 hours that week.
Lets calculate the total hours worked for each interval and add to
the table.
frequency x mid-interval value
Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked
15<20 7 17.5 122.5
20<25 12 22.5 270
25<30 5 27.5 137.5
30<35 1 32.5 32.5
35<40 9 37.5 337.5
40<45 10 42.5 425
45<50 1 47.5 47.5
50<55 1 52.5 52.5
55<60 3 57.5 172.5
We can now calculate the mean for this grouped data.

mean = Sum of Sub-Group Total Hours Worked
Total Number of Workers
Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked
15<20 7 17.5 122.5
20<25 12 22.5 270
25<30 5 27.5 137.5
30<35 1 32.5 32.5
35<40 9 37.5 337.5
40<45 10 42.5 425
45<50 1 47.5 47.5
50<55 1 52.5 52.5
55<60 3 57.5 172.5
49 1597.5
Note: When we calculated total number of hours worked from raw
data, we got 1592.5. Starting from grouped data, using the mid-
interval and the frequency to calculate the hours worked, we get
1597.5.

Mean = 1597.5 / 49 = 32.602

Our raw data mean was 32.5.
To find the mode, we simply need our frequencies and intervals.
Hours Worked Frequency
15<20 7
20<25 12
25<30 5
30<35 1
35<40 9
40<45 10
45<50 1
50<55 1
55<60 3
Looking at our table, our mode will fall in which interval?
20 < 25

Use your formula to calculate the mode:
20 + (5)(5) = 22.08
5+7
Our raw data mode was 20.
l1 = 12 7 = 5
l2 = 12 5 = 7

w = 25-20=5
Now lets calculate the median and quartiles.

Well first need to compute the cumulative frequency and add it to
our table.
Hours Worked Frequency Less Than Cumulative Frequency
15<20 7 20 7
20<25 12 25 19
25<30 5 30 24
30<35 1 35 25
35<40 9 40 34
40<45 10 45 44
45<50 1 50 45
50<55 1 55 46
55<60 3 60 49
Q1 is still positioned at 0.25(n+1), or 12.5
th
in the data.
Q1 will be in the 20<25 interval.
To determine the value of Q1:
-From the 7 items in the preceding interval, 5.5 more are needed to
reach the 12.5
th
position.
-There are 12 items in the interval that contains Q1.

From this we get: 5.5 / 12 = 0.46

Take this times the size of the interval to get: 0.46 x 5 = 2.3

Add this to the beginning of the interval to get: 2.3 + 20 = 22.3 = Q1
Hours Worked Frequency Less Than Cumulative Frequency
15<20 7 20 7
20<25 12 25 19
25<30 5 30 24
30<35 1 35 25
35<40 9 40 34
40<45 10 45 44
45<50 1 50 45
50<55 1 55 46
55<60 3 60 49
Hours Worked Frequency Less Than Cumulative Frequency
15<20 7 20 7
20<25 12 25 19
25<30 5 30 24
30<35 1 35 25
35<40 9 40 34
40<45 10 45 44
45<50 1 50 45
50<55 1 55 46
55<60 3 60 49
Q2 (the median) is still positioned at 0.5(n+1), or 25
th
in the data.
Q2 will be in the 30<35 interval.

From the 24 items in the preceding intervals, 1 more is needed to
reach the 25
th
position.

There is 1 item in the interval that contains Q2.

Since there is only 1 item in the interval, Q2 = mid-interval value
Q2 = 32.5
Hours Worked Frequency Less Than Cumulative Frequency
15<20 7 20 7
20<25 12 25 19
25<30 5 30 24
30<35 1 35 25
35<40 9 40 34
40<45 10 45 44
45<50 1 50 45
50<55 1 55 46
55<60 3 60 49
Q3 is still positioned at 0.75(n+1), or 37.5
th
in the data.
Q3 will be in the 40<45 interval.

From the 34 items in the preceding intervals, 3.5 more is needed to
reach the 37.5
th
position.
There are 10 items in the interval that contains Q3.
3.5 / 10 = 0.35

Times the size of the interval: 5 x 0.35 = 1.75
Add to beginning of interval: 1.75 + 40 = 41.75 = Q3
Hours Worked
Raw Data: Grouped Data:
mean 32.5 32.602
median 30.2 32.5
mode 20 22.08
Q1 21.45 22.3
Q2 30.2 32.5
Q3 45.1 41.75
Weighted Averages allow us to give more importance to certain
data points.
ex: intervals with higher frequencies will make a
greater contribution to the mean than those with lower
frequencies

Ex: Consider 5 brands of wine and their price per bottle.
Wine W $8
Wine X $10
Wine Y $12
Wine Z $55
Wine Q $150

There are 3 options for purchasing the wine:
Bundle 1: 1 bottle of each wine
Bundle 2: 8 bottles of each wine
Bundle 3: 123W, 62X, 32Y, 2Z, 1Q
Lets calculate the average price per bottle for each option.

Bundle 1:
8 + 10 + 12 + 55 + 150
5
= $47 per bottle


Bundle 2:
8(8) + 10(8) + 12(8) + 55(8) + 150(8)
5(8)
= $47 per bottle


Bundle 2 is a weighted average, but all the weights are the same.
For Bundle 3, the weights will be different.

Bundle 3:
8(123) + 10(62) + 12(32) + 55(2) + 150(1)
123+62+32+2+1

= 2248
220

= $10.22 per bottle

Quick Summary:

A summary statistic is used to represent a typical value of our data.
- mean
- median
- mode
- quartiles

We can calculate summary statistics for raw data and grouped data.
III. Measures of Dispersion for Raw Data

A summary statistic gives no indication about the dispersion of
values within a set of data.

Ex: You are a tour operator planning activities for two different
tour groups. You are told the average age for each group is 50
years old.

When the tourists arrive you discover the ages of the individuals in
each group are as follows:

group 1: 48, 50, 52, 51, 49

group 2: 22, 85, 72, 27, 64, 39, 41
The range is the difference between the highest and lowest
value in the data set.

group 1 range =
52 48 = 4

group 2 range =
85 22 = 63


A smaller number indicates all data values are closer together.

A larger number could indicate:
1. data are disperse
2. there are outliers
Variance is a way of measuring how much each data point
varies from the mean value.

Lets calculate the difference between each data point and the
mean. or

Then, calculate the sum of the differences for each group.
or
xi xi - 50 xi xi - 50
48 -2 22 -28
50 0 85 35
52 2 72 22
51 1 27 -23
49 -1 64 14
39 -11
41 -9
Total 0 Total 0
Group 1 Group 2
) (
i
x
( )


i
x ( )

x x
i
) ( x x
i

To overcome the problem of the differences from the mean
summing to zero:
square each difference and then sum.
( )
2


i
x ( )
2

x x
i
xi xi - 50 (xi - 50)^2 xi xi - 50 (xi - 50)^2
48 -2 4 22 -28 784
50 0 0 85 35 1225
52 2 4 72 22 484
51 1 1 27 -23 529
49 -1 1 64 14 196
39 -11 121
41 -9 81
Total 0 10 Total 0 3420
Group 1 Group 2
We can see that there is much larger variation from the mean in
group 2 data.
However, because our data sets are of unequal size, we should
adjust for that.

Divide the sum of squared differences by the number of
observations.

group 1: 10/ 5 = 2

group 2: 3420 / 7 = 488.57


This statistic is called the variance.
( )
N
x
i
2
2


=

o
( )
1
2
2

=

n
x x
s
i
If the n-1 is used in the defining formula for the sample variance, then it is possible to
prove that the average value of the sample variance equals the true variance.
The square root of the variance is called the standard deviation.
- it is another way to measure the dispersion around the mean

- it is measured in the same units as the data
- unless data is a percent, then standard deviation is
in percentage points
( )
N
x
i
2


=

o
( )
1
2

=

n
x x
s
i
For our example:

group 1: 1.41

group 2: 22.1
In the same way that a mean can be skewed by outliers, so can the
variance and standard deviation.

Looking at the median and quartiles may be informative.

The semi-interquartile range is the difference between the upper
and lower quartile.

The quartile deviation is the semi-interquartile divided by 2.
Lets arrange our raw data into quartiles:

First, order the data:

group 1: 48,50,52,51,49 becomes 48, 49, 50, 51, 52

Then, find Q1, Q2, Q3:
Q2 = median = 50
Q1: 0.25(5+1) = 1.5
so average the 1
st
and 2
nd
values Q1 = 48.5

Q3: 0.75(5+1) = 4.5
so average the 4
th
and 5
th
values Q3 = 51.5

Now, find the IQR and QD:
IQR = 51.5 48.5 = 3
QD = 3/2 = 1.5
First, order the data:
group 2: 22,85,72,27,64,39,41 becomes 22, 27, 39, 41, 64, 72, 85

Then, find Q1, Q2, Q3:
Q2 = median = 41
Q1 = 0.25(7+1) = 2
Q1 = 27

Q3 = 0.75(7+1) = 6
Q3 = 72

Now, find the IQR and QD:
IQR = 72 41 = 31
QD = 31/2 = 15.5


Group 1 has a much lower IQR and QD than Group 2.
Group1: Group 2:
Mean 50 50
Median 50 41
Range 4 63
Variance 2 488.57
Stand. Dev. 1.14 22.1
Q1 48.5 27
Q2 50 41
Q3 51.5 72
IQR 3 31
QD 1.5 15.5
IV. Measures of Dispersion for Grouped Data

Suppose we have the following frequency distribution table for
swimmers and their ages.
frequency
Ages fi
17 < 19 14
19 < 21 19
21 < 23 11
23 < 25 4
25 < 27 1
27 < 29 1
Total 50
To calculate the mean, well need the mid-interval values.
Lets calculate the mid-interval values.
frequency
Ages fi xi
17 < 19 14 18
19 < 21 19 20
21 < 23 11 22
23 < 25 4 24
25 < 27 1 26
27 < 29 1 28
Total 50 na
Mid-Interval
Value
The mean is given by

=
i
i i
f
x f

We know the sum of the frequencies. We need to calculate the


product of the frequencies and mid-interval value and then sum.
Ages fi xi (fi)(xi)
17 < 19 14 18 252
19 < 21 19 20 380
21 < 23 11 22 242
23 < 25 4 24 96
25 < 27 1 26 26
27 < 29 1 28 28
Total 50 na 1024
So the mean for this grouped data is:

1024 / 50 = 20.48

Now that we have the mean, we can calculate the dispersion
around the mean for each mid-interval value. Then square.
- instead of taking each data point minus the mean, we
are using the mid-interval value
Multiply the squared terms by the frequency. Then sum.
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2
17 < 19 14 18 252 -2.48 6.1504
19 < 21 19 20 380 -0.48 0.2304
21 < 23 11 22 242 1.52 2.3104
23 < 25 4 24 96 3.52 12.3904
25 < 27 1 26 26 5.52 30.4704
27 < 29 1 28 28 7.52 56.5504
Total 50 na 1024 na na
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2
17 < 19 14 18 252 -2.48 6.1504 86.1056
19 < 21 19 20 380 -0.48 0.2304 4.3776
21 < 23 11 22 242 1.52 2.3104 25.4144
23 < 25 4 24 96 3.52 12.3904 49.5616
25 < 27 1 26 26 5.52 30.4704 30.4704
27 < 29 1 28 28 7.52 56.5504 56.5504
Total 50 na 1024 na na 252.48


=
i
i i
f
x f
2
2
) (
o
Variance =


2
) (
i i
x f
i
f
We can now use our grouped data variance formula.
= 252.48 = 5.0496
50
The standard deviation is 2.247
There is an alternative formula for calculating the variance for
grouped data:
2
2
2
|
|
.
|

\
|
=

i
i i
i
i i
f
x f
f
x f
o
Lets calculate the mid-interval value squared and then multiply
it by the frequency. Then sum.
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2 fi(xi)^2
17 < 19 14 18 252 -2.48 6.1504 86.1056 4536
19 < 21 19 20 380 -0.48 0.2304 4.3776 7600
21 < 23 11 22 242 1.52 2.3104 25.4144 5324
23 < 25 4 24 96 3.52 12.3904 49.5616 2304
25 < 27 1 26 26 5.52 30.4704 30.4704 676
27 < 29 1 28 28 7.52 56.5504 56.5504 784
Total 50 na 1024 na na 252.48 21224
2
2
2
|
|
.
|

\
|
=

i
i i
i
i i
f
x f
f
x f
o
Variance = 21224/50 - (1024/50)^2
= 424.48 - 419.4304
= 5.0496
Same answer as other formula!
Finally, lets calculate the inter-quartile range and the quartile
deviation. Well need the cumulative frequency to do this.
Q1: 0.25(50 + 1) = 12.75
th
position

From the 0 items in the preceding interval,
12.75 more are needed to reach the 12.75
th

position.

There are 14 items in the interval that
contains Q1.
From this we get: 12.75 / 14 = 0.91
Take this times the size of the interval to get: 0.91 x 2 = 1.82

Add this to the beginning of the interval to get: 1.82 + 17 = 18.82
= Q1
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 44
23 < 25 4 48
25 < 27 1 49
27 < 29 1 50
Q2: 0.5(50 + 1) = 25.5
th
position

From the 14 items in the preceding
interval, 11.5 more are needed to reach the
25.5
th
position.

There are 19 items in the interval that
contains Q2.
From this we get: 11.5 / 19 = 0.605
Take this times the size of the interval to get: 0.605 x 2 = 1.21

Add this to the beginning of the interval to get: 1.21 + 19 = 20.21
= Q2
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 44
23 < 25 4 48
25 < 27 1 49
27 < 29 1 50
Q3: 0.75(50 + 1) = 38.25
th
position

From the 33 items in the preceding
interval, 5.25 more are needed to reach the
38.25
th
position.

There are 11 items in the interval that
contains Q3.
From this we get: 5.25 / 11 = 0.4772
Take this times the size of the interval to get: 0.4772 x 2 = 0.954

Add this to the beginning of the interval to get: 0.954 + 21 = 21.95
= Q3
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 44
23 < 25 4 48
25 < 27 1 49
27 < 29 1 50
The IQR = Q3 Q1 = 21.95 18.82 = 3.13

The QD = 3.13 /2 = 1.565



Summary of our Grouped data:
mean 20.48
variance 5.0496
st. dev. 2.247
median 20.21
Q1 18.82
Q2 20.21
Q3 21.95
IQR 3.13
QD 1.57
V. Other Descriptive Statistics

The coefficient of variation (CV) is useful for comparing two
sets of data when
- the means are close but the variances are different
- the means are different but the variances are close

CV is independent of the units of measurement.
100 =

o
CV
Pearsons Coefficient of skewness (sk) gives a measure of the
degree of skewness in a dataset.
- independent of units of measure


sk = 3(mean median)
standard deviation


A negative value means the data is skewed to the left.

A positive value means the data is skewed to the right.
A box plot is a graphical display of the symmetry or skewness
of a dataset.
The middle bar in the box represents the median.
Each end of the box is Q1 and Q3.
The whiskers extend to the minimum and maximum data values.
- as long as the value is within (1.5)(IQR)
- otherwise value is marked with an *
Chapter Skills:
Given raw data you should be able to calculate:
mean median
mode quartiles
variance standard deviation
coefficient of variation Pearsons coefficient
box plot

Given raw data you should be able to construct a frequency
distribution table and cumulative frequency.

From grouped data you should be able to calculate:
mean median
mode quartiles
variance standard deviation
coefficient of variation Pearsons coefficient
box plot

You might also like