You are on page 1of 12

Economics Dr.

Sauer

Ch 2: Descriptive Statistics Chapter Overview: I. Working With Raw Data II. Working With Grouped Data III. Measures of Dispersion for Raw Data IV. Measures of Dispersion for Grouped Data V. Other Measures of Dispersion I. Working with Raw Data (mean, median and mode) Suppose you are a manager preparing a report on hours worked by your 49 staff members.
Hours worked in a given week by 49 staff members 37.3 54.2 25.3 59.6 24.5 38.8 42.1 39.5 56.8 16.9 42.0 39.5 42.6 40.0 44.2 56.4 30.2 20.0 22.7 37.8 20.2 36.1 18.3 19.7 36.8 23.4 15.4 20.0 38.9 42.1 18.5 21.3 22.6 37.2 42.9

20.0 18.0 45.5 44.0 26.0 24.0 41.0

29.7 28.5 40.1 23.4 26.5 24.1 17.9

You might like to know the average number of hours worked.

You might also like to know the median hours worked. - sort the data in ascending order Hours worked in a given week by 49 staff members 15.4 20.0 23.4 26.5 37.3 40.1 16.9 20.0 23.4 28.5 37.8 41.0 17.9 20.0 24.0 29.7 38.8 42.0 18.0 20.2 24.1 30.2 38.9 42.1 18.3 21.3 24.5 36.1 39.5 42.1 18.5 22.6 25.3 36.8 39.5 42.6 19.7 22.7 26.0 37.2 40.0 42.9 The mode can be determined from the sorted data. Are there any outliers we should make note of?

44.0 44.2 45.5 54.2 56.4 56.8 59.6

mean: median: mode:

One final calculation we might like to make is arranging the data into quartiles. The position of the lower quartile (Q1) is the item that is closest to position

Weve already found Q2: To find the upper quartile (Q3), use the value of the item closest to position

Sometimes the mean is not a good representation of the data. - a representative statistic is fairly typical of most of the data Outliers can skew the mean. Ex: Suppose we have the following data on ages of student taking piano lessons. 5,6,7,7,7,8,9,9,32 Calculate the mean, median and mode:

Drop the outlier and re-calculate the mean, median and mode:

Graphically, skewed data has a long tail extending to the outlier. - low outliers produce skewed to the left graphs - high outliers produce skewed to the right graphs

For low outliers, the value of the mean will be less than the value of the median. For high outliers, the value of the mean will be more than the value of the median.

II. Working with Grouped Data (mean, median and mode) Many times it would be impractical to list all of the raw data. Often data is first put into groups. Example: employment data in the farming, fishing and forestry industry
Employment in the Farming, Fishing and Forestry Industry Age Group 1991 1996 15-19 4,585 2,826 20-24 11,872 9,319 25-34 27,171 24,492 35-44 31,299 28,210 45-54 31,626 30,902 55-64 33,477 25,846 65 and over 23,519 19,030 Total 163,549 140,625

Note: We are assuming that the values within each interval vary uniformly between the lowest and highest values for the interval. The mid-interval value is the average value of the data in any interval. - used to represent the group numerically Mid-Interval Value for 15-19:

The age of each person in the interval is assumed to be:

Back to our hours worked example


15.4 16.9 17.9 18.0 18.3 18.5 19.7 Hours worked in a given week by 49 staff members 20.0 23.4 26.5 37.3 40.1 20.0 23.4 28.5 37.8 41.0 20.0 24.0 29.7 38.8 42.0 20.2 24.1 30.2 38.9 42.1 21.3 24.5 36.1 39.5 42.1 22.6 25.3 36.8 39.5 42.6 22.7 26.0 37.2 40.0 42.9 44.0 44.2 45.5 54.2 56.4 56.8 59.6

Lets group this data into a frequency distribution table. - choose between 5 and 20 intervals Data starts at 15.4 and goes to 59.6. Grouping hours by 5s or 10s makes sense. For our data, by 5s will be more revealing. Complete the frequency distribution table.
Hours Worked 15<20 20<25 25<30 30<35 35<40 40<45 45<50 50<55 55<60 Frequency

Lets calculate the mid-interval values and add them to our table.

Lets calculate the total hours worked for each interval and add to the table. frequency x mid-interval value

We can now calculate the mean for this grouped data. mean = Sum of Sub-Group Total Hours Worked Total Number of Workers

Note: When we calculated total number of hours worked from raw data, we got 1592.5. Starting from grouped data, using the mid-interval and the frequency to calculate the hours worked, we get 1597.5.

To find the mode, we simply need our frequencies and intervals. Looking at our table, our mode will fall in which interval? Use your formula to calculate the mode:

Now lets calculate the median and quartiles. Well first need to compute the cumulative frequency and add it to our table.
Hours Worked 15<20 20<25 25<30 30<35 35<40 40<45 45<50 50<55 55<60 Frequency 7 12 5 1 9 10 1 1 3 Cumulative Frequency

Q1 is still positioned at 0.25(n+1) in the data: Q1 will be in the interval: To determine the value of Q1: -From the 7 items in the preceding interval, 5.5 more are needed to reach the 12.5th position. -There are 12 items in the interval that contains Q1. From this we get:

Take this times the size of the interval to get:

Add this to the beginning of the interval to get:

To determine the value of Q2 Q2 (the median) is positioned at 0.5(n+1) in the data: Q2 will be in the interval: From the 24 items in the preceding intervals, 1 more is needed to reach the 25th position.

There is 1 item in the interval that contains Q2. Since there is only 1 item in the interval, Q2 = mid-interval value To determine the value of Q3 Q3 is still positioned at 0.75(n+1) in the data: Q3 will be in the interval: From the 34 items in the preceding intervals, 3.5 more is needed to reach the 37.5th position. There are 10 items in the interval that contains Q: Times the size of the interval: Add to beginning of interval: _________________________________ Weighted Averages allow us to give more importance to certain data points. - intervals with higher frequencies will make a greater contribution to the mean than those with lower frequencies Ex: Consider 5 brands of wine and their price per bottle. Wine W $8 Wine X $10 Wine Y $12 Wine Z $55 Wine Q $150 There are 3 options for purchasing the wine: Bundle 1: 1 bottle of each wine Bundle 2: 8 bottles of each wine Bundle 3: 123W, 62X, 32Y, 2Z, 1Q Lets calculate the average price per bottle for each option. Bundle 1:

Bundle 2:

(Bundle 2 is a weighted average, but all the weights are the same.)

Bundle 3:

III. Measures of Dispersion for Raw Data A summary statistic gives no indication about the dispersion of values within a set of data. Ex: You are a tour operator planning activities for two different tour groups. You are told the average age for each group is 50 years old. When the tourists arrive you discover the ages of the individuals in each group are as follows: group 1: 48, 50, 52, 51, 49 group 2: 22, 85, 72, 27, 64, 39, 41 The range is the difference between the highest and lowest value in the data set. group 1 range: group 2 range:

A smaller number indicates all data values are closer together. A larger number could indicate: 1. data are disperse 2. there are outliers

Variance is a way of measuring how much each data point varies from the mean value. Lets calculate the difference between each data point and the mean.

Then, calculate the sum of the differences for each group.

xi 48 50 52 51 49

Group 1 xi - 50

Total

xi 22 85 72 27 64 39 41 Total

Group 2 xi - 50

To overcome the problem of the differences from the mean summing to zero: square each difference and then sum.

However, because our data sets are of unequal size, we should adjust for that. Divide the sum of squared differences by the number of observations. group 1: group 2:

This statistic is called the variance.

The square root of the variance is called the standard deviation. - it is another way to measure the dispersion around the mean - it is measured in the same units as the data - unless data is a percent, then standard deviation is in percentage points

group 1:

group 2:

In the same way that a mean can be skewed by outliers, so can the variance and standard deviation. Looking at the median and quartiles may be informative. The semi-interquartile range is the difference between the upper and lower quartile. The quartile deviation is the semi-interquartile divided by 2. Lets arrange our raw data into quartiles: First, order the data: group 1: 48,50,52,51,49 becomes Then, find Q1, Q2, Q3:

Now, find the IQR and QD:

For group 2, first, order the data: group 2: 22,85,72,27,64,39,41 becomes Then, find Q1, Q2, Q3:

Now, find the IQR and QD:

IV. Measures of Dispersion for Grouped Data Suppose we have the following frequency distribution table for swimmers and their ages.
Ages 17 < 19 19 < 21 21 < 23 23 < 25 25 < 27 27 < 29 Total
fi xi

14 19 11 4 1 1 50

18 20 22 24 26 28 na

To calculate the mean, well need the mid-interval values. Lets calculate the mid-interval values. The mean is given by

We know the sum of the frequencies. We need to calculate the product of the frequencies and midinterval value and then sum. So the mean for this grouped data is:

Now that we have the mean, we can calculate the dispersion around the mean for each mid-interval value. Then square. - instead of taking each data point minus the mean, we are using the mid-interval value Multiply the squared terms by the frequency. Then sum. We can now use our grouped data variance formula.

Variance =

Standard Deviation =

There is an alternative formula for calculating the variance for grouped data:

Lets calculate the mid-interval value squared and then multiply it by the frequency. Then sum. Variance =

Finally, lets calculate the inter-quartile range and the quartile deviation.
Ages 17 < 19 19 < 21 21 < 23 23 < 25 25 < 27 27 < 29
fi cumulative

14 19 11 4 1 1

Q1:

Q2:

Q3:

IRQ =

QD =

V. Other Descriptive Statistics The coefficient of variation (CV) is useful for comparing two sets of data when - the means are close but the variances are different - the means are different but the variances are close CV is independent of the units of measurement.

Pearsons Coefficient of skewness (sk) gives a measure of the degree of skewness in a dataset. - independent of units of measure

A negative value means the data is skewed to the left. A positive value means the data is skewed to the right.

A box plot is a graphical display of the symmetry or skewness of a dataset.

The middle bar in the box represents the median. Each end of the box is Q1 and Q3. The whiskers extend to the minimum and maximum data values. - as long as the value is within (1.5)(IQR) - otherwise value is marked with an *

_____________________________________________________________________ Chapter Skills: Given raw data you should be able to calculate: mean median mode quartiles variance standard deviation coefficient of variation Pearsons coefficient box plot Given raw data you should be able to construct a frequency distribution table and cumulative frequency. From grouped data you should be able to calculate: mean median mode quartiles variance standard deviation coefficient of variation Pearsons coefficient box plot

You might also like