You are on page 1of 17

Mean, Mode, Median, and Standard Deviation

The Mean and Mode


The sample mean is the average and is computed as the sum of all the observed outcomes from the sample divided by the total number of events. We use x as the symbol for the sample mean. In math terms,

where n is the sample size and the x correspond to the observed valued. Example Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region:

34, 43, 81, 106, 106 and 115


We compute the sample mean by adding and dividing by the number of samples, 6.

34 + 43 + 81 + 106 + 106 + 115 = 80.83 6


We can say that the sample mean of non-indigenous weed is 80.83. The mode of a set of data is the number with the highest frequency. In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once. The population mean is the average of the entire population and is usually impossible to compute. We use the Greek letter for the population mean.

Median, and Trimmed Mean


One problem with using the mean, is that it often does not depict the typical outcome. If there is one outcome that is very far from the rest of the data, then the mean will be strongly affected by this outcome. Such an outcome is called and outlier. An alternative measure is the median. The median is the middle score. If we have an even number of events we take the average of the two middles. The median is better for describing the typical value. It is often used for income and home prices. Example Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your are interested in the typical house price. In $100,000 the prices were

2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
If we computed the mean, we would say that the average house price is 744,000. Although this number is true, it does not reflect the price for available housing in South Lake Tahoe. A closer look at the data shows that the house valued at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since there is an even number of outcomes, we take the average of the middle two

3.7 + 4.1 = 3.9 2

The median house price is $390,000. This better reflects what house shoppers should expect to spend. There is an alternative value that also is resistant to outliers. This is called the trimmed mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom. We can also use the trimmed mean if we are concerned with outliers skewing the data, however the median is used more often since more people understand it. Example: At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays:

44, 50, 38, 96, 42, 47, 40, 39, 46, 50.
To find the sample mean, add them and divide by 10:

44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50 = 49.2 10
Notice that the mean value is not a value of the sample. To find the median, first sort the data:

38, 39, 40, 42, 44, 46, 47, 50, 50, 96


Notice that there are two middle numbers 44 and 46. To find the median we take the average of the two.

44 + 46 Median = 2
Notice also that the mean is larger than all but three of the data points. The mean is influenced by outliers while the median is robust.

= 45

Variance, Standard Deviation and Coefficient of Variation


The mean, mode, median, and trimmed mean do a nice job in telling where the center of the data set is, but often we are interested in more. For example, a pharmaceutical engineer develops a new drug that regulates iron in the blood. Suppose she finds out that the average sugar content after taking the medication is the optimal level. This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar content while the other half have dangerously high content. Instead of the drug being an effective regulator, it is a deadly poison. What the pharmacist needs is a measure of how far the data is spread apart. This is what the variance and standard deviation do. First we show the formulas for these measurements. Then we will go through the steps on how to use the formulas.

We define the variance to be and the standard deviation to be

Variance and Standard Deviation: Step by Step 1. Calculate the mean, x.

Write a table that subtracts the mean from each observed value. Square each of the differences. Add this column. Divide by n -1 where n is the number of items in the sample This is the variance. 6. To get the standard deviation we take the square root of the variance.
2. 3. 4. 5. Example The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and writes down the following data.

44, 50, 38, 96, 42, 47, 40, 39, 46, 50


He calculated the mean by adding and dividing by 10 to get

x = 49.2
Below is the table for getting the standard deviation:

x 44 50 38 96 42 47 40 39 46 50 Total
Now

x - 49.2 -5.2 0.8 11.2 46.8 -7.2 -2.2 -9.2 -10.2 -3.2 0.8

(x - 49.2 )2 27.04 0.64 125.44 2190.24 51.84 4.84 84.64 104.04 10.24 0.64 2600.4

2600.4 = 288.7 10 - 1
Hence the variance is 289 and the standard deviation is the square root of 289 = 17. Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction. The mean for this example was about 49.2 and the standard deviation was 17. We have: 49.2 - 17 = 32.2 and

49.2 + 17 = 66.2 What this means is that most of the patrons probably spend between $32.20 and $66.20.

The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter . The sample variance will be denoted by s2 and the population variance will be denoted by 2. The variance and standard deviation describe how spread out the data is. If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large. Having outliers will increase the standard deviation. One of the flaws involved with the standard deviation, is that it depends on the units that are used. One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100%

CV = 17

100%

In the above example, it is

100% = 34.6% 49.2


This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.

Chebyshev's Theorem
A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean. In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least

1 1 k2
For example, if k = 2 this number is

1 1 2
2

= .75

This tell us that at least 75% of the data lies within 75% of the mean. In the above example, we can say that at least 75% of the diners spent between

49.2 - 2(17) = 15.2


and

49.2 + 2(17) = 83.2


dollars.

A normal distribution is a very important statistical data distribution pattern occurring in many natural phenomena, such as height, blood pressure, lengths of objects produced by machines, etc. Certain data, when graphed as a histogram (data on the

horizontal axis, amount of data on the vertical axis), creates a bell-shaped curve known as a normal curve, or normal distribution. Normal distributions are symmetrical with a single central peak at the mean (average) of the data. The shape of the curve is described as bell-shaped with the graph falling off evenly on either side of the mean. Fifty percent of the distribution lies to the left of the mean and fifty percent lies to the right of the mean. The spread of a normal distribution is controlled by the standard deviation, . The smaller the standard deviation the more concentrated the data. The mean and the median are the same in a normal distribution.

Chart prepared by the NY State Education Department

Reading from the chart, we see that approximately 19.1% of normally distributed data is located between the mean (the peak) and 0.5 standard deviations to the right (or left) of the mean.
(The percentages are represented by the area under the curve.) Understand that this chart shows only percentages that correspond to subdivisions up to one-half of one standard deviation. Percentages for other subdivisions require a statistical mathematical table or a graphing calculator. (See example 4)

If you add percentages, you will see that approximately: 68% of the distribution lies within one standard deviation of the mean. 95% of the distribution lies within two standard deviations of the mean. 99.7% of the distribution lies within three standard deviations of the mean. These percentages are known as the "empirical rule".
Note: The addition of percentages in the chart at the top of the page are slightly different than the empirical rule values due to rounding that has occurred in the chart.

s.d. in callout boxes = standard deviation

It is also true that: 50% of the distribution lies within 0.67448 standard deviations of the mean. If you are asked for the interval about the mean containing 50% of the data, you are actually being asked for the interquartile range, IQR. The IQR (the width of an interval which contains the middle 50% of the data set) is normally computed by subtracting the first quartile from the third quartile. In a normal distribution (with mean 0 and standard deviation 1), the first and third quartiles are located at -0.67448 and +0.67448 respectively. Thus the IQR for a normal distribution is: Interquartile range = 1.34896 x standard deviation (this will be the population IQR) Percentiles and the Normal Curve The mean (at the center peak of the curve) is the 50% percentile. The term "percentile rank" refers to the area (probability) to the left of the value. Adding the given percentages from the chart will let you find

certain percentiles along the curve.

Look for the words "normally distributed" in a question before referring to the Normal Distribution Standard Deviation chart seen on this page. When using the chart, your information should fall on the increments of one-half of one standard deviation as shown in the chart. Find the percentage of the normally distributed data that lies within 2 standard deviations of the mean. Solution: Read the percentages from the chart at the top of this page from -2 to +2 standard deviations. 4.4% + 9.2% + 15.0% + 19.1% + 19.1% + 15.0% + 9.2% + 4.4% = 95.4%

Examples:

1.

2.

At the New Age Information Corporation, the ages of all new employees hired during the last 5 years are normally distributed. Within this curve, 95.4% of the ages, centered about the mean, are between 24.6 and 37.4 years. Find the mean age and the standard deviation of the data. Solution: As was seen in Example 1, 95.4% implies a span of 2 standard deviations from the mean. The mean age is symmetrically located between -2 standard deviations (24.6) and +2 standard deviations (37.4). The mean age is years of age. From 31 to 37.4 (a distance of 6.4 years) is 2 standard deviations. Therefore, 1 standard deviation is (6.4)/2 = 3.2 years.

3. The amount of time that Carlos plays video games in any given
week is normally distributed. If Carlos plays video games an average of 15 hours per week, with a standard deviation of 3 hours, what is the probability of Carlos playing video games between 15 and 18 hours a week?

Solution: The average (mean) is 15 hours. If the standard deviation is 3, the interval between 15 and 18 hours is one standard deviation above the mean, which gives a probability of 34.1% or 0.341, as seen in the chart at the top of this page.

4. The lifetime of a battery is normally distributed with a mean life of


40 hours and a standard deviation of 1.2 hours. Find the probability that a randomly selected battery lasts longer than 42 hours.
The most accurate answer to a problem such as this cannot be obtained by using the chart at the top of this page. One standard deviation above the mean would be located at 41.2 hours, 2 standard deviations would be at 42.4, and one and one-half standard deviations would be at 41.8 standard deviations. None of these locations corresponds exactly to the needed 42 hours. We need more power than we have in the chart to find the most accurate answer. Calculator to the rescue!!

Solution: Graph the normal curve. We see from the location of 42 on the graph that the answer is going to be quite small.

Now, determine the probability of a value falling to the right of 42 hours (between 42 hours and infinity). Answer: 4.779%

You might also like