Professional Documents
Culture Documents
where n is the sample size and the x correspond to the observed valued. Example Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region:
2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
If we computed the mean, we would say that the average house price is 744,000. Although this number is true, it does not reflect the price for available housing in South Lake Tahoe. A closer look at the data shows that the house valued at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since there is an even number of outcomes, we take the average of the middle two
The median house price is $390,000. This better reflects what house shoppers should expect to spend. There is an alternative value that also is resistant to outliers. This is called the trimmed mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom. We can also use the trimmed mean if we are concerned with outliers skewing the data, however the median is used more often since more people understand it. Example: At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays:
44, 50, 38, 96, 42, 47, 40, 39, 46, 50.
To find the sample mean, add them and divide by 10:
44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50 = 49.2 10
Notice that the mean value is not a value of the sample. To find the median, first sort the data:
44 + 46 Median = 2
Notice also that the mean is larger than all but three of the data points. The mean is influenced by outliers while the median is robust.
= 45
Write a table that subtracts the mean from each observed value. Square each of the differences. Add this column. Divide by n -1 where n is the number of items in the sample This is the variance. 6. To get the standard deviation we take the square root of the variance.
2. 3. 4. 5. Example The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and writes down the following data.
x = 49.2
Below is the table for getting the standard deviation:
x 44 50 38 96 42 47 40 39 46 50 Total
Now
x - 49.2 -5.2 0.8 11.2 46.8 -7.2 -2.2 -9.2 -10.2 -3.2 0.8
(x - 49.2 )2 27.04 0.64 125.44 2190.24 51.84 4.84 84.64 104.04 10.24 0.64 2600.4
2600.4 = 288.7 10 - 1
Hence the variance is 289 and the standard deviation is the square root of 289 = 17. Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction. The mean for this example was about 49.2 and the standard deviation was 17. We have: 49.2 - 17 = 32.2 and
49.2 + 17 = 66.2 What this means is that most of the patrons probably spend between $32.20 and $66.20.
The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter . The sample variance will be denoted by s2 and the population variance will be denoted by 2. The variance and standard deviation describe how spread out the data is. If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large. Having outliers will increase the standard deviation. One of the flaws involved with the standard deviation, is that it depends on the units that are used. One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100%
CV = 17
100%
Chebyshev's Theorem
A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean. In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least
1 1 k2
For example, if k = 2 this number is
1 1 2
2
= .75
This tell us that at least 75% of the data lies within 75% of the mean. In the above example, we can say that at least 75% of the diners spent between
A normal distribution is a very important statistical data distribution pattern occurring in many natural phenomena, such as height, blood pressure, lengths of objects produced by machines, etc. Certain data, when graphed as a histogram (data on the
horizontal axis, amount of data on the vertical axis), creates a bell-shaped curve known as a normal curve, or normal distribution. Normal distributions are symmetrical with a single central peak at the mean (average) of the data. The shape of the curve is described as bell-shaped with the graph falling off evenly on either side of the mean. Fifty percent of the distribution lies to the left of the mean and fifty percent lies to the right of the mean. The spread of a normal distribution is controlled by the standard deviation, . The smaller the standard deviation the more concentrated the data. The mean and the median are the same in a normal distribution.
Reading from the chart, we see that approximately 19.1% of normally distributed data is located between the mean (the peak) and 0.5 standard deviations to the right (or left) of the mean.
(The percentages are represented by the area under the curve.) Understand that this chart shows only percentages that correspond to subdivisions up to one-half of one standard deviation. Percentages for other subdivisions require a statistical mathematical table or a graphing calculator. (See example 4)
If you add percentages, you will see that approximately: 68% of the distribution lies within one standard deviation of the mean. 95% of the distribution lies within two standard deviations of the mean. 99.7% of the distribution lies within three standard deviations of the mean. These percentages are known as the "empirical rule".
Note: The addition of percentages in the chart at the top of the page are slightly different than the empirical rule values due to rounding that has occurred in the chart.
It is also true that: 50% of the distribution lies within 0.67448 standard deviations of the mean. If you are asked for the interval about the mean containing 50% of the data, you are actually being asked for the interquartile range, IQR. The IQR (the width of an interval which contains the middle 50% of the data set) is normally computed by subtracting the first quartile from the third quartile. In a normal distribution (with mean 0 and standard deviation 1), the first and third quartiles are located at -0.67448 and +0.67448 respectively. Thus the IQR for a normal distribution is: Interquartile range = 1.34896 x standard deviation (this will be the population IQR) Percentiles and the Normal Curve The mean (at the center peak of the curve) is the 50% percentile. The term "percentile rank" refers to the area (probability) to the left of the value. Adding the given percentages from the chart will let you find
Look for the words "normally distributed" in a question before referring to the Normal Distribution Standard Deviation chart seen on this page. When using the chart, your information should fall on the increments of one-half of one standard deviation as shown in the chart. Find the percentage of the normally distributed data that lies within 2 standard deviations of the mean. Solution: Read the percentages from the chart at the top of this page from -2 to +2 standard deviations. 4.4% + 9.2% + 15.0% + 19.1% + 19.1% + 15.0% + 9.2% + 4.4% = 95.4%
Examples:
1.
2.
At the New Age Information Corporation, the ages of all new employees hired during the last 5 years are normally distributed. Within this curve, 95.4% of the ages, centered about the mean, are between 24.6 and 37.4 years. Find the mean age and the standard deviation of the data. Solution: As was seen in Example 1, 95.4% implies a span of 2 standard deviations from the mean. The mean age is symmetrically located between -2 standard deviations (24.6) and +2 standard deviations (37.4). The mean age is years of age. From 31 to 37.4 (a distance of 6.4 years) is 2 standard deviations. Therefore, 1 standard deviation is (6.4)/2 = 3.2 years.
3. The amount of time that Carlos plays video games in any given
week is normally distributed. If Carlos plays video games an average of 15 hours per week, with a standard deviation of 3 hours, what is the probability of Carlos playing video games between 15 and 18 hours a week?
Solution: The average (mean) is 15 hours. If the standard deviation is 3, the interval between 15 and 18 hours is one standard deviation above the mean, which gives a probability of 34.1% or 0.341, as seen in the chart at the top of this page.
Solution: Graph the normal curve. We see from the location of 42 on the graph that the answer is going to be quite small.
Now, determine the probability of a value falling to the right of 42 hours (between 42 hours and infinity). Answer: 4.779%