Professional Documents
Culture Documents
Term Project
Stat The Rainbow
Introduction
I started this project by purchasing a 2.17-ounce bag of Original Skittles. I counted and
recorded the number of candies of each color: red (13), orange (8), yellow (20), green (9), and
purple (11). The total number of candies in my bag of Skittles was 61. This information was
submitted to my instructor. All students in my class were given the same assignment. Our
instructor took the results of 38 students and reported the results. Out of 2435 candies (in 38
bags), 500 were red, 446 were orange, 474 were yellow, 503 were green, and 512 were purple.
Using this data, I developed Pie and Pareto Charts showing the number of candies by color (as
shown below and on the following page).
Organizing and Displaying Categorical Data: Colors
Number of Skittles, By Color, In 38 Bags
500, 20.53%
500
400
300
200
100
0
Purple
Green
Red
Yellow
Orange
According to these charts, the difference in the number of each color of candy
does not appear particularly significant.
The following table demonstrates the data from my own sample bag in comparison to the
data collected from the class as a whole:
Comparison of Individual Data to Class Data
My Sample
Class Sample
My Proportion
Class Proportion
Red
13
500
.213
.205
Orange
8
446
.131
.183
Yellow
20
474
.328
.195
Green
9
503
.148
.207
Purple
11
512
.18
.210
Total
61
2435
1
1
I was surprised that my own findings did not necessarily agree with those of the class.
Because my own bag had nearly twice as many yellow candies as any other one color, I assumed
that most bags would contain a greater number of yellow candies. Yet, according to the class
data, yellow candies were outnumbered by every other color except orange. Doing this project
has helped me to better understand the importance of using a large data sample in order to make
more correct assumptions about an entire population.
Organizing and Displaying Quantitative Data: the Number of Candies per Bag
Another set of information that the class data supplied was the number of candies in each
bag. There were 61 candies in my bag. As stated earlier, the total number of candies in all 38
bags was 2,435. The mean number of candies in each bag was 64.1. The standard deviation of
the number of candies per bag was 13.2 (13.20); the 5-number summary was: 45, 59, 61, 62, 114.
Since my bag had 61 candies, it was exactly the same as the median number in our class, yet it
was not the same as the mean. Below, you will see a histogram and a box plot that I developed
with this data.
These charts (above, on previous page) indicate a right-skewed distribution of data, with
a somewhat bell-shape. I didnt expect to see such a gap between the third quartile and the right
whisker. When this data is drawn up in a modified box plot (as shown below), a number of
outliers are revealed. I believe this suggests the possibility that a few (Im guessing 3) students
gathered their data from Skittles packages that were larger than the designated 2.17-ounce size. If
that was the case, then their data literally skewed the results, as the box plot below reflects a
slightly left-, rather than extremely right-skewed distribution, and it would have been an example
of a non-random sampling error, since the data wasnt collected from similar samples (sample
bags of the same package size).
are used to identify specific players; not to count or measure them. So it stands to reason, that
different types of data require different charts to reflect them. When comparing the number of
different colors of candies within a sample, I used a Pie Chart and a Pareto Chart, because these
charts work best to display categorical data, such as color. On the other hand, when
demonstrating quantitative data, histograms and box plots are more appropriate. Histograms
work well for quantitative data, because they have class boundaries that range from a low limit to
a high one, and can include a full range of integers. The color of a Skittles candy doesnt fall
within a range; either it is one color, or it is another. Since Pareto Charts have gaps between bars,
and Histograms do not, it wouldnt make sense to use a Histogram to display categorical
information. Although it may sound a bit confusing and complicated at first glance, common
sense guides statisticians to recognize the appropriate use of each category of data.
Specific Value
Significance Level
Confidence Interval
99%
95%
98%
Based on these confidence interval estimates, I can make the following statements:
I have 99% confidence that a random bag of Skittles will have between 17.4 and 21.5%
yellow candies.
I have 95% confidence that a random bag of Skittles will have a mean of between 59 and
69 candies.
I have 98% confidence that the number of Skittles in a random bag will have a standard
deviation of 13 candies.
Hypothesis Tests
When a claim is made about the characteristics of all members of a general population,
a hypothesis test can be made on a simple random sample to find the likelihood that any
randomly chosen individual/item would fall into the parameters of the claim. With the data from
such a test, a determination can be made, with a specified degree of confidence, whether there is
sufficient evidence to support or reject the original claim.
For instance, for the claim that 20% of all Skittles candies are red, I can run a hypothesis
test at a 0.05 significance level. Since the z-score for this test (0.65) is less than the critical value
(1.96), there isnt sufficient evidence to reject the claim that 20% of all Skittles candies are red.
Another example would be to test the accuracy of the claim that the mean number of
candies in a bag of Skittles is 35, using a 0.01 significance level. Since the t-stat for this test
(4.250) is greater than the critical value (2.715), there is sufficient evidence to reject the claim
that the mean number of candies in a bag of Skittles is 35.
The work for both of these hypothetical tests can be found on the following page.