You are on page 1of 11

Math 1040

A Statistical Evaluation of Skittles


Term Group Project
SLCC - Spring 2017
May 4 - 2017

Chelsea Rinde
Jessica Reynolds
Paul Baker
Trisha Snow
Whytnie Heuser
A Statistical Evaluation of Skittles

Report Introduction:

We were tasked to analyze bags of Skittles. Is there a consistent number of candy in any
given bag 2.17oz bag? Are the number of candies for a given color normally distributed or are
they random? Using a sample size based on the number of students enrolled in the current class,
we have recorded all the metrics of each bag of skittles. These metrics were then used were then
used to create an analysis to determine the answers to these questions using five number
summaries, histograms, confidence intervals, and hypothesis testing.

Organizing and Displaying Categorical Data: Colors

To visualize the proportions Skittle color, we created a pie chart of relative frequency.
From smallest relative frequency to largest we have: red, orange, green, yellow, and purple with
respective relative frequencies of 0.186, 0.199, 0.204, 0.204, and 0.207.
It is known that pie charts lack in representation when concerning quantity. A Pareto
chart supplements this data by showing the quantity and relationship of the relative frequency
together so we can see how each proportion increments towards 100% with each additional
category.

In addition, instead of being organized into a visual chart. It is helpful just to see the
numbers broken down without visualization. The table below displays the data for the class as a
whole. It is broken into a section of rows and columns, where the columns are color and the rows
are total and sample proportion.

For our individual contributions to the sample from the group, we have our own bags
displayed in the following table. Again, the data is broken into columns by color, and rows by
individual contribution. The last two rows are total of our group followed by proportion within
our group.
We found that purple was the most frequently occurring, while the red was the least.
Comparing our contributions with the sample show, preliminarily, that the numbers appear to be
consistent from bag to bag. This is reflected by our proportions being similar to the sample, with
the purple being the highest proportion and the red being the smallest proportion. We expect the
data agrees with a single bag of candy, because we also calculated the minimum and maximum
amount of candy with the entire class, and our groups results correlates with these.

Organizing and Displaying Quantitative Data: The number of candies per bag

Given that the sample is the data from our whole class and not just our group we determine the
following.

Regarding the number of Skittles per bag:


Sample Mean x : 60.4
Sample Standard Deviation s : 1.74
Sample Size n: 24

There are 24 total bags, and what we calculated was the sample mean and standard
deviation. The boxplot also visualizes this data. We can see that the data appears to be normally
distributed because the whiskers of the box plot are approximately equal, and the box itself isnt
very large in relation to the spread so we can tell that there isnt much variance or spread (which
we confirm with our sample standard deviation s)
To assist visualization further, we can show the number of skittles per bag, within our
class sample, with a histogram. Using a class-width of one, we can see that bags containing 59
and 61 candies occur most frequently while the least frequently occurring are bags of 62 and 63
candies.
Giving an early look at our own contributions we can start to see that this appears to be
fairly correct because our data appears to be similar to the sample with our numbers being very
close to the most frequently occurring in the sample.

After calculating the mean number of candy per bag, and a 5-number summary. We find
that the distribution in this case is bell shaped according to the histogram. The Pareto Chart cant
be normally distributed, because it is used for categorical data instead of numerical data. In the
5-number summary, we also find this is normally distributed, there is no skew either left or right
here. This is what we expected to see, since the proportions in these are reflective of the data.

The difference between categorical data and quantitative data Reflection:

The differences between categorical and quantitative data in this project is the categorical
data is the colors in each bag compared to the number of candies in each bag. The graphs that
make sense for the categorical data are pie and pareto charts. The quantitative data is the number
of candies in each bag along with the total for the rest of the class. The graphs that make sense
for these are box plots and histograms.

Confidence Interval Estimates

Confidence intervals serves to measure the probability that a parameter or specific value
falls within a range of 2 values. This estimate helps to identify the certainty or level of error
found within our sample study.

Confidence interval of proportion of yellow candy in the sample:


Using 99% confidence and the calculated portion of yellow candy within the sample, we
can calculate the amount of error and then generate our interval of confidence. With this
information we can project our proportion to the population. Given the whole population of
skittles we can say that the proportion of yellow candy is within 0.177 and 0.231 with 99%
confidence.

Confidence interval of the average candies per bag, sample to population:

Using 95% confidence, we can calculate how many candies there are per bag. With our
sample of 24 bags, we can calculate the mean number of candies per bag and the error of our
interval. We can project this data to the population and state that if we purchase a bag of skittles
arbitrarily, there are between 59.665 and 61.135 on average. These number need to be interpreted
as discrete (unbroken) candies, so there will be between 60 and 62 candies on average.

Hypothesis Tests

A statistical hypothesis is the assumption of a population parameter. This assumption, or


claim may or may not be true. The hypothesis testing is the procedure by the claim is tested. Data
is gathered from a population sample, and used to test the hypothesis. If the data supports the
hypothesis, it is accepted (failed to reject), if it does not support the original claim, it is rejected.

The first hypothesis was to test the claim that 20% of all skittles are red, using a .05
significance level. This is considered a two tailed test because it is using an equal vs. not equal
claim. The test statistic -1.371 was less than the critical value of 1.96, therefore, we fail to reject
the null hypothesis. There is not sufficient sample evidence to support the claim that 20% of all
skittles are red.
The second hypothesis was to test the claim that the mean of all skittles were 55, using a
.01 significance level. This is also considered a two tailed test, due to the equal vs. not equal
claim. The test statistic 15.211 was greater than the critical value of 2.807, therefore we reject the
null hypothesis. The sample data supports the claim that the mean number of skittles in a bag is
55.

Statistical Evaluation Reflection

Confidence interval estimates for sample mean require the sampling method to be simple random
sampling which was accomplished in our study. Each student randomly selected a Skittles bag to use from
a variety of store locations. The number and color of Skittles within each bag was also randomly inserted
as the manufacturer did not dictate exact number and color combinations for each bag. The second
condition for interval estimates require they be an approximately normally distributed sampling
distribution. As shown by the general bell shaper of the Skittles Per Bag histogram, the sampling data was
approximately normally distributed.

Conditions for performing Hypothesis Tests also require the simple random sampling method and
that the sampling distribution is approximately normally distributed. Based on the sampling method used
and sample data, we believe these conditions have been met.

Though we feel the appropriate conditions have been met to conduct our statistical
evaluation, there are still potential errors that could have occurred. Any of the students could
have made a mistake while counting. Any of the students or the professor could have made a
typographical error while submitting the data to the class for use. We are also making the naive
assumption that our region (where all students purchased skittles) is not special and the
population of skittles, as a whole, are bagged the same regardless of their geographical
destination. If there are regions, we have not taken them into account and we do not know where
their borders lie. Because of this, errors can be introduced if the students that are in this class are
across these borders. The sample gathered by them may not correctly reflect either the
population or the population of the region due to this issue.

In order to limit the possibility of the mentioned sampling errors we could purchase
skittle bags from a wider geographical area such as across the state, from various locations in the
Country, or even from across the world. By purchasing from a wider geographical area we could
account for any differences in bagging the candy or other potential deviations made by the
manufacturer. Once the bags of candy were obtained we could have each bag counted and the
data recorded and peer-reviewed for accuracy by different individuals to ensure the correct
sampling data was obtained. This would work to limit mistakes from human error.

After conducting our statistical analysis we have found that given our sample, assuming it
to be the correct size and not just arbitrary to the number of students enrolled in class. We can
make assumptions about the population, such as proportion of a given color of candy and how
much candy will be packaged per bag. We can also perform hypothesis tests to reject or validate
(fail to reject) our assumptions. We can conclude that this data appears to be normally distributed
and number of candies per bag as well as the proportion of their colors is most likely to be
intentionally set and not random.