Professional Documents
Culture Documents
Professor Oremus
Math 1040 Statistics
28 November 2018
Skittles Term Project
Throughout the semester we have been studies various aspects of statistics and how to
collect, calculate, and analyze data from real world applications. The skittles project was done to
show an example of how to interpret and apply statistics through a real life. The first part of the
project was to collect the data we were going to be working with, in this case, it was the amount
of each color of skittles in our bag. Everyone in our class purchased a 2.17 ounce bag of Original
within the overall sample gathered by the class. To do this we first made an educated guess and
listed our own expectations of what we though the proportions would be and why. The we
opened the data set and computed the proportions of Red, Orange, Yellow, Green, and Purple
candies in the class data set. Noting that the sample size is the total number of candies collected
by the class.
As a guess, we had assumed that the proportions of skittle colors would be a lot more
evenly distributed across a larger population sample than it would amongst a small, or individual
sample. It was predicted that there would be around 20% of each color reflected amongst the
class sample although those results may vary a few percentage points depending on the sample
size because we had discussed how a class sample would not be large enough to reflect across all
the varying locations of different skittle factory’s and their production accuracy, as well as all
skittle consumers.
The total number of skittles for the class was 6,680
The total number of Red Skittles was 1,340 with a relative frequency is 0.201 or 20.1%
The total number of Orange Skittles was 1,356 with a relative frequency is 0.203 or 20.3%
The total number of Yellow Skittles was 1,410 with a relative frequency is 0.211 or 21.1%
The total number of Green Skittles was 1,245 with a relative frequency is 0.186 or 18.6%
The total number of Purple Skittles was 1,329 with a relative frequency is 0.199 or 19.9%
In StatCrunch we created a pie chart and a Pareto chart for the total number of candies of each
color in our class data set.
Then we decided if the class data represented a random sample and what the population was for
There were a few different opinions that our group had while discussing whether the class
data would represent a random sample or not. It was argued that the answer would depend on
what we would consider the sample and population. It was discussed how the sample might be
considered a convenience sample when considering that class numbers are the sample although
our focus is on the comparison of skittle colors, and not necessarily on the people chosen to
collect the data. If we had considered the entire Skittle’s production as the population that would
mean that our data would be nowhere near being enough to represent the entire population,
because our class (online and on-campus) doesn’t represent the entire Skittles production, it
would not be accurate to conclude that the data would represent the entire population. If were
were to consider that all of SLCC is our population, then our class would be considered a random
sample that were given the assignment to collect the skittle data, essentially, we are a sample of
the students at the school, although one could even argue that an even bigger population, such as
the U.S. It was concluded that we could be considered a random sample because us as a class
We then created a table that displays the counts by color and total from our own bag of candies
together with the counts by color and total for the entire class sample.
The graphs for the class total count of the skittle colors is what I had expected to see,
with a total of 5 colors, I had assumed that there should be about 20% of each color within
each bag. Although I had predicted that the amount would vary within individual bags, I
assumed that as a larger population for this statistical study was gathered, the equal
percentage of colors would reflect more prominently across the larger sample population.
There were a few outliers within the initial data that was gathered although it seemed to
mainly be an error on the participates part or possibly a mistake typing in the data, this data
was removed from the total count because it would have reflected across the graphics and
summary statistics negatively by reflecting inaccurate data. The data from my own bag did
differ slightly from the class total data, although that was expected. The class total shows a
more equal distribution of color while my personal bag doesn’t quite reflect an equal
distribution of colors. I would estimate that a larger population sample would reflect an even
more equal distribution of colors within the bag, although individual bags may not represent
this, from just looking as the data collected from my personal bag, one would assume that the
color distribution is not equal, although a larger sample would reflect a much different
conclusion.
Using the total number of candies in each bag in our class sample, we computed the
following measures for the variable “Total candies in each bag” and reported these summary
5 number summary for the number of candies per bag: 35, 58, 59, 61, 97 2.
We then created a frequency histogram for the variable “Total candies in each bag”
As well as a box plot for the variable “Total candies in each bag”.
Then discussed my findings about the variable “Total candies in each bag” and adressed
the following in my writing: What is the shape of the distribution? Do the graphs reflect what
you expected to see? Does the overall data collected by the whole class agree with your own
data from a single bag of candies? Include the number of candies from your own bag and the
When considering the total number of skittles in each bag in our sample population, we
get a mean of about 60 skittles in each bag. The shape of the histogram graph is a bit difficult
to determine when looking at the graph with a bin width of 10. Most of the data is cluttered
within the 50-70 number of skittles per bag range, so the histogram mainly looks like it only
consists of two bars that have a high frequency and not much else, but the graph’s shape
becomes slightly more distinguishable as you reduce the bin width. When I had changed the
bin width to 1, the shape looked to be a lot more bell shaped, with the data clustered around
60-61 and distributed in a bell-shaped manner around that. The graphs were what I had
expected to see, skittles are meant to be packaged in a pretty precise manner although there is
some variations around the average of 60 skittles per bag. The over data from the class did
agree with my own, with the mean for the whole class being about 60 skittles, and the count
I also explained the difference between categorical and quantitative data by addressing
the following in my writing: What types of graphs make sense and what types of graphs do
not make sense for categorical data? For quantitative data? Explain why. What types of
calculations make sense and what types of calculations do not make sense for categorical
Categorical and Quantitative data are both organized through tables and graphs, although
the types of tables and graphs used to represent each type of data varies. Categorical data
consists of tables that are mainly frequency tables and relative frequency tables. The graphs
that are used to represent categorical data are: bar, side-by-side bar graphs and pie charts.
These graphs best represent categorical data because they better represent and compare data
with variables that classify the data that is based on characteristics or attributes as opposed to
quantitative data that has variables based on a numerical measure. Quantitative data on the
other hand can be broken down even further depending of the type of data, whether it be
discrete data or continuous. Discrete quantitative data, which is data with a finite or
countable number of possible values, will contain values that can be broken down into
categories. Discrete quantitative data with a variety of outcomes and continuous data, which
are variables with an infinite number of values that are not necessarily countable, will contain
values that best form classes. Both discrete and continuous quantitative data are best
organized in table such as: frequency, relative frequency, cumulative frequency, and
cumulative relative frequency. The graphs that best represent data that is quantitative are:
histograms, stem and leaf plot, dot plots, frequency polygons, ogives, and time series plots.
These tables and graphs much better represent and compare data that has variables with
numerical values.
Construct a 99% confidence interval estimate for the population proportion of yellow candies.
0.211+2.58[0.211(1−0.211)6680]=0.224 UpperBound
Construct a 95% confidence interval estimate for the population mean number of candies per
bag.
Discuss and interpret (with complete sentences) the results of each of your interval estimates
3. We can be 99% confident that the proportion of yellow skittles is between 0.198 and
0.224. This means that for every hundred skittles, we would expect to find between 20-
unknown parameter that is reported within a level of confidence (a level of confidence represents
the proportion of intervals that will contain the parameter if a large number of different samples
is obtain). This basically means that the confidence interval is going to measure the probability
that a population parameter will fall between two sets of values known as the upper and lower
bound. For example, if we were to be 95% confident that the population mean for a certain
situation were to lie between 73.4 and 103.7, that would mean that if repeated samples were
taken and a 95% interval computed for each sample, one would expect 95% of the of those
intervals to contain the population mean. It is important to consider hat these calculations must
be made under certain conditions, two of the most important condition being that the data
obtained through a simple random sample or from a randomized experiment, and the ample size
aspects of statistical evidence to real life applications. Throughout the process of working on this
project I have been able to understand and apply many concepts such as: organizing and
analyzing data, drawing conclusions, using confidence intervals and hypothesis tests, as well as
being able to clearly present the data collected and explain it in a report. I was also able to
understand how important and useful statistics is throughout its many application to our daily
At the beginning of the semester, each student taking the class was asked to purchase a
2.17-ounce bag of original skittle to conduct our statistical research. As the semester progressed,
we were able to apply knowledge and concepts that’s we had learned from each chapter to the
data collected from the skittles. The first part of the project taught me how hypothesize, collect
and create visual representations of the data collected, as well as analyzing those results and
applying meaning to them. The second part of the project helped me understand how to compute
the mean, standard deviation, frequency, etc. to be able to create a graph using StatCrunch or
calculator, but most importantly how to analyze and derive meaning from the shape of the graphs
and boxplots made from the collected data. I was also able to grasp a better understanding of
what type of data was collected (categorical and quantitative) and why that is relevant in each
In the third part of the project, I learned how to construct a confidence interval estimate
for the population proportion and mean through StatCrunch and on a calculator, as well as
interpret the results for each interval estimate. I was able to learn a lot about inference and what
can be concluded from collected data and statistical evidence throughout this project and
semester.