You are on page 1of 20

Math 1040 Term Project

Analysis of Exhale Data

Professor
Rick Christensen

Team Members
Michelle Chapman
Tatum Plotner
Lonny Runsted

Date
December 2014

Part 1 Select Data Set


Part 2 Select One Categorical Variable
Part 3 Select One Quantitative Variable
Part 4 Create Confidence Intervals
Part 5 Test Hypotheses
Part 6 What was this project about?
Part 7 Reflection: What have you learned? How does it apply to you?

Michelle Chapman
MATH 1040-008
Exhale Study Term Project

Part 1 Select Data Set


The exhale data set was selected by the group for the term project.
Math 1040 is an introductory statistics class. As one of our assignments, we were
asked to form several groups to do a term project. We will be working on this project
throughout the entire semester. This project consists of steps that revolve around
everything we will learn throughout the semester. The professor gave us several data
sets and as a group, we were asked to choose which data set we wanted to work from.
The data set that we chose is called the Exhale Study and will serve as the population
for the entire term project. Throughout this project we will be collecting samples,
analyzing, computing, and comparing information. We will present our work in different
graphical forms. This project will allow us to work individually as well as together and to
compare our results.

Part 2 Select One Categorical Variable


Gender was selected as the categorical variable, and 40 individuals were sampled from
the population using Simple Random and Systematic sampling methods.
The gender population proportion for the exhale data set is as follows:

Male
Female

Frequency
336
318

Percentage
51%
49%

Full Population

Simple Random Sampling

Systematic Sampling

How samples were obtained:


The Simple Random sample was obtained by adding a Random column to the data set.
The RAND function from Excel was used to populate each row with a random value
between 0 and 1. The random values were copied out and the values pasted back in to
prevent the values in the Random column from changing. The data set was then sorted
by the random column from low to high and the first 40 records were copied out into
another worksheet.

The Systematic sample was obtained by using a random offset and selecting samples
at specific intervals beginning with the 14th number in the data set and then using every
16th number calculated. Because the sampling interval was not a whole number the
values were rounded up to the nearest integer. This resulted in 41 records selected at
consistent intervals. The specific records were extracted by Excel by marking every 16th
number. The numbers were then sorted and copied to the spreadsheet.

Compare results from the two samples

The Simple Random sample resulted in the following frequencies and percentages:
Male
Female

Frequency
21
19

Percentage
53%
48%

The Systematic sample results in the following frequencies and percentages:


Male
Female

Frequency
22
19

Percentage
54%
46%

The frequencies for the male and female genders were very close in number across
both samples. There was only a frequency difference of 2 in the Simple Random
sample and a frequency difference of 3 in the Systematic sample between male and
female individuals.

Compare the results with the population

The results between the Simple Random sample and the Population were very similar
and only off by 1-2% - The Systematic sample compared to the Population had a larger
gap of a 3% difference. In all data, it is interesting to note that the category of males
was always higher than that of the females.

Part 3 Select One Quantitative Variable

Age was selected as the categorical variable, and 40 individuals were sampled from the
population using the Simple Random and Cluster sampling methods.

Full Population - Ages


Mean

9.931

Standard Deviation

2.953

Five Number Summary:


Minimum

Q1

Median

10

Q3

12

Max

19

Histogram
100
90
80
70
60
50
40
30
20
10
0
1

9 10 11 12 13 14 15 16 17 18 19

Box Plot

10

15

Simple Random Sample - Ages


Mean

9.95

Standard Deviation

3.403

Five Number Summary:


Minimum

Q1

Median

Q3

11

Max

18

20

Histogram
16
14
12
10
8
6
4
2
0
4-6

7-9

10-12

13-15

16-18

Box Plot

10

15

20

Cluster Sample - Ages


Mean

10.525

Standard Deviation

3.658

Five Number Summary:


Minimum

Q1

Median

11

Q3

12

Max

18

Histogram
16
14
12
10
8
6
4
2
0
1

Box Plot

10

15

20

The samples that the group selected for Part 3 of the project are Cluster and Simple
Random. These samples will be compared to the Full Population. There are definite
differences between each sampling histogram. The shape of the overall population
definitely has a normal distribution and looks like a bell shape. It starts low, then
increases then decreases again. The shape of the Simple Random Histogram looks
somewhat like a bell shape curve although it appears to be slightly skewed to the right.
The shape of the Cluster Histogram is also very close to a bell shape although on the
right side of the chart it does rise slightly. The box plot charts are not as easy to read
when trying to determine the data. In order to read them more accurately, it is better to
focus on the numbers. The means of the Population and the Simple Random samples
are almost identical. The mean of the Simple random sample is a little higher than the
Population and the Simple random samples. The Cluster sample has the highest
standard deviation.

Part 4 Create Confidence Intervals


Sample 1: Simple Random Sample Smokers by Gender

)
)

Sample 2: Systematic Sample Smokers by Gender

)
)

Sample 3: Simple Random Sample - Ages

; df = 39

Sample 4: Cluster Sample - Ages

; df = 39

Explain the Meaning of These Confidence Intervals:


A confidence interval is a range of values used to give the best estimate of the
population parameter. We indicate a confidence interval by its endpoints. In the
samples above I used a 95% confidence interval. The confidence interval does not
mean there is a 95% chance the population parameter is contained in the interval. It
means you have a 95% chance of finding a confidence interval that contains the
population parameter. We can increase the expression of confidence in our estimate by
widening the confidence interval. Confidence intervals are one way to represent how
"good" an estimate is.

Discuss Did Intervals Capture the Parameter?

Each of the intervals captured the value of the population parameter.

Variable

Sample
Type

Population
Parameter

Lower
Endpoint

Upper
Endpoint

Contains?

Smoker

Simple
Random
Systematic
Simple
Random
Cluster
Sample

0.51

0.007

0.196

Yes

0.51
9.93 years

0.023
8.861

0.073
11.039

Yes
Yes

9.93 years

9.355

11.695

Yes

Smoker
Age
Age

Part 5 Test Hypotheses


A claim is made regarding the simple random sample. The claim is that the proportion
of females is not equal to 50%.

Test

Assumption

Test Statistic

z test for a
given
proportion

Null
Hypothesis

Alternate
Hypothesis

| |

sample
normally
distributed

Rejection
Criteria

The following values are taken from the sample.

0.475

0.5

0.5

40

Significance
Null
Alternate
Test
Level
Hypothesis Hypothesis Statistic
0.05
-0.316

Pvalue
0.749

Because the P-value of .749 is greater than the significance level of = 0.05, we fail to
reject the null hypothesis and conclude that there is not sufficient evidence to warrant
rejection of the claim that less than 50% of individuals in the exhale study are female.

Means

A claim is made regarding the age of the participants in the exhale data set. The claim
is that the mean age of the participants is not equal to 10 years.

Test

Assumption

Test Statistic

Null
Hypothesis

Alternate
Hypothesis

Rejection
Criteria

z test for a
population
mean

is known,
sample
normally
distributed

| |

The following values for the hypothesis test are derived based on the sample:

7.87

9.93

2.95

40

Significance
Null
Alternate
Test
Level
Hypothesis Hypothesis Statistic
0.05
-4.42

PValue
0.0001

As the P-Value of 0.0001 is less than the significance level of


, we reject the
null hypothesis and conclude that there is sufficient evidence to warrant rejection of the
claim that the mean age of participants is equal to 10 years.

How Samples Meet the Conditions for Testing


Proportion
The sample size must be sufficiently large to allow n x p
and n x q p
, and the
sample must be normally distributed in order to do the hypothesis test for proportions.
The histogram for gender shows the sample is normally distributed, and the proportions
and sample size are sufficiently large to create a product greater than 5.

Mean
In Order to complete the hypothesis test for proportion, the sample must be normally
distributed and the must be known. The histograms for age demonstrate the sample
is normally distributed.

Explanation of conclusions
Proportion

The sample and population proportions were close to each other. The sample
proportion was .475 and the population was 0.5. Given that the values were close to
each other, it isnt surprising to find out we could not reject the null hypothesis that p =
0.5.
Type I error in this hypothesis test indicates you reject the null hypothesis when it is
true. This means our claim that the proportion of females equals 0.5 Is rejected when it
is true. Type II error is this test indicates the null hypothesis is not rejected when it is
false. This indicates a case where we neglect failing to reject the claim that the
proportion is equal to 0.5 When the proportion does not equal 0.5.

Means
Because the means are significantly different by 2.06 years, it is not surprising to
find out we rejected the null hypothesis because the means were different. We
can conclude we rejected it because the two were different.
Type I error in this hypothesis test indicates the claim that the mean equals 9.93
years is rejected when it is true. Type II error indicates we fail to reject the claim
that the mean is equal to 9.93 years when it does not equal 9.93 years.

Part 6 What was this project about?


This project centers around the data from an exhale study done on children and
teenagers. Gender and age were analyzed to determine the characteristics of the
dataset. The sampling method used was simple random and systematic. These
methods helped us identify male/female proportion and mean age. Results of the
analysis are shown in the Pareto, Pie and Bar charts. Using the sample data, we
estimated confidence intervals for the gender proportion and mean age. Finally, a
hypothesis test was done to determine if the data from the sample was significantly
different than the population.

Part 7 Reflection: What have you learned? How


does it apply to you?
This project has helped us have a better understanding of sampling and hypothesis
testing. The skills we learned in this class are very important toolsets that can be

applied in the future to help us examine data and solve problems, as well as prove or
disprove the validity of claims.
Completing the project has also taught me that we can test different solutions, measure
the effects, and identify if there is sufficient evidence to warrant a particular conclusion
via hypothesis testing.
I have realized that statistics comes into play in everyday life. It helps people think
more clearly and critically about information. My education is taking me in the direction
of healthcare and medicine. I never really thought of the importance of statistics in
regards to healthcare, or any field for that matter, but now I see much more clearly how
it helps in our everyday lives.
Statistics is a science of decisions. Its a scientific method to process and analyze
information in a very effective and efficient manner. Statistics offer insights into
determining whether data and conclusions are trustworthy. This is extremely important
in healthcare where ignorance and blind acceptance is not an option. The skills I have
learned through this semester and this class project has helped me have a better
understanding about the importance of statistics and the effects in both education and
professional areas.

You might also like