You are on page 1of 14

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Part One 1. We selected the Major League Baseball statistics for our data set. Part Two 2. The categorical variable we selected was Primary Position Played. This selection included 1,340 players with the positions: First Base, Second Base, Third Base, Shortstop, Catcher, Outfielder, and Designated Hitter. 3. The population proportions are as follows: First Base: 139/1340 = .1037 = 10.37% Second Base: 148/1340 = .1104 = 11.04% Third Base: 145/1340 = .1082 = 10.82% Shortstop: 154/1340 = .1149 = 11.49% Catcher: 254/1340 = .1896 = 18.96% Outfielder: 492/1340 = .3672 = 36.72% Designated Hitter: 8/1340 = .0060 = 0.60%

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

4. We generated a Simple Random Sample (SRS) of 32 from the categorical data.

5. We then generated a Systematic Random Sample of 32 from the categorical data.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Simple Random Sample (Bar Graph, Pareto Chart, Pie Chart; respectively)

I obtained the Simple Random Sample by: I clicked Data then Sample chose the column Primary Positioned Played chose 32 for my sample size and clicked Compute!. To finish I pressed the graph button and selected the bar graph, pareto chart, and pie chart.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Systematic Random Sample (Bar Graph, Pareto Chart, Pie Chart; respectively)

I obtained the Systematic Random Sample by: First obtaining: N, n, and the kth element. N=1,340, n=32, kth=1,340/32= 41 (when rounded down), and then selected a random number between 1-k, of 1. The following row numbers were selected with their respective corresponding Primary Position Played: 1, 42, 83, 124, 165, 206, 247, 288, 329, 370, 411, 452, 493, 534, 575, 616, 657, 698, 739, 780, 821, 862, 903, 944, 985, 1026, 1067, 1108, 1149, 1190, 1231, 1272. O, 3, C, S, O, C, C, O, O, 1, 2, 3, O, 1, 2, O, 3, O, 3, C, 2, O, C, S, 2, 2, O, 2, C, C, S, S. I then deleted all the unnecessary data. To finish I pressed the graph button and selected the bar graph, pareto chart, and pie chart.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Comparing the Sample Results:


Simple Random Sample: 2, 7, 21.88% 3, 2, 6.25% C, 6, 18.75% O, 11, 34.38% S, 6, 18.75%

Systematic Random Sample: 1, 2, 6.25% 2, 6, 18.75% 3, 4, 12.5% C, 7, 21.88% O, 9, 28.13% S, 4, 12.5%

Population: 1, 139, 10.37% 2, 148, 11.04% 3, 145, 10.82% C, 254, 18.96% O, 492, 36.72% S, 154, 11.49% D, 8.00, 0.60%

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

The Simple Random Sample Simple has one less category than the Systematic Random Sample Systematic. This is due to the Simple not choosing any First Basemen, whereas the Systematic chose two First Basemen. The Simple and Systematic had similar amounts of Second Basemen chosen, seven and six times (respectively). The Simple chose two Third Basemen while the Systematic chose four, doubling Simples. The number of Catchers chosen were similar, six by Simple and seven by Systematic. The number of Outfielders were close with Simple choosing eleven, while Systematic chose nine. Lastly, Shortstop was chosen more by Simple with six, while Systematic had four. Overall, the two sampling methods produced fairly similar results, with surprisingly similar graphs. Both had Outfielder having the highest percentage. I believe if the sample size had been larger, the two sampling methods would have shown more similar results. The Population had an additional cateogory than both Simple and Systematic. This cateogory being Designated Hitter. It is understandable that neither sampling picked up a Designated Hitter, since it was only .6% of the population. The Population had a small percentage increase of Second Basemen compared to its Third basemen, while Systematic sampling showed pretty similar percentage results, there was a much larger gap between the two with the Simple sampling. The Outfielder once again having the highest percentages in all groupings. In some ways both sampling methods were closer, and in other ways further from the actual Population. Overall, Systematic was closer to the Population. I can see how both sampling methods are valid, but also how sampling isnt perfect.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Part Three 6. For our quantitative variable, we chose the comfortable decimal-free standby: Home Runs

7. Next, we computed the population mean, as well as the population standard deviation. Population mean = 85.1097 Population standard deviation = 97.8935 (If you can believe it!)

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

8. Next, we pulled a Simple Random Sample of 32 from the population data:

9. Finally, we generated a Stratified Sample of 32 from the quantitative variable of Home Runs, between 16 First Basemen and 16 Catchers:

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

We generated a Simple Random Sample (SRS) of 32 from the quantitative variable of Home Runs.

Mean: 96.25 Standard Deviation: 81.368933

Frequency Histogram

Box Plot

We generated a Stratified Sample of 32 from the quantitative variable of Home Runs between 16 First
Basemen and 16 Catchers.

Mean: 100.90625 Standard Deviation: 68.54401

Frequency Histogram

Box Plot

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

The Simple Random Sample (SRS) Frequency Histogram was not a bell-shaped (normal) distribution. SRSs frequencies did not increase to a maximum and then decrease. The graph didnt have symmetry. The Histogram was skewed to the left, but not with a longer left tail, but instead in an almost pareto chart fashion. The SRS had an interesting gap, this was due to the inclusion of an outlier. The Stratified Random Sample did not have the outlier, because the outlier wasnt selected in its sample. The Stratified Sample Frequency Histogram was much closer to a bell-shaped (normal) distribution than the SRS. However, it isnt a true bell-shaped distribution. This is because the left side was uniform, instead of increasing, while the right side wasnt a mirror image of the left side, nor was it decreasing. The SRS Box Plot did not have a bell-shaped (normal) distribution. The data set is skewed to the left. And just like the SRS Histogram, has a scale influenced by the outlier. The Stratified Random Sample Box Plot is also skewed to the left, but not in as extreme of a manner as the SRS. The Stratified Random Box Plots second quartile is more toward the middle of the box, unlike the SRS, which is to the left of the box. Both the SRS and the Stratified Random Sample had similar means, with the SRS having a greater standard deviation. The Population more resembled the SRS, than the Stratified Sample. The SRS also matched the Populations mean and standard deviation more closely. This result is not surprising, since the Stratified Random Sample only took information from two categorical groups. I believe if the Stratified Random Sample included samplings from all seven positions played, then it would have been more accurate and in tune with the Population. However, in the end, the Population, SRS, and Stratified Random Sample all showed the same themes, and portrayed fairly similar information.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Part Four

10a. In the SRS, 4 of the 32 players sampled were First Basemen. For this sample, .125 (12.5%) of the players are First Basemen. We are 95% confident that the interval from 0.0104 to 0.2396 accurately contains the true value of the population proportion of First Basemen. n = 32 x=4 p = 0.125 C-level = .95 In other words: 0.0104< p <0.2396; (0.0104, 0.2396)

10b. Our Systematic Random Sample generated 2 First Basemen, which is 0.0625 (6.25%). The confidence interval here, with a 95% confidence rate is 0 to 0.1464, or (0, 0.1464). n = 32 x=2 p = 0.0625 C-level = .95 0< p <0.1464

11a. For our SRS of home run data, we can conclude with 95% certainty that the population mean should fall between 35.61 and 113.45. x = 74.53 s = 107.95 n = 32 CI = (35.61, 113.45)

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

11b. Our stratified samples for first basemen and catchers resulted in the respective confidence intervals of (39.43, 75.32) and (99.64, 179.61). First Basemen: x = 57.375 s = 33.671 n = 16 CI = (39.43, 75.32) Catchers: x = 139.625 s = 75.042 n = 16 CI = (99.64, 179.61)

For the sample proportion of the First Basemen categorical variable, I made a confidence interval from each of my two sampling methods: SRS, and Systematic Random Sample. The two sample confidence intervals each capture with 95% certainty the true population proportion of First Basemen. I was happily surprised that both sample confidence intervals did include the true population proportion parameter of .1037! For the sample mean of the Home Runs quantitative variable, I made three confidence intervals. I made one confidence interval from my SRS, and two confidence intervals from my Stratified Sample (one for each stratum.) The SRS confidence interval did capture the population mean parameter of 85.11%. The two Stratified Sample confidence intervals did not capture the population mean parameter, however, both Stratified Sample confidence intervals did capture their own categorical population mean parameter.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Part Five Hypothesis Test for Population Proportion In Part Two, my Systematic Random Sample of 32 from the categorical data of Primary Positions Played, resulted in a sample proportion of 0 out of 32 (0 or 0%) of Designated Hitters. The population proportion of Designated Hitters is supposedly 8 out of 1340 (0.0060 or 0.60%. ) Using my sample data I am going to test the claim that there are 0.60% Designated Hitters in the Major League Baseball statistics data set. Using the P-value Method. Step 1: The original claim is that 0.60% of Players Primary Position is Designated Hitter. p = 0.0060 Step 2: The opposite of the claim is that p 0.0060 Step 3: Therefore H0: p = 0.0060 and H1: p 0.0060 Step 4: For the significance level, I selected = 0.05, because it is just the right amount of inclusive and exclusive to be relevant. Step 5: We then use: 1-PropZTest: p0 = 0.006, x = 0, n = 32, prop p0 Step 6: The P-value result is 0.6603 Step 7: Because the P-value of 0.6603 is greater than the significance level of = 0.05, we fail to reject the null hypothesis. Because we fail to reject H0: p = 0.0060, we fail to reject the claim that 0.60% of Players Primary Position is Designated Hitter. Therefore, based off my sample, it is plausible that 0.60% of Players Primary Position is Designated Hitter. Hypothesis Test for the Population Mean In Part Three, my Simple Random Sample of 32 from the quantitative data of Home Runs, resulted in a sample mean of 96.25. The population mean of Home Runs is supposedly 85.1097. Using my sample data I am going to test the claim that there is a Home Run mean of 85.1097 in the Major League Baseball statistics data set. Using the Critical Value Approach Step 1: The claim that the Major League Baseball statistics data set has a Home Run population mean of 85.1097 is symbolically expressed as = 85.1097. Step 2: The alternative to the original claim is 96.25. Step 3: Therefore H0: = 85.1097 and H1: 96.25. Step 4: For the significance level, I selected = 0.05. Step 5: We then use: T-Test: 0 = 85.1097, x = 96.25, Sx = 81.369, n = 32, 0 Step 6: The test statistic is therefore t = 0.775. Step 7: I found the critical value of: 2.040 from the t Distribution Table. With df = 31, and = 0.05 in two tails. Because the test statistic of t = 0.775 does not fall in the critical region bounded by the critical value of t = 2.040, we fail to reject the null hypothesis. Because we failed to reject the null hypothesis, this means that based off my sample, the claim of a population mean of 85.1097, is plausible. It is very important that we are careful to not make a Type 1 Error, which is to reject a true null hypothesis. It is helpful to make a distribution graph. Make sure you use P-values with significance levels, and test statistics with critical values. Remember, if the P-value or test statistic falls out of the range of the significance level or critical value, thats when you can reject a null hypothesis.

Tarah Van Wyngaarden MATH-1040-011-fa13 Term Project

Part Six The purpose of the Math 1040 Term Project was to better understand and use the concepts we studied, including: collecting samples, organizing and analyzing data, drawing conclusions, and presenting work. Part of the project was done in a group, and part was done individually. To begin, for part one, as a group we were to choose a data set that would serve as our population for the entire project. My awesome group member Steve and I chose the Major League Baseball statistics for our data set. In part two, we chose the categorical variable of Primary Position Played and computed the population proportion for each of the baseball positions. We then used two sampling methods: the Simple Random Sample, and Systematic Random Sample. We selected n of 32, so our sample sizes would be above 30 and therefore qualify as a normal distribution. To explore the data further we individually prepared a Bar Graph, Pareto Chart, and Pie Chart for each of the samples. The goal was to learn how to obtain different types of random samples, and to be able to compare and contrast the samples with one another and with the population. In part three, as a group, we selected the quantitative variable of Home Runs. We computed the population mean and population standard deviation. Using two different sampling methods of: Simple Random Sampling and Stratified Sampling, we selected two samples of size 32. For the stratified sampling, we chose the two groups of 16 First Basemen and 16 Catchers. As individuals, we computed the mean and standard deviation for each sample. I created a Frequency Histogram and a Box Plot for each sample. The goal was to learn how to visualize data via graphs and numerical statistics, and that doing so helps to understand, compare, and contrast sample and population results. In part four, I created a confidence interval for First Basemen for each of my samples. I then created a confidence interval for the mean of each of my quantitative samples of Home Runs. The goal was to learn how to make confidence intervals, understand their meaning, and see that they work! In part five, I completed a hypothesis test for the population proportion and population mean. The goal was to learn how to perform and understand hypothesis tests, with emphasis on type I error. I learned so much from this project! I never thought I could learn how to use programs like StatCrunch, but it turned out to not be too tricky. Learning how to use StatCrunch for this project, taught me how to use data sets to get: different samplings (such as a Simple Random Sample), calculations (such as a mean), and make cool graphs (such as a Pareto Chart.) I realized that I too could use a population to get samples, and that my samples could then capture an accurate representation of said population, for instance in confidence intervals. It felt like magic when the calculation formulas of my samples worked in the hypothesis testing of the population. Im really glad this project had me do the hypothesis tests because I was finally able to grasp the process and meaning behind the hypothesis tests; before I was mixing the P-value, critical values, and test statistics up! Skills I learned in this project will certainly have a positive impact in my school career. Im ready to learn new programs in future classes, knowing now that programs like StatCrunch are not sources of anxiety, but instead very helpful, intuitive applications. Having made graphs myself, I can now easily spot graphs that enlighten and graphs that deceive, now graphs in future textbooks will be easier for me to understand and more interesting. Because of this class, and project, Ive already found myself critically thinking over and analyzing sampling methods and sampling results. This part of the project changed my thinking in the biggest way! Having this skill will help me both in my future classes and in life in general. For example, in school I will be able to use valid statistical information when it comes to homework research papers, and in my daily life when debating over viewpoints. Im really impressed with the many mathematical formulas and the technology of StatCrunch and my TI-83 Plus calculator which incorporates them, and that I can now understand and use the formulas and tests myself!

You might also like