You are on page 1of 19

Chelsea Henriquez Statistics 1040

Group Project

Introduction We will be using statistical analysis from math 1040 class to describe major league baseball data. The variables we are most concerned with are how a player was elected to the hall of fame and homeruns hit. We will be taking random and systematic samples for each variable that we are concerned with. How a player was elected to the hall of fame is our categorical variable, and has three possible values a player was not elected to the hall of fame, a player was elected by the Baseball Writers Association (media), or a player was elected by the Old Timers Committee (former players). We will be creating summary statistics of this variable; creating histograms and pie charts; and conducting proportion confidence intervals and hypothesis tests for the proportion of players elected by the media. The quantitative variable we will be using is the number of homeruns hit by a player. We will be creating all of the charts mentioned above in addition to a box plot and five number summary. We will then conduct confidence intervals and hypothesis test about the true population mean of home runs hit by each player.

Pareto Charts

Categorical Data for Hall of Fame Membership

Population Hall of Fame Frequency


1400 1200 1000 800 600 400 200 0 Not a member Elected by Players Elected by Media

1216

67

57

Systematic Sample Hall of Fame


40 30 20 10 0 Not a member Elected by Media Elected by Players

Random Sample Hall of Fame


40 30 20 10 0

35

34
Not a member

Elected by Media Elected by Players

The charts that we constructed seem correct intuitively because only a select few players make it into the hall of fame. Players elected to the hall of fame by the old timers committee and the media seem to be about the same. To obtain my samples I used simple random sampling and systematic sampling. For the simple random sample I used Excel to generate random numbers to reorder my data. For my systematic sample, I selected a random sample starting at twenty one and then every thirty fourth number until I had 40 numbers. I used the formula N/40=1340/40=33.5. I rounded up to to

pick every 34 number starting with a random number of 21. This was to ensure that the sample was unbiased and that the whole population was sampled.
The Pareto charts (above) and the pie charts (below) will visually display my data. The samples appear to be good estimates of the population data. This generally happens when we have a sample size greater than 30.

Quantitative Data for Home Runs

Population Hall of Fame


4.25% 5% Not a member Elected by Media Elected by Players 90.75%

Random Sample Hall of Fame


7.5% 7.5% Not a member Elected by Media Elected by Players 85%

Systematic Sample Hall of Fame


7.50% 5.00%

Not a member Elected by Media 87.50% Elected by Players

Quantitative Variable Analysis: Home Run Data The quantative variable that I chose to analyze from the population was of home run data from professional baseball players. I used statcrunch to find statistics, Including the population mean, population standard deviation,and the five number summary. I decided to take a random sample and a systematic sample using the same sampleing techniques mentioned above with a sample size of forty. I also computed the following summary statistics for each sample.

Summary Sample statistics: For Quantitative Data Column n Mean Variance Std. Dev. Range Min Max Q1 1 2 Q3 Median

Systematic Sampling 40 100.475 22327.64 Random Sample 40 82.3

149.42436 754 337

755 15.5 140.5 40.5 339 23 121 50.5

7074.2153 84.10835

Summary Population parameter: For Quantitative Data Column n Mean Variance Std. Dev. Median Range Min Max Q1 Q3 755 0 755 22 108

Home Runs 1340 85.1097 9590.293 97.930046 51

The systematic sample has a higher mean and standard deviation than both the random sample and the population. This could be because our data is skewed right and we could have had an unusual amount of great baseball players in the systematic sample that hit a lot of homeruns.

Population Home Run Boxplot

Random Sample Home Runs Boxplot

Systematic Sample Home Runs Boxplot

I constructed the following boxplots using the five number summaries above. The data appears to be skewed right. Hall of fame caliber players that hit more than 500 homeruns are very rare and when they happen to be randomly selected in our samples they can skew our results. Most players are not nearly as skilled as hall of fame players. We can see that the major difference between the random sample and the systematic sample is that the max number in the random sample is 339 and the max number of homeruns in the systematic sample is 755 which is also the maximum number of homeruns in the entire population. This is interesting because the probability that we would pick Hank Aaron (best home run hitter) in our sample of 40 is very low.

Systematic Sample Histogram Home Runs

Population Histogram of Home Runs

Random Sample Histogram of Home Runs

The histograms constructed show that our data is skewed right and confirm all of our points that we expressed previously. There is an outlier in our systematic sample. Hank Aaron who hit 755 homeruns is well above three standard deviations away from the mean.

Confidence Intervals 95% confidence intervals of the population proportion

We constructed two confidence intervals from the random and systematic samples. They are completed at the 95% level for the population proportion of players elected to the hall of fame by the old timers comittee. The margin of error for the simple random sample is approximately 8.16%, and the margin of error for the systematic sample is approximately 6.75%. For the random sample we are 95% confident that the true population proportion for players elected by the old timers comittee is between 0% and 15.66%. For our systematic sample we are 95% confident that the true population proportion of players elected by the old timers committee is between 0% aand 11.75%. When constructing the intervals we computed a

negative lower bound which is unrealistic because there cannot be a negative percentage. The true population proportion is 5% which happens to be within both of these intervals.

95% confidence intervals of the population mean

We constructed two confidence intervals from the random and systematic samples. They are completed at the 95% level for the population mean of homeruns hit by players. For the random sample we are 95% confident that the true population mean for homeruns hit is between 55.4 and 109.2. For the systematic sample we are 95% confident that the true population mean for homeruns hit is between 52.69 and 148.27. The true population mean is 85.11 which happens to be within both of these intervals.

95% confidence intervals of the population standard deviation

The above are two confidence intervals contructed from the random and systematic samples. They are contructed at the 95% level for the population standard deviation. For the random sample we are 95% confident that the true population standard deviation for homeruns hit by major league baseball players is between 68.19 and 106.27. For our systematic sample we are 95% confident that the true population standard deviation is between 121.13 and 106.27. The true population standard deviation is 97.93 which happens to be within the simple random sample but not within the systematic sample. This could be be because of the outlier that is contained in the systematic sample of 755 homeruns.

Proportion Hypothesis Test (level of significance .05)

The hypothesis tests conducted are for both the systematic and random samples for the population proportion equal to .05 of players elected by the old timers committee. The test for the random sample we do not reject the null hypothesis because the P-value of .468 > .05. We conclude that we do not have evidence to suggest the true population proportion is different than .05. The true population proportion is .05 so we have come to the correct conclusion. The hypothesis test for systematic sample suggests that we do not reject the null hypothesis because the P-value of .99 > .05. The conclusion we come to is that we have evidence to suggest the true population proportion is different than .05. In both cases we have not made any type I errors.

Hypothesis Test for the Population Mean (level of significance .05)

The hypothesis tests that we conducted is for both the systematic and random samples for the population mean equal to 85.11 homeruns hit on average by each player. The test for the random sample we do not reject the null hypothesis because the P-value of .833 > .05. We conclude that we do not have evidence to suggest the true population mean is different than 85.11. The true population mean is 85.11 so we have come to the correct conclusion. The hypothesis test for systematic sample suggests that we do not reject the null hypothesis because the P-value of .52 > .05. We conclude that we do not have evidence to suggest the true population mean is different than 85.11. In both cases we have not made any errors (Type I or Type II) assuming that our true population data is correct. Reflection I thought this project helped me to apply all of the concepts that I have learned in class. I am not very interested in major league baseball data, but it was interesting to apply confidence intervals and hypothesis tests to this subject. I would be interested in applying these concepts in my future career in the medical field. I could see this type of statistical analysis being used to test the effectiveness of various drugs, care, and time of treatment. This is very important when peoples lives are at stake. The samples we used for the proportion meets the conditions because of the way we conducted the samples and the procedures we used and described above. The samples are also less than 5% of the total population. This is the primary condition our book talks about meeting. The samples that we conducted for the mean homeruns is also a good estimate because the simple random and systematic selection processes we used are sound. In this case the sample

sizes are greater than 30 which allows us to construct confidence intervals and hypothesis tests even if the population is not normally distributed. The conclusions that we came to about our hypothesis tests seem logical because our project we know the true population data and we tested based on the true values. In the real world we would most likely not know the true population values and this would be why we would conduct these tests in the first place. If we were to know the population data there would be no need to construct confidence intervals and tests. If we were to reject any of our hypothesis based on the tests that we conducted we would be encountering a type I error. In all of our samples we were not able to reject the null hypothesis. We did not make any type I errors.

Extra Credit Regression p

This regression shows the positive correlation between hits and homeruns. As a player gets more hits our regression shows that the player will also hit more homeruns. A player who hits 1500 hits during his career will on average hit 106 home runs. A player who is a member of the 3000 hit club will on average have hit 231 home runs. We can see that most of our data points are close to the origin. This

is because there are only a select few great players who get many hits and have many homeruns. We would expect as we go up and to the right on our graph we would find more hall of famers.

You might also like