Professional Documents
Culture Documents
Math 241
4/17/15
Baseball Data Project
This statistical project is based around baseball and specific stats and variables .The
purpose behind the baseball statistical project is to determine if there is any correlation between
the salary a professional baseball player earns and several different variables. RBIs, walks, hits,
runs, and at bats were all variables used in the determination if there was any correlation. The
population of the data is the entirety of the MLB, and the sample set used for this data is from
244 MLB players on 11 different teams.
The first area of my project that was determined was the descriptive statistics of the
baseball data set. These descriptive statistics are the mean, median, mode, range, standard
deviation, and variance for the salary earned by the sample population. The following chart
shows these statistics.
Variable
Salary
Mean
StDev
Variance Median
Range
Mode Mode
21
The mean salary of this data set is $3,577,408. The mean of this data is the most useful measure
of central tendency. While the other information is interesting, showing such information as
where the middle range of the salary is located (1,500,000) and the most common salary
(500,000 earned by 21 players), the mean salary is the most useful to us. It shows the average
salary earned by the sample set and allows us to make correlations from there. The most useful
measure of variability in this data set is standard deviation. The standard deviation allows us to
see what salary may be normal, small or large compared to the mean salary depending on how
many deviations from the mean it is.
The histogram that follows gives us a visual representation of how the salary is
distributed over the sample population. It shows us how many players earn different ranges of
salaries, and also gives us a rough visual of the distribution of salaries for the population sample.
Histogram of Salary
180
160
140
Frequency
120
100
80
60
40
20
0
4000000
Salary
This histogram shows the salary earned by most players is within the 0 $4,000,000 range. The
second most earned salary that the histogram shows is in the range of 4,000,000, to 8,000,000.
The salaries drop off dramatically after these first two ranges, down to only player making
between 24,000,000 and 28,000,000. The data in this histogram is substantially skewed to the
left, showing that the information is not symmetric.
The following boxplot shows much of the same information as the above histogram. It
allows us to other sets of information easier though, such as outliers.
Boxplot of Salary
25000000
Salary
20000000
15000000
10000000
5000000
The data in this box plot shows we have many outliers above the third quartile. The outliers start
approximately 11,000,000 and cluster between there and roughly 16,000,000. This cluster is
followed by two more outliers even further above original outliers. The data is also clustered
around the first and second quartiles, showing us that much of the salary ranges down below 5
million dollar mark at the third quartile. Following this cluster data is spread out between the
second quartile and the third quartile.
A confidence interval allows us to determine with a certain percentage that the calculated
true mean is contained within a certain range of data. The following interval does this with a
95% level of confidence.
N Mean SE Mean
95% CI
244 3577408 275615 (3037212, 4117604)
This information shows that there is a 95% chance that the confidence interval we calculated
contains the true mean of $3,577,408.
Salary
20000000
15000000
10000000
5000000
0
0
20
40
60
Runs
80
100
120
Even though the regression equation does appear to show a higher salary per runs scored, by
observing the scatterplot it appears that a players earned salary compared to the number of runs
has no direct correlation. The scatterplot shows not distinct line correlating earned salary to
number of runs from 0 runs to 120 runs. This shows that regression is not an appropriate way
predicting in this specific scenario.
I continued my analysis of the data set by conducting multiple hypothesis test to
determine if there was a correlation between earned salary and the different variables. The first
hypothesis test I conducted was a continuation of the above calculations, correlating salary
versus runs. Preforming this hypothesis test I calculated a p-value of 0.0 with a correlation
coefficient of .383. With this p-value and a significance level of .05 I determined that I have
sufficient evidence to reject the null hypothesis. This means that there is sufficient evidence to
determine that a correlation between the number of runs and earned salary is present. A caveat to
that though is that with the correlation coefficient being so low (.383) the linear relation is slight
but still positive.
I concluded my analysis by preforming hypothesis tests on the remaining variables,
trying to determine if there was a correlation between them, and earned salary. The following
table shows p-values for each hypothesis test as well as the correlation coefficients.
Hypothesis Tests
Salary vs RBIs Correlation Coefficient = .406 Pvalue = 0.0
Salary vs HITS Correlation Coefficient = .356 Pvalue = 0.0
Salary vs RUNS Correlation Coefficient = .383 Pvalue = 0.0
Salary vs WALKS Correlation Coefficient = .416 Pvalue = 0.0
Salary vs AT BATS Correlation Coefficient = .359 Pvalue = 0.0
The chart above shows the correlation coefficients for each of the independent variables when
compared to the salary variable. Each in hypothesis test was determined to have a p-value of 0,
and when paired with a significance level of .05 the null hypothesis for each test can be rejected.
These rejections determine that there is a correlation between each independent variable and the
earned salary of baseball players. But as with runs there is still a caveat. The correlation
coefficients for each test while still positive was fairly low, meaning that the linear relation is still
only slight.
After preforming multiple types of test on this data set, I can conclude that while there is
a correlation between the salary earned by players and each independent variable, the correlation
is only slight. Conducting a calculation on a regression equation and then reviewing a scatterplot
gave mixed results, but by preforming hypothesis tests on each independent variable versus
salary I was able to determine a level of correlation. This implies that players who are able to
achieve high levels in each of the independent variables should have a higher salary then others.
To test this correlation further, a new data set could be established from players from both ends
of the spectrum. Players with low stats versus players with high stats could both be taken, and
then have their salaries compared versus the stats of their different independent variables.