Statistical Interpretation and Discussion For A Set of Results

Ammaar Saiyed
Interpretation and discussion of results

Hypothesis 1 -
Bar chart: I made a bar chart of the data as I thought I will be able to look at it and see if there was
any link between the two variables. I predicted that as one variable - magnitude - increases, the
other - depth - increases also. The bar chart didnt require me to make any calculations. The
conclusion that I can make from the bar chart is that there is a very slight, positive correlation. It
indicates that as magnitude increases the depth increased also. I left all outliers in the bar chart.
Scatter Graph with anomalies and line of best fit: I plotted this scatter graph to look at all the data
with the anomalies included so I can actually view and make conclusions from the full set of data
rather than only part of it. This allowed me to be able to see that there was a positive trend in the
data. To calculated the outliers I multiplied the inter quartile range (23) by 1.5 reaching 34.5. Adding
this to the upper quartile gave me 67.5 and I found that 7 values were anomalies. I circled this in the
scatter graph. As there was no correlation, I couldnt plot a linear line of best fit, and instead had to
plot a 6 point polynomial line of best fit. The data fluctuates a lot and you can hardly see any linear
correlation in the points.
Spearmans Rank Correlation Coefficient - anomalies included: The Spearmans rank value gave me a
numerical value which I could definitively use to say whether there was correlation. Using the
Spearmans rank formula, I then had to calculate 1980.9465/26970 to find out what the SRCC was.
The result I got was 0.4407, this showed me that there was a positive correlation, but not very
strong. This may suggest that there may be no correlation at all in fact, as I only had 30 pieces of
data, therefore, as less data was involved, the correlation coefficient value isnt very accurate.
Scatter graph - without anomalies: I plotted a scatter graph without the anomalies to see if there
was a higher correlation between the variables. Removing outliers means that you can assess data
without values which are not in the majority of the data. However, from my graph I can see that,
contrary to my expectations, the correlation decreased, meaning there is hardly any correlation at
all. The data fluctuated a lot and the data went up and down with no link between it.
Spearmans Rank Correlation Coefficient: To calculate this, I had to divide 6943.3268 by 19656. This
calculation gave me 0.3278, actually meaning the correlation between my data decreased. The value
0.3278 means that there is positive correlation, but very close to none.
Evaluation: All these calculations and graphs tell me that there is a bit of positive correlation in my
bivariate data, but not a lot. Removing the outliers decreased the correlation between the two
variables and this further supports the fact that, my hypothesis proved to be wrong. I think the
source was too advanced for me to use; the depths of the all the earthquakes were all followed by a
variety of three different letters, a reason unknown to me. However, due to the reliability of the
data, I still used it. Although the first Spearmans Rank value for the data with anomalies included
had a positive correlation, higher than without the anomalies, I think that my hypothesis is mainly
wrong. To improve this hypothesis research, and to continue it further, I would use more data,
maybe over 75 as it would mean my data would be a lot more accurate. There are hundreds and
thousands of earthquakes that have occurred so I think that I made a mistake only using 30 for my
research and this may be why I have such low correlation.
Ammaar Saiyed

Hypothesis 2 -
Data calculations with anomalies: To compare the 2 sets of data, I worked out the mean, median,
mode, range, and standard deviations of both the sets. The purpose of working out the mean,
median and mode are slightly different but similar. Working out the mean is a simple way of having
one number to represent the data, whereas the median is the actual midpoint of the data. The mode
is the most popular number. For 1990-1999, I got varied results, the mean was 2766.066, the median
was 545 and there was no mode. For 2000-2010, the mean was 22673.8, the median was 32, and
the mode was 4. Comparing the two means shows that the deaths in 2000-2010 were a lot higher.
However, this is only because the 3 anomalies were very big, making the mean to be very big. The
mean being 22673.8 shows that the data is really very varied and also shows that the largest values
are extremely large, as the median is only 32 but the mean is 22641.8 bigger than it. As the median
from 1990-1999 is bigger than the other set, I can see that the values from there are bigger than in
2000-2010. From calculating the ranges of both data, I can say that both are very wide - with ranges
of 17049 and 315998. These show that there are probably outliers in the data, with such wide
ranged data. The standard deviation for 1990-1999 is 4841.55716, and this shows that the mean for
the data isnt a true representation of the data, as the standard deviation was calculated to see if
there was variance from the mean, and a standard deviation of that size is very big. This is again the
case for 2000-2010 as the standard deviation is more than 10x bigger than the other. What this
implies is that the mean has significant value, and isnt a representation of the data in anyway. I was
sure that the two sets had some anomalies and therefore I calculated the outlier boundaries by
multiplying the interquartile ranges by 1.5 and adding them to the upper quartiles. I found 3 in both
sets; I decided to then make 2 sets of box plots. One with the anomalies in the calculations but not
the max values, and one with the anomalies completely excluded.
Box plot with anomalies (in quartiles and median calculation - not in max values):
From the 2 box and whisker diagrams I can see clearly that the death toll for the earthquakes I
measured for 1990-1999 was a lot higher than the death toll for the earthquakes I measure from
2000-2010. I can also see that the 1990-1999 death toll number was very varied, as the box plot
extended from near the start to near the end of the scale. This is the reason that I excluded the
outliers from the maximum values. Although the 2000-2010 diagram doesnt look varied, it is
actually supposed to stretch over 290,000 more than it actually is. If I carried this out, my diagram
would be too big to effectively use. Another analysis that could be made from this is that the middle
50% of the data in 2000-2010 isnt very big however, as I did use the anomalies in calculating these
values. On the other hand, the middle 50% of the data from 1990-1999 has a lot of variance. To
conclude from these 2 diagrams is very clear that the deaths in 1990-1999 were higher than 2000-
2010. This is because of how further along the lower box plot is from the upper one, and in addition
to this it is clear, as although the two middle 50 percentages overlap, there is so much more of the
1990-1999 percentage even after the overlap and there is little of the 2000-2010 middle apart from
the overlapping parts. Furthermore, the median of the 2000-2010 data is nowhere near the middle
50% of the other set, in fact it is less than the minimum value, and the other median is just about
overlaps the 2000-2010 50%. Another result that can be established from the medians is from how
far apart they are, the difference is quite big, with the 1990-1999 median quite larger than the other.
Using all these results which were found visually, by looking at the box plots, you can certainly say
that the 1990-1999 time period had more deaths than 2000-2010.
Ammaar Saiyed

Data Calculations without anomalies -
After removing anomalies, I recalculated all the things I did at first for the same reasons, but hoping
for more accurate results, especially lower standard deviations. My results were as I hoped - for
1990-1999, I obtained a mean of 682, a median of 219 and a range of 2,322. No mode could be
found and although this range is still very big, it has decreased by nearly 15,000. For 2000-2010, the
mean was 110.33, the median was 16.5, the mode was 4 and the range was 942. Again, the range
was still fairly big, but had decreases by over 315,000. This shows that the outliers that had been
removed where very large values, and were very effective in altering the values that I got. By
comparing the two means, and the two medians I can see that 1990-1999 had higher death tolls. The
standard deviation for 1990-1999 is 831.4053637. This is still an extremely high standard deviation
and shows that the mean isnt a very true representation of the data but is more of a representation
than the first one was. This is the same for the 2000-2010 as the standard deviation was now
266.6595832. This is a high standard deviation and has decreased by over 80,000! The variance of
the data was now much less to the mean compared to before and shows how much of a difference
removing outliers can actually make.
Box plots without anomalies-
From the box plots I can see very clearly that 1990-1999 had a lot more deaths than 2000-2010. The
two middle fifty percentages dont overlap at all, in fact none of the values overlap except the
maximum value for 2000-2010 and the middle 50% of 1990-1999. As none of the two medians
overlap into the other set of datas middle fifty percentage, it indicates that the 1990-1999 is a lot
higher than 2000-2010 in terms of death rate from earthquakes. I can also see that the 2000-2010
earthquakes have values which are very similar as the middle 50% is bunched together very closely
implying that it isnt varied. Whereas the 1990-1999 box plot is still varied, but not as much as
before. The only part of the box plots that seems to look wrong are the maximum values. The
maximum values seem to be very far apart from the upper quartiles. I think this is because in the
original sets of data, the 3 anomalies in both sets were so big, that these values werent anomalies.
When working out anomalies in the second set of data, these values are now anomalies, but I left
them in, as I would have to work out anomalies again and again. The reason I re plotted a box plot
was to have a more accurate diagram of both the sets of data, to be able to compare it more
accurately.
Evaluation:

Statistical Interpretation and Discussion For A Set of Results

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Interpretation and Discussion For A Set of Results

Uploaded by

Copyright:

Available Formats

Ammaar Saiyed

Interpretation and discussion of results

You might also like