Hypothesis 1 - Bar chart: I made a bar chart of the data as I thought I will be able to look at it and see if there was any link between the two variables. I predicted that as one variable - magnitude - increases, the other - depth - increases also. The bar chart didnt require me to make any calculations. The conclusion that I can make from the bar chart is that there is a very slight, positive correlation. It indicates that as magnitude increases the depth increased also. I left all outliers in the bar chart. Scatter Graph with anomalies and line of best fit: I plotted this scatter graph to look at all the data with the anomalies included so I can actually view and make conclusions from the full set of data rather than only part of it. This allowed me to be able to see that there was a positive trend in the data. To calculated the outliers I multiplied the inter quartile range (23) by 1.5 reaching 34.5. Adding this to the upper quartile gave me 67.5 and I found that 7 values were anomalies. I circled this in the scatter graph. As there was no correlation, I couldnt plot a linear line of best fit, and instead had to plot a 6 point polynomial line of best fit. The data fluctuates a lot and you can hardly see any linear correlation in the points. Spearmans Rank Correlation Coefficient - anomalies included: The Spearmans rank value gave me a numerical value which I could definitively use to say whether there was correlation. Using the Spearmans rank formula, I then had to calculate 1980.9465/26970 to find out what the SRCC was. The result I got was 0.4407, this showed me that there was a positive correlation, but not very strong. This may suggest that there may be no correlation at all in fact, as I only had 30 pieces of data, therefore, as less data was involved, the correlation coefficient value isnt very accurate. Scatter graph - without anomalies: I plotted a scatter graph without the anomalies to see if there was a higher correlation between the variables. Removing outliers means that you can assess data without values which are not in the majority of the data. However, from my graph I can see that, contrary to my expectations, the correlation decreased, meaning there is hardly any correlation at all. The data fluctuated a lot and the data went up and down with no link between it. Spearmans Rank Correlation Coefficient: To calculate this, I had to divide 6943.3268 by 19656. This calculation gave me 0.3278, actually meaning the correlation between my data decreased. The value 0.3278 means that there is positive correlation, but very close to none. Evaluation: All these calculations and graphs tell me that there is a bit of positive correlation in my bivariate data, but not a lot. Removing the outliers decreased the correlation between the two variables and this further supports the fact that, my hypothesis proved to be wrong. I think the source was too advanced for me to use; the depths of the all the earthquakes were all followed by a variety of three different letters, a reason unknown to me. However, due to the reliability of the data, I still used it. Although the first Spearmans Rank value for the data with anomalies included had a positive correlation, higher than without the anomalies, I think that my hypothesis is mainly wrong. To improve this hypothesis research, and to continue it further, I would use more data, maybe over 75 as it would mean my data would be a lot more accurate. There are hundreds and thousands of earthquakes that have occurred so I think that I made a mistake only using 30 for my research and this may be why I have such low correlation. Ammaar Saiyed
Hypothesis 2 - Data calculations with anomalies: To compare the 2 sets of data, I worked out the mean, median, mode, range, and standard deviations of both the sets. The purpose of working out the mean, median and mode are slightly different but similar. Working out the mean is a simple way of having one number to represent the data, whereas the median is the actual midpoint of the data. The mode is the most popular number. For 1990-1999, I got varied results, the mean was 2766.066, the median was 545 and there was no mode. For 2000-2010, the mean was 22673.8, the median was 32, and the mode was 4. Comparing the two means shows that the deaths in 2000-2010 were a lot higher. However, this is only because the 3 anomalies were very big, making the mean to be very big. The mean being 22673.8 shows that the data is really very varied and also shows that the largest values are extremely large, as the median is only 32 but the mean is 22641.8 bigger than it. As the median from 1990-1999 is bigger than the other set, I can see that the values from there are bigger than in 2000-2010. From calculating the ranges of both data, I can say that both are very wide - with ranges of 17049 and 315998. These show that there are probably outliers in the data, with such wide ranged data. The standard deviation for 1990-1999 is 4841.55716, and this shows that the mean for the data isnt a true representation of the data, as the standard deviation was calculated to see if there was variance from the mean, and a standard deviation of that size is very big. This is again the case for 2000-2010 as the standard deviation is more than 10x bigger than the other. What this implies is that the mean has significant value, and isnt a representation of the data in anyway. I was sure that the two sets had some anomalies and therefore I calculated the outlier boundaries by multiplying the interquartile ranges by 1.5 and adding them to the upper quartiles. I found 3 in both sets; I decided to then make 2 sets of box plots. One with the anomalies in the calculations but not the max values, and one with the anomalies completely excluded. Box plot with anomalies (in quartiles and median calculation - not in max values): From the 2 box and whisker diagrams I can see clearly that the death toll for the earthquakes I measured for 1990-1999 was a lot higher than the death toll for the earthquakes I measure from 2000-2010. I can also see that the 1990-1999 death toll number was very varied, as the box plot extended from near the start to near the end of the scale. This is the reason that I excluded the outliers from the maximum values. Although the 2000-2010 diagram doesnt look varied, it is actually supposed to stretch over 290,000 more than it actually is. If I carried this out, my diagram would be too big to effectively use. Another analysis that could be made from this is that the middle 50% of the data in 2000-2010 isnt very big however, as I did use the anomalies in calculating these values. On the other hand, the middle 50% of the data from 1990-1999 has a lot of variance. To conclude from these 2 diagrams is very clear that the deaths in 1990-1999 were higher than 2000- 2010. This is because of how further along the lower box plot is from the upper one, and in addition to this it is clear, as although the two middle 50 percentages overlap, there is so much more of the 1990-1999 percentage even after the overlap and there is little of the 2000-2010 middle apart from the overlapping parts. Furthermore, the median of the 2000-2010 data is nowhere near the middle 50% of the other set, in fact it is less than the minimum value, and the other median is just about overlaps the 2000-2010 50%. Another result that can be established from the medians is from how far apart they are, the difference is quite big, with the 1990-1999 median quite larger than the other. Using all these results which were found visually, by looking at the box plots, you can certainly say that the 1990-1999 time period had more deaths than 2000-2010. Ammaar Saiyed
Data Calculations without anomalies - After removing anomalies, I recalculated all the things I did at first for the same reasons, but hoping for more accurate results, especially lower standard deviations. My results were as I hoped - for 1990-1999, I obtained a mean of 682, a median of 219 and a range of 2,322. No mode could be found and although this range is still very big, it has decreased by nearly 15,000. For 2000-2010, the mean was 110.33, the median was 16.5, the mode was 4 and the range was 942. Again, the range was still fairly big, but had decreases by over 315,000. This shows that the outliers that had been removed where very large values, and were very effective in altering the values that I got. By comparing the two means, and the two medians I can see that 1990-1999 had higher death tolls. The standard deviation for 1990-1999 is 831.4053637. This is still an extremely high standard deviation and shows that the mean isnt a very true representation of the data but is more of a representation than the first one was. This is the same for the 2000-2010 as the standard deviation was now 266.6595832. This is a high standard deviation and has decreased by over 80,000! The variance of the data was now much less to the mean compared to before and shows how much of a difference removing outliers can actually make. Box plots without anomalies- From the box plots I can see very clearly that 1990-1999 had a lot more deaths than 2000-2010. The two middle fifty percentages dont overlap at all, in fact none of the values overlap except the maximum value for 2000-2010 and the middle 50% of 1990-1999. As none of the two medians overlap into the other set of datas middle fifty percentage, it indicates that the 1990-1999 is a lot higher than 2000-2010 in terms of death rate from earthquakes. I can also see that the 2000-2010 earthquakes have values which are very similar as the middle 50% is bunched together very closely implying that it isnt varied. Whereas the 1990-1999 box plot is still varied, but not as much as before. The only part of the box plots that seems to look wrong are the maximum values. The maximum values seem to be very far apart from the upper quartiles. I think this is because in the original sets of data, the 3 anomalies in both sets were so big, that these values werent anomalies. When working out anomalies in the second set of data, these values are now anomalies, but I left them in, as I would have to work out anomalies again and again. The reason I re plotted a box plot was to have a more accurate diagram of both the sets of data, to be able to compare it more accurately. Evaluation: