Professional Documents
Culture Documents
Descriptive Statistics
2014
GEOG-2210B
Descriptive Statistics
2014
The histograms below have the median (diamond) and the mean (circle) two standard deviations (arrows) marked on them. These ranges provide a convenient (and theoretically sound) basis for identifying outliers from a designated group.
Figure 2.2. Histograms of weight by sex with mean, median and mean two standard deviations Some of the calculations required below can be tedious without effective use of a calculator. Calculators with a sign probably store intermediate statistics. It will be worthwhile learning such a short cut now using a simple synthetic data set. For example, a data set consisting of {2,3,4} has n = 3, x = 9 and x2 = 29. Spreadsheets start to become a significant advantage now in taking over these tedious operations. However, tests and examinations may require some compilations of data, and spreadsheets are not allowed. Use the spreadsheet as a check on your manual calculations, to provide more professional exercise reports, and to build expertise in computer assisted analysis. Table 2.1. contains the height, weight and distance from parental home sata set as used in the previous exercise. The objective is to demonstrate the derivation of descriptive statistics and to explore the use of descriptive statistics in the efficient analysis of research questions. Many of the terms are described in the text, and a summary listing of descriptive statistics follows.
Descriptive Statistics
2014
The mean of the sample is: = x/n The median is obtained by ranking the data and taking the middle value. In cases of tied ranks, then assign ranks either in order of collection, or take the tied ranks (e.g., 2nd ,3rd) and assign the average rank to all tied values (e.g., 2.5, 2.5). If the sample size (n) is even, then there is no central value. Take the average of the middlemost two values. The sample standard deviation is:
sx =
The last of the definitions is the easiest way of calculating standard deviation. The coefficient of variation (%), CV = 100(sx/ ) The signal to noise ratio is = /sx Notation The notation used here and in much of the course is simplified. The x should be subscripted with i (xi) to imply the individual members of the set. should normally indicate the range over which I varies in making the summation. Thus the mean should be defined as follows
The subscript (i=1) and superscript (n) after the imply that all the members of {x} should be summed. The equation can be read as sum all members of the set x from the first (i=1) to the last (i=n). In most cases, will imply summation over the entire set and the subscripts can be safely omitted.
GEOG-2210B
Descriptive Statistics
2014
Table 2.1. Height, weight and distance-to-home data for students in a Geography 2210 Class Sex M M M M M M M M M M M F F F F F F F F F F F F F Height (m) 1.80 1.84 1.89 1.79 1.85 1.62 1.85 1.83 1.70 1.93 1.91 1.61 1.60 1.60 1.85 1.65 1.45 1.70 1.75 1.84 1.68 1.63 1.80 1.57 Rank 8 6 3 9 4 11 5 7 10 1 2 9 10.5= 10.5= 1 7 13 5 4 2 6 8 3 12 Weight (kg) 104 68 75 73 77 65 68 60 73 85 99 54 84 49 78 69 48 73 59 73 60 49 55 52 Rank Distance (km) 1 573 192 198 211 544 588 4 573 189 4298 2 528 0.1 35 92 249 5 7 777 8 45 200 98 Rank 11 3 8 7 6 5 2 10 4 9 1
GEOG-2210B
Descriptive Statistics
2014
Geography 2210B EXERCISE 2 Descriptive Statistics Instructions and Exercise Sheets Name:_______________ 1. Intermediate Statistics
(a) For each sample group, calculate sample size (n), the sum of data (x) and the uncorrected (x2) and corrected sum of squares (SSx = x2 (x)2/n). The symbols and terms that you will need are defined at the end of the exercise. To start with, use the worked examples in the table to check your technique. Place your answers in Table 2.2. (6) Table 2.2. Intermediate statistics for Geography 2210 class data Height (m) Male n x x2 SSx Female 11 848 67,336 1,962.91 Weight (kg) Male 13 803 51,391 1,790.308 Female 11 7,371 1.9927E7* 1.4988E7* Distance (km) Male Female
Student #:____________
T.A.:_________________
* Note: E-notation is a computer shorthand form of exponential notation that is used to represent large or very small numbers. Exponential notation splits a number into a manitissa which gives the numerical value, and an exponent, which indicates the order of magnitude. Thus 19,927,000 is represented as 1.9927x107, or in the shorthand form, 1.9927E7. This so-called scientific notation or format is awkward to read, but allows precise application of the concept of significant figures, and, more pragmatically, permits very different number to be fit in the same space.
2. Descriptive Statistics
(a) For each sample group, calculate the mean ( ), median (Medx), standard deviation (sx), variance (sx2) and coefficient of variation (CV). Calculate the mean plus 1 and 2 standard deviations, and the mean minus 1 and 2 standard deviations. The symbols and terms that you will need are defined at the end of the exercise. Place your answers in Table 2.3. Use the worked examples to establish and check your technique. In particular, note that these results are all presented to three significant figures. If you try to use these rounded numbers subsequently errors can result. For example, the variance of Male Distance is not equal to 1220 squared, it is computer from the raw standard deviation (-1224.264). Note that the median requires rankings to be assigned in Table 2.1. (6)
GEOG-2210B Height (m) Male Mean ( ) median (Medx) standard deviation (sx) variance (sx2) coefficient of variation (CV), (%) + 2sx + sx - sx 2sx
Descriptive Statistics Weight (kg) Male 77.1 73.0 14.0 196 18.2 105 91.1 63.1 49.1 Female 61.8 59.0 12.2 149 19.8 86.2 74.0 49.6 37.3 Distance (km) Male 670 211 1,220 1,500,000 183 3,120 1,890 -554 -1780 Female
2014
Female
(g) Approximately how many individuals lie outside the one and two standard deviation limits on your histogram or scattergram? Express you answer as a number and an approximate percentage (100 x number outside/total number in sample). (2)
Table 2.4. Outliers based on histograms of sample statistics Height (m) Weight (kg) 6 Distance (km)
Descriptive Statistics Male 3 (27%) 0 (0%) Female 5 (39%) 0 (0%) Male 1 (9%) 1 (9%) Female 2 (15%) 2 (15%)
2014
4. Inference
In some cases, we know so little about a phenomenon that the primary need is for basic information, or exploratory analysis. Descriptive statistics are useful in drawing out information in such cases, and often raise new questions. Once armed with some knowledge, it is possible to develop more formal questions and to infer specific answers to these questions. Inference is the process of drawing conclusions from analysis of data. Drawing inferences from data requires formality so that your train of analysis should be objective and clear to a reader. Effective inference depends upon the quality and quantity of the data and its relevance to the problem on hand, and application of appropriate analytical methods. Relevance is achieved by narrowing the research problem to questions to formal hypotheses. Hypotheses are simple testable statements that lead directly to the necessary data. Hypotheses are often disappointingly narrow in their focus, but this is more than compensated for by their overt testability. For example, the class data set might have been gathered because of curiosity about the gender of students, their morphology (shape) and the availability of home cooking. Within this broad research problem, we have decided to focus on a narrower group of questions about height, weight, sex and distance from home. Formal hypotheses can be most readily expressed in simple terms of difference. There are three different ways of defining difference, and we can assign particular names to them. Table 2.5. Identification and representation of tailedness of a research question Description of hypothesis Language Symbol 1. One-tailed test on the positive side Significantly > greater than 2. One-tailed test on the negative side Significantly less < than 3. Two-tailed test Significantly different from
Words Expect one variable to exceed another Expect one variable to be less than the other Expect one variable to be greater or less than another
The histograms (Figure 1.1) indicate that males are often heavier than females. A related hypothesis, therefore, would be one-tailed positive (i.e. Wtm > Wtf). But some males are heavier and others are lighter than females, so the data are inconclusive. The sample mean provides an excellent way to characterize a data set. Indeed the mean weight of males (77.1 kg) is greater than the mean weight of females (61.9 kg). What if the male mean weight was only slightly larger than female mean weight? How different do the means have to be? An objective rule is needed to decide how big a difference is needed to be considered significant. A simple rule for deciding significance of a difference might be that the male average weight should be more than one standard deviation greater than female average weight. This can be presented graphically:
GEOG-2210B
Descriptive Statistics
2014
Figure 2.3. Sketch of hypothesis testing using means and standard deviation The male mean weight is clearly more than one standard deviation above the female mean weight; therefore we conclude male weight is not only greater than female weight, but that it is significantly greater. It is also possible to tackle the same hypothesis by asking if female weight is significantly less than male mean weight. In this case the following graphic applies.
Figure 2.4. Sketch of hypothesis testing using mean and standard deviation
This result confirms the previous test. Female weight is significantly less than male mean weight. Overall, the two tests both confirm that males are significantly greater in weight than females. As there are two ways of testing the hypothesis, these may or may not corroborate one another. How might we decide which route is better? First, it is possible that one sample is more reliable than the other. Assuming good sampling procedure, the sample size is the best indication of reliability. The female sample size is the larger in this case, so perhaps the first method of testing is preferable. Alternatively, we could pool the standard deviations, and get a composite confidence interval that way. This test procedure is only useful as a general indication. However, this style of inference testing is an extremely powerful approach to analysis. The research retains the primary responsibility for resolving the problem into practical hypotheses, and uses formal techniques of descriptive statistics and logic to test the hypotheses. The questions are answered directly using the data, and the grounds for drawing conclusions are explicit to readers. Table 2.6. below summarizes the example of the inferential process, and has a couple of examples to be completed. First, identify the variable of interest (e.g., Height in the partially worked example), then state a hypothesis (e.g., we expect males to be taller than females; Htm > Htf; a one-tailed test). Present a rule for identifying differences (e.g., the 8
GEOG-2210B Descriptive Statistics 2014 male mean height should be at least one standard deviation greater than the female mean height). This is the tolerance or threshold beyond which you will reject the hypothesis. Provide the appropriate statistics (e.g. sx) and draw a conclusion as to whether the hypothesis is supported or rejected. Conclusions should be qualified (i.e., add some cautionary words) if you have any doubts about the reliability of the test (e.g. if the results are inconsistent). Work through the examples, then complete your analysis for the hypothesis about height, and tackle a two-tailed test for gender differences in distance-to-home. (10) Observed Statistics Hypothesis Males are heavier than females Wtm > Wtf One tailed +ve Rule for acceptance m > ( f + sxf) Or f < ( m - sxm) Male = 77.1 sx = 14.0 (11) Female = 61.8 sx = 12.2 (13) Tolerance sx Male 63.1 Female 74.0 Conclusion m > ( f + sxf) 77.1 > (61.8 + 12.2) f > ( m - sxm) 61.8 < (77.1 14.0) hypothesis is supported
Male weight is different from female weight Wtm Wtf two tailed
m ( f sxf) Or f ( m sxm)
63.1 to 91.1
49.6 to 74.0
m > ( f + sxf) 77.1 > (61.8 + 12.2) f > ( m - sxm) 61.8 < (77.1 14.0) hypothesis is supported
GEOG-2210B Table 2.6. Hypothesis testing for Exercise 2 Variable Rule for Hypothesis acceptance Height Males are taller m > ( f + sxf) (kg) than females Or f < ( m - sxm) Htm > Htf One-tailed +ve
Distance (km)
10