You are on page 1of 5

Handout 1 Characterizing Data Sets

A data set consisting of six values may speak for itself if someone asks what the data are like, it suffices to write the six values down and hand the paper to the interlocutor ; but a data set of 600, 6000 or six million values is too large for a human being to comprehend just by scanning values. For a large data set it is imperative to have a way to characterize the key features of the data set succinctly. Characterizing the Central Tendency of a Data Set One feature of a data set one might wish to characterize or measure is what is average, typical , whats in the center. There are many ways to give such a characterization ; the measure of central tendency which is most tractable for mathematical analysis is the sample mean. This is just the average youve come to know and love : add all the values up and then divide the sum by the number of values. The reader is assumed to be comfortable with summation notation, and hence will see that this is the appropriate formal definition : Definition Given a data set { x1 , x2 , , xn } , the sample mean of the data is

Example : The sample mean for the data set { 2,3,4,11 } is ( 2 + 3 + 4 + 11 )/4 = 5. There are other ways to measure central tendency ; perhaps the most important of these is the sample median. It is left to the reader to investigate what the sample median is, and when it is useful as an alternative measure of central tendency. Characterizing the Dispersion of a Data Set Consider the two data sets : { 10, 40,50, 80 } and { 30,40,50,60 } . It is easy to verify that the sample mean for both data sets is 45 ; but there is clearly an aspect of the data that is not adequately described by the sample mean, an aspect different from the central tendency of the data. Note that the former data set is more spread out with respect to the mean Thats it ! : we need a characterization of the dispersion of the data. One way to measure dispersion, which has the virtue of simplicity, is via the range of the data set. The range of a data set { x1 , x2 , , xn } is defined to be xmax - xmin , where x max denotes the largest observation, and xmin denotes the smallest observation. Note that the range of the former data set is 80 - 10 = 70 , and the range of the latter data set is 60 - 30 = 30 . In general, the bigger the range the more dispersed the data are.

The range has the shortcoming of being a function of just two data points : the rest of the data get no vote at all, even though the values of every datum contributes to the dispersion of the data set. To illustrate this point, consider a third data set : { 30,30,60,60 } ( also with sample mean 45 ) intuitively, this data set is more spread out relative to the sample mean than is the second data set , but the range is the same, namely 30. Clearly one needs a measure of dispersion that takes all the data into account. At the risk of being too pedestrian, lets take time for motivating the ideas that follow. The goal is to measure dispersion relative to the mean ; so its natural to look at the deviation of each observation from the mean. Using the first data set as an example, the deviations from the mean are 10 - 45 , 40 - 45 , 50 - 45 , 80 - 45 ; that is , - 35 , -5 , 5 , 35 . Since every data point should contribute to our new measure of dispersion, why not take an average of these deviations : ( -35 + ( -5 ) + 5 + 35 )/4 = 0/4 = 0 . Zero, hmm. A little odd, but whats worse ( verify this fact for the other two data sets ), is that the sum of the deviations from the sample mean is always zero . In other words, for any data set , , and so the average

deviation from the sample mean is always zero and so reveals nothing about the dispersion of the data. Just thinking about the arithmetic of the situation, if one doesnt want everything to cancel out and sum to zero, maybe one should square the deviations first ! Hence the ( admittedly sketchy ) motivation for following definition :

Definition Given a data set { x1 , x2 , , xn } , the sample variance of the data is

Numerical Example : For the data set , { 10, 40,50, 80 } , the sample variance is ( (-35)2 + ( -5 )2 + 352 + 52 )/ ( 4 - 1 ) = 2500/3 = 833.3. A few computations should convince one that the sample variance is doing what one wants : the more spread out the data relative to the mean, the larger the sample variance. Verify these calculations : Data Set { 10, 40,50, 80 { 30,40, 50, 60 { 30,30, 60, 60 { 45,45,45, 45 } } } } Sample Variance 833.3 166.7 300 0

Remarks ( 1 ) Why divide by n - 1 , instead of the more intuitive n , in the definition of the sample variance ? An intuitive answer to this question will be given later ; a mathematically rigorous explanation is beyond the scope of this course. (2) The subscript x in the notation just indicates that one is referring to the

sample variance for the variable x : in a complex experiment or survey, one might be measuring many different variables, each with its own sample mean and sample variance. If the variable is understood, then the subscript may be dispensed with. (3) The square root of the sample variance , , is called the sample standard

deviation. Note both the sample variance and the sample standard deviation are always non-negative !!! (4) A note on units : if the original data have units of , say, feet, Then the units of the sample variance are feet squared ; but the units of the sample standard deviation ( and the sample mean ) are feet. In general, the sample standard deviation always has the same units as the original data which is an advantage when presenting results to a non-technical audience! There is an alternative formula for calculating the sample variance which is sometimes more convenient to use :

(5)

A derivation of this result is given in the following appendix. You are not responsible for knowing the derivation ; you are responsible for knowing this formula, and being able to use it. Numerical Example : Take the data set { 1 , 2 , 3 ) . So n = 3, ,

, and one computes

(6)

To become adept at computing sample variances / standard deviations requires practice. Theres no need to wait for anyone to give you a data set : write one down and start computing. ( You can always check the validity of your result via MINITAB. )

APPENDIX - Alternative Expression for the Sample Variance Recall the definition of the sample variance of a data set { x1 , , x n } :

Definition

This expression is not always the most convenient one. Heres an alternative expression for the sample variance, together with a derivation : Theorem The sample variance, as defined above, is given by :

Proof.

Write

Now sum over the index i , keeping in mind that index i. So one has

is a constant with respect to the

Now substitute

for

in the last expression to obtain :

Divide the last expression by n - 1 , and the derivation is complete. QED

You might also like