Measures of Dispersion

MEASURES OF DISPERSION
Applied Statistics and Computing Lab Indian School of Business
Applied Statistics and Computing Lab
Learning goals
To understand the need for studying dispersion To understand the idea behind measures of dispersion To study different measures of dispersion Additional topics
Standardization of a variable Skewness and Kurtosis Five-point summary
2
Need to study dispersion

Two patients are admitted into the Intensive Care Unit of a hospital. The night before their operation, the doctor makes the last visit at 9pm and blood pressure for Patient 1 is 110/80 and for Patient 2 it is 120/70. Although they are normal, for precautionary reasons, the Doctor asks the nurse to check their blood pressure every 2 hours. At 7.30 the next morning, the nurse reports that the average blood pressure for both the patients was normal, 120/80. The chart of their actual blood pressures was:
Time Patient 1 Patient 2 11pm 120/80 110/60 1am 100/80 100/60 3am 100/60 100/70 5am 130/80 130/90 7am 150/100 160/120
3
Need to study dispersion (contd.)

What if the doctor decides to operate the patients without looking at the blood pressure chart? What if someone decides to visit the tourist destination next week, based on the average temperature of last week, given in our data? What if I am interested in working with company X (that is visiting our campus) and I am given information about only the mean salary of the employees? In an extreme case, a central tendency can also indicate a dataset consisting of same constant value
Mode
10
11
12
13
Median
Examples
Variability in temperature through the week Scatter of the horsepower capacities, within the cars available Spread of the prices at which varieties of a single product (say rice varieties) are available Variability in returns on investments
Need for measures of dispersion (contd.)

Helps determine the reliability of the measure of central tendency Facilitates comparison of two sets of data Useful for building further statistical measures
Desired properties
A good measure should not get highly affected if the data changes slightly A good measure should be representative of the majority of the data A good measure should allow us to declare an interval within which most of the values lie, with a certain degree of confidence
Dataset
Body measurements on 507 individuals 247 men and 260 women Primarily in 20s and 30s, with some exceptions All individuals exercise several hours a week From the 28 total variables present in this dataset, we consider the variables Gender (1=Male, 0=Female) and Weight (in Kg.)
9
Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study
Dataset (contd.)
Female Min. weight (in Kgs.) Max. weight (in Kgs.) Mean weight (in Kgs.) Median weight (in Kgs.) 42 105.2 60.6 59 Male 53.9 116.4 78.14 77.3 Overall 42 116.4 69.15 68.2
10
Consider the boundaries (Measure based on selected values)
Report the extreme values Calculate a coefficient Build an absolute measure
High coefficient: Large spread, high variability
Evaluating dispersion
Consider distance from a central tendency (Measures based on all the values)
Small coefficient: Small spread, less variability
11
1. Considering the boundaries

These measures consider and report only the boundaries of the data Try to understand how far the values of the variable reach The spread of the data is not considered relative to any central tendency These measures overlook the patterns of values within the boundaries
12
Minimum and maximum values
Range = (Maximum value) (Minimum value)
Inter-quartile range = (3rd quartile) (1st quartile)
ADVANTAGES:
Useful when range of tolerance exists i.e. if values beyond a certain limit are harmful or unacceptable Easy to compute and understand
ADVANTAGES:
Easy comparison of variability across datasets Easy to compute and understand
ADVANTAGES:
Highlights the middle portion of the distribution of values Easy to understand
DISADVANTAGES: DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data
DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data
More difficult to compute than Min-max and range Ignores irregularities on the extremes Ignores 25% data on each side
Female (Min. weight, Max. weight) Weight range Weight inter-quartile range (42, 105.2) 63.2 11.1
Male (53.9, 116.4) 62.5 14.55
Overall (42, 116.4) 74.4 20.45 13
Consider the boundaries (Measure based on selected values)
Report the extreme values Calculate a coefficient Build an absolute measure
High coefficient: Large spread, high variability
Evaluating dispersion
Consider distance from a central tendency (Measures based on all the values)
Small coefficient: Small spread, less variability
14
2. Considering distance from central tendency

Consider the deviations of values from the central tendency measure What if we simply sum all these deviations? Consider a hypothetical dataset (1,1,2,2,3,3,4,5,5,6,6,7,7) Mean = Median = 4 Consider
= = 0
Taking absolute values or taking squares so that we are considering only the magnitudes
15
Absolute deviations
For a dataset consisting of n observations: Absolute deviations: Mean absolute deviation from Mean absolute deviation from
( ) mean = ( ) median =
Median absolute deviation from median =
( )
Female weights Male weights 8.58 8.57 7.2
16
Mean absolute deviation from mean Mean absolute deviation from median Median absolute deviation from median
7.33 7.19 5.1
Measures based on squared deviation

For a dataset consisting of n observations, Variance = =
( )
In order to look at a measure that has unit of measurements equivalent to the original data, we can take square root: Standard deviation = =
Variance Weight, females = 92.46 Variance Weight, males = 110.52
Standard deviationWeight, females = 9.62 Standard deviationWeight, males = 10.51

17
Relative measures of dispersion

Coefficient of range: ( ) Always lies between [0,1] Higher the coefficient, broader the range!
, = 0.43 , = 0.37
100
( )
Coefficient of variation:
Computes the variability per unit mean Indicates how consistent the data is, with respect to its mean Higher the coefficient, more spread-over are the observations
, = 13.45 , = 15.87
The values of weights among females are more spread-over than those among males
18
Comparing measures of dispersion

All the measures that consider distance from central tendency, are based on all the values! -Absolute deviations are less affected by extreme values, as compared to squared deviations -Absolute deviations are easy to understand and interpret -Median absolute deviation is least affected by slight changes in the data, across all measures of dispersion -Variance and Standard deviation are most popular measures of dispersion due to their usefulness in building further statistical measures and because they algebraically amenable -Both play an important part in building and evaluating further statistical measures -Standard deviation is easier to understand than variance, as it is in the same units as the original data -Algebraic manipulation of measures based on measures of absolute deviations is difficult -Variance is most affected by extreme values as it is based on squared deviations -Standard deviation is not very easy to compute -Standard deviation cannot be calculated for data with open ended classes
-Coefficients are free of units therefore facilitate comparison -Useful even when two variables are measured in two different units
19
Standardization
Standardized variable of = Mean of standardized variable = 0 Variance of standardized variable = 1

Standardized variables are free of units Therefore measures of variation of standardized variables are comparable
20
Example
How is the weight of a new-born affected by whether a mother smokes or not? Further, does it affect the perinatal mortality rate that varies for different birth weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated with an increase in the number of babies who die shortly after birth, the babies of smokers tended to have much lower death rates than the babies of nonsmokers.* In this study, he compared perinatal death rates by grouping birth rates In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended that the babies should be grouped based on their relative (or standardized) birth weight, rather than looking at the absolute weights (in Kgs.) What happened then? Table in Yerushalmy J. (1971)**
(Weights measured in grams)
21
* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
Example (contd.)
22
Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
Further to deviations
Variance = is the sum of squares of deviations from the mean divided by n or the expected value of squared deviation of X from its mean Expected values of higher powers of deviations from mean, give additional information about the distribution of data Expected value of any power of the deviations from mean of a variable X (say power) is called the central moment of that variable ( ) = ( ) = = Central moments depict the spread and shape of data Variance is 2nd central moment Measures using the 3rd and 4th central moments are useful to understand the shape of the distribution
23
( )
Skewness
Skewness is a measure of symmetry (or the lack of it) in a dataset A distribution is right-skewed or positively skewed if it stretches asymmetrically to the right It is left or negatively skewed if the asymmetric stretch is on the left Measuring skewness using moments:
= =
Important to note that if a distribution is perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero, indicates a highly symmetric distribution
24 Visuals from Aczel A., Sounderpandian J. Complete business statistics
Kurtosis
Kurtosis is a measure of peakedness of a dataset The ideal value for kurtosis is 3 and such a curve is called the Mesokurtic curve Value larges than 3 indicates that the distribution would be peaked with shorter tails. This graph is also termed the Leptokurtic curve Value smaller than 3 would fetch a flatter graph with longer tails and is called the Platykurtic curve Measuring kurtosis using moments:
= =
The red line represents a frequency curve of a long tailed distribution The blue line represents a frequency curve of a short tailed distribution The black line is the standard bell curve
25 Visual from http://whatilearned.wikia.com/wiki/File:Kurtosis.jpg
Example
Table of the gender-wise skewness and kurtosis of weights:
Skewness Female Male Entire dataset 1.14 0.29 0.40 Kurtosis 5.59 3.15 2.65
26
Example (contd.)
We see that skewness and kurtosis captures the numeric measure of the information presented in a histogram We see that the histogram of weights of females is highly stretched on the right, leading to a positive and high skewness measure of 1.14 The stretch of histogram for weights of the entire dataset is moderate and much lesser than that for weights of females. This is reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of the centrality giving a skewness measure as close to zero as 0.29 Skewness and Kurtosis shed light on important characteristics such as symmetry and peakedness Give additional information about distribution of data, than the measures of central tendency and measures of dispersion
27
Point summary
Very useful and practical use of measures of central tendency and dispersion 5-point summary
Minimum 1st quartile Median 3rd quartile Maximum
6-point summary
Minimum 1st quartile Median Mean 3rd quartile Maximum
Gives an idea about the extreme values, the values within which the middle 50% of the values lie and also the centrality of the data 6-point summary of Weights in the body measurement data:
Min. 42 1st Qu. 58.4 Median 68.2 Mean 69.15 3rd Qu. 78.85 Max. 116.4
28
Measure Minimum Maximum Range Inter-quartile range Mean absolute deviation about mean Mean absolute deviation about median Median absolute deviation about median Variance Standard deviation Coefficient of range Coefficient of variation Standardization of a variable Skewness and Kurtosis
R-code min(variable name) max(variable name) range(variable name) IQR(variable name) mean(abs(variable name-mean(variable name))) mean(abs(variable name-median(variable name))) median(abs(variable name-median(variable name))) var(variable name) sd(variable name) (max(variable name) - min(variable name)) / (max(variable name) + min(variable name)) library(raster) cv(variable name) function(x) {(x-mean(x))/sqrt(var(x))} library(moments) skewness(variable name) kurtosis(variable name) summary(variable name)
29
6-point summary
Thank you

Measures of Dispersion

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measures of Dispersion

Uploaded by

Copyright:

Available Formats

MEASURES OF DISPERSION

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Need to study dispersion

Applied Statistics and Computing Lab

Need to study dispersion (contd.)

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Need for measures of dispersion (contd.)

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Consider the boundaries (Measure based on selected values)

Report the extreme values Calculate a coefficient Build an absolute measure

High coefficient: Large spread, high variability

Small coefficient: Small spread, less variability

Applied Statistics and Computing Lab

1. Considering the boundaries

Minimum and maximum values

Range = (Maximum value) (Minimum value)

Inter-quartile range = (3rd quartile) (1st quartile)

Male (53.9, 116.4) 62.5 14.55

Overall (42, 116.4) 74.4 20.45 13

Applied Statistics and Computing Lab

Consider the boundaries (Measure based on selected values)

Report the extreme values Calculate a coefficient Build an absolute measure

High coefficient: Large spread, high variability

Small coefficient: Small spread, less variability

Applied Statistics and Computing Lab

2. Considering distance from central tendency

Median absolute deviation from median =

7.33 7.19 5.1

Measures based on squared deviation

Standard deviationWeight, females = 9.62 Standard deviationWeight, males = 10.51

Relative measures of dispersion

Applied Statistics and Computing Lab

Comparing measures of dispersion

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

25 Visual from http://whatilearned.wikia.com/wiki/File:Kurtosis.jpg

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

You might also like