Professional Documents
Culture Documents
Learning goals
To understand the need for studying dispersion To understand the idea behind measures of dispersion To study different measures of dispersion Additional topics
Standardization of a variable Skewness and Kurtosis Five-point summary
Applied Statistics and Computing Lab
2
Mode
10
11
12
13
Median
Examples
Variability in temperature through the week Scatter of the horsepower capacities, within the cars available Spread of the prices at which varieties of a single product (say rice varieties) are available Variability in returns on investments
Desired properties
A good measure should not get highly affected if the data changes slightly A good measure should be representative of the majority of the data A good measure should allow us to declare an interval within which most of the values lie, with a certain degree of confidence
Dataset
Body measurements on 507 individuals 247 men and 260 women Primarily in 20s and 30s, with some exceptions All individuals exercise several hours a week From the 28 total variables present in this dataset, we consider the variables Gender (1=Male, 0=Female) and Weight (in Kg.)
Applied Statistics and Computing Lab
9
Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study
Dataset (contd.)
Female Min. weight (in Kgs.) Max. weight (in Kgs.) Mean weight (in Kgs.) Median weight (in Kgs.) 42 105.2 60.6 59 Male 53.9 116.4 78.14 77.3 Overall 42 116.4 69.15 68.2
10
Evaluating dispersion
Consider distance from a central tendency (Measures based on all the values)
11
ADVANTAGES:
Useful when range of tolerance exists i.e. if values beyond a certain limit are harmful or unacceptable Easy to compute and understand
ADVANTAGES:
Easy comparison of variability across datasets Easy to compute and understand
ADVANTAGES:
Highlights the middle portion of the distribution of values Easy to understand
DISADVANTAGES: DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data
DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data
More difficult to compute than Min-max and range Ignores irregularities on the extremes Ignores 25% data on each side
Female (Min. weight, Max. weight) Weight range Weight inter-quartile range (42, 105.2) 63.2 11.1
Evaluating dispersion
Consider distance from a central tendency (Measures based on all the values)
14
Taking absolute values or taking squares so that we are considering only the magnitudes
Applied Statistics and Computing Lab
15
Absolute deviations
For a dataset consisting of n observations: Absolute deviations: Mean absolute deviation from Mean absolute deviation from
( ) mean = ( ) median =
( )
Female weights Male weights 8.58 8.57 7.2
16
Mean absolute deviation from mean Mean absolute deviation from median Median absolute deviation from median
Applied Statistics and Computing Lab
In order to look at a measure that has unit of measurements equivalent to the original data, we can take square root: Standard deviation = =
Variance Weight, females = 92.46 Variance Weight, males = 110.52
Applied Statistics and Computing Lab
Coefficient of range: ( ) Always lies between [0,1] Higher the coefficient, broader the range!
, = 0.43 , = 0.37
100
( )
Coefficient of variation:
Computes the variability per unit mean Indicates how consistent the data is, with respect to its mean Higher the coefficient, more spread-over are the observations
, = 13.45 , = 15.87
The values of weights among females are more spread-over than those among males
18
-Coefficients are free of units therefore facilitate comparison -Useful even when two variables are measured in two different units
Applied Statistics and Computing Lab
19
Standardization
Standardized variable of = Mean of standardized variable = 0 Variance of standardized variable = 1
Standardized variables are free of units Therefore measures of variation of standardized variables are comparable
Applied Statistics and Computing Lab
20
Example
How is the weight of a new-born affected by whether a mother smokes or not? Further, does it affect the perinatal mortality rate that varies for different birth weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated with an increase in the number of babies who die shortly after birth, the babies of smokers tended to have much lower death rates than the babies of nonsmokers.* In this study, he compared perinatal death rates by grouping birth rates In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended that the babies should be grouped based on their relative (or standardized) birth weight, rather than looking at the absolute weights (in Kgs.) What happened then? Table in Yerushalmy J. (1971)**
(Weights measured in grams)
21
* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
Example (contd.)
22
Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
Further to deviations
Variance = is the sum of squares of deviations from the mean divided by n or the expected value of squared deviation of X from its mean Expected values of higher powers of deviations from mean, give additional information about the distribution of data Expected value of any power of the deviations from mean of a variable X (say power) is called the central moment of that variable ( ) = ( ) = = Central moments depict the spread and shape of data Variance is 2nd central moment Measures using the 3rd and 4th central moments are useful to understand the shape of the distribution
Applied Statistics and Computing Lab
23
( )
Skewness
Skewness is a measure of symmetry (or the lack of it) in a dataset A distribution is right-skewed or positively skewed if it stretches asymmetrically to the right It is left or negatively skewed if the asymmetric stretch is on the left Measuring skewness using moments:
= =
Important to note that if a distribution is perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero, indicates a highly symmetric distribution
Applied Statistics and Computing Lab
24 Visuals from Aczel A., Sounderpandian J. Complete business statistics
Kurtosis
Kurtosis is a measure of peakedness of a dataset The ideal value for kurtosis is 3 and such a curve is called the Mesokurtic curve Value larges than 3 indicates that the distribution would be peaked with shorter tails. This graph is also termed the Leptokurtic curve Value smaller than 3 would fetch a flatter graph with longer tails and is called the Platykurtic curve Measuring kurtosis using moments:
= =
Applied Statistics and Computing Lab
The red line represents a frequency curve of a long tailed distribution The blue line represents a frequency curve of a short tailed distribution The black line is the standard bell curve
Example
Table of the gender-wise skewness and kurtosis of weights:
Skewness Female Male Entire dataset 1.14 0.29 0.40 Kurtosis 5.59 3.15 2.65
26
Example (contd.)
We see that skewness and kurtosis captures the numeric measure of the information presented in a histogram We see that the histogram of weights of females is highly stretched on the right, leading to a positive and high skewness measure of 1.14 The stretch of histogram for weights of the entire dataset is moderate and much lesser than that for weights of females. This is reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of the centrality giving a skewness measure as close to zero as 0.29 Skewness and Kurtosis shed light on important characteristics such as symmetry and peakedness Give additional information about distribution of data, than the measures of central tendency and measures of dispersion
Applied Statistics and Computing Lab
27
Point summary
Very useful and practical use of measures of central tendency and dispersion 5-point summary
Minimum 1st quartile Median 3rd quartile Maximum
6-point summary
Minimum 1st quartile Median Mean 3rd quartile Maximum
Gives an idea about the extreme values, the values within which the middle 50% of the values lie and also the centrality of the data 6-point summary of Weights in the body measurement data:
Min. 42 1st Qu. 58.4 Median 68.2 Mean 69.15 3rd Qu. 78.85 Max. 116.4
28
Measure Minimum Maximum Range Inter-quartile range Mean absolute deviation about mean Mean absolute deviation about median Median absolute deviation about median Variance Standard deviation Coefficient of range Coefficient of variation Standardization of a variable Skewness and Kurtosis
R-code min(variable name) max(variable name) range(variable name) IQR(variable name) mean(abs(variable name-mean(variable name))) mean(abs(variable name-median(variable name))) median(abs(variable name-median(variable name))) var(variable name) sd(variable name) (max(variable name) - min(variable name)) / (max(variable name) + min(variable name)) library(raster) cv(variable name) function(x) {(x-mean(x))/sqrt(var(x))} library(moments) skewness(variable name) kurtosis(variable name) summary(variable name)
29
6-point summary
Thank you