You are on page 1of 43

Descriptive Statistics

Sept 2015
Associate Professor Dr Sanjay Rampal
MBBS, MPH, PhD, CPH (US NBPHE), AMM
Faculty of Medicine, University of Malaya
srampal@ummc.edu.my / rampal.s@gmail.com

CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness

Sanjay Rampal

Summarizing Data 1

Measures of central tendency


Central tendency is an estimate of the centre
of a distribution of values.
There are three major types of estimates of
central tendency
A) Mean
B) Median
C) Mode

Mean

(1)

The average value (sum of all observed values


divided by the total number of observations)
Mean,

xi
n

X 1 X 2 X 3 ...... Xn
n

= (sigma) (means add)


xi= observed values
n= total number of observations

Sanjay Rampal

Summarizing Data 1

Mean (arithmetic)

(2)

Used when the numbers can be added


(characteristics are measured on a numerical
scale)
Should not be used with qualitative data
Should not be used with ordinal scale because
arbitrary nature of ordinal scale
Can be estimated from a frequency table.
Weighted average estimate of the mean is
formed by multiplying data value by number of
observations, add the products and divide the
sum by number of observations

Mean (arithmetic)

(2)

1, 3, 5, 7, 7, 8, 8, 9
n=8
xi=1+3+5+7+7+8+8+9= 48

xi
n

48
8

= 6

Sanjay Rampal

Summarizing Data 1

Mean: Advantages

It is familiar to most people


It reflects the inclusion of every item in the data set
Utilize all values
It always exists
It is unique
It is easily used with other statistical measurements
The mean is the center of gravity of the data and, easy
to understand and to calculate
Distribution is determine symmetrical
Important for statistical analyses and its applications

Mean: Disadvantages
It can be affected by extreme values in the
data set, called outliers, and therefore be
biased
Loss of accuracy when the distribution is
skewed
Including or excluding a data (number) will
change the mean
Manually, more tedious to calculate

Sanjay Rampal

Summarizing Data 1

Other types of means

Geometric mean
Harmonic mean
Generalized means
Weighted arithmetic mean
Truncated mean
Inter-quartile mean

Mean (Geometric)
It is an average that is useful for sets of
numbers that are interpreted according to
their product and not their sum (as is the case
with the arithmetic mean). E.g disease rates

Sanjay Rampal

Summarizing Data 1

The geometric mean is useful to determine average factors


E.g. if the incidence rate of a disease increased by 10% in
Y1, 20% in Y2 and decreased 15% in Y3
The geometric mean of the disease rates 1.10, 1.20 and 0.85
= (1.10 1.20 0.85)1/3 = 1.039
Conclusion - the incidence rate increased 3.9 percent per
year, on average

Arithmetic Vs Geometric Mean


Arithmetic mean is relevant any time several quantities add
together to produce a total
The arithmetic mean answers the question, "if all the
quantities had the same value, what would that value have
to be in order to achieve the same total?"
Geometric mean is relevant any time several quantities
multiply together to produce a product
The geometric mean answers the question, "if all the
quantities had the same value, what would that value have
to be in order to achieve the same product?"

Sanjay Rampal

Summarizing Data 1

Suppose there is a disease rate which increases by


10% in 2004, 50% in 2005, and 30% in 2006. What is
its average increase in disease incidence?
It is not the arithmetic mean, because what these
numbers signify is that on the 2004 the disease
incidence was multiplied (not added to) by 1.10, and
in 2005 it was multiplied by 1.50, and in 2006 it was
multiplied by 1.30
The relevant quantity is the geometric mean of these
three numbers, which is about 1.28966 or about 29%
average annual increase in disease rates

It is important to know whether arithmetic mean or


geometric mean should be used
When averaging ratios geometric mean
Consider the following when considering the two
extremes. If one experiment yields a ratio of 10,000
and the next yields a ratio of 0.0001, an arithmetic
mean would misleadingly report that the average
ratio was near 5000. Taking a geometric mean will
more honestly represent the fact that the average
ratio was 1.

Sanjay Rampal

Summarizing Data 1

Truncated Means
This is a useful measure of central tendency in the
presence of extreme values or outliers
The observations in the dataset are truncated
observations on either side comprising n % are
discarded and the mean is calculated where n
ranges from 5% to 50%
90% truncated mean 5% observations on either
extremes are discarded

Inter quartlie mean


A type of truncated means
When distribution is skewed or in the presence of
extreme values, an alternative measure of central
tendency is the inter quartile mean
25% of the observations on either ends of the
distribution are discarded

leaving the middle 50% (Q1 Q3)


then an arithmetic mean is calculated on the
group of observations.

Sanjay Rampal

Summarizing Data 1

Median

(1)

Is the middle observation point (50th percentile)


It is the point at which half of the observations are
smaller and half are larger

The median like the mean, may also be estimated


from frequency table

Median

(1)

Calculate the median by:


1) Arranging the observations from smallest to
largest
2) Find the middle value
e.g.

9, 7, 6, 5, 3, 1, 1

Sanjay Rampal

Summarizing Data 1

Median

(2)

Odd Number of Measurements (n=odd value)


The median is the value of middle-most
observations in ascending order.
x=[1234567]
n =7
median = 4 (4th observation)

Median

(3)

Even Number of Measurements (n=even value)


The median is the average value of the two
middle-most observations in ascending order.
x=[12345678]
n=8
median = (4+5)/2= 4.5

Sanjay Rampal

Summarizing Data 1

10

If odd number of observations, median observation


= (n+1)/2
Or
If even number of observations, median

(n/2) [(n 1)/2]


2

Median: Advantages
Fairly easy to calculate and always exist
Relatively easy to interpret - half of the sample
(normally) lies above/below the median
Is not affected by extreme data values
Used when distribution of data is skewed
Does not include values of observations, only their
ranks
Can be used with ordinal observations because
calculation does not use actual vales of the
observations
Do not need a complete data set to calculate the
rank

Sanjay Rampal

Summarizing Data 1

11

Median: Disadvantages
Manually tedious to find for a large sample which is
not in order (Requires ordering)
Does not utilize all data values

Mode

(1)

The mode of a set of observations is the specific


value that occurs with the greatest frequency

There may be more than one mode in a set of


observations, if there are several values that all
occur with the greatest frequency
A mode may also not exist; this is true if all the
observations occur with the same frequency

Sanjay Rampal

Summarizing Data 1

12

Mode

(2)

Arrange the numbers in order by size


Determine the number of instances of each
numerical value
The numerical value that has the most instances
is the mode
E.g.
What is the mode for the following data?
2, 4, 5, 5, 5, 7, 8, 8, 9, 12

Mode

(3)

When a set of data has two modes, it is called


bimodal

What diseases have bimodal distributions?


For frequency table or small number of
observations, the mode is sometimes
estimated by the modal class, which having
the largest number of observations

Sanjay Rampal

Summarizing Data 1

13

Mode

(4)

Advantages
Quick and easy to calculate
Unaffected by extreme values

Disadvantages
May not be representative of the whole
sample as they do not use all values
Seldom gives statistical significance

1, 2, 3, 3, 4, 5
Mean ?
Median ?
Mode ?

Sanjay Rampal

Summarizing Data 1

14

Mean, Median, Mode


6
5
4
3
2
1
0
1

Mean =

1(1) 2( 2) ...8( 2) 9(1)


20

Median = 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9
Mode = 5

Using central tendency

(1)

The choice of measure will depend on the following factors:


1) Scale of measurement
2) Shape of the distribution observations
Mean is used for numerical data and symmetric (not
skewed) distributions
The median is used for ordinal data or for numerical data if
the distribution is skewed
The mode is used primarily for bimodal distributions
The geometric mean is used primarily for observations
measured on a logarithmic scale

Sanjay Rampal

Summarizing Data 1

15

Using central tendency

(2)

If the outlying values are small, the distribution is


skewed to the left (negatively skewed)
If the outlying values are large, the distribution
skewed to the right (positively skewed)

Mean=median (symmetrical)
Mean>median (distribution skewed to right)
Mean<median (distribution skewed to left)

Guidelines of central tendency


Mean is used for numerical data and symmetric (not
skewed) distributions

The median is used for ordinal data or for numerical


data if the distribution is skewed
The mode is used primarily for bimodal distributions

The geometric mean is used primarily for observations


measured on a logarithmic scale

Sanjay Rampal

Summarizing Data 1

16

CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness

Variability / Dispersion
the variability of observed values from the measures of
central tendency
data values in a sample are not all the same variation
between values is called dispersion
When the dispersion is large, the values are widely
scattered; when it is small they are tightly clustered
The width of diagrams such as dot plots, box plots, stem
and leaf plots is greater for samples with more dispersion
and vice versa

Sanjay Rampal

Summarizing Data 1

17

How spread out are the values?


a) All values the same = no variability
b) Small difference among values = small
variability
c) Big difference between values = large
variability

Variability of a sample selected


from a population

Sanjay Rampal

Summarizing Data 1

18

Population distributions of height & weight

Measures of Dispersion
1)
2)
3)
4)
5)

Range
Interquartile range
Variance
Standard deviation
Coefficient of variation

Sanjay Rampal

Summarizing Data 1

19

Range

(1)

The difference between the highest and the


lowest values in a set of data
Max. value - Min. value
The range is affected by furthest outliers at either
end of the distribution
Range is of limited use as a measure of
dispersion, because it reflects information about
extreme values

Range

(2)

E.g.
0 1 2 3 4 5 6
Range ?
0 1 2 3 4 5 6 51
Range ?

Sanjay Rampal

Summarizing Data 1

20

Interquartile range
More on this in later slides

Measuring dispersion
Real difference: xi -
Absolute difference: |xi - |
Mean absolute difference
where m(X) ~ Mean, Median, Mode
Note:
The sample mean absolute deviation is a biased estimator
of the population mean absolute deviation
The sample median absolute deviation is a unbiased
estimator of the population median absolute deviation

Sanjay Rampal

Summarizing Data 1

21

Deviation
Deviation: Distance and Direction from the mean
Deviation value: Values mean
E.g.
Mean = 52
Scores =45, 53, 50, 60
Deviations scores -7, 1, -2, 8 (respectively)

Note: Tells you how far whether above or below the


mean

Variance

(1)

The variance is a measure of how spread out a


distribution is
The average of squared deviations of the data
points from the mean
2
Variance = s.d2 2 ( )
N
E.g. the numbers 1, 2, and 3, the mean is 2 and
the variance is:

(1 2) 2 (2 2) 2 (3 2) 2
3

= 0.667

Sanjay Rampal

Summarizing Data 1

22

Variance

(2)

The formula for the variance in a population is

( )2
N

where =mean and N=number of observations /


scores
The formula for the variance in a sample is

( X )2
s
n 1
2

Standard deviation

(1)

The SD is most commonly used measure of dispersion


with medical and health data

Measure of the spread of data about their mean


(very important for statistical inference)
Numerically, the standard deviation is the square root
of the variance

( X ) 2

N
Population

Sanjay Rampal

( X X ) 2
n 1
Sample

Summarizing Data 1

23

Standard deviation

(2)

Measure of the spread of data about their mean


(Describe how observations cluster around the
mean and very important in statistical inference)
Finds the average distance between each
score/datapoint and the mean

Standard deviation

(3)

To calculate SD of a population it is first


necessary to calculate that population's
variance
Numerically, the standard deviation is the
square root of the variance
Sample
Population

( X ) 2

Sanjay Rampal

( X X ) 2
s
n 1

Summarizing Data 1

24

Standard Deviation and Descriptive


Statistics
Remember the goal of descriptive statistics is
to summarize and describe a set of data
When you are given mean and standard
deviation, you should be able to visualize the
distribution
E.g. Mean= SD=4, tells you that the majority
of the values are within 4 points of the mean
These values are concrete and meaningful

Mean and Standard deviation

Sanjay Rampal

Summarizing Data 1

25

Coefficient of variation

(1)

Useful when comparing the variation of two or


more quantitative data sets that are on different
scales or units

An extension of the SD concept


A measure of relative dispersion
Adjusts the scales/units to be comparable

Coefficient of variation

(2)

An attribute of a distribution: its standard


deviation divided by its mean
CV= Standard deviation
mean

X 100%

It is generally expresses the standard


deviation as a percentage of the sample
mean

Sanjay Rampal

Summarizing Data 1

26

Coefficient of variation

(3)

Useful measure of relative spread in data


E.g.
Mean blood glucose (mg/dl)=152.1, SD=54.7
Mean serum cholesterol =217.0, SD=38.8
CV blood glucose
=54.7/152.1 X 10036%
CV serum cholesterol =38.8/217.0 X 10018%
Variation in blood glucose > serum cholesterol

CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness

Sanjay Rampal

Summarizing Data 1

27

Other measures of location

Quantiles
Box plot
Scatter plot

Quantiles
Quantiles are a set of 'cut points' that divide
a sample of data into groups containing (as
far as possible) equal numbers of
observations
E.g. quantiles include:
quartiles, quintiles, deciles, percentiles

Sanjay Rampal

Summarizing Data 1

28

Quartiles
Quartiles divide an ordered data set into four
quartiles
100 %
Q4
75 %
Q3
Q2
Q1

50 %

(Median)

25 %

Quartiles

(2)

E.g.

Data: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36


Ordered Data: 6, 7, 15, 36, 39, 41, 41, 43, 43,
47, 49
Median (Q2) = 41
Third quartile cut off (Q3) = 43
Lower quartile cut off (Q1) = 15

Sanjay Rampal

Summarizing Data 1

29

Quintiles
Quintiles are values that divide a sample of
data into 5 quintiles containing (as far as
possible) equal numbers of observations
Q5
Q4
Q3
Q2

Q1

80%
60%
40%

20%

Percentiles

The use of
percentiles in
the presentation
of data

50th percentile
= median

Sanjay Rampal

Summarizing Data 1

30

Summary of quantiles
k

Quantile name

No of
quantiles

Description in ordered set

Median

50% of observations both above and


below median

Quartiles

25% of observations below 1st, above 3rd


and between successive quartiles

Quintiles

20% of observations below 1st, above 4th


and between successive quintiles

10

Deciles

10% of observations below 1st, above 9th


and between successive deciles

100

Percentiles

99

1% of observations below 1st, above and


between successive percentiles

Why use quantiles?


It is an efficient way of dividing data into
groups groups are approximately equal
sized
Useful when studying relationships of skewed
variables
Not as efficient when the data variability is
low small range thus categories do not
differ much

Sanjay Rampal

Summarizing Data 1

31

The Interquartile Range (IQR)


(Q3 Q1)

Q1

Median

10

11

Q3

What are the advantages of Interquartile


range over the range?
E.g.
0 1 2 3 4 5 6 7 8 9 10
Mean? Range?
Median ? IQR?
0 1 2 3 4 5 6 7 8 9 1000
Mean? Range?
Median ? IQR?

Sanjay Rampal

Summarizing Data 1

32

Box-and-whisker plots (Boxplots)

(1)

Extreme values

Outlier

Whisker
Median + 1.5 IQR
Q3 = P75
Median
Q1 = P25

Box-and-whisker plots (Boxplots)

(2)

The box-length represents the interquartile


range
The whiskers extend to the smallest and
largest observations
The outliers and extreme values are indicated
by symbols as and *

Sanjay Rampal

Summarizing Data 1

33

CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness

Normal distribution
The Normal Curve is bell-shaped and
symmetrical.
It is unimodal (mean = median = mode)
Tails of the normal curve are asymptotic to
the horizontal axis (- to + ); i.e. the curve
approaches the horizontal axis but never
touches it

Sanjay Rampal

Summarizing Data 1

34

The Normal Distribution

The Normal curve is determined by


probability density function (pdf), given by the
formula
2

1 X
exp

2


2 2

Normal distribution

(2)

Shape of curve depends on two parameters:


mean and variance ( and 2)
Effects of on the Probability Density
Function of a Normal Random Variable

Effects of 2 on the Probability Density


Function of a Normal Random Variable

0.4

0.4
0.3

Mean = 6

Mean = 5

Variance = 1

0.3

0.2

0.2

0.1

0.1

Variance = 4

0.0

0.0
1.5

2.5

3.5

4.5

Sanjay Rampal

5.5

6.5

7.5

8.5

1.5

Summarizing Data 1

2.5

3.5

4.5

5.5

6.5

7.5

8.5

35

Properties of a Standard Normal


Distribution (3)

- 3SD

- 2SD

- 1SD

+ 1SD

+ 2SD

+ 3SD

<----- 68.3%---->
<--------------95.5%-------------->
<----------------------99.7%------------------------->

Skewed Distributions
Skewness is defined as asymmetry in the
distribution of the sample data values
Values on one side of the distribution tend to be
further from the 'middle' than values on the
other side

Sanjay Rampal

Summarizing Data 1

36

Skewness
Skewness measures the extent a distribution
of values deviates from symmetry around the
mean
Simplest measurement is Mean-Median
If Mean-Median >0, then +ve skew
If Mean-Median <0, then -ve skew

Skewed distribution
+ve skewness

-ve skewness

+ve skewness indicates a


greater number of smaller
values.

Sanjay Rampal

-ve skewness indicates a greater


number of larger values.

Summarizing Data 1

37

Positively Skewed Distribution


Median
%

Mean

Sanjay Rampal

Summarizing Data 1

38

Negatively Skewed Distribution


Median
%

Mean

For further reading


Pearsons coefficient of skewness
Developed in the 1890s by Karl Pearson

The value for of sk will fall within the range of


-3 to +3 with a value of 0 associated with a
perfect symmetrical distribution

Sanjay Rampal

Summarizing Data 1

39

Kurtosis
Curvature
Defined as a measure reflectingthe degree to
which a distribution is peaked
Provides information regarding the height of a
distribution relative to the value of its standard
deviation
Can be divided into:
Mesokurtic bell shaped
Leptokurtic peak (Clustered around the mean)
Platykurtic peak (More dispersed)

For further reading


Testing for Normality

DAgostino-Pearson test
Kolmogrov Smirnov Test
Lilliefors test
Shapiro-Wilk W test (7n2000 )
Shapiro-Francia W' test (5 n5000)

Sanjay Rampal

Summarizing Data 1

40

For further reading


Transformation to normality
If there is evidence of marked non-normality
then we may be able to remedy this by
applying suitable transformations
The more commonly used transformations
which are appropriate for data which are
skewed to the right with increasing strength
(positive skew) are 1/x, log(x) and sqrt(x),
where the x's are the data values

Commonly used transformations


If skewed to the right (positive skew) with
increasing strength are 1/x, log(x) and sqrt(x)

If skewed to the left (negtive skew) with


increasing strength are squaring, cubing, and
exp(x)
where the x's are the data values

Sanjay Rampal

Summarizing Data 1

41

Transformation when dealing with


associations between 2 variables
The circle of powers sometimes called
the ladder of powers provides a general
guideline for choosing an appropriate
transformation

Sanjay Rampal

Summarizing Data 1

42

If the plotted data resemble Quadrant I, a


transformation that is either up on x or up on
y should be used In other words, we would raise
either x or y to a power greater than p = 1
The more curvature in the data, the higher the
value of p needed to achieve linearity
In general, we prefer to transform x whenever
possible

Sanjay Rampal

Summarizing Data 1

43

You might also like