You are on page 1of 9

Statistics I

Central Tendency and Dispersion


MEASURES OF CENTRAL TENDENCY AND DISPERSION

After collection and tabulation of numerical data, our next step is to analyze the data. The analysis of data
means computing the key features (or summary measures), describing the findings and ultimately the
interpreting the data. We can have basic idea of the distribution of these numerical variables from tables and
diagrams but unless we compute some summary measures, we cannot tell about the data completely. For
example, in KIESS data, there are many numerical variables, such as WORKHRS, AGE, EDUC, EARNRS, RINCOME
etc. For each of these variables, in addition to tables and diagrams, it is desirable to compute some summary
measures to have the idea about the distribution of these variables.

The summary measures, which are the single numbers to describe the feature of a data set, are contained in
the study of four chief characteristics of a data set. They are:

1. Measure of Central Tendency or Averages (middle value)


2. Measures of Partition Values or Positional Averages (partitioning values)
3. Measures of Dispersion (spread)
4. Measures of Skewness (departure from symmetry) and Kurtosis (peakedness)

In any analysis and/or interpretation, a variety of descriptive measures representing the properties of Central
tendency, dispersion, skewness and kurtosis may be used to extract and summarize the major features of the
data batch. (See Section 3.1 page 70)

1. MEASURES OF CENTRAL TENDENCY

The measures of central tendency are typical values that show the location where the data cluster around.
These measures give the idea about the concentration of the values in the central part of a distribution, these
measures are sometimes called measures of location because they enables us to locate the middle point in
the data.

The various measures of central tendency are:


a. Arithmetic mean (simple and weighted) b. Geometric Mean
c. Harmonic mean d. Median e. Mode

ARITHMETIC MEAN:

The arithmetic mean (called mean only) is the most commonly used average or measure of central tendency.
It is obtained by dividing the sum of all observation with number of observation

Sum of all observation in a data set


Mean =
number of observation in a data set

For a sample containing a batch of n observations x1, x2…….,xn the arithmetic mean is denoted by x (x bar)
and given by,
1
Formula 1: x=
n
∑x
In case of discrete frequency distribution where the value x1 is repeated f1 times, x2 repeated f2 times
…………., xk repeated fk times, the mean is given by

1 k
1
Formula 2: x=
n
∑fx
i =1
i i =
n
∑ fx , where n = total number of observation = ∑ f
Further for a grouped data (continuous frequency distribution), we use the formula 2, but the x values are
taken as mid-value of the corresponding classes.

Page 1 of 9
Statistics I
Central Tendency and Dispersion
Remarks(Grouping Error): When you put data points in continuous distribution, you will lose some
information as the classes will be represented by the mid-value. So AM calculated with formula 2 for
continuous distribution will have some error, which is called grouping error. So whenever individual data are
available, you should use formula 1 to calculate AM.

Step Deviation Method (Coding Method) for Calculation of Mean:

To simplify our calculation in the case of grouped data, we shall use a technique called step-deviation
method, whenever the mid points values and/or the frequency values are large, for this we code each mid
value 'x' as a new variable 'u' by

x − x0
u=
w
Where u= coded midpoint
x = original variable
x0 = value of midpoint occurring in the middle or nearest to middle of that data set
Then the arithmetic mean is calculated using the formula,

∑ fu = x
w
Formula 3 : x = x 0 + 0 + wu
n

The weighted arithmetic mean:

The arithmetic mean discussed above is the simple arithmetic mean in which all the items are assumed to be
equally important in the distribution, but in practice this may not be so. The importance of some items in a
distribution may be greater than the other. So, in such cases, proper weightage should be given to various
items. Let W1, W2,……., Wk be the weights given to the variable values x1, x2, …….xk respectively then their
weighted Arithmetic mean is given by,

xw =
W1 x1 + W2 x2 + .......... + Wk xk
=
∑Wx
W1 + W2 + .......... + Wk ∑W
Examples: Weighted averages in Entrance Examination, Weighted average of in-semester exams, Weighted
average of price relatives (in index numbers) etc.

Some Properties of Arithmetic Mean:

Property 1: The algebraic sum of deviations of the given set of observations from their arithmetic mean is
zero. That is Σf ( x − x ) = 0 This provides a check for the calculation.

Property 2 (Mean of combined series): If the individual sizes, and the individual mean of two series are
known, the mean of combined series can be calculated as follows

Series 1 Series 2 Combined Series


Size n1 n2 n1+n2
Mean v v v v
x1 x2 v n1 x 1 + n 2 x 2
x=
n1 + n 2
This result may be generalized to the case of more than two series.

Page 2 of 9
Statistics I
Central Tendency and Dispersion
THE GEOMETRIC MEAN:
The arithmetic mean may be inappropriate, when dealing with quantities that change over a period of time,
such as growth rate, depreciation rates etc. In such cases, we use the geometric mean (GM). The GM is
defined as the nth root of product of all observations, where n is the number of observations. As such,
GM = n x1 .x2 .......xn
⎛1 ⎞
GM = Anti log ⎜
⎝n
∑ log x ⎟⎠
So, in other words, geometric mean is simply the arithmetic mean of logarithms of observation. It is apparent
from the above formula that the geometric mean cannot be calculated in the cases when some values are
zero or negative.

For frequency distribution, the geometric mean is given by,


However, this measure of central tendency is not used so widely due to its computational difficulty.

(∏ x )
1
GM = f n

Where the sign Π (capital pi) is used to denote the product. and n is the total
number of observations, n = Σf
⎛1 ⎞
or, GM = Antilog ⎜
⎝n
∑ f log x ⎟⎠

Remarks: When you calculate the average growth rate using GM, you should calculate the GM of growth
factors and not of growth percentages. Growth factor = 1+r/100 or Growth factor =Current period's
value/Previous period's value.

MEDIAN:
The median is a measure of central tendency. It is a single value from the data set that measures the most
central item in the data. The single item is the middlemost, and half the values are greater than median and
half the values are less than median. In other words, Median divides the whole data set in two equal parts,
50% values lying above the median and 50% values lying below the median. Thus we see that, unlike the mean
the median is not based on all the observation, it is the positional average, which depends on the position
occupied by a value in the frequency distribution.

Calculation of Median for ungrouped data:

The first step in the calculation of median is the arrangement of data values in ascending order of magnitude,
then the median(Md) is given by,
⎛ n + 1⎞
Md = Value⎜ ⎟th observation ........(when n is odd)
⎝ 2 ⎠
⎛n⎞ ⎛n ⎞
Md = AM of value of ⎜ ⎟th and value of ⎜ + 1⎟th observation.........(when n is even)
⎝2⎠ ⎝2 ⎠
here n = number of observation in the data set.

Calculation of Median for grouped data:

In discrete frequency distribution, the value of the median is the value of middlemost item, it is calculated
using following steeps:
Step 1: Prepare less than cumulative frequency distribution from the given distribution.
Step 2: Calculate (n+1)/2, the value of the variable corresponding to the cumulative frequency just greater
than or equal to (n+1)/2 is the Median.

Page 3 of 9
Statistics I
Central Tendency and Dispersion

In case of continuous frequency distribution use the following steps:


Step 1: Prepare less than cumulative frequency distribution from the given distribution.
Step 2: Calculate n/2, the class in which the (n/2)th observation falls is called the Median class (because this
class contains Median). Median class is the class corresponding to cumulative frequency just greater than
(n/2).
Step 3: Use the following formula to calculate Median.
w⎛n ⎞
Md = l + ⎜ − c⎟
f ⎝2 ⎠
Where, l = lower limit of Median class
w = width of Median class
f =frequency of Median class
c =cumulative frequency of the class preceding the Median class.

Note: To use the above formula the classes should be of exclusive type.

Another Formula for calculating Median for grouped data:

This is a variation of above formula for calculation of median in the grouped data.

Step 1: Prepare less than cumulative frequency distribution from the given distribution.
n +1 ⎛ n + 1⎞
Step 2: Calculate the class in which the ⎜ ⎟th observation falls is called the Median class
2 ⎝ 2 ⎠
(because this class contains Median). Median class is the class corresponding to cumulative frequency just
n +1
greater than .
2
Step 3: Use the following formula to calculate Median.
w ⎛ n +1 ⎞
Md = lm + ⎜ − ( F + 1 )⎟
fm ⎝ 2 ⎠
lm = lower limit of median class
f m = frequency of median class
w = class - interval or width of class
F = sum of all frequencies up to but not including the median class( cumulative
frequency of the class preceding the median class.

Page 4 of 9
Statistics I
Central Tendency and Dispersion
MODE:
Mode is the value of the variable, which is repeated most often in a data set. Around mode, the observations
are highly concentrated. In the curve of frequency distribution, the mode is the peak point of the
distribution.

Computation of Mode: To calculate mode, first we should arrange the data into discrete or continuous
frequency distribution. In case of the discrete frequency distribution, mode can be found out simply by
inspection, it is the value of the variable whose frequency is maximum.
In case of continuous frequency distribution, the class corresponding to the maximum frequency is called the
Modal Class (because this class contains Mode), and the value of the Mode is calculated using the following
formula.

⎛ d1 ⎞ w( f1 − f 0 )
Mode = l + ⎜⎜ ⎟⎟ w = l +
⎝ d1 + d 2 ⎠ 2 f1 − f 0 − f 2
Where, l = lower limit of the Modal Class. ,
w = width of modal class, d1 = f1- f0 , d2 = f1-f2
f0 = frequency of the class preceding the Modal class.
f1 = frequency of the modal class.
f2 = frequency of the class succeeding the modal class.

Remarks :

1. To calculate mode using the above formula requires the classes to be exclusive type, and frequencies
should be such that they increase and decrease in regular manner.
2. The above techniques of calculation of mode is not practicable in the following cases:

i) If the maximum frequency is repeated.


ii) If the maximum frequency occurs at very beginning or end of the distribution.
iii) If there are irregularities in the distribution i.e. the frequencies increase and decrease haphazardly
iv) Sometimes we come across distributions, whose curve has more than one peak, in these distribution
there are more than one value for which concentration of frequency is high. Such distributions are
called bimodal (two modes) or multi-modal( more than two modes).

RELATION BETWEEN MEAN, MEDIAN AND MODE

By comparing the values of Mean(M), Median(Md) and the Mode(Mo), we can have idea about the shape of the
frequency curve.
The value of Median always lies between the Mean and the Mode, there may be 3 cases.

i) M = Md = Mo, this kind of distribution is called symmetric distribution.


ii) M ≤ Md ≤ Mo, if this relation holds for a distribution, such distribution is called Negatively Skewed.
The curve of this distribution has tail to the left.
iii) M ≥ Md ≥ Mo if this relation holds for a distribution, such distribution is called Positively Skewed. The
curve of this distribution has tail to the right.

The above three conditions are the basics of study of Skewness.

For the distributions, which are not highly skewed, the following relation holds:
Mode =3Median-2Mean (This relation is the empirical relation)

ADVANTAGES AND DISADVANTAGES OF MEAN MEDIAN AND MODE:

The various measures of central tendency mean, median and mode have their own relative advantages and
disadvantages. The following points give the brief discussion. (For examples see the textbook)

1. Rigidity of Definition: The Mean and Median are rigidly defined but mode is not rigidly defined, the value
of mean and median for a distribution is unique but the value of mode may not be so.

Page 5 of 9
Statistics I
Central Tendency and Dispersion
2. Comprehensibility: The mean and median are easy to understand and calculate than mode.
3. Dependence on values: Only mean is calculated using all the observations, the median and mode are not
calculated using all the observation.
4. Open-ended Classes: For the distribution containing open-ended classes, mean cannot be calculated; in
such cases we can only calculate Median or Mode.
5. Extreme Values: The mean is the measure which is very seriously affected by the extreme values in the
distribution, so mean may not be reliable in the decision making processes in which we have to consider
the distribution having extreme values. The mean may even mislead us to wrong conclusion. Median and
Mode do not have this problem.
6. Sampling Fluctuations: The mean is least affected by fluctuation of sampling than median and mode, so
while estimating a parameter concerning the central measure of population from a sample, mean is
commonly used than median and mode.
7. Further Mathematical Treatments: The further mathematical formulas regarding advanced analysis of
data may use mean, median and mode as primary measures. However, the formulas that use mean are
easier than those using median and mode.

2. MEASURES OF PARTITION VALUES (FRACTILES):

Partition values are those values of the variable, which divides the entire data set into equal number of
parts. Thus Median may be regarded as a particular partition value, which divide the data set into 2 equal
parts. The commonly used partition values are Quartiles, Deciles and Percentiles.

Quartiles: The three values Q1, Q2, Q3 (Q1 ≤ Q2 ≤ Q3) which divide the data set in 4 equal parts are
collectively called quartiles.

Deciles: The nine values D1, D2……….D9 (D1≤ D2 ≤……….≤ D9), which divide the data set in 10 equal parts,
are called Deciles.

Percentiles: The ninety-nine values P1, P2, ………,P99(P1 ≤ P2 ≤………≤ P99) which divide the data set in 100
equal parts is called Percentiles.

Different fraction or parts of data lies above and below of the partition values, so they are sometimes
called Fractiles. For example, median is called 0.5 fractile, the Q1 is called 0.25 fractile, the D4 is
called 0.4 fractile, P67 is called 0.67 fractile and so on.

Computation of Partition Values ( in case of ungrouped data):

Step 1: Arrange the data in ascending order of magnitude.

Step 2: The k-fractile(any partition value which is converted into fractile, k=0.01, 0.02………0.99) is given by,

k-fractile = value of {k(n+1)}th ordered observation.

Note that, the conversion of any partition value into fractile is according to following rule:
Qj = jth quartile = j / 4 fractile (j =1, 2, 3)
Dj = jth decile = j / 10 fractile (j = 1, 2,…………….,9)
Pj= jth percentile = j / 100 fractile (j = 1, 2, ……99)

Computation of Partition Values (in case of grouped data):


Step 1: Prepare the less than cumulative frequency distribution.
Step 2: To calculate k-fractile, first calculate kn, and look in which class the (kn)th observation lies, this is
the class containing k-fractile.
Step 3: The k-fractile is given by the following formula,
w(kn − c)
k − fractile = l +
f
Where, l = lower limit of the class containing k-fractile. w= class-interval
f=frequency of the class containing k-fractile.

Page 6 of 9
Statistics I
Central Tendency and Dispersion
c = cumulative frequency of the class preceding the class containing the k-fractile.

For example: using this compact formula D4 may be calculated using following formula:

w(0.4n − c)
0.4 − fractile = l +
f

3. MEASURES OF DISPERSION:

The measures of central tendency alone are not sufficient to describe the data batch because they only gives
us the idea of concentration of the observations about the central part of distribution. There may be
different data set in which the central tendency measure is same, but differ widely from each other in
number of ways. See the following example.
Section Marks obtained by 6 students Total Mean
A 15, 15, 15, 15, 15, 15 90 15
B 13, 14, 15, 15, 16, 17 90 15
C 1, 10, 14, 16, 19, 30 90 15

The students in the 3 sections have mean marks 15, but the data set differs from one another very much.
Therefore, the measures of central tendency must be supported and supplemented by some other measures
to describe the data set completely; one of such measures is Dispersion.

The measures of dispersion are devised to measure the 'scatteredness' in the data set; we study the measures
of dispersion to have an idea of the homogeneity (compactness or uniformity or consistency) of the
distribution. It measures the 'spread' in a data set. Less the value of measure of dispersion, less is the
variability in the data set.

The measures of dispersion also gives us an additional information for the reliability of the measure of central
tendency; if data are widely dispersed, the central location is less representative of the data as a whole. But,
if data is less dispersed, the central tendency measures are more representative.

There are problems peculiar to widely dispersed data, we must be able to recognize that data are widely
dispersed before we can tackle those problem. For example, we would not buy appliances, whose life is very
widely different from each other. In the language of quality control we say that ' Variability is the Enemy of
Quality'.

Finally, sometimes we may wish to compare the dispersion of various samples, to find out which data set is
more reliable.
The measures of dispersion are categorized into two classes. The distance measures give the measures in
terms of difference between two values selected from the data set, Range, Inter-fractile and Inter-quartile
Range are examples of this type. The Average Deviation Measure deals with the average deviation from some
measure of central tendency. Standard Deviation and Variance are examples of this type of dispersion,
which express dispersion in terms of average deviations of the values taken from the arithmetic mean.

Range: Range is the difference between the highest and lowest observed value in a data set. Thus,
Range = Highest Value –Lowest Value
For a grouped distribution,
Range = upper limit of highest class – lower limit of lowest class
Range cannot be calculated in open- ended classes.

Inter-fractile and Inter-quartile range: The Inter-fractile range is the measure of the spread between two
fractiles in a frequency distribution. It is generally calculated by taking the difference between two fractiles
lying on the two sides of median. For example, the inter-fractile range D7-D3 gives the spread in middle 40%
of the data, similarly, P95-P5 gives the middle 90% spread. P90-P10 is the commonly used percentile range. One
advantage of this measure over Range is , it discards the outliers (the extreme values on the two sides) and
compared to Range it takes into account more data.

Page 7 of 9
Statistics I
Central Tendency and Dispersion
The Inter-quartile range = Q3 –Q1, is a particular inter-fractile range which gives the spread in the middle 50%
of a given data set.

This is the measure, which is better than range, because unlike range, which depends on only two
observations, the inter-quartile range is based on 50% of the observation. Further, this measure can be
computed for the grouped data, even when the frequency distribution contains open-ended classes.

Standard Deviation and Variance: Variance of a distribution is defined as the arithmetic mean of squares of
deviations of the observations in a data set taken about their arithmetic mean. The positive square root of
variance is the standard deviation. For the population, the variance is denoted by σ2 and the standard
deviation is denoted by σ. For the sample the standard deviation and variance are denoted by s and s2.
The following table gives the formula for variance for population and sample, for grouped and ungrouped
data.

Variance Formula Ungrouped data Grouped data


Population 1 1
N = population size σ2 =
N
∑( x − µ ) 2
σ2 =
N
∑ f(x−µ) 2

µ = population mean
Σx 2 ⎛ Σx ⎞ Σfx 2 ⎛ Σfx ⎞
2 2

= −⎜ ⎟ = −⎜ ⎟
N ⎝N⎠ N ⎝ N ⎠
1 1
Sample s2 =
n −1 ∑( x − x )2 s2 =
n −1 ∑
f ( x − x )2

n = sample size Σx 2 nx 2 Σfx 2 nx 2


= − = −
x = sample mean n −1 n −1 n −1 n −1
Σx 2 ( Σx )2 Σfx 2 ( Σfx )2
= − = −
n − 1 n( n − 1 ) n − 1 n( n − 1 )

The standard deviation (SD) is taken by taking the positive square root of the variance. The least possible
value of SD is zero, which show that there is no variation at all in the given data. Larger the value of variance
or SD, higher is the variation.

Note that variance and standard deviation and variance uses all the observations in the data set and is a
better measure of dispersion than range and inter-fractile range, but variance and SD can not be calculated in
case of open-ended classes.

Calculation of Variance Using Coding Method:

As in the calculation of arithmetic mean, coding method may be employed to calculate variance in the case of
grouped distributions. The following steps give the method in the case of sample variance, the calculation of
population variance will be then straightforward.

x − x0
Step1: Code the original variable x to variable u, using the relation: u =
w
Σfu
Step 2: Compute u =
n
1 Σfu 2 ( Σfu )2
Step 3: Find the variance of the variable u. s u =
2

n −1 ∑
f ( u − u )2 = −
n − 1 n( n − 1 )
Step 4: Find the variance of the original variable by s x = w s u
2 2 2

Page 8 of 9
Statistics I
Central Tendency and Dispersion
Relation Between Range and Standard Deviation: See Use of Standard Deviation page 116

Relative Measures of Dispersion: The Coefficient of Variation

All the measures described above are absolute measures because they are not free of their unit of
measurement; they are expressed in the same units as the original data.

The standard deviation being a measure of dispersion based on the arithmetic mean, it should not be solely
used to interpret the variation in the data, rather it should be interpreted by relating it to the mean. The
relative measure coefficient of variation is used to compare the variations in two data set, in either cases
where the two data set belong to same or different units of measurement. It expresses the standard
deviation as the percentage of arithmetic mean. Hence,

σ
Population coefficient of Varaition = × 100%
µ
s
Sample Coefficient of Variation = × 100%
x
To compare the variability of two or more data sets, coefficient of variation (not the standard deviation) of
the data sets are compared. (See Section 3.10 page 126)

: End of Central Tendency and Dispersion:

Page 9 of 9

You might also like