You are on page 1of 15

Quantitative Analysis

Chapter 3

Measures of Central Tendency

Submitted to: Prof. Mamta Bhrambhatt

Date of Submission: 24/12/2009

1
Index

1. Measures of Central Tendency: Ungrouped Data.


• Mode
• Median
• Mean
• Percentile
• Quartiles

2. Meaures of Variability: Ungrouped Data.


• Range
• Interquartile Range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Empirical Rule
• Chebyshev’s Theorem
• z Scores
• Co-efficient of Variation

3. Measures of Central Tendency and Variability: Grouped Data


• Mean
• Mode

4. Measures of Shape
• Skewness
• Kurtosis
• Box and Whisker plots

5. Measures of Association
• Correlation

1. Measures of Central Tendency: Ungrouped Data.

2
We can use single numbers called “Summery Statistics’ to describe
characteristics of a data set. Two these characteristics are particularly
important to decision makers:
1. Central tendency
2. Dispersion

Central Tendency:
Central tendency is the middle point of a distribution. Measures of
central tendency are also known as Measures of location. Measures of
central tendency yield information about the center, or middle part, of a
group of a numbers. It does not focus on the span of data set or how far
values are from the middle numbers.

Dispersion:
Dispersion is the spread of the data in a distribution, that is, the extent
to which the observations are scattered.

Objectives:

 To use summary statistics to describe collection of data.


 To use the mean, median and mode to describe how data “bunch up”
 To use the range, variance and standard deviation to describe how data
“spread out”.

MEASURES OF CENTRL TENDENCY: UNGROUPED DATA

Mode:

The mode is a measure of central tendency. It is the most common value in a distribution

E.g. the mode of 3, 4, 4, 5, 5, 5, 8 is 5. Because 5 is occurring for the most of the time.

• Bimodal -- Data sets that have two modes


• Multimodal -- Data sets that contain more than two modes

When to use: Use the mode when the data is non-numeric or when asked to choose
the most popular item.

3
• Advantages:
• Extreme values (outliers) do not affect the mode.
• Disadvantages:
• Not as popular as mean and median.
• Not necessarily unique - may be more than one answer
• When no values repeat in the data set, the mode is every value and is useless.
• When there is more than one mode, it is difficult to interpret and/or compare.

Median
The data must be ranked (sorted in ascending order) first. The median is the number in the
middle.

To find the depth of the median, there are several formulas that could be used, the one that we
will use is:
Depth of median = 0.5 * (n + 1)

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

When to use: Use the median to describe the middle of a set of data that does have an outlier.

• Advantages:
• Extreme values (outliers) do not affect the median as strongly as they do the mean.
• Useful when comparing sets of data.
• It is unique - there is only one answer.

Disadvantages:
• Not as popular as mean.

Mean:-

The Mean is the average of a group of numbers and computed by summing all
numbers and dividing by the number of numbers. The population mean is
represented by the Greek letter µ . The sample mean is represented by x . The
formulas for computing the population mean and the sample mean are given below.

• Population mean:

4
N

∑x i
x1 + x 2 + ... + x N
µ= i =1
=
N N

• Sample mean:

x 1 + x 2 + ... + x n
n

∑x i
x= i =1 =
n n

When to use: Use the mean to describe the middle of a set of data that does not have an outlier.

• Advantages:
• Most popular measure in fields such as business, engineering and computer science.
• It is unique - there is only one answer.
• Useful when comparing sets of data.
• Disadvantages:
• Affected by extreme values (outliers)

Percentiles:
• They are measures of central tendency that divide a group of data
into 100 parts
• At least n% of the data lie below the nth percentile, and at most
(100 - n)% of the data lie above the nth percentile
• Example: 90th percentile indicates that at least 90% of the data lie
below it, and at most 10% of the data lie above it
• The median and the 50th percentile have the same value.
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
For Calculation:
• Organize the data into an ascending ordered array.
• Calculate the percentile location:
P
i= ( n)
100

• Determine the percentile’s location and its value.


• If i is a whole number, the percentile is the average of the values at
the i and (i+1) positions.
• If i is not a whole number, the percentile is at the (i+1) position in
the ordered array.
5
FOR EXAMPLE
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:

30
i= (8) = 2.4
100

• The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole
number portion is 3; the 30th percentile is at the 3rd location of the array;
the 30th percentile is 13.

Quartiles
• Measures of central tendency that divide a group of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile
• Q1 is equal to the 25th percentile
• Q2 is located at 50th percentile and equals the median
• Q3 is equal to the 75th percentile
• Quartile values are not necessarily members of the data set

E.g.
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1
25 109 +114
i= (8) = 2 Q1 = = 111 .5
100 2
• Q2:

50 116 +121
i= (8) = 4 Q2 = = 118 .5
100 2

• Q3:
75 122 +125
i= (8) = 6 Q3 = = 123 .5
100 2

Measures of Variability:
Ungrouped Data
• Measures of variability describes the spread or the dispersion of a set of data.

6
3.1 RANGE:
“The range is the different between the highest and lowest observed values.

RANGE = value of highest observation – value of lowest observation

Advantages of range:
• It is easy to understand and to find
• It is used in quality assurance, where the range is used to to construct a
control charts.

Disadvantages of range:
• Its usefulness as a measure of dispersion is limited.
• It is only consider highest and lowest value of a distribution
• It is heavily affected by extreme values.
• It is not used in open ended series.
Example:

The ungrouped data is as follows:


10, 2, 5, 6, 7, 3, 4
The Range is : 10-2 = 8

3.1 INTERQUARTILE RANGE:

Inter quartile range is the values of the first and third quartiles. The interquartile range (IQR) is
the range of the middle 50% of the scores in a distribution. It is less affected by extremes.

It is computed as follows:

IQR = 75th percentile - 25th percentile

IQR = Q3 – Q1

For E.g. if the 75th percentile is 8 and the 25th percentile is 6. The Interquartile
range is therefore 2.

3.2VARIANCE :

Variance in population:

7
Variability can also be defined in terms of how close the scores in the distribution are to
the middle of the distribution. Using the mean as the measure of the middle of the distribution,
the variance is defined as the average squared difference of the scores from the mean.

Example:

16 , 45, 32, 12, 34, 65, 46, 76

where σ2 is the variance, μ is the mean, and N is the number of numbers.

Variance in Sample:

If the variance in a sample is used to estimate the variance in a population, then the previous
formula underestimates the variance and the following formula should be used:
n

s2 = ∑ (x
i =1
i - x) 2

n −1

where s2 is the estimate of the .Since, in practice, the variance is usually computed in a sample,
this formula is most often used..

Standard deviation:

Population Standard deviation:

• It is the Square root of the population variance

σ
2
σ =

Sample Standard Deviation:

• It is the Square root of the sample variance

2
S= S

Uses Of Standard Deviation:

• To determine, with a great deal of accuracy.

8
• Useful in describing how far individual items in a distribution depart from the
mean of the distribution.
• Indicator of financial risk
• Quality Control in construction of quality control charts & process capability
studies
• Comparing populations for household incomes in two cities & employee
absenteeism at two plants

EMPIRICAL RULE:

• It is an important rule of thumb that is used to state the approximate


percentage of values that lie within a given number of standard deviations
from the mean of a set of data if the data are normally distributed.
• It is also known as 68-95-99.7 rule
• In statistics, the 68-95-99.7 rule, or three-sigma rule, or empirical rule, states
that for a normal distribution, nearly all values lie within 3 standard
deviations of the mean.
• About 68% of the values lie within 1 standard deviation of the mean (or
between the mean minus 1 times the standard deviation, and the mean plus
1 times the standard deviation). In statistical notation, this is represented as:
μ ± σ.
• About 95% of the values lie within 2 standard deviations of the mean (or
between the mean minus 2 times the standard deviation, and the mean plus
2 times the standard deviation). The statistical notation for this is: μ ± 2σ.
• Nearly all (99.7%) of the values lie within 3 standard deviations of the mean
(or between the mean minus 3 times the standard deviation and the mean
plus 3 times the standard deviation). Statisticians use the following notation
to represent this: μ ± 3σ

• This rule is often used to quickly get a rough estimate of something's probability, given
its standard deviation, if the population is assumed normal, thus also as a simple test for
outliers (if the population is assumed normal), and as a normality test (if the population is
potentially not normal).

9
Ran Population in
ge range
μ±
1σ 68 %
μ±
2σ 95 %
μ±
3σ 99.7 %

CHEBYSHEV’S THEOREM

• Applies to any distribution, regardless of shape


• Places lower limits on the percentages of observations within a given number of standard
deviations from the mean
• At least (1-1/k2) of the elements of any distribution lie within k standard deviations of the
mean

CHEBYSHEVS THEOREM

Minimum Proportion Of
Number Of Standard
Distance From The Mean Values Falling Without
Deviation
Distance

K=2 μ ± 2σ 1-1/2² = 0.75

K=3 μ ± 3σ 1-1/3² = 0.89

K=4 μ ± 4σ 1-1/4² = 0.94

6. z Scores:-

A z score represents the number of standard deviations a value (x) is


above or below the mean of a set of numbers when the data are
normally distributed. Using z scores allows translation of a value’s raw
distance

10
If the Z score is negative, the raw value (x) is below the mean. If
the z score is positive, the raw value (x) is above the mean.
For example, for a data set that is normally distributed with a mean of
50 and a standard deviation of 10, suppose a statistics want to
determine the z score for a value of 70. The value is 20 units above the
mean, so the z value is,

The z score is interpreted as the empirical rule states that 95% of


all values are within two standard deviations of the mean if the data is
approximately normally distributed.

7. Coefficient of Variation:-
The Coefficient of variation is a statistic that is the ratio of the
standard deviation to the mean expressed in percentage.

The coefficient of variation essentially is a relative comparison of a


standard deviation to its mean. The coefficient of variation can be
useful in comparing standard deviations that have been computed
from data with different means.
Suppose five weeks of average prices for the stock A are 57, 68, 64,
71 and 62. To compare a coefficient of variation for these prices, first
determine the mean and standard deviation: µ = 64.40 and σ = 4.84.
The coefficient of variation is:

The standard deviation is 7.5% of mean.


Sometimes financial investors use the coefficient or standard
deviation or both as measures of risk. Imagine a stock with a price that
never changes. An investor bears no risk of losing money from the
price going down because no variability occurs in price. Suppose, in
contrast, that the price of the stock influence widely. An investor who
buys at a low price and sells for a high price can make a nice profit.
However, if the price drops below what the investors buys it for, the
stock owner is subject to a potential loss. The greater the variability is,
more the potential for loss. Hence, investors use measures of
variability such as standard deviation or coefficient of variation to
determine the risk of a stock.

. Measures of Central Tendency and Variability: Grouped Data


Mean of grouped data
• Weighted average of class midpoints

11
• Class frequencies are the weights

µ=∑
fM
∑f
=
∑fM
N
f 1M 1 + f 2 M 2 + f 3 M 3 +⋅⋅⋅ + fiMi
=
f 1 + f 2 + f 3 +⋅⋅⋅ + fi

Mode of Grouped Data


• Midpoint of the modal class
• Modal class has the greatest frequency

4. Measures of Shape

Skewness

When they are displayed graphically, some distributions of data have many more observations on one
side of the graph than the other. Distributions with most of their observations on the left (toward
lower values) are said to be skewed right; and distributions with most of their observations on the
right (toward higher values) are said to be skewed left.

Skewed left Skewed right

12
13
1. Arithmetic Mean:
• The arithmetic mean of a set of data is the sum of the data values
divided by the number of observations.
If the data set is from a sample, then the sample mean, is:
n

∑x i
X= i =1
n
n = sample size and Σ means "to add"
If the data set is from a population, then the population
mean,µ is: N

∑ xi
x + x 2 + ... + x N
µ= i =1
= 1
N N

N= population size. Σ is a statistic and μ is a parameter.


• Advantages of Arithmetic Mean:
• Easy to understand.
• Simple to compute.
• Based on all the observation.
• Uniquely defined.
• Disadvantages of Arithmetic Mean:
• Affected by extreme value.
• Unable to compute mean for open-ended classes.
• Tedious to compute

14
• The Weighted Mean:

All observation do not have same importance, we use weighted


average mean. The weighted average mean can be defined as

Where Xw represents the weighted average mean.

15

You might also like