Basic Statistics

1-1
Quantitative Techniques for Quantitative Techniques for

geographical data analysis geographical data analysis
1. 1.Descriptive statistics Descriptive statistics
2. 2.Spatial data Analysis Spatial data Analysis
3. 3.Inferential statistics Inferential statistics
4. 4.Correlation & regression Correlation & regression
1. 1.Descriptive statistics Descriptive statistics
2. 2.Spatial data Analysis Spatial data Analysis
3. 3.Inferential statistics Inferential statistics
4. 4.Correlation & regression Correlation & regression
1-2
Chapter 1 Chapter 1
1. Descriptive Statistics 1. Descriptive Statistics 1. Descriptive Statistics 1. Descriptive Statistics
1-3
1.1. Note on descriptive and inferential statistics
1.2. Types of data and some characteristics
1.3. Percentiles, deciles and Quartiles
1.4. Measures of Central Tendency
1.5. Measures of Variability
1.6. Methods of data presentation/displaying
Table of frequency distribution
Graphs
Skewness and
Kurtosis
Descriptive Statistics Descriptive Statistics
11
1.1. Note on descriptive and inferential statistics
1.2. Types of data and some characteristics
1.3. Percentiles, deciles and Quartiles
1.4. Measures of Central Tendency
1.5. Measures of Variability
1.6. Methods of data presentation/displaying
Graphs
Skewness and
Kurtosis
1-4
Distinguish between descriptive and inferential statistical
data.
Describe nominal, ordinal, interval, and ratio scales of
measurements.
Calculate and interpret quartiles, deciles and percentiles.
Explain measures of central tendency and how to compute
them.
Create different types of charts that describe data sets.
LEARNING OBJECTIVES LEARNING OBJECTIVES 11
After studying this chapter, you should be able to After studying this chapter, you should be able to::
Distinguish between descriptive and inferential statistical
data.
Describe nominal, ordinal, interval, and ratio scales of
measurements.
Calculate and interpret quartiles, deciles and percentiles.
Explain measures of central tendency and how to compute
them.
Create different types of charts that describe data sets.
1-5
11--1. 1. Note on descriptive and
inferential statistics
Inferential Statistics
Collect data from
population or
representative samples
Predict and forecast
values of population
parameters
Test hypotheses about
parameters
Make decisions
Descriptive
Statistics
Collect
Organize
Summarize
Display
Analyze
Inferential Statistics
Collect data from
population or
representative samples
Predict and forecast
parameters
Test hypotheses about
parameters
Make decisions
Descriptive
Statistics
Collect
Organize
Summarize
Display
Analyze
1-6
Qualitative -
Categorical or
Nominal:
Examples are-
Color
Gender
Nationality
Quantitative -
Measurable or
Countable:
Examples are-
Temperatures
Salaries
Number of points
scored on a 100
point exam
1.2. Types of data and some
characteristics
Qualitative -
Categorical or
Nominal:
Examples are-
Color
Gender
Nationality
Quantitative -
Measurable or
Countable:
Examples are-
Temperatures
Salaries
Number of points
scored on a 100
point exam
1-7
Some Characteristics of Data Some Characteristics of Data
Not all data is the same. There are some limitations
as to what can and cannot be done with a data set,
depending on the characteristics of the data
Some key characteristics that must be considered
are:
A. Continuous vs. Discrete
B. Grouped vs. Individual
C. Scale of Measurement
Not all data is the same. There are some limitations
as to what can and cannot be done with a data set,
depending on the characteristics of the data
Some key characteristics that must be considered
are:
A. Continuous vs. Discrete
B. Grouped vs. Individual
C. Scale of Measurement
1-8
A. Continuous vs. Discrete Data A. Continuous vs. Discrete Data
Continuous data can include any value (i.e., real
numbers)
e.g., 1, 1.43, and 3.1415926 are all acceptable values.
Geographic examples: distance, tree height, amount
of precipitation, etc
Discrete data only consists of discrete values, and
the numbers in between those values are not
defined (i.e., whole or integer numbers)
e.g., 1, 2, 3.
Continuous data can include any value (i.e., real
numbers)
e.g., 1, 1.43, and 3.1415926 are all acceptable values.
Geographic examples: distance, tree height, amount
of precipitation, etc
Discrete data only consists of discrete values, and
the numbers in between those values are not
defined (i.e., whole or integer numbers)
e.g., 1, 2, 3.
1-9
B. Grouped vs. Individual Data B. Grouped vs. Individual Data
The distinction between individual and grouped
data is somewhat self-explanatory, but the issue
pertains to the effects of grouping data
While a family income value is collected for each
household (individual data), for the purpose of
analysis it is transformed into a set of classes
(e.g., 80Birr/hh vs. 0 - 100Birr, 100-200Birr,
200- 300Birr, etc)
The distinction between individual and grouped
data is somewhat self-explanatory, but the issue
pertains to the effects of grouping data
While a family income value is collected for each
household (individual data), for the purpose of
analysis it is transformed into a set of classes
(e.g., 80Birr/hh vs. 0 - 100Birr, 100-200Birr,
200- 300Birr, etc)
1-10
B. Grouped vs. Individual Data B. Grouped vs. Individual Data
In grouped data, the raw individual data is
categorized into several classes, and then analyzed
The act of grouping the data, by taking the central
value of each class introduce a significant
distortion
Grouping always reduces the amount of
information contained in the data
In grouped data, the raw individual data is
categorized into several classes, and then analyzed
The act of grouping the data, by taking the central
value of each class introduce a significant
distortion
Grouping always reduces the amount of
information contained in the data
1-11
C. Scales of Measurement C. Scales of Measurement
The data used in statistical analyses can divided
into four types:
1. The Nominal Scale
2. The Ordinal Scale
3. The interval Scale
4. The Ratio Scale
As we progress through
these scales, the types of
data they describe have
increasing information
content
The data used in statistical analyses can divided
into four types:
1. The Nominal Scale
2. The Ordinal Scale
3. The interval Scale
4. The Ratio Scale
As we progress through
these scales, the types of
data they describe have
increasing information
content
1-12
The Nominal Scale The Nominal Scale
Nominal scale data are data that can simply be
broken down into categories, i.e., having to do
with names or types:
The categories cannot be ranked or ordered
(no greater/less than)
It can appear in the form of:
* Dichotomous or binary
* Multichotomous
Nominal scale data are data that can simply be
broken down into categories, i.e., having to do
with names or types:
The categories cannot be ranked or ordered
(no greater/less than)
It can appear in the form of:
* Dichotomous or binary
* Multichotomous
1-13
continued
Dichotomous or binary nominal
data has just two types, e.g., yes/no,
female/male, is/is not, hot/cold, etc
Multichotomous data has more
than two types, e.g., vegetation
types, soil types, counties, eye
color, etc
Dichotomous or binary nominal
data has just two types, e.g., yes/no,
female/male, is/is not, hot/cold, etc
Multichotomous data has more
than two types, e.g., vegetation
types, soil types, counties, eye
color, etc
1-14
The Ordinal Scale The Ordinal Scale
Ordinal scale data can be categorized AND can
be placed in an order, i.e., categories that can be
assigned a relative importance and can be ranked
such that numerical category values have
star-systemrestaurant rankings
5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 2 stars
BUT ordinal data still are not scalar in the sense
that differences between categories do not have a
quantitative meaning
Ordinal scale data can be categorized AND can
be placed in an order, i.e., categories that can be
assigned a relative importance and can be ranked
such that numerical category values have
star-systemrestaurant rankings
5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 2 stars
BUT ordinal data still are not scalar in the sense
that differences between categories do not have a
quantitative meaning
1-15
The Interval Scale The Interval Scale
Interval scale data take the notion of ranking items in
order one step further, since the distance between
adjacent points on the scale are equal
For instance, the Fahrenheit scale is an interval scale,
since each degree is equal but there is no absolute zero
point.
This means that although we can add and subtract
degrees (100 is 10 warmer than 90), we cannot
multiply values or create ratios (100 is not twice as
warm as 50)
Interval scale data take the notion of ranking items in
order one step further, since the distance between
adjacent points on the scale are equal
For instance, the Fahrenheit scale is an interval scale,
since each degree is equal but there is no absolute zero
point.
This means that although we can add and subtract
degrees (100 is 10 warmer than 90), we cannot
multiply values or create ratios (100 is not twice as
warm as 50)
1-16
The Ratio Scale The Ratio Scale
Similar to the interval scale, but with the addition
of having a meaningful zero value, which allows
us to compare values using multiplication and
division operations, e.g., precipitation, weights,
heights, etc
e.g., rain We can say that 2 cm of rain is twice as
much rain as 1 cm of rain because this is a ratio
scale measurement
e.g., age a 100-year old person is indeed twice as
old as a 50-year old one
Similar to the interval scale, but with the addition
of having a meaningful zero value, which allows
us to compare values using multiplication and
division operations, e.g., precipitation, weights,
heights, etc
e.g., rain We can say that 2 cm of rain is twice as
much rain as 1 cm of rain because this is a ratio
scale measurement
e.g., age a 100-year old person is indeed twice as
old as a 50-year old one
1-17
1.3 Quartiles, Deciles 1.3 Quartiles, Deciles
and Percentiles and Percentiles
1-18
Quartiles (for raw data) Quartiles (for raw data)
If a set of data is organized in order of magnitude,
quartiles are means of dividing a set of data at every
25% of the observation
There are only 3 quartiles.
The values are denoted by:
Q1, 1
st
or lower quartile (the first 25% of the
observation),
Q2, 2
nd
quartile (a value that divides the observation
into 50%), and
Q3, 3
rd
quartile (representing 75% of the observation).
If a set of data is organized in order of magnitude,
quartiles are means of dividing a set of data at every
25% of the observation
There are only 3 quartiles.
The values are denoted by:
Q1, 1
st
or lower quartile (the first 25% of the
observation),
Q2, 2
nd
quartile (a value that divides the observation
into 50%), and
Q3, 3
rd
quartile (representing 75% of the observation).
1-19
Example for raw data Example for raw data -- Sales and Sales and
Sorted Sales Sorted Sales
Sales Sorted Sales
9 6
6 9
12 10
10 12
13 13
15 14
16 14
14 15
14 16
16 16
17 16
16 17
24 17
21 18
22 18
18 19
19 20
18 21
20 22
17 24
Sales Sorted Sales
9 6
6 9
12 10
10 12
13 13
15 14
16 14
14 15
14 16
16 16
17 16
16 17
24 17
21 18
22 18
18 19
19 20
18 21
20 22
17 24
1-20
Quartiles (contd) Quartiles (contd)
This method of data classification is important when we
want to describe the observations in four groups according
to their order of magnitude.
The serial number of first or lower quartile, Q1 is
computed by
To find the position of Q1, determine the data point in
position
.
th N ) 1 (
4
1
+
This method of data classification is important when we
want to describe the observations in four groups according
to their order of magnitude.
The serial number of first or lower quartile, Q1 is
computed by
To find the position of Q1, determine the data point in
position
.
th N ) 1 (
4
1
+
25 . 5 4 / 21 ) 1 20 (
4
1
= = + th
1-21
Thus, Q1is located at the 5.25th
position
The 5th observation is 13, and the
6th observation is 14.
Q1 is a point lying 0.25 of the
way from13 to 14 and is thus =
13 +1*0.25= 13.25.
Thus, Q1is located at the 5.25th
position
The 5th observation is 13, and the
6th observation is 14.
Q1 is a point lying 0.25 of the
way from13 to 14 and is thus =
13 +1*0.25= 13.25.
1-22
The serial number of 2
nd
quartile (Q2) is
computed by
Thus, Q2 is located at the 10.5th position
The 10th observation is 16, and the 11th
observation is also16.
Thus Q2 will lie halfway between the 10th and
11th values (which are both 16 in this case) and is
thus 16.
5 . 10 4 / 42 4 / ] 21 * 2 [ ) 1 (
4
2
= = = + th N
nd
quartile (Q2) is
computed by
observation is also16.
Thus Q2 will lie halfway between the 10th and
11th values (which are both 16 in this case) and is
thus 16.
5 . 10 4 / 42 4 / ] 21 * 2 [ ) 1 (
4
2
= = = + th N
1-23
rd
quartile (Q3) is
computed by
observation is 19.
Thus Q2 will lie 0.75 of the way from18th to
19th values and is thus = 18 +1*0.75= 18.75.
75 . 15 4 / 63 4 / ] 21 * 3 [ ) 1 (
4
3
= = = + th N
rd
quartile (Q3) is
computed by
observation is 19.
Thus Q2 will lie 0.75 of the way from18th to
19th values and is thus = 18 +1*0.75= 18.75.
75 . 15 4 / 63 4 / ] 21 * 3 [ ) 1 (
4
3
= = = + th N
1-24
Deciles (for raw data) Deciles (for raw data)
This method of data classification is important
when we want to describe the observations in 10
groups according to their order of magnitude.
There are nine values called deciles which divide
the distribution into every 10%.
Deciles divide the total distribution into 10 equal
parts.
These values are denoted by D1, D2, D3, , D9.
This method of data classification is important
when we want to describe the observations in 10
groups according to their order of magnitude.
There are nine values called deciles which divide
the distribution into every 10%.
Deciles divide the total distribution into 10 equal
parts.
These values are denoted by D1, D2, D3, , D9.
1-25
Deciles (contd) Deciles (contd)
The serial number of any partition
value, such as the Kth deciles can be
computed by:
The serial number of any partition
value, such as the Kth deciles can be
computed by:
th N
K
) 1 (
10
+
1-26
Percentiles (for raw data) Percentiles (for raw data)
There are ninety-nine values called
percentiles which divide the
distribution into every 1%.
Percentiles divide the total
distribution into 100 equal parts.
These values are denoted by P1,
P2, P3, , P99.
There are ninety-nine values called
percentiles which divide the
distribution into every 1%.
Percentiles divide the total
distribution into 100 equal parts.
These values are denoted by P1,
P2, P3, , P99.
1-27
Percentiles (contd) Percentiles (contd)
The serial number of any
partition value, such as the
Kth percentiles can be
computed by
The serial number of any
partition value, such as the
Kth percentiles can be
computed by
th N
K
) 1 (
100
+
1-28
Find the 50
th
, 80
th
, and the 90
th
percentiles of this
data set.
To find the 50
th
percentile, determine the data point
in position (n + 1)P/100 = (20 + 1)(50/100)
= 10.5.
Thus, the percentile is located at the 10.5
th
position.
The 10
th
observation is 16, and the 11
th
observation
is also 16.
The 50th percentile will lie halfway between the
10
th
and 11
th
values (which are both 16 in this case)
and is thus 16.
Example: Percentiles (contd) Example: Percentiles (contd)
Find the 50
th
, 80
th
, and the 90
th
percentiles of this
data set.
To find the 50
th
percentile, determine the data point
in position (n + 1)P/100 = (20 + 1)(50/100)
= 10.5.
Thus, the percentile is located at the 10.5
th
position.
The 10
th
th
observation
is also 16.
The 50th percentile will lie halfway between the
10
th
and 11
th
values (which are both 16 in this case)
and is thus 16.
1-29
To find the 80
th
percentile, determine
the data point in position (n + 1)P/100 =
(20 + 1)(80/100) = 16.8.
Thus, the percentile is located at the
16.8
th
position.
The 16
th
th
observation is 20.
The 80
th
percentile is a point lying 0.8
of the way from 19 to 20 and is thus
19.8.
To find the 80
th
percentile, determine
the data point in position (n + 1)P/100 =
(20 + 1)(80/100) = 16.8.
16.8
th
position.
The 16
th
th
observation is 20.
The 80
th
percentile is a point lying 0.8
of the way from 19 to 20 and is thus
19.8.
1-30
To find the 90
th
percentile, determine the
data point in position (n + 1)P/100 = (20
+ 1)(90/100) = 18.9.
18.9
th
position.
The 18
th
th
observation is also 22.
The 90
th
percentile is a point lying 0.9 of
the
way from 21 to 22 and is thus 21.9.
To find the 90
th
percentile, determine the
data point in position (n + 1)P/100 = (20
+ 1)(90/100) = 18.9.
18.9
th
position.
The 18
th
th
observation is also 22.
The 90
th
percentile is a point lying 0.9 of
the
way from 21 to 22 and is thus 21.9.
1-31
Relationship among quartiles, deciles & Relationship among quartiles, deciles &
percentiles percentiles
The 2nd quartiles, 5th Deciles
and 50th percentiles correspond
to the median.
The 25th and 75th percentiles
correspond to the 1st and 3rd
quartiles respectively.
The 2nd quartiles, 5th Deciles
and 50th percentiles correspond
to the median.
The 25th and 75th percentiles
correspond to the 1st and 3rd
quartiles respectively.
1-32
Quartiles are the percentage points that break
down the ordered data set into quarters.
The first quartile is the 25
th
percentile. It is the
point below which lie 1/4 of the data.
The second quartile is the 50
th
percentile. It is
the point below which lie 1/2 of the data. This
is also called the median.
The third quartile is the 75
th
percentile. It is
the point below which lie 3/4 of the data.
Quartiles Quartiles Special Percentiles Special Percentiles
Quartiles are the percentage points that break
down the ordered data set into quarters.
The first quartile is the 25
th
percentile. It is the
point below which lie 1/4 of the data.
The second quartile is the 50
th
percentile. It is
the point below which lie 1/2 of the data. This
is also called the median.
The third quartile is the 75
th
percentile. It is
the point below which lie 3/4 of the data.
1-33
The first quartile, Q
1
, (25
th
percentile) is
often called the lower quartile.
The second quartile, Q
2
, (50
th
percentile) is often called the median
or the middle quartile.
The third quartile, Q
3
, (75
th
percentile)
is often called the upper quartile.
The interquartile range is the difference
between the first and the third quartiles.
Quartiles and Interquartile Range Quartiles and Interquartile Range
The first quartile, Q
1
, (25
th
percentile) is
often called the lower quartile.
The second quartile, Q
2
, (50
th
percentile) is often called the median
or the middle quartile.
The third quartile, Q
3
, (75
th
percentile)
is often called the upper quartile.
The interquartile range is the difference
between the first and the third quartiles.
1-34
Quartiles for Grouped data Quartiles for Grouped data
In a grouped frequency data, any
partition value which has a proportion
(Q1, Q2 or Q3) of observation is
calculated by the interpolation formula
as:
In a grouped frequency data, any
partition value which has a proportion
(Q1, Q2 or Q3) of observation is
calculated by the interpolation formula
as:
c
f
f N
L Q
Q
*
) ( ) 1 (
4
1
1
1
1
+
+ =
1-35
Contd Contd
Where,
L = lower class boundary of the Q1 class
N = number of observation in the data (total
frequency)
= sum of frequencies of all classes
lower than the Q1 class
= frequency of the Q1 class
c = size of the class interval
Where,
L = lower class boundary of the Q1 class
frequency)
= sum of frequencies of all classes
lower than the Q1 class
= frequency of the Q1 class
c = size of the class interval
1
) ( f
1 Q
f
1-36
Contd Contd
For example, a grouped frequency of monthly
income of X-factorys employees
Class Class interval (monthly
salary in Dollars)
Frequency (f)
1 30-39 1
2 40-49 3
3 50-59 11
4 60-69 21
5 70-79 43
6 80-89 32
7 90-100 9
Total 120
For example, a grouped frequency of monthly
income of X-factorys employees
Class Class interval (monthly
salary in Dollars)
Frequency (f)
1 30-39 1
2 40-49 3
3 50-59 11
4 60-69 21
5 70-79 43
6 80-89 32
7 90-100 9
Total 120
1-37
Contd Contd
Calculate the lower quartile
(Q1) of the distribution of
monthly salary and interpret
the result
Calculate the lower quartile
(Q1) of the distribution of
monthly salary and interpret
the result
1-38
Contd Contd
8 . 66 10 *
21
15 ) 1 120 (
4
1
5 . 59
1
=
+
+ = Q 8 . 66 10 *
21
15 ) 1 120 (
4
1
5 . 59
1
=
+
+ = Q
Interpretation:
About 25% of the X-factorys
employees monthly salary is up to
66.8 Dollars or lower
1-39
Percentiles for Grouped data Percentiles for Grouped data
c
fp
f N p
L p *
) ( ) 1 (
1
+
+ =
Calculate the 60 percentile of the
distribution of monthly salary of
employees and interpret the result
c
fp
f N p
L p *
) ( ) 1 (
1
+
+ =
1-40
Contd Contd
01 . 78 10 *
43
36 ) 1 120 ( 60 . 0
5 . 69 60 =
+
+ = p 01 . 78 10 *
43
36 ) 1 120 ( 60 . 0
5 . 69 60 =
+
+ = p
Interpretation:
About 60% of the X-factorys
employees monthly salary is up to
78.01 Dollars and less
1-41
Measures of Variability
Range
Interquartile range
Variance
Standard Deviation
Coefficient of variation (CV)
Measures of Central Tendency
Median
Mode
Mean
Summary Measures: Population Summary Measures: Population
Parameters & Sample Statistics Parameters & Sample Statistics
Measures of Variability
Range
Interquartile range
Variance
Standard Deviation
Coefficient of variation (CV)
Measures of Central Tendency
Median
Mode
Mean
Other summary measures:
Skewness
Kurtosis
1-42
- Median Middle value when
sorted in order of
magnitude
50th percentile
11--4 Measures of Central Tendency 4 Measures of Central Tendency
or Location or Location
50th percentile
- Mode Most frequently-
occurring value
- Mean Average
1-43
The median is the middle
value of raw data sorted in
order of magnitude.
Example Example Median (Data is used from Median (Data is used from
Example 1 Example 1--1) 1)
The median is the middle
value of raw data sorted in
order of magnitude.
1-44
Median for group data Median for group data
In the case of grouped data the median
would be obtained by interpolation
In the case of grouped data the median
would be obtained by interpolation
c
f
f
N
L Median
median
.
|
\
|

+ =
1
1
) (
2
1-45
Contd Contd
Where,
L = lower class boundary of the median class
frequency)
= sum of frequencies of all classes lower than
the median class
= frequency of the median class
c = size of median class interval
Where,
L = lower class boundary of the median class
frequency)
= sum of frequencies of all classes lower than
the median class
= frequency of the median class
c = size of median class interval
1
) ( f
median
f
1-46
Mode Mode
The mode of a set of data is the value
that occurs with the greatest frequency.
It represents the most common value
Note that, the mode as an average may
be used when a frequency distribution
represents data measured only on a
nominal scale.
The mode of a set of data is the value
that occurs with the greatest frequency.
It represents the most common value
Note that, the mode as an average may
be used when a frequency distribution
represents data measured only on a
nominal scale.
1-47
Mode in arry data Mode in arry data
In array data the mode may not exist
and sometimes if it does exist may not
be unique. For example:
Monthly income (in Birr) of 9
employees of small private business
may be: 800, 950, 1200, 1300, 2000,
2500, 2800, 2900, 3000, has no mode.
In array data the mode may not exist
and sometimes if it does exist may not
be unique. For example:
Monthly income (in Birr) of 9
employees of small private business
may be: 800, 950, 1200, 1300, 2000,
2500, 2800, 2900, 3000, has no mode.
1-48
Contd Contd
Monthly income (in Birr) of 10 teachers of one
elementary school : 650, 700, 700, 700, 700, 800,
950, 950, 1000, 1050, has mode 7000 Birr.
Distribution with one mode is called unimodal.
Monthly income (in Birr) of 10 farmers: 100,
250, 250, 250, 300, 350, 400, 400, 400, 500, has
two modes, 250 and 400, and the data or the
characteristics of the variable is bimodal.
Monthly income (in Birr) of 10 teachers of one
elementary school : 650, 700, 700, 700, 700, 800,
950, 950, 1000, 1050, has mode 7000 Birr.
Distribution with one mode is called unimodal.
Monthly income (in Birr) of 10 farmers: 100,
250, 250, 250, 300, 350, 400, 400, 400, 500, has
two modes, 250 and 400, and the data or the
characteristics of the variable is bimodal.
1-49
Mode for grouped data Mode for grouped data
For a frequency distribution of grouped
data or histogram the mode can be
computed using the formula:
For a frequency distribution of grouped
data or histogram the mode can be
computed using the formula:
c L Mode
.
|
\
|
A + A
A
+ =
2 1
1
1
1-50
Continued Continued
Where,
= Lower class boundary of modal
class (class containing the mode)
= excess of modal frequency over
frequency of the next lower class
frequency of the next higher class
c = size of modal class interval
1
L
1
A
Where,
= Lower class boundary of modal
class (class containing the mode)
frequency of the next lower class
frequency of the next higher class
c = size of modal class interval
1
A
2
A
1-51
Means Means
There are three types of means:
arithmetic mean
Weighted arithmetic mean
Geometric mean
Harmonic mean
There are three types of means:
arithmetic mean
Weighted arithmetic mean
Geometric mean
Harmonic mean
1-52
The mean of a set of observations is their average -
the sum of the observed values divided by the
number of observations.
Arithmetic Mean or Average Arithmetic Mean or Average
Population Mean Sample Mean
=
=
x
N
i
N
1
x
n
i
n
=
=
1
1-53
Weighted arithmetic mean Weighted arithmetic mean
Some times collected data may not have equal weights.
In such cases weighting of data falling under different classes or
category may be important and thus, certain weighting factor (w)
has to be applied using the formula:
Some times collected data may not have equal weights.
In such cases weighting of data falling under different classes or
category may be important and thus, certain weighting factor (w)
has to be applied using the formula:
k
k k
x x x
x w x w x w
wX
+ + +
+ + +
=
...
...
2 1
2 2 1 1
_
=
xi
wiXi
1-54
Arithmetic mean for grouped data Arithmetic mean for grouped data
Arithmetic mean of grouped data is computed
using class marks (m) and frequency
distributions, by assuming that all frequencies of
a given class are considered as coincident with
the class mark or midpoint of the interval, using
the formula:
Arithmetic mean of grouped data is computed
using class marks (m) and frequency
distributions, by assuming that all frequencies of
a given class are considered as coincident with
the class mark or midpoint of the interval, using
the formula:
k
k k
f f f
f m f m f m
X
+ + +
+ + +
=
...
...
2 1
2 2 1 1
_
=
f
mf
1-55
Empirical relation among Mean, Median Empirical relation among Mean, Median
and Mode and Mode
In the case of uniform (symmetrical)
distribution the relation is defined as:
Mean = Mode = Median.
For unimodal frequency curves which
are moderately skewed (asymmetrical)
the empirical relation is:
In the case of uniform (symmetrical)
distribution the relation is defined as:
Mean = Mode = Median.
For unimodal frequency curves which
are moderately skewed (asymmetrical)
the empirical relation is:
) ( 3 median mean Mode Mean =
1-56
Which one is better: mean, median, or mode? Which one is better: mean, median, or mode?
The mean is valid only for interval and ratio data.
The median is valid for ordinal, interval and ratio data.
The mode is valid for nominal, ordinal, interval, and
ratio data
Median & mode are the only measures of central
tendency that can be used with ordinal data
Mode is the only measure of central tendency that can
be used with nominal data
The mean is valid only for interval and ratio data.
The median is valid for ordinal, interval and ratio data.
The mode is valid for nominal, ordinal, interval, and
ratio data
Median & mode are the only measures of central
tendency that can be used with ordinal data
Mode is the only measure of central tendency that can
be used with nominal data
1-57
Range
Difference between maximum and minimum values
Interquartile Range
Difference between third and first quartile (Q
3
- Q
1
)
Variance
Average
*
of the squared deviations from the mean
Standard Deviation
Square root of the variance
11--5 Measures of Variability or 5 Measures of Variability or
Dispersion Dispersion
Range
Difference between maximum and minimum values
Interquartile Range
Difference between third and first quartile (Q
3
- Q
1
)
Variance
Average
*
of the squared deviations from the mean
Standard Deviation
Square root of the variance
-
Definitions of population variance and sample variance differ slightly.
1-58
Variance and Standard Deviation Variance and Standard Deviation
( )

2
2
1
=
=
( ) x
N
i
N
Population Variance &
Standard deviation
( )
s
x x
n
i
n
2
2
1
1
=

=
( )
Sample Variance &
Standard deviation
( )
( )
2
1
2
2
1
=
=
=
=
N
x
N
N
i
N
x
i
N
( )
( )
n
x
x
n
n
s
s
i
n
i
n
2
1
2
2
1
1
1
=
=
=
( )
1-59
6 -9.85 97.0225 36
9 -6.85 46.9225 81
10 -5.85 34.2225 100
12 -3.85 14.8225 144
13 -2.85 8.1225 169
14 -1.85 3.4225 196
14 -1.85 3.4225 196
15 -0.85 0.7225 225
16 0.15 0.0225 256
16 0.15 0.0225 256
16 0.15 0.0225 256
17 1.15 1.3225 289
17 1.15 1.3225 289
18 2.15 4.6225 324
18 2.15 4.6225 324
19 3.15 9.9225 361
20 4.15 17.2225 400
21 5.15 26.5225 441
22 6.15 37.8225 484
24 8.15 66.4225 576
317 0 378.5500 5403
x x
x
( ) x x
2
x
2
( )
( )
( )
s
x x
n
x
x
n
n
s
s
i
n
i
n
i
n
2
2
1
2
1
2
2
2
1
37855
20 1
37855
19
19923684
1
5403
317
20
20 1
5403
100489
20
19
5403 502445
19
37855
19
19923684
19923684 446
1
=

= =
=

|
\
|
.
=

=

= =
= = =
=
=
=
( )
.
( )
.
.
. .
.
. .
Calculation of Sample Variance & Calculation of Sample Variance &
Standard deviation ( ) Standard deviation ( )
85 . 15 = X
6 -9.85 97.0225 36
9 -6.85 46.9225 81
10 -5.85 34.2225 100
12 -3.85 14.8225 144
13 -2.85 8.1225 169
14 -1.85 3.4225 196
14 -1.85 3.4225 196
15 -0.85 0.7225 225
16 0.15 0.0225 256
16 0.15 0.0225 256
16 0.15 0.0225 256
17 1.15 1.3225 289
17 1.15 1.3225 289
18 2.15 4.6225 324
18 2.15 4.6225 324
19 3.15 9.9225 361
20 4.15 17.2225 400
21 5.15 26.5225 441
22 6.15 37.8225 484
24 8.15 66.4225 576
317 0 378.5500 5403
( )
( )
( )
s
x x
n
x
x
n
n
s
s
i
n
i
n
i
n
2
2
1
2
1
2
2
2
1
37855
20 1
37855
19
19923684
1
5403
317
20
20 1
5403
100489
20
19
5403 502445
19
37855
19
19923684
19923684 446
1
=

= =
=

|
\
|
.
=

=

= =
= = =
=
=
=
( )
.
( )
.
.
. .
.
. .
1-60
Standard Deviation Standard Deviation-- Grouped Grouped
frequencies frequencies
Standard deviation of grouped data
can be calculated using:
Where,
mi : is class midpoint (class mark)
fi : is frequency
: sample mean
( )
2
2
x
f
fi mi
s =
Standard deviation of grouped data

can be calculated using:
Where,
mi : is class midpoint (class mark)
fi : is frequency
: sample mean
( )
2
2
x
f
fi mi
s =
X
1-61
Coefficient of variation Coefficient of variation
The actual variation or dispersion as determined
from the standard deviation is called the absolute
dispersion.
This absolute dispersion cannot tell how much
exactly variability occurred.
Thus, a measure of this effect can be explained
by relative dispersion or coefficient of variation.
And this is generally expresses as a percentage
The actual variation or dispersion as determined
from the standard deviation is called the absolute
dispersion.
This absolute dispersion cannot tell how much
exactly variability occurred.
Thus, a measure of this effect can be explained
by relative dispersion or coefficient of variation.
And this is generally expresses as a percentage
100 *
X
s
CV =
1-62
Methods of data presentation/displaying
Graphs
Line graphs:
Ogives
Time plot
Pie charts
Bar graphs
Skewness and
Kurtosis
11--6 Methods of Data presentation 6 Methods of Data presentation
Methods of data presentation/displaying
Graphs
Line graphs:
Ogives
Time plot
Pie charts
Bar graphs
Skewness and
Kurtosis
1-63
Frequency distribution
Dividing data into groups or classes or intervals
Groups should be:
Mutually exclusive
Not overlapping - every observation is assigned to
only one group
Exhaustive
Every observation is assigned to a group
Equal-width (if possible)
Dividing data into groups or classes or intervals
Groups should be:
Mutually exclusive
Not overlapping - every observation is assigned to
only one group
Exhaustive
Every observation is assigned to a group
Equal-width (if possible)
1-64
Frequency distribution Frequency distribution
Large size of raw data has to be organized into
classes or categories containing a number of
individuals belonging to each class.
Number of individuals in a given class is known
as the class frequency.
A tabular arrangement of data by classes together
with the corresponding class frequencies is called
a frequency distribution or frequency table.
Large size of raw data has to be organized into
classes or categories containing a number of
individuals belonging to each class.
Number of individuals in a given class is known
as the class frequency.
A tabular arrangement of data by classes together
with the corresponding class frequencies is called
a frequency distribution or frequency table.
1-65
General rules for forming frequency General rules for forming frequency
distribution distribution
1. Identify the largest and the smallest
numbers in the raw data and thus find
the range.
For example, the largest number of
the raw data in Table 2.1 is 4887,
whilst the smallest number is 950.
The range is 4887 950 = 3937.
1. Identify the largest and the smallest
numbers in the raw data and thus find
the range.
For example, the largest number of
the raw data in Table 2.1 is 4887,
whilst the smallest number is 950.
The range is 4887 950 = 3937.
1-66
Contd Contd
2. Divide the range by a convenient number of
classes.
For our example 10 classes are used.
You may have different classes depending on
the nature and size of the data.
Thus, (class width)
2. Divide the range by a convenient number of
classes.
For our example 10 classes are used.
You may have different classes depending on
the nature and size of the data.
Thus, (class width)
394 7 . 393
10
3937
~ = =
Classes
Range
1-67
Contd Contd
3. Determine the class interval for the 10
classes.
Common practice to determine class limits
is as:
Where,
SRD is smallest number of the raw data,
lcl is lower class limit,
ucl is upper class limit
Cw is class width.
1 = SRD lcl
3. Determine the class interval for the 10
classes.
Common practice to determine class limits
is as:
Where,
SRD is smallest number of the raw data,
lcl is lower class limit,
ucl is upper class limit
Cw is class width.
Cw lcl ucl + = ) 1 (
1-68
Contd Contd
In our example:
SRD = 950 and Cw is 394.
Thus, lower class limit of the 1st
class is 950 1 = 949 and
the upper class limit of the same
class is [(949 -1) + 394] = 1342
(Table 2.3).
In our example:
SRD = 950 and Cw is 394.
Thus, lower class limit of the 1st
class is 950 1 = 949 and
the upper class limit of the same
class is [(949 -1) + 394] = 1342
(Table 2.3).
1-69
Contd Contd
For other successive classes:
build the lower class limit of the next higher class by adding
the class interval on the lower class limit of the preceding class
and,
The upper class limit of the next higher class by adding the
class interval on the upper class limit of the preceding class.
For example, the class limits of the first class are
949 1342
The next class limits are
(949 +394) = 1343 (lower class limit of 2
nd
class)
(1342 +394)=1736 (upper class limit of the 2
nd
class
1343 - 1736
For other successive classes:
build the lower class limit of the next higher class by adding
the class interval on the lower class limit of the preceding class
and,
The upper class limit of the next higher class by adding the
class interval on the upper class limit of the preceding class.
For example, the class limits of the first class are
949 1342
The next class limits are
(949 +394) = 1343 (lower class limit of 2
nd
class)
(1342 +394)=1736 (upper class limit of the 2
nd
class
1343 - 1736
1-70
Plot code Yield
(kg ha
-1
)
Plot code Yield
(kg ha
-1
)
Plot code Yield
(kg ha
-1
)
Plot code Yield
(kg ha
-1
)
1 950 20 1258 39 3689 58 2058
2 1250 21 1509 40 3824 59 1020
3 1504 22 2051 41 4823 60 3687
4 2058 23 1028 42 1886 61 3891
5 1020 24 3681 43 1382 62 4230
6 3687 25 3820 44 4666 63 3228
7 3891 26 4875 45 1785 64 3468
8 4887 27 1825 46 2423 65 4356
9 1895 28 1345 47 3271 66 2598
10 1324 29 4699 48 4430 67 1050
Table 2.1. Raw data of agricultural production (kg ha-1 yr-1) (an example)
10 1324 29 4699 48 4430 67 1050
11 4657 30 1735 49 3228 68 1258
12 1765 31 2412 50 3468 69 4130
13 2456 32 3285 51 4356 70 3228
14 3214 33 4145 52 2598 71 3468
15 4167 34 3230 53 1050 72 4356
16 3264 35 3483 54 1258 73 2598
17 3478 36 4099 55 1509 74 1050
18 4052 37 2568 56 2051 75 1258
19 2567 38 990 57 1028 76 1735
1-71
Assignment 1.
Build a frequency table for the agricultural yield
indicated in the previous slide.
Number of classes of the frequency distribution
should be 10
Build the class intervals in class boundary
Construct the class marks for each class
What is the upper class boundary of the 3
rd
class?
What is the lower class boundary of class 7?
What is the frequency of the 2
nd
class?
Build a frequency table for the agricultural yield
indicated in the previous slide.
Number of classes of the frequency distribution
should be 10
Build the class intervals in class boundary
Construct the class marks for each class
What is the upper class boundary of the 3
rd
class?
What is the lower class boundary of class 7?
What is the frequency of the 2
nd
class?
1-72
Frequency distribution can appear in Frequency distribution can appear in
Simple frequency
Cumulative frequency
relative cumulative frequency,
or
relative cumulative frequency
percentage
Simple frequency
Cumulative frequency
relative cumulative frequency,
or
relative cumulative frequency
percentage
1-73
x f(x)
Spending Class ($) Frequency (number of customers)
0 - 100 30
100 - 200 38
200 - 300 50
300 - 400 31
400 - 500 22
500 - 600 13
Total 184
x f(x)
0 - 100 30
100 - 200 38
200 - 300 50
300 - 400 31
400 - 500 22
500 - 600 13
Total 184
Example: Frequency Distribution Example: Frequency Distribution
x f(x)
0 - 100 30
100 - 200 38
200 - 300 50
300 - 400 31
400 - 500 22
500 - 600 13
Total 184
x f(x)
0 - 100 30
100 - 200 38
200 - 300 50
300 - 400 31
400 - 500 22
500 - 600 13
Total 184
1-74
x f(x) f(x)/n
Spending Class ($) Frequency (number of customers) Relative Frequency
0 - 100 30 0.163
100 - 200 38 0.207
200 - 300 50 0.272
300 - 400 31 0.168
400 - 500 22 0.120
500 - 600 13 0.070
Total 184 1.000
x f(x) f(x)/n
0 - 100 30 0.163
100 - 200 38 0.207
200 - 300 50 0.272
300 - 400 31 0.168
400 - 500 22 0.120
500 - 600 13 0.070
Total 184 1.000
Example: Relative Frequency Distribution Example: Relative Frequency Distribution
x f(x) f(x)/n
0 - 100 30 0.163
100 - 200 38 0.207
200 - 300 50 0.272
300 - 400 31 0.168
400 - 500 22 0.120
500 - 600 13 0.070
Total 184 1.000
x f(x) f(x)/n
0 - 100 30 0.163
100 - 200 38 0.207
200 - 300 50 0.272
300 - 400 31 0.168
400 - 500 22 0.120
500 - 600 13 0.070
Total 184 1.000
Example of relative frequency for the 1
st
class: 30/184 = 0.163
Sum of relative frequencies = 1
1-75
Continued. Continued.
Spending
($)
Cumulati
ng up
Relative
cumulati
ve less
than
Relative
cumulative
percentage
less than
Spending
($)
Cumulating
down
Relative
cumulati
ve
greater
than
Relative
cumulative
percentage
greater
than
< 0 0 0 0.0 > 0 184 1 100
< 100 30 0.163 16.3 > 100 154 0.837 83.7
< 200 68 0.37 37.0 > 200 116 0.630 63.0
< 300 118 0.641 64.1 > 300 66 0.359 35.9
< 400 149 0.81 81.0 > 400 35 0.190 19.0
< 500 171 0.929 92.9 > 500 13 0.071 7.1
< 600 184 1 100 > 600 0 0 0
Cumulative frequency (up to/less than) Cumulative frequency (down/greater than)
Spending
($)
Cumulati
ng up
Relative
cumulati
ve less
than
Relative
cumulative
percentage
less than
Spending
($)
Cumulating
down
Relative
cumulati
ve
greater
than
Relative
cumulative
percentage
greater
than
< 0 0 0 0.0 > 0 184 1 100
< 100 30 0.163 16.3 > 100 154 0.837 83.7
< 200 68 0.37 37.0 > 200 116 0.630 63.0
< 300 118 0.641 64.1 > 300 66 0.359 35.9
< 400 149 0.81 81.0 > 400 35 0.190 19.0
< 500 171 0.929 92.9 > 500 13 0.071 7.1
< 600 184 1 100 > 600 0 0 0
Cumulative frequency (up to/less than) Cumulative frequency (down/greater than)
1-76
Less than cumulative frequency Less than cumulative frequency
It is the total frequency of all values
successively from the lowest to the highest (less
than) upper class boundary of a given class
interval including the frequency of that class.
For example, the cumulative frequency up to
(less than) and including the class interval
300 400 in the spending Table is 30+38+50+31
= 149, indicating that 149 customers have
spending less than 400 $ .
It is the total frequency of all values
successively from the lowest to the highest (less
than) upper class boundary of a given class
interval including the frequency of that class.
For example, the cumulative frequency up to
(less than) and including the class interval
300 400 in the spending Table is 30+38+50+31
= 149, indicating that 149 customers have
spending less than 400 $ .
1-77
Greater than cumulative frequency Greater than cumulative frequency
. It is the total frequency of all values
successively from the highest to the lowest
(more than) lower class boundary of a given
class interval including the frequency of that
class.
For example, the cumulative frequency down to
(more than) and including the class interval
300 400 in the spending Table is 13 + 22
+33 = 66, indicating that 66 customers have
spending greater than 300 $.
. It is the total frequency of all values
successively from the highest to the lowest
(more than) lower class boundary of a given
class interval including the frequency of that
class.
For example, the cumulative frequency down to
(more than) and including the class interval
300 400 in the spending Table is 13 + 22
+33 = 66, indicating that 66 customers have
spending greater than 300 $.
1-78
Contd Contd
A graph showing the cumulative
frequency less than the upper class
boundary plotted against the upper
class boundary is known as less than
cumulative frequency polygon or
Ogives
frequency less than the upper class
boundary plotted against the upper
class boundary is known as less than
cumulative frequency polygon or
Ogives
1-79
Continued Continued
0
20
40
60
80
100
120
140
160
180
200
<

0
<

1
0
0
<

2
0
0
<

3
0
0
<

4
0
0
<

5
0
0
<

6
0
0
Spending ($)
C
u
s
t
o
m
e
r
s

(
n
u
m
b
e
r
)
0
20
40
60
80
100
120
140
160
180
200
<

0
<

1
0
0
<

2
0
0
<

3
0
0
<

4
0
0
<

5
0
0
<

6
0
0
Spending ($)
C
u
s
t
o
m
e
r
s

(
n
u
m
b
e
r
)
Figure __. Cumulative frequency distribution (less than): Ogives
1-80
continued continued
frequency greater than the lower
class boundary plotted against the
lower class boundary is known as
greater than cumulative frequency
polygon or Ogives
frequency greater than the lower
class boundary plotted against the
lower class boundary is known as
greater than cumulative frequency
polygon or Ogives
1-81
continued continued
0
20
40
60
80
100
120
140
160
180
200
>

0
>

1
0
0
>

2
0
0
>

3
0
0
>

4
0
0
>

5
0
0
>

6
0
0
Spending ($)
C
u
s
t
o
m
e
r
s

(
n
u
m
b
e
r
)
0
20
40
60
80
100
120
140
160
180
200
>

0
>

1
0
0
>

2
0
0
>

3
0
0
>

4
0
0
>

5
0
0
>

6
0
0
Spending ($)
C
u
s
t
o
m
e
r
s

(
n
u
m
b
e
r
)
Figure __:Cumulative frequency distribution (Grater than): Ogives
1-82
Assignment 2
Represent data of Table 2.1 in
cumulative frequency less than
Ogives and
cumulative frequency greater
than Ogives
And discuss some of the results
Represent data of Table 2.1 in
cumulative frequency less than
Ogives and
cumulative frequency greater
than Ogives
And discuss some of the results
1-83
A histogram is a chart made of bars of
different heights.
Widths and locations of bars correspond
to widths and locations of data groupings
Heights of bars correspond to
frequencies or relative frequencies of
data groupings
Histogram Histogram
A histogram is a chart made of bars of
different heights.
Widths and locations of bars correspond
to widths and locations of data groupings
Heights of bars correspond to
frequencies or relative frequencies of
data groupings
1-84
Frequency Histogram
Histogram Example Histogram Example
1-85
Relative Frequency Histogram
Histogram Example Histogram Example
1-86
Skewness
Measure of asymmetry or symmetrical of a frequency
distribution
Skewed to left
Symmetric or unskewed
Skewed to right
Kurtosis
Measure of flatness or peakedness of a frequency distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Skewness and Kurtosis Skewness and Kurtosis
Skewness
Measure of asymmetry or symmetrical of a frequency
distribution
Skewed to left
Symmetric or unskewed
Skewed to right
Kurtosis
Measure of flatness or peakedness of a frequency distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
1-87
Skewed to left
Skewness Skewness
Mean < Median< Mode
0
5
10
15
20
25
20
0
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700
Monthly expenses (Dollar)
F
r
e
q
u
e
n
c
y
Mean < Median< Mode
0
5
10
15
20
25
20
0
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700
Monthly expenses (Dollar)
F
r
e
q
u
e
n
c
y
1-88
Skewness Skewness
Symmetric
1-89
Skewness Skewness
Skewed to right
1-90
Kurtosis Kurtosis
Platykurtic - flat distribution
1-91
Kurtosis Kurtosis
Mesokurtic - not too flat and not too peaked
1-92
Kurtosis Kurtosis
Leptokurtic - peaked distribution

Basic Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Statistics

Uploaded by

Copyright:

Available Formats

1-1

Quantitative Techniques for Quantitative Techniques for

Standard deviation of grouped data

You might also like