You are on page 1of 56

Descriptive Statistics

MATH30-6
Probability and Statistics
Objectives
At the end of the lesson, the students are expected to
• Define and differentiate various measures of describing
data;
• Describe a given set of data using various measures;
and
• Interpret values that arise from computation.
Measures of Describing Data
• Measure of Central Tendency
- Also known as Measure of Center, Measure of Central
Location
- Measure of finding the mean, median or mode of the
dataset
- The midrange is rarely used. It is calculated by adding
the highest data value to the lowest data value and
dividing the sum by 2.

• Measure of Position
- Measure of finding the kth element of the distribution
- Also the quantiles or fractiles of distribution
Measures of Describing Data
• Measure of Variation
- Measure of how the data is distributed about the
mean.

• Measure of Shape
- Measure of the degree of symmetry of a distribution.
The Mean
• Most widely used parameter of describing a ratio
data.
• May be classified as
- Arithmetic mean
- Weighted mean
- Geometric mean
- Harmonic mean
- Trimmed mean
- Quadratic or Root Mean Square (RMS)
Arithmetic Mean
For Discrete Case

Sample mean
𝑛
𝑥𝑖 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑥ҧ = ෍ =
𝑛 𝑛
𝑖=1

Population mean
σ𝑋
𝜇=
𝑁
Arithmetic Mean
Characteristics
• All values are used.
• It is unique.
• The arithmetic mean is the only measure of central
tendency where the sum of the deviations of each
value from the mean is zero.
• It is calculated by summing the values and dividing by
the number of values.
• Every set of interval-level and ratio-level data has a
mean.
• The mean is affected by unusually large or small data
values.
Arithmetic Mean
Arithmetic Mean
6-1/205 Will the sample mean always correspond to one
of the observations in the sample?

6-2/205 Will exactly half of the observations in a sample


fall below the mean?

6-3/205 Will the sample mean always be the most


frequently occurring data value in the sample?
Weighted Mean
The weighted mean of a set of numbers x1, x2, …, xn, with
corresponding weights w1, w2, …, wn, is computed from
the following formula:

𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 … +𝑤𝑛 𝑥𝑛
𝑥ҧ𝑤 =
𝑤1 + 𝑤2 + 𝑤3 + ⋯ + 𝑤𝑛

σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖
𝑥ҧ𝑤 = 𝑛
σ𝑖=1 𝑤𝑖
Weighted Mean
Example:
1. The Carter Construction Company pays its hourly
employees $16.50, $19.00, or $25.00 per hour. There
are 26 hourly employees, 14 of which are paid at the
$16.50 rate, 10 at the $19.00 rate, and 2 at $25.00
rate. What is the mean hourly rate paid of the 26
employees?
The Median
• The midpoint of the values after they have been
ordered from the smallest to largest
• There are as many values above the median as below it
in the data array.
• For an even set of values, the median will be the
arithmetic average of the two middle numbers.

Sample median
𝑥෤ = 𝑥 𝑛+1 Τ2 if n is odd,

𝑥𝑛Τ2 +𝑥𝑛Τ2+1
𝑥෤ = if n is even.
2
The Median
Characteristics
• There is a unique median for each data set.
• It is not affected by extremely large or small values and
is therefore a valuable measure of central tendency
when such values occur.
• It can be computed for ratio-level, interval-level, and
ordinal-level data.
• It can be computed for an open-ended frequency
distribution if the median does not lie in an open-
ended class.
The Median
Example:
1. Find the median of
1.8, 2.1, 1.7, 1.6, 0.9, 2.7, and 1.8
The Mode
• The value of the observation that appears most
frequently
The Mode
Characteristics
• Used when you want to find the most
occurring/frequent score
• A quick approximate of the average
• An inspection average
• The most unreliable among the three measures
because its value is undefined in some observations
• The only measure of central location that can be used
for nominal data
• Usually used in polls
• If a distribution is said to have 2 modes, it is bi-modal,
if three, a tri-modal. Generally, multi-modal.
The Mode
Example:
1. At a certain poll, the following data were recorded:
1 − Yes, 2 – No, 0 – Undecided. What is the modal choice?

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 2

𝑥ො = 1
Measures of Location
• Quantiles (or Fractiles) are points taken at regular
intervals from the cumulative distribution function of a
random variable.
• Dividing ordered data into q essentially equal-sized
data subsets is the motivation for q-quantiles; the
quantiles are the data values marking the boundaries
between consecutive subsets.
• There are q − 1 q-quantiles, with k an integer satisfying
0 < k < q.
Measures of Location
Quartiles
• Dividing the dataset into 4 groups.

Deciles
• Dividing the dataset into 10 groups.

Percentiles
• Dividing the dataset into 100 groups.
Quartile
• Any of the three fractiles obtained by dividing the set
of data into four equal parts

• Q1 is the lower quartile which contains the lowest


25% of the data.

• Q2 is the median which divides the data into two


equal parts.

• Q3 is the third quartile (upper quartile) which


contains the upper 25% of the data.
Quartile
There is no universal agreement on a single procedure for
calculating quartiles, and different computer programs
often yield different results, For example, if you use the
data set of 1, 3, 6, 10, 15, 21, 28, and 36, you will get
these results:

Program Q1 Q2 Q3
STATDISK 4.5 12.5 24.5
Minitab 3.75 12.5 26.25
Excel 5.25 12.5 22.75
TI-83 Plus 4.5 12.5 24.5
Algorithm used in QUARTILE()
function in Excel
1. Find the kth smallest member in the array of values,
where:
k=(quart/4)*(n – 1)+1
If k is not an integer, truncate it but store the fractional
portion (f) for use in step 3.

quart = value between 0 and 4 depending on which


quartile you want to find.
n = number of values in the array.
Algorithm used in QUARTILE()
function in Excel
2. Find the smallest data point in the array of values that
is greater than the kth smallest, the (k+1)th smallest
member.
3. Interpolate between the kth smallest and the (k+1)th
smallest values:
Output = a[k]+(f*(a[k+1]-a[k]))

a[k] = the kth smallest


a[k+1] = the (k+1)th smallest
Quartile
Example:
1. Find the three quartiles in the array of values 0, 2, 3,
5, 6, 8, 9.
Measures of Location
• Quartile – One fourth
First (1/4), Second (1/2), Third (3/4)
Q1 = P25, Q2 = D5 = P50 = median, Q3 = P75
• Decile – One tenth
10%, 20%, …, 90%
D1 = P10, D2 = P20, …, D8 = P80, D9 = P90
• Percentile − One hundredth
1%, 2%, …, 99%
P1, P2, P3, …, P98, P99
Measures of Location
Examples:
1. Consider the observations 11, 14, 17, 23, 27, 32, 40,
49, 54, 59, 71, and 80. What is the 29th percentile?

2. The magazine Forbes publishes annually a list of the


world’s wealthiest individuals. For 2007, the net worth
of the 20 richest individuals, in billion of dollars, in no
particular order, is as follows:
18, 18, 18, 18, 19, 20, 20, 20, 21, 22, 22, 23, 24, 26, 27, 32,
33, 49, 52, 56
Find the first percentile.
Measures of Variation (Dispersion)
Why study dispersion?
• A measure of location, such as the mean or the median
does not tell us anything about the spread of the data.
• For example, if your nature guide told you that the
river averaged 3 feet in depth, would you want to wade
across on foot without additional information?
Probably not. You would want to know something
about the variation in depth.
Measures of Variation (Dispersion)
Why study dispersion?
• A second reason is to compare the spread in two or
more distributions.
• These are measures of the average distance of each
observation from the center of distribution (Mean
Absolute Deviation).
• They measure the homogeneity or heterogeneity of a
particular group.
Measures of Variation
• Range
- The difference between the largest and smallest
number in the set
• Mean Absolute Deviation (MAD)
- The average of unsigned deviations from mean
• Variance
- The average of square deviations
• Standard Deviation (SD)
- The population/sample standard deviation is given as
the positive square root of population/sample variance
Measures of Variation
• Coefficient of Variation (CV)
- The percentage of the ratio of standard deviation to
the mean
Range
R=H─L
Consider the following data.

Grades in Statistics
Jon 100 Ann 84
Ron 65 Ria 86
Dan 75 Let 85
Tom 85 Bel 82
Bob 95 Nel 83
Range 35 Range 4
Range
Conclusion: Grades of males are more scattered while
grades of females are more compressed. Females are
more homogeneous in their math ability.

Disadvantages of the range:


1. Unstable for a very large class
2. Unreliable since only two values are taken into
account
3. Range of two sets of data with unequal number of
scores are not directly comparable
Variance and Standard Deviation
• Sample variance (s2)
𝑛
2
2
𝑥𝑖 − 𝑥ҧ
𝑠 =෍
𝑛−1
𝑖=1

σ 2 σ 2
𝑛 𝑥𝑖 − 𝑥𝑖
𝑠2 =
𝑛(𝑛 − 1)
• Sample standard deviation (s)
- Positive square root of s2
𝑠 = 𝑠2
The quantity n − 1 is often called the degrees-of-freedom
associated with the variance estimate.
Variance and Standard Deviation
• Population variance (σ2)
𝑛
𝑥𝑖 − 𝜇 2
2
𝜎 =෍
𝑁
𝑖=1

• Population standard deviation (σ)


- Positive square root of σ2

𝜎= 𝜎2
Variance
Determine the variance in the previous example treating
the data as a population and sample.

Grades in Statistics
Jon 100 Ann 84
Ron 65 Ria 86
Dan 75 Let 85
Tom 85 Bel 82
Bob 95 Nel 83

𝒙 84 ഥ
𝒙 84
Variance
Males
100 − 84 2 + 65 − 84 2 + 75 − 84 2 + 85 − 84 2 + 95 − 84 2
𝑠2 =
5−1

𝑠 2 = 205

100 − 84 2 + 65 − 84 2 + 75 − 84 2 + 85 − 84 2 + 95 − 84 2
𝜎2 =
5

𝜎 2 = 164
Variance
Females
84 − 84 2 + 86 − 84 2 + 85 − 84 2 + 82 − 84 2 + 83 − 84 2
𝑠2 =
5−1

𝑠 2 = 2.5

84 − 84 2 + 86 − 84 2 + 85 − 84 2 + 82 − 84 2 + 83 − 84 2
𝜎2 =
5

𝜎2 = 2
Variance
Conclusion: Males showed more variability. The higher
the variance, the more variable or far apart the values are
from each other.

Remark: Since the variance is in squared units, it does not


reflect the true meaning of data being measured.
Standard Deviation
Males
s = 14.3178
σ = 12.8062

Females
s = 1.5811
σ = 1.4142
Mean Absolute Deviation
𝑛
𝑥𝑖 − 𝑥ҧ
MAD = ෍
𝑛
𝑖=1
Measures of Variation
Example:

Consider the following test scores:


Test 1 2 3 4 5 6 7 8 9 10
Student A 12 6 13 2 5 0 9 6 10 7
Student B 8 10 9 12 5 1 4 7 9 3
a. Who performed better?
b. Who is more consistent?
Measures of Variation
a. Compute the average score of each student.

12 + 6 + 13 + 2 + 5 + 0 + 9 + 6 + 10 + 7
𝑥𝐴ҧ = =7
10
8 + 10 + 9 + 12 + 5 + 1 + 4 + 7 + 9 + 3
𝑥ҧ𝐵 = = 6.8
10

Student A performed better because of the higher


computed average.
Measures of Variation
b. Compute the sample standard deviations.
12 − 7 2 + 6−7 2 + 13 − 7 2 + ⋯+ 7 − 7 2
𝑠𝐴 =
10 − 1
𝑠𝐴 = 4.1366

8 − 6.8 2 + 10 − 6.8 2 + 9 − 6.8 2 + ⋯ + 3 − 6.8 2


𝑠𝐵 =
10 − 1
𝑠𝐵 = 3.4577

Student B is more consistent because of lower standard


deviation.
Measures of Variation
6-4/205 For any set of data values, is it possible for the
sample standard deviation to be larger than the sample
mean? If so, give an example.

6-5/205 Can the sample standard deviation be equal to


zero? If so, give an example.

6-6/205 Suppose that you add 10 to all of the


observations in a sample. How does this change the
sample mean? How does it change the sample standard
deviation?
Measures of Variation
6-7/205 Eight measurements were made on the inside
diameter of forged piston rings used in automobile
engine. The data (in millimeters) are 74.001, 74.003,
74.015, 74.000, 74.005, 74.002, 74.005, and 74.004.
Calculate the sample mean and sample standard
deviation, construct a dot diagram, and comment on the
data.
6-8/205 In Applied Life Data Analysis (Wiley, 1982),
Wayne Nelson presents the breakdown time of an
insulating fluid between electrodes at 34 kV. The times, in
minutes, are as follows: 0.19, 0.78, 0.96, 1.31, 2.78, 3.16,
4.15, 4.67, 4.85, 6.50, 7.35, 8.01, 8.27, 12.06, 31.75,
32.52, 33.91, 36.71, and 72.89. Calculate the sample
mean and sample standard deviation.
Measures of Variation
Remark: Standard deviation and variance are both reliable
but cannot be used in comparing two sets of data of
different units.

Example: Consistency of a player − assist or making points


Coefficient of Variation
CV = s.d./mean
𝑐𝑣 = 𝜎Τ𝜇 or 𝑐𝑣Ƹ = 𝑠Τ𝑥ҧ
where:
s.d. = standard deviation (s or σ)
Player C’s record of assists and points in Game 1:
A 7 10 9 1 5 3 4 7 9 4
P 25 25 30 22 23 22 16 35 20

CVA = 𝑐𝑣Ƹ 𝐴 = 0.5018


CVP = 𝑐𝑣Ƹ 𝐵 = 0.2297
The player is more consistent in making points.
Measures of Shape
• Skewness
- Degree of asymmetry of distribution about a mean. It
is a measure on how the data departs from being
symmetrical
- Can be interpreted as symmetric, positively skewed or
negatively skewed

• Kurtosis
- The degree of peakedness exhibited by the distribution
Skewness
Pearsonian Coefficient of Skewness in a sample (Pearson’s
Coefficient of Skewness by Karl Pearson) using the mode

𝑥ҧ − 𝑥ො
𝑆𝑘1 =
𝑠

Interpretation of values:
1. Sk < 0, “negatively skewed” or “skewed to the left”
2. Sk = 0, symmetrical
3. Sk > 0, “positively skewed” or “skewed to the right”
Skewness
Pearsonian Coefficient of Skewness in a sample (Pearson’s
Coefficient of Skewness by Karl Pearson) using the median

3 𝑥ҧ − 𝑥෤
𝑆𝑘2 =
𝑠

Interpretation of values:
1. Sk < 0, “negatively skewed” or “skewed to the left”
2. Sk = 0, symmetrical
3. Sk > 0, “positively skewed” or “skewed to the right”
Skewness
• A measure of the asymmetry of the frequency distribution

a. Positive skewness: mode < median < mean


b. Symmetrical: mode = median = mean
c. Negative skewness: mode > median > mean
SKEW() Function in Excel
𝑛 3
𝑛 𝑥𝑖 − 𝑥ҧ
𝑆𝑘 = ෍
𝑛−1 𝑛−2 𝑠
𝑖=1
Kurtosis
• A measure of the degree to which a unimodal
distribution is peaked
• The state or quality of flatness or peakedness of the
curve describing a frequency distribution about its
mode

Leptokurtic Platykurtic

Mesokurtic
KURT() Function in Excel
Relative Kurtosis
𝑛 4
𝑛 𝑛+1 𝑥𝑖 − 𝑥ҧ 3 𝑛−1 2
𝐾= ෍ −
𝑛−1 𝑛−2 𝑛−3 𝑠 𝑛−2 𝑛−3
𝑖=1

Interpretation of values:
1. K < 0, “platykurtic” or “relatively flat”
2. K = 0, “mesokurtic” or having the same kurtosis as the
Normal Distribution. The kurtosis of a Normal Distribution
is 3.
3. K > 0, “leptokurtic” or “relatively peaked”
Summary
• The measures of central tendency are mean, median,
and mode. Midrange is rarely used. Midrange is found
by adding the highest data value to the lowest data
value and dividing the sum by 2.
• Different types of means (arithmetic, weighted,
geometric, harmonic, etc.) are computed depending on
the nature of data.
• The measures of location are quartiles, deciles, and
percentiles.
• The measures of variation tell us about how the data is
distributed about the mean.
• The measures of shape refer to either skewness or
kurtosis.
References
• Montgomery and Runger. Applied Statistics and Probability
for Engineers, 6th Ed. © 2014
• Microsoft® Excel
• Walpole, et al. Probability and Statistics for Engineers and
Scientists 9th Ed. © 2012, 2007, 2002
• http://irving.vassar.edu/faculty/wl/econ209/dessript.pdf
• http://www.preciousheart.net/chaplaincy/Auditor_Manual
/10descsd.pdf

You might also like