You are on page 1of 88

Describing a Set of Data

with Numerical Measures

Lesson 2(b)
 To present the important measures
and to show how to compute the
following:
▪ Mean
 Median
 Mode
 A measure of center is a value at the
center or middle of a data set.
...Mean
...Median
...Mode
 The (arithmetic) mean is generally the
most important of all numerical
descriptive measurements, and it is
what most people call an average
 The arithmetic mean of a set of values
is the number obtained by adding the
values and dividing the total by the
number of values; also referred to as
mean will be often used throughout

the remainder of the course.


 The mean is denoted by
(pronounced “x-bar”) if the data set is
a sample from a larger population.
 The mean is denoted by
(lowercase Greek mu) if all values of
the population are used.
 The Greek letter  (uppercase Greek
sigma) indicates that the data values
should be added.

 Formula
Denotes the mean of a set
of sample values
(ungrouped)
Notation
Σ Denotes the addition of a set of values

x is the variable n represents the


usually used to number of values in a
represent the sample
individual data values
Denotes the mean of all
values in a population

Notation

Σ Denotes the addition of a set of values

x is the variable N represents the


usually used to number of values in a
represent the population
individual data values
 Listed below are the volumes (in
ounces) of the Coke in five
different cans. Find the mean for
this example.

12.3 12.1 12.2 12.3 12.2


12.3 12.1 12.2 12.3 12.2
 It is sensitive to every value, so one
exceptional value can affect the mean
dramatically.

 The median largely overcomes this


disadvantage.
 The median of a data set is the middle
value when the original data values are
arranged in order of increasing (or
decreasing) magnitude.

 The median is often denoted by


(pronounced “x-tilde”, or “x-curl”).
 first sort the values (arranged them in
order), then follow one of these two
procedures:
 If the number of values is odd, the
median is the number located in the
exact middle of the list.
 If the number of values is even, the
median is found by computing the
mean of the two middle number

 Find the median of the following
salaries (in millions of dollars) paid to
female executives (based on data from
Working Woman magazine):

6.72 3.46 3.60 6.44


 Since the number of values is an even
number, and arranging them in order;
such that

3.46 3.60 6.44 6.72

Then, the median is $5.02 million


 Repeat Example 1, this time including
another salary of $26.70 million. That
is, find the median of the following
salaries (in million dollars):

6.72 3.46 3.60 6.44 26.70


 Since the number of values is an odd
number, and arranging them in order;
such that
3.46 3.60 6.44 6.72 26.70

Exact middle
Then, the median is $6.44
 The mode of a data set is the value
that occurs most frequently.

 The mode is often denoted by M.


 When two values occur with the same
greatest frequency, each one is a
mode and the data is bimodal.
 When more than two values occur with
the same greatest frequency, each is a
mode and the data set is said to be
multimodal.
 When no value is repeated, we say that
there is no mode.
Find the modes of the following data
sets.

1. 5 5 5 3 1 5 1 4 3 5
2. 1 2 2 2 3 4 5 6 6 6 7 9
3. 1 2 3 6 7 8 9 10
1. The number 5 is the mode because it
is the value that occurs most often.
5 5 5 3 1 5 1 4 3 5
2. The number 2 and 6 are both modes
because they occur with the same
greatest frequency. This data set is
bimodal.

1 2 2 2 3 4 5 6 6 6 7 9
3. There is no mode because n value is
repeated.

1 2 3 6 7 8 9 10
 It is the value midway between the
highest and the lowest values in the
original data set. It is found using the
formula shown
Find the midrange of the ages of people arrested on theft charges at the Dutches County jail.
18 16 23 25 19 18 20 38

Find the midrange of the ages of


people arrested on theft charges at the
Dutches County jail.

19 16 23 25 19 18 20 38
2 2 2 20 34 45 210
Median = 20; middle value Mean = 45; Midrange = (2
that occur in the data set average value + 210)/2 = 156
Median

2 Mean

2 Outlier

2 20 34 45 210

Mode = 2
Mode

value that occur


most often
A distribution of data is skewed if it is not
symmetric and if it extends more to one
side than the other.

A distribution of data is symmetric if the


left half of its histogram is roughly a mirror
image of its right half
Lopsided to the right = Skewed to the left =
Negatively Skewed

The mean and median are to the left of the


mode. Although not always predictable,
data of this type of distribution have the
mean to the left of the median
Lopsided to the left = Skewed to the right =
positively Skewed

The mean and median are to the right


of the mode. Although not always
predictable, data of this type of
distribution generally have the mean
to the right of the median
Grouped Data
 When data are summarized in a
frequency table, we do not know the
exact values falling in a particular
class. To make calculations possible,
we pretend that within each class, all
sample values are equal to the class
midpoint.
 Since each class midpoint is
repeated number of times equal to the
class frequency, the sum of all sample
values becomes (f•x), where f
denotes frequency and x represents
the class midpoint.
 The total number of sample values
is the sum of frequencies f.
(𝑭 ∙ 𝒙) (𝑭 ∙ 𝒙)
𝒙=
𝒇
𝒙=
𝒇 Example
from
Lesson 2(a)

Frequency Distribution
CI Class Width f x fx
1 28 - 34 1 31 31
2 35 - 41 4 38 152
3 42 - 48 10 45 450
4 49 - 55 9 52 468
5 56 - 62 9 59 531
6 63 - 69 4 66 264
7 70 - 76 1 73 73
8 77 - 83 2 80 160
Total 40 2129
Lesson 2(c)
 To discuss the following key concepts:
 Variation refers to the amount that
values vary among themselves, and it
can be measured with specific
numbers
 Values that are relatively close
together have lower measures of
variations, and values that are spread
farther apart have measures of
variation that are larger
 The Standard deviation, which is a
particularly important measure of
variation can be computed

 The values of Standard Deviation must


be interpreted correctly.
Data sets may have
the same center but
look difference
because of the way
the numbers spread
out from the center Different range and
unequal variability
...Range
...Variance
...Standard Deviation
...Coefficient of Variation
 The difference between the largest
observation and the smallest
observation

 Its advantage is also its disadvantage


 Its simplicity; because it is calculated
from only two observations, it tells
nothing about other observations
 Population variance

 Sample variance
 The population variance is
represented by σ2 (Greek letter sigma
squared)
 To compute the sample variance s2 begin by
calculating the sample mean , then compute
for the difference (also known as deviation)
between each observation and the mean
 Square the deviation and sum, finally devide
the sum of squared deviation by (n – 1)
8 4 9 11 13

The mean is

From each 8–7=1


observation
4 – 7 = -3
we determine
the deviation 9–7=2
from the 11 – 7 = 4
mean 3 – 7 = -4
Squaring the (1)2 = 1
deviations (-3)2 = 9
yields
(2)2 = 4
(4)2 = 16
(-4)2 = 16
Summing and
dividing by
(n – 1)
 the difference between the value and
the mean

 Formula

 This is seldom used because of limited


utility
 The variance provides only a rough idea
about the amount of variation in the data
 It is useful when comparing two or more
sets of data
 Squaring the deviations from the mean is
squared requires squaring the unit
attached to the variance. This
contributes to the problem of
interpretation: Solution is Standard
Deviation
The following are the number of summer
jobs a sample of six students applied
for. Find the mean and variance of these
data

17 15 23 7 9 13
The mean is

The sample
variance is
 Population

 Sample
 The standard deviation of a set of
sample values is a measure of
variation of values about the mean.
 Formula

(a) Sample standard (b) Shortcut formula for


deviation standard deviation
 Step 1:Find the mean of the values
 Step 2:Subtract the mean from each
individual value to get a list of
deviations of the form
 Step 3:Square each of the differences
obtained from Step 2
 Knowing the mean and standard
deviation allows the statistician to
extract useful bits of information. The
information depends on the shape of
the histogram. If the histogram is bell-
shaped the Empirical Rule is used.
 Step 4:Add all the squares obtained
from Step 3 to get
 Step 5:Divide the total from Step 4 by
the number (n-1)
 Step 6:find the square root of the
result of Step 5.
µ

Approximately
Approximately 68% 99.7% of all
Approximately 95% observations fall
of all observations of all observations
fall within one within three
fall within two standard deviations
standard deviation standard deviations
of the mean of the mean
of the mean
 Calculate the variance and standard
deviation for the five measurements
given in the table below.
5 7 1 2 4

 Use formulae

and
 Solution: Given 5 7 1 2 4

Table for simplified calculation of s2 and s


xi (xi)2
5 25
7 49
1 1
2 4
4 16
19 95
 Solution: Given 5 7 1 2 4
 Solution: using

Computation using deviation from the mean

5 1.2 1.44
7 3.2 10.24
1 -2.8 7.84
2 -1.8 3.24
4 0.2 0.04
19 0.0 22.80
 Solution
 The coefficient of variance of a set of
observations is the standard deviation
of the observations divided by the
mean
 Population

 Sample
 Calculate the variance of the following
samples

9 3 7 4 7 5 4

 Determine the variance and standard


deviation of the following samples

12 6 22 31 23 15 13 15 17 21
Calculate the variance and standard
deviation of the following samples

6.5 6.6 6.7 6.8 7.1


7.4 7.7 7.7 7.7 7.3
Lesson 2(d)
 Provides information about the
position of particular values relative to
the entire data set
 Types
 Median
 Centiles (Percentile, Quartile,
Decile)
 Z-score
 A centile or centile point is defined as a
specific point in a distribution which has
a given percentage of the cases below it.
 Widely used in educational circles in
reporting the results of standardized
tests
 Any Centile Point the desired centile
cf = cumulative
frequency of cases
below interval in
 Where
which we are
LL = lower exact limit
interpolating
of interval in which
fi = frequency of the
we are interpolating interval in whic we are
N = number of cases interpolating
p = proportion i = size of the class
corresponding to interval
Cumulative Relative Ogive
Frequency Distribution
This side of
Cum Rel the curve tells
f
Freq
CI LL UL that... 100%
1 28 - 34 1 0.025 92.50% 95.00%
60% of the
2 35 - 41 4 0.125 82.50%
students who
3 42 - 48 10 0.375 took the
4 49 - 55 9 0.600 Geography Test 60.00%
got a score below
5 56 - 62 9 0.825
56 points
6 63 - 69 4 0.925 37.50%
7 70 - 76 1 0.950
8 77 - 83 2 1.000
12.50%
Total 40 2.50%
1 2 3 4 5 6 7 8
Ogive
100% For example,
82.50%
92.50% 95.00%
the 60th centile
(C60) is that
60.00% point in a
C60 distribution
37.50%
which has 60%
12.50% of the cases
2.50%
1 2 3 4 5 6 7 8 below it.
Cumulative Frequency and Percentage

 A frequency CI f cf cP

distribution of 60-64
55-59
2 376
374
100
99.5
12
the scores of 50-54 20 362 96.3

376 boys on a 45-49


40-44
32 342
310
90.7
82.4
46
test of 35-39 58 264 70.2

mechanical 30-34 64 206 54.8


25-29 58 142 37.7
ability is 20-24 42 84 22.3
presented in the 15-19 23 42 11.2
10-14 15 19 5.0
opposite table 5-9 4 4 1.1
Total 376
Cumulative Frequency and Percentage

 To illustrate C50 CI f cf cP
60-64 2 376 100
55-59 12 374 99.5
50-54 20 362 96.3
45-49 32 342 90.7

By definition, C50 40-44 46 310 82.4


35-39 58 264 70.2
is the centile 30-34 64 206 54.8
point that will 25-29 58 142 37.7
20-24 42 84 22.3
have 50% of the 15-19 23 42 11.2
cases above and 10-14 15 19 5.0
5-9 4 4 1.1
below it. Total 376
Cumulative Frequency and Percentage

C50 is the midpoint CI f cf cP

of the 60-64
55-59
2 376
374
100
99.5
12
distribution and 50-54 20 362 96.3

is known as the 45-49


40-44
32 342
310
90.7
82.4
46
median 35-39 58 264 70.2
30-34 64 206 54.8
25-29 58 142 37.7
20-24 42 84 22.3
15-19 23 42 11.2
10-14 15 19 5.0
5-9 4 4 1.1
Total 376
Cumulative Frequency and Percentage

Hence we are CI f cf cP

interested in 60-64
55-59
2 376
374
100
99.5
12
finding that point 50-54 20 362 96.3

in the 45-49
40-44
32 342
310
90.7
82.4
46
distribution with 35-39 58 264 70.2

188 cases above 30-34 64 206 54.8


25-29 58 142 37.7
and below it 20-24 42 84 22.3
15-19 23 42 11.2
10-14 15 19 5.0
5-9 4 4 1.1
Total 376
Cumulative Frequency and Percentage

Beginning from the CI f cf cP

bottom until we 60-64


55-59
2 376
374
100
99.5
12
come as close to 50-54 20 362 96.3

188 cases, as 45-49


40-44
32 342
310
90.7
82.4
46
possible, but not 35-39 58 264 70.2

exceeding it. 30-34 64 206 54.8


25-29 58 142 37.7
20-24 42 84 22.3
15-19 23 42 11.2
10-14 15 19 5.0
5-9 4 4 1.1
Total 376
188 cases is at the Cumulative Frequency and Percentage

CI f cf cP
bottom of class 60-64 2 376 100

interval 30-34 55-59


50-54
12 374
362
99.5
96.3
20
and above 25-29. 45-49 32 342 90.7

This being 29.5 40-44 46 310 82.4


35-39 58 264 70.2
has 142 cases 30-34 64 206 54.8
below it. 25-29 58 142 37.7
20-24 42 84 22.3
We need 46 cases 15-19 23 42 11.2
to meet the 188 10-14 15 19 5.0
5-9 4 4 1.1
cases Total 376
We need, Cumulative Frequency and Percentage

CI f cf cP
therefore, to 60-64 2 376 100

interpolate. 55-59
50-54
12 374
362
99.5
96.3
20
45-49 32 342 90.7
40-44 46 310 82.4
35-39 58 264 70.2
30-34 64 206 54.8
25-29 58 142 37.7
20-24 42 84 22.3
15-19 23 42 11.2
10-14 15 19 5.0
5-9 4 4 1.1
Total 376
We verify by Cumulative Frequency and Percentage

CI f cf cP
coming down 60-64 2 376 100

from top. 55-59


50-54
12 374
362
99.5
96.3
20
45-49 32 342 90.7
40-44 46 310 82.4
35-39 58 264 70.2
30-34 64 206 54.8
25-29 58 142 37.7
20-24 42 84 22.3
15-19 23 42 11.2
10-14 15 19 5.0
5-9 4 4 1.1
Total 376
 Several of the
Centile points have
special names:
 C10 – Decile (D1)
25% 25% 25% 25%
 C20 – D2
 C25 – 1st Quartile Median, M
(Q1)
 C50 – Median Lower Upper
Quartile, Q1 Quartile, Q3
 C75 – 3rd Quartile
(Q3)
 Suppose you have been notified that
your score of 610 on the Verbal
Graduate record Examination placed
you at the 60th percentile in the
distribution of scores. Where does
your score of 610 stand in relation to
the scores of others who took the
examination?
60% 40%
 Scoring at the 60th
percetile means
that 60% of all 25% 25% 25% 25%

examination scores
were lower that 60th %-tile

your score and 40%


were higher
 Sample z-score

A z-score measures the distance between


an observation and the mean, measured in
units of standard deviation;

Valuable tool for determining whether the


observation under consideration is likely to
occur quite frequently or somewhat
unusual.
 Consider the sample of 10
measurements:
1 1 0 15 2 3 4 0 1 3

The measurements x = 15 appears to


be unusually large. Calculate the z-
score for this observation and state
your conclusions.
x x2
1 1
1 1
0 0
15 225
2 4
3 9
4 16
0 0 The z-score for the suspected outlier
1 1 is calculated as:
3 9
∑x=30 ∑x2=266
µ

The measurement x=15 lies 2.71 standard deviation


above the sample mean. Although the z-score does
not exceed 3s, it is close enough so that you might
suspect that x=15 is an outlier. In case, examine the
sampling procedure to see whether x=15 is a faulty
observation

You might also like