Professional Documents
Culture Documents
Sem 1/HK01
STATISTICS: CHAPTER 2
Descriptive Statistics and Frequency
Distribution
Dr. Harimi Djamila
University Malaysia Sabah
STATISTICS
DESCRIPTIVE STATISTICS
Presenting and Summarizing Data
2
2
MEASURING LOCATION
Median (Q2)
The median is the middle number of a set of data
when all observations (values) are sorted in order.
Applicable to quantitative data only.
Median
2. If the recorded values for a variable form
a symmetric distribution, the median and
mean are identical.
3. In skewed data, the mean lies further
toward the skew than the median.
Symmetric
Skewed
Mean
Median
Mean
Median
Median
The middle score or measurement in a set of ranked
scores or measurements; the point that divides a
distribution into two equal halves.
Median
Class A--IQs of 13 Students
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109
(six cases above, six below)
Median
If the first student were to drop out of Class A, there would be
a new median:
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109.5
109 + 110 = 219/2 = 109.5
(six cases above, six below)
10
MEASURING LOCATION
Mean
It is commonly known as the average.
: Population mean
: Sample mean = x-bar
Of all the measures of location, the arithmetic mean is the most
commonly used in many statistical contexts.
11
Mean
Sample Mean
x-bar =
(x1 + x2 + . . . + xn)
n
x-bar = xi
n
n = number of cases in the sample
= Greek letter sigma = sum or add up what follows
i = a typical case or each case in the sample (1 through n)
Population Mean
=
(x1 + x2 + . . . + xn)
N
N = number of cases in the Population
12
Mean
Class A--IQs of 13 Students
102
128
131
98
140
93
110
127
131
96
80
93
120
109
115
109
89
106
119
97
Xi = 1437
X-barA = Xi = 1437 = 110.54
n
13
162
103
111
109
87
105
Xi = 1433
X-barB = Xi = 1433 = 110.23
n
13
13
Example of Mean
Measurements
x
Deviation
x - mean
-1
-3
-2
-4
40
MEAN = 40/10 = 4
Notice that the sum of the
deviations is 0.
Notice that every single
observation intervenes in
the computation of the
mean.
14
Mean
1. Means can be badly affected by outliers
Outlier
MEASURING LOCATION
Mode
The mode is the most commonly occurring value in
the data. It is not generally used because it is often
not representative of the data, particularly when the
dataset is small.
The mode most often is used for qualitative data, it
rarely used for quantitative data but why?
Forum.
16
Mode
The most common data point is called the
mode.
17
Example of Mode
Measurements
x
3
5
5
1
7
2
6
7
0
4
18
Example of Mode
Measurements
x
3
5
1
1
4
7
3
8
3
Mode: 3
19
Mode
1.
2.
3.
Symmetric
Skewed
Mean
Median
20
Mode
Mode
Mode: You could have a situation in which two or
more values occur the same number of times
and thus there are multiple modes(1).
The excel MODE function does not detect multiple
modes. If you think the mode is an appropriate
measure of central tendency for your data, you
should examine the data visually to see if the
mode is distinct or run frequency distribution and
histogram.
21
MEASURING LOCATION
Quartiles
The median is refereed to as Q2 sometimes called the
50th percentile since 50% of the numbers fall below it
and 50% fall above it.
Interquartile Range
A quartile is the value that marks one of the divisions that breaks a series of
values into four equal parts.
The median divides the cases in half.
25th percentile is a quartile that divides the first of cases from the latter .
75th percentile is a quartile that divides the first of cases from the latter .
25%
of
cases
25%
25%
25%
of
cases
23
25
50
75
100
Interquartile Range
+2
Q1=
4
+1
Q2=
2
3+2
Q3=
4
= 3 1
24
MEASURING LOCATION
Outliers
Outliers
(extreme
values)
usually
demand
investigation Often they are errors in the data (e.g. due
to instrument failure or errors in recording).
But they also may be very important (e.g. a new
scientific observation). If there is no reason to suspect
they have been wrongly recorded, may want to use
summaries that are resistant to their influence (e.g.,
medians rather than means)
Outliers should not be discarded without good reason
25
MEASURING LOCATION
Outliers
Usually outliers are identified visually, but how does a computer
identify an outlier?
Let
Q1= 25th Percentile
Q3 = 75th Percentile
H= Q3- Q1 ( The interquartile range)
An outlier is defined as any value less than Q1-1.5*H or greater
than Q3+1.5*H. An extreme outlier is defined as any value less
than Q1-3*H or greater than Q3+3*H.
Inter-quartile Range, H = Q3 - Q1, is a measure of variability of
the distribution
(H contains middle 50% of the observations)
26
MEASURING LOCATION
Trimmed Mean
27
28
MEASURING VARIABILITY
Range
The spread, or the distance, between the lowest and highest values
of a variable.
To get the range for a variable, you subtract its lowest value from its
highest value.
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Class A Range = 140 - 89 = 51
MEASURING VARIABILITY
Variance
A measure of the spread of the recorded values on a variable. A
measure of dispersion.
The larger the variance, the further the individual cases are from
the mean.
Mean
The smaller the variance, the closer the individual scores are to
the mean.
Mean
30
MEASURING VARIABILITY
31
MEASURING VARIABILITY
Example of
Sample Variance
Measurements Deviations
x
3
5
5
1
7
2
6
7
0
4
40
x - mean
-1
1
1
-3
3
-2
2
3
-4
0
0
Total
Square of
deviations
1
1
1
9
9
4
4
9
16
0
54
Mean= 4
Variance = 54/9 = 6
It is a measure of
spread.
Notice that the larger the
deviations (positive or
negative) the larger the
variance
33
MEASURING VARIABILITY
Variance
34
MEASURING VARIABILITY
Variance
The last step
The approximate average sum of squares is the
variance.
SS/N = Variance for a population.
SS/n-1 = Variance for a sample.
35
MEASURING VARIABILITY
Standard Deviation
36
MEASURING VARIABILITY
Standard Deviation
1.
19
2.
3.
4.
25
31
13
25
37
x = 25
x = 25
s.d. = 3
s.d. = 6
s.d. = 0 only when all values are the same (only when you have a
constant and not a variable)
If you were to rescale a variable, the s.d. would change by the same
magnitudeif we changed units above so the mean equaled 250, the s.d.
on the left would be 30, and on the right, 60
Like the mean, the s.d. will be inflated by an outlier case value.
37
Percentiles
The p-the percentile is a number such that at most p%
of the measurements are below it and at most 100 p
percent of the data are above it.
Example, if in a certain data the 85th percentile is 340
means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340
Notice that the median is the 50th percentile
38
MEASURING VARIABILITY
Coefficient of Variation
When comparing distributions of different means and
variances, a useful measure is the coefficient of variation
(CV).
CV=
39
Further Notes
When the Mean is greater than the Median the
data distribution is skewed to the Right.
When the Median is greater than the Mean the
data distribution is skewed to the Left.
When Mean and Median are very close to each
other the data distribution is approximately
symmetric.
40
PRACTICE
41
42
43
Bar Chart
Summarizes categorical data.
Horizontal axis represents categories,
while vertical axis represents either counts
(frequencies) or percentages (relative
frequencies).
Used to illustrate the differences in
percentages
(or
counts)
between
categories.
44
Bar Chart
45
Constructing Histograms
Used for numeric variables, so need Class Intervals
Let Range = Largest - Smallest Measurement
Break range into (say) 5-15 intervals depending on sample size
Make the width of the subintervals a convenient unit, and make
break points so that no observations fall on them
To determine the number of classes, k for a set of data
consisting of n observations, the formula below can be used
K =
. If the number of data is 60, than the number of
classes, k is
=5.9=6
Construct Histogram
Draw bars over each subinterval with height representing
class frequency or relative frequency (shape will be the
same)
Leave no space between bars to imply adjacency of class
intervals
47
Histogram
48
Interpreting Histograms
Probability: Heights of bars over the class intervals are
proportional to the chances an individual chosen at
random would fall in the interval
Unimodal: A histogram with a single major peak
Bimodal:
Histogram
with
two
distinct
(often evidence of two distinct groups of units)
peaks
23
28
50
Frequency (Count)
6
5
4
3
2
1
0
2
GPA
n=92 students
51
Dot Plot
Summarizes measurement data.
Horizontal axis represents measurement
scale.
Plot one dot for each data point.
52
Dot Plot
Fastest Ever Driving Speed
226 Stat 100 Students, Fall '98
100
Men
126
Women
70
80
90
100
150
160
53
Stem-and-Leaf Plot
Summarizes measurement data.
Each data point is broken down into a
stem and a leaf.
First, stems are aligned in a column.
Then, leaves are attached to the stems.
54
Stem-and-Leaf Plots
Simple, approach to obtaining shape of distribution without
losing individual measurements to class intervals.
Procedure:
55
12
63
(33)
43
25
12
8
4
4
2
2
1
1
1
1
1
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
= 139
223334444444
555555555555566666666677777778888888888888999999999
000000000000011112222233333333444
555555556667777888
0000000000023
5557
0023
00
0
5
56
PRACTICE
57
BOX-Plots
Box Plots
Maximum value without
outliers
Extreme
Outlier
Zone
Outlier
Zone
Outlier
Zone
Q1
1.5H
1.5H
Q2
Extreme
Outlier
Zone
Q3
1.5H
1.5H
59
Box Plot
Summarizes measurement data.
Vertical (or horizontal) axis represents
measurement scale.
Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.
60
Box Plot
Whiskers are drawn to the most extreme
data points that are not more than 1.5
times the length of the box beyond either
quartile.
Whiskers are useful for identifying outliers.
Box Plots
Box Plots - Display a box containing middle
50% of measurements with line at median
and lines extending from box. Breaks data
into four quartiles
Outliers - Observations falling more than
1.5IQR above (below) upper (lower) quartile
62
63
BOX-Plots
180.00
160.00
140.00
123.5
120.00
M=110.5
106.5
100.00
96.5
82
80.00
IQ
64
Box Plot
Amount of sleep in past 24 hours
of Spring 1998 Stat 250 Students
10
9
8
7
6
5
4
3
Outlier
2
1
0
65
Scatter Plots
Foot sizes of Spring 1998 Stat 250 students
31
30
29
28
27
26
25
24
23
22
22
23
24
25
26
27
28
29
30
31
67
Scatter Plots
Summarizes the relationship between two
measurement variables.
for
each
pair
of
68
No relationship
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
32
31
30
29
28
27
26
25
24
23
22
52
57
62
69
Closing comments
Many possible types of graphs.
Use common sense in reading graphs.
When creating graphs, dont summarize
your data too much or too little.
When creating graphs, label everything for
others.
Remember you are trying to
communicate something to others!
70
71
Histograms in SPSS
After Importing your dataset, and providing names
to variables, click on:
GRAPHS HISTOGRAM
Select Variable to be plotted
Click on DISPLAY NORMAL CURVE if you want a normal
curve superimposed (Next Chapter 4).
72
73
PRACTICE
SPSS
74
75
Question 1
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)
1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes
9
8
9
6
2
0
4
0
3
2
6
6
76
86
23
41
98
96
20
40
32
66
50
92
40
60
77
Question 2
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)
1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes
9
8
9
6
2
0
4
0
3
2
6
6
78
34
53
23
17
54
12
78
98 199
79
Descriptive Statistics
Now you are qualified use descriptive statistics!
80