Chapter 2 Student Version

UMS-Faculty of Engineering 2016/2017
Sem 1/HK01
STATISTICS: CHAPTER 2
Descriptive Statistics and Frequency
Distribution
Dr. Harimi Djamila
University Malaysia Sabah
CHAPTERKA220102 Sem 1 2016/2017

1
STATISTICS
DESCRIPTIVE STATISTICS
Presenting and Summarizing Data
2
2
Why Descriptive Statistics

1. Gaining familiarity with the data.
2. Looking for maximum, minimum, mean values or
unusually high or low values (Outliers), missing
values, and making any necessary corrections.
3. Checking the assumptions required for statistical
testing.
4. Comparing your data with other researchers works
to uncover any unusual values found in your data
which will be used for further analysis
3
MEASURING LOCATION
Median (Q2)
The median is the middle number of a set of data
when all observations (values) are sorted in order.
Applicable to quantitative data only.
it is particularly useful where there are unusually low

or high values (Outliers) that would render the mean
unrepresentative of the data (2).
Extreme observations (often referred to as outliers)
Please: Explanation about outliers will be provided

after few slides. You may need to return back to
reread all the slides
Median
2. If the recorded values for a variable form
a symmetric distribution, the median and
mean are identical.
3. In skewed data, the mean lies further
toward the skew than the median.
Symmetric
Skewed
Mean
Median
Mean
Median
Median
The middle score or measurement in a set of ranked
scores or measurements; the point that divides a
distribution into two equal halves.
Data are listed in orderthe median is the point at

which 50% of the cases are above and 50% below.
Known as the 50th percentile.
Median
Class A--IQs of 13 Students
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109
(six cases above, six below)
Median
If the first student were to drop out of Class A, there would be
a new median:
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109.5
109 + 110 = 219/2 = 109.5
(six cases above, six below)
10
MEASURING LOCATION
Mean
It is commonly known as the average.
: Population mean
: Sample mean = x-bar
Of all the measures of location, the arithmetic mean is the most
commonly used in many statistical contexts.
it is strongly influenced by extreme values (outliers), however, it

is most representative when data are symmetrically distributed.
11
Mean
Sample Mean
x-bar =
(x1 + x2 + . . . + xn)
n
x-bar = xi
n
n = number of cases in the sample
= Greek letter sigma = sum or add up what follows
i = a typical case or each case in the sample (1 through n)
Population Mean
=
(x1 + x2 + . . . + xn)
N
N = number of cases in the Population
12
Mean
Class B--IQs of 13 Students
102
128
131
98
140
93
110
127
131
96
80
93
120
109
115
109
89
106
119
97
Xi = 1437
X-barA = Xi = 1437 = 110.54
n
13
162
103
111
109
87
105
Xi = 1433
X-barB = Xi = 1433 = 110.23
n
13
13
Example of Mean
Measurements
x
Deviation
x - mean
-1
-3
-2
-4
40
MEAN = 40/10 = 4
Notice that the sum of the
deviations is 0.
Notice that every single
observation intervenes in
the computation of the
mean.
14
Mean
1. Means can be badly affected by outliers
(data points with extreme values unlike

the rest)
2. Outliers can make the mean a bad
measure of central tendency or common
experience
Outlier
MEASURING LOCATION
Mode
The mode is the most commonly occurring value in
the data. It is not generally used because it is often
not representative of the data, particularly when the
dataset is small.
The mode most often is used for qualitative data, it
rarely used for quantitative data but why?
Forum.
16
Mode
The most common data point is called the
mode.
17
Example of Mode
Measurements
x
3
5
5
1
7
2
6
7
0
4
In this case the data have

two modes:
5 and 7
Both measurements are
repeated twice
18
Example of Mode
Measurements
x
3
5
1
1
4
7
3
8
3
Mode: 3
Notice that it is possible for a
data not to have any mode.
19
Mode
1.
2.
3.
It may give you the most likely experience rather

than the typical or central experience.
In symmetric distributions, the mean, median, and
mode are the same.
In skewed data, the mean and median lie further
toward the skew than the mode.
Symmetric
Skewed
Mean
Median
20
Mode
Mode Median Mean
Mode
Mode: You could have a situation in which two or
more values occur the same number of times
and thus there are multiple modes(1).
The excel MODE function does not detect multiple
modes. If you think the mode is an appropriate
measure of central tendency for your data, you
should examine the data visually to see if the
mode is distinct or run frequency distribution and
histogram.
21
MEASURING LOCATION
Quartiles
The median is refereed to as Q2 sometimes called the
50th percentile since 50% of the numbers fall below it
and 50% fall above it.
The 25th percentile as descriptive measures is refereed

to as Q1 and the 75th percentile as Q3. The median
would be Q2.
Since Q1, Q2 and Q3 split the data into four sections

they are called quartiles.
22
Interquartile Range
A quartile is the value that marks one of the divisions that breaks a series of
values into four equal parts.
The median divides the cases in half.
25th percentile is a quartile that divides the first of cases from the latter .
75th percentile is a quartile that divides the first of cases from the latter .
The interquartile range is the distance or range between the 25th

percentile and the 75th percentile. H=Q3-Q1
25%
of
cases
25%
25%
25%
of
cases
23
25
50
75
100
Interquartile Range
+2
Q1=
4
+1
Q2=
2
3+2
Q3=
4
= 3 1
24
MEASURING LOCATION
Outliers
Outliers
(extreme
values)
usually
demand
investigation Often they are errors in the data (e.g. due
to instrument failure or errors in recording).
But they also may be very important (e.g. a new
scientific observation). If there is no reason to suspect
they have been wrongly recorded, may want to use
summaries that are resistant to their influence (e.g.,
medians rather than means)
Outliers should not be discarded without good reason
25
MEASURING LOCATION
Outliers
Usually outliers are identified visually, but how does a computer
identify an outlier?
Let
Q1= 25th Percentile
Q3 = 75th Percentile
H= Q3- Q1 ( The interquartile range)
An outlier is defined as any value less than Q1-1.5*H or greater
than Q3+1.5*H. An extreme outlier is defined as any value less
than Q1-3*H or greater than Q3+3*H.
Inter-quartile Range, H = Q3 - Q1, is a measure of variability of
the distribution
(H contains middle 50% of the observations)
26
MEASURING LOCATION
Trimmed Mean
Trimmed mean Discards all outliers and

averages the remaining values.
27
28
MEASURING VARIABILITY
Range
The spread, or the distance, between the lowest and highest values
of a variable.
To get the range for a variable, you subtract its lowest value from its
highest value.
102
115
128
109
131
89
98
106
140
119
93
97
110
Class A Range = 140 - 89 = 51
Class B--IQs of 13 Students

127
162
131
103
96
111
80
109
93
87
120
105
109
Class B Range = 162 - 80 = 82
29
Variance
A measure of the spread of the recorded values on a variable. A
measure of dispersion.
The larger the variance, the further the individual cases are from
the mean.
Mean
The smaller the variance, the closer the individual scores are to
the mean.
Mean
30
Variance for Population (2)

Steps:
Compute each deviation from the mean
Square each deviation
Sum all the squares
Divide by the data size of the population: n
31
Variance for Sample (s2)

Steps:
- Calculate the mean
Compute each deviation from the mean
Square each deviation
Sum all the squares
Divide by the data size minus one: n-1
To be continue Variance for population
32
Example of
Sample Variance
Measurements Deviations
x
3
5
5
1
7
2
6
7
0
4
40
x - mean
-1
1
1
-3
3
-2
2
3
-4
0
0
Total
Square of
deviations
1
1
1
9
9
4
4
9
16
0
54
Mean= 4
Variance = 54/9 = 6
It is a measure of
spread.
Notice that the larger the
deviations (positive or
negative) the larger the
variance
33
Variance
If you were to add all the squared deviations

together, youd get what we call the
Sum of Squares.
Sum of Squares (SS) = (xi mean)2
SS = (x1 mean)2 + (x2 mean)2 + . . . + (xn mean)2
34
Variance
The last step
The approximate average sum of squares is the
variance.
SS/N = Variance for a population.
SS/n-1 = Variance for a sample.
35
Standard Deviation
It is defines as the square root of the variance

In the previous example
Variance = 6
Standard deviation = Square root of the variance
= Square root of 6 = 2.45
36
Standard Deviation
1.
Larger s.d. = greater amounts of variation around the mean.

For example:
19
2.
3.
4.
25
31
13
25
37
x = 25
x = 25
s.d. = 3
s.d. = 6
s.d. = 0 only when all values are the same (only when you have a
constant and not a variable)
If you were to rescale a variable, the s.d. would change by the same
magnitudeif we changed units above so the mean equaled 250, the s.d.
on the left would be 30, and on the right, 60
Like the mean, the s.d. will be inflated by an outlier case value.
37
Percentiles
The p-the percentile is a number such that at most p%
of the measurements are below it and at most 100 p
percent of the data are above it.
Example, if in a certain data the 85th percentile is 340
means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340
Notice that the median is the 50th percentile
38
Coefficient of Variation
When comparing distributions of different means and
variances, a useful measure is the coefficient of variation
(CV).
CV=
The rule of thumb is that the larger the percentage,

the greater is the coefficient of variation
39
Further Notes
When the Mean is greater than the Median the
data distribution is skewed to the Right.
When the Median is greater than the Mean the
data distribution is skewed to the Left.
When Mean and Median are very close to each
other the data distribution is approximately
symmetric.
40
PRACTICE
41
42
Which graph to use?
Depends on type of data

Depends on what you want to illustrate
Depends on available statistical software
43
Bar Chart
Summarizes categorical data.
Horizontal axis represents categories,
while vertical axis represents either counts
(frequencies) or percentages (relative
frequencies).
Used to illustrate the differences in
percentages
(or
counts)
between
categories.
44
Bar Chart
45
Constructing Histograms
Used for numeric variables, so need Class Intervals
Let Range = Largest - Smallest Measurement
Break range into (say) 5-15 intervals depending on sample size
Make the width of the subintervals a convenient unit, and make
break points so that no observations fall on them
To determine the number of classes, k for a set of data
consisting of n observations, the formula below can be used
K =

. If the number of data is 60, than the number of

classes, k is
=5.9=6
Obtain Class Frequencies, the number in each subinterval

Obtain Relative Frequencies, proportion in each subinterval
46
Construct Histogram
Draw bars over each subinterval with height representing
class frequency or relative frequency (shape will be the
same)
Leave no space between bars to imply adjacency of class
intervals
47
Histogram
48
Interpreting Histograms
Probability: Heights of bars over the class intervals are
proportional to the chances an individual chosen at
random would fall in the interval
Unimodal: A histogram with a single major peak
Bimodal:
Histogram
with
two
distinct
(often evidence of two distinct groups of units)
peaks
Uniform: Interval heights are approximately equal

Symmetric: Right and Left portions are same shape
Right-Skewed: Right-hand side extends further
Left-Skewed: Left-hand side extends further

49
Too few categories

Age of Spring 1998 Stat 250 Students
60
50
40
30
20
10
0
18
23
28
Age (in years)

n=92 students
50
Too many categories

GPAs of Spring 1998 Stat 250 Students
7
Frequency (Count)
6
5
4
3
2
1
0
2
GPA
n=92 students
51
Dot Plot
Summarizes measurement data.
Horizontal axis represents measurement
scale.
Plot one dot for each data point.
52
Dot Plot
Fastest Ever Driving Speed
226 Stat 100 Students, Fall '98
100
Men
126
Women
70
80
90
100
110 120 130 140

Speed
150
160
53
Stem-and-Leaf Plot
Each data point is broken down into a
stem and a leaf.
First, stems are aligned in a column.
Then, leaves are attached to the stems.
54
Stem-and-Leaf Plots
Simple, approach to obtaining shape of distribution without
losing individual measurements to class intervals.
Procedure:
Split each measurement into 2 sets of digits (stem and leaf)

List stems from smallest to largest
Line corresponding leaves aside stems from smallest to largest
If too cramped/narrow, break stems into two groups: low with
leaves 0-4 and high with leaves 5-9
When numbers have many digits, trim off right-most (less
significant) digits. Leaves should always be a single digit.
55
Example Stem-and-Leaf Plot

Stem-and-leaf of Shoes
12
63
(33)
43
25
12
8
4
4
2
2
1
1
1
1
1
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
= 139
Leaf Unit = 1.0
223334444444
555555555555566666666677777778888888888888999999999
000000000000011112222233333333444
555555556667777888
0000000000023
5557
0023
00
0
5
56
PRACTICE
57
BOX-Plots
A way to graphically portray almost all the

descriptive statistics at once is the box-plot.
A box-plot shows: Upper and lower quartiles
Mean
Median
Range
Outliers (1.5 IQR)
58
Box Plots
Maximum value without
outliers
Minimum value without

outliers
Extreme
Outlier
Zone
Outlier
Zone
Outlier
Zone
Q1
1.5H
1.5H
Q2
Extreme
Outlier
Zone
Q3
1.5H
1.5H
59
Box Plot
Vertical (or horizontal) axis represents
measurement scale.
Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.
60
Box Plot
Whiskers are drawn to the most extreme
data points that are not more than 1.5
times the length of the box beyond either
quartile.
Whiskers are useful for identifying outliers.
Outliers, or extreme observations, are

denoted by asterisks.
Generally, data points falling beyond the
whiskers are considered outliers.
61
Box Plots
Box Plots - Display a box containing middle
50% of measurements with line at median
and lines extending from box. Breaks data
into four quartiles
Outliers - Observations falling more than
1.5IQR above (below) upper (lower) quartile
62
Using Box Plots to Compare
63
BOX-Plots
180.00
IQR = 27; There

is no outlier.
162
160.00
140.00
123.5
120.00
M=110.5
106.5
100.00
96.5
82
80.00
IQ
64
Box Plot
Amount of sleep in past 24 hours
of Spring 1998 Stat 250 Students
10
9
8
7
6
5
4
3
Outlier
2
1
0
65
Which graph to use when?

Stem-and-leaf plots and dot plots are good
for small data sets, while histograms and
box plots are good for large data sets.
Boxplots and dotplots are good for
comparing two groups.
Boxplots are good for identifying outliers.
Histograms and boxplots are good for
identifying shape of data.
66
Scatter Plots
Foot sizes of Spring 1998 Stat 250 students
31
30
29
28
27
26
25
24
23
22
22
23
24
25
26
27
28
29
30
31
Left foot (in cm)

n=88 students
67
Scatter Plots
Summarizes the relationship between two
measurement variables.
Horizontal axis represents one variable

and vertical axis represents second
variable.
Plot one point
measurements.
for
each
pair
of
68
No relationship
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
32
31
30
29
28
27
26
25
24
23
22
52
57
62
Head circumference (in cm)

n=89 students
69
Closing comments
Many possible types of graphs.
Use common sense in reading graphs.
When creating graphs, dont summarize
your data too much or too little.
When creating graphs, label everything for
others.
Remember you are trying to
communicate something to others!
70
Descriptive Statistics-In SPSS

After Importing your dataset, and providing names to
variables, click on:
ANALYZE DESCRIPTIVE STATISTICS FREQUENCIES
Choose any variables to be analyzed and place them in

box on right
Options include (For Categorical Variables):
Frequency Tables
Pie Charts, Bar Charts
Options include (For Numeric Variables)
Frequency Tables (Useful for discrete data)
Measures of Central Tendency, Dispersion,
Percentiles
Pie Charts, Histograms
71
Histograms in SPSS
After Importing your dataset, and providing names
to variables, click on:
GRAPHS HISTOGRAM
Select Variable to be plotted
Click on DISPLAY NORMAL CURVE if you want a normal
curve superimposed (Next Chapter 4).
72
Side-by-Side Bar Charts In SPSS

After Importing your dataset, and providing names
to variables, click on:
GRAPHS BAR Clustered (Summaries for
Groups of Cases) DEFINE
Bars Represent N of Cases (or % of Cases)
CATEGORY AXIS: Variable that represents groups to be
compared (independent variable)
DEFINE CLUSTERS BY: Variable that represents
outcomes of interest (dependent variable)
73
PRACTICE
SPSS
74
75
Question 1
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)
1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes
9
8
9
6
2
0
4
0
3
2
6
6
76
86
23
41
98
96
20
40
32
66
50
92
40
60
77
Question 2
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)
1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes
9
8
9
6
2
0
4
0
3
2
6
6
78
34
53
23
17
54
12
78
98 199
79
Descriptive Statistics
Now you are qualified use descriptive statistics!
80

Chapter 2 Student Version

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Student Version

Uploaded by

Copyright:

Available Formats

UMS-Faculty of Engineering 2016/2017

CHAPTERKA220102 Sem 1 2016/2017

Why Descriptive Statistics

it is particularly useful where there are unusually low

Please: Explanation about outliers will be provided

Data are listed in orderthe median is the point at

it is strongly influenced by extreme values (outliers), however, it

Class B--IQs of 13 Students

(data points with extreme values unlike

In this case the data have

Notice that it is possible for a

data not to have any mode.

It may give you the most likely experience rather

Mode Median Mean

The 25th percentile as descriptive measures is refereed

Since Q1, Q2 and Q3 split the data into four sections

The interquartile range is the distance or range between the 25th

Trimmed mean Discards all outliers and

Class B--IQs of 13 Students

Variance for Population (2)

Variance for Sample (s2)

If you were to add all the squared deviations

It is defines as the square root of the variance

Larger s.d. = greater amounts of variation around the mean.

The rule of thumb is that the larger the percentage,

Which graph to use?

Depends on type of data

Obtain Class Frequencies, the number in each subinterval

Uniform: Interval heights are approximately equal

Left-Skewed: Left-hand side extends further

Too few categories

Age (in years)

Too many categories

110 120 130 140

Split each measurement into 2 sets of digits (stem and leaf)

Example Stem-and-Leaf Plot

Leaf Unit = 1.0

A way to graphically portray almost all the

Minimum value without

Outliers, or extreme observations, are

Using Box Plots to Compare

IQR = 27; There

Which graph to use when?

Left foot (in cm)

Horizontal axis represents one variable

Head circumference (in cm)

Descriptive Statistics-In SPSS

Choose any variables to be analyzed and place them in

Side-by-Side Bar Charts In SPSS

You might also like