You are on page 1of 80

UMS-Faculty of Engineering 2016/2017

Sem 1/HK01

STATISTICS: CHAPTER 2
Descriptive Statistics and Frequency
Distribution
Dr. Harimi Djamila
University Malaysia Sabah

CHAPTERKA220102 Sem 1 2016/2017


1

STATISTICS
DESCRIPTIVE STATISTICS
Presenting and Summarizing Data

2
2

Why Descriptive Statistics


1. Gaining familiarity with the data.
2. Looking for maximum, minimum, mean values or
unusually high or low values (Outliers), missing
values, and making any necessary corrections.
3. Checking the assumptions required for statistical
testing.
4. Comparing your data with other researchers works
to uncover any unusual values found in your data
which will be used for further analysis
3

MEASURING LOCATION

Median (Q2)
The median is the middle number of a set of data
when all observations (values) are sorted in order.
Applicable to quantitative data only.

it is particularly useful where there are unusually low


or high values (Outliers) that would render the mean
unrepresentative of the data (2).
Extreme observations (often referred to as outliers)

Please: Explanation about outliers will be provided


after few slides. You may need to return back to
reread all the slides

Median
2. If the recorded values for a variable form
a symmetric distribution, the median and
mean are identical.
3. In skewed data, the mean lies further
toward the skew than the median.
Symmetric

Skewed

Mean
Median

Mean
Median

Median
The middle score or measurement in a set of ranked
scores or measurements; the point that divides a
distribution into two equal halves.

Data are listed in orderthe median is the point at


which 50% of the cases are above and 50% below.
Known as the 50th percentile.

Median
Class A--IQs of 13 Students
89
93
97
98
102
106
109
110
115
119
128
131
140

Median = 109
(six cases above, six below)

Median
If the first student were to drop out of Class A, there would be
a new median:
89
93
97
98
102
106
109
110
115
119
128
131
140

Median = 109.5
109 + 110 = 219/2 = 109.5
(six cases above, six below)

10

MEASURING LOCATION

Mean
It is commonly known as the average.
: Population mean
: Sample mean = x-bar
Of all the measures of location, the arithmetic mean is the most
commonly used in many statistical contexts.

it is strongly influenced by extreme values (outliers), however, it


is most representative when data are symmetrically distributed.

11

Mean
Sample Mean

x-bar =

(x1 + x2 + . . . + xn)
n
x-bar = xi
n
n = number of cases in the sample
= Greek letter sigma = sum or add up what follows
i = a typical case or each case in the sample (1 through n)
Population Mean
=

(x1 + x2 + . . . + xn)
N
N = number of cases in the Population
12

Mean
Class A--IQs of 13 Students

Class B--IQs of 13 Students

102
128
131
98
140
93
110

127
131
96
80
93
120
109

115
109
89
106
119
97

Xi = 1437
X-barA = Xi = 1437 = 110.54
n
13

162
103
111
109
87
105

Xi = 1433
X-barB = Xi = 1433 = 110.23
n
13
13

Example of Mean
Measurements
x

Deviation
x - mean

-1

-3

-2

-4

40

MEAN = 40/10 = 4
Notice that the sum of the
deviations is 0.
Notice that every single
observation intervenes in
the computation of the
mean.
14

Mean
1. Means can be badly affected by outliers

(data points with extreme values unlike


the rest)
2. Outliers can make the mean a bad
measure of central tendency or common
experience

Outlier

MEASURING LOCATION

Mode
The mode is the most commonly occurring value in
the data. It is not generally used because it is often
not representative of the data, particularly when the
dataset is small.
The mode most often is used for qualitative data, it
rarely used for quantitative data but why?

Forum.

16

Mode
The most common data point is called the
mode.

17

Example of Mode
Measurements
x
3
5
5
1
7
2
6
7
0
4

In this case the data have


two modes:
5 and 7
Both measurements are
repeated twice

18

Example of Mode
Measurements
x
3
5
1
1
4
7
3
8
3

Mode: 3

Notice that it is possible for a

data not to have any mode.

19

Mode
1.

2.
3.

It may give you the most likely experience rather


than the typical or central experience.
In symmetric distributions, the mean, median, and
mode are the same.
In skewed data, the mean and median lie further
toward the skew than the mode.

Symmetric

Skewed

Mean
Median

20

Mode

Mode Median Mean

Mode
Mode: You could have a situation in which two or
more values occur the same number of times
and thus there are multiple modes(1).
The excel MODE function does not detect multiple
modes. If you think the mode is an appropriate
measure of central tendency for your data, you
should examine the data visually to see if the
mode is distinct or run frequency distribution and
histogram.
21

MEASURING LOCATION

Quartiles
The median is refereed to as Q2 sometimes called the
50th percentile since 50% of the numbers fall below it
and 50% fall above it.

The 25th percentile as descriptive measures is refereed


to as Q1 and the 75th percentile as Q3. The median
would be Q2.

Since Q1, Q2 and Q3 split the data into four sections


they are called quartiles.
22

Interquartile Range
A quartile is the value that marks one of the divisions that breaks a series of
values into four equal parts.
The median divides the cases in half.
25th percentile is a quartile that divides the first of cases from the latter .
75th percentile is a quartile that divides the first of cases from the latter .

The interquartile range is the distance or range between the 25th


percentile and the 75th percentile. H=Q3-Q1

25%
of
cases

25%

25%

25%
of
cases
23

25

50

75

100

Interquartile Range
+2
Q1=
4
+1
Q2=
2
3+2
Q3=
4
= 3 1
24

MEASURING LOCATION

Outliers
Outliers
(extreme
values)
usually
demand
investigation Often they are errors in the data (e.g. due
to instrument failure or errors in recording).
But they also may be very important (e.g. a new
scientific observation). If there is no reason to suspect
they have been wrongly recorded, may want to use
summaries that are resistant to their influence (e.g.,
medians rather than means)
Outliers should not be discarded without good reason
25

MEASURING LOCATION

Outliers
Usually outliers are identified visually, but how does a computer
identify an outlier?
Let
Q1= 25th Percentile
Q3 = 75th Percentile
H= Q3- Q1 ( The interquartile range)
An outlier is defined as any value less than Q1-1.5*H or greater
than Q3+1.5*H. An extreme outlier is defined as any value less
than Q1-3*H or greater than Q3+3*H.
Inter-quartile Range, H = Q3 - Q1, is a measure of variability of
the distribution
(H contains middle 50% of the observations)

26

MEASURING LOCATION

Trimmed Mean

Trimmed mean Discards all outliers and


averages the remaining values.

27

28

MEASURING VARIABILITY

Range
The spread, or the distance, between the lowest and highest values
of a variable.
To get the range for a variable, you subtract its lowest value from its
highest value.
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Class A Range = 140 - 89 = 51

Class B--IQs of 13 Students


127
162
131
103
96
111
80
109
93
87
120
105
109
Class B Range = 162 - 80 = 82
29

MEASURING VARIABILITY

Variance
A measure of the spread of the recorded values on a variable. A
measure of dispersion.
The larger the variance, the further the individual cases are from
the mean.

Mean

The smaller the variance, the closer the individual scores are to
the mean.

Mean
30

MEASURING VARIABILITY

Variance for Population (2)


Steps:
Compute each deviation from the mean
Square each deviation
Sum all the squares
Divide by the data size of the population: n

31

MEASURING VARIABILITY

Variance for Sample (s2)


Steps:
- Calculate the mean
Compute each deviation from the mean
Square each deviation
Sum all the squares
Divide by the data size minus one: n-1
To be continue Variance for population
32

Example of

Sample Variance
Measurements Deviations
x
3
5
5
1
7
2
6
7
0
4
40

x - mean
-1
1
1
-3
3
-2
2
3
-4
0
0

Total

Square of
deviations
1
1
1
9
9
4
4
9
16
0
54

Mean= 4
Variance = 54/9 = 6
It is a measure of
spread.
Notice that the larger the
deviations (positive or
negative) the larger the
variance
33

MEASURING VARIABILITY

Variance

If you were to add all the squared deviations


together, youd get what we call the
Sum of Squares.
Sum of Squares (SS) = (xi mean)2
SS = (x1 mean)2 + (x2 mean)2 + . . . + (xn mean)2

34

MEASURING VARIABILITY

Variance
The last step
The approximate average sum of squares is the
variance.
SS/N = Variance for a population.
SS/n-1 = Variance for a sample.

35

MEASURING VARIABILITY

Standard Deviation

It is defines as the square root of the variance


In the previous example
Variance = 6
Standard deviation = Square root of the variance
= Square root of 6 = 2.45

36

MEASURING VARIABILITY

Standard Deviation
1.

Larger s.d. = greater amounts of variation around the mean.


For example:

19

2.
3.

4.

25
31
13
25
37
x = 25
x = 25
s.d. = 3
s.d. = 6
s.d. = 0 only when all values are the same (only when you have a
constant and not a variable)
If you were to rescale a variable, the s.d. would change by the same
magnitudeif we changed units above so the mean equaled 250, the s.d.
on the left would be 30, and on the right, 60
Like the mean, the s.d. will be inflated by an outlier case value.
37

Percentiles
The p-the percentile is a number such that at most p%
of the measurements are below it and at most 100 p
percent of the data are above it.
Example, if in a certain data the 85th percentile is 340
means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340
Notice that the median is the 50th percentile
38

MEASURING VARIABILITY

Coefficient of Variation
When comparing distributions of different means and
variances, a useful measure is the coefficient of variation
(CV).

CV=

The rule of thumb is that the larger the percentage,


the greater is the coefficient of variation

39

Further Notes
When the Mean is greater than the Median the
data distribution is skewed to the Right.
When the Median is greater than the Mean the
data distribution is skewed to the Left.
When Mean and Median are very close to each
other the data distribution is approximately
symmetric.

40

PRACTICE
41

42

Which graph to use?

Depends on type of data


Depends on what you want to illustrate
Depends on available statistical software

43

Bar Chart
Summarizes categorical data.
Horizontal axis represents categories,
while vertical axis represents either counts
(frequencies) or percentages (relative
frequencies).
Used to illustrate the differences in
percentages
(or
counts)
between
categories.
44

Bar Chart

45

Constructing Histograms
Used for numeric variables, so need Class Intervals
Let Range = Largest - Smallest Measurement
Break range into (say) 5-15 intervals depending on sample size
Make the width of the subintervals a convenient unit, and make
break points so that no observations fall on them
To determine the number of classes, k for a set of data
consisting of n observations, the formula below can be used

K =


. If the number of data is 60, than the number of

classes, k is

=5.9=6

Obtain Class Frequencies, the number in each subinterval


Obtain Relative Frequencies, proportion in each subinterval
46

Construct Histogram
Draw bars over each subinterval with height representing
class frequency or relative frequency (shape will be the
same)
Leave no space between bars to imply adjacency of class
intervals

47

Histogram

48

Interpreting Histograms
Probability: Heights of bars over the class intervals are
proportional to the chances an individual chosen at
random would fall in the interval
Unimodal: A histogram with a single major peak

Bimodal:
Histogram
with
two
distinct
(often evidence of two distinct groups of units)

peaks

Uniform: Interval heights are approximately equal


Symmetric: Right and Left portions are same shape
Right-Skewed: Right-hand side extends further

Left-Skewed: Left-hand side extends further


49

Too few categories


Age of Spring 1998 Stat 250 Students
60
50
40
30
20
10
0
18

23

28

Age (in years)


n=92 students

50

Too many categories


GPAs of Spring 1998 Stat 250 Students
7

Frequency (Count)

6
5
4
3
2
1
0
2

GPA
n=92 students

51

Dot Plot
Summarizes measurement data.
Horizontal axis represents measurement
scale.
Plot one dot for each data point.

52

Dot Plot
Fastest Ever Driving Speed
226 Stat 100 Students, Fall '98

100
Men

126
Women
70

80

90

100

110 120 130 140


Speed

150

160
53

Stem-and-Leaf Plot
Summarizes measurement data.
Each data point is broken down into a
stem and a leaf.
First, stems are aligned in a column.
Then, leaves are attached to the stems.
54

Stem-and-Leaf Plots
Simple, approach to obtaining shape of distribution without
losing individual measurements to class intervals.

Procedure:

Split each measurement into 2 sets of digits (stem and leaf)


List stems from smallest to largest
Line corresponding leaves aside stems from smallest to largest
If too cramped/narrow, break stems into two groups: low with
leaves 0-4 and high with leaves 5-9
When numbers have many digits, trim off right-most (less
significant) digits. Leaves should always be a single digit.

55

Example Stem-and-Leaf Plot


Stem-and-leaf of Shoes

12
63
(33)
43
25
12
8
4
4
2
2
1
1
1
1
1

0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7

= 139

Leaf Unit = 1.0

223334444444
555555555555566666666677777778888888888888999999999
000000000000011112222233333333444
555555556667777888
0000000000023
5557
0023
00
0

5
56

PRACTICE
57

BOX-Plots

A way to graphically portray almost all the


descriptive statistics at once is the box-plot.
A box-plot shows: Upper and lower quartiles
Mean
Median
Range
Outliers (1.5 IQR)
58

Box Plots
Maximum value without
outliers

Minimum value without


outliers

Extreme
Outlier
Zone

Outlier
Zone

Outlier
Zone

Q1

1.5H

1.5H

Q2

Extreme
Outlier
Zone

Q3

1.5H

1.5H
59

Box Plot
Summarizes measurement data.
Vertical (or horizontal) axis represents
measurement scale.
Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.
60

Box Plot
Whiskers are drawn to the most extreme
data points that are not more than 1.5
times the length of the box beyond either
quartile.
Whiskers are useful for identifying outliers.

Outliers, or extreme observations, are


denoted by asterisks.
Generally, data points falling beyond the
whiskers are considered outliers.
61

Box Plots
Box Plots - Display a box containing middle
50% of measurements with line at median
and lines extending from box. Breaks data
into four quartiles
Outliers - Observations falling more than
1.5IQR above (below) upper (lower) quartile

62

Using Box Plots to Compare

63

BOX-Plots

180.00

IQR = 27; There


is no outlier.
162

160.00

140.00

123.5
120.00

M=110.5

106.5

100.00

96.5

82
80.00

IQ

64

Box Plot
Amount of sleep in past 24 hours
of Spring 1998 Stat 250 Students
10
9
8
7
6
5
4
3

Outlier

2
1
0

65

Which graph to use when?


Stem-and-leaf plots and dot plots are good
for small data sets, while histograms and
box plots are good for large data sets.
Boxplots and dotplots are good for
comparing two groups.
Boxplots are good for identifying outliers.
Histograms and boxplots are good for
identifying shape of data.
66

Scatter Plots
Foot sizes of Spring 1998 Stat 250 students
31
30
29
28
27
26
25
24
23
22
22

23

24

25

26

27

28

29

30

31

Left foot (in cm)


n=88 students

67

Scatter Plots
Summarizes the relationship between two
measurement variables.

Horizontal axis represents one variable


and vertical axis represents second
variable.
Plot one point
measurements.

for

each

pair

of
68

No relationship
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
32
31
30
29
28
27
26
25
24
23
22
52

57

62

Head circumference (in cm)


n=89 students

69

Closing comments
Many possible types of graphs.
Use common sense in reading graphs.
When creating graphs, dont summarize
your data too much or too little.
When creating graphs, label everything for
others.
Remember you are trying to
communicate something to others!

70

Descriptive Statistics-In SPSS


After Importing your dataset, and providing names to
variables, click on:
ANALYZE DESCRIPTIVE STATISTICS FREQUENCIES

Choose any variables to be analyzed and place them in


box on right
Options include (For Categorical Variables):
Frequency Tables
Pie Charts, Bar Charts
Options include (For Numeric Variables)
Frequency Tables (Useful for discrete data)
Measures of Central Tendency, Dispersion,
Percentiles
Pie Charts, Histograms

71

Histograms in SPSS
After Importing your dataset, and providing names
to variables, click on:
GRAPHS HISTOGRAM
Select Variable to be plotted
Click on DISPLAY NORMAL CURVE if you want a normal
curve superimposed (Next Chapter 4).

72

Side-by-Side Bar Charts In SPSS


After Importing your dataset, and providing names
to variables, click on:
GRAPHS BAR Clustered (Summaries for
Groups of Cases) DEFINE
Bars Represent N of Cases (or % of Cases)
CATEGORY AXIS: Variable that represents groups to be
compared (independent variable)
DEFINE CLUSTERS BY: Variable that represents
outcomes of interest (dependent variable)

73

PRACTICE
SPSS
74

75

Question 1
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)

1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes

9
8

9
6

2
0

4
0

3
2

6
6

76

86

23

41

98

96

20

40

32

66

50

92

40

60

77

Question 2
Given sample records
Complete Table 1
Draw a boxplot and steam and leaf plot.
Comment on your results
Descriptive statistics
count
mean
sample standard deviation
sample variance
minimum
maximum
range
8 2 4
coefficient of variation (CV)

1st quartile
median
3rd quartile
interquartile range
mode
low extremes
low outliers
high outliers
high extremes

9
8

9
6

2
0

4
0

3
2

6
6

78

34

53

23

17

54

12

78

98 199

79

Descriptive Statistics
Now you are qualified use descriptive statistics!

80

You might also like