Selvanathan 5e Chapter 02

Chapter 2
Graphical descriptive methods
Introduction and Re-cap

Descriptive statistics
involves arranging, summarising, and presenting a
set of data in such a way that useful information is
produced.
Statistics
Data
Information
Its methods make use of graphical techniques and

numerical descriptive measures (such as averages)
to summarise and present the data.
3
Populations and Samples

Population
Sample
Subset
The graphical and tabular methods presented here apply

to both entire populations and samples drawn from
populations.
4
Definitions
A variable is some characteristic of a population
or sample.
E.g. student grades.
Typically denoted with a capital letter: X, Y, Z

The values of the variable are the range of
possible values for a variable.
E.g. student marks (0100)
Data are the observed values of a variable.

E.g. student marks: {67, 74, 71, 83, 93, 55, 48}
5
2.1 Types of Data

Data (at least for purposes of Statistics) fall into
three main groups:
Numerical (interval or quantitative) data
Nominal (categorical or qualitative) data
Ordinal (ranked) data
Numerical Data
Numerical data
Real numbers, i.e. heights, weights, prices,
waiting time at a medical practice, etc.
Also referred to as quantitative or interval.
Arithmetic operations can be performed on
numerical data, thus its meaningful to talk
about 2*Height, or Price + $1, and so on.
Nominal Data
Nominal Data
The values of nominal data are categories.
E.g. responses to questions about marital status are
categories, coded as: Single = 1, Married = 2,
Divorced = 3, Widowed = 4
These data are categorical in nature; arithmetic

operations dont make any sense (e.g. does
Divorced 2 = Married?!)
Nominal data are also called qualitative or
categorical.
Ordinal Data
Ordinal Data
Ordinal data appear to be categorical in
nature, but their values have an order; a
ranking to them:
E.g. University course evaluation system: poor = 1,
fair = 2, good = 3, very good = 4, excellent = 5
While its still not meaningful to do arithmetic on

this data (e.g. does 2*fair = very good?!), we
can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what
numeric values are assigned to each category.
9
Types of data Examples

Numerical data
age income
income
age
Nominal data
person
married
person
married
Ordinal data
exam grade
grade
exam
HD
HD
75
000
1
yes
75 000
1
yes
D
D
68 000
000
no
68
22
no
CC
no
..
33.
no
P
P
.
.
.
F
..
F
.
.
.
.
all we
computer
brand
computer
brand
weight gain
gain With nominal data,
weight
With
ordinal data, all we
Food quality
quality
Food
can
calculate
is
the
1
IBM
1
IBM
+10
+10
can use is computations
Excellent
Excellent
proportion
of
data
that
2
Dell
2
Dell
+5
+5
involving the ordering
Good
Good
falls
into
each
category.
Compaq
33
Compaq
Satisfactory
process.
..
Satisfactory
IBM
44
IBM
Poor
Poor
..
..
..
IBM
Dell
Compaq
other total
total
IBM
Dell
Compaq
other
25
11
50
25
11
88
66
50
50%
22%
16%
12%
50%
22%
16%
12%
55
55
42
42
..
..
10
Calculations for Types of Data

As mentioned above,
All calculations are permitted on interval data.
No calculations are allowed for nominal data,
except counting the number of observations in
each category and calculating their proportions.
Only calculations involving a ranking process
are allowed for ordinal data.
This lends itself to the following hierarchy of
data
11
Hierarchy of Data
Numerical
Values are real numbers.
All calculations are valid.
Data may be treated as ordinal or nominal.
Nominal
Values are the arbitrary numbers that represent
categories.
Only calculations based on the frequencies of occurrence
are valid.
Data may not be treated as ordinal or numerical.
Ordinal
Values must represent the ranked order of the data.
Calculations based on an ordering process are valid.
Data may be treated as nominal but not as numerical.
12
Other Forms of Data

Cross-sectional data is collected at a certain
point in time across a number of units of interest
marketing survey (observe preferences by
gender, age)
test score in a statistics course exam
starting salaries of graduates of an MBA
program in a particular year.
Time-series data is collected over successive
points in time
weekly closing price of gold
monthly tourist arrivals in Australia.
13
2.2 Graphical and tabular

techniques for nominal data
The only allowable calculation on nominal data is
to count the frequency of each value of the
variable.
We can summarise the data in a table that
presents the categories and their counts called a
frequency distribution.
A relative frequency distribution lists the
categories and the proportion with which each
occurs.
14
Introduction ...
The methods presented apply to both
the entire population, and
a sample selected from the population.
15
Graphical techniques
for nominal data
The graphical presentations shown here
are used primarily for nominal data.
These graphical tools are most appropriate
when the raw data can be naturally
categorised in a meaningful manner.
16
Bar charts
The bar chart is mainly used for nominal
data.
A bar chart graphically represents the
frequency of each category as a bar rising
vertically from the horizontal axis
The height of each bar is proportional to the
frequency of the corresponding category.
17
Pie charts
Another useful chart to present nominal
data is the pie chart.
The pie chart is a very popular tool used to
represent the proportions of appearance for
nominal data.
A pie chart is a circle that is subdivided into
slices whose areas are proportional to the
frequencies (or relative frequencies),
thereby displaying the proportion of
occurrences of each category.
18
Example 2.1
To determine the approximate market share of
various womens magazines in New Zealand, a
womens magazine readership survey was
conducted using a sample of 200 readers.
Data was collected and the count of the
occurrences (frequencies) was recorded for each
magazine.
The frequencies were presented in a bar chart.
Then the frequencies were converted to
proportions and the results were presented in a
pie chart.
19
Example 2.1
1 = Australian Womens Weekly (NZ Edition); 2 = Next;
3 = NZ New Idea; 4 = NZ Womans Day; 5 = NZ Womens
Weekly; and 6 = Thats Life.
20
Example 2.1 cont. (Excel representation)
21
The size of each slice in a pie chart is proportional

to the percentage corresponding to the category it
represents.
(10/100)(3600) = 360
22
Use bar charts also when the order in which

data are presented is meaningful.
Trend in
in total
total exports,
exports, Australia,
Australia, 19922009
19922009
Trend
23
2.3 Graphical Techniques for

Numerical Data
There are several graphical methods that are
used when the data are numerical (i.e.
quantitative, non-categorical).
The most important of these graphical methods
is the histogram.
The histogram is not only a powerful graphical
technique used to summarise interval data, but
it is also used to help explain probabilities.
24
Example 2.5
Providing information concerning the
monthly bills of new subscribers in the
first month after signing on with a
telephone company
collect data
prepare a frequency distribution
draw a histogram.
25
Example 2.5
As part of a larger study, a long-distance company
wanted to acquire information about the monthly
bills of new subscribers in the first month after
signing with the company. The companys
marketing manager conducted a survey of 200
new residential subscribers wherein the first
months bills were recorded. These data are stored
in file XM02-05. The general manager planned to
present his findings to senior executives. What
information can be extracted from these data?
26
Example 2.5
In Example 2.1 we created a frequency distribution
of the 6 categories. In this example we also create
a frequency distribution by counting the number of
observations that fall into a series of intervals,
called classes.
The justification for the classes chosen below will
be discussed later.
27
Example 2.5
We have chosen eight classes defined in such a way
that each observation falls into one and only one
class. These classes are defined as follows:
Classes
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts
that
that
that
that
that
that
that
that
are
are
are
are
are
are
are
are
less than or equal to 15

more than 15 but less than or equal to 30
28
Example 2.4
29
Interpret
(18+28+14=60)200 = 30%
About half (71+37=108)
i.e. nearly a third of the phone bills
of the bills are small,
are $90 or more.
i.e. less than $30.

There are only a few telephone
bills in the middle range.
30
Building a Histogram
1) Collect the data
2) Create a frequency distribution for the data
How?
a) Determine the number of classes to use
How?
Refer to Table 2.10:
With 200 observations,
we should have
between 7 & 10
classes we could use Sturges formula:
Alternatively,
Number of class intervals = 1 + 3.3 log (n)
31
Class width
It is generally best to use equal class
widths, but sometimes unequal class
widths are called for.
Unequal class widths are used when the
frequency associated with some classes is
too low. Then,
several classes are combined together to form
a wider and more populated class
it is possible to form an open-ended class at
the higher or lower end of the histogram.
32
1) Collect the data
2) Create a frequency distribution for the data
How?
a) Determine the number of classes to use. [8]
b) Determine how wide to make each class
(assuming equal class width). How?
Look at the range of the data, that is,
Range = Largest observation
Smallest observation
Range = $119.63 $0 = $119.63
Then each class width becomes:
Range (# classes) = 119.63 8 15
33
34
35
Shapes of Histograms
Variable
Frequency
Frequency
Frequency
Symmetry
A histogram is said to be symmetric if, when we
draw a vertical line down the center of the
histogram, the two sides are identical in shape
and size:
Variable
Variable
36
Frequency
Frequency
Skewness
A skewed histogram is one with a long tail
extending to either the right or the left:
Variable
Positively skewed
Variable
Negatively skewed
37
Modality
A unimodal histogram is one with a single peak, while
a bimodal histogram is one with two peaks:
Bimodal
Frequency
Frequency
Unimodal
Variable
Variable
A modal class is the class with

the largest number of observations
38
Many statistical techniques

require that the population be
Frequency
Bell Shape
A special type of symmetric unimodal
histogram is one that is bell shaped:
bell shaped.
Variable
Drawing the histogram helps

verify the shape of the
Bell Shaped
39
Histogram comparison
Compare and contrast the following histograms based
on data from Example 2.7: The marks from the computerUnimodal vs. bimodal
Marks (computer course)
based statistics course and the

manual statistics course have very
different histograms
Marks (manual course)
Spread of the marks (narrower | wider)
40
Stem and Leaf Display

Retains information about individual observations that would
normally be lost in the creation of a histogram.
Split each observation into two parts, a stem and a leaf:
e.g. Observation value: 42.19
There are several ways to split it up
We could split it at the decimal point:
Stem
Leaf
Or split it at the tens position (while rounding to42

the nearest
19
integer in the ones position).
2
41
Stem & Leaf Display

Continue this process for all the observations.
StemThen,
Leaf use the stems for the classes and each
leaf becomes part of the histogram (based on
0
Example 2.5 data) as follows
0000000000111112222223333345555556666666778888999999
1
000001111233333334455555667889999
0000111112344666778999
001335589
124445589
33566
3458
022224556789
334457889999
00112222233344555999
Thus, we still have access to

our original data points value!
42
Histogram and Stem and Leaf
Compare the overall shapes of the figures

43
Relative frequency
It is often preferable to show the
relative frequency (proportion) of
observations falling into each class,
rather than the frequency itself.
Class relative frequency=
Class frequency
Total number of observations
44
Relative frequency
Relative frequencies should be used
when
the population relative frequencies are
studied
comparing two or more histograms
the number of observations of the samples
studied are different.
45
Cumulative frequency of a class

This is the number of measurements less
than the upper limit of that class.
To obtain the cumulative frequency of a
class, we add the frequency of that class
and the frequencies of all previous classes.
The cumulative relative frequency of a
particular class is the proportion of
measurements that are less than the upper
limit of that class.
46
Ogive
(pronounced Oh-jive) is a graph of
a cumulative frequency distribution.
We create an ogive in three steps
First, from the frequency distribution created
earlier, calculate relative frequencies:
Class relative frequency=
Class frequency
Total number of observations
47
Relative Frequencies
For example, we had 71 observations in our
first class (telephone bills from $0.00 to
$15.00). Thus, the relative frequency for this
class is 71 200 (the total # of phone bills) =
0.355 (or 35.5%).
48
Ogive
Is a graph of a cumulative frequency
distribution.
We create an ogive in three steps
1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies by adding the current
class relative frequency to the previous
class cumulative relative frequency.
(For the first class, its cumulative relative
frequency is just its relative frequency.)
2.49
Cumulative Relative Frequencies

TABLE 2.15 Cumulative relative frequencies for Example 2.5
First class
Next class: .355+.185=.540
:
:
Last class: .930+.070=1.00
50
Ogive
Is a graph of a cumulative frequency
distribution.
1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies.
3) Graph the cumulative relative
frequencies
Example 2.5 Ogive
51
Ogive
Example 2.5 Ogive
The ogive can be used

to answer questions
like:
What telephone bill
value is at the 50th
percentile?
around $35
(Refer also to Fig. 2.21 in your textbook.)

52
2.4 Describing Time Series Data

Observations measured at the same point in
time across individual units are called crosssectional data.
Observations measured at successive points in
time on a single unit are called time-series
data.
Time-series data are graphed on a line chart,
which plots the value of the variable on the
vertical axis against the time periods on the
horizontal axis.
Time series data graphed on a line chart is
alternatively known as a time-series chart.
53
Time Series Data

We recorded the value of Australian exports from
1992 to 2009 (Figure 2.22). Draw a line chart to
describe these data and briefly describe the
results.
54
Line Chart
Plot the frequency of a category above
the point on the horizontal axis
representing that category.
Use line charts when the categories are
points in time.
Line charts are particularly useful when
the trend over time is to be
emphasised.
55
Line Chart
Figure 2.22 Line chart showing change in Australian exports over time
56
Line Chart
Australian exports have had a slow but
steady increase from 1992 to 2004.
After 2004, Australian exports have
been increasing steadily at a much
higher rate.
57
2.5 Relationship between Two

Variables
So far weve looked at tabular and graphical
techniques for one variable (either nominal
or numerical data).
Now we will look at the relationship between
two variables (either nominal or numerical
data) using either tabular or graphical
techniques.
58
Describing the Relationship between

Two Nominal Variables
A cross-classification table (or crosstabulation table) is used to describe the
relationship between two nominal variables.
A cross-classification table lists the
frequency of each combination of the
values of the two variables
59
Example 2.8
In a major Australian city there are four
competing newspapers: N1, N2, N3 and N4.
To help design advertising campaigns, the
advertising managers of the newspapers
need to know which segments of the
newspaper market are reading their papers.
A survey was conducted to analyse the
relationship between newspapers read and
occupation.
60
Example 2.8
A sample of newspaper readers was
asked to report which newspaper they
read: N1, N2, N3, N4, and to indicate
whether they were blue-collar worker (1),
white-collar worker (2), or professional
(3).
The responses are stored in file XM02-08.
61
Example 2.8
By counting the number of times each of
the 12 combinations occurs, we produced
the Table 2.16.
62
Example 2.8
If occupation and newspaper are related,
then there will be differences in the

newspapers read among the occupations.
An easy way to see this is to convert the
frequencies in each row to relative
frequencies in each row.
That is, compute the row totals and divide
each frequency by its row total.
63
Example 2.8
Interpretation: The relative frequencies in the rows 2
and 3 are similar, but there are large differences
between rows 1 and 2, and between rows 1 and 3.
64
Example 2.8
Interpretation: The relative frequencies in the rows
2 and 3 are similar, but there are large differences
between rows 1 and 2, and between rows 1 and 3.
Row 1: Blue collar (1); Row 2: White collar (2);
Row 3: Professional (3)
This tells us that blue collar workers tend to read
different newspapers from both white collar
workers and professionals and that white collar
and professionals are quite similar in their
newspaper choice.
65
Example 2.8
Use the data from the cross-classification table to create
bar charts
For example,
Professionals (3)
tend to read
newspaper N2
more than twice
as often as
newspaper N3.
66
Describing the Relationship

between Two Numerical Variables
Often we are interested in the relationships
between two numerical variables.
Advert
Advert
11
33
55
44
22
55
33
22
Sales
Sales
30
30
40
40
40
40
50
50
35
35
50
50
35
35
25
25
67
Example 2.9
A small-business owner wants to assess the
effects of advertising on sales levels.
Paired observation data were collected.
Each pair consisted of monthly advertising
expenditure and monthly sales levels.
68
Scatter diagram
A scatter diagram can describe the
relationship between advertising
expenditure and sales.
Sales
Sales
30
30
40
40
40
40
50
50
35
35
50
50
35
35
25
25
Sales
Advert
Advert
11
33
55
44
22
55
33
22
Sales
60
50
40
30
20
10
0
0
Excel scatter diagram

sales
d
n
a
iture
d
n
e
p
ex
hip.
s
g
n
n
i
o
s
i
i
t
t
Adver ositive rela
p
Have
ales
s
d
n
re a
u
t
i
hip.
d
s
n
n
e
o
i
p
t
g ex near rela
n
i
s
i
t
r
Adve to have li
ear
p
p
a
1
2
3
4
5
Advertising Expenditure
69
Patterns of Scatter Diagrams

Linearity and direction are two concepts we are
interested in.
Positive linear relationship
Negative linear relationship
Weak or non-linear relationship
70
Chapter-Opening Example
WERE OIL COMPANIES GOUGING CUSTOMERS 19992006?: SOLUTION
In May 1999 the average retail price of petrol

was A$0.67 per litre in Melbourne and the price
of oil (Dubai Fetch Crude) was US$15.38 per
barrel (1 barrel = 159.18 litres).
Over the next 10 years, the price of both
substantially increased. Many drivers complained
that the oil companies were guilty of price
gouging.
That is, they believed that when the price of oil
increased the price of petrol also increased, but
when the price of oil decreased, the decrease in
71
the price of petrol seemed to lag behind.
WERE OIL COMPANIES GOUGING CUSTOMERS 19992006?: SOLUTION
To determine whether this perception is

accurate we determined the monthly figures
for both commodities. CH02:\Oil
Graphically depict these data and describe the
findings.
72
73
Interpreting the results:
The scatter diagram reveals that the two
prices are strongly related linearly.
As the oil price increases, petrol price also
increases. When the price of oil was below
A$85, the relationship between the two
variables was stronger than when the price of
oil exceeded A$85.
74
Summary I
Factors That Identify When to Use Frequency and Relative
Frequency Tables, Bar and Pie Charts
1. Objective: Describe a single set of data.
2. Data type: Nominal.
Factors That Identify When to Use a Histogram, Ogive, or Stemand-Leaf Display
1. Objective: Describe a single set of data.
2. Data type: Interval.
Factors that Identify When to Use a Cross-classification Table
1. Objective: Describe the relationship between two variables.
2. Data type: Nominal.
Factors that Identify When to Use a Scatter Diagram
1. Objective: Describe the relationship between two variables.
2. Data type: Interval.
75
Summary II
Numerical
data
Histogram
Single set of
data
Relationship
between two
variables
Scatter diagram
Nominal
data
Frequency and
relative
frequency
tables, bar and
pie charts
Crossclassification
table, bar charts
76
Typical patterns
Positive linear relationship
No relationship
Negative nonlinear relationship
Negative linear relationship
Nonlinear (concave) relationship
77

Selvanathan 5e Chapter 02

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Selvanathan 5e Chapter 02

Uploaded by

Copyright:

Available Formats

Chapter 2

Graphical descriptive methods

Introduction and Re-cap

Its methods make use of graphical techniques and

Populations and Samples

The graphical and tabular methods presented here apply

Typically denoted with a capital letter: X, Y, Z

Data are the observed values of a variable.

2.1 Types of Data

These data are categorical in nature; arithmetic

While its still not meaningful to do arithmetic on

Types of data Examples

Calculations for Types of Data

Other Forms of Data

2.2 Graphical and tabular

Example 2.1 cont. (Excel representation)

The size of each slice in a pie chart is proportional

Use bar charts also when the order in which

2.3 Graphical Techniques for

less than or equal to 15

About half (71+37=108)

i.e. nearly a third of the phone bills

of the bills are small,

are $90 or more.

i.e. less than $30.

A modal class is the class with

Many statistical techniques

Drawing the histogram helps

based statistics course and the

Marks (manual course)

Spread of the marks (narrower | wider)

Stem and Leaf Display

Or split it at the tens position (while rounding to42

Stem & Leaf Display

Thus, we still have access to

Histogram and Stem and Leaf

Compare the overall shapes of the figures

Cumulative frequency of a class

Cumulative Relative Frequencies

Last class: .930+.070=1.00

The ogive can be used

(Refer also to Fig. 2.21 in your textbook.)

2.4 Describing Time Series Data

Time Series Data

2.5 Relationship between Two

Describing the Relationship between

then there will be differences in the

Describing the Relationship

Excel scatter diagram

Patterns of Scatter Diagrams

Positive linear relationship

Negative linear relationship

Weak or non-linear relationship

In May 1999 the average retail price of petrol

To determine whether this perception is

Negative nonlinear relationship

Negative linear relationship

Nonlinear (concave) relationship

You might also like