You are on page 1of 77

Chapter 2

Graphical descriptive methods

Introduction and Re-cap


Descriptive statistics
involves arranging, summarising, and presenting a
set of data in such a way that useful information is
produced.
Statistics
Data

Information

Its methods make use of graphical techniques and


numerical descriptive measures (such as averages)
to summarise and present the data.
3

Populations and Samples


Population

Sample

Subset

The graphical and tabular methods presented here apply


to both entire populations and samples drawn from
populations.
4

Definitions
A variable is some characteristic of a population
or sample.
E.g. student grades.

Typically denoted with a capital letter: X, Y, Z


The values of the variable are the range of
possible values for a variable.
E.g. student marks (0100)

Data are the observed values of a variable.


E.g. student marks: {67, 74, 71, 83, 93, 55, 48}
5

2.1 Types of Data


Data (at least for purposes of Statistics) fall into
three main groups:
Numerical (interval or quantitative) data
Nominal (categorical or qualitative) data
Ordinal (ranked) data

Numerical Data
Numerical data
Real numbers, i.e. heights, weights, prices,
waiting time at a medical practice, etc.
Also referred to as quantitative or interval.
Arithmetic operations can be performed on
numerical data, thus its meaningful to talk
about 2*Height, or Price + $1, and so on.

Nominal Data
Nominal Data
The values of nominal data are categories.
E.g. responses to questions about marital status are
categories, coded as: Single = 1, Married = 2,
Divorced = 3, Widowed = 4

These data are categorical in nature; arithmetic


operations dont make any sense (e.g. does
Divorced 2 = Married?!)
Nominal data are also called qualitative or
categorical.

Ordinal Data
Ordinal Data
Ordinal data appear to be categorical in
nature, but their values have an order; a
ranking to them:
E.g. University course evaluation system: poor = 1,
fair = 2, good = 3, very good = 4, excellent = 5

While its still not meaningful to do arithmetic on


this data (e.g. does 2*fair = very good?!), we
can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what
numeric values are assigned to each category.
9

Types of data Examples


Numerical data
age income
income
age

Nominal data
person
married
person
married

Ordinal data
exam grade
grade
exam

HD
HD
75
000
1
yes
75 000
1
yes
D
D
68 000
000
no
68
22
no
CC
no
..
33.
no
P
P
.
.
.
F
..
F
.
.
.
.
all we
computer
brand
computer
brand
weight gain
gain With nominal data,
weight
With
ordinal data, all we
Food quality
quality
Food
can
calculate
is
the
1
IBM
1
IBM
+10
+10
can use is computations
Excellent
Excellent
proportion
of
data
that
2
Dell
2
Dell
+5
+5
involving the ordering
Good
Good
falls
into
each
category.
Compaq
33
Compaq
Satisfactory
process.
..
Satisfactory
IBM
44
IBM
Poor
Poor
..
..
..
IBM
Dell
Compaq
other total
total
IBM
Dell
Compaq
other
25
11
50
25
11
88
66
50
50%
22%
16%
12%
50%
22%
16%
12%

55
55
42
42
..
..

10

Calculations for Types of Data


As mentioned above,
All calculations are permitted on interval data.
No calculations are allowed for nominal data,
except counting the number of observations in
each category and calculating their proportions.
Only calculations involving a ranking process
are allowed for ordinal data.
This lends itself to the following hierarchy of
data

11

Hierarchy of Data
Numerical
Values are real numbers.
All calculations are valid.
Data may be treated as ordinal or nominal.
Nominal
Values are the arbitrary numbers that represent
categories.
Only calculations based on the frequencies of occurrence
are valid.
Data may not be treated as ordinal or numerical.
Ordinal
Values must represent the ranked order of the data.
Calculations based on an ordering process are valid.
Data may be treated as nominal but not as numerical.
12

Other Forms of Data


Cross-sectional data is collected at a certain
point in time across a number of units of interest
marketing survey (observe preferences by
gender, age)
test score in a statistics course exam
starting salaries of graduates of an MBA
program in a particular year.
Time-series data is collected over successive
points in time
weekly closing price of gold
monthly tourist arrivals in Australia.
13

2.2 Graphical and tabular


techniques for nominal data
The only allowable calculation on nominal data is
to count the frequency of each value of the
variable.
We can summarise the data in a table that
presents the categories and their counts called a
frequency distribution.
A relative frequency distribution lists the
categories and the proportion with which each
occurs.
14

Introduction ...
The methods presented apply to both
the entire population, and
a sample selected from the population.

15

Graphical techniques
for nominal data
The graphical presentations shown here
are used primarily for nominal data.
These graphical tools are most appropriate
when the raw data can be naturally
categorised in a meaningful manner.

16

Bar charts
The bar chart is mainly used for nominal
data.
A bar chart graphically represents the
frequency of each category as a bar rising
vertically from the horizontal axis
The height of each bar is proportional to the
frequency of the corresponding category.

17

Pie charts
Another useful chart to present nominal
data is the pie chart.
The pie chart is a very popular tool used to
represent the proportions of appearance for
nominal data.
A pie chart is a circle that is subdivided into
slices whose areas are proportional to the
frequencies (or relative frequencies),
thereby displaying the proportion of
occurrences of each category.
18

Example 2.1
To determine the approximate market share of
various womens magazines in New Zealand, a
womens magazine readership survey was
conducted using a sample of 200 readers.
Data was collected and the count of the
occurrences (frequencies) was recorded for each
magazine.
The frequencies were presented in a bar chart.
Then the frequencies were converted to
proportions and the results were presented in a
pie chart.
19

Example 2.1
1 = Australian Womens Weekly (NZ Edition); 2 = Next;
3 = NZ New Idea; 4 = NZ Womans Day; 5 = NZ Womens
Weekly; and 6 = Thats Life.

20

Example 2.1 cont. (Excel representation)

21

The size of each slice in a pie chart is proportional


to the percentage corresponding to the category it
represents.

(10/100)(3600) = 360

22

Use bar charts also when the order in which


data are presented is meaningful.
Trend in
in total
total exports,
exports, Australia,
Australia, 19922009
19922009
Trend

23

2.3 Graphical Techniques for


Numerical Data
There are several graphical methods that are
used when the data are numerical (i.e.
quantitative, non-categorical).
The most important of these graphical methods
is the histogram.
The histogram is not only a powerful graphical
technique used to summarise interval data, but
it is also used to help explain probabilities.
24

Example 2.5
Providing information concerning the
monthly bills of new subscribers in the
first month after signing on with a
telephone company
collect data
prepare a frequency distribution
draw a histogram.

25

Example 2.5
As part of a larger study, a long-distance company
wanted to acquire information about the monthly
bills of new subscribers in the first month after
signing with the company. The companys
marketing manager conducted a survey of 200
new residential subscribers wherein the first
months bills were recorded. These data are stored
in file XM02-05. The general manager planned to
present his findings to senior executives. What
information can be extracted from these data?
26

Example 2.5
In Example 2.1 we created a frequency distribution
of the 6 categories. In this example we also create
a frequency distribution by counting the number of
observations that fall into a series of intervals,
called classes.
The justification for the classes chosen below will
be discussed later.

27

Example 2.5
We have chosen eight classes defined in such a way
that each observation falls into one and only one
class. These classes are defined as follows:
Classes
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts
Amounts

that
that
that
that
that
that
that
that

are
are
are
are
are
are
are
are

less than or equal to 15


more than 15 but less than or equal to 30
more than 30 but less than or equal to 45
more than 45 but less than or equal to 60
more than 60 but less than or equal to 75
more than 75 but less than or equal to 90
more than 90 but less than or equal to 105
more than 105 but less than or equal to 120
28

Example 2.4

29

Interpret

(18+28+14=60)200 = 30%

About half (71+37=108)

i.e. nearly a third of the phone bills

of the bills are small,

are $90 or more.

i.e. less than $30.


There are only a few telephone
bills in the middle range.

30

Building a Histogram
1) Collect the data
2) Create a frequency distribution for the data
How?
a) Determine the number of classes to use
How?
Refer to Table 2.10:
With 200 observations,
we should have
between 7 & 10
classes we could use Sturges formula:
Alternatively,
Number of class intervals = 1 + 3.3 log (n)

31

Building a Histogram
Class width
It is generally best to use equal class
widths, but sometimes unequal class
widths are called for.
Unequal class widths are used when the
frequency associated with some classes is
too low. Then,
several classes are combined together to form
a wider and more populated class
it is possible to form an open-ended class at
the higher or lower end of the histogram.
32

Building a Histogram
1) Collect the data
2) Create a frequency distribution for the data
How?
a) Determine the number of classes to use. [8]
b) Determine how wide to make each class
(assuming equal class width). How?
Look at the range of the data, that is,
Range = Largest observation
Smallest observation
Range = $119.63 $0 = $119.63
Then each class width becomes:
Range (# classes) = 119.63 8 15
33

Building a Histogram

34

Building a Histogram

35

Shapes of Histograms

Variable

Frequency

Frequency

Frequency

Symmetry
A histogram is said to be symmetric if, when we
draw a vertical line down the center of the
histogram, the two sides are identical in shape
and size:

Variable

Variable

36

Shapes of Histograms

Frequency

Frequency

Skewness
A skewed histogram is one with a long tail
extending to either the right or the left:

Variable

Positively skewed

Variable

Negatively skewed

37

Shapes of Histograms
Modality
A unimodal histogram is one with a single peak, while
a bimodal histogram is one with two peaks:
Bimodal

Frequency

Frequency

Unimodal

Variable

Variable

A modal class is the class with


the largest number of observations

38

Shapes of Histograms

Many statistical techniques


require that the population be

Frequency

Bell Shape
A special type of symmetric unimodal
histogram is one that is bell shaped:

bell shaped.
Variable

Drawing the histogram helps


verify the shape of the

Bell Shaped
39

Histogram comparison
Compare and contrast the following histograms based
on data from Example 2.7: The marks from the computerUnimodal vs. bimodal
Marks (computer course)

based statistics course and the


manual statistics course have very
different histograms

Marks (manual course)

Spread of the marks (narrower | wider)

40

Stem and Leaf Display


Retains information about individual observations that would
normally be lost in the creation of a histogram.
Split each observation into two parts, a stem and a leaf:
e.g. Observation value: 42.19
There are several ways to split it up
We could split it at the decimal point:

Stem

Leaf

Or split it at the tens position (while rounding to42


the nearest
19
integer in the ones position).

2
41

Stem & Leaf Display


Continue this process for all the observations.
StemThen,
Leaf use the stems for the classes and each
leaf becomes part of the histogram (based on
0
Example 2.5 data) as follows
0000000000111112222223333345555556666666778888999999
1

000001111233333334455555667889999

0000111112344666778999

001335589

124445589

33566

3458

022224556789

334457889999

00112222233344555999

Thus, we still have access to


our original data points value!
42

Histogram and Stem and Leaf

Compare the overall shapes of the figures


43

Relative frequency
It is often preferable to show the
relative frequency (proportion) of
observations falling into each class,
rather than the frequency itself.
Class relative frequency=

Class frequency
Total number of observations

44

Relative frequency
Relative frequencies should be used
when
the population relative frequencies are
studied
comparing two or more histograms
the number of observations of the samples
studied are different.

45

Cumulative frequency of a class


This is the number of measurements less
than the upper limit of that class.
To obtain the cumulative frequency of a
class, we add the frequency of that class
and the frequencies of all previous classes.
The cumulative relative frequency of a
particular class is the proportion of
measurements that are less than the upper
limit of that class.

46

Ogive
(pronounced Oh-jive) is a graph of
a cumulative frequency distribution.
We create an ogive in three steps
First, from the frequency distribution created
earlier, calculate relative frequencies:
Class relative frequency=

Class frequency
Total number of observations

47

Relative Frequencies
For example, we had 71 observations in our
first class (telephone bills from $0.00 to
$15.00). Thus, the relative frequency for this
class is 71 200 (the total # of phone bills) =
0.355 (or 35.5%).

48

Ogive
Is a graph of a cumulative frequency
distribution.
We create an ogive in three steps
1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies by adding the current
class relative frequency to the previous
class cumulative relative frequency.
(For the first class, its cumulative relative
frequency is just its relative frequency.)

2.49

Cumulative Relative Frequencies


TABLE 2.15 Cumulative relative frequencies for Example 2.5

First class
Next class: .355+.185=.540

:
:

Last class: .930+.070=1.00

50

Ogive
Is a graph of a cumulative frequency
distribution.
1) Calculate relative frequencies.
2) Calculate cumulative relative
frequencies.
3) Graph the cumulative relative
frequencies
Example 2.5 Ogive

51

Ogive
Example 2.5 Ogive

The ogive can be used


to answer questions
like:
What telephone bill
value is at the 50th
percentile?

around $35

(Refer also to Fig. 2.21 in your textbook.)


52

2.4 Describing Time Series Data


Observations measured at the same point in
time across individual units are called crosssectional data.
Observations measured at successive points in
time on a single unit are called time-series
data.
Time-series data are graphed on a line chart,
which plots the value of the variable on the
vertical axis against the time periods on the
horizontal axis.
Time series data graphed on a line chart is
alternatively known as a time-series chart.
53

Time Series Data


We recorded the value of Australian exports from
1992 to 2009 (Figure 2.22). Draw a line chart to
describe these data and briefly describe the
results.

54

Line Chart
Plot the frequency of a category above
the point on the horizontal axis
representing that category.
Use line charts when the categories are
points in time.
Line charts are particularly useful when
the trend over time is to be
emphasised.

55

Line Chart
Figure 2.22 Line chart showing change in Australian exports over time

56

Line Chart
Australian exports have had a slow but
steady increase from 1992 to 2004.
After 2004, Australian exports have
been increasing steadily at a much
higher rate.

57

2.5 Relationship between Two


Variables
So far weve looked at tabular and graphical
techniques for one variable (either nominal
or numerical data).
Now we will look at the relationship between
two variables (either nominal or numerical
data) using either tabular or graphical
techniques.

58

Describing the Relationship between


Two Nominal Variables
A cross-classification table (or crosstabulation table) is used to describe the
relationship between two nominal variables.
A cross-classification table lists the
frequency of each combination of the
values of the two variables
59

Example 2.8
In a major Australian city there are four
competing newspapers: N1, N2, N3 and N4.
To help design advertising campaigns, the
advertising managers of the newspapers
need to know which segments of the
newspaper market are reading their papers.
A survey was conducted to analyse the
relationship between newspapers read and
occupation.
60

Example 2.8
A sample of newspaper readers was
asked to report which newspaper they
read: N1, N2, N3, N4, and to indicate
whether they were blue-collar worker (1),
white-collar worker (2), or professional
(3).
The responses are stored in file XM02-08.

61

Example 2.8
By counting the number of times each of
the 12 combinations occurs, we produced
the Table 2.16.

62

Example 2.8
If occupation and newspaper are related,

then there will be differences in the


newspapers read among the occupations.
An easy way to see this is to convert the
frequencies in each row to relative
frequencies in each row.
That is, compute the row totals and divide
each frequency by its row total.

63

Example 2.8
Interpretation: The relative frequencies in the rows 2
and 3 are similar, but there are large differences
between rows 1 and 2, and between rows 1 and 3.

64

Example 2.8
Interpretation: The relative frequencies in the rows
2 and 3 are similar, but there are large differences
between rows 1 and 2, and between rows 1 and 3.
Row 1: Blue collar (1); Row 2: White collar (2);
Row 3: Professional (3)
This tells us that blue collar workers tend to read
different newspapers from both white collar
workers and professionals and that white collar
and professionals are quite similar in their
newspaper choice.

65

Example 2.8
Use the data from the cross-classification table to create
bar charts

For example,
Professionals (3)
tend to read
newspaper N2
more than twice
as often as
newspaper N3.

66

Describing the Relationship


between Two Numerical Variables
Often we are interested in the relationships
between two numerical variables.
Advert
Advert
11
33
55
44
22
55
33
22

Sales
Sales
30
30
40
40
40
40
50
50
35
35
50
50
35
35
25
25

67

Example 2.9
A small-business owner wants to assess the
effects of advertising on sales levels.
Paired observation data were collected.
Each pair consisted of monthly advertising
expenditure and monthly sales levels.

68

Scatter diagram
A scatter diagram can describe the
relationship between advertising
expenditure and sales.
Sales
Sales
30
30
40
40
40
40
50
50
35
35
50
50
35
35
25
25

Sales

Advert
Advert
11
33
55
44
22
55
33
22

Sales
60
50
40
30
20
10
0
0

Excel scatter diagram


sales
d
n
a
iture
d
n
e
p
ex
hip.
s
g
n
n
i
o
s
i
i
t
t
Adver ositive rela
p
Have

ales
s
d
n
re a
u
t
i
hip.
d
s
n
n
e
o
i
p
t
g ex near rela
n
i
s
i
t
r
Adve to have li
ear
p
p
a
1
2
3
4
5

Advertising Expenditure
69

Patterns of Scatter Diagrams


Linearity and direction are two concepts we are
interested in.

Positive linear relationship

Negative linear relationship

Weak or non-linear relationship

70

Chapter-Opening Example
WERE OIL COMPANIES GOUGING CUSTOMERS 19992006?: SOLUTION

In May 1999 the average retail price of petrol


was A$0.67 per litre in Melbourne and the price
of oil (Dubai Fetch Crude) was US$15.38 per
barrel (1 barrel = 159.18 litres).
Over the next 10 years, the price of both
substantially increased. Many drivers complained
that the oil companies were guilty of price
gouging.
That is, they believed that when the price of oil
increased the price of petrol also increased, but
when the price of oil decreased, the decrease in
71
the price of petrol seemed to lag behind.

Chapter-Opening Example
WERE OIL COMPANIES GOUGING CUSTOMERS 19992006?: SOLUTION

To determine whether this perception is


accurate we determined the monthly figures
for both commodities. CH02:\Oil
Graphically depict these data and describe the
findings.

72

Chapter-Opening Example

73

Chapter-Opening Example
Interpreting the results:
The scatter diagram reveals that the two
prices are strongly related linearly.
As the oil price increases, petrol price also
increases. When the price of oil was below
A$85, the relationship between the two
variables was stronger than when the price of
oil exceeded A$85.

74

Summary I
Factors That Identify When to Use Frequency and Relative
Frequency Tables, Bar and Pie Charts
1. Objective: Describe a single set of data.
2. Data type: Nominal.
Factors That Identify When to Use a Histogram, Ogive, or Stemand-Leaf Display
1. Objective: Describe a single set of data.
2. Data type: Interval.
Factors that Identify When to Use a Cross-classification Table
1. Objective: Describe the relationship between two variables.
2. Data type: Nominal.
Factors that Identify When to Use a Scatter Diagram
1. Objective: Describe the relationship between two variables.
2. Data type: Interval.
75

Summary II
Numerical
data
Histogram
Single set of
data
Relationship
between two
variables

Scatter diagram

Nominal
data
Frequency and
relative
frequency
tables, bar and
pie charts
Crossclassification
table, bar charts
76

Typical patterns
Positive linear relationship

No relationship

Negative nonlinear relationship

Negative linear relationship

Nonlinear (concave) relationship

77

You might also like