Professional Documents
Culture Documents
Week 2
Objectives
On completion of this module you should be able to: produce a stem-and-leaf plot (by hand and using Excel/PHStat2) construct a frequency distribution (by hand and using Excel/PHStat2) plot a histogram, ogive and scatterplot (by hand and using Excel/PHStat2) graph a bar chart, pie chart & grouped (side-byside) bar chart (by hand and using Excel/PHStat2)
2
Objectives
On completion of this module you should be able to: interpret the data presentations listed above, and apply the results and conclusions in real world examples and discover and describe common graphical errors, and explain how to overcome these.
Example 2-1
The following data represent the actual weight of potato chips found in bags labelled 50 grams. The manufacturer aims to overfill the bags by 5 grams to allow for settling and dehydrating of the chips prior to sale. The results of fill weights in a sample of 20 consecutive 50-gram bags are listed below (reading from left to right in the order of being filled):
59.4 56.8 56.0 57.9 59.2 51.7 57.5 54.8 52.6 51.5 51.6 55.7 53.7 54.1 59.6 52.4 55.6 54.5 50.2 56.1
4
(a) Stem-and-leaf
First create an ordered array (order data from smallest to largest).
50.2 51.5 51.6 51.7 52.4 52.6 53.7 54.1 54.4 54.8
55.6 55.7 56.0 56.1 56.8 57.5 57.9 59.2 59.4 59.6
Choose the stems. Probably easiest to use first two digits: 50, 51, 52, 53, 54, The leaves will then be the digits after the decimal point: 2, 5, 6, 7, 4,
5
Stem-and-leaf
Write the stems down the left hand side:
50
51
52 53 54 55 56 57 58 59
6
Stem-and-leaf
First data point is 50.2, so add 2 after 50.
50
51 52 53 54 55 56 57 58 59
7
Stem-and-leaf
Next data point is 51.5, so add 5 after 51.
50 2
51
52 53 54 55 56 57 58 59
Stem-and-leaf
Continue until all data is added.
50 2
51
52 53 54 55 56 57 58 59
5
4 7 1 6 0 5 2
6 7
6 4 8 7 1 8 9 4 6
9
11
13
Data range: 59.6 - 50.2 9.4 This is a small data set so we choose a small number of classes: 8. Width of interval: 9.4 1.175
8
Easier to round this number to 1.2 (since data is given to 1 dec. pl.). Read information on class and boundary points in the study guide (p. 2-7).
14
Tally
/
Number of bags
1
////
// /// //// /
4
2 3 4 1
//
///
2
3
15
Number of bags
1 4 2 3 4 1 2 3 20
Percentage of bags
1 20 100 5% 4 20 100 20%
2 20 100 10%
3 20 100 15% 4 20 100 20% 1 20 100 5%
2 20 100 10%
3 20 100 15%
100%
16
Frequency
2.5 2 1.5 1 0.5 0 50.8 52 53.2 54.4 55.6 56.8 58 59.2 Midpoints
18
We will discuss how to produce histograms using Excel and PHStat2 during workshops. Instructions are in the text and Excel Handbook sections included in the text. Make sure you can produce histograms (and other graphs) by hand as well!!!
19
d) Percentage distribution
Percentage Polygon 25%
20%
15%
10%
5%
20
Percentage of bags
5 20 10 15 20 5 10 15
Cumulative percentage
5 25 35 50 70 75 85 100
21
120%
100%
80%
60%
40%
20%
22
Solution 2-1
(g) On the basis of the results of (a) through (f), does there appear to be any concentration of the bag weights around specific values? There are no obvious outliers, no obvious patterns and the data seems fairly even distributed from 51 to about 60.
23
Solution 2-1
(h) If you had to make a prediction of the weight of potato chips in the next bag, what would you predict? Why? The best prediction would be somewhere around the middle of the data (because there is no trend or pattern obvious): we could predict about 55 grams. Note: we will learn how to make more accurate forecasts later in the course.
24
Example 2-2
In recent years, the cost of holiday accommodation on a particular island has been increasing. There was, however, a reduction as a reaction to reduced air travel in the aftermath of the attacks of September 11, 2001. Since then, rising fuel costs have increased the cost of commercial flights and so further discouraged travel to the island, but despite this, the cost of accommodation has continued to increase.
25
Example 2-2
Year
This data represents the cost of a double room for one nights accommodation on the island for the years 1995 to 2006. (a) Set up a scatter diagram with cost of the double room on the y-axis and year on the x-axis.
1995
1996 1997 1998 1999 2000 2001 2002 2003
2004
2005 2006
185
190 205
26
(a) Scatterplot
Cost of double room
250
200
Cost ($)
150
100
50
0 1994
1996
1998
2000 Year
2002
2004
2006
2008
27
Time series data is data that is recorded at regular time intervals (in our example it was years). A time series plot has time on the x-axis and connects the data with straight lines. Since this particular example records the data annually (i.e. at regular intervals), a time series plot is more appropriate than a scatterplot.
28
200
Cost ($)
150
100
50
0 1994
1996
1998
2000 Year
2002
2004
2006
2008
29
There is a clear upward trend in room cost from 1995 to 2001. Close to the September 11 attacks, cost decreases for each of the next three years. From 2004, the cost begins to increase again, but does not (yet) return to the heights experienced prior to September 11.
30
Example 2-3
A DVD hire company deals with a number of complaints regarding their rental DVDs. The number of times each complaint occurred is given in the table.
Complaint
Scratched disc Dirty disc Cracked disc Wrong DVD Too expensive Coarse language Explicit content Boring Too violent Too soppy Not funny Rental period too short Bad movie Rude staff Store closed
Frequency
125 116 21 54 39 26 41 29 18 27 33 14 12 9 4
31
Normally word categories (as in this example) are listed up y-axis and number categories (for example years, months, pay classification scales etc) are listed across the x-axis. Make sure you are confident preparing a bar chart by hand!! Remember always label axes and give graphs a title!
32
Complaint
Rude staff Rental period too short Not funny Explicit content Dirty disc Cracked disc Coarse language Boring Bad movie 0 20 40 60 80 100 120
33
140
Explicit content Not funny 7% ntal pe riod too Rude Re s taff 6% short 2% 2%
34
The default view of this pie chart is difficult to read! Youll often have to work with default graphs to improve their look (especially for assignments!!) Sometimes just resizing the graph can help!
35
To produce a pie-chart by hand you need a protractor (to measure degrees). In the exam you will only get easy category sizes (eg multiples of 45o). To calculate the degrees for each category:
Complaint Scratched disc Frequency 125 %
125 100 21% 586 116 100 20% 586 21 100 4% 586
Degrees
125 360 77 586 116 360 71 586 21 360 13 586
Dirty disc
Cracked disc
116
21
36
Sc ra tc h
10% 15% 20% 25% 0% 5%
ed di s c
D irt y di sc D VD
W ro n g Ex pl ic it co nt en t
To o ex pe ns iv e
N ot fu n ny
B or in g To o so pp y gu a ge C oa rs e la n
Complaint
Pareto Diagram
C ra c ke d di s c
R en t al p er io d to o B o vi ol en
To t
sh or t ad m ov R ud e St or e
ie st cl
af f os e
37
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Often Pareto charts have both vertical axes with the same scale Excel and PHStat2 do not do this easily. On the following slide, the left-hand axis allows for the total frequency value (586) and this lines up exactly with 100% in the cumulative frequency on the right-hand side. This graph also groups the very small categories (in this case only the last category) calling them other.
38
100 80 60 40 20 0
Percent
39
(d) Which graphical method do you think is best to portray this data?
Pie chart : Too cramped with so many categories. The similar sized segments are difficult to compare. In some views, the category labels overlap each other. Pareto chart preferred over bar chart since it orders categories from smallest to largest, includes the cumulative percentage polygon and makes it easy to see most common complaints. 40
The two most common complaints are scratched and dirty discs (21% and 20% respectively or 41% of complaints in total). The third most common is wrong DVD (10%).
41
Women
35
30
70
42
Senior accountant
Junior accountant
Male Female
Accountant
10
20
30
40
50
60
43
Because there are clearly more men in each of the three job positions (junior accountant, accountant and senior accountant) it is difficult to comment on the ratio of men to women in each class. Note that PHStat2 has changed the order of the three categories (to alphabetical order from bottom to top). It seems as if the relative number of women is dropping as the job position increases. We might be better to use relative frequencies to compare gender differences. 44
Data-Ink Ratio
The data-ink ratio is the proportion of the graphics ink that is devoted to nonredundant display of data information. Data - ink Data - ink ratio = Total ink used to print the graphic
Aim maximise proportion of ink used in graph that is devoted to data.
Source: Levine et al., 2005.
47
Graphical excellence
Chartjunk decoration that is non-data-ink or redundant data ink. Lie factor the ratio of the size of the effect shown in the graph to the size of the effect in the data. Aim is to reduce both of these! Will discuss examples in more detail in tutorials important to be there!
Source: Levine et al., 2005.
48
Review the lecture material Complete all readings Complete all of recommended problems (listed in SG) from the textbook Complete at least some of additional problems Consider (briefly) the discussion points prior to tutorials
49