Overall Descriptive Statistics

1
Slide
Descriptive Statistics
Data and Statistics
Applications in Business and Economics
Data
Data Sources
Statistical Inference
2

Slide
Applications in Business and Economics
Statistics is the process of data collection, organizing,
analyzing the data, interpertation and make
decisions.
Accounting
Public accounting firms use statistical sampling
procedures when conducting audits for their clients.
Finance
Financial analysts use a variety of statistical
information, including price-earnings ratios and
dividend yields, to guide their investment
recommendations.
Marketing
Point-of-sale scanners at retail checkout counters are
being used to collect data for a variety of marketing
research applications.
3

Slide
Production
A variety of statistical quality control charts are used
to monitor the output of a production process.
Economics
Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
Applications in
Business and Economics
4

Slide
Why Study Statistics?

1. Numerical information is everywhere
2. Statistical techniques are used to make decisions that
affect our daily lives
3. The knowledge of statistical methods will help you to
understand how decisions are made and give you a better
understanding of how they affect you.
4. No matter what line of work you select, you will find
yourself faced with decisions where an understanding of
data analysis is helpful.
Some examples of the need for data collection.
1. Research analysts evaluate many facets of a particular
stock before making a buy or sell recommendation.
2. The marketing department Managers must make
decisions about the quality of their product or service.

5

Slide
What is Meant by Statistics?
In the more common usage, statistics refers to
numerical information.
Examples: the average starting salary of college
graduates, the number of deaths due to alcoholism last
year etc.
We often present statistical information in a
graphical form for capturing reader attention.

6

Slide
Types of Statistics
Descriptive Statistics - methods of organizing,
summarizing, and presenting data in an informative
way.

Inferential Statistics: A decision, estimate, prediction, or
generalization about a population, based on a
sample.

7

Slide
Population versus Sample
A population is a collection of all possible individuals,
objects, or measurements of interest.
A sample is a portion, or part, of the population of
interest

8

Slide
Data
Elements, Variables, and Observations
Scales of Measurement
Qualitative and Quantitative Data
Cross-Sectional and Time Series Data
9

Slide
Data and Data Sets
Data are the facts and figures that are collected,
summarized, analyzed, and interpreted. E.g.,
IBMs sales revenue is $100 bn.; stock price $80.
The data collected in a particular study are referred to
as the data set. E.g.,
The sales revenue and stock price data for a
number of firms including IBM, Dell, Apple, etc.

10

Slide
Elements, Variables, and Observations
The elements are the entities on which data are
collected. E.g.,
IBM, Dell, Apple, etc. in the previous setting.
A variable is a characteristic of interest for the
elements. E.g.,
Sales revenue, stock price (of a company)
The set of measurements collected for a particular
element is called an observation.
Sales revenue, stock price for 2003

11

Slide
Scales of measurement include:
Nominal
Ordinal
Interval
Ratio
The scale determines the amount of information
contained in the data.
The scale indicates the data summarization and
statistical analyses that are most appropriate.
12

Slide
Nominal
data that is classified into categories and cannot be
arranged in any particular order. A numeric code may
be used. The Nominal Scales Categorize Individuals
or Groups And This Scale Measure The Percentage
Response E.G. Male- Female, Pakistani-American
Example:
Students of a university are classified by the school
in which they are enrolled using a nonnumeric
label such as Business, Humanities, Education, and
so on.
Alternatively, a numeric code could be used for the
school variable (e.g. 1 denotes Business, 2 denotes
Humanities, 3 denotes Education, and so on).

13

Slide
Ordinal
similar to the nominal level, with the additional property
that meaningful amounts of differences between data
values can be determined. It categorizes and ranks the
variables according to the preferences e.g. from best to
worst, first to last, a numeric code may be used.
e.g. rank job characteristics
Example:
Students of a university are classified by their class
standing using a nonnumeric label such as
Freshman, Junior, Senior.

Alternatively, a numeric code could be used for the
class standing variable (e.g. 1 denotes Freshman, 2
denotes, Junior and so on).

14

Slide
Interval
The data have the properties of ordinal and
interval between observations is expressed in
terms of a fixed unit of measure. Preferences on a
5/7 point scale. It also measures the magnitude of
the differences in the preferences among the
individuals. Interval data are always numeric.
Example:
strongly disagree, disagree, neither agree nor
disagree, agree, strongly agree etc.

15

Slide
Ratio
The data have all the properties of interval data and
the ratio of two values is meaningful. This scale
must contain a zero value that indicates that
nothing exists for the variable at the zero point.
Example:
Variables such as distance, height, weight, and
time use the ratio scale.

16

Slide

Ratio scales: used when exact numbers are called for
e.g. how many orders do you operate?
Interval scale: used for responses to various items
on 5/7 points use of stats measures as ratio scale, a.
mean, stand. deviation.
Ordinal scale: for preference in use, stats measures
are median, range, rank order correlations
Nominal scale: used for personal data

17

Slide
Types of Variables
A. Qualitative variable - the characteristic
being studied is nonnumeric.
EXAMPLES: Gender, religious affiliation, type of
automobile owned, eye color are examples.
use either the nominal or ordinal scale of measurement.

B. Quantitative variable - information is
reported numerically.
EXAMPLES: balance in your account, minutes
remaining in class, or number of children in a family.

18

Slide
Quantitative Data
Quantitative data indicate either how many or how
much.
Quantitative data that measure how many are
discrete.
Quantitative data that measure how much are
continuous.
Quantitative data are always numeric.
Arithmetic operations (e.g., +, -) are meaningful only
with quantitative data.

19

Slide
Summary of Types of Variables
LO4
20

Slide
Cross-Sectional and Time Series Data
Cross-sectional data are collected at the same or
approximately the same point in time.
Example: data detailing the number of building
permits issued in June 2000
Time series data are collected over several time
periods.
Example: Texas in each of the last 36 months
21

Slide
Data Sources
Existing Sources
Data needed for a particular application might
already exist within a firm. Detailed information
is often kept on customers, suppliers, and
employees.
Substantial amounts of business and economic
data are available from organizations that
specialize in collecting and maintaining data.
Government agencies are another important
source of data.
Data are also available from a variety of industry
associations and special-interest organizations.
22

Slide
Data Sources
Internet
The Internet has become an important source of
data.
Most government agencies, like the Bureau of the
Census (www.census.gov), make their data
available through a web site.
More and more companies are creating web sites
and providing public access to them.
A number of companies now specialize in making
information available over the Internet.
23

Slide
Data Acquisition Considerations
Time Requirement
Searching for information can be time consuming.
Information might no longer be useful by the time
it is available.
Cost of Acquisition
Organizations often charge for information even
when it is not their primary business activity.
Data Errors
Using any data that happens to be available or
that were acquired with little care can lead to poor
and misleading information.
24

Slide
Descriptive statistics are the tabular, graphical, and
numerical methods used to summarize data.
25

Slide
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Example: Hudson Auto Repair
The manager of Hudson Auto would like to have
a better understanding of the cost of parts used in the
engine tune-ups performed in the shop. She examines
50 customer invoices for tune-ups. The costs of parts,
rounded to the nearest dollar, are listed below.

26

Slide
Tabular Summary (Frequencies and Percent
Frequencies)

Parts Percent
Cost ($) Frequency Frequency
50-59 2 4
60-69 13 26
70-79 16 32
80-89 7 14
90-99 7 14
100-109 5 10
Total 50 100

27

Slide
Graphical Summary (Histogram)
Parts
Cost ($)
2
4
6
8
10
12
14
16
18
F
r
e
q
u
e
n
c
y

50 60 70 80 90 100 110
28

Slide
Numerical Descriptive Statistics
The most common numerical descriptive statistic
is the average (or mean).
Hudsons average cost of parts, based on the 50
tune-ups studied, is $79 (found by summing the
50 cost values and then dividing by 50).

29

Slide
Statistical Inference
Statistical inference is the process of using data
obtained from a small group of elements (the sample)
to make estimates and test hypotheses about the
characteristics of a larger group of elements (the
population).
30

Slide
Process of Statistical Inference
1. Population
consists of all
tune-ups. Average
cost of parts is
unknown.
2. A sample of 50
engine tune-ups
is examined.
3. The sample data
provide a sample
average cost of
$79 per tune-up.
4. The value of the
sample average is used
to make an estimate of
the population average.
31

Slide
Descriptive Statistics:
Tabular and Graphical Methods
Summarizing the Qualitative Data
Frequency Distribution
Relative Frequency
Percent Frequency Distribution
Bar Graph
Pie Chart
32

Slide
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items in
each of several classes.

33

Slide
Example: Marada Inn
Guests staying at Marada Inn were asked to rate the
quality of their accommodations as being excellent,
above average, average, below average, or poor. The
ratings provided by a sample of 20 quests are shown
below.

Below Average Average Above Average
Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average
34

Slide

Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
Example: Marada Inn
35

Slide
Relative Frequency Distribution
The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
36

Slide
Percent Frequency Distribution
The percent frequency of a class is the relative
frequency multiplied by 100.
A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class.

37

Slide
Example: Marada Inn
Relative Frequency and Percent Frequency
Distributions

Relative Percent
Rating Frequency Frequency

Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100
38

Slide
Bar Graph
A bar graph is a graphical device for depicting
qualitative data.
On the horizontal axis we specify the labels that are
used for each of the classes.
A frequency, relative frequency, or percent frequency
scale can be used for the vertical axis.
The bars are separated to emphasize the fact that
each class is a separate category.
39

Slide
Example: Marada Inn
Bar Graph
1
2
3
4
5
6
7
8
9
Poor Below
Average
Average Above
Average
Excellent
F
r
e
q
u
e
n
c
y

Rating
40

Slide
Pie Chart
The pie chart is a commonly used graphical device
for presenting relative frequency distributions for
qualitative data.
First draw a circle; then use the relative frequencies
to subdivide the circle into sectors that correspond to
the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a
relative frequency of .25 would consume .25(360) =
90 degrees of the circle.
41

Slide
Example: Marada Inn
Pie Chart
Average
25%
Below
Average
15%
Poor
10%
Above
Average
45%
Exc.
5%
Quality Ratings
42

Slide
Insights Gained from the Preceding Pie Chart
One-half of the customers surveyed gave Marada
a quality rating of above average or excellent
(looking at the left side of the pie). This might
please the manager.
For each customer who gave an excellent rating,
there were two customers who gave a poor
rating (looking at the top of the pie). This should
displease the manager.
Example: Marada Inn
43

Slide
Summarizing Quantitative Data
Relative Frequency
Percent Frequency Distributions
Cumulative Distributions
Dot Plot
Histogram
Ogive/ Frequency Polygon
44

Slide
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
The manager of Hudson Auto would like to get a
better picture of the distribution of costs for engine
tune-up parts. A sample of 50 customer invoices has
been taken and the costs of parts, rounded to the
nearest dollar, are listed below.

45

Slide
Guidelines for Selecting Number of Classes
Use between 5 and 20 classes.
Data sets with a larger number of elements
usually require a larger number of classes.
Smaller data sets usually require fewer classes.
46

Slide
Guidelines for Selecting Width of Classes
Use classes of equal width.
Approximate Class Width =

Largest Data Value Smallest Data Value
Number of Classes
47

Slide
If we choose six classes:
Approximate Class Width = (109 - 52)/6 = 9.5 ~ 10

Cost ($) Frequency
50-59 2
60-69 13
70-79 16
80-89 7
90-99 7
100-109 5
Total 50
48

Slide
Relative Frequency and Percent Frequency
Distributions

Relative Percent
Cost ($) Frequency Frequency
50-59 .04 4
60-69 .26 26
70-79 .32 32
80-89 .14 14
90-99 .14 14
100-109 .10 10
Total 1.00 100
49

Slide
Insights Gained from the Percent Frequency
Distribution
Only 4% of the parts costs are in the $50-59 class.
30% of the parts costs are under $70.
The greatest percentage (32% or almost one-third)
of the parts costs are in the $70-79 class.
10% of the parts costs are $100 or more.
50

Slide
Dot Plot
One of the simplest graphical summaries of
quantitative data is a dot plot.
A horizontal axis shows the range of data values.
Then each data value is represented by a dot placed
above the axis.

51

Slide
Dot Plot
.
. .. . . .
50 60 70 80 90 100 110
. . . ..... .......... .. . .. . . ... . .. .
. .. .. .. .. . .
Cost ($)
52

Slide
Histogram
Another common graphical presentation of
quantitative data is a histogram.
The variable of interest is placed on the horizontal
axis.
A rectangle is drawn above each class intervals
frequency, relative frequency, or percent frequency.
Unlike a bar graph, a histogram has no natural
separation between rectangles of classes.
53

Slide
Histogram
Parts
Cost ($)
2
4
6
8
10
12
14
16
18
F
r
e
q
u
e
n
c
y

50 60 70 80 90 100 110
54

Slide
Cumulative frequency distribution -- shows the
number of items with values less than or equal to the
upper limit of each class.
Cumulative relative frequency distribution -- shows
the proportion of items with values less than or equal
to the upper limit of each class.
Cumulative percent frequency distribution -- shows
the percentage of items with values less than or equal
to the upper limit of each class.
55

Slide

Cumulative Cumulative
Cumulative Relative Percent
Cost ($) Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 .62 62
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100

56

Slide
Ogive
An ogive is a graph of a cumulative distribution.
The data values are shown on the horizontal axis.
Shown on the vertical axis are the:
cumulative frequencies, or cumulative relative
frequencies, or cumulative percent frequencies
The frequency (one of the above) of each class is
plotted as a point.
The plotted points are connected by straight lines.
57

Slide
Ogive
Because the class limits for the parts-cost data are
50-59, 60-69, and so on, there appear to be one-unit
gaps from 59 to 60, 69 to 70, and so on.
These gaps are eliminated by plotting points
halfway between the class limits.
Thus, 59.5 is used for the 50-59 class, 69.5 is used
for the 60-69 class, and so on.

58

Slide
Ogive with Cumulative Percent Frequencies
Parts
Cost ($)
20
40
60
80
100
C
u
m
u
l
a
t
i
v
e

P
e
r
c
e
n
t

F
r
e
q
u
e
n
c
y

50 60 70 80 90 100 110
59

Slide
Cross tabulations and Scatter Diagrams
Thus far we have focused on methods that are used
to summarize the data for one variable at a time.
Often a manager is interested in tabular and
graphical methods that will help to understand the
relationship between two variables.
Cross tabulation and a scatter diagram are two
methods for summarizing the data for two (or more)
variables simultaneously.

60

Slide
Crosstabulation
Crosstabulation is a tabular method for summarizing
the data for two variables simultaneously.
Crosstabulation can be used when:
One variable is qualitative and the other is
quantitative
Both variables are qualitative
Both variables are quantitative
The left and top margin labels define the classes for
the two variables.
61

Slide
Example: Finger Lakes Homes
Crosstabulation
The number of Finger Lakes homes sold for each
style and price for the past two years is shown below.

Price Home Style
Range Colonial Ranch Split A-Frame Total

< $99,000 18 6 19 12 55
> $99,000 12 14 16 3 45

Total 30 20 35 15 100
62

Slide
Insights Gained from the Preceding Crosstabulation
The greatest number of homes in the sample (19)
are a split-level style and priced at less than or
equal to $99,000.
Only three homes in the sample are an A-Frame
style and priced at more than $99,000.

63

Slide
Crosstabulation: Row or Column Percentages
Converting the entries in the table into row
percentages or column percentages can provide
additional insight about the relationship between the
two variables.
64

Slide
Row Percentages

Price Home Style
Range Colonial Ranch Split A-Frame Total

< $99,000 32.73 10.91 34.55 21.82 100
> $99,000 26.67 31.11 35.56 6.67 100

Note: row totals are actually 100.01 due to rounding.

65

Slide
Column Percentages

Price Home Style
Range Colonial Ranch Split A-Frame

< $99,000 60.00 30.00 54.29 80.00
> $99,000 40.00 70.00 45.71 20.00

Total 100 100 100 100
66

Slide
Scatter Diagram
A scatter diagram is a graphical presentation of the
relationship between two quantitative variables.
One variable is shown on the horizontal axis and the
other variable is shown on the vertical axis.
The general pattern of the plotted points suggests the
overall relationship between the variables.
67

Slide
Example: Panthers Football Team
Scatter Diagram
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.

x = Number of y = Number of
Interceptions Points Scored
1 14
3 24
2 18
1 17
3 27
68

Slide
Scatter Diagram
y
x
Number of Interceptions
1
2 3
N
u
m
b
e
r

o
f

P
o
i
n
t
s

S
c
o
r
e
d

0
5
10
15
20
25
30
0
69

Slide
The preceding scatter diagram indicates a positive
relationship between the number of interceptions
and the number of points scored.
Higher points scored are associated with a higher
number of interceptions.
The relationship is not perfect; all plotted points in
the scatter diagram are not on a straight line.
70

Slide
Scatter Diagram
A Positive Relationship
x
y
71

Slide
Scatter Diagram
A Negative Relationship
x
y
72

Slide
Scatter Diagram
No Apparent Relationship
x
y
73

Slide
Tabular and Graphical Procedures
Data
Qualitative Data Quantitative Data
Tabular
Methods
Tabular
Methods
Graphical
Methods
Graphical
Methods
Frequency
Distribution
Rel. Freq. Dist.
% Freq. Dist.
Crosstabulation
Bar Graph
Pie Chart Frequency
Distribution
Rel. Freq. Dist.
Cum. Freq. Dist.
Cum. Rel. Freq.
Distribution
Cross tabulation
Dot Plot
Histogram
Ogive
Scatter
Diagram
74

Slide
Descriptive Statistics: Numerical
Methods
Measures of Location
The Mean (A.M, G.M and H. M)
The Median
The Mode
Percentiles
Quartiles
75

Slide
Summary Measures
Center and Location
Mean
Median
Mode
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of
Variation
Range
Percentiles
Quartiles
Weighted Mean
76

Slide
Mean
The Mean is the average of data values
The most common measure of central tendency
Mean = sum of values divided by the number of values

0 1 2 3 4 5 6 7 8 9
10
Mean = 3
0 1 2 3 4 5 6 7 8 9
10
Mean = 4
4
5
20
5
10 4 3 2 1
= =
+ + + +
77

Slide
Mean
The mean (or average) is
the basic measure of
location or central
tendency of the data.
The sample mean is a
sample statistic.
The population mean is a
population statistic.

x
78

Slide
Mean
Sample mean

Population mean
n = Sample
Size
N = Population
Size
n
x x x
n
x
x
n
n
i
i
+ + +
= =

=

2 1 1
N
x x x
N
x
N
N
i
i
+ + +
= =

=

2 1 1
79

Slide
Example: College Class Size
We have the following sample of data
for 5 college classes:
46 54 42 46 32
We use the notation x
1
, x
2
, x
3
, x
4
, and x
5
to represent the
number of students in each of the 5 classes:
X
1
= 46 x
2
= 54 x
3
= 42 x
4
= 46 x
5
= 32
Thus we have:
44
5
32 46 42 54 46
5
5 4 3 2 1
=
+ + + +
=
+ + + +
=
=
x x x x x
n
x
x
i
The average class size is 44 students
80

Slide
Median
The median is the value in the
middle when the data are arranged in
ascending order (from smallest value
to largest value).
a. For an odd number of observations the median
is the middle value.
b. For an even number of observations the
median is the average of the two middle values.
81

Slide
The College Class Size example
First, arrange the data in ascending order:
32 42 46 46 54
Notice than n = 5, an odd number. Thus the
median is given by the middle value.
32 42 46 46 54
The median class
size is 46
82

Slide
Median Starting Salary For a Sample
of 12 Business School Graduates
A college placement office has obtained the
following data for 12 recent graduates:
Graduate Starting Salary Graduate Starting Salary
1 2850 7 2890
2 2950 8 3130
3 3050 9 2940
4 2880 10 3325
5 2755 11 2920
6 2710 12 2880
83

Slide
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Notice that n = 12, an even number. Thus we take an
average of the middle 2 observations:
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Middle two
values
First we arrange
the data in
ascending order
2905
2
2920 2890
Median =
+
=
Thus
84

Slide
Mode
The mode is the value that occurs with
greatest frequency
A measure of central tendency
Value that occurs most often
There may be no mode
There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13
14
Mode = 5
0 1 2 3 4 5 6
No Mode
85

Slide
The Mode
MODE The value of the observation that appears most frequently.

86

Slide
Characteristics of the Mean
1. The most widely used measure of
location.
2. Major characteristics:
All values are used.
It is unique.
It is calculated by summing the
values and dividing by the
number of values.
3. Weakness: Its value can be unclear
when extremely large or extremely
small data compared to the majority
of data are present.

Properties and Uses of the Median
1. There is a unique median for each data set.
2. Not affected by extremely large or small
values and is therefore a valuable measure
of central tendency when such values
occur.

Characteristics of the Mode
1. Mode: the value of the
observation that appears
most frequently.
2. Advantage: Not affected
by extremely high or low
values.
3. Disadvantages:
For many sets of data,
there is no mode
because no value
appears more than
once.
For some data sets
there is more than one
mode.
87

Slide
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.

88

Slide
Weighted Mean

x = E w
i
x
i

E w
i

where:
x
i
= value of observation i
w
i
= weight for observation i

89

Slide
Sample Data

Population Data

where:
f
i
= frequency of class i
M
i
= midpoint of class i
Mean for Grouped Data
=
i
i i
f
M f
x
N
M f
i i
=
90

Slide
Weighted Mean
Used when values are grouped by frequency or relative
importance

Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of 26
Repair Projects
Weighted Mean Days
to Complete:
days 6.31
26
164

2 8 12 4
8) (2 7) (8 6) (12 5) (4
w
x w
X
i
i i
W
= =
+ + +
+ + +
= =
91

Slide
Example: Apartment Rents
Given below is the previous sample of monthly rents
for one-bedroom apartments presented here as grouped
data in the form of a frequency distribution.

Rent ($) Frequency
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
92

Slide
Mean for Grouped Data

This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
Rent ($) f
i
M
i
f
i
M
i
420-439 8 429.5 3436.0
440-459 17 449.5 7641.5
460-479 12 469.5 5634.0
480-499 8 489.5 3916.0
500-519 7 509.5 3566.5
520-539 4 529.5 2118.0
540-559 2 549.5 1099.0
560-579 4 569.5 2278.0
580-599 2 589.5 1179.0
600-619 6 609.5 3657.0
Total 70 34525.0
x = =
34 525
70
493 21
,
.
93

Slide
Five houses on a hill by the beach
Review Example
$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:

$2,000,000
500,000
300,000
100,000
100,000

94

Slide
Summary Statistics
Mean: ($3,000,000/5)
= $600,000

Median: middle value of ranked data
= $300,000

Mode: most frequent value
= $100,000
House Prices:

$2,000,000
500,000
300,000
100,000
100,000
Sum 3,000,000
95

Slide
Percentiles
The pth percentile is a value such that at least p
percent of the observations are less than or equal to
this value and at least (100 p) percent of the
observations are greater than or equal to this value.
I scored in the 70
th

percentile on the
Graduate Record Exam
(GRE)meaning I
scored higher than 70
percent of those who
took the exam
96

Slide
Calculating the pth Percentile
Step 1: Arrange the data in ascendingorder
(smallest value to largest value).
Step 2: Compute an index i
n
p
i
|
.
|
\
|
=
100
where p is the percentile of interest and n in the number
of observations.
Step 3: (a) If i is not an integer, round up. The next
integer greater than i denotes the position of the
pth percentile.
(b) If i is an integer, the pth percentile is the
average of values in i and i + 1
97

Slide
Example: Starting Salaries of
Business Grads
Lets compute the 85
th

percentile using the starting
salary data. First arrange
the data in ascending order.
Step 1:
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Step 2:
2 . 10 12
100
85
100
=
|
.
|
\
|
=
|
.
|
\
|
= n
p
i
Step 3: Since 10.2 in not an integer, round up to
11.The 85
th
percentile is the 11
th
position (3130)
98

Slide
Quartiles
Quartiles are just specific percentiles
Let:
Q
1
= first quartile, or 25
th
percentile
Q
2
= second quartile, or 50
th
percentile (also the median)
Q
3
= third quartile, or 75
th
percentile
Lets compute the 1
st
and
3rd quartiles using the
starting salary data. Note we
already computed the
median for this sampleso
we know the 2
nd
quartile
99

Slide
Now find the 25
th
percentile: 3 12
100
25
100
=
|
.
|
\
|
=
|
.
|
\
|
= n
p
i
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Note that 3 is an integer, so to find the 25
th
percentile we must
average together the 3
rd
and 4
th
values:
Q
1
= (2850 + 2880)/2 = 2865
Now find the 75
th
percentile: 9 12
100
75
100
=
|
.
|
\
|
=
|
.
|
\
|
= n
p
i
Note that 9 is an integer, so to find the 75
th
percentile we must
average together the 9
th
and 10
th
values:
Q
1
= (2950 + 3050)/2 = 3000
100

Slide
Quartiles for the Starting Salary Data
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Q
1
= 2865
Q
1
= 2905
(Median)
Q
3
= 3000
101

Slide

Measures of Variability
Measures of Relative Location and Detecting
Outliers
Exploratory Data Analysis
Measures of Association
Between Two Variables

x
102

Slide
Measures of Variability
It is often desirable to consider measures of variability
(dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.

Range
Inter-quartile Range
Variance
Standard Deviation
Coefficient of Variation

103

Slide
Measures of Variation
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range
104

Slide
Measures of variation give information on the
spread or variability of the data values.

Variation
Same center,
different variation
105

Slide
Range
Simplest measure of variation
Difference between the largest and the smallest
observations:

Range = x
maximum
x
minimum

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
Chap 3-105
106

Slide
Range
Range = largest value - smallest value
Range = 615 - 425 = 190
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
107

Slide
Interquartile Range
The interquartile range of a data set is the difference
between the third quartile and the first quartile.
It is the range for the middle 50% of the data.
108

Slide
Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
109

Slide
Variance
The variance is a measure of variability that utilizes
all the data.
It is based on the difference between the value of
each observation (x
i
) and the mean (x for a sample,
for a population).
110

Slide
Variance
The variance is the average of the squared differences
between each data value and the mean.
If the data set is a sample, the variance is denoted by
s
2
.

If the data set is a population, the variance is denoted
by o
2
.
s
x
i
x
n
2
2
1
=

( )
o

2
2
=

( ) x
N
i
111

Slide
Variance for Grouped Data
Sample Data

Population Data
1
) (
2
2

=

n
x X f
s
i i
N
X f
i i

=
2
2
) (
o
112

Slide
Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
The standard deviation of a data set is the positive
square root of the variance.
If the data set is a sample, the standard deviation is
denoted s.

If the data set is a population, the standard deviation
is denoted o (sigma).
s s =
2
o o =
2
113

Slide
Calculation Example:
Sample Standard Deviation
Sample
Data (X
i
) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.2426
7
126
1 8
16) (24 16) (14 16) (12 16) (10
1 n
) x (24 ) x (14 ) x (12 ) x (10
s
2 2 2 2
2 2 2 2
= =
+ + + +
=
+ + + +
=
114

Slide
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Is used to compare two or more sets of data measured
in different units
100%
x
s
CV
|
|
.
|
\
|
=
100%
CV
|
|
.
|
\
|
=
Population Sample
115

Slide
Variance

Standard Deviation

s
x
i
x
n
2
2
1
2 996 16 =

( )
, .
s s = = =
2
2996 47 54 74 . .
s
x
= = 100
54 74
490 80
100 1115
.
.
.
116

Slide
Measures of Relative Location
and Detecting Outliers
z-Scores
Detecting Outliers
117

Slide
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value x
i
is from the mean.

A data value less than the sample mean will have a
z-score less than zero.
A data value greater than the sample mean will have
a z-score greater than zero.
A data value equal to the sample mean will have a
z-score of zero.
z
x x
s
i
i
=

118

Slide
z-Score of Smallest Value (425)

Standardized Values for Apartment Rents

z
x x
s
i
=

=

=
425 490 80
54 74
1 20
.
.
.
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
119

Slide
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
It might be an incorrectly recorded data value.
It might be a data value that was incorrectly included
in the data set.
120

Slide
Detecting Outliers
The most extreme z-scores are -1.20 and 2.27.
Using |z| > 3 as the criterion for an outlier,
there are no outliers in this data set.

Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
121

Slide
Exploratory Data Analysis
Five-Number Summary
122

Slide
Five-Number Summary
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
123

Slide
Five-Number Summary
Lowest Value = 425 First Quartile = 450
Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
124

Slide
Measures of Association
between Two Variables
Covariance
Correlation Coefficient
125

Slide
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.

126

Slide
If the data sets are samples, the covariance is denoted
by s
xy
.

If the data sets are populations, the covariance is
denoted by .
Covariance
s
x x y y
n
xy
i i
=

( )( )
1
o

xy
i x i y
x y
N
=

( )( )
o
xy
127

Slide
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship.
Values near +1 indicate a strong positive linear
relationship.
If the data sets are samples, the coefficient is r
xy
.

If the data sets are populations, the coefficient is .
r
s
s s
xy
xy
x y
=
o
o o
xy
xy
x y
=
xy

Overall Descriptive Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overall Descriptive Statistics

Uploaded by

Copyright:

Available Formats

1

You might also like