Book On Bussiness Statisitics!!!

BUSINESS STATISTICS
AND APPLICATIONS
SELECTED TEXT MATERIAL FOR
ECO 3411 AT UCF
2010
AUTHORS:
DR. MARK D. SOSKIN
DR. BRADLEY BRAUN

ASSOCIATE PROFESSORS OF ECONOMICS
Not for resale or any other commercial use.
All intellectual materials in this text are the sole property of the authors.
Business Statistics and ApplicationsSoskin and Braun Page 2

TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION TO BUSINESS STATISTICS 6
1.1 Identify the Population and Select the Variables 8
1.2 Locate the Data 8
1.3 Analyze the Data 9
1.4 Generate a Statistical Report 10
1.5 Applying the Four Steps to Other Business Problems 11
1.6 Ethical Issues 12
CHAPTER 2 DISPLAYING DATA 13
2.1 Analyzing Data Displayed in Sorted Order 14
2.2 Analyzing Quantitative Data Graphically 18
2.3 Analyzing Bivariate and Multivariate Data Graphically 31
CHAPTER 3 SUMMARIZING DATA BY AVERAGE & VARIABILITY MEASURES 42
3.1 MEASURING THE AVERAGE 43
3.2 VARIABILITY AND RANGE MEASURES 59
3.3 MEAN VARIABILITY MEASURES 70
3.4 RANDOM SAMPLE INFERENCES ABOUT A POPULATION 77
CHAPTER 4: DESCRIBING HOW VARIABLES ARE RELATED 93
4.1 Summarizing a Variable Relationship by an Equation 94
4.2 Fitting Data to an Equation 114
4.3 Summarizing the Strength of Variable Relationships 131
4.4 Fitting Equations with More Than One Explanatory Variable 144
CHAPTER 5 COLLECTING AND GENERATING DATA 167
5.1 The Essential Role of Data in Business Decision Making 168

5.2 Collecting Data by Sampling from a Population 177
CHAPTER 6 PROBABILITY AND RANDOMNESS 190
6.1 Using Probability to Cope with Business Uncertainties 191
6.2 Probability Concepts for Understanding Statistical Inference 202
6.3 Random Variables and Expected Value 225
6.4 Principles of Decision Theory 240
CHAPTER 7 PROBABILITY DENSITY AND THE NORMAL DISTRIBUTION 258
7.1 Selecting a Distribution for Statistical Analysis 259
7.2 Families of Discrete Probability Distributions 262
7.3 Probability Concepts for Continuous Random Variables 275
7.4 The Normal Density Function 282
CHAPTER 8 INTRO TO SAMPLING DISTRIBUTIONS, ESTIMATION, TESTING 300
8.1 Introduction to Statistical Inference 301
8.2 Sampling Distributions 302
8.3 Estimation and Confidence Intervals 324
8.4 The Fundamentals of Hypothesis Testing 331
8.5 The Practice of Inferential Statistics and Common Pitfalls 348
CHAPTER 9 ONE- AND TWO-SAMPLE INFERENCES 357
9.1 One-Sample Estimation of the Mean 358
9.2 One-Sample Tests for the Mean 379
9.3 Inferences for Matched Pair Samples 393
CHAPTER 10 ANALYSIS OF VARIANCE USING EXPERIMENTAL DESIGN 397
10.2 Comparing Sample Means Under a Completely Randomized Design 406
10.3 Linear Modeling Under One-Way Analysis of Variance 412
10.4 The F Test for One-Way Analysis 417

10.5 The Analysis of Variance Table 422
10.6 Comparing Individual Treatment Means 427
10.7 Regression vs. Analysis of Variance: A Comparison 429
CHAPTER 11 REGRESSION MODELING,TESTING AND PREDICTION 437
11.1 Introduction to Regression Modeling and Statistical Inference 438
11.2 Testing and Estimation of Individual Explanatory Variables 463
11.3 Testing Regression Models for Significance 490
11.4 Making Predictions by Regression Inference 507
11.5 Forecasting Methods and Forecast Modeling 522
CHAPTER 13 NONPARAMETRIC METHODS 548
13.1 Parametric and Nonparametric Statistics 549
13.2 One-Sample and Matched Pair Tests 555



Chapter 1 Introduction to Business Statistics

Welcome to the exciting world of statistics. To understand the business world, we use
data, the factual information that describes the business world. Statistics is the field of study
dealing with the collection, analysis, interpretation, and presentation of data.
Statistics affects nearly every aspect of the modern business enterprise. Most successful
businesses have employees who know how to do the following:

- Ask the right questions and know what needs to be answered by statistical analysis.
- Efficiently design and manage a complete statistical study.
- Know what data to collect, how to collect it, and when to generate new data.
- Know what methods of analysis to use and how to analyze and interpret results.
- Write informative reports and make effective presentations to business groups.
- Act in a timely manner on statistical findings and follow up with further study.

How do individuals accomplish these tasks? By following four key steps to statistical decision
making.

Four Key Steps of Statistical Decision Making
1. Identify the relevant population and variables.
2. Locate the data.
3. Analyze the data.
4. Generate a statistical report.

We can see how these steps are used by examining a real-world example: the enormous
compensation awarded to corporate heads.
________________________________________________________________________
Case Study 1.1: CEO Multimillion-Dollar Compensation Packages
Multimillion-dollar compensation packages for top corporate executives are no longer
unusual. In fact, such packages are viewed as a suitable reward for those who provide huge
corporations with high earnings and shareholders with increasing stock value. By 1997, CEOs at
Coca-Cola, Intel, General Electric, and eight other companies each pocketed at least $40 million.
Total compensation reached $7.8 million for the average CEO at the largest U.S. corporations,
the S&P 500. These CEO compensations continued to rise at 35% annually, cresting at a record
326 times what their blue-collar employees earned. When Congress passed legislation that taxed
salaries and bonuses exceeding $1 million, corporations shifted compensation to stock-options to
avoid the tax. Boards of directors, often hand picked by entrenched corporate management,
rarely served as watchdogs over the financial interests of stockholders. Instead, corporate boards
were forced into bidding wars that drove CEO earning even higher.
When it comes time to choose a new CEO or renegotiate an old ones contract, boards are
going after the same small pool of experienced CEO candidates.... Few boards get
criticized for choosing an experienced, well-known CEO rather than taking a shot on a
promising but unproven No. 2 (Business Week, April 20, 1998, p. 68).
Suppose the board of directors at Tangerine Technologies, a large corporation in the
computer industry, is currently renegotiating the pay package of its CEO. The board is unsure as
to what a reasonable compensation offer is in the present business environment. Their current
CEO has done a good job, so why risk losing a known quantity for an uncertain replacement? On
the other hand, the board does not want pay their CEO too much more than market value. They
place a call to your department, and request a statistical analysis of the compensations received
by CEOs of similar firms. You must provide a statistical report that will help them make an
informed and intelligent decision.
________________________________________________________________________

1.1 Identify the Population and Select the Variables
In our Case Study, your job is to analyze the compensations paid to CEOs by firms that
are similar to Tangerine. How close in size and product should a firm be to qualify as similar?
Although there may be more than one reasonable answer, your judgment must be logical and
consistent in answering this question if you want useful and persuasive results.
To answer this question, you must first identify the relevant population. The population
is the total collection of items (people, places, objects, etc.) under consideration. Clearly, the
items in this population are CEOs. But, which companies are relevant? Thinking this through,
you decide that you must select CEOs from companies of similar size and area of manufacturing.
You therefore choose the population of those S&P 500 corporations in the Office Equipment and
Computer industry.
Once the population has been targeted, you then must determine the appropriate
variables, or information needed to analyze the problem. Select variables that represent the most
important aspects of the decision-making problem. In our CEO case, you focus on a single
variable that combines all major sources of compensation: salary, bonuses, and long-term
earnings such as stock options. You also use a three-year average to smooth out large CEOs
stock option claims in particular years. After you decide what it is you are measuring, you must
locate this compensation information.

1.2 Locate the Data
Where do you find relevant data on CEO compensation? If compensation figures were not
available, you might have to collect them yourself by surveying corporations in the industry.
Fortunately, you locate the data in a business publication. Business Week magazine publishes
CEO compensation figures, including the 24 CEOs of S&P 500 office equipment and computer
companies in your population. You copy the information and calculate the average total
compensations for 1995 through 1997. The compensation data is shown in a table, Table 1.1.

Table 1.1
CEO Total Compensation for S&P 500
Office Equipment & Computer Companies (Annual
Average 1995-97 in Millions of $)
Company Name Compensation
America Online $ 19.1
Automatic Data Processing $ 2.1
Cadence Design Systems $ 23.2
Ceridian $ 6.5
Cisco Systems $ 14.2
Compaq Computers $ 16.1
Computer Associates Int'nal $ 12.4
Compuware $ 1.0
Digital Equipment $ 1.3
Electronic Data Systems $ 8.1
EMC $ 5.3
First Data $ 2.3
HBO & Co. $ 28.3
Hewlett-Packard $ 6.7
IBM $ 12.8
Microsoft $ 0.5
Oracle $ 12.0
Parametric Technology $ 12.6
Pitney Bowes $ 2.3
Sabre Group Holdings $ 1.3
Seagate Technology $ 8.2
Sterling Commerce $ 1.7
Sun Microsystems $ 9.6
3Com $ 10.3
Derived from data in Business Week, April 20, 1998, p. 100.

1.3 Analyze the Data
Now that you have located the relevant data, you must analyze it. You decide it would be
a good idea to calculate the average compensation, $9.1 million, the arithmetic mean of the data.
But you also notice considerable variation in the data. The CEO at HBO & Co. received the
most, $28.3 million, while billionaire Bill Gates (who already owns a large share of Microsoft)
trailed the pack at $0.5 million. Searching the data further, you count four CEOs earning over
$15 million, six with compensation between $10 and $15 million, six more between $5 and $10
million, and the remaining eight under $5 million. How can this data analysis help Tangerine
deal with its upcoming salary negotiations? Only if the analysis is translated into an
understandable statistical report.

1.4 Generate a Statistical Report
Your final task is to summarize the findings in a clear and concise report. This report
must provide the board at Tangerine with the information it needs to decide how much to offer
their current CEO. The relevant job market variable CEO compensation in the computer and
electronics industry is the central piece of information around which the decision will be made.
Your data analysis can shed light on this decision. Decision makers tend to distrust what
they do not understand. At this stage, a responsible statistician should report what the data
analysis mean to them. A readable report and effective presentation is the essence of statistical
communication. Your report might read as follows:
To retain our CEO, the board at Tangerine needs to design a competitive compensation package.
In the current industry environment, CEOs of similar-sized companies average about $9 million
annually in salaries, bonuses, and long-term compensation such as stock options. As you can
see from Figure 1.1, only a few CEOs in our industry are earning over $15 million per year. In
fact, one-third of the CEOs heading large office equipment and computer companies receive less
than $5 million. Therefore an offer more than $15 millions should not be necessary to keep our
CEO, and $10 million would still exceed the current average compensative package.
Notice how this report translates the numbers and terminology in your data analysis into a clear
presentation of the results relevant to the specific salary negotiations facing the board. A graph is
used to convey information visually that would be difficult to explain in words. Finally, the
report offers a range of actions but does not attempt to make the decision. The ultimate decision
rests with the board itself.
You have reported the useful information from your analysis, but the board must combine
this information with other relevant factors to arrive at a decision. These factors may include the
attractiveness of the position, pressure from stockholders, and the performance record of the
CEO. For example, if the CEO job at Tangerine is viewed as a desirable post, the current CEO
may be persuaded to accept lower compensation. These and other factors will help determine
how much above or below the industry average the board of directors will ultimately offer.

0 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
To ta l C o m p e ns a ti o n
N
u
m
b
e
r

o
f

C
E
O
s
(m i l l i o ns of dol l ars , 1995-97 averag e)
C E O To ta l C o m p e ns a ti o n fo r S & P 5 0 0
Offi c e E q ui p m e nt & C o m p ute r C o m p a ni e s

1.5 Applying the Four Steps to Other Business Problems
All statistical problem solving requires these same four steps: identify the relevant
population and variables, locate the data, analyze the data, and generate a statistical report. In
other business cases, we will choose variables that contain important information on stock prices,
interest rates, product quality, sales volume, or debt. The CEO compensation case relied on a few
of the many analytical and graphical tools available to the businesses statistician. Chapter 2
introduces several ways to display data to give us visual insights about business data patterns.
Chapters 3 and 4 covers methods that summarize data and variable relationships quantitatively.
Performing each of the four steps is not always easy. You may have trouble deciding
which objects are similar enough to belong in the populations or which variables to choose for
the problem. If data are not readily available, you may have to collect it yourself or be content to
analyze only a sample of the population data. No single statistical method is appropriate for all
business problems. You must assess the strengths and limitations of each competing method
before selecting the proper statistical method. Finally, you must correctly interpret the results of
your analysis and understand the information needs of the decision makers in order to generate a
useful report.
Figure 1.1

1.6 Ethical Issues
Sometime, decision makers and even their statisticians are convinced that they know the
answers before any data has been analyzed. Prejudices may influence our choice of population,
variables, data gathering technique, or statistical method. Prejudices arise from pre-judging a
situation before we examine the facts. Suppose you are prejudiced against the Tangerine CEO
and do not want her rehired. By analyzing data that only includes salaries and bonuses, you
calculate an average, $1.8 million, that seriously understates total compensation in the industry.
You report that an offer of $3 or $4 million would exceed the market average. However, total
compensation, $8.1 million, averages twice that recommended amount. Your choice of variable
has slanted the analysis and your report, thus increasing the risk that the CEO will reject the
boards offer.
Unfortunately, it is easier to support conventional wisdom than to defend an unpopular
finding. In statistics, however, ends seldom justify the means. It is unethical to withhold
information or distort results to obtain a preferred conclusion. Businesses only hurt themselves
when they ignore the truth or try to explain it away. Rather than shoot the messenger, effective
decision makers need to be open minded. A responsible statistician should convince decision
makers that the world is not always the way they expect it to be. Business analysts need to argue
clearly and convincingly the merits and usefulness of their statistical analysis.


CHAPTER 2 DISPLAYING DATA

Approach: Everyday experience with tally sheets, bar graphs and pie graphs are used to
introduce diagrams that compactly display data. Diagrams may also be used to portray
variables that are not measured quantitatively. Valuable insights may be gained from
examining graphs that plot variables against one another.

Where We Are Going: Frequency tables will be used to introduce probability distributions.
The most important of these distributions may be described by using the summary
measures introduced in the next chapter. Plotting, time series graphs, and two-variable
relationships will be extended to regression modeling and quality control charts in later
chapters.
New Concepts and Terms:
sorted data
frequency distributions and histograms
mode, class intervals, cutpoints, and modal class
symmetric and skewed, unimodal and bimodal distributions
time series graphs, trend lines, and trend rates
two-dimensional scatterplots

SECTION 2.1 Analyzing Data Displayed in Sorted Order
SECTION 2.2 Analyzing Quantitative Data Graphically
SECTION 2.3 Analyzing Bivariate and Multivariate Data Graphically

2.1 Analyzing Data Displayed in Sorted Order
Business data is usually essential to making informed decisions. A plant manager needs
production data to stay on top of events on the plant floor. Personnel officers cannot function
effectively without information on current employment and the job applicant pool. Economic
indicators and market data are crucial tools for market analysts and business policy planners.
Without the appropriate data, the business world would be operating in the dark.
But data can be very messy to work with, especially if there is a lot of it. Hunting for
information buried within business data can be a time consuming and frustrating task. To
convince you, lets reexamine the 38 CEO compensations from Chapter 1. Figure 2.1 contains
instructions for displaying a column of data in Minitab. Suppose you obtained the data listing
displayed in Figure 2.2, and you needed to quickly identify the five largest CEO compensations
(measured in thousands of dollars annually):
Data Display

CEO comp
1948 964 618 1338 3565 1814 914 6219 541 981 995
711 1054 1846 5101 3144 2096 1225 1446 2987 515 487
1135 692 762 1225 1217 2965 1179 1336 1572 456 683
621 993 1000 684 1634
Figure 2.2

How long did it take you? Did you locate all five correctly 6219, 5101, 3565, 3144, and 2987
or did you miss one or two? Imagine how much more time it would take (and the errors you
might make) if you had to search through 380, instead of 38, executive compensation figures!

Sorting Data
But what if the compensation data had been listed from smallest to largest values? Then
it would have been an easy job to identify the five highest. Why werent the data sorted this way
in the first place? The answer is that business data seldom have a single "correct" order. Instead,
the way numbers needs to be arranged depends upon the task at hand. In the CEO data from
Table 1.1 of the previous chapter, the compensation figures were listed in order: in alphabetical
order by the company name, from Advanced Micro Development to Whirlpool. For other
purposes, we might need the data sorted by last names or social security numbers of the CEO, by
years of service with the company or years in the CEO position, by corporate sales or market
shares, or alphabetically by product name.

Data may also be sorted by size or by location, chronologically, or according to any other
meaningful criteria. An inventory control example illustrates how useful it can be to sort data in
a particular order. Suppose a small hardware store is monitoring month-end inventory levels for
twelve types of plumbing supply parts. More than 300 of any type of plumbing part wastes
valuable shelf space and ties up funds that could be better used elsewhere. However, if stocks
fall below 150, the store risks running short and angering regular customers. The usual practice
is to record information on inventory arranged by catalog number. The inventory stock sheet for
each of the twelve parts looks like the following:
Current Inventory Sheet:
143 for part A, 279 for part B, 533 for part C, 149 for part D, 1213 for part E,
135 for part F, 187 for part G, 165 for part H, 132 for part I, 144 for part J,
453 for part K, 147 for part L
This alphabetical listing by part (from letter A to L) is very helpful if the store manager wants to
check on the inventory of a particular part. For purposes of inventory control, however, the store
manager should also sort the data by inventory size for each part:
Sorted Listing of Inventory Amounts:
132 (part I), 135 (part F), 143 (part A), 144 (part J), 147 (part L), 149 (part D),
165 (part H), 187 (part G), 279 (part B), 453 (part K), 533 (part C), 1213 (part E)
It is now much clearer that a sizable portion of current inventory (1213 items) consists of one
type of plumbing part (part E). Two other parts also have stocks in excess of 300 (533 of part C
and 453 of K). Although the store manager may notice this problem from the alphabetical sort,
the second list is much better designed for monitoring these inventory distribution. With the
thousands of parts carried by hardware retailers today, it would be a difficult task indeed to
detect inventory problems unless the stock sheet is sorted by amount of inventory for each part.

Sorting data in numerical order makes it easy to correctly identify the observations with
the largest and smallest values.
But sorting data sets manually can be as time consuming as adding sums of figures by
hand. An ordinary pocket calculator is no help either. It can only do arithmetic computations.
Here is where computers come to our rescue.
Computers are commonly viewed as "number crunchers" that perform arithmetic
operations on financial and scientific data. At least as important is the sorting time and expense
that computers save businesses and government. Lightening-fast computers can now compile
enormous mailing lists alphabetically or according to zip codes, arrange detailed equipment
maintenance records in chronological order, sort personnel files by social security number or job
title and department, and rank competitors by market share.

Modern computer software is credited (or blamed) for making it easy to generate
mountains of statistical output from virtually any data set. An essential requirement for turning
out much of this output is a computers ability to organize data so we can present information in
new and exciting ways. Since the dawn of the computer age, innovative methods have improved
our ability to visually display quantitative information.
1
Most of these methods, including the
ones in this chapter, rely on sorted data. Displaying a sorted data listing in Minitab or Excel is
easy to do (see instructions and sample computer screens in Figures 2.3 and 2.5).
Using Excel to Sort Spreadsheet Column(s)
(1) highlight the column(s) to be sorted
(2) press the Sort Ascending button (
A
Z
!) on the Excel tool bar
Figure 2.3

!Lets inspect the Minitab listing of sorted CEO data show in Figure 2.7.
Using Minitab to Sort a Column of Data and Include it in the Worksheet
Pull-Down Menu sequence: Manip Sort...
Complete the Sort DIALOG BOX as follows:
(1) double click mouse on variable to be sorted
(2) click mouse inside box labeled Store Sorted Column(s) in: and type in a name of eight or
fewer characters for the new sorted column of data
(3) click on OK button, and the new sorted data column will appear in worksheet
Figure 2.5

Data Display

sort CEO
456 487 515 541 618 621 683 684 692 711 762
914 964 981 993 995 1000 1054 1135 1179 1217 1225
1225 1336 1338 1446 1572 1634 1814 1846 1948 2096 2965
2987 3144 3565 5101 6219
Figure 2.7
These are the same compensation figures as those listed earlier. By sorting them
numerically, however, CEOs can be easily ranked from lowest paid, $456,000, to highest paid,
$6.219 million. It is also easy to identify the six CEOs with earnings of about three million or
more (highlighted in bold print). It is equally clear from the listing that only two CEOs earned

1
The book with that same title (by Edward R. Tufte, Cheshire, Conn.: Graphics Press, 1983) has been very influential in
encouraging the visual approach to data analysis.

less than half a million dollars. Reporting these high- and low-end CEO compensation figures to
the Board of Directors at Tangerine should furnish them with useful information in preparing a
contract renewal offer to their own CEO.

Boxes Total
1
2
3
4
5
6
7
8
8
25
12
6
2
1
0
1
2.2 Analyzing Quantitative Data Graphically
Tallying up numbers is as old as
the hills. Suppose your club is raising
money for a trip by selling boxes of
greeting cards. At the end of the week you
get 55 people to order cards, but some
order one box while others order as many
as eight boxes. To tabulate how many
customers ordered one box, two boxes,
three boxes, and so forth, you would
probably make a tally sheet like the one
shown in Figure 2.11. You would place
vertical pencil marks representing the
number of boxes each customer bought.
Then write you would write down the total
number of marks on each line.
We call these totals on the right side of your tally sheet the frequencies, because they tell
us how "frequently" each value occurred. A table of these frequencies is known as a frequency
distribution.
DEFINITION: A frequency distribution is a table that reports the number of times each
numerical value or grouping of values occurs in quantitative data.
What can be learned about the pattern of greeting card sales by examining the frequency
distribution? Notice that nearly all orders were for four or fewer boxes and nobody ordered more
than eight boxes. It is also easy to spot the most common order size, two boxes. Two boxes
were order by 25 customers, more than twice the frequency of any other sized order. We call
this the mode.
DEFINITION: The mode is the most frequently occurring value in the data.
Statisticians commonly examine frequency distributions to help them analyze the
statistical properties of data. Did you know that the tally sheets youve been preparing for years
are useful tools for doing statistical analysis? Lets see how they are used in business.
Figure 2.11

Imagine that five years from now you find yourself in a staff meeting listening to a
presentation on fourth quarter regional sales of your companys copier machines. As your
watch, you get the funny feeling that youve seen the bar graph on the screen (Figure 2.12)
somewhere before. It contains the same information as your old greeting card tally sheet, doesn't
it? Eight customers ordered one copier, 25 ordered two copiers, 12 ordered three, and so forth.
The only differences are that you are selling copiers instead of greeting cards and the
presentation uses a bar graph. The pencil marks have given way to bars, attractively generated
by an Excel spreadsheet. This type of bar graph is called a histogram.
2

DEFINITION: A histogram is a bar graph display of a frequency distribution with the
frequencies represented by bar heights.

2
Warning! Not all bar graphs you see in newspapers and business reports measure frequencies. For example, consider a bar graph
that compares teacher salaries in various parts of the country. The height of each bars measure the teacher salary, not the number of teachers at
each salary level. This is a bar graph, but it is not a histogram because the bar heights do not indicate frequencies.
Figure 2.12
Fourth Quarter Copier Sales
0
5
10
15
20
25
1 2 3 4 5 6 7 8
Order Size
N
u
m
b
e
r

o
f

C
l
i
e
n
t
s

Chapter Case #1: Nothing is Certain but Death and Taxes
Early each year, a Daytona Beach CPA firm approaches the tax crunch period preceding
the April 15th tax filing deadline. Some clients require ten or more IRS forms to file their tax
return, while other clients need only one or two forms. More forms usually result in more time
spent with the client, more receipts, and more computation time. To better prepare its staff and
resources, the 53 clients tax files were collected from the peak February and March period the
previous year. The number of tax forms is recorded in a Minitab file, and the sorted data
displayed in Figure 2.13.
Data Display

nforms
1 2 2 2 2 2 3 3 3 3 3 3 4
4 4 4 4 4 4 4 5 5 5 5 5 5
5 6 6 7 7 7 7 8 8 8 9 9 9
9 9 10 10 10 11 14
Figure 2.13

A histogram (Figure 2.14) would better display any patterns in the number of tax forms clients
filed. This histogram was created using the Minitab instructions given in Figure 2.15.
The histogram indicates that the mode is four because eight clients needed four forms to
file their taxes. However, this time there are several close runners up. The histogram suggests
that the CPA firms clients fall into two groups: clients needing 2 to 5 forms and a smaller group
needing at least 7 forms. The CPA firm determines that this smaller group consists primarily of
self-employed clients who typically must file several additional forms. Based on this
information, the CPA firm adopted a policy of customer screening. Self-employed clients
were rescheduled for individual attention from small business staff specialists. All other clients
were targeted for "express lane" service. Thus, histograms helped this CPA firm improve
efficiency and customer satisfaction.


Using Minitab to Graph Histograms
Pull-Down Menu sequence: Graph Histogram...
Then complete the Histogram DIALOG BOX as follows:
(1) double click names on the left to select variable(s)
(2) click OK at the bottom to obtain a computer-generated histogram
For integer data, steps (2*) through (5) instead of (2) result in a separate bar for each value:
(2*) click Options... button at the bottom to open Histogram Options DIALOG BOX
Complete the Histogram Options DIALOG BOX as follows:
(3) at the bottom of Definition of Intervals, click to highlight Midpoint/Cutpoint
(4) in Midpoint/Cutpoint box, type in the smallest and largest data values: S:L
where S is the smallest number in the data, L is the largest, separated by a colon (:)
(5) click on OK to return to Histogram dialog box, then click OK to obtain graph
Figure 2.15

Class Intervals and the Modal Class
So far, we have examined discrete data where each value could be easily represented by
a separate histogram bar. However, histograms display their greatest advantage in their ability to
compactly display data.

Figure 2.14
15 10 5 0
8
7
6
5
4
3
2
1
0
nforms
F
r
e
q
u
e
n
c
y

Chapter Case #2: Please Return Tray Tables to Their Upright Position
An Orlando-based catering company provides onboard meals to domestic airlines. The
new personnel manager notices that she hasnt seen many younger employees at the company.
To check this out, she first examines the Minitab sorted listing of age for the 76 employees
displayed in Figure 2.17.
Data Display

AgeCater
23 26 26 27 28 28 29 29 29 30 30 31 31
31 31 32 33 34 34 34 34 34 35 35 35 35
36 36 36 36 37 37 37 37 38 38 38 38 39
40 40 41 41 41 41 42 42 42 43 44 44 45
45 46 47 47 48 48 49 50 50 50 51 51 54
54 55 55 56 57 62 64 64 67 70 72
Figure 2.17

But a histogram of separate bars for each year from 23 to 72 would furnish too much detail (see
Figure 2.18). The 50 bars and low frequencies unduly complicates the histogram, making it
difficult to observe broad age patterns. For example, does it matter that only one employee is
aged 32, 33, and 39, but four or more workers are 31 or 34 through 38 years of age? This may
be a case of being too close to the trees to see the forest. By grouping ages into classes, there are
fewer histogram bars to examine.

Figures 2.18 and 2.19
70 60 50 40 30 20
5
4
3
2
1
0
AgeCater
F
r
e
q
u
e
n
c
y
80 70 60 50 40 30 20
30
20
10
0
AgeCater
F
r
e
q
u
e
n
c
y

DEFINITION: A class is an interval of values for a frequency distribution and the corresponding
histogram bar. The class width is the width of a class, and cutpoints are the beginning and
ending values of each class.
The catering employee ages may be grouped, for example, into the following six classes:
20 to 29, 30 to 39, and so on up to 70 to 79 years of age. The class widths are each ten-year
intervals.
3
This histogram is shown in Figure 2.19.
Using Minitab to Specify Histogram Cutpoints for Constant Interval Width
Pull-Down Menu sequence: Graph Histogram...
(1) double click names on the left to select variable(s), then click OK button
(2) click Options... button at the bottom to open the Histogram Options DIALOG BOX
Then complete the Histogram Options DIALOG BOX as follows:
(3) Under Type of Intervals, highlight Cutpoint (instead of Midpoint)
(4) at the bottom of Definition of Intervals, highlight Midpoint/Cutpoint
(5) in the Midpoint/Cutpoint box, type in the cutpoint and class width information:
S:L/W
where S is the first cutpoint, L the last cutpoint, and W is the class width for all classes
(6) click on OK to return to Histogram dialog box, then click OK to obtain graph
Figure 2.20
A similar looking histogram can be produced using Excel if cutpoints are also typed into the
spreadsheet (see instructions in Figure 2.20). Figure 2.21 contains an Excel-generated histogram
after making cosmetic adjustments to labels, dimensions, shading, and interval widths.
4

Using Excel to Obtain a Histogram for a Spreadsheet Column of Data
First, type into a spreadsheet column a list of upper cutpoint values for each histogram class
Pull Down Menu Sequence: Tools Data Analysis...
(1) Select Histograms from Analysis Tools list in Data Analysis box, and click OK

3
Technically, these intervals are 20 up to but not including 30, 30 up to but not including 40, and so forth, because histogram classes
must border one another even for discrete data such as this. Therefore, the cutpoints are 20, 30, 40, 50, 60, 70, and 80 for this histogram.

4
Excel generates histograms that are confusing in two major respects. First, its histogram bars are not adjacent as they should be.
This can be corrected by changing the gap width to 0%. In addition, Excel frequency distribution and histogram classes are labeled by their upper
cutpoints. To avoid confusion, the Bin column in the frequency output table may be edited to include the lower cutpoint for each interval.

(2) Click to place an x next to Labels and Chart Output
(3) Click inside box following Input Range and type in the cell range containing the variable
label and data to be graphed in the histogram (or drag mouse through spreadsheet data, and this
cell range will automatically appear in the box)
(4) Click inside box following Bin Range and type in the cell range containing the spreadsheet
column of cutpoints (or drag mouse through cutpoint column)
(5) Click OK to produce an Excel histogram graph.
(6) Re-label classes and make bars adjacent on graph (see footnote 6 for explanation)

Notice how compact is the display by 10-year age classes. This graph better indicates
that 30 employees are in the 30 to 39 age group, ten more than in any other age group. This is
the modal class.
DEFINITION: The modal class is that class containing the largest number of observations in a
histogram.
The modal interval locates an interval of data occurring most frequently. In this example, nearly
two-fifths catering company employees (30 out of 76) were in the modal class. By contrast, the
mode reports rare (and perhaps coincidental) repetitions of particular values. Only five of the 76
employees (under 7 percent) had the most commonly occurring age, 34.
Figure 2.22
Figure 2.21
Histogram
0
10
20
30
20-29 30-39 40-49 50-59 60-69 70-79
Employee Age
N
u
m
b
e
r

o
f

E
m
p
l
o
y
e
e
s

For continuous data and for discrete data with many different values, the modal class is
nearly always more informative than the mode.
The histogram also finds substantial proportions of workers in the 40 to 49 and 50 and
over age groups, but only nine employees under 30. Armed with this information, the personnel
manager investigated why so few younger employees are on the payroll. She soon discovered
that occasional slump in airline catering business had resulted in layoffs of the least senior
employees. Over the years, newly-trained workers were let go, leaving behind an aging
workforce. A compact histogram display of age classes allowed this company to recognize its
personnel problems and take action to insulate some of its younger workers from short-term
business fluctuations.
Symmetric and Skewed Distributions
Often it is possible to observed identifiable patterns in histograms. One common type of
frequency distribution is called symmetric because of the symmetry of its histogram.
DEFINITION: A frequency distribution is symmetric if the left half of the histogram is the
mirror image of the right half.
Histograms of symmetric data distributions look like the snowflakes you used to cut from folded
up paper. When you unfold the paper, the patterns you cut would be repeated on both sides of
the fold. For example, consider monthly passenger arrivals at Orlando International Airport
graphed in the Minitab histogram Figure 2.22. Notice how the bars heights of the class intervals
drop off in a similar fashion as we move away from 850 in either direction. Therefore, the
histogram is nearly symmetric around the 850 cutpoint.
The passenger arrival data also is an example of unimodal data.
DEFINITION: A frequency distribution is unimodal if the histogram has one mode or modal
class and is bimodal if the histogram bars group around two separate modes.

By contrast, the tax forms histogram pictured earlier (Figure 2.14) was bimodal. It had two
modes, one at four tax forms and second at nine. Bimodal distributions may also be symmetric.
The tax form histogram was almost symmetric because the frequency patterns around the first
mode resembled the pattern around the other mode at nine.
Most distributions encountered in business data unimodal but are usually not symmetric.
Many unimodal histograms have a series of bars that tail off on one side of the graph. A
histogram with one of these tails is said to be skewed.

DEFINITION: A unimodal distribution is skewed right if its histogram has a long tail to the
right of its modal class and skewed left if its tail lies only to the left.

Most skewed distributions in business are skewed right. For example, the catering company
employee age histogram in Figure 2.19 gradually tailed off for higher age groups.
Figure 2.22
1150 1050 950 850 750 650 550
20
10
0
AirTraf
F
r
e
q
u
e
n
c
y
Distribution of Monthly Passenger Arrivals: Jan. 1989 - July 1994

Chapter Case #3: It All Comes Out in the Wash
As a final example of the construction and advantages of histograms, we examine a large
data set with over two thousand possible values. The owner of a self-service laundromat in
Longwood, Florida wanted to know if he had the proper number of machines. With too many
machines, he'll be paying for them to stand idle. Too few machines and revenue will be lost.
Long waits for an idle machine also will reduce customer satisfaction and hamper future
business. The owner has in fact received a few complaints about lengthy waits, but he is not sure
whether these occur rarely or are a continual problem.

Suppose the owner knows that revenue from every machine in the laundry operating most
of the day is about $600. Daily washer and dryer revenue data are collected over the entire
course of the year, 358 observations (because the laundry was closed seven days for holidays).
Obviously, these are far too many observations to analyze a data listing. Histograms enjoy their
greatest advantage for data sets of this type by providing a compact, concise picture.
Figure 2.23b
600 550 500 450 400 350 300 250 200 150 100
100
90
80
70
60
50
40
30
20
10
0
Mach Rev
F
r
e
q
u
e
n
c
y

The owner's records show that daily revenue never reached $600 or fell below $100. He
therefore constructs a histogram with class widths of $50 and cutpoints from $100 to $600. The
result is the skewed-right histogram in Figure 2.23b with ten classes for the frequency
distribution of laundromat daily revenues.
Histograms are best for larger data sets containing many data values because little
important information is lost and analysis is clearer by grouping data into classes.

Multiple Choice Review:
2.12 Which of the following method of displaying the data would be best for a data set with
450 observations?
a. a sorted listing
b. a listing of descriptive summary statistics
c. a histogram
d. all of the above are equally informative
e. it depends on the analytical problem at hand

2.13 A histogram presents quantitative data as
a. a table of frequencies
b. a pie chart of relative frequencies
c. bar graph of class frequencies
d. a sorted listing
e. a listing of descriptive statistics

2.14 The modal class is
a. the class with the largest observations
b. the class with the most observations
c. the class with the widest interval width
d. the most frequently occurring observation
e. the class containing the mode


CASE MINI-PROJECT:
A university will move up to Division I-A football next season in hopes of attaining the stature,
recognition, and alumni support accorded other universities its size. Home attendance (in
thousands per game) for 92 current I-A university is presented in the following sorted data listing
and summary statistics:

Data Display

AttendAv
4.7 9.3 10.4 11.6 12.4 12.6 12.7 14.0 15.9
15.9 16.0 16.8 18.0 18.1 18.8 19.0 20.0 22.0
22.0 22.5 23.3 23.5 23.8 24.0 24.4 25.1 25.3
25.8 25.9 26.0 26.8 26.9 28.9 30.3 30.7 31.0
31.0 32.2 32.7 33.1 33.2 33.2 33.8 34.4 35.1
35.6 37.3 37.4 37.4 37.6 38.9 39.0 39.4 39.5
40.0 40.5 40.8 40.9 41.4 43.4 43.4 46.9 47.9
48.1 48.7 50.8 51.0 51.6 51.9 52.1 57.8 59.0
59.1 59.6 60.3 61.5 62.1 62.4 65.2 66.8 67.5
68.5 74.0 75.6 75.7 78.1 81.1 84.5 92.2 94.0
95.3 105.7

1. By examining the histogram on the right, complete the following sentence about the modal
class: More than 40% universities, a total of schools, had between and
thousand in attendance to their I-A
football games.
2. The mode from the sorted data
listing is which lies (inside /
outside) [circle one] the modal class.
The modal class is a superior measure
because there are universities in
the modal class but only
universities are at the mode.
Answers:
1. 38, 20, 40
2. modes: 31, 33.2, 37.4, and 43.4; the
first two lie inside the modal class; 2
100 80 60 40 20 0
40
30
20
10
0
AttendAv
F
r
e
q
u
e
n
c
y
5
12
20
38
16


2.3 Analyzing Bivariate and Multivariate Data Graphically
So far, we have learned how to display data in ways that helped us analyze business
problems. But sorted data and histograms rely on univariate analysis and ignore the role that
time can play in the business world.
Chapter 1 introduced us to other types of data time series, bivariate, and multivariate
data which can also be displayed graphically. We discussed the unique position of time series
data in business and economics. We also learned that many business decisions rely on bivariate
and multivariate data to better understand the relationship among variables. In this section, we
will learn how graphical display helps us unlock some of the secrets buried in time series,
bivariate, and multivariate data.

Displaying Time Series Data
In the previous chapter, we saw an example of time series data and discussed how it
differs from cross-section data. To show how these differences affect the way we display and
analyze time series data, we look back at the landmark election campaign of 1992.

Chapter Case #5: "That Dog Won't Hunt"
The 1992 presidential race was a different sort of campaign. There were three major
candidates instead of the usual two, the TV networks lost influence to cable news, and issues
took center stage over image and fluff. Out of nowhere appeared Ross Perot, a billionaire Texan
buying half-hour TV ads interspersed with dozens of graphs.
Like most successful business executives today, Perot relied heavily on the tools of data
analysis to locate problems and suggest solutions. Although he is known for his folksy sayings,
Perot borrowed a method from modern business presentations to home his messages. He used
graphs and diagrams to present data visually and compactly. Business schools instruct students
that tables are too complex and hard to read to hold an audience's attention and communicate
important points.
At the time, press and political consultants firmly believed that people are too ignorant or
to understand statistics. The conventional wisdom also held that the public considers graphs and

numbers boring.
5
Instead, campaign commercials of previous elections emphasized suffering
steel worker testimonials or feel good "Morning in America" images. Yet statisticians have long
known that most high-school graduates readily comprehend and appreciate many types of tables
and graphs. Statistics can be a powerful tool to take us beyond the nebulous images and
anecdotal evidence. Time series graphs can be an especially compact way to display broad,
measurable patterns and trends.
Central election issues for Mr. Perot were the failure of politicians to responsibly address
major issues such as the skyrocketing national debt and the high trade deficit. Perot's effective
use of graphics in the campaign continually forced the political debate back to these issues.
Following the election, deficit, waste, and spending were more seriously addressed. As they
present their argument before Congress, members of both major parties usually display graphs
and charts.
In 1992, many incumbents were defeated by challengers running against business as
usual. Suppose a campaign adviser to one of these challengers had access to Table 2.2, data on
annual Federal expenditures for each of the previous 15 years (since 1992 data would not have
been available yet).
How could the economic adviser use this information to present the candidate with
meaningful analysis? What if he only relied on display methods introduced so far in the text?
Based on the histogram of 1976-1991 government spending shown in Figure 2.39, the adviser
would have reported that government spending during the 15-year period was approximately
evenly distributed from $400 to $1400 billion. Although this report is truthful, the adviser failed
to find the most important information in the Table 2.2 data. See if you can spot his mistake.
Table 2.2
U.S. Government Expenditure (Billions of Dollars)
During the Carter, Reagan, and Bush Years

5
The unexpected success of the national newspaper U.S.A. Today, however, has been attributed in part to its extensive use of
multi-color graphics and tables to summarize news, sports, and weather.
Year Spending
1977 426.4
1978 469.3
1979 520.3
1980 613.1
1981 697.8
Year Spending
1982 770.9
1983 840.0
1984 892.7
1985 969.9
1986 1028.2
Year Spending
1987 1065.6
1988 1109.0
1989 1179.4
1990 1270.1
1991 1321.7

Ch. 2 Pg. 33

Did you notice the upward trend in government spending? This is the information that
the candidate most needed to learn from the government spending data.
Why is this trend pattern in the data not detectable from a histogram? The reason is that
histograms are a univariate method of displaying data. Univariate methods analyze each variable
in isolation and ignore how related variables are behaving. For time series data, there is always
another variable that belongs in the picture, time itself. Each observation in time series data is
tied to the period at which it occurs. In this case, each government spending figure in Table 2.2
takes place in a different year.
Because Table 2.2 contains bivariate data on government spending and year, the advisor
needs to use bivariate analysis to understand the relationship between these two variables.
However, bivariate tables can be more confusing than the univariate data lists we tried to
decipher earlier in this chapter. That is why time series data are generally displayed in a time
series graph.
DEFINITION: A time series graph is a two-dimensional plot of time series data with the time
series variable plotted on the vertical axis and time itself plotted on the horizontal axis.
The time variable is always place on the horizontal axis so we can track our time series variable
Figure 2.39
Figure 2.40
1400 1200 1000 800 600 400
4
3
2
1
0
Gov Exp
F
r
e
q
u
e
n
c
y
Business Statistics and Applications--Soskin and Braun Page 34
Ch. 2 Pg. 34
over time. Figure 2.40 contains a Minitab time series plot of the government expenditure data,
plotted using directions shown in Figure 2.41.
Using Minitab to Display a Time Series Plot
Pull-Down Menu sequence: Graph Time Series Plot...
Then Complete the Time Series Plot DIALOG BOX as follows:
(1) double click mouse on variable to select it for plotting
(2) click to highlight Calendar or Clock, click button, and select period of the data
(3) chick Options... button to enter the Time Series Options DIALOG BOX
(4) type the start time (the year, for example) of first observation in the data and click OK
(3) click the OK button to plot the graph
Figure 2.41
Notice how the steady upward trend is clearly seen from this graph than from the table. Here, a
picture may be worth a thousand words (or at least several dozen). From the graph, it is apparent
that the trend
is almost, but
not quite, a
straight line.
This
imaginary
line is called
a trend line
and its slope
is the trend
rate.

Figure 2.43
91 90 89 88 87 86 85 84 83 82 81 80 79 78 77
1400
1300
1200
1100
1000
900
800
700
600
500
400
Year
G
o
v

E
x
p
Ch. 2 Pg. 35

DEFINITION: A trend line is a straight line that approximates the linear relationship between a
time series variable and time itself. The slope of the trend line is the trend rate.
We can easily guess the approximate trend rate. The graph shows government expenditures
rising by about $900 billion (from over $400 to more than $1300 billion) during the 14-year
period, 1977-1991. Using the "rise over the run" method of calculating slopes, the annual rise in
government expenditures was 900/14, or about $64 billion.
Many politicians in 1992 ran for reelecting by blaming politics the Congress or a
particular president or political party for high government spending. Other incumbents
claimed that higher government spending was a temporary result of recessionary episodes or the
latest Middle East war.
The time series plot, however, clearly reveals a disturbingly constant upward march in
expenditures. In every year, spending was tens of billions higher than in the preceding year.
Spending rose during recessions and economic booms, wartime and peace, high inflation and
stable prices, Democratic and Republican presidents.

The campaign adviser should recommend that the congressional challenger not blame
either party or economic events for this steady increase in government spending. Instead, the
relentless trend indicates that something about the established interests in Washington has
allowed the regular increase in spending. By showing voters graphs such as Figure 2.43, running
as an outsider became a winning strategy for challengers in 1992.
6
In Chapter 4, we will learn
about other properties of trend lines.

6
Of course, identifying the specific causes of the trend and effective solutions requires further analysis. Parts IV and V of this text
will introduce statistical methods to explore questions of this sort.
Ch. 2 Pg. 36
Figure 2.44
90 88 86 84 82 80 78 76 74 72 70 68 66 64 62 60
175
150
125
100
75
50
25
0
-25
-50
Ye a r
T
r
a
d
e

D
e
f
i
c
i
t
Q uarte rly U.S . T rad e D e f ic its 1 9 6 0 -1 9 9 1 (B illio ns $ )
Ch. 2 Pg. 37

Thus, the chamber of commerce may have a better chance of taking its streets back if growth
slows and elderly groups support them, but an enlightened citizenry would also help.

2.44 Bivariate data are data gathered on
a. one variable for each observation
b. two variables for each observation
c. at least one variable for each observation
d. at least two variables for each observation
e. none of the above
2.45 A scatterplot is a
a. two dimensional plot of data
b. frequency distribution
c. histogram
d. trend line
e. random array of points

2.46 An index
a. summarizes a group of related variables
b. is the average of several variables
c. is often used to represent overall movements in stock prices
d. is often used to represent overall movements in consumer prices
e. all of the above

2.48 Which of the following is a guideline for constructing time series graphs?
a. place time on the vertical axis
b. always draw a trend line rather than connect the plotted points
c. beware of graphs that omit the most recently available data
d. compare each time series observation with its cross section counterpart
e. graph two (or more) time series variables on the same graph to compare them

Chapter Review Summary
We can often learn much by examining the entire data set. Even sorted listings, however,
are too cumbersome to observe general patterns in the data.
Ch. 2 Pg. 38
A histogram, by graphing the information from a frequency distribution, yields an even
simpler diagram to observe data patterns. However, histograms sacrifice all detail about location
of individual observations within a class.
The modal class is a measure of the average for a histogram. This class locates the most
frequently occurring values in the data. While the mode is a better measure with categorical or
some discrete variables, the modal class is generally the proper choice with all continuous data.
A frequency distribution is unimodal if the histogram has one mode or modal class and
is bimodal if the histogram bars group around two separate modes. A unimodal distribution is
skewed right if its histogram has a long tail to the right of its modal class and skewed left if its
tail lies only to the left.
Bivariate data cannot be fully analyzed using univariate statistics. In order to examine
the relationship between the paired variables, we may present bivariate data graphically in
scatterplots. A trend line may be used to summarize time series data in a scatterplot with the
time period represented on the horizontal axis.
Ch. 2 Pg. 39
Chapter Case Study Exercises:
1. An independent insurance agency examines the histogram, descriptive statistics, and printed
listing of premiums paid by its auto insurance policyholders. The computer printout is the
following:
Data Display

Premiums
303 467 494 497 539 569 575 584 609 609 611
613 672 704 737 746 758 773 805 815 819 886
898 915 928 931 946 946 946 946 985 1179 1179
1198 1248 1265 1265 1329 1361 1413 1450 1522 1635 1680
1789 1965 2027 2078 2162 2204 2221 2638 3040 3040 3204

(a) In two sentences, describe the information furnished by the modal interval.
Premiums are most frequently in the $600
to $1000 interval. Of the 55
policyholders, 23 pay premiums in this
range, nearly three times the number of
policyholders in any other interval.
(b) Why does the modal class provide more
useful information to the agency than the mode
for this population? (one sentence)
The mode, 946, reflects premiums for only
four policyholders, far less than the 23 in
the modal class.
(c) How does the modal class compare with the
mode?
The modal class (600 to 1000) contains
the mode (946).

2. A company considers reimbursing its 51 workers for their annual premiums paid on $10,000
life insurance policies. Insurance premium charges vary from worker to worker. The histogram,
descriptive statistics, and sorted listing of annual premiums charged (in dollars) to the N = 51
workers are provided below.
(a) Complete the following sentence describing the information ONLY found from examining
the above histogram's modal class interval:
"If you reimburse your workers for their life insurance payments, your company ...."

3400 3000 2600 2200 1800 1400 1000 600 200
20
10
0
Premiums
F
r
e
q
u
e
n
c
y
3
1
2
4
6
8
23
8
Ch. 2 Pg. 40
Data Display

premium
39 44 44 57 57 62 71 71 86 92 93
95 109 111 121 121 128 135 140 141 154 157
157 159 163 166 188 190 202 202 211 213 223
241 261 274 300 308 312 334 355 363 372 444
446 461 463 593 598 775 781

(b) Why does the modal class interval provide more useful information about "average"
premiums than does the mode for this population?
Answers:
(a) If you reimburse your workers for their life insurance payments, your company will pay out
between $100 and $200 annually to nearly one-third of its workers (or 16 of 51 employees).
(b) While the modal class describes the premiums of nearly one-third of the works, the most
frequently occurring value, the mode, occurs only twice and is not even unique. Instead, there
are five modes ($44, $57, $121, $157, and $202.

800 700 600 500 400 300 200 100 0
15
10
5
0
lifeins
F
r
e
q
u
e
n
c
y
2
0
2
4
7
8
16
12
Ch. 2 Pg. 41
3. A manufacturer of parachutes examines the histogram, descriptive statistics, and printed
listing of units sold ("unitsold") each month over the past seven years (N = 84 months). The
computer printout is the following:

Data Display

unitsld
18 26 31 33 34 37 38 40 40 41 43
43 43 45 46 47 47 47 48 49 51 52
52 52 53 53 53 53 54 54 54 55 55
55 56 56 56 58 58 58 59 59 59 60
60 62 62 62 63 64 64 67 67 67 68
68 71 71 71 72 73 76 76 76 77 77
78 79 79 80 81 81 82 82 87 89 89
89 94 95 100 105 122 138

(a) In two sentences, describe the modal interval in the context of this case.
(b) Why does the modal interval provide more useful information than the mode for this
population?
Answers: (a) In 20 of the 84 months, nearly one-quarter of the time, between 55 and 64
parachute were sold.
(b) The mode of 53
occurred only four times,
while several other sales
totals re-occurred in three
different months. By
contrast, sales within the
modal class occurred in 20
of the months, or five times
more often than did the
mode. 150 100 50 0
20
10
0
units ld
F
r
e
q
u
e
n
c
y
1
0
1 1
2
5
13
10
20
18
8
4
1
0
Ch. 3 Pg. 42

Chapter 3 Summarizing Data by Average and
Variability Measures

Approach: Data may often be summarized by measures of average and variability. Basic
notions of average are compared. The importance of data variability is stressed through
the everyday idea of the range and a common sense way of understanding standard
deviation. Summary statistics can also help us make informed decisions when only a
sample is available.
Where We Are Going: The mean and standard deviation provide insights about the
distribution of business data. The sample mean and sample standard deviation will be
important for making inferences about the parent populations prevailing in business
environments.
New Concepts And Terms:
mean, median, and mode
range, outliers, and resistant measures
percentiles, quartiles, interquartile range, and box plots
variance, standard deviation, coefficient of variation, and mean absolute deviation
random sampling, sample mean, and sample standard deviation
descriptive statistics, inference, parameters, and bias

SECTION 3.1 Measuring the Average
SECTION 3.2 Variability and Range Measures
SECTION 3.3 Mean Variability Measures
SECTION 3.4 Using Random Samples to Make Inferences about a Population

Ch. 3 Pg. 43
3.1 MEASURING THE AVERAGE
Business executives today spend too much of their valuable workday in meetings.
During a lengthy meeting presentation, a frustrated manager will interrupt by saying, "What is
the bottom line?" or "Let's cut to the chase!" Not everything can be equally important.
Managers also confront stacks of lengthy reports. Fortunately, business reports begin
with a brief Executive Summary whose boldfaced, capitalized type seems to shout,
THESE ARE THE IMPORTANT POINTS!
Summary statistics can often supply the business analyst with information similar to an executive
summary.
In Chapter 2, we showed the usefulness of examining entire data visually in sorted lists,
charts, or graphs. Data sets can be too complicated for our eyes to grasp visually efficiently.
Identifying the most important features of a particular histogram may be difficult. Many
decision situations require us to attach numbers to quantify our overall assessments of the
data. In this chapter we learn how to summarize the information in a data set by measures of the
average and variability.
1
With just a single number, we can inform a decision maker about the
average value of the data. Using another descriptive statistic, we can also describe the variability
of the data.
DEFINITION: Descriptive statistics quantify the information in data and summarize its
different characteristics.
Descriptive statistics are convenient when we have a very small data set of 10 observations, but
become an absolute necessity when the number of observations reaches 30, 100, or 1000. We
begin by introducing descriptive statistics that measure the average.

The Mean
If asked what single measurement best summarizes a spreadsheet column of sales data,
most people would say the average. As with many words in the English language, the dictionary
gives us several different meanings for average. For example, the plant manager may interpret
the following worker evaluation:

1
Statisticians often refer to various the average as central tendency or location and variability as dispersion relative to distributions.
Distributions will be discussed in greater detail in Chapter 6.
Ch. 3 Pg. 44
He is doing average work.
to indicate mediocre, substandard job performance. However, the production supervisor
intended the evaluation to reflect typical performance, another meaning of average. Because
misinterpreted information can be worse than no information at all, imprecise terminology has no
place in objective data analysis. Even if statistical analysis involves methods too complicated for
decision makers to understand, we still have the responsibility to describe the properties and
limitations of the statistic being reported.
Business statistics carefully defines each type of descriptive statistic to communicate
specific information to decision makers precisely.
More than one popular descriptive statistic is defined to measure the average for business data.
Rather than call any one (or all) of these the average, we assign special names to each.
For example, which "average" would you give the Board that best summarizes the
compensation data on the 38 CEO compensations? Most people would suggest we total up the
compensations and divide this sum by the number of CEOs. In statistics, we call this arithmetic
average the mean of the data.
2

DEFINITION: The mean is the arithmetic average of a data set calculated by summing the data
and dividing the total by the number of observations in the set. The population mean, , is the
population sum divided by the population size, N.
In Greek, the letter with an "M" sound (for mean) is , and pronounced "M-yu" (rhymes with
"view").
3
Suppose we have N observations X
1
, X
2
, ... , X
N
on a variable x. Then, we may
represent the population mean by the following:
= (X
1
+ X
2
+ . . . + X
N
) / N
or in the more compact summation notation:
N
X
_
=

2
In fact, is sometimes called the arithmetic mean or simple mean to distinguish it from other types of means such as the weighted
mean, geometric mean, and harmonic mean.

3
Greek letters are often used in statistics to describe characteristics of a population.
Ch. 3 Pg. 45
The computations are easy to show for small populations. For example, if a company has only
five workers whose weekly salaries are $300, $300, $325, $450, and $475, finding the mean
involves the following simple calculations:

370
5
) 475 450 325 300 (
5
=
+ + +
= =
_
X

It takes about 30 seconds on a pocket calculator to sum these and divide by N = 5 and arrive at
the population mean of $370.
These calculations become laborious for larger populations. But once the numbers are
stored in a computer file, a spreadsheet program or other software can do lightning-fast, error-
free arithmetic calculations. A large variety of summary statistics, including the mean, may be
obtained from Excel or Minitab, for example, by using the Descriptive Statistics procedures
outlined in Figures 3.1 and 3.2.

Using Minitab to Obtain Descriptive Statistics on Columns of Data
Pull-Down Menu sequence: Stat Basic Statistics Descriptive Statistics...
Complete the Descriptive Statistics DIALOG BOX as follows:
(1) double click names on the left to select variables for descriptive statistics
(2) click on OK button to obtain descriptive statistics
Figure 3.1

Using Excel to Obtain Descriptive Statistics from Spreadsheet Data
(1) Select Descriptive Statistics from Data Analysis box, and click OK
Then complete the Descriptive Statistics DIALOG BOX as follows:
(2) Click to place an x next to Labels in the First Row.
(3) Click to place an x next to Summary Statistics at the bottom.
(4) Click inside box following Input Range and type cell range of label and data for the
variables (or drag mouse through spreadsheet data so cell range appears in the box)
(5) click on OK to obtain descriptive statistics in a new sheet
Figure 3.2

We may use Excel and Minitab to produce descriptive statistics for the 38 CEO compensations in
the data file we examined in the first two chapters. The Minitab output shown Figure 3.5.

Ch. 3 Pg. 46
Descriptive Statistics

N MEAN MEDIAN TRMEAN STDEV SEMEAN
CEO comp 38 1544 1157 1365 1256 204

MIN MAX Q1 Q3
CEO comp 456 6219 706 1822
Figure 3.5

Notice that Minitab provide information about the
number of observations N, the mean, and several other
statistical summary measures which we'll explain in this
and later chapters. The Excel output (see Figure 3.6) has
many of the same summary statistics listed. For now,
our interest is only in the mean, 1544, reported in both
computer outputs.
Just as it did with the worker salary data earlier,
the computer summed the data, divided by the number of
observations N, 38 CEOs for this case, and reported the
quotient. The Excel output gives us this Sum, 58663,
near the bottom. Check with your pocket calculator that
the reported mean, 1544, is in fact equal to the quotient
58663 38.
Because the data are expressed in thousands of dollars, the mean is $1,544,000. You
could therefore report to the Tangerine corporate management that mean compensations
in the computer and electronics industry is slightly over one-and-a-half million dollars. We
will have a lot more to say about the mean and its uses, but first let's introduce the other
primary measure of the average.

The Median
The mean is not the only and for many purposes not the best measure of the
average for a set of data. A descriptive statistic exists that does require us to total the data or
arithmetic at all. Instead, this alternative measure of the average relies on data sorted by size. In
Chapter 2, we already saw how sorted quantitative data is much easier to display or group Into
class intervals. The median observation focuses on a particular value in the sorted listing: the
CEO comp
Mean 1543.763
Standard Error 203.7692
Median 1157
Mode 1225
Standard Deviation 1256.118
Sample Variance 1577831
Kurtosis 5.412895
Skewness 2.233408
Range 5763
Minimum 456
Maximum 6219
Sum 58663
Count 38
Confidence Level(95.000%) 399.3797
Figure 3.6 Descriptive Stats in Excel
Ch. 3 Pg. 47
one found half way through the sorted data.
DEFINITION: The median, M, of a population is the middle value sorted according to size. An
equal number of values will lie above and below the median.
Consider again the five-worker salary example presented earlier. Recall that the mean of
$370 is the combined weekly salary $1850 spread equally across the five workers. On the other
hand, the median (or middle) value of $300, $300, $325, $450, and $475 is clearly $325 because
two salaries lie above and below. Notice how the median is based upon examining sorted data
instead of summing data and computing quotients. Because the median is not calculated from by
arithmetic, the median and mean are seldom equal. In fact, they often have considerably
different values, as we will see shortly.
First, let's examine how to find the median under various circumstances. Observe that
the median of the five salaries was the middle observation. A middle value will always exist for
an odd number of observations N. For N = 5, the median is the third largest. The median also is
the third smallest it doesn't matter which because the middle is the same number of
observations from either end of the data. An odd number of sorted observations can always be
divided into two equal parts and still leave a middle value, our median.
If the population size N is an odd number, the median M is the (N+1) / 2 largest value in the
population.
For N = 25 observations, (25 + 1) / 2 is 13. The median is then the 13
th
largest observation, with
12 observations larger than M and the remaining 12 smaller.
What do we do for populations with no middle observation? Data sets with an even
number of observations have no middle value. We can't look for the (N + 1) / 2 largest value
because (N + 1) / 2 is no longer an integer. Suppose N = 24. Then (24 + 1) / 2 is 12. It is easy
to see why the two observations closest to the middle fail to satisfy our definition. The
thirteenth largest has 12 observations larger but only 11 that are smaller. The twelfth largest
value faces the opposite situation: 11 larger and 12 smaller.
For the five-employee example, consider what happens when we hire an additional
worker at a salary of $375. The six salaries 300, 300, 325, 375, 450, and 475 now have no
middle observation. But what if we choose a median between the 3
rd
and 4
th
largest? By
selecting a median of $350, we satisfy our definition for the median because three observations
are above and three are below $350. The same is also true of $355 and $371.50. Which do we
choose? With even-sized populations, it is customary to select a median midway between the
two observations closest to the middle.
Ch. 3 Pg. 48

If the population size N is an even number, the median M is the value lying half way
between the N

and (N + 1) largest values in the population.

Thus, the median for the six workers would be (375 + 325) / 2, or $350.
Just such a situation occurs for the CEO data whose 38 compensations represent an even
number. The median is halfway between the 19
th
and 20
th
largest observations because N / 2 is
19 and N / 2 + 1 = 20.
sort CEO
456 487 515 541 618 621 683 684 692 711 762
914 964 981 993 995 1000 1054 1135 1179 1217 1225
1225 1336 1338 1446 1572 1634 1814 1846 1948 2096 2965
2987 3144 3565 5101 6219
Figure 3.7

We can locate the median by examining the sorted data from Chapter 2 (see Figure 3.7).
Counting from either end, it is easy to spot the 19th and 20th compensations, 1135 and 1179.
Therefore, the median must lie midway between these values, or
M = (1135 + 1179) / 2 = 1157
In practice, letting the computer find the median is much easier. In fact, this median M =
1157 was already listed following the mean on the descriptive statistics output of Excel and
Minitab. Your report to the Board of Directors can therefore cite a median compensation of
under $1.2 million. Notice that this median falls well below the $1.544 million mean CEO
compensation computed earlier.

Ch. 3 Pg. 49
The Mean as an Accounting Average
Descriptive statistics are designed to summarize data. The business analyst who relies
exclusively on descriptive statistics may lose touch with the actual data. It is unreasonable to
expect one or two numbers to summarize everything about a set of data. Only if all observations
have the same value will the mean or median tell us everything about the data. Generally,
summary statistics cost us information about the data. Like the mean, the median and mode are
other popular measures of the average. We now investigate the strengths and weaknesses of
each as a measure of the average.
The mean is usually the best measure when we have in mind an accounting definition of
the average. Recall that the mean is the total divided evenly across all observations. Suppose
Burger King wants to know how many burgers on average it must serve at each of its franchises
in order to sell 50 million hamburgers per year. The mean is the best descriptive statistic here
because it is directly related to the sum total. If there are 2000 franchises, for example, then each
franchise must average (in the mean sense) annual sales of 25,000.
The information provided by the mean also operates in the reverse direction. By
knowing and N, we can always find the total from the product. If 2000 franchises average
25,000 hamburger each, total sales for the whole chain must be:
Total = N = (2000)(25,000) = 50,000,000 hamburgers
Similarly, a mean of 15 course credits per term for eight semesters will give you (8)(15), or a
total of 120 credits. For a quality control example, suppose that the mean number of defective
computer chips is three per box shipped and 5,000 boxes are shipped to assembly plants. Then
the company can expect 15,000 dissatisfied customers demanding refunds or replacements.
By contrast, the median has no such accounting relationship with the sum. The median
sales for Burger King franchises, for example, may be above or below the 25,000 mean for the
chain. This median is the number midway between sales at the 1000
th
and 1001
st
ranked Burger
King franchise. Because it is not related to the sum of the population values, the median does
not provide information necessary to calculate the total.
One common reason for reporting the mean is its accounting relationship with the sum of
all observations in the population.
Total = N
There is no equivalent connection between the total and the median.
Ch. 3 Pg. 50
The Effect of Outliers on the Mean and Median
This computational advantage the mean enjoys can be a mixed blessing. Because it is
arithmetically related to the sum of the data, the mean can also be highly sensitive to anything
that dramatically affects that sum. Many common types of business data contain one or more
outliers, unusually large or small values in the population.
DEFINITION: An outlier is an observation whose value is much larger or smaller than that of
the any other observation in the data.
Since the mean is derived from the sum of all values, a few outliers can radically affect the mean.
In our salary example, suppose the boss hires another worker (perhaps his brother-in-law) at
$2350 / week. This one unusually large salary is enough to raises the total payroll from $1850 to
$3600 (300 + 300 + 325 + 450 + 475 + 2350). The mean nearly doubles from $370 to $4200 / 6
= $700. The median, however, found by counting the sorted entries, treats the new salary as just
another above-average salary. With N now 6 instead of 5, the median increases an only modest
amount: from $325 to (325 + 450) / 2 = $387.50. Notice that outliers exceeding the median
elevate the mean well above the median. This makes the median a resistant measure.
DEFINITION: Statistics such as the median that are not strongly affected by outliers are called
resistant measures.
On the other hand, outliers much smaller than the rest of the data pull the mean well
below the median. Sometimes, the presence of both high- and low-end outliers results in
offsetting effects on the mean.
The mean can be dramatically shifted in the direction of a few outliers, unless other outliers
offset this effect. The median, as a resistant measure, is often a much better measure of the
average for data containing outliers.
Because of their influence on the mean, outliers should be carefully scrutinized. If
unusually large or small values can be traced to a legitimate mistake, we may be justified in
correcting or removing outliers from the data. By double checking an outlier, we may discover
that a value was measured or recorded incorrectly. For example, an 11,000 square-foot
residential lot (one-quarter acres) was the only value entered onto a spreadsheet without first
converting its measurement to acres. Fortunately, the 11,000 showed up on the histogram as an
outlier against the much smaller residential lot data recorded in acres. Once the error was
discovered, the spreadsheet entry was corrected from 11,000 to 0.25. In other cases, the source
of an outlier is simply a typing error causing a misplaced decimal point or too many digits. It is
a simple matter to correct the mistake and avoid an embarrassing retraction later.
Ch. 3 Pg. 51
Different remedies are required when outliers are traceable to improper data sources.
Rather than residential lots, suppose a farm house on a hundred-
acre agricultural tract was inadvertently included in a
residential property study. The number, 100 acres, may be
correct but originates from the wrong population. This outlier
clearly does not belong in the residential property population
being studied and would make mean lot size much too large. It
should therefore be deleted from the data.
However, extreme caution should be exercised before
outliers are deleted. What if outliers cannot be traced to either
of these types of errors? Business analysts may be accused of
discarding outliers merely to make the mean a more acceptable
measure of the average. Thus, despite the problems an outlier
may cause, never remove outliers for reasons others than
population or recording errors. Instead of throwing out such
observations, analysts detecting outliers may rely on
descriptive statistics such as the median that not sensitive to
outliers.
No observations, even outliers that severely affect the mean, should be removed unless they
can be shown to be incorrectly recorded or originate from the wrong population.
For the CEO case study, 6219 and 5101 were considerably larger than the other compensations.
But they represented valid $5 and $6 million pay packages negotiated by companies in the
computer and electronics industry. Therefore, neither of these observations should be omitted
from the data analysis.

The Most Frequently Occurring Value
Recall from Chapter 2 that the mode and modal class find the most common value or
interval of values. Frequently occurring values are often viewed as typical values. Thus, if the
mode or modal class contain a large enough portion of the data, they may be considered
alternative ways of expressing the average.
Rather than relying on arithmetic or sorted data, the mode reports the value repeated most
often. The appeal of the mode is that it represents more observations than any other. If only a
few different values occur in the population, the mode can be a highly-useful descriptive statistic.
When the AAA (the Automobile Association of America) relocated its corporate headquarters, it
vacation
Mean 882.575
Median 512.5
Mode 0
Kurtosis -0.82358
Skewness 0.809083
Range 2750
Minimum 0
Maximum 2750
Sum 35303
Count 40
Confidence Level(95 283.0456
Figure 3.9
Ch. 3 Pg. 52
hired many new employees from the local labor force. Shortly after the move, the mode number
of years worked at firm was only one year. This was a useful measure of the average because it
represented the most common experience of its workers. The mean and even the median were
considerably larger than the mode due to the many staff members transferred from its previous
location.
The mode has a similar advantage when a particular value is a typical outcome for
continuous data. A survey of 40 uncovered the annual vacation expenditures among 40 graduate
business students (see Figure 3.8).
Data Display
Vacation
0 0 0 0 0 0 0 0 0 150 150
200 225 250 250 300 300 400 400 425 600 650
750 800 900 980 1000 1000 1023 1600 2000 2000 2000
2125 2200 2250 2450 2500 2675 2750
Figure 3.8
The mode of zero results from a lack of time and resources to take a vacation until students
finished their coursework. Because nine of the 40 expenditures were zero, this mode may be
reported as the typical, or average, expenditure. As can be seen on the Excel output in Figure
3.9, the median exceeds $500 and the mean is nearly twice the median. Neither of these is very
descriptive of typical vacation spending by graduate students.
The modal class interval is useful for finding the most frequently occurring interval of
values. The modal class is important feature of the frequency distribution and histogram.

The Average for Symmetric, Unimodal and Skewed Distributions
Let's examine the relationship between the mean, median, and modal class for two
important types of frequency distributions introduced in the previous chapter, symmetric and
skewed distributions. Figure 3.10 reproduces the approximately symmetric distribution of the
passenger arrival data from Chapter 2. Because of this symmetry, small values that pull the
mean below the median are offset by values on the other end of the distribution. As a result,
Figure 3.11 reports that the 829.5 mean is nearly the same as the median (822).

Ch. 3 Pg. 53


Variable N Mean Median TrMean StDev SEMean
AirTraf 67 829.5 822.0 828.0 110.9 13.5

Variable Min Max Q1 Q3
AirTraf 597.0 1086.0 751.0 906.0
Figure 3.11

When distributions are symmetrical, the mean and median will always have this and
convenient relationship to one another. Because mean and median are the same for
symmetric data, their average has all the desirable qualities of the mean and the median
with none of the limitations of either.
Because it is the median, 822,000 separates the 33 months with fewer arrivals from an equal
number of months with more than 822,000 passenger arrivals. In addition, the nearly identical
mean, 829,500 may be used as an accounting average representing the monthly average air
traffic volume.
This relationship holds for any symmetric distribution. If a symmetric distribution is also
1150 1050 950 850 750 650 550
20
10
0
AirTraf
F
r
e
q
u
e
n
c
y
Distribution of Monthly Passenger Arrivals: Jan. 1989 - July 1994
Figure 3.10
Ch. 3 Pg. 54
unimodal, however, all three types of averages converge on the same value.

For unimodal, symmetrical distributions, the mode, median, and mean are identical.
Thus, the modal class for the air traffic histogram spans the interval from 750,000 up to 850,000.
Both the mean and median are inside this interval because the data have a symmetric, unimodal
frequency distribution. Therefore, the average of monthly
passenger arrivals can be represented by the interval 750,000 to
850,000 no matter which average we are interested in: (1) the
most frequently occurring monthly amount, (2) the middle
ranked value, or (3) the arithmetic average derived from the
overall sum of the data.
The happy coincidence among the three measures of the average fails to occur with other
types of data. In fact, skewed distributions often have mean, median, and modal class that are
quite different from one another. Skewed data, however, does have a predictable relationship
among the three types of averages.
Consider, for example, the skewed catering firm's employee age distribution from
Chapter 2 shown again in Figure 3.12. The top part of the Excel descriptive statistics output
(Figure 3.13) tells us that the mean is 41.4, nearly three years more than the median, 38.5. Just
AgeCater
Mean 41.38158
Median 38.5
Figure 3.13
80 70 60 50 40 30 20
30
20
10
0
AgeCater
F
r
e
q
u
e
n
c
y
Figure 3.12
Ch. 3 Pg. 55
like high-end outliers, the large values in the tail of skewed right data pull the mean upward. The
mean is thus greater than the median. Unlike a data set with outliers, no values are much larger
than the rest of the data. Instead, the tail displays a gradual decline in frequencies. However,
the many observations in the tail can affect the mean as much as one or two outliers. Notice also
that the mean lies outside the modal class, employee ages from 30 to 39. The median, on the
other hand, falls within the modal class. If the skewness is extreme, the tail contains so many
observations that the median will also lie beyond the modal class but can never be shifted as
much as the mean. For skewed-left distributions, the relationship among the mean, median, and
modal class are just the reverse.
The mean is greater than the median and often lies to the right of the modal class for distributions
that are skewed right. The opposite ordering of mean, median, and modal class occurs for
skewed left distributions. For highly skewed distributions, even the median will fall outside the
modal class.
3.1 The mean for the salaries (measured in thousands of dollars / year) 10, 20, 20, 20, 30,
100, 500 is
a. 10
b. 20
c. 30
d. 65
e. 100

3.2 For 120 private colleges, the mean number of accounting majors is 84, the median is 59,
the mode is 64. In the histogram of accounting majors at the 120 colleges, the modal
class
a. has a class midpoint of 59 accounting majors
b. has a class midpoint of 64 accounting majors
c. has a class midpoint of 84 accounting majors
d. will not occur because this is not quantitative data
e. cannot be found from information given in the problem

3.3 A housing tract builder needs to decide on the number of bedrooms to put into its most
commonly-built homes. Which measure of the average should the builder determine?
a. the mean number of bedrooms for new home sales
b. the median number of bedrooms for new home sales
c. the mode for bedrooms of new home sales
d. the mean family size of new home buyers
Ch. 3 Pg. 56
3.4 In calculating the standard deviation which operation is not involved?
a. summation over the number of observations
b. squaring differences from the mean
c. averaging over the number of observations
d. taking the square root
e. all the above operations are involved

3.5 The median is
a. the arithmetic average
b. the most frequently occurring value in the data set
c. the 50th percentile
d. the middle value if there is an even number of observations
e. all of the above

3.6 For which of the following are the mean, median, and mode identical?
a. 1 3 3
b. 1 2 2 4
c. 1 1 4 4 4 8 8
d. 1 10 10 19
e. 1 5 5 5 10

3.7 The median for the salaries (measured in thousands of dollars / year) 10, 20, 20, 20, 30,
100, 500 is
a. 10
b. 20
c. 30
d. 65
e. 100

3.8 The mode for the salaries (measured in thousands of dollars / year) 10, 20, 20, 20, 30,
100, 500 is
a. 10
b. 20
c. 30
d. 65
e. 100

3.9 If mean and median salaries are very similar at a firm, then you should conclude that
a. most employees make about the same salary
Ch. 3 Pg. 57
b. if some salaries are far below the mean, other salaries are considerably above the
mean
c. no salaries are far from the mean
d. some salaries may be far above the mean, but none can be far below it
e. the similarity between mean and median tells us nothing about the distribution of the
data

3.10 Given the following means, for which would you also want to know the dispersion?
a. choosing among three mutual funds which averaged 15, 12, and 8 percent return
b. hiring a new engineer from among graduates with 3.3, 3.1, and 3.0 grade point
averages
c. attempting to make vacation reservations at one of three resort areas with average
annual occupancy rates of 60, 70, and 80 percent
d. all of the above

3.11 If the mean weight of 50 parts in a shipment is 2.2 pounds and the median weight is 0.8
pounds, the total weight of the shipment is
a. 40 pounds
b. 50 pounds
c. 110 pounds
d. 220 pounds
e. cannot be determined from the information provided

3.12 If the mean weight of 50 parts in a shipment is 2.2 pounds and the median weight is 0.8
pounds, then
a. half the parts weigh more than 0.8 pounds
b. 26 parts weigh less than 0.8 pounds
c. 25 parts weigh between 0.8 and 2.2 pounds
d. all of the above

3.13 For the data set 5 4 3 2 1 0
a. the mean and median are the same
b. the mean has a value that does not occur in the data
c. the median has a value that does not occur in the data
d. all of the above

Ch. 3 Pg. 58
Calculator Exercises:
3.14 Show that the mean is the same for each of the following seven data sets. How do these
sets differ from each other? Explain why these differences do not result in different
means.
(a) 1 2 3 4
(b) 4 3 2 1
(c) 1 1 4 4
(d) 2 2 3 3
(e) 1 1 1 7
(f) 1 1 2 2 3 3 4 4
(g) 2.5

3.15 Use a calculator to find the mean for each of the following data sets:
(a) 20, 50, 80, 200, 500, 1000, 1200
(b) 0.1, 0.1, 0.3, 0.5, 0.6, 0.9, 1.2, 1.5
(c) -15, -12, -10, -5, -2, -2, -1

3.16 Use a calculator to find the mean for each of the following data sets:
(a) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
(b) 0, 0, 0, 10, 10, 10, 30, 30, 30
(c) -50, -25, -10, 10, 25, 50, 70

3.17 What is the median for the following data sets: 1 2 3, 1 1 2 3 3, 1 2 2, 2 2 2 2 10, and -20
2 50? Discuss what factors do and do not appear to affect the median.
Ch. 3 Pg. 59
3.18 Find the mode or modes from the following data sets: 1 2 2 3, 1 1 2 2, 1 3 3 3 10, 1 2 3 4
8 8, 1 1 4 5 6 20 20 20. Compare the mode with the means and medians for the same
data. For which of these is the mode an acceptable measure of the average? For which is
the mode more misleading than illuminating? Explain.

3.19 Verify algebraically that for N an odd number, M is the (N + 1) / 2 ranked observation.
[Hint: Use the N = 2k + 1 formula for any odd number, solve for the integer k, and show
that k observations will lie above and also below M.]

3.20 Determine where the median will be in the sorted data for populations of
(a) N = 17
(b) N = 39
(c) N = 101

3.21 Determine the method for calculating the median for each of the following population
sizes:
(a) N = 32
(b) N = 10
(c) N = 250

3.2 VARIABILITY AND RANGE MEASURES
We have seen how measures of the average can help us make business decisions. In
many situations, all we may need to know is the average. If most values in the population are
similar, then a single number the mean or median will summarize the data quite nicely.
For example, consider the prices consumers pay for milk. The median supermarket price
for a half-gallon of milk in Kansas City say $1.50 may be a few cents higher or lower than
the price at any particular Safeway or other supermarket in that city. But no one would fault you
for summarizing the supermarkets price information by stating, "the price of a half gallon of milk
in Kansas City this month is about $1.50." Most people interpret this statement to mean that the
average price is $1.50 and variations around that price are relatively small.
Not so with other kinds of purchases. Many things, such as housing, transportation, or
repair services, are often much more expensive in cities than in small towns. A national average
of living costs would be too low to represent urban costs and too high as a rural measure. If we
wish to portray the price of fresh fruits and vegetables, an annual average would gloss over the
Ch. 3 Pg. 60
seasonal swings. Crop prices vary substantially between harvest-time and midwinter when
produce must be shipped in from Chile and other far away ports.
The average for a bimodal distribution is also a very poor summary measure. Consider a
company where approximately half its employees are high-paid and the rest are low-paid.
Suppose the engineers, accountants, programmers, and top management all earn over $75,000 a
year while the secretaries, delivery staff, and maintenance personnel are paid under $30,000
each. Then it is quite likely that the mean will lie somewhere between $30,000 and $75,000.
Since the salary of no employee at the company lies in this salary interval, it would be a mistake
to interpret the mean as a representative value.
Earlier in this chapter, we saw how the median is often the superior measure of the
average. In this example, however, the median may be even more misleading. When higher
paid employees are in the majority, the median will make average salary appear deceptively
high. Thus if there are 501 high-paid workers and 500 low-paid workers, the median must
exceed $75,000. But adding just two secretaries to the staff causes the median to drop below
$30,000 because the middle value now lies within the lower paying group. Most of us would be
uncomfortable relying exclusively upon a measure which is so sensitive to minor changes in the
data.
In general, decision makers can make costly judgment errors if they only consider the
average. The following case illustrates how such errors can easily arise.

Chapter Case #1: Meet Me in St. Louis
Suppose your company is considering a move of its home office from its current San
Francisco operations to similar facilities in St. Louis. Management asks you to estimate the
effect of the move upon heating and cooling costs. Since temperature is the most important
factor determining heating and air-conditioning expenses, you decide to consult an almanac for
30-year monthly average temperatures (see Table 3.1).

Ch. 3 Pg. 61
Table 3.1
Average Temperatures (Degrees Fahrenheit)

San Francisco St. Louis
January 49 29
February 52 34
March 53 43
April 55 56
May 58 66
June 61 75
July 62 79
August 63 77
September 64 70
October 61 58
November 55 45
December 49 34

Source: The World Almanac and Book of Facts

The San Francisco and St. Louis temperature data in the chapter data file may be summarized by
descriptive statistics. The first line of the Minitab output for each city is shown Figure 3.14.

San Fran 12 56.83 56.50 56.90 5.39 1.56
St.Louis 12 55.50 57.00 55.80 18.13 5.23
Figure 3.14
The mean temperatures in the two cities are within a degree or two of each other. So too are the
medians.
Cities with either very high or very low average annual temperatures are associated with
substantial heating or cooling costs. Thus, Minneapolis and Buffalo businesses experience high
heating bills, while Houston- and Tucson-based companies find resources drained by high air
conditioning costs. What about St. Louis and San Francisco, two cities with similar average
annual temperatures? Should we conclude that heating and air-conditioning costs should also be
similar for the two cities? No not unless the variability around their average is similar. Here's
why.
Heating and cooling an office is only necessary when the outside temperature is
considerably below or above a comfortable level, say 70 degrees. When the temperature is in the
Ch. 3 Pg. 62
fifties or low sixties outside, the combination of sun hitting the building, body heat of workers,
and heat generated by office equipment will keep offices inside quite comfortable. A glance at
the data in Table 3.1 tells us that San Francisco, with the moderating effects on its climate from
ocean breezes, is a good location for low heating and air-conditioning costs.
Despite a similar mean and median, St. Louis weather is an entirely different story. Look
again at the second column in Table 3.1. The hot summers make air conditioning essential in St.
Louis, and indoor heating will be needed during the cold November through March months.
Thus, the proposed St. Louis headquarters will be running either its furnaces or air conditioners
two-thirds of the year, resulting in greater energy cost at the new location.
This case study illustrates why the mean and median may not be enough. Information
about how much data varies from the average can be equally important.

The Range
Are there other summary measures that tell us about how much a data set varies? There
certainly are. In fact, you know one already. You may have described a set of numbers in the
following terms:
The instructor said the scores on the first exam ranged from 62 to 98
Last summer my friends had jobs ranging from $4.50 to $8.75 an hour

Statisticians also use the term range as a measure of variation.
DEFINITION: The range is the difference between the largest and smallest value in a population
or sample. Range = Maximum - Minimum
In statistics, a measure such as the range is known as a measure of dispersion, in contrast to
measures of average discussed in the preceding chapter.
DEFINITION: A summary measure of a data set that describes some aspect of the variability of
the data is called a measure of dispersion.
How can we calculate the range for the city temperature data? There are only twelve
months of data, and it is obvious that a summer month will contain the highest temperature and a
winter month the lowest. A quick glance at Table 3.1 finds for us 64 minus 49, a 15-degree
range of temperatures for San Francisco, and a 79 - 29 = 50-degree range for St. Louis.
For larger data sets, we usually let the computer sort the data by size, list the sorted data,
and pick out the largest and smallest values. The minimum and maximum values are reported in
Ch. 3 Pg. 63
the descriptive statistics output of Excel and Excel also reports the range explicitly.
Since the mean (and median) temperatures are nearly the same, variability becomes the
crucial factor describing temperature's effect upon heating and cooling costs. For St. Louis, the
range of monthly averages is more than three times that of San Francisco. The management at
the relocating company will certainly want to know that combined heating and air-conditioning
costs should be substantially higher. Even if the move to St. Louis still makes sense (for
example, the lower labor costs and access to Midwestern markets), your warning about increased
office operating expenses will permit the company to factor these costs into its plans.

The Range and Quality Control
The pursuit of quality is as much a concern about undesirable dispersion as it is about
means. A college administrator may take pride in the high quality of student services such as
advising, counseling, and job placement. Perhaps the average student is receiving high quality
assistance. Yet if procedures are not carefully designed and monitored, even the best institutions
let too many students fall through the cracks in the system while resources are squandered on
seldom-needed services. Measures of the average, such as mean waiting time to see a counselor
or annual average number of students placed in jobs, may disguise unacceptably large
dispersions. Perhaps the placement center caters to graduates of professional programs but has
little success placing arts and science majors. Or the counseling center may experience periodic
staff shortages that result in canceled appointments or long delays. These variations in service
quality can cause horror stories that can do long term harm to the reputation of a college. Poor
quality is hard to keep a secret, and can be very costly. Eventually, the press, high school
guidance advisers, and alumni hear from those students receiving unsatisfactory service.
For the principles of TQM to be successful, employees at all levels must be involved in
quality control and improvement. Since business operations are an ongoing process, employees
who record the quality measurements are in the best position to monitor the process. However,
data collection is only part of the problem. A post mortem of the day's quality charts can prevent
customer dissatisfaction and lost future orders from shipping a bad production batch or
continuing poor service. Far better to be on the production front lines when unacceptable quality
variation occurs! If workers monitoring the data determine that a quality problem exists,
production can be shut down or backup systems shifted into place until a high quality process
can be restored. Being on the scene gives workers another advantage. They can alert a quality
control manager before the trail has gone cold. Perhaps large variability occurs only at a certain
time of the day or week.
The range is an especially common measure of dispersion with very small data sets,
Ch. 3 Pg. 64
especially in production situations involving quality control. Management must make reasonable
demands of its workers. Employees working in noisy, tiring, dangerous, dirty environments can
quickly lose enthusiasm for the goals of quality management unless the distractions of data
recording are minor. In many process control situations, workers need only sample five or six
data measurements at regular intervals, say every 20 minutes. The measurements are then
recorded on a chart, perhaps hanging from a chain next the employee's machine station.
Suppose the worker is supposed to monitor the variability of some critical aspect of
quality, such as the calibrated diameter of a camshaft coming off a
lathe. The range is an especially convenient dispersion measure to
track in busy business situations. Finding the maximum and
minimum from only five or six numbers is easy, so the sample
range involves only subtracting these two numbers. If the worker
notices the range growing beyond some prescribed limit, the
machine can be shut down before an unacceptable production
batch must be scrapped. An entire area called statistical process
control (SPC) has been developed to design procedures for
reading and acting on this and other types of control chart information.

Drawbacks of the Range
The range suffers from one major limitation: it uses information from only two
observations in the data. If the large range results from one or two outliers, we may get a
distorted picture of the typical variability in the data. In fact, any two data sets with an identical
maximum and minimum would have the same range, even if the other values within this range
were quite different.
A simple numerical example illustrates this point. Examine the four Excel columns of
sorted data copied in Figure 3.16. Notice that each column of six observations will have the
same mean and median (10), maximum (20), and minimum (0). Yet the identical ranges of 20
are deceiving. The variability of some columns appear greater than for some others. See if you
can rank these columns from least to most variable.
Most people would conclude that col 1 has the least variability, followed by col 2, col 3,
and col 4? Four of the six col 1 observations do not vary at all from the mean. There couldn't be
any less variability and still have a range of 20. By contrast, data in col 2 are equally spread
across the range from 0 to 20. Even greater variability is apparent in col 3 where data fall into
two groups: three larger values and three well below the mean. The case of greatest dispersion
col 1 col 2 col 3 col 4
0 0 0 0
10 4 2 0
10 8 4 0
10 12 16 20
10 16 18 20
20 20 20 20
Figure 3.16
Ch. 3 Pg. 65
occurs for col 4 data, with three values each at the minimum and the maximum.
Thus, we often need a dispersion measure that does more than tell us the difference
between the most extreme values in the data. Fortunately, other measures don't suffer from this
limitation.

Percentiles and the Interquartile Range
Wouldn't it be nice if we had a measure of dispersion that was as easy to interpret as the
range yet not affected by outliers? Perhaps we can again rely on what we already know about
statistics. So far we have only examined three portions of the sorted data: the maximum,
minimum, and the median. Besides the middle and extremes, a data set contains observations for
all other rankings from the second smallest up to the second largest. Since the sample or
population size will vary, these other positions in the sorted data are best referred to by their
relative ranking (from the bottom) measured in percentage terms, or percentile, out of the total
number of observations.
DEFINITION: An observation is in the pth percentile if p percent of the observations in the
data set have values as small or smaller and 100 - p percent have larger values.
You are probably familiar with the reporting of standard aptitude tests such as the SAT exams.
A raw numerical score is usually meaningless unless you know where it ranks you among all
those being tested. The percentile is the percentage ranking for those taking this type of exam.
Scoring a 1050 out of the 1600 possible on the SAT may rank you in the 80th percentile. This
means that 80 percent of those tested scored less than 1050 and only 20 percent performed
better.
We could also measure the difference in scores between various percentiles. A girl in the
92nd percentile may receive a score of 1210, 160 points better than a boy whose scored in the
80th percentile. These comparisons often rely on round number percentiles: deciles, for every 10
percent (90th, 80th, etc.), and quartiles, for each percentile divisible by 25. We are already
familiar with the second quartile that occupies the 50th percentile. This percentile is more
commonly called the median, since it divides the upper from the lower half of the data.
In Chapter 5, we will extend this notion of percentiles to relative frequency distributions.
At this point, however, our goal is finding an ordinal measure of variation. We therefore direct
our attention to the upper ends of the first and third quartiles located at the 25th (Q
1
) and 75th
percentiles (Q
3
), respectively. The interquartile range is defined in terms of these two quartiles.
Ch. 3 Pg. 66
DEFINITION: the interquartile range, or IQR, measures the difference between the first and
third quartile observations in a sample or population.
IQR = Q
3
- Q
1

Like the range, the interquartile range has intuitive appeal. Both measures are found by
subtracting two meaningful numbers. By comparing the observations at the 25th and 75th
percentiles, the interquartile range measures the difference between values in the middle 50
percent of the data.
Unlike the range, the interquartile range is not influenced by outliers. As we learned
from our discussion of medians, the value of the 50th percentile--the median--is not affected by
the order, and not the sizes, of observations in the data. The same is true for any other percentile
value, such as the first and third quartiles. If these are not influenced by outliers, then their
difference cannot be affected either.
Determining the IQR is usually more difficult than finding the range. Like the median,
the task of finding Q
1
and Q
3
is essential one of locating observations corresponding to the
desired percentiles. If N is odd, the median was the value of the observation with ranking (N +
1) / 2 in the sorted data. The IQR seeks the values one-fourth and three-fourth of the way along
in the sorted data. The rules for locating Q
1
and Q
3
require us to look for observations with
rankings of (N + 1) / 4 and 3(N + 1) / 4. Recall that if N is an even number, there is no
observation at the middle of the sorted data. Then, we chose the number half way between the
two values. A similar procedure is again required to deal with the even more common situation
in which no actual observation lies precisely at the 25th and 75th percentiles.
For example, suppose the local newspaper surveys regular gas prices at the 11 local
service stations. For N = 11 observations, (N + 1) / 4 is 12 / 4, or 3, and 3(N + 1) / 4 is 9. Thus,
the 25th percentile is the 3rd lowest price and the 75th percentile is the 9th lowest. If the sorted
data are $1.22, $1.30, $1.36, $1.37, $1.40, $1.42, $1.45, $1.46, $1.55, $1.59, and $1.75, Q
1
is
$1.36 and Q
3
is $1.46. Therefore, the interquartile range is $1.46 - $1.36, or $0.10. The middle
50 percent of gas stations have a 10-cent range of prices. This IQR is much less than the 53
cents range in the prices. Ten cents may be the more appropriate measure of price dispersion if
newspaper readers want to know the general variability of prices at the middle half of the
population.

Ch. 3 Pg. 67
Chapter Case #2: Take Two Aspirin and Call Me in the Morning
In the 1990s, the rising cost of health care has captured the attention of the news media,
politicians, businesses, and ordinary citizens. The United States economy pays more and gets
less coverage than Japan, Canada, and European Community countries. Rising Medicare and
Medicaid costs are the single largest barrier to reducing the federal budget deficit. General
Motors now pays more for health care expenses than it does for steel. American companies have
trouble competing in the world economy when every product contains the high U.S. medical
costs on its workers. To avoid these costs, laid off workers are replaced by temporaries, part-
timers, or more overtime by current staff. Fear of losing medical insurance traps people into jobs
they don't like.
Although blame is often leveled at doctors, hospitals, insurance companies, and
government, a favorite target is the drug industry. Drug companies are responsible for only a
small percentage of the total national health care bill, but medicine is a highly visible expense
because it is often paid out-of-pocket (unlike the pages of hospital bills sent directly to medical
insurers). College students sometimes have a similar experience if their tuition, room, and board
are billed to their parents or paid by loans but textbooks. Thus, only a small fraction of college
costs, are bought separately and paid out of summer job earnings.
The price charged by drug companies has risen three times as fast as inflation, and the
identical drugs are often sold oversees at a small fraction of their U.S. counterparts.
Dramatically escalating vaccine prices caused parents to forego immunizing their toddlers,
resulting in measles outbreaks in several urban areas. Pharmaceutical ads in medical journals
and free promotional samples encourage over-prescribing of certain drugs, and a recent
campaign has bypassed doctors entirely with "ask your doctor to prescribe" ads. Drug
manufacturers have always claimed that their high profits are necessary because of the high risks
of launching new drugs, but their stable of past successful products has kept most large
companies from losing money. The drug industry never seems to experience a recession.
A common justification voiced by the industry for rising prices is the need to recover past
research expenditures and finance the development costs of the miracle drugs of the future. The
companies cite the increasing expense of gene splicing procedures, state-of-the-art lab
equipment, and lengthy testing requirements.
Suppose the National Association for Retired Persons is lobbying to cap the prices of
prescription medicine. Perhaps dispersion analysis can shed light on the claim by drug
companies that research and development (R&D) costs are an enormous drain on revenues.
Using R&D expenses as a percentage share of sales, the NARP analyst measures the variation
among the 26 largest drug companies. These shares, already sorted from lowest to highest are
Ch. 3 Pg. 68
presented in Table 3.2 (and found in the chapter worksheet file).
Table 3.2
Research & Development Expenditures as a Percent of Sales Revenue
for Drugs and Drug Research Companies with Sales over $100 Million

Company Name R&D Share
Medco Containment Services 0.1 %
Perrigo 0.6
SPI Pharmaceuticals 1.3
ICN Pharmaceuticals 1.4
R. P. Scherer 2.6
A. L. Laboratories 4.2
IVAX 6.0
American Home Products 6.1
Forest Laboratories 6.5
Life Technologies 7.6
Carter-Wallace 7.7
Alza 8.3
Warner-Lambert 8.4

Allergan 8.9
Bristol-Myers Squibb 8.9
Genzyme 9.7
Pfizer 10.9
Merck 11.5
Rhone-Poulenc Rorer 11.6
Schering-Plough 11.8
Eli Lilly 13.4
Marion Merrell Dow 13.8
Upjohn 14.3
Amgen 15.8
Syntex 17.4
Genetech 46.3
Data derived from Business Week's R&D Scoreboard, June 29, 1992, pp. 114-115.

The analyst could report the variability by the difference between the maximum and minimum
values in the population. With the data already sorted by R&D share, the range is easily
calculated to be
Range = 46.3 - 0.1 = 46.2
But the 46.3 percent share for Genetech, a company on the cutting edge of genetic engineering
technology, clearly represents an outlier in this population. At the other end of the listing is a
handful of corporations with very low R&D shares. The interquartile range might present a more
representative view of the typical dispersion in drug company R&D shares.

Data Display
Q
1
= 5.55
r&d drug
0.1 0.6 1.3 1.4 2.6 4.2 6.0 6.1 6.5 7.6 7.7
8.3 8.4 M 8.9 8.9 9.7 10.9 11.5 11.6 11.8 13.4 13.8
14.3 15.8 17.4 46.3
Q
3
= 12.2
Figure 3.17

A sorted array of data is like a road map or a painting (see Figure 3.17). We can focus on
Ch. 3 Pg. 69
one aspect and then shift attention to another. Looking at the first and last value, it is easy to find
the maximum (46.3), minimum (0.1), and thus the range (46.2). Since there are two rows of 11
entries and a third row of four, the median (half way between the 13
th
and 14
th
values in the
second row) must be (8.4 + 8.9) / 2, or 8.65.
For the drug industry data, Minitab supplied range information on the second line of its
Descriptive Statistics output (Figure 3.18).

r&d drug 26 9.81 8.65 8.70 8.82 1.73

MIN MAX Q1 Q3
r&d drug 0.10 46.30 5.55 12.20

Figure 3.18

The minimum (0.1) and maximum (46.3) are present for calculating the range, and Q
1
= 5.55 and
Q
3
= 12.2 quartile values allow us to find the IQR.
The mean of nearly 10 percent, almost triple the overall mean for all industries, provides
evidence of the high average of research and development support by the drug companies.
However, the large mean is misleading because of the large variability in R&D effort. Nor is
this variability confined to one or two outliers. The staff analyst reports that the middle 50
percent of the drug companies ranged widely in research and development expenditures from
5.55 and 12.20 percent of their sales revenue, and the interquartile range is 12.20 - 5.55, or 6.65
percentage points. The first quartile contains several large drug companies with relatively low
R&D efforts. This does cast doubt on the claim that drug companies generally plow back into
research substantial shares of their revenues from high-priced medicines.
The interquartile range has a limitation the opposite of the one associated with the range.
Because it is unaffected by extremes, the IQR provides no information about the range in the
data. Fortunately, this isn't an either-or issue. By examining both range measures in a
graphical diagram, we can obtain a more thorough summary of the variation in the data. In
addition, we may also rely on measures of average variability, our topic for the next section.
Ch. 3 Pg. 70
3.3 MEAN VARIABILITY MEASURES
Range information is one way of summarizing the dispersion in a data set. However, we
may want to measure the average variability among all the data, not just between a couple of
observations.
The Standard Deviation and Variance
The most commonly used numerical measure is the standard deviation, . The symbol
(pronounced "sigma") is the lowercase version of the Greek letter "S." Recall that we use the
uppercase symbol for summation. It is the most commonly used measure of variability and
was first developed a century ago.
5

The standard deviation for a population is the following:
DEFINITION: The population standard deviation, , is the square root of the mean
sum of squared differences from the population mean, .
As we saw, statistical ideas often benefit from the compactness and precision of mathematical
formulation. Algebraically, the verbal definition of can be expressed by the following:

N
X
i _
=
2
) (
o

where we use the same notation developed in formulas for the mean. This time, the expression
to be summed is more complicated: the squared differences of each of the x
i
observations from
the population mean, (x
i
- ).
Until now, we have emphasized the intuitive appeal of statistics such as the mean,
median, mode, range, and interquartile range. Suddenly, we encounter a descriptive statistic that
appears unnecessarily complicated. If our goal is measuring average variability, why not design
a measure of average variability that parallels our definition of the mean? As you remember, is
the arithmetic average found by summing the data and dividing by the population size N. One
way to measure variability is by the amount each observation differs, or "deviates," from the
mean. Hence the term deviation, the numerical difference from the mean.
These deviations can be found in the calculations for in each (X
i
- 15). For the realty
agency data, the four deviations from = 15 are (10 - 15), (12 - 15), (16 - 15), and (22 - 15), or -

5
The term was first coined by Karl Pearson in 1893. Cited in Stephen M. Stigler's interesting book The History of Statistics,
Cambridge, MA: Belknapp Press, 1986, p. 328.
Ch. 3 Pg. 71
5, -3, +1, and +7. Notice that deviations may be positive or
negative, depending on whether home sales for an agency were
above or below mean sales. We cannot simply define our
measure of average variability by the mean deviation,
mean deviation = (X
i
- ) / N
If we try to use this formula, the sum of deviations in the
numerator will cancel to zero.
(-5) + (-3) + (+1) + (+7) = 0
Was this result a fluke? No. The mean deviation must
always be zero. There are no deviations on average because the
mean is defined to leave no net deviation. To see this, notice
that (X
i
- ) is the sum of the X
i
data, X
i
, followed by N
successive subtractions of the , or X
i
- N . But N
equals X
i
because = X
i
/ N as we learned earlier in this
chapter. Therefore, summing the deviations amounts to
subtracting the sum of the data from itself.
The reason we cannot use mean deviation is the
algebraic sum allows positive and negative deviations to offset
each other. We would like our measure of average variability to
combine negative and positive deviations so they don't cancel
out. Thus, we have to find a way to eliminate the negative
signs. Squaring the deviations is one way to assure us of only positive numbers to sum. But
when we square the deviations, we also create a new problem. Our average is now a mean for a
sum of squared deviations, not of the deviations themselves. At some point we have to "un-
square" the results. This is the reason for taking the square root.
The mean of the squared deviations is itself a fundamental concept in statistics and is
called the variance.

R&D drug deviations dev square
0.1 -9.7 94.09
0.6 -9.2 84.64
1.3 -8.5 72.25
1.4 -8.4 70.56
2.6 -7.2 51.84
4.2 -5.6 31.36
6 -3.8 14.44
6.1 -3.7 13.69
6.5 -3.3 10.89
7.6 -2.2 4.84
7.7 -2.1 4.41
8.3 -1.5 2.25
8.4 -1.4 1.96
8.9 -0.9 0.81
8.9 -0.9 0.81
9.7 -0.1 0.01
10.9 1.1 1.21
11.5 1.7 2.89
11.6 1.8 3.24
11.8 2 4
13.4 3.6 12.96
13.8 4 16
14.3 4.5 20.25
15.8 6 36
17.4 7.6 57.76
46.3 36.5 1332.25
Figure 3.24
Ch. 3 Pg. 72
DEFINITION: The population variance, , is the mean of squared deviations from the
population mean , or

N
X
i _
=
2
2
) (
o

The standard deviation thus may be calculated directly from the variance, its square:

2
o o =
We will learn about several other uses for variance in later chapters.
The standard deviation is one of the most important and yet most challenging concepts in
learning statistics. Let's develop it in a step-by-step way by continuing our drug industry case
study. What if the NARP staff analyst is not content to express variability as a range and wants
to use all the information in the data to calculate the average variation of R&D shares? Recall
that the for the R&D shares is about 9.8 percent. Using a spreadsheet, we create a column of
deviations from the mean. We merely subtract 9.8 from each observation in the R&D share
column. We next create a second column containing squared deviations. The resulting Excel
spreadsheet is displayed in Figure 3.24.
It is easy to verify that the deviations from the mean are the result of subtracting 9.8, the
industry mean, from each R&D share (the r&d drug column). Notice that the deviations (in
the column labeled deviations) are negative for the first 16 rows because the companies have
R&D shares below 9.8 percent. By contrast, the positive deviations in the bottom ten rows result
from above-average R&D shares. Convince yourself that each number in the right-hand column
(dev square) is the square of the value to its left. Notice also how negative deviations are
converted to squared deviations with positive signs. The sum of squared deviations is 1945.4.
When divided by N = 26, the variance turns out to equal 74.8. This mean may be easily found
using the spreadsheet function for the average. The square root of this variance of 74.8 is 8.65.
Verify this result with your calculator. Thus the standard deviation, , of R&D shares is 8.65
percent. Whenever approaches the size of , this indicates a large variability in the population.

The Normal Distribution Rule
By now, you are surely curious about what our staff analyst plans to do with the sample
standard deviation. How may it be interpreted? Although Part III of the text will explore the
role played by standard deviation and variance in decisions involving sample inferences, we can
preview one use for s right now. Often, just over two-third of a data set lies within one standard
Ch. 3 Pg. 73
deviation of the mean and nearly all of the observations within two standard deviations.
Normal Distribution Rule: If a population has a symmetric, unimodal, and bell-
shaped distribution, approximately 68 percent of the observations will lie within
one standard deviation of the mean and about 95 percent within two standard
deviations of the mean.
Many real world populations have these bell-shaped distributions, called normal distributions.
These percentage rules may often be used as rough guidelines when we know and .
6

For example, we determined earlier that the air traffic case contained symmetric,
unimodal data.
Suppose all we had were descriptive statistics about airport arrivals, but we knew that
monthly figures historically have a normal distribution. The first line of the Minitab Descriptive
Statistics output furnishes us with the information in Figure 3.25.

AirTraf 67 829.5 822.0 828.0 110.9 13.5
Figure 3.25

How well does application of the normal distribution rule do? In this case, very well. We know
that = 829.5 thousand passengers and = 110.9. Two standard deviations is 2(110.9), or
221.8. We expect about 95 percent of the months to have arrival totals within two standard
deviations of the mean, or between - 2 = 607.7 and + 2 = 1051.3. Inspection of the sorted
data (Figure 3.26) reveals that only two months had more than 1051 arrivals and one other had
fewer than 607.7. The remaining 64 out of 67 months, or 95.5 percent, are within the 2
interval.

Data Display

AirTraf
597 633 650 653 680 681 683 704 713 723 723
727 739 743 747 750 751 752 754 756 758 762
766 771 776 781 792 792 793 795 801 813 819

6
The extreme alternative, Chebyshev's inequality, sets mathematical minimum percentages. For example, only 75 percent of the
observations must lie within 2 of . G. Cobb (Journal of American Statistical Association, March 1987, p. 324) argues that this standard is too
conservative a rule for practical business decision making situations.
Ch. 3 Pg. 74
822 830 831 833 834 844 850 850 853 859 874
877 882 884 884 900 903 906 909 916 917 921
941 942 947 947 964 990 1028 1032 1040 1047 1054
1086
Figure 3.26

With a bit more effort, we can count 46 months between 718.6 and 940.4, the boundary
points one standard deviation from the air arrival mean of 829.5. Forty-six out of 67 months is
68.7 percent, almost exactly the 68 percent the normal distribution rule predicts. Thus, knowing
only the mean and standard deviation of a bell-shaped distribution permits us to determine
intervals containing different portions of the data.
Even for moderate skewed distributions, the two standard deviation interval is often a
decent approximation. Suppose that the staff analyst for the NARP has descriptive statistics
about the chemical industry. The chemical industry is the most closely related in production and
research to the drug industry, and therefore serves as a good comparison for judging how
intensive R&D effort is in each industry. Table 3.3 contain the population data for the 46 large
chemical companies sorted by R&D share of sales revenue. Figure 3.27 presents the Minitab
histogram.
10 9 8 7 6 5 4 3 2 1 0
10
5
0
r&d chem
F
r
e
q
u
e
n
c
y
Figure 3.27
Ch. 3 Pg. 75
Table 3.3
Research & Development Expenditures as a Percent of Sales Revenue
for Large Chemical Companies

Mississippi Chemical 0.4
First Mississippi 0.8
Trans Resources 0.9
Grow Group 1.1
NL Industries 1.1
Scott 1.3
Rexene 1.4
LeaRonal 1.6
Ferro 1.7
Witco 1.7
Olin 1.8
Engelhard 1.9
Quantum Chemical 1.9
H. B. Fuller 2.0
Crompton & Knowles 2.1
G-I Holdings 2.2
Arco Chemical 2.3
Cambrex 2.3
Cabot 2.5
W. R. Grace 2.5
Lawter International 2.5

Air Products & Chemical 2.7
Ethyl 2.7
Hercules 2.9
Great Lakes Chemical 3.0
B. F. Goodrich 3.1
Morton International 3.1
Union Carbide 3.2
DuPont 3.4
International Specialty 3.4
Petrolite 3.5
Nalco Chemical 3.8
Betz Laboratories 4.0
Loctite 4.0
UCC Investors Holdings 4.0
MacDermid 4.2
Dexter 4.5
Lubrizol 5.4
Dow Chemical 6.2
Rohm & Haas 6.6
Monsanto 7.1
American Cyanamid 9.9
Source: Business Week's R&D
Scoreboard, June 29, 1992.
Clearly, the distribution of R&D shares is not bell-shaped, primarily due to a fewer larger
values. However, unimodal business data that is only slightly skewed often still approximates
the one- and two-standard deviation rules. Let's see if the NARP can do this for the chemical
industry data. Suppose they had only been furnished with the Minitab descriptive statistics
output (see Figure 3.28).

r&d chem 42 3.017 2.600 2.855 1.871 0.289
Figure 3.28

If the distribution were bell-shaped, we would expect about 95 percent of the companies
to have shares within two standard deviations of the mean. How well does the rule work for this
modestly-skewed distribution? Rounded to one decimal place, = 3.0 and = 1.9. Two
standard deviations is 2(1.9), or 3.8. Inspection of Table 3.3 reveals only two companies,
R&D drug deviations dev square
13.4 3.6 12.96
13.8 4 16
14.3 4.5 20.25
15.8 6 36
17.4 7.6 57.76
46.3 36.5 1332.25
Ch. 3 Pg. 76
American Cyanamid at 9.9 and Monsanto at 7.1, above a 3.0 + 3.8 = 6.8. The remaining 40 of
42 companies, or 95 percent, are within the 2 interval. What about companies at the low end of
R&D effort? Because - 2 is a negative number (3.0 - 3.8 = -0.8) and it is impossible to spend
negative amounts on R&D.

Weaknesses of the Standard Deviation
There are many other important uses for the standard
deviation that cannot be discussed until Parts II and III of this text. But as useful as the standard
deviation is, it has several major weaknesses. As we already emphasized, the standard deviation
is one of the least intuitive descriptive statistics. Even for small data sets, its value is difficult to
guess. Once it is computed, standard deviations may be less useful than other measures of
dispersion for business decision making. Although the standard deviation has convenient
relationships for the normal and other distributions, use of the two-standard deviation rule to
skewed distributions can lead to highly erroneous business decisions.
Finally, squaring the deviations has an often-undesirable side effect. For example, notice
that 2
2
is four times as large, not twice as large, as 1
2
. The furthest observations from the mean
will produce the largest deviation, accentuating the influence of these observations on the
standard deviation.
To illustrate the potential effect of an outlier, the last six lines of output of drug industry
deviations are reprinted in Figure 3.29. The last output line corresponds to the data for
Genetech, whose large deviation (36.5) creates in a squared deviation (1332) that swamps those
of other companies in the industry. Recall that the sum of the entire column was on 1945.4!
Therefore, squaring the deviations enlarges the impact the Genetech outlier, causing it to be
responsible for over two thirds of the variance.
Its complex arithmetic relationship to the mean makes the standard deviation less intuitive
than other measures of dispersion, especially if data are not distributed normally. The
standard deviation is also highly sensitive to outliers in the data.
3.65 Which of the following is not a measure of dispersion?
a. standard deviation
b. range
c. maximum
d. mean absolute difference
e. all of the above are measures of dispersion
Figure 3.29
Ch. 3 Pg. 77

3.66 The problem with using the range to determine dispersion is that the range does not take
account of
a. the smallest value
b. the largest value
c. the difference between the largest and smallest value
d. values other than the largest and smallest

3.67 In calculating the standard deviation which operation is not involved?
a. summation over the number of observations
b. squaring differences from the mean
c. averaging over the number of observations
d. obtaining the square root
e. all the above operations are involved

3.4 RANDOM SAMPLE INFERENCES ABOUT A POPULATION
Suppose you have the time and resources to survey only a portion of the 38 largest
computer and electronics firms. In Chapter 1, we introduced the idea of collecting a sample from
a population and the enormous cost- and time-saving advantages samples often offer us. When
limited to a sample, we may use statistical principles to guide us in making the best use of
available information.
If we must use a sample to make an educated guess about the population mean, ,
statisticians usually advise us to draw a random sample because of its desirable properties.
DEFINITION: A sample of size n drawn from a population is a random sample if every
possible group of n observations in that population has the same chance of being selected.
8

Random sampling may be explained in terms of a lottery. Most states now sell lottery
tickets, and winning tickets are those that match a sequence of randomly selected numbers. The
usual selection method is to set spinning balls marked with all possible numbers (say one to 40)
in a row of clear containers. A numbered ball then pops up into the chamber atop each container.
Imagine a lottery machine containing balls marked with the CEO compensations for each
of the 38 firms in the population. Then our random sample could be collected by selecting n =

8
This is actually called a simple random sample to distinguish it from other types of random samples discussed in Chapter 5.
Ch. 3 Pg. 78
10 different balls from the bowl.
9
The sample is random because the balls are of the same size
and weight and are thoroughly mixed so that each has the same chance of popping up.
How would a random sample be collected in a business statistics situation? Although we
do not find population data spinning in lottery containers, we can often reproduce the random-
selection conditions present in a lottery. A numbered listing of the 38 electronics and computer
firms could be used to randomly select ten firms for your random sample.
10
One method would
be to place the numbers 1 through 38 on identical slips of paper in a bowl, mix very well (don't
forget the draft lottery), and draw ten slips. You would identify the corporation names
associated with each of the ten numbers drawn and restrict your survey of CEO compensation to
these ten firms. Today's computers can help us design, collect, and even simulate a random
sampling process. We will see how beginning in Chapter 7.

Descriptive and Inferential Statistics
If we know the entire population, the characteristics of each desired variable can be
calculated directly from the data. We called this type of analysis descriptive statistics. The
population mean , median M, and many other statistical summary measures may be used to
describe the variables needed for decision making. These characteristics of a population are
called parameters.
DEFINITION: A parameter describes some characteristic of a variable in the population.
The value for a particular parameter will vary from one population to the next. The mean CEO
compensation will be different in other industries and will probably vary from year to year. For
example, may be only $600,000 in the airlines industry and M might have fallen to $1.2
million by 1994 for electronic industry CEOs. However, once we establish which population is
relevant to the decision-making problem, the parameters for that population are fixed.
What if the population data are not available to us. Without having data on the entire
population, how do you provide the Board of Directors at Tangerine with guesses of the
unknown population mean and median? You must use what information you have: sample
evidence.

9
For this type of sampling, we eliminate the possibility of getting the same ball twice by setting aside each drawn ball.

10
Such a listing is often called the sampling frame. Even if we cannot list everything in a sampling frame, we can still construct our
samples if we can specifically describe the characteristics of the population.
Ch. 3 Pg. 79
A favorite trick of Sherlock Holmes was to use items sampled from the crime scene a
hair found on a hat, dirt scraped from a shoe to deduce the motive or identity of the murderer.
Statisticians also use sample evidence to deduce, or "infer," something about the population.
This procedure is called statistical inference.
DEFINITION: statistical inference is the process of using sample data to make judgments about
the parent population.
By applying formulas called estimators to sample data, we can calculate numerical estimates of
the unknown population parameters such as .
DEFINITION: An estimate is a numerical guess for the unknown parameter when the
population data are unavailable; an estimator is an expression or formula used to calculate the
estimator for any particular set of sample data.
Suppose we have completed a survey of CEO compensations and stored the random
sample in the computer. Since our random sample also has a mean and median, you can report
these in place of the unknown population mean and median. Simply apply to the sample data the
formula for the mean and the procedure for finding the median. These methods for finding the
sample mean and sample median become our estimators of the unknown population mean and
median.
DEFINITION: The sample mean X is an estimator of , and the sample median, m, is an
estimator of M.
We refer to the sample mean by the notation X (pronounced "X-bar"); placing the bar over a
variable is the shorthand way of saying "take the mean of the sample data on that variable."
Notice that the formula for X is the same as that for , except now we sum over n, the
sample size, rather than all N observations in the population. If we want to estimate M, we could
use the sample median m. The sample median also is located from a sample size n rather than
N.

Ch. 3 Pg. 80
Bias
Because we are only working with a random sample, X seldom (if ever) exactly equals
and may often be quite different. However, the sample mean is correct "on average." In other
words, if we were to draw thousands of random samples of size n = 10 observations from the
population of CEO compensation, we would average pretty close to 1544, the true population
mean. This property of X can be verified by simulations we will do in Chapter 8. We therefore
say that X is an unbiased estimator of . Estimators which lack this property are biased.
DEFINITION: An estimator is an unbiased estimator of a population parameter if the mean of
very many sample estimates is the same as that parameter. If this mean is larger or smaller than
the parameter, the estimator is biased.
Although unbiasedness is an appealing quality for an estimator to possess, unbiased
estimators are not always superior to biased ones. For example, we told you that X is unbiased.
Yet outliers in the population may occasionally cause large differences between X and from
one sample to the next. Furthermore, we stated earlier that may not be a very appropriate
measure of the average when outliers pull the mean far from the median.
By contrast, the sample median is not in general an unbiased estimator of the population
median. For the CEO compensation data, the sample median on average is a bit smaller than the
population median. Yet we saw previously that the sample medians for this type of data did a
much better job than the sample means did in guessing the respective population parameters.
The sample median, although biased, may be the better of the two estimators for reporting on
CEO compensation.
The task of choosing the best statistical methods, procedures, or measures is a general
one for the data analyst. We will continually confront this issue as we introduce other methods
of statistical analysis.

The Sample Standard Deviation
As often happens in business situations, we cannot obtain the population data in a timely
manner and within the prescribed budget. We just learned that an unbiased estimate of the
population parameters is its sample counterpart, X . The parameter we now wish an unbiased
estimator of is the population standard deviation . We call this the sample standard deviation.
Ch. 3 Pg. 81

DEFINITION: The sample standard deviation is the square root of the sum of squared
deviations from the sample mean divided by n - 1, where n is the sample size.
The smaller denominator produces a larger result than if we divided by four.
For the mean, recall that we computed X by a formula that contains the sample size n
rather than the population size N. It seems logical again to replace N with n in the formula for
to estimate from sample data. Unfortunately, this formula for s would be a biased estimator of
because even the average of very many samples would be too small. By using a smaller
divisor, n - 1 instead of n, we increase the quotient to overcome this downward bias problem.
11

Degrees of Freedom
Understanding the sample standard deviation is important to appreciating statistical
inference. So it is a fair question to ask where the n - 1 came from. The secret behind the n - 1
divisor involves degrees of freedom, an indispensable statistical concept we will apply over and
over again in this text.
DEFINITION: degrees of freedom count the separate pieces of information present.
A sample of n observations starts out containing n pieces of information. What causes a
sample to lose some of its original information? Sometimes we must first use up information
before we can even begin our computations. In calculating s, we run into this very problem. To
calculate deviations from the sample mean, we must first calculate X from the sample data.
After "mining" the data for one piece of information, the sample mean, we have one fewer
degree of freedom remaining in the sample. Using up one degree of freedom makes it proper to
average the squared deviations over only the remaining n - 1 degrees of freedom.
Another way of understanding the n - 1 divisor is to examine very small samples. In the
extreme case where n = 1, the sample mean is the single observation itself. However, computing
s is then impossible because we cannot divide by n - 1 = 0. We therefore can say nothing about
the variability in the population we sampled. That makes sense. How can we estimate
variability from a single observation? Only by observing variation in the sample can we draw

11
More precisely, it is s, the sample variance, that is an unbiased estimator of . Although dividing by n - 1 prevents bias in s, we
still cannot claim that s is an unbiased estimator of . In most situations, s will suffer from a slight downward bias. The Exercises contain a
numerical illustration of how s can be biased even if s is not).
Ch. 3 Pg. 82
inferences about population variability. After extracting the single piece of information in the
sample, we have nothing left with which to calculate variability.
For n = 2, there is only one way to find variation in the sample data: the difference
between the two observations. So the average variation should logically be this difference itself.
By using n - 1 = 1 in the formula for s, we do in fact divide by one.
Finally, the case of n = 3 gives us only two unique differences, since the third difference
can always be found from the other two. For example, the sample 2, 5, 6 contains difference 5 -
2 = 3, 6 - 5 = 1, and 6 - 2 = 4. But (6 - 2) is merely the sum of the other two, (5 - 2) + (6 - 5).
Thus, a divisor of n - 1 = 2 is appropriate for calculating s.
12

In the next chapter, this same logic will be extended to count degrees of freedom
remaining after more than one piece of information is used up. The basic principle remains the
same. Figure 3.31 summarizes in a flow chart the procedure involved with the computation of
population and sample standard deviations.

Sample versus Population Standard Deviations in Practice
The vacation expenditure survey from section 1 is an example of a random sample.
Recall that a random sample of n = 40 graduate students was asked about their spending this year
on vacations. The first line of Descriptive Statistics (Figure 3.32) furnishes us with the sample
mean, median, and standard deviation.

vacation 40 883 512 830 913 144

Figure 3.32

Notice that the standard deviation (StDev) reported in Minitab is 913. This same value also may
be found near the middle of the Excel output (see Figure 3.33). But which standard deviation
does the computer give us: the one averaged over n or n - 1?

12
This explanation is based on the author's "Teaching Sample Standard Deviation in Applied Statistics," in American
Statistical Association 1990 Proceedings of Section on Statistical Education. Alexandria, VA.: American Statistical Assoc., 1990.

Ch. 3 Pg. 83
Minitab and Excel, like other statistical software and spreadsheets, computes s rather than
. Statistics typically involves making inferences about the
population from a random sample, and the standard deviation is
one of the most powerful measures used in formulating these
inferences. The sample standard deviation is what we normally
seek, so most computer packages assume we are always
working with sample data.
What if we did have the complete population of all
graduate students, not a random sample of them? Can we use
the Descriptive Statistics output (which involve N - 1) to find
? Yes. Simply multiply by the square root of (N - 1) / N. The
adjustment factor is the square root of (40 - 1) / 40, or nearly
0.987. Multiplying 913 by 0.987 yields 901. For smaller
populations the downward correction will be a bit more. For
example, a population with N = 10 would adjust by the square
root of 9 / 10, or about 0.95.
vacation
Mean 882.575
Median 512.5
Mode 0
Kurtosis -0.82358
Skewness 0.809083
Range 2750
Minimum 0
Maximum 2750
Sum 35303
Count 40
Confidence Level(95 283.0456
Figure 3.33
Ch. 3 Pg. 84
CASE MINI-PROJECT:

A miniature golf course had the following customer attendance last year for each of 52 weeks:
Data Display

1342 1433 1286 1853 2161 3127 3547 4170 5437 5220 4891
4660 4170 5682 5990 4338 4387 4072 3126 2420 2067 2490
2466 2920 3926 3769 4011 4726 4290 5210 5062 4827 4317
4180 3294 3110 2627 2212 1643 1482 1272 1622 1937 1363
1594 1167 1202 1307 1195 558 673 1240

1. The sorted listing below is more useful for finding the following statistics (circle any of the
words that apply): mean, median, mode, range, standard deviation
Data Display

558 673 1167 1195 1202 1240 1272 1286 1307 1342 1363
1433 1482 1594 1622 1643 1853 1937 2067 2161 2212 2420
2466 2490 2627 2920 3110 3126 3127 3294 3547 3769 3926
4011 4072 4170 4170 4180 4290 4317 4338 4387 4660 4726
4827 4891 5062 5210 5220 5437 5682 5990

2. The median, 3015, is not an actual weekly attendance figure in the data because the number
of weeks of data is a(n) number. The median in this example was calculated by
taking the average of the following two numbers: and .

3. The mean, 3021, is almost exactly the same as the median, 3015. An examination of the
sorted data explains the reason for this similarity: there are no s in the data to
shift the mean away from the median.

4. The range of the attendance encountered over the past year is .

5. The total attendance for the previous year was which can easily be calculated by
multiplying the number of weeks by the following statistic: .

6. The mode for attendance was , but this number is not a very representative measure of
the average here because the mode only occurred times out of the 52 weekly observations.
Business Statistics and Applications --Soskin and Braun Page 85
Ch. 3 Pg. 85
3.80 Random samples
a. must be collected from lottery machines
b. are the results of simulations
c. have the same chance of occurring as any other sample of the same size
d. require a computer to collect

3.81 Which of the following is an example of random sampling?
a. people attending a shopping mall on a weekday afternoon
b. respondents to an evening phone survey
c. a mailed questionnaire sent to people with drivers licenses
d. phone response to a 900 number announced during the network evening news
e. drawing 10 names from a well-mixed hat containing names of everyone in the
population

3.82 A description of a variable in the population is a(n)
a. estimator
b. estimate
c. parameter
d. sample

3.83 Statistical inference is the process of
a. describing a population from population data
b. describing a sample using sample data
c. make guesses about sample estimates by using population parameters
d. make guesses about population parameters by using sample estimates
e. all of the above

3.84 is the parameter but is an estimator of the mean
a. X and
b. X and M
c. m and X
d. and X

3.85 If the mean for a very large number of samples is equal to the parameter being estimated,
the estimator is
Ch. 3 Pg. 86
a. unbiased
b. a sample
c. not an outlier
d. random
3.86 An outlier is defined as
a. an observation considerably larger than any other in the sample
b. an observation considerably smaller than any other in the sample
c. an observation considerably larger or smaller than any other in the sample
d. observations that are measured incorrectly
e. observations that should be deleted from the sample

3.87 Which of the following conclusions does not involve potentially large sampling bias?
a. poverty cannot be difficult to overcome. All the people I know that came from the
ghetto are doing well financially now.
b. I can't see why our product isn't selling. Our most trusted customers tell me we
have the best product on the market.
c. I knew I wanted to become a computer programmer. I loved computers playing
around with the one at home for years.
d. I took art courses in school, so I know that I am not artistically inclined.
e. All of the above involve sampling bias.

3.88 To ensure that s has the same units and scale as the variable in the data,
a. we divide by n - 1
b. we square the differences from the mean
c. we sum all the squared differences
d. we take the square root of the final mean sum of squares

3.89 To adjust for the lost degree of freedom in the sample estimator s
a. we divide by n - 1
b. we square the differences from the mean
c. we sum all the squared differences
d. we take the square root of the final mean sum of squares

3.90 For a utility company responding to the 200 service outages last month, median repair
time was 45 minutes, the mean was one and one-half hours, the minimum was 15
minutes, the maximum was 1 day, and the standard deviation was 3 hours. The range of
Ch. 3 Pg. 87
service outage times last month was
a. 23.85 hours
b. 23.75 hours
c. 23.55 hours
d. 23.25 hours
e. 22.50 hours

Descriptive statistics quantify the information in data and summarize its different
characteristics. Business statistics carefully defines each type of descriptive statistic to precisely
communicate specific information to decision makers.
The mean, median, and mode are commonly used measures of the average. The mean is
an arithmetic measure related to the sum of the data while the median is defined in terms of
sorted data. One important purpose for using the mean is to calculate the total by the formula
Total = N . There is no analogous way to determine the total from the median.
If the population size N is an odd number, the median M is the (N + 1) / 2 largest (or
smallest) value in the population. If the population size N is an even number, the median M lies
half way between the Nth and (N + 1) largest (or smallest) values in the population.
An outlier is an observation whose value is much larger or smaller than that of the any
other observation in the data. The mean can be dramatically shifted in the direction of a few
outliers, unless other outliers offset this effect. Statistics such as the median that are not strongly
affected by outliers are called resistant measures. The mean can be dramatically shifted in the
direction of a few outliers, unless other outliers offset this effect. The median, as a resistant
measure, is often a much better measure of the average for data containing outliers. No
observations, even outliers that severely affect the mean, should be removed unless they can be
shown to be incorrectly recorded or originate from the wrong population.
Because mean and median are the same for symmetric data, their average has all the
desirable qualities of the mean and the median with none of the limitations of either. For
unimodal, symmetrical distributions, the mode, median, and mean are identical. The mean is
greater than the median and often lies to the right of the modal class for distributions that are
skewed right. The opposite ordering of mean, median, and modal class occurs for skewed left
distributions. For highly skewed distributions, even the median will fall outside the modal class.
For most data sets, the variability is sufficiently large to merit description along with
measures of the average. With extreme amounts of variability, any measure of the average may
provide meaningless or misleading summary information. Data sets with similar averages may
Ch. 3 Pg. 88
differ considerably in their variability.
The range is determined only from two data points the maximum and minimum, while
the standard deviation averages the variability in the entire data set.
Where a population census cannot be obtained, a random sample should be used to make
inferences from sample estimates about the unknown population parameters. Computer
simulation can replicate the process of random sampling so that we can investigate the properties
of sample inference.
Unbiased estimators have the desirable quality of equaling their population parameter
counterpart on average. However, sometimes biased estimators may be preferred over unbiased
ones.
In calculating the sample standard deviation, we need to adjust for loss of one degree of
freedom so as to remove most of the bias. The mean absolute deviation avoids the distortion in
the standard deviation caused by squaring deviations.
For normal distributions, 68 and 95 percent of the data lie within one and two standard
deviations, respectively, of the mean. The interquartile range locates the middle 50 percent for
any data set.
Chapter Case Study Exercises
A. The list price of new mid- and full-sized cars models was surveyed and the summary
statistics reported below:

PRICE 45 22795 20389 22620 6486 967

PRICE 13206 35553 17091 27357

Answer the following four questions regarding the data described above:
1. The median price, $20,389, must be an actual car price in the data set because:
a. there is an odd number of car models in the data set
b. the average of the minimum and maximum is the median
c. the median and mode always are actual observed values, but the mean may not be
d. the median is less than the mean
e. the median does not have to be an actual car price in this data set

2. $22,347 is
a. The interquartile range
Ch. 3 Pg. 89
b. The range
c. The standard deviation
d. The mode
e. None of the above

3. If the distribution of prices is approximately bell-shaped, then we would expect about
95% of the prices would lie between:
a. $16,309 and $29,281
b. $12,972 and $33,361
c. $9823 and $35,767
d. $17,091 and $27,357

4. If a car dealer were to stock one of each of the 45 cars models, the total value of its
inventory would be approximately:
a. $600,000
b. $750,000
c. $900,000
d. $1 million
e. Cannot be calculated from the information given

Answers: A. 1a, 2b, 3c, 4d

B. In the 1993 NBA college draft, the 54 players drafted signed for annual salaries (in thousands
of dollars) given in the following sorted data listing, summary statistics, and histogram:
SALARY
125 125 125 125 125 145 160 170 175 200 220
245 275 300 305 325 365 380 400 430 475 500
550 575 600 625 660 700 745 775 800 825 865
900 950 1000 1100 1250 1300 1430 1500 1600 1700 1780
1875 1945 2000 2100 2200 2260 2335 2350 2400 2500

SALARY 54 924 680 881 752 102
SALARY 125 2500 294 1525

Answer the following six questions based on this information:
1. The median annual salary is
a. the average of the 27th and 28th highest salary
b. half-way between $660,000 and $700,000
Ch. 3 Pg. 90
c. not an actual observation in the data set
d. all of the above are true
e. none of the above are true

2. The mode for this data set is
a. $125,000
b. $250,000
c. $680,000
d. $924,000

3. The modal class interval is
a. $100,000 to $400,000
b. $125,000 to $2,500,000
c. $294,000 to $1,525,000
d. $924,000 to $2,628,000

4. If a players union representative
stated that one-third of the first-year
players drafted earned less than
$400,000 per year, he would be
reporting information about the
a. mean
b. median
c. mode
d. modal class
e. minimum

5. The fiftieth percentile for annual
salary of NBA rookies was
a. more than a million dollars
b. just over $900,000 dollars
c. about half-a-million dollars
d. under $700,000
e. cannot be determined from the information given

6. By only examining the histogram for the salary data, we can immediately conclude that
a. The mean will be substantially larger than the median
2500 2200 1900 1600 1300 1000 700 400 100
20
10
0
SALARY
F
r
e
q
u
e
n
c
y
Ch. 3 Pg. 91
b. There are outliers in the data
c. Salaries do not have a bell-shaped distribution
d. The standard deviation will be relatively large
e. All of the above

Answers: B. 1d, 2a, 3a, 4d, 5d, 6c

C. A regional supervisor for Coors has the price per 6-pack of its beer surveyed from the 25
supermarket chains in Los Angeles. Describe in a couple sentences the average and variability
of beer prices from the following computer output of descriptive statistics:

coors 25 3.6620 3.6500 3.6548 0.2446 0.0487

MIN MAX Q1 Q3
coors 3.2900 4.2900 3.4900 3.8200

Answer: Average beer price in L.A. supermarkets is about $3.65 a six-pack, but
prices ranges from as high as high as $4.29 to as low as $3.29.

D. New television shows are notoriously risky ventures. Nielsen ratings are the main variable
used to assess the success of a show and the rates that can be charged advertisers. A network
programmer would like you to summarize the Nielsen ratings for the 1993-94 season crop of
new shows. You obtain the following computer printout:

Data Display

NIELSEN
4.8 5.0 5.3 5.5 5.8 6.1 7.0 7.2 7.4 7.6 7.6
7.8 7.9 8.0 8.3 8.3 8.5 9.0 9.1 9.1 9.2 9.4
9.5 10.2 10.3 10.4 10.4 10.5 10.6 10.7 10.8 10.9 11.3
11.3 11.4 11.4 11.9 12.0 12.0 12.0 12.1 12.2 12.2 12.2
12.7 12.7 13.4 13.6 14.0 14.1 14.3 14.5 14.8 15.1 15.2
15.3 15.7 15.9 17.4 17.6 17.8 19.3 20.5

NIELSEN 63 11.176 10.900 11.077 3.586 0.452

MIN MAX Q1 Q3
NIELSEN 4.800 20.500 8.300 13.600

Ch. 3 Pg. 92
a) Based on the DESCRIBE information, locate and mark the mean and median Nielsen rating
on the sorted data listing above. [No explanation please]
b) Is the median an actual value in the data? Why or why not? In two sentences, explain
carefully how the median was calculated from this data set.
c) By examining the Nielsen data printed above, explain why the mean is not very different from
the median.
d) What is the RANGE for the Nielsen data (single number answer)?
e) Calculate (from the DESCRIBE output above) an interval two standard deviations on either
side of the mean? Are about 95 percent of the Nielsen ratings within this interval? Which
ratings (if any) are above and below these limits?
f) Write a one-sentence report summarizing to the network programmer your findings about
averages and dispersion of new program Nielsen ratings.
Answers:
(a) Circle 10.9 for the median and a point between 10.9 and 11.3 on the data listing as the mean.
(b) Yes, because n is odd. The (n + 1) / 2st largest (or smallest) observation from the sorted
listing is the median. Since (63 + 1) / 2 is 32, the median must be the 32nd largest rating, or
10.9.
(c) No extremely large or small ratings occur, nor are there an overabundance of larger ratings
not offset by about the same number of lower ratings.
(d) Range = Max - Min = 20.5 - 4.8 = 15.7
(e) 95% confidence interval = ( - , + ) = (11.176 - 2(3.586), 11.176 - 2(3.586))= (4.00,
18.35)
Of the n = 63 ratings observed, 61 of the 63 are within this interval, about 97% and close to
95%. None of the show ratings were below the interval, but two were above (19.3 and 20.5).
(f) Nielsen ratings for 1993-94 new shows average approximately 11, but ratings ranged from as
low slightly less than 5 to one show than exceeded 20.
Ch. 4 Pg. 93

CHAPTER 4: DESCRIBING HOW
VARIABLES ARE RELATED
Approach: Without using probabilistic inference methods, we extend our description of data
from univariate to variable relationships. In Chapter 2 we displayed bivariate and
multivariate relationships graphically. We now summarize these relationships by fitting
the data to linear equations and measure the strength of the fits. These equations are used
to predict outcomes and pose what if? questions that quantify individual variable
effects.
Where We Are Going: We will explore sample inference questions of variable relationships
after we introduce probability and inferential methods in Part II. Chapters in the last half
of the text on experimental design and regression modeling will extend the descriptive
linear relationship methods introduced here.
linear fit, least-squares, and regression equations
goodness of fit, standard error of the estimate, and coefficient of determination
correlation and covariance
simple and multivariate regression
prediction and forecasting

SECTION 4.1 Summarizing a Variable Relationship by an Equation
SECTION 4.2 Fitting Data to an Equation and Measuring How Well it Fits
SECTION 4.3 Summarizing a Variable Relationship by a Single Number
SECTION 4.4 Fitting Equations with More than One Explanatory Variable

Ch. 4 Pg. 94
4.1 Summarizing a Variable Relationship by an Equation

We have devoted much of our attention to analyzing univariate data. However,
univariate summary statistics have their limits: they can only examine statistical properties of
individual variables. Our first chapter case showcases the usefulness of bivariate methods.

Chapter Case #1: Neither Rain, Nor Snow, Nor Dead of Night...
The U.S. Postal Service (USPS) has a long history of offering low cost, fast, and
convenient mail service. For decades, the media and talk show hosts have ridiculed the USPS
for waste, poor management, disgruntled postal workers, lost or late deliveries, and higher
stamp prices. Experts have predicted the post office's demise following the introduction of every
new communication technology the telephone, fax machines, e-mail, and overnight delivery
such as Federal Express. Yet the USPS has confounded the experts by adopting streamlined
procedures, new technology, quality control methods, and modern marketing methods.
However, the Postal Service must continually anticipate changing demand for its
services. If the volume of mail exceeds the ability of staff, truck fleet, and post offices to
process and deliver it, delays and lost business will result. If mail volume is below projected
levels, on the other hand, USPS will have to pay for idle workers and equipment.
Suppose a resource planner at USPS has 1960-1995 data on total annual postal volume
for first class mail. This type of mail has long been the major revenue source for USPS because
Congress forbids competitors from using mail boxes for pick up and delivery. In addition, the
resource planner also collects data on the fundamental customer base, U.S. population. The 36
years of first-class volume and population totals are listed in Table 4.1:
Ch. 4 Pg. 95

Table 4.1
U.S. First-Class Postal Service Volume and U.S. Population, 1960-1995
(Mail Volume in Billions and Population in Millions)

Volume Population
1960 33.2 180
1961 34.3 183
1962 35.3 186
1963 35.8 188
1964 36.9 191
1965 38.1 194
1966 40.4 196
1967 42.0 197
1968 43.2 199
1969 46.4 201
1970 48.6 204
1971 51.5 207
Volume Population
1972 50.3 209
1973 52.3 211
1974 52.9 213
1975 52.5 215
1976 52.5 218
1977 53.7 220
1978 56.0 222
1979 58.0 225
1980 60.3 227
1981 61.5 229
1982 62.3 232
1983 64.3 234
Volume Population
1984 68.5 236
1985 72.5 238
1986 76.3 240
1987 78.9 242
1988 82.4 244
1989 85.8 247
1990 89.9 249
1991 90.3 252
1992 90.8 255
1993 92.2 258
1994 95.3 260
1995 96.3 262

In the previous chapter, we found important uses for univariate summary measures such
as the mean and standard deviation. If the Postal Service planner relies strictly on univariate
statistics from Chapter 3, however, she obtains average and variability of first-class mail volume
and U.S. population data (see Figure 4.1).

1stClVol 36 60.60 54.85 60.08 19.81 3.30
PopUSA 36 221.22 221.00 221.22 24.14 4.02

1stClVol 33.20 96.30 44.00 78.25
PopUSA 180.00 262.00 199.50 241.50
Figure 4.1

Mean volume was more than sixty billion with a standard deviation about twenty and a range of
sixty-six. She could also have reported summary statistics for population from this computer
output.
For planning purposes, however, she needs to know about the bivariate relationship
between mail volume and population. Univariate measures prove inadequate for many types of
business problems. Business decisions often require our understanding of the relationship

among variables. Examples are the following:
How do profits respond to increased investment?
Do education and on-the-job experience increase worker salaries?
How much will rising gas prices depress car sales?

Success in business often means getting an edge on the competition by making best use
of information. Knowledge about a related variable can give us that competitive advantage if we
also know the relationship. For example, the Postal Service planner might want to predict first-
class volume when the population reaches 290 million. Or determine how much volume
increases every time the population increase a million. For problems such as this, an algebraic
equation can come in handy.

The Usefulness of Equations in Business Analysis
How are equations used in business? To see, let's review a little algebra. Recall that a
linear equation in slope-intercept form has the following form:
Y = m X + b
where Y is the dependent variable and X is the
independent variable. For example, the equation
graphed in Figure 4.2:
Y = 3 + 2 X
describes a line that intercepts the Y-axis at Y = +3
and has a slope of +2. This is a highly
economical way of summarizing a variable
relationship, reducing all the information
in a table or graph into a brief equation.
In addition, if we are given the value of X,
the equation allows us to calculate Y. If
X = 4, then
Y = 3 + 2(4) = 11
A third use for an equation is that its slope tells us the rise-over-the-run. Therefore,
change in Y = slope change in X
0 1 2 3 4
0
1
2
3
4
5
6
7
8
9
10
11
X
Y
Figure 4.2
x
0
1
2
3
4
y
3
5
7
9
11

If we use (delta) to mean change in a variable,
Y = m X
Because the slope is positive in this equation, any
increase in X will also increase Y. The slope of +2
tells us that if X increases by one, Y will increase by
two. If X increases by three, Y must increase by 2(3),
or six because
Y = m X = (+2)(+3) = +6
Similarly, a decrease in X will prompt a twofold
decrease in Y.
Negative slopes may also be represented by
equations. Suppose the relationship between X and
Y is the following:
Y = 20 - 4 X
Because the slope is negative, -4, every change in X
will be associated with Y moving in the opposite
direction (see Figure 4.3). For example, a 2 unit
increase in X will decrease Y by 8 because
Y = m X = (+2)(-4) = - 8.
Businesses can often use these three properties of an
equation.
A linear equation relating business variables lets us perform the following tasks:
(1) summarize the relationship among variables by an equation
(2) make predictions on a variable based on data for related variables
(3) measure the sensitivity of one variable to changes in a related variable

Suppose, for example, workers at a construction site are paid according to the following
equation:
0 1 2 3 4 5
0
2
4
6
8
10
12
14
16
18
20
X
Y
Figure 4.3
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
Graph of Equation: wage = 6 + 0.5 exper
experience (in years)
w
a
g
e
s

(
d
o
l
l
a
r
s
/
h
o
u
r
)
Figure 4.4

wage = 6 + 0.5 exper
where wage denotes wages (in dollars per hour) and exper stands for years of experience. This
simple equation contains all the information in the graph (Figure 4.4) or a wage-experience table.
The construction contractor can also use the equation to predict wages for a worker with seven
years experience:
wage = 6 + 0.5(7) = 9.50
or $9.50 per hour. Finally, the contractor can use his knowledge of the slope, +0.5, to answer
what if? questions. For example, if he adds on a work crew with four additional years
experience each, the contractor will be paying his new workers $2.00 more per hour because
wage = m exper = (+0.5)(+4) = +2

Using Regression Equations to Examine Variable Relationships
In business statistics, we often encounter similar problems that require analysis of
bivariate and multivariate data by using linear equations. For example, our Postal Service
planner may need to do the following:
summarize quantitatively the relationship of the U.S. population and mail volume data
guess first-class volume based solely on knowledge of the U.S. population
determine how much mail volume responds to increases in population
Can a linear equation give her the information she needs? It just so happens that a statistical
method exists that addresses these same three types of questions. It is called regression analysis
and is one of the most widely used methods in business.
DEFINITION: Regression analysis is a statistical method for finding and analyzing
relationships among bivariate or multivariate data.

As we have already learned, numbers do not tell us their story unless we use statistical methods
to interpret their message. Regression analysis may be used as our interpreter about quantitative
data relationships.

In Chapter 2, we learned that graphical display of bivariate and multivariate data can
uncover variable relationships hidden from the view of univariate histograms. In particular, we
observed that the graphical plot of quantitative variables may display an approximately linear
pattern.
In regression analysis, independent variables are called explanatory variables or
predictors because they help us predict the dependent variable and explain why it varies. The
regression equation is the algebraic equation that quantifies the linear data pattern.
DEFINITION: A regression equation is an equation summarizing the data relationship of a
dependent variable with one or more explanatory variables. Simple regression involves
equations with only one response variable.
The dependent variable is also known as the response variable, especially when using
experimental data.
Thus, simple regression uses a linear equation to describe how bivariate data are related.
The simple regression equation summarizing the data relationship between dependent variable Y
and explanatory variable X has the following form:
= b
0
+ b
1
X Form of the Simple Regression Equation
The look of the regression equation differs in two ways from other types of equations we have
seen. The first difference is a minor one. It is traditional to label the intercept b
0
instead of b
and the slope b
1
instead of m. The replacement of Y with , however, represents a fundamental
distinction between a mathematical equation and a regression equation which is statistical. Real
world data relationships usually cant be perfectly described by an equation. Bivariate data
hardly ever lie exactly along a straight line. Yet decision makers still may benefit from having an
equation that summarizes the general linear pattern in the data. The equations we use therefore
reflect only an overall slope and a formula that approximates the relationship prevailing in the
data.
While mathematics does not know how to deal with approximations an answer is
either right or wrong in statistics we avoid this problem by giving the approximation a special
name. In adventure movies, the dangerous action sequences often are filmed with stunt
doubles. Just as the stuntman must wear the trademark Indian Jones hat when he doubles for
Harrison Ford, we place a hat-shaped symbol (^) over the dependent variable in the regression
equation. A hat over a variable indicates we aren't dealing with the actual values of that
variable. Instead, contains approximate values for Y calculated from the regression equation
that summarizes overall patterns in the data.

DEFINITION: The predicted value of Y, (pronounced "Y-hat"), represents the approximate
values of the dependent variable calculated from the regression equation.
Can a business really benefit from an equation that provides wrong answers? Yes, it can,
especially if the answers aren't so very wrong or at least are better than the next best option.
Because of their compactness in summarizing relationships among larger data sets, regression
equations can help a manager or client see the forest from the trees. Regression can also be an
invaluable tool for make rough predictions and finding approximate answers to pesky what if?
questions. Let's return our Chapter Case and see how the USPS resource planner can use
regression analysis to tackle her three types of data analysis needs.
Examine Figure 4.5, a Minitab plot of the bivariate Postal Service data. Because
regression has even more complicated and less intuitive formulas than those for the standard
deviation, we'll postpone until the next section (Section 4.2) our discussion of how regression
equations are computed. From the nearly linear pattern in Figure 4.5, it isn't very difficult to
guess the slope of the regression line. The graph indicates that first-class mail volume (billions)
rose about 65 (that is, from the low 30s to the upper 90s). At the same time, U.S. population (in
millions) increased from 180 to 262, or approximately 82. The rise-over-the-run quotient is 65 /
82, for a slope of about 0.8. When we run the bivariate data through the computer, the regression
270 260 250 240 230 220 210 200 190 180
100
90
80
70
60
50
40
30
PopUSA
1
s
t
C
l
V
o
l

output seems to concur. The first lines of the Minitab output are copied in Figure 4.6, generated
by the steps described in Figures 4.7a and 4.7b.
1

Regression Analysis

The regression equation is
1stClVol = - 117 + 0.805 PopUSA
Figure 4.6

Using Minitab to Run a Simple Regression
Pull-Down Menu sequence: Stat Regression Regression...
Then complete the Regression DIALOG BOX as follows:
(1) find dependent variable on listing and double click to place it in Response: box
(2) double click on explanatory variable in listing to place it in Predictor: box.
(3) click on the OK button
Figure 4.7

Notice that slope coefficient of PopUSA in the regression equation is 0.805. The USPS
planner can therefore report that an additional million people translates to about 0.8 billion, or
800 million, more first-class mail that must be processed and delivered. If increased immigration
pushes U.S. population up by ten million, she uses the following calculations:
1stClVol = b
1
PopUSA = (+0.8)(+10) = +8
to project an eight billion increase in first-class mail volume.
Marginal Effects Rule for Simple Regression: For a given change in the explanatory variable,
the average change in the dependent variable may be calculated by the following rule:
= b
1
X
Now that she has interpreted the slope for What if? projections, the Postal Service planner
needs the entire regression equation to summarize and predict. Yet what is the meaning for the
negative intercept term, - 117, reported in the Minitab output? In and of itself, the intercept has
little meaning. It is merely one part of the regression equation for summarizing or predicting the
dependent variable. To check that this equation does do a good job of summarizing the data

1
Until now, computer printout has been relatively simple. An elaborate set of statistics has evolved for regression because of its
importance to statistics. We can only examine a portion of the regression printout in this chapter. The remainder must await introduction of
other concepts.

relationship, try it out a few actual values. For example, Table 4.1 reports that mail volume was
33.2 billion when population was 180 million. How did the regression equation do?
Predicted 1stClVol = - 117 + 0.805 PopUSA = - 117 + 0.805(180) = - 117 + 145 = 28
A predicted value of 28 billion pieces of mail is in the right ballpark. If we plug in other
population values from Table 4.1, the predicted values are as close or closer. Remember that the
regression equation only summarizes the relationship of actual data.
If the regression equation is valid, why did we get such a preposterous intercept? Mail
volume cannot be negative, let alone - 117 billion! Regression lines commonly have negative
intercepts when only positive values are possible for a dependent variable, such as sales or
wages. A glance at Figure 4.8 explains what the intercept is really telling us. Remember that the
intercept is the value of the dependent variable when the explanatory variable is zero. Unlike the
previous plot, Figure 4.8 extends the population axis back to zero. Sure enough, the regression
line intercepts the Y-axis at -117. Of course, barring nuclear war, it doesn't make sense to
predict mail volume for U.S. populations of zero or even 150 million. Besides, the lowest value
of population in the data was 180 million.
91 90 89 88 87 86 85 84 83 82
400
300
200
100
Year
S
&
P
5
0
0

Unrealistic regression intercept values are common when values near zero of the
explanatory variable are impossible or do not occur in the data.
The final task for the postal planner is to predict mail volume when population reaches 275
million around the year 2000.

Predicted 1stClVol = - 117 + 0.805(275) = - 117 + 221 = 104

Thus, she predicts the volume of first-class mail to exceed 100 billion pieces annually by the turn
of the century. In her report, she cautions USPS management that this prediction is based on a
regression equation summarizing the overall linear relationship between population and volume.
In the next section, we will learn how to attach a margin of error to regression predictions,
similar to the margin of error we learned about in Chapter 3.
By the way, spreadsheet software can also furnish us with regression equations. Finding
it is just slightly harder. Look at the bottom of the Excel output is the table reproduced in Figure
4.9. Notice that the intercept and slope, - 117.4866 and 0.8049997, are listed in the second
column entitled Coefficients. Recall that Minitab gave us these same values, rounded to three
figures. Figures 4.10a and 4.10b provide instructions and screens examples how to produce
Excel regressions.
Using Excel to Perform Simple Regression Analysis
(1) Select Regression from Analysis Tools list in Data Analysis box, and click OK
(2) Click to place an x next to Labels
(3) Click inside box following Input Y Range and type cell range of label and data for the
dependent variable (or drag mouse through spreadsheet data, and this cell range will
automatically appear in the box)
(4) Click inside box following Input X Range and type cell range of label and data for the
explanatory variable (or simply drag mouse through the spreadsheet data)
(5) Click OK to produce regression output
Figure 4.10a
Figure 4.8
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.000%Upper 95.000%
Intercept -117.4866 6.101318791 -19.2559 7.3E-20 -129.886 -105.087 -129.8859635 -105.0872366
PopUSA 0.8049997 0.027421819 29.35617 9.74E-26 0.749272 0.860728 0.749271897 0.860727506
Figure 4.9

Regression Equations and Time Trend Data
Now we have better tools to analyze the trend lines we observed in Chapter 2. We can
use regression to quantify these graphical patterns by an algebraic equation. As we discovered,
time trend data are really bivariate relationships in disguise. To represent a linear trend line
relationship, the explanatory variable is time t.
The regression equation of a trend line for a time series variable Y is
= b
0
+ b
1
t
and b
1
is the overall trend rate.

Depending on the type of data series, the time period between observations may be a year, a
quarter, a month, a week, a day, or even an hour. The next chapter case examines trends for
annual stock market data.
91 90 89 88 87 86 85 84 83 82
400
300
200
100
Year
S
&
P
5
0
0

Chapter Case #2: The Great Bull Market
It was 1992. The Savings and Loan collapse had sent interest rates plummeting.
Attracted by the higher earnings in the stock market, small investors were wary about risking
principal from their retirement savings. Even when the overall market is rising, individual stock
prices are often highly volatile. Putting "all your eggs in one basket" exposes investors to
Figure 4.11

unacceptably large risk. To reduce overall risk, investment counselors advise clients to hold a
balanced portfolio. But small investors don't have the time, expertise, or capital to manage such
a portfolio. In addition, many brokers charge a commission on each stock market transaction.
However, small investors may buy into a diversified portfolio with a single purchase of a
mutual fund. For a modest fund management fee, stock mutual funds offer investors the reduced
risk and high return previously available only to large investors. The returns earned on one of
these funds is based on the average of all the stocks in that fund's portfolio.
One of the broadest stock market measures is Standard and Poor's index of 500 stocks.
The S&P 500 is the weighted average of stock prices for 500 established corporations listed on
the New York Stock Exchange and other major stock exchanges. Investments in a mutual fund
of S&P 500 stocks more than tripled their value between 1982 and 1991. Figure 4.11 displays
this 10-year growth of the S&P 500 in a Minitab time series plot.
The graph clearly shows a strong time trend but not perfectly linear. However, increases
were greater in some years than in others. The S&P 500 even declined in one year (1988) and
failed to rise at all in another (1984). In Chapter 2, we learned a trend rate that we guessed from
the graph may describe that trend data. We may represent this trend rate in a regression equation
with Year as the explanatory variable. The regression of S&P500 data on Year yields the
Minitab output found in Figure 4.12.
Figure 4.12
Regression Analysis
S&P500 = - 2182 + 28.1 Year

The trend rate, +28.1, from the regression equation
Predicted S&P500 = - 2182 + 28.1 Year
is very close to what we would have guessed from the approximately 260 point rise in nine years
(1982 to 1991). A rise-over-run of 260 / 9 is less than thirty points per year in the S&P 500. A
market analyst may report that the stock index trended upward an average of 28 points per year
Figure 4.13

from 1982 to 1991. As before, the negative intercept should not be interpreted separately. Its
purpose is to make the equation predictions match the data well.
In Figure 4.13, we have graphed the trend line in the scatterplot. Although points lie
above and below our estimated trend line, it appears to do a good job of summarizing the general
pattern of the plotted data.
Of course, stock markets have been known to go down as well as up. Suppose that the
regression equation for the early part of next century is
Predicted S&P500 = 1900 - 15 Yr
The negative slope would describe a stock index that trends downward at fifteen points per year.
We can also use the regression equation to make special types of predictions called
forecasts.
DEFINITION: A forecast is a prediction about the future value of a variable.
By plugging in a year sometime in the future, for example, a market analyst can forecast the S&P
500. The forecast for 1994 is
91 90 89 88 87 86 85 84 83 82
400
300
200
100
Year
S
&
P
5
0
0
Figure 4.13

Predicted S&P500 = - 2182 + (28.1)(94) = 455
which turned out to be extremely close to the 459 closing value for the S&P that year. However,
our luck would have run out had we forecast 1998. The surging S&P 500 spent much of that
year above 1100, but our regression equation forecasts the market level at only half as high:
Predicted S&P500 = - 2182 + (28.1)(98) = 572
The lesson here is to be extremely careful when making forecasts. By their very nature, trend
forecasts require us to extend a trend line beyond the period used to fit that line. This is an
example of extrapolation.
DEFINITION: Extrapolation is prediction based on values outside the range of the data used to
make the prediction.
Trend forecasting always involves extrapolation, since we have no direct data about the future on
which to base our forecasts. Because stock prices turned up more sharply in 1995 and 1996,
setting one record after another, the ten-year trend equation forecasts too low. On the other hand
the market could have flattened out or turned downward instead.
Always include the following warning when forecasting from time trends:
All forecasts are based on the assumption that past trends continue.
We will have a lot more to say about time series and forecasting in other chapters. But to
continue any further, we need to understand more about regression equations themselves. For
example, how do we know we have the right one? In the next section, we investigate the
properties of a regression equation by comparing the trend line we just used with other possible
contenders. In the process, we will find ways to measure how well the regression equation
summarizes the data relationship.


CASE MINI-PROJECT:
French workers are concerned about the effect of the new European free trade agreement on
unemployment rates in their country. A bank staffer collects French unemployment rates for the
past thirty years, 1965-94. Based on the descriptive statistics below, the bank staffer reports that
unemployment has averaged under 7 percent. He concludes that average unemployment in
France is in good enough shape to withstand shocks from ending trade protection.

unemp FR 30 6.759 6.300 6.528 3.355 0.648

MIN MAX Q1 Q3
unemp FR 1.600 10.700 3.100 9.700

The bank economist is concerned because information in the following plot was not examined:

unemp FR- * * *
- * * * * *
- * *
9.0+ *
- * *
- *
-
- *
6.0+ *
- *
- *
- * *
-
3.0+ * * * *
- * * *
- * * *
-
--+---------+---------+---------+---------+---------+--------year
65 70 75 80 85 90

1. The data in this case study is (cross section / time series) [circle one]
2. Descriptive statistics fail to alert us to the trend in unemployment because only variate
analysis is reported. The plot reveals this trend because the graph involves variate analysis.
unemp FR = - 22 + 0.35 year
3. Use the regression equation to forecast unemployment in 1999 (Hint: when year = 99).
4. Unemployment rate rises an average of 0. points per year, or every four years.

4.1 By knowing only the trend line, you know all except which one of the following things
about time series data?
a. fit around the trend
b. steepness of the trend
c. location of the trend
d. the average rate of increase of the time series data for each time period
e. all of the above can be determined from the trend line

4.2 If the trend line for inventories is given by the equation
predicted inventory = 50 + 150 Quarter
were fit from 20 consecutive quarters (1 through 20), prediction for quarter 21 would be
a. forecasting
b. extrapolation
c. 3200
d. all of the above

you can predict inventory levels for quarter number 5 to be
a. 700
b. 800
c. 900
d. 1000
e. cannot be predicted from the information given

then inventory levels on average
a. rise 600 per year
b. rise 200 per quarter
c. rise 150 per month
d. fall 200 per quarter


were fit from 20 consecutive quarters (1 through 20), prediction for quarter 21 would be
a. forecasting
b. extrapolation
c. 3200
d. all of the above

4.6 The problem with trend line regressions is that
a. eventually all trends must end
b. there may not be any trend upward or downward
c. they can only apply to time series data
d. they do not explain why a trend occurs
e. all of the above

4.7 Based on the trend line for sales of a fast growing company given by the equation
predicted sales = 2500 + 500 Year
the fitted value of sales for Year 7 is
a. 2500
b. 3500
c. 5000
d. 6000

4.8 Based on the trend line for sales given by the equation
predicted sales = 3000 + 600 Year
sales on average
a. rises 50 per month
b. rises 150 per quarter
c. rises 600 per year
d. all of the above

4.9 A simple regression equation can always be used to predict a dependent variable if
a. the explanatory variable is positive
b. the explanatory variable lies within the range of data values
c. the explanatory variable has a realistic value
d. the predicted value of the dependent variable is realistic

4.10 The intercept in a simple regression equation may always be interpreted as
a. the change in the dependent variable for zero change in the explanatory variable
b. the value of the dependent variable when the explanatory variable is zero
c. the value we add to b
1
X to predict in-range values of the dependent variable
d. the mean of the dependent variable
e. all of the above

Answer the next two questions based on the following case and statistical output:
A tourism promoter collects data on air traffic (in thousands of travelers) at Orlando International
Airport for n = 67 consecutive months from 1989 through mid-1994. From the descriptive
statistics below, the promoter reports that air traffic over these 5 years averaged about 830,000
per month. She concludes that the airport shopping mall should base its current and future
inventory and staffing around this average traffic figure.

AirTraf 67 829.5 822.0 828.0 110.9 13.5

MIN MAX Q1 Q3
AirTraf 597.0 1086.0 751.0 906.0
70 60 50 40 30 20 10 0
1100
1000
900
800
700
600
Month
A
i
r

T
r
a
f
The airport mall manager is concerned that the following information was not examined:
4.11 By itself, the analysis of descriptive statistics is inadequate because it fails to examine
a. Univariate statistics
b. Bivariate statistics
c. Multivariate statistics

d. All of the above
e. The descriptive statistics provide adequate information proper analysis

4.12 Which of the following equations would be a reasonable least-squares line for the plotted
data above:
a. predicted AirTraf = 800 + 5 Time
b. predicted AirTraf = 800 + 9 Time
c. predicted AirTraf = 650 + 5 Time
d. predicted AirTraf = 650 + 9 Time
e. none of these

Concepts and Applications:
4.13 Are there decision situations in which forecasting exactly (or nearly so) is the only guess
that is rewarded? Suppose a market analyst achieves acclaim only when his or her
forecasts are right on the mark, but investors and the business press have a short memory
concerning predictions that never materialized. Explain why a least-squares forecast may
not be preferable for such an individual.

4.14 Determine what the average trend rate is for each of the following (notice that the final
example represents a downward trend):
(a) sales = 500 + 2 Week
(b) profits = 80,000 + 1,500 Year
(c) inventory = 85 + 150 Quarter
(d) workers = 850 - 18 Month

4.15 For the examples in the preceding exercise, forecast the value of the dependent time
series variable for
(a) Week 53
(b) Year 21
(c) the 17th quarter
(d) the 25th month

4.16 Examine the points and trend line for the late-Spring S&P500 data plotted in this section.
(a) For how many of the days were the trend values greater than the observed values?
(b) Explain why a good trend line doesn't necessarily require that one-half the points lie
above (and the other half below) the line.

4.21 The humorous maxim often quoted in the stock market is "buy low and sell high." But
without a time machine, crystal ball, or illegally-obtained insider information, there is no

way to know when a stock has peaked or bottomed out. Discuss the problem of deciding
when to buy and sell with regard to the four sets of 1987 S&P 500 series examined in this
chapter.

4.22 Discuss difficulties with forecasting from past trend lines in each of the following cases:
(a) Before the oil crises of the 1970s, extrapolations of trends in electricity usage forecast
the tremendous need for new generators by electric utilities. The higher fuel prices
depressed demand for electricity, but led to forecasts of exhausted oil reserves until new
discoveries of oil resulted in an oil glut.
(b) The arrival of computers led to forecasts of a paperless office, until fax machines,
laser printers, and inexpensive, imported copiers reversed the trend.
(c) Financial markets are full of gurus whose expensive newsletters report astounding
performance while conveniently forgetting their failures
4.23 Explain why trends are unlikely to persist for long in each of the following examples:
(a) rapid rate of new product development by a computer firm
(b) the rising cost of health care
(c) increasing enrollment in business schools
(d) declining petroleum costs on world markets

4.28 The fitted equation predicts that firms with about 100,000 employees would have about
$6 billion in sales, since 0.06(100,000) is $6,000 million (rounding away the intercept of
"only" $10 million). Motorola, one of the largest firms in this industry with 97,700
employees in 1987 had $6.7 billion in sales that same year. Considering that the
regression equation was based on data that contained no firm with more than 1200
workers, interpret the possible reasons for the accuracy of this extrapolation.


4.2 Fitting Data to an Equation

A simple regression equation summarizes the relationship between two variables. It does this by
locating the equation that best describes how data on the dependent variable relate to the
corresponding values of the explanatory variable. If your foot is a size 8C, only shoes with that
length and width will fit your feet. Similarly, a simple regression equation must have the right
combination of intercept and slope to fit the data. Thus, the regression equation is also known
as the fitted equation, and the predicted values of the dependent variable are called the fitted
values. Let's show what we mean by examining another graph of the S&P 500 data introduced
in the previous section.
Figure 4.14 again plots the actual S&P values and the regression equation. In addition,
however, we have added two other lines. Can you see why the regression line (the solid line in
this graph) summarizes the data better than the dashed and dotted lines? Unlike the regression
line, the dotted line does not rise steeply enough to summarize the plotted market data. As a
result, the plotted points are well below the dotted line for 1982-1985 and well above it for 1989-
1991. The dashed line appears to have the proper slope. But because its y-axis intercept is too
high, the dashed line tracks considerably above most of the plotted S&P values.
82 83 84 85 86 87 88 89 90 91
100
200
300
400
Year
S
&
P
5
0
0
Figure 4.14

What can we learn from this graphical exercise? Recall from algebra that one line can
only differ from another if its intercept or slope is different. Therefore, fitting data to a
regression equation boils down to selecting the slope and intercept that best match the overall
data pattern. Unfortunately, finding the best-fitting linear equation is usually not obvious from a
graph. We need the assistance of a statistical method to help find the intercept and slope
coefficient. To do this, we must first establish standards for what we mean precisely by a best-
fitting equation.
Back in Chapter 3, we saw that the mean X could be used to summarize univariate data.
We then measured variability around the mean by the standard deviation. The standard deviation
was based upon deviations from the mean, X - X . To predict a single value for a univariate, we
would guess the mean and use the standard deviation to obtain a margin of error.
In regression, we also use a deviations-based measure of variability. Only now, these are
differences from the fitted value, or Y - . Differences are the errors.
DEFINITION: the error, e , in regression for each observation in the fitted data is the difference
between the actual value of the dependent variable Y and its predicted value :
e = Y -

The univariate differences X - X vary from one observation to the next in the data only because
the X-value changes. By contrast, both parts of Y - vary because the regression equation is
calculated at different X-values. For example, S&P500 was 287 in 1987. Substituting Year = 87
into the regression equation,
= - 2182 + 28.1 Year = - 2182 +(28.1)(87) = 262.7
Rounding to 263, the error for the 1987 observation is 14 because
e = Y - = 287 - 263 = 14


Geometrically, these deviations are simply the vertical distances between the scatterplot points
and fitted values displayed along the trend line. The 1987 error just calculated is displayed as a
vertical distance in Figure 4.15.
91 90 89 88 87 86 85 84 83 82
400
300
200
100
0
Year
actual value
of S&P 500
Plot of Regression Equation
predicted
value of
S&P 500
error = actual - predicted
Figure 4.15


Errors can be calculated for every observation. For example, Figure 4.16 shows the errors
graphically for the S&P 500 data. Notice that the S&P 500 for 1983, 1987, and 1989 lie above
the regression line, so the actual market index exceeded the predicted value. The errors for these
observations are therefore positive. For 1984, 1985, 1988, and 1990, errors are clearly negative
because the plotted points lie below the regression line.
To obtain the best fitting equation, we must take into account all these errors. To do this,
we combine errors mathematically. As we learned in Chapter 3, negative and positive
differences can cancel if we sum them. If negative errors are just as bad as positive ones, we can
make all terms in the sum positive by the same method used with standard deviations. By
summing the squared errors, e
2
, all observations contribute positively. The equation that
minimizes the error sum of squares, SS
E
, is called the least squares regression equation.
DEFINITION: The least-squares regression equation is the equation that fits the data with the
minimum possible error sum of squares.
On spreadsheet programs and most statistical software, regression equations are
computed by a least-squares analysis. All the regressions discussed in this text are least-squares
91 90 89 88 87 86 85 84 83 82
400
300
200
100
Year
Figure 4.16

Year actual fit error e
2
82 120 122 -2 4
83 160 150 10 100
84 160 178 -18 324
85 187 207 -20 400
86 236 235 1 1
87 287 263 24 576
88 265 291 -26 676
89 323 319 4 16
90 335 347 -12 144
91 376 375 1 1
regressions.
2
In Section 4.4, we will generalize least-squares analysis to multiple regression
which involves more than one explanatory variable.
If we follow the script from Chapter 3, it is now time to present the least-squares
formulas necessary to fit the data to regression equations. In fact, two generations of business
students devoted valuable study time and energy to learning how to calculate least-squares
equations from canned formulas. Thankfully, these complex, time-consuming computation
exercises have been relegated to the scrap heap next to procedures to calculate square roots by
hand. Today, learning and using these formulas has become an
unproductive activity. Virtually all regression computations today
are done on computer software.
Moreover, least-squares formulas are derived from calculus
to ensure an intercept-slope combination that minimizes SS
E
.
Thus, hand calculator exercise lack the intuitive lessons that we
obtained from working through univariate formulas.
3
Anyway,
the most common type of business regression uses multivariate
data (see Section 4.4). Multivariate regression formulas are so
complicated they must be stated in matrix algebra and are beyond
the reach of even sophisticated pocket calculators.
Does this mean we must view regression as a black box that magically fits data to an
equation? Not at all! We always need to understand how the methods we use are computed and
what are the properties and limitations of the statistics we report. All we will need to meet this
standard is our definition of least-squares which assures us that the regression equation best fits
the data. In Chapter 10, we will learn that least-squares equations are best in another sense if we
are willing to impose several assumptions.
How can we tell that the regression equation reported by the computer does in fact have
the lowest possible SS
E
? Let's use a trial-and-error approach. To show that we have the smallest
SS
E
, we can calculate the fits and errors from the regression equation and check whether we can
further reduce SS
E
by choosing another equation. The fit, error, and e
2
" columns in the
table on the left were generated in Excel (see Figure 4.17). The fits are calculated by the
regression equation reported earlier:

2
Least squares dates from 1805, but Sir Francis Galton first coined the term "regression" 80 years later to account for deviation of
children height from that of their parents. It was not until Yule, in 1896, that the two terms were finally linked. Recently, maximum likelihood
and nonparametric procedures have become popular ways of computing regression equations. These methods often enjoy advantages over least
squares and are used extensively in business research. See S. M. Stigler, The History of Statistics, Cambridge: Harvard Univ. Press, 1986.

3
Least-squares formulas are derived multivariate calculus applied to algebra or matrix algebra.
Figure 4.17

Predicted S&P500 = - 2182 + 28.1 Year
and then rounded to simplify things. The errors, actual S&P 500 - fitted value, range from -26 to
+24. Squaring turns all the errors positive, and the sum of the e
2
column
4 + 100 + 324 + 400 +1+ 576 + 676 +16 + 144 + 1
is over 2000.
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 64932 64932 247.76 0.000
Error 8 2097 262
Total 9 67029
Figure 4.18

The precise error sum of squares (without rounding) may be found in the Minitab output
in the table entitled Analysis of Variance (Figure 4.18). It is in the SS column (which stands for
"sum of squares") and on the second-from-the-bottom line labeled Error. For the regression on
the S&P 500 index data, SS
E
is 2097.
The Excel regression (Figure 4.19) has the same information in its ANOVA table (which
stands for ANalysis Of VAriance). The error sum of squares is still in
the same place on the SS column,
but the row is labeled Residual, another name for error. Analysis of
variance occupies a vital role in business analysis. We will discuss
the rest of this important table later in this and other chapters.
If we truly have the least-squares equation, no other fitted
equation will give us as small a SS
E
. Suppose, for example, we
increase the slope by a small amount from 28.1 to 28.2. The Excel
chart of fits, errors, and errors squared in Figure 4.20 indicates a
poorer fit and larger errors. Just this small change increases SS
E
from
ANOVA
df SS MS F Significance F
Regression 1 64932.25 64932.25 247.7556 2.65109E-07
Residual 8 2096.655 262.0818
Total 9 67028.9
Year actual fit error e
2
82 120 130 -10 100
83 160 159 1 1
84 160 187 -27 729
85 187 215 -28 784
86 236 243 -7 49
87 287 271 16 256
88 265 300 -35 1225
89 323 328 -5 25
90 335 356 -21 441
91 376 384 -8 64
Figure 4.20
Figure 4.21

2097 to 3674, nearly twice as large as before. Three of the squared errors alone 729, 784, and
1225 sum to more than 2700. Similar increases in the error sum of squares result from
tinkering with the slope or intercept. No matter how we try, we cannot find another equation that
results in an SS
E
any smaller we got from our original equation (see Figure 4.21).
Measuring How Well an Equation Fits
In Section 4.1, we learned that real world data relationships cant be perfectly described
by an equation, yet decision makers still may benefit from a summary equation that predicts and
answers What if? questions. Unlike an algebra equation, regression only approximates the
overall linear pattern in the data. Least-squares is like a Good News, Bad News story. The
good news is that regression gives us the best fitting equation. The bad news is that sometimes
the best isn't very good. If it fits the data well, the regression equation can be relied on to
summarize the variable relationship and to make useful predictions. When the fit is not as good,
however, the regression equation may still be useful but only as a general guide.
To be useful to business, regression analysis must provide us with two types of information:
(1) the best-fitting equation, and (2) how well data fits that equation.

The Standard Error of the Estimate
How can we measure how well or poorly a regression fits the data? Fortunately, least-
squares automatically furnishes us with all the information we need to calculate a fit measure.
When regression obtains the best fitting equation, the SS
E
it minimizes may be used to measure
the regression fit called the standard error of the estimate, SEE.
DEFINITION: The standard error of the estimate, SEE, in regression is the square root of the
mean of the squared errors. For simple regression on bivariate data set of n observation
SEE =
) 2 ( n
SS
E

Recall that we calculated the standard deviation s as the square root of (X - X ) / (n - 1), the
mean of squared deviations. SEE serves as our standard deviation for regression.


The standard error of the estimate measures average deviation of the dependent variable
from the fitted regression equation in the same way that standard deviation measures
variability from the mean.
Why do we average SS
E
over (n - 2) instead of n? In our univariate analysis of Chapter
3, only (n - 1) pieces of information remained to calculate standard deviation deviations. To
enable the deviations (X - X ) from this mean to be calculated, we consumed one degree of
freedom to find the mean X of the data. In regression, the deviations are the errors e. These
errors can only be found after we have fitted values from the regression equation
= b
0
+ b
1
X. We divide SS
E
by n - 2 because finding the intercept b
0
and slope b
1
costs us
two degrees of freedom. In our S&P 500 example, SS
E
was 2097 for n = 10 years of data, so
SS
E
/ (n - 2) = 2097 / (10 - 2) = 262
whose square root is 16.19.
This value for SEE along with its step-by-step calculations may be found in either the
Minitab or Excel regression output. In Minitab Figure 4.22, SEE is referred as s and located on
the line just above the Analysis of Variance table.
s = 16.19 R-sq = 96.9% R-sq(adj) = 96.5%

SOURCE DF SS MS F p
Regression 1 64932 64932 247.76 0.000
Error 8 2097 262
Total 9 67029
Figure 4.22

The Analysis of Variance table helps us trace how s = 16.19 was computed. We already
discovered that the SS column gives the sum of squares, so the Error row tells us SS
E
= 2097.
This time, notice the DF and MS columns in the table. DF is the degrees of freedom we
discussed, and MS stands for the mean of the sum of squares. Now we can find all the
ingredients to calculate SEE. The Error row contains the degrees of freedom, n - 2 = 8, the SS
E
,
and the mean square error MS
E
= 2097 8 = 262.

DEFINITIONS: The mean square error, MS
E
, is the sum of squared errors averaged over the
number of degrees of freedom: MS
E
= SS
E
/ DF.

The Analysis of Variance table contains squared data (sums of squares, mean squares, etc.). To
exit this table, we need to un-square a number.
The standard error of the estimate is the square root of the mean square error.
The Excel output contains all the same information in nearly the same arrangement. See if you
can find the standard error in the top part of accompanying output and the degrees of freedom,
sum of squares, and mean squares for the errors in the ANOVA table.
The standard error of the estimate, like its name implies, often allows us to attach an
approximate margin of error to our regression predictions and forecasts. Because some data fits
a linear equation better than others, it is seldom proper to report the predicted value without
including a margin of error. We already saw in Chapter 3 how reporting a two-standard-
deviation margin of error helped our business consultant keep his reputation and the business
avoids disaster. If errors have a bell-shaped distribution, we may often be able to apply a similar
margin of error to make crude approximations of regression predictions.
For normally distributed errors, predictions based on regression have a margin of error
roughly approximated at plus or minus two standard errors.
Recall that several of the errors from predicting the S&P 500 were between 18 and 26. Two
standard errors is 2(16.19), or 32.38. Thus, approximately 95 percent of the actual S&P 500
should lie roughly within 32 or 33 points of the predicted values. In this case, all ten data
observations fell within this margin of error.
4
The standard error of the estimate occupies a
crucial place in inferential methods of testing and forecasting, as we will see in Chapters 8
through 11. In Chapter 11, we will present a much more refined method for finding prediction
intervals.

The R
2
Measure of Fit
As we learned in Section 1, describing a bivariate relationship with univariate summary
statistics may sacrifice important information. In our stock market index example, the mean was
245 over the ten-year period. If we ignored the bivariate time trend relationship completely, we
could still summarize the market by the mean value of the S&P 500. How much do we gain by
using the regression? That is the question raised by our other fit measure, R
2
. Before we
formally define this measure, we must first quantify the errors from ignoring bivariate data
patterns.

Regression equations have the form:
= b
0
+ b
1
X
To require to equal the mean, Y , regardless of the value of the explanatory variable,
we must disconnect X from the equation. We unplug X from having any effect on Y by
making its slope b
1
zero:
= Y + 0 X
For the stock market data, this amounts to assigning to equal 245 for in every year. Figure
4.24 graphs this horizontal line and shows the errors that would result from using this line make
predictions.

Notice that two years are near the mean of 245, but most were 50, 100, or more
points away. Our regression line never erred by more than 26 points. We can usually
obtain better overall fits from a bivariate relationship, and in this case, we can do much
better.
In extreme cases, we can get almost as good a fit by guessing the mean of the dependent variable
whatever the value of the explanatory variable. If there is little relationship in the bivariate data,
the best fit won't do much better than the formula. To make this comparison, we need to
measure the error sum of squares if we always predict by Y . This is called the total sum of
squares, SS
T
.
91 90 89 88 87 86 85 84 83 82
400
300
200
100
Year
Figure 4.24

DEFINITION: In regression analysis, the total sum of squares, SS
T
, is the sum of squared
deviations of the dependent variable from its mean. Thus,
SS
T
= (Y -Y )
2

Notice that SS
T
is the same sum used to calculate the standard deviation for a single variable. In
regression, we can also interpret SS
T
as the sum of squared errors if we used = Y as our fitted
equation. Thus, SS
E
can be no greater than SS
T
. Even if the regression line adds nothing to the
fit, the worst least squares can do is guess Y , resulting in SS
E
= SS
T
.

SS
E
is nearly always smaller than SS
T
, and much smaller if knowledge of X provides strong
evidence about the value of Y. In the case of a perfect fit, SS
E
is zero.
Decision makers need to know how well a regression equation fits the data. A popular
goodness-of-fit measure is the coefficient of determination, more commonly called R
(pronounced R-Square). R
2
is based on the ratio of SS
E
to SS
T
. We can interpret the ratio
SS
E
/ SS
T
as that part of total variability not explained by the regression. Then we let R
2

represent the fraction the regression does explain.
DEFINITION: R measures the goodness of fit for a regression by the portion of variation in the
dependent variable accounted for by the explanatory variable.
R = (1 - SS
E
/ SS
T
)
or in percentage terms: R = (1 - SS
E
/ SS
T
) 100%
R ranges from 1.0 for perfect fits to a minimum of zero if the regression accounts for no
variation about the mean of Y. R is reported as either a percentage or a decimal fraction. For
example, if SS
T
is 24 and SS
E
is 6, then R = 1 - 6 / 24 = 0.75, or 75 percent.
To see how this is reported in Minitab and Excel output, refer one final time to the
regression output for the S&P 500 example. R in the Minitab output is called "R-sq and is
printed just to the right of the standard error s. The R of 96.9 percent tells us that the
regression line accounts for nearly all the variation in the S&P 500. Over the 10-year period, the
least-squares equation provided an excellent summary of behavior in the stock index.
The total sum of squares may be found on the Total line of the Minitab Analysis of
Variance table immediately below the SS
E
value. The R is so large because the SS
T
of 67,029
has been reduced by using regression to only 2097. Thus, we can find R directly from these two
numbers:
R = 1 - (2097) / (67,029) = 1 - 0.031 = 0.969

or 96.9 percent. These same values for R (called R Square), SS
E
, and SS
T
are reported in our
Excel output as well.
Let's pull together what we have learned about fit measures.
R measures how good a fit is while SEE reports the variability around the fit. SEE
is measured in the same units as the dependent variable so it can inform us about
the margins of error. R is a relative measure to let us determine what percentage or
fraction of a perfect fit we have achieved. Therefore, these two fit measures
complement one another.
We conclude our discussion of fit measures with a business case involving a weaker regression
fit.

Chapter Case #3: Let's Win One for the Gipper
The president of a financially strapped university was being pressured by alumni to hire a
top coach and pour extensive resources into the football program. Many U.S. universities have
long relied on sports in general, and football in particular, to maintain financial solvency. A
successful football program can bring lucrative television contracts, bowl bids, and even support
from alumni in the state legislatures. When Northwestern University turned its football program
around, alumni contributions and insignia sportswear sales rose substantially. Despite the travel
and equipment expenses, football can be cost effective because the players aren't paid salaries.
Faced with declining enrollments from rising tuition costs, even the university admissions
director reports that her task would be easier if the team were more successful. A sports
consulting firm was contracted to assess the benefits of a successful program.
One good measure of financial success of a football program is attendance at the games.
Besides ticket and concession revenues at the game, high attendance can demonstrate fan loyalty,
a large television market, and wealthy alumni support. Any team can turn out fans for a
traditional rival, so the consultants used average attendance (in thousands) for the 1993 season as
their dependent variable.

I-A TEAM AttendAv Top25CNN
AIRFORCE 38.9 3
ALABAMA 75.7 10
ARIZONA 50.8 2
ARIZONAST. 51.6 1
ARKANSAS 43.4 4
ARMY 33.8 2
AUBURN 81.1 8
BALLST. 10 0
BAYLOR 34.4 2
BOSTONCOLL 33.2 5
BOWLINGGRE 12.7 0
BYU 65.2 7
CALIFORNIA 40.5 2
CENTRALMICH 18 0
CINCINNATI 15.9 0
CLEMSON 66.8 7
COLORADO 51.9 6
COLORADOST. 22 0
DUKE 22.5 0
EASTCAROLIN 26.9 1
EASTERNMICH 12.4 0
FLORIDA 84.5 7
FLORIDA ST. 74 10
FRENSO ST. 39.5 1
GEORGIA 78.1 8
GEORGIATECH 41.4 2
HAWAII 40.8 1
HOUSTON 23.5 3
ILLINOIS 51 2
INDIANA 37.4 2
IOWA 68.5 7
IOWA ST. 35.6 0
KANSAS 35.1 1
KANSAS ST. 31 1
KENT 4.7 0
KENTUCKY 52.1 1
LOUISIANAST 60.3 5
LOUISVILLE 37.6 2
MARYLAND 37.4 1
MEMPHISST 30.3 0
MIAMI (FL) 47.9 10
MIAMI (OH) 14 0
MICHIGAN 105.7 9
MICHIGAN ST. 61.5 3
MINNESOTA 40 0
MISSISIPPI 32.7 2
MISSISIPPIST. 33.2 0
MISSOURI 37.4 0
NAVY 25.8 0
NEBRASKA 75.6 11
NEVADA-LV 15.9 0
NEW MEXICO 25.3 0
NEW MEXICOST. 16.8 0
N. CAROLINA 48.7 2
N. CAROLINAST 40.9 3
N. ILLINOIS 12.6 0
NORTHWESTERN 31 0
NOTRE DAME 59.1 8
OHIO 9.3 0
OHIO ST 92.2 8
OKLAHOMA 62.1 9
OKLAHOMA ST. 24.4 1
OREGON 36.4 0
OREGON ST. 28.9 0
PACIFIC (CA) 10.4 0
PENN ST. 94 9
PITTSBURGH 26 3
PURDUE 43.4 0
RICE 23.8 0
RUTGERS 30.7 0
SAN DIEGO ST 37.3 1
A successful team can't expect to raise its attendance
overnight. Alumni willing to cross the state to endure hours of
freezing weather on hard benches usually attended games as
students. Team success over the preceding eleven years was
therefore selected. Success in the national rankings was chosen
rather than winning percentage. Teams from weak conferences or
those who play easy schedules are unlikely to go to major bowls or
be seen on national TV. The consultants therefore chose as their
explanatory variable the number years the college was ranked among
the top 25 in the CNN-USA Today Poll. Data for 95 Division I-A
teams were collected (see data from Excel spreadsheet in Figure
4.25).
Figure 4.25

After running the regression, the Minitab output is shown in Figure 4.26.
Regression Analysis
AttendAv = 25.4 + 5.67 Top25CNN

s = 12.51 R-sq = 68.2% R-sq(adj) = 67.9%

SOURCE DF SS MS F p
Regression 1 31205 31205 199.51 0.000
Error 93 14546 156
Total 94 45750
Figure 4.26
One young consultant looked only at the regression equation
Predicted AttendAv = 25.4 + 5.67 Top25CNN
and recommended that this equation be the basis for their report. He argued that the rest of the
output was unimportant. Using the equation alone, the sports consultants can predict attendance
of nearly 71,000 per game if the university targets a top-25 ranking in 8 of every 11 years
10 5 0
100
50
0
Top25CNN
A
t
t
e
n
d
A
v
Figure 4.27

because
Predicted AttendAv = 25.4 + 5.67(8) = 70.8 (AttendAv measured in thousand / game)
Can you find where the young consultant went wrong? Did you notice the relatively large value
of the standard error of the estimate, 12.5 thousand? Thankfully, more experience heads
prevailed. The sports consultants included a roughly approximate margin of error of 2 SEE with
their report. They therefore predict the following average game attendance if the target success
rate is reached:
71 thousand 25 thousand
The university president would then know that attendance could be as low as 46,000 or as high
as 96,000. The president won't be misled into thinking that 71,000 is the likely outcome. On the
other hand, the information gathered from this regression is still very useful. If the university is
currently averaging only 30,000 per game, these regression results can tell that president that a
successful program almost certainly increases attendance and could even triple it.
Finally, let's examine the regression variables as plotted in Figure 4.27. It is clear from
the broad dispersion of points in the graph that no regression line can possibly fit the data well.
That is why we got a lower R and a sizable standard error SEE. Nevertheless, the regression
line summarizes the general pattern in the data. After all, the R of 68.2 percent tells us that
more than two-thirds of the variation in attendance is explained by the regression.
Time series regressions like the post office example in Case #1 generally yields much
better fits than comparable regressions on cross-sectional data. It is not unusual for R to be 98
or 99 percent for regressions with time series data, whereas regressions on cross-section data
seldom produce fits exceeding 75 or 80 percent. Thus, the fit for the attendance-ranking
regression may be judged quite adequate for the trade association to use in predicting sales.
It is difficult to get good fits from regressions that address questions such as:

Why do some students perform better than others on exams?
Why do some businesses have lower employee morale than other businesses?
Why do some cities have lower unemployment or crime rates than other cities?

because the answers often involve fundamental differences among individuals or business
organizations. A time series equivalent to the preceding question would be:

Why does my performance vary from one exam to the next?
Why is my company's employee morale higher in some months than in others?

Why has New York City's crime rate declined dramatically in recent years?

Regression based on each of these questions promises to yield a better fit. With time series
regression, we don't have to explain these differences because all the data are from the same
student, business, or city. For example, I may do better on exams if I get more sleep the night
before, but this may not be as true for others.
Though people, businesses, and cities do change over time, change often occurs at a
glacial pace. We all have friends who are set in their eating habits and music tastes. Similarly, a
conservative company is unlikely to change abruptly into a risk-taking enterprise. Finally, time
series variables often move together over time because of movements common to all prices,
population, and technology. We will have more to say about comparing fits among regressions
in Section 4.4.
Accounting for variation in a single product, company, stock, or country is generally easier
than explaining why different items are different. Thus, time series regressions often
produce higher Rs than do regressions on cross-sectional data.
One final point about regression fits must also be mentioned. Recall that both SEE and
R depend on SS
E
from the fitted equation. However, this error sum of square is highly sensitive
to observations with larger errors, called outliers. Squaring the large errors associated with
outliers may dramatically increase SS
E
, resulting in substantially lower R and much larger SEE.
The regression fit also are heavily influenced by observations that would otherwise produce
large errors. To minimize SS
E
, the least-squares equation may be excessively shifted toward
these influential observations. Methods to detect outliers and influential observations and ways
to measure their effect on fit measures and the regression equation will be discussed extensively
in Chapter 11.
4.29 If SS
E
= 200 and SS
T
= 500, then R is equal to
a. 60 percent
b. 40 percent
c. 30 percent
d. 20 percent


4.30 The least-squares line has which of the following properties?
a. minimizes the sum of errors
b. passes through the most points possible
c. passes through the origin
d. minimizes the SS
E

e. all of the above

4.31 To determine the standard error of the estimate for simple regression, we divide by n - 2
in the denominator prior to taking the square root because
a. we lose two pieces of information to first determine the error
b. we lose two degrees of freedom
c. we must first estimate the slope and intercept of the fitted equation
d. all of the above

4.32 If SS
E
= 35 and SS
T
= 140, then R is equal to
a. 75 percent
b. 60 percent
c. 35 percent
d. 25 percent

4.33 If R is zero for a regression, then
a. the intercept for the regression equation must be zero
b. the slope for the regression equation must be zero
c. all the fitted values of the dependent variable must be zero
d. all of the above

Calculator Problems:
4.34 Calculate R and SEE from the following information:
(a) SS
E
= 120, SS
T
= 1000, n = 25
(b) SS
E
= 8, SS
T
= 24, n = 33
(c) SS
E
= 60, SS
T
= 64, n = 65
(d) SS
E
= 0.4, SS
T
= 1, n = 41
(e) SS
E
= 44,100, SS
T
= 250,000, n = 22

Question 4.34 a-e: (rounded) a is 88%, b is 67%, c is 6%, d is 60%, e is 82%

4.3 Summarizing the Strength of Variable Relationships
The goal of simple regression analysis is to determine the equation that best describes the
relationship of a dependent with an explanatory variable. But what if our aspirations are more
limited? Suppose we merely want a numerical measure of linear association that tells how
closely movements in one variable are paralleled by the movements in a second variable?
Unlike regression, measures of linear association don't require us to specify one variable as
dependent and the other as explanatory. The reason is we only want to measure the degree of
association, not explain or predict one variable by another.
10 5 0
100
50
0
Top25CNN
A
t
t
e
n
d
A
v
Figure 4.28

Measures of Linear Association: Covariance and Correlation
We begin by exploring what might be useful properties for a measure of association.
Let's re-examine the relationship between attendance and football ranking by plotting a
horizontal line at the mean attendance, 40,000, and a vertical line at the mean number of top 25
rankings, 2.6 (see Figure 4.28).
Next, we subtract 40 from the attendance data and 2.6 from the top 25 ranking data. We
now have data that measure differences from each variable's respective mean. These differences
are plotted in Figure 4.29.
Notice that this scatterplot looks just like the one in Figure 4.28 except that now the origin (0, 0)
has moved from the lower left corner inward nearer the middle of the graph. This new origin is
located at the same position as the mean of the original attendance and poll ranking data (40,
2.6). Thus, The difference-from-the-mean plot displays the same relationships among the data
points.
To assist us in analyzing this scatterplot, we have drawn the new X and Y axes, labeled
Quadrants I to IV, and identified five points A through E on the graph. Quadrant I corresponds
10 5 0
50
0
-50
Top25Dif
A
B
C
D
E
Quadrant I
Quadrant III
Quadrant IV
Quadrant II
Figure 4.29

to universities with both attendance and ranking above the average for each of these variables.
Conversely, the third quadrant locates universities where both variables are below average. The
direct relationship between attendance and poll ranking is reflected in the predominance of
points in these two quadrants. Our measure of association should therefore reflect this by having
a positive sign. Second and fourth quadrant universities are less common in the data, alerting us
that few of the universities had one variable below its mean when the other was above its mean.
In fact, only point E and two others out of the 95 universities had above-average ranking and
below-average attendance. If points primarily occupied the second and fourth quadrants, we
would have an indirect relationship and a negative association between the variables. Finally, if
the quadrants were about equally populated, no relationship should be reported and a measure
should be near zero.
Unfortunately, less clear-cut patterns in a scatterplot are more difficult to assess the
direction of association. In addition, we would like to assign a numerical value to our measure
of association so we could claim one relationship is stronger or weaker than another. We next
refine these results with our ultimate goal to quantify our measure of association.
Recall that univariate statistics use deviations from the mean to assign a value to
dispersion measures of variance and its square root, the standard deviation. For bivariate data,
we suddenly have two differences to consider: the deviation of each variable from its mean. To
measure the degree of linear association between two variables X and Z, we form the cross
products (X - X ) (Z - Z ) of these deviations for each observation in the data. The mean of
these cross products is called the covariance, so named because it measures the average co-
variation among two variables.
DEFINITION: The covariance is a measure of linear association equal to the mean of cross-
product deviations. The covariance of X and Z, s
XZ
is:
5

1
) )( (

=
_
n
Z Z X X
s
XZ

5
Despite calculating two means, we need to subtract only one degree of freedom because each column of data is used only once to
derive a mean.

For the football attendance
case, sample computations for the first
seven of the 95 universities were
calculated in Excel and are listed in
Figure 4.30. The first university, The
Air Force Academy, had attendance of
38,900 and made the top 25 rankings 3
times in 11 years. Their attendance
was slightly below the mean of
40,050, so the difference column
Attend Dif registers a negative value
of -1.15 thousand. Three top-25
rankings was only slightly more than
the 2.68 average, resulting in a small positive Top25 Dif value of 0.42. The -0.483 cross
product of these two differences, a negative and a positive, is negative but small because each
variable was so near its mean. By contrast, the two large cross product magnitudes, 264.5 and
222.5 (on lines 2 and 7 of the Excel table), are both positive and from deep inside Quadrant I of
the plot. Each represents football powerhouses (Alabama and Auburn) with attendance well
above average and perennially ranked among the top 25.
The rest of the table continues in this fashion with most cross products positive,
especially the ones with the largest magnitudes. The sum of the cross products is 5505.
Averaging this total over n - 1 degrees of freedom, we obtain a covariance of
s
AttendAv, Top25CNN
= 5505 (95 - 1) = 58.56
Let's check this result on the computer. The Minitab instructions are listed in Figure 4.31.
Using Minitab to Obtain a Variance-Covariance Matrix
Pull-Down Menu sequence: Stat Basic Statistics Covariance...
Complete the Correlation (or Covariance) DIALOG BOX as follows:
(1) double click on each variable to be used in the covariance matrix
Figure 4.31

Covariances
AttendAv Top25CNN
AttendAv 486.7070
Top25CNN 58.5633 10.3315
Figure 4.32
AttendAv Top25CNN Attend Dif Top25 Dif cross product
38.9 3 -1.15 0.42 -0.483
75.7 10 35.65 7.42 264.523
50.8 2 10.75 -0.58 -6.235
51.6 1 11.55 -1.58 -18.249
43.4 4 3.35 1.42 4.757
33.8 2 -6.25 -0.58 3.625
81.1 8 41.05 5.42 222.491
Figure 4.30

Minitab furnishes us with variances for each of the two variables as well as the covariance.
Consequently, this matrix is called a variance-covariance matrix. The variances of AttendAv and
Top25CNN are on the main diagonal and the covariance, 58.56, is at the lower left of the Minitab
output (see Figure 4.32). Notice this is the same value we just calculated from the mean cross
products.
Covariance has some of the attributes we want in a measure of linear association: (1) it is
positive when most points are in Quadrants I and III and negative when points fall mainly in the
other two quadrants, and (2) the magnitude of s
XZ
is greater when large deviations in X and Z
both occur for the same observations. An observation contributes a positive cross product to the
average when both variables are either above or both below their respective means. Thus,
positive cross products result from points in the first and third quadrants. If one is above while
the other is below, a negative product results from these second and fourth quadrant points. A
data set with mostly positive products results in a positive average cross product; mostly negative
cross products yield an average that is negative. The combined presence in the data of many
positive and negative products tends to offset one another, yielding an average near zero.
In addition to the sign, the magnitude of the deviations (indicated by the location of a
point within a quadrant) affects the cross product average. Consider points labeled A and D in
the attendance-poll ranking scatterplot. Since both are in Quadrant I, each should contribute a
positive product. But should they each have the same impact on the average cross product?
Point A ought to add more because it is further from the origin (0,0). In fact, nearby points just
below D on the graph should affect the average very little since the attendance difference is near
zero. By multiplying the differences from the mean, we can reflect this distinction.
But covariance has limitations that often disqualify it as a useful measure of association.
First, covariance is difficult to interpret. An average cross-product of deviations equal to 4600 is
not something easily explained to decision makers. In addition, the covariance is highly
sensitive to the units in which X and Z are measured. To better understand this sensitivity to the
choice of units, suppose that our data on attendance had been measured in actual attendance
instead of in thousands. The data would still contain the same information, but from numbers
such as 10 and 95 would become 10,000 and 95,000. The deviations from the attendance mean
would also be 1000 times larger, so the cross products would all have three extra zeros at the
end. A covariance of 58.56 would become 58,560. Alternatively, if attendance were expressed
in tens of thousands instead of thousands, the resulting covariance would shrink by a factor of
ten to 5.856. Covariance has many other business uses, especially in the field of finance, but it is
only second best as a measure of linear association.
Because its value depends on the units used to measure the variables, the covariance has
limited appeal as a measure of linear association.

To remove this unit sensitivity, we divide covariance by the average variability of each
variable. In the process of dividing by the standard deviations of X and Z, denoted by s
X
and s
Z
.,
we remove the unit dependence from our linear association measure. We are left with a unitless
measure called the correlation which is confined to the interval from zero to one.
DEFINITION: For bivariate data X and Z, the correlation, r
X Z
, is a measure of linear
association. It is covariance scaled up or down by standard deviations of each variable:
6

Now, let's use the computer to check out this relationship between correlation and covariance.
The instructions for obtaining these in Minitab are given in Figure 4.33.

Using Minitab to Obtain Correlations
Pull-Down Menu sequence: Stat Basic Statistics Correlation...
Complete the Correlation (or Covariance) DIALOG BOX as follows:
(1) double click on each variable to be correlated
Figure 4.33
For convenience, we have reprinted the variance-covariance matrix for the football attendance
case in Figure 4.34.

Covariances

AttendAv Top25CNN
AttendAv 486.7070
Top25CNN 58.5633 10.3315
Figure 4.34

The standard deviations of attendance and top 25 rankings is the square root of these variances.
The correlation reported by Minitab is found in Figure 4.35.

Correlations (Pearson)

Correlation of AttendAv and Top25CNN = 0.826
Figure 4.35

Dividing 58.56 by square roots of 486.71 and 10.33, we get

6
This measure is often called simple correlation to distinguish it from partial and multiple correlation.

58.56 (22.06 3.214) = 58.56 70.9 = 0.826
which is identical to the correlation the computer reports. This same relationship between
correlation and covariance may be discovered using Excel (see instructions in Figure 4.36).
Using Excel to Obtain Correlations or a Covariance Matrix
(1) Select Correlation or Covariance from Analysis Tools and click OK
(3) Click inside box following Input Range and type cell range of label and data of the
designated variables (or drag mouse through adjacent spreadsheet data columns, and this cell
range will automatically appear in the box)
(4) Click OK to produce correlation or covariance matrix output
Figure 4.36

Let's summarize in Table 4.2 some of the most useful properties of correlation.

Table 4.2
Properties of Correlation
1. Correlation is a unitless measure of linear association over the range (-1, 1)
-1 r +1.
2. The larger the magnitude of r, the stronger the linear association.
3. Variables are uncorrelated if r = 0 and perfectly correlated if r = 1.
4. Variables are positively correlated if they are directly related, and negatively correlated if
variables are inversely related.

For example, r = 0.25 is a weak, positive correlation, and r
XZ
= -0.91 means that X and Z have a
high, negative correlation.

Correlation Versus Regression and Cause-and-Effect Issues
For the two variable case, correlation and regression are related mathematically in two
ways. The first is that the correlation r
XY
when squared is the same as R. This is true whether
the regression uses Y or X as its dependent variable. In fact, r is often used in place of R in
simple regressions.

But knowing R alone does not permit calculation of r because there are two square roots
for any positive number: a positive and a negative value. R can only tell us the magnitude.
Fortunately, if we know R, we must already have run the regression and found the fitted
equation. To determine the correlation, simply attach the sign of the slope coefficient b
1
to the
known magnitude for r,
2
R . We can do this because the correlation and slope coefficient of
the regression share the same sign. An upward-sloping regression line means a positive
correlation and a negative slope for the regression also indicates a negative correlation.
R for simple regression is the same as the squared correlation between the same two
variables. The correlation between the variables in simple regression has the square root
of R for its magnitude and the same sign as the regression slope.
Despite their similarities, correlation and regression also have very important differences.
Because it does not require us to decide which variables are dependent and which are
explanatory, correlation is often the best for exploratory data analysis. When we are tackling a
new subject that we know little about, correlation allows us to discover the direction and strength
of variable relationships in the data. Based on the correlation findings, we then may recommend
further investigation of these variable relationships. In Chapter 13, we will also use correlation
to help design and troubleshoot sophisticated regression studies.
Something, however, more than a descriptive measure of association is required.
Because it finds a fitted equation, regression can make predictions or time-series forecasts and
answer What if? questions. For example, a purchasing manager may use regression to predict
his companys heating oil bills next winter. He first fits an equation to data on previous market
conditions, and then inserts values for the explanatory variables to obtain next years prediction.
Correlation alone cannot provide that information. Only regression analysis can.
In addition, if we know which of the two variables we want as the dependent variable, it
usually makes sense to run a regression. Like marriage and business contracts, regression entails
a commitment. Rather than committing to a single spouse or business partner, the regression
commits us to a particular dependent variable. Regression should be used whenever our purpose
is to predict or understand a particular dependent variable. All correlation tells us is the
direction and degree of association between two variables. As a result, r
XY
means the same and
contains the same computations as r
YX
. Not so for regression equations. The simple regression
equation of Y on X cannot be found algebraically from the regression equation of X on Y.
Notice that it is not necessary for either variable to be the cause of variation in the other.
For example, higher tuition does not cause textbook prices to rise, nor do book prices dictate
tuition rates. Instead, college tuition and textbook prices move together over time because both
are driven up by inflationary costs in throughout the economy.

A clear-cut case for regression is when cause-and-effect can be identified. Sometimes,
the laws of nature or the business world dictate that our dependent variable is determined by an
obvious group of explanatory variables. For example, the price of a hi-tech stock (StockPr) is
affected by profitability (Profit) and market share (MktShare) of the company. Finance theory,
logic, and experience in the stock market tells us that cause-and-effect flows in one direction
higher profits or greater market share will push up the stock price. Thus, the regression equation:
predicted-StockPr =b
0
+ b
1
Profit + b
2
MktShare
fit to company records over the past several years may be used to predict stock prices at
particular levels of profits and market share.
Regression and correlation each measure the direction and degree of linear
association or fit. Because regression requires us to choose a dependent variable, we are
rewarded with an equation for predicting and forecasting, dealing withWhat If?
questions, and addressing cause-and-effect issues.
However, one more caution is necessary before we close this section. Not every
regression involves a cause-and-effect relationship. Just because explanatory variables produce
good-fitting regression equations does not mean these variables cause the variation in the
dependent variable. There are several other possibilities. The most dangerous occurs when we
set up the regression backwards. Then, the dependent variable can be the cause and the
explanatory variable becomes the effect. For example, a hotel-motel association in a seasonal
resort area discovers that occupancy rates (OccRate) are highest when room rates (RoomRate) is
highest in the following regression:
predicted-OccRate = b
0
+ b
1
RoomRate
The excellent fit for this regression leads the hotel-motel association to conclude that raising
room rates will increase occupancy rates. But this recommendation could be disastrous for the
association members. The fit results primarily because the high occupancy rates in the peak
tourist season enables the hotels and motels to charge higher rates. Thus, occupancy rate may
affect room rates. Chapter 13 discusses issues of reverse and joint causation more extensively.
More often, regressions do not represent cause-and-effect at all. Instead, the regression
merely reflects relationships between the dependent and explanatory variables. Other factors
may be the cause of variability in the explanatory and dependent the variables. For example, an
admissions director at an Ivy League college may use the following regression to predict the
potential performance of prospective applicants:
predicted-GPA = b
0
+ b
1
SAT + b
1
HSGrades

where GPA is grades at the college, SAT is SAT exam score, and HSGrades is grade point
average in high school for students admitted in the past several years. Clearly, doing well on the
SAT exam or receiving higher grades do not cause a student to perform better in college.
Students who are intelligent and hard working tend to do well in high school, on standardized
tests like the SAT, and thereafter in college. The regression fit for the admissions director results
from the high correlation of intelligence and dedication to both explanatory variables and the
dependent variable. But it is difficult to measure intelligence and dedication to work. Therefore,
colleges rely on the regression such as this one to predict performance and select their freshman
classes.
Many regressions do not indicate a cause-and-effect relationship. Often, the regression fit
results when explanatory and dependent variables are jointly affected by the same factors
not included in the regression. The dependent variable may even be a cause of variation in
the explanatory variable.

CASE MINI-PROJECT:
Monthly data on short term interest rates for commercial paper from November 1989 through
August 1993 was examined by a financial analyst, using the following variables:
USA Int U.S. short term interest rates (in percent) in month t
Month a time trend variable for each of the n = 46 months: month = 1, 2, ..., 46
The regression output is:

USA Int. = 8.95 - 0.146 Month
[other output not relevant is omitted here]

s = 0.4899 R-sq = 94.2% R-sq(adj) = 94.1%

SOURCE DF SS MS F p
Regression 1 171.69 171.69 715.35 0.000
Error 44 10.56 0.24
Total 45 182.25

(1) The monthly trend rate in short term interest may be described by: Interest rates (decreased /
increased) at an average rate of percentage points per month.

(2) Assuming past trends continued, forecast interest rates for December 1993 (Month = 50):

(3) What percent of variation in interest rates is explained by the regression? %. This
percentage may be verified by subtracting from 1 the following ratio: / .
(4) The standard error of the estimate, 0.4899, is the square root of which number?
(5) To capture the actual interest rate outcome about 95% of the time, predictions based on the
fitted equation may report a margin of error of roughly plus or minus (please round).
(6) The correlation is negative because of the negative sign of the . The
correlation coefficient between interest rate and Month is - 0. . [show work]

4.43 The mean cross-product deviation for bivariate data is the
a. correlation
b. coefficient of determination
c. standard error of the estimate
d. covariance

4.44 Which of the following is a unitless measure ranging from -1 to +1?
a. correlation
b. coefficient of determination
c. standard error of the estimate
d. covariance
e. all of the above

4.45 Which of the following is not a difference between regression and correlation?
a. correlation requires us to first plot the data
b. regression requires that we designate a dependent variable
c. regression allows us to make predictions
d. correlation is merely a measure of association
e. correlation is more appropriate for exploring a new area of inquiry

4.46 If r
X,Z
= 0 then we say that X and Z are
a. uncorrelated
b. perfectly correlated
c. negatively correlated
d. both a and b
e. both a and c

4.47 If the correlation among two variables is 0.2, then R for the simple regression is
a. 0.40
b. 0.20
c. 0.04
d. 0.02
e. cannot be calculated with the information given

4.48 If a simple regression has an R of 49 percent, then the correlation between the two
variables is
a. 0.2401

b. 0.49
c. 0.07
d. 0.70
e. cannot be calculated from the information given
Answer the next four questions based on the following case and statistical output:
The primary business of a newspaper is to sell readers to advertisers. Data on two variables are
collected for Florida newspaper on the following variables:
advert advertisers space (in thousands of inches) purchased during each month
month a time trend variable for each of n = 50 months: month = 1, 2, ..., 50

advert = 102 0.501 month
[intervening output omitted]

s = 7.276 R-sq = 50.7% R-sq(adj) = 49.7%


SOURCE DF SS MS F p
Regression 1 2612.4 2612.4 49.34 0.000
Error 48 2541.4 52.9
Total 49 5153.8

4.49 According the regression output, advertising space in this newspaper
a. Increased at about 102,500 inches per month
b. Increased at about 500 inches per month
c. Decreased at about 102,500 inches per month
d. Decreased at about 500 inches per month
e. Cannot be determined from the information provided

4.50 The least-squares equation reduces the total variation in predicting advert
a. From 5154 to 2541
b. From 2612 to 52.9
c. From 2541 to 52.9
d. From 50.7 to 49.7
e. From 49 to 48

4.51 We may predict advertising space for month 20 to be approximately
a. 92,000 inches with a margin of error of 15,000 inches
b. 82,000 inches with a margin of error of 15,000 inches
c. 92,000 inches with a margin of error of 7,300 inches
d. 82,000 inches with a margin of error of 7,300 inches

e. 82,000 inches with a margin of error of 7,300 inches

4.52 The correlation coefficient between advertizing space and month was
a. +0.71
b. - 0.71
c. +0.26
d. - 0.26
e. Insufficient information to determine answer

Calculator Problem:
4.59 Given that r
xy
= -.80, determine the following:
(a) R for the regression of Y on X and the sign of the slope coefficient b
1

(b) R for the regression of X on Y and the sign of the slope coefficient b
1

4.59a and b are the same answer: 64% and negative sign

4.4 Fitting Equations with More Than One Explanatory Variable
So far we have only considered simple regression equations, those containing exactly
one explanatory variable. Wouldn't it be great if business and economic phenomena were all that
simple to explain? Unfortunately, the world is usually much more complex than that. Prices,
sales, currency exchange rates, and the number housing starts are each too complex to be
adequately explained by a single explanatory variable. We learned about the importance of
multivariate data in Chapters 1, and saw in Chapter 2 how three variables could be displayed
graphically. We now need equations that quantify the relationship between a dependent and two
or more explanatory variables. Therefore, regression analysis must be extended to its most
common business use, multiple regression.
DEFINITION: A multiple regression equation is a least-squares equation with at least two
explanatory variables; the general form for k explanatory variables X
1
, X
2
, ..., X
k
, is
= b
0
+ b
1
X
1
+ b
2
X
2
+ . . . + b
k
X
k

Limitations and Dangers of Business Analysis by Simple Regression
If youve been shopping for a computer lately, you probably noticed large differences in
prices among seemingly identical machines. Suppose a salesperson convinces you to buy a
particular model that has just as much memory price as the one priced a thousand dollars more.

After you get it home, however, you discover that the microprocessor slows your favorite
software to a crawl and the small monitor screen strains your eyes. The simple regression
Predicted-Price = b
0
+ b
1
Memory
will probably be of limited use to computer shoppers because it portrays only a partial picture of
what makes computer prices vary. In this case, a simple regression approach may even cause
you to make a misguided buying decision. One warning sign is the poor fit that often results
from an incomplete simple regression equation. Because processor speed and monitor size are
important considerations that affect price, a more complete equation should also include them.
Including these two variables would change the simple regression to a multiple regression. A
multiple regression equation with the following form:
Predicted-Price = b
0
+ b
1
Memory + b
1
Speed + b
1
Monitor
could have explained the thousand-dollar price differential by differences in processor speed and
monitor size. The addition to the equation of relevant explanatory variables will improve the fit
of the regression equation. The result will be a higher R
2
and smaller standard error of the
estimate.
Including several potentially important explanatory variables is usually required to obtain
an acceptable regression fit and disentangles the distinct relationships among the variables.

Fundamental Similarities Between Simple and Multiple Regression
Have we been wasting our time in this chapter learning only how to analyze unrealistic
bivariate regression? Not at all. While the simple regressions we have been using may be too
simple for most business applications, nearly all of the regression analysis we learned already
can be immediately applied to multiple regression.
Most of the principles, terms, and methods used in simple regression apply directly or by
easy extension to working with multiple regression.
Notice, for example, that multiple regression equations still contain an intercept b
0
term
and a coefficient for every explanatory variable. The only difference is now we have generalized
the regression equation to whatever number of explanatory variables we need to explain
movements in the dependent variable.
As we did with simple regression, the fitted regression equation is found using least
squares. The errors are still calculated by

e = Y -
If an equation has five explanatory variables, for example, then finding the regression equation
requires values each of for six different numbers: b
0
, b
1
, b
2
, b
3
, b
4
, and b
5
. In general, equations
with k explanatory variables contain k + 1 values to find, an intercept and k variable
coefficients. As before, the least squares equation results in a smaller error sum of squares than
any other possible choice of these k + 1 values.
7

Multiple regression also uses the fit measures R and SEE we learned about in Section
4.2. R, now called the coefficient of multiple determination, may be obtained from an Analysis
of Variance table in the output by the expression
R = 1 - SS
E
/ SS
T

where SS
E
= e is the same formula we used before for simple regression. Later in this section
we will introduce an adjusted version of R to give up greater flexibility in comparing fits.
Only the formula for the standard error, SEE, needs to be generalized for the number of
degrees of freedom remaining after fitting the equation from sample data. For a sample of n
observations, the intercept and k variable coefficients leave us with only n - k - 1 degrees of
freedom.
The standard error of the estimate SEE in multiple regression is

1
= =
k n
SS
MS SEE
E
E

This is the general formula for SEE because for k = 1 explanatory variables n - k - 1 reduces to
the same n - 2 denominator we used with simple regression. For example, suppose that
SS
E
=120, SS
T
= 200, n = 36 observations, and there are five explanatory variables. Then k = 5,
R = 1 - 120 / 200 = 1 - 0.60 = 0.40. In addition, SEE is the square root of 120 / (36 - 5 - 1), or 2.
In Chapter 10, we will see the fundamental importance of SEE in multiple regression forecasting
and testing.
Consider the following regression equation fit of a rent equation from a sample of
apartments in a college town:
predicted Rent = -100 + 2.0 Size - 20 Walk

7
You might expect even a high speed modern computer to spend enormous amounts of time searching out the least-squares equation if
k is large, say 20. But the computer doesn't have to search at all because there is a direct route to the exact solution via matrix algebra no matter
how large k is. Without a computer, however, the computational task would be daunting indeed.

where Rent is monthly rent (in dollars), Size is measured by the apartment floor area (in square
feet), and Walk is the walking time to campus (in minutes).
You could use this regression equation to find affordable housing or locate relative
bargains and avoid the overpriced apartments. Suppose you and your roommates can afford no
more than $500 for rent. For a large apartment of 1000 square feet and 5 minutes from campus,
you should expect to pay -100 + 2(1000) - 20(5) = -100 + 2000 - 100, or $1800. To find an
affordable place, a smaller apartment further from campus must be selected. For a 600 square
feet requiring a 30 minute walk, rent should average only $500 because
-100 + 2(600) - 20(30) = -100 + 1200 - 600
Notice once again that the negative intercept is no cause for concern unless absurdly low (or
even negative) rent would be fit from realistic value for Size and Walk.
It is easy to lose an intuitive feel for regression once we depart from bivariate analysis
and simple regression. With even two explanatory variables, X
1
and X
2
, there is a total of three
variables, and plotting the regression line requires three dimension graphics: two axes for X
1
and
X
2
and another for Y. For the general case of k explanatory variables in the multiple regression
equation, there are k + 1 variables to consider. The least squares equation becomes impossible
to portray graphically. Nevertheless, the least-squares regression equation retains its algebraic
meaning as the best fitting formula.
Multiple regression equations summarize the linear relationship of the dependent variable
with the explanatory variables and may be used to predict the dependent variable for any
particular values of these explanatory variables.
Least-squares computations for multiple regression are complex and cumbersome, but
they are easily performed with computer software such as Minitab or Excel. As a case example,
again consider the football attendance example from Section 4.2. Recall that the simple
regression used top-25 poll ranking as the explanatory variable explained more than two-thirds
of the variation in average attendance. The far-from-perfect fit was evidenced by a scatterplot
with large dispersion around the regression line. We would have been surprised if R were
substantially higher than 68.2 percent we obtained. After all, football attendance is determined
by a complex set of factors, and success in the polls only captures one aspect. Older universities
have an advantage in reputation, tradition, and football lore to attract people to the game. Larger
universities have an automatic source of attendance on campus and a larger alumni network of
past graduates. Realizing the need for a more comprehensive regression equation, suppose the
sports consulting firm decided to represent these two factors by two additional explanatory
variables, program age (in years) and university enrollment (in thousands). The regression
equation therefore took on a multivariate form:

predicted AttendAv = b
0
+ b
1
top25CNN + b
2
ProgAge + b
3
Enrollmt
It is just as easy to generate multiple regression output as it was for simple regressions.
Instructions and computer screens for Minitab and Excel are in Figures 4.37 through 4.40.
Using Minitab to Run Multiple Regression
Complete the Regression DIALOG BOX as follows:
(1) click on dependent variable from the variable listing and press Select button to have this
variable appear in the Response: box on the right.
(2) double click on each explanatory variable from variable list and they will appear in the
Predictor: box.
(3) click on the OK button when all the regression variables have been added
Figure 4.37

Using Excel to Perform Multiple Regression Analysis
(1) Select Regression from Analysis Tools list in Data Analysis box, and click OK
(3) Click inside box following Input Y Range and type cell range of label and data for the
dependent variable (or drag mouse through spreadsheet data, and this cell range will
automatically appear in the box)
(4) Click inside box following Input X Range and type cell range of label and data for all
explanatory variables (or drag mouse through adjacent columns of spreadsheet data)
(5) Click OK to produce regression output
Figure 4.38

Regression Analysis
AttendAv = - 0.73 + 5.19 Top25CNN + 0.196 ProgAge + 0.430 Enrollmt
[other output not relevant to the this problem]

s = 11.02 R-sq = 75.8% R-sq(adj) = 75.0%


SOURCE DF SS MS F p
Regression 3 34698 11566 95.23 0.000
Error 91 11053 121
Total 94 45750
Figure 4.39

First observe how similar the Minitab output in Figure 4.39 looks to the simple
regressions we observed earlier. All the entries appear to be in their proper places and have the
identical labels as before. The only obvious difference is that the regression equation is longer.
The same is true for the corresponding Excel output in Figure 4.40.

As we expected, the multiple regression equation produced a better fit. R increased
more than 7 percentage points from 68.2 to 75.8 percent. The new equation accounts for over
three-fourths of the variation in attendance. Another sign of the improved fit was the decline in
standard error of the estimate. SEE went from 12.51 in the simple regression to 11.02 in the
multiple regression. Translating this to attendance in thousands, the rough approximation for the
2 SEE margin of error fell from about 25,000 to 22,000.
Suppose the university commissioning this study intends to make the CNN top 25 in 8 of
the next 11 years, has a 100-year-old football program, and has an enrollment of 25,000.
predicted AttendAv = - 0.73 + 5.19 Top25CNN + 0.196 ProgAge + 0.430 Enrollmt
= - 0.73 + 5.19 (8) + 0.196 (100) + 0.430 (25)
= - 0.73 + 41.52 + 19.6 + 10.75
= 71.14

Figure 4.40

or about 71,000 attendance. Clearly, the additional two variables had an impact on predicted
attendance. For example, if the program were 50 years old rather than 100, ten thousand less
attendance would have been predicted. A university with 46,000 students, on the other hand,
would have led to a prediction of 80,000 attendance. As mentioned earlier, a margin of error of
roughly 22,000 may be included with these predictions.

Marginal Effects: Other Things Equal Interpretations
The coefficients in multiple regression still may be interpreted as slopes, but only for a
line graphed through the (k + 1) dimensions. We may restore analysis to a two-dimensional
plane if we hold all but one explanatory variable fixed during the analysis. By invoking this
other things equal condition, we may answer What If? questions about the effects of changes
in one particular explanatory variable.
Marginal Effects for Multiple Regression: If all other explanatory variables remain
constant, the regression coefficient b
j
measures the average individual effect on Y of a one
unit increase in the j
th
explanatory variable. The predicted change in Y, , from a change
in X of X
j
is
= b
j
X
j

As we did in Section 4.1, we have again used the delta notation, () to indicate the
change in a variable. Consider the college apartment rent example, with regression equation
predicted Rent = -100 + 2.0 Size - 20 Walk
Using = b
j
X
j
, we interpret the regression equation to mean that among apartments equally
far from campus, rent will be $2 higher on average for every additional square foot of floor area.
Armed with this information, suppose you have narrowed your search are to apartments 20
minutes walking distance from campus. Then you should expect to pay 2(120) = $240 higher
rent for an additional 120 square feet (the approximate area of an extra bedroom). To justify rent
more than this $240 differential, the landlord must offer other features (dishwasher, pool, quite
study conditions, or attractive neighbors) not included in the equation. Otherwise, you should
take your business elsewhere.
Suppose instead you are willing to walk a bit more or drive in and pay campus parking if
you can save enough on rent. Because of the negative sign attached to the coefficient of walk
(that is, b
2
= -20), the fitted equation says that equal-sized apartments cost on average $20 less
rent for each extra minute of walking time. You can save an average of (-20)(15) = $300 for
equal-sized apartments if you extend your search to places located 15 minutes further out from
campus. In Part IV we will add the statistical inference tools of testing and interval estimation to
enhance these conclusions.

The marginal effects analysis is also very enlightening for our college football attendance
case study. Recall that the fitted equation computed by Minitab and Excel was
AttendAv = - 0.73 + 5.19 Top25CNN + 0.196 ProgAge + 0.430 Enrollmt
Thus, each extra time the team finishes in the top 25 is worth more than 5,000 in attendance.
Enrollment growth of 1,000 translates to 430 extra people in the stands at game time. Finally,
when a program ages ten years, its attendance rises an average of nearly 2,000. These are each
potentially useful insights gained by running a multiple regression equation.

Comparing Fits Among Different Equations
On the surface, it seems that a higher R would also mean a better fitting regression
equation. This conclusion is not necessarily so. There are three situations where higher R may
not indicate a preferable regression equation: (1) the data analyzed are different, (2) the
dependent variable is different, or (3) the regression equations do not contain the same number of
explanatory variables. The fits among different regression equations may be compared only if
the data sets may be considered samples from the same population. Otherwise, the fits are not
comparable. For example, cross-sectional data nearly always produces substantially lower Rs
than time series data. For example, it is easier to identify the cause of a sales increase at a
particular company than to account for why one company's sales is markedly higher than
another's. For similar reasons, cross sectional data are likely to produce better fits the more
narrowly we define the population. Our regression of attendance on top 25 ranking applies only
to I-A football programs. Sampling from a more broadly-defined population, such as smaller I-
AA and Division II programs, would reduce the fit. We would need additional explanatory
variables to account for the differences between big time and smaller football programs such as
athletic scholarships awarded.
The same regression equation may also experience varying degrees of success in fitting
different time periods. For example, the 1970s reflected upheavals in financial, commodity, and
international markets caused many economic analysts to switch professions. The 1982-1989 era
by contrast, a period of general expansions and stable prices, was conducive to excellent fits for
regression equations. The implications for forecasting will be explored in Chapter 12.
Even if samples are drawn from the same population, regression Rs are not directly
comparable if the dependent variables in the regression equations differ. Consider an alternative
regression equation in which the dependent variable is the ticket and concessions revenue rather
than attendance at the games. Then the Rs of these two regression equations are not directly
comparable because the two regressions are explaining different variations. Remember that R

tells us the percentage of total variation in the dependent variable explained by the explanatory
variables. If we change dependent variables, we also alter the amount and pattern of variation to
be explained. We would expect the game revenue regression to yield the lower R because
revenue is affected not only by attendance but also by ticket and concession prices.
Finally, consider the case of two regressions run on the same set of data and using the
same dependent variable and all the same explanatory variables except for one difference: there
is one (or more) explanatory variable in the second regression not present in the first regression.
Surely we can directly compare these Rs. Regressions containing additional variables cannot
have a smaller R and will nearly always have a larger R. If added variables explain variation in
the dependent variable not explained by variables in the smaller equation, R must be greater in
the larger equation.
Is a regression equation with more explanatory variables necessarily better than one with
fewer? A strong case can often be made for using fewer variables. Regression equations are by
definition simplifications of reality. If we can give decision makers a fitted equation involving
only a handful of explanatory variables, the analysis is easier to understand and translated into
decision-making criteria. Thus, simpler is often better, as long as the regression equation
provides an acceptable fit and includes the most important explanatory variables. We have just
finished making a comparison in the football attendance case example. The simple regression
equation contained only top 25 ranking information, while the second regression equation
included additional variables, program age and student enrollment. We opted for the latter
regression because these added variables account for important aspects that improved the fit
considerably.
But extra variables provide a misleading advantage if fits are compared using the R
measure. To see why, consider the extreme case of sample data for which the program age and
student enrollment variables are totally unrelated to game attendance. The least-squares
procedure will then furnish us with the following regression equation:
predicted AttendAv = 25.4 + 5.67 Top25CNN + 0 ProgAge + 0 Enrollmt
This is the same as the least-squares equation for the simple regression because zero
times the equity is zero. Thus, the R for either regression equation must be identical. But even
if the added variables are unrelated to the dependent variable in the population, the R will
almost surely be larger because least-squares gives us the best fitted equation for the sample data.
Consequently, fluctuations in these variables can and will be translated into fit improvement
even if they dont reflect any real relationship in the underlying population.
Therefore it is conventional to report the adjusted R fit, as well as (or in place of) R.

DEFINITION: Adjusted R for a regression is a measure of fit that makes a downward
adjustment to R to correct for the average contribution to the sample fit if the explanatory
variables were unrelated in the population.
The adjustment is greatest for very small sample n or when the number of explanatory variables
k is a substantial fraction of n (the actual formula is given in the Exercises). Unlike R, adjusted
R is lower for equations with more explanatory variables if the downward adjustment between
the larger and smaller equation exceeds the contribution to R from the additional variables.
Regression equations with different k or n are usually compared for fit by using their adjusted
R.
Minitab and Excel each reports this statistic; in Minitab, it is abbreviated "Adj-Rsq" and
printed to the right of the R. For the case of the two regressions we ran on attendance, the Rs
rose from 68.2 to 75.8 percent, an increase of 7.6 percentage points. Meanwhile, the adjusted
Rs only increased 7.1 percentage points, from 67.9 to 75.0 percent. Due to the large sample size
and fewness of explanatory variables, the adjustments were relatively minor. Either the sample
size or the improvement in R would need to be much smaller to produce a meaningful decline in
the adjusted R. Therefore, even after comparing adjusted Rs, it appears that the inclusion of
program age and enrollment variables improved the overall fit and added useful insights to the
regression analysis.
When comparing the fit of two different regression equations, be sure that the dependent
variable and type of data are the same. Even if they are, adjust for differences in the
number of explanatory variables before comparing Rs.

CASE MINI-PROJECT:
The following multiple regression determines insurance charged to short haul trucking firms:
premium = b
0
+ b
1
fleetsz + b
2
popden
where the variables in the regression equation are defined as:
premium = insurance premiums charged each firm (in dollars per truck)
fleetsz = number of truck owned by each firm
popden = county population density (people / square mile) where company is located
From a random sample of n = 32 trucking firms surveyed in 1993, the regression output is:

premium = 779 - 5.49 fleetsz + 0.885 popden
s = 332.5 R-sq = 36.6% R-sq(adj) = 32.2%


SOURCE DF SS MS F p
Regression 2 1850413 925206 8.37 0.001
Error 29 3206046 110553
Total 31 5056459

1. This data set consists of time series / cross section (circle one) data.
2. This is multivariate regression because there is more than one variable.
3. The statistic tells us we have explained over one-third the variation in premiums.
4. The degrees of freedom for the error sum of squares = , found by - - 1.
5. Use the fitted equation to predict premiums for a firm with 10 trucks in a county with 50
people / square mile.
[show work and round answer to the nearest dollar] $
6. For the following, show computations assuming no other explanatory variable changes:
(a) Companies increasing fleet size 20 trucks are charged $110 lower premiums on average.

b
1
fleetsz =
(b) Companies moving to counties of 200 fewer people / square mile are charged $177 less on
average.
b
2
popden =

4.72 Multiple regression means that
a. we run more than one regression equation
b. we run the same regression equation more than once
c. we have more than one dependent variable
d. we have more than one explanatory variable

4.73 When one regression equation contains an additional explanatory variable not present in a
second, the first equation will nearly always have a
a. higher adjusted R
b. lower adjusted R
c. higher R
d. lower R
e. both R and adjusted R may be higher or lower depending on the data

For each of the following four questions, assume that the regression equation for the apartment
rent example is
Rent = - 200 + 1.0 Size - 10 Walk
4.74 The estimate of monthly rent for an 700 square foot apartment that is a 20 minute walk
from campus is
a. $200
b. $300
c. $400
d. $500
e. cannot be calculated from the information given

4.75 How much should you expect rents to change on average if you moved to the same-sized
apartment, but 7.5 minutes further walking distance from campus?
a. $15 less
b. $15 more
c. $75 less
d. $75 more
e. exactly the same amount

4.76 The monthly rent for a 1200 square foot apartment that is a 2 minute walk from campus
may be estimated to be
a. $1180
b. $980

c. $1220
d. $1420
e. $760

4.77 How much should you expect rents to change on average if you moved to an equal-sized
apartments 10 minutes further walking distance from campus?
a. $100 less
b. $100 more
c. $20 more
d. $200 more
e. exactly the same amount

4.78 We should not expect R from two regressions to be directly comparable if
a. the samples for the two regressions was gathered from very different time periods
b. the sample for the one of regressions was from a more broadly-defined population
c. the dependent variables used in each regression was defined differently
d. all of the above

4.79 Suppose a particular regression equation is utterly worthless in accounting for variation in
the dependent variable over the entire population. Then we should expect that the least-
squares equation from a random sample of that population will yield an R = 0
a. only very rarely
b. less than 50 percent of the time
c. most of the time
d. nearly always
e. always

4.80 Which is not true about the difference between R and the adjusted R?
a. the difference is greater for smaller sample sizes
b. the difference is greater if R is large
c. the difference is greater if there are more explanatory variables in the equation
d. adjusted R cannot be greater than R
e. adjusted R cannot be greater than 100 percent

4.81 Multiple regression is so often used in business and economics today because
a. most variables we seek to explain are affected by a complex set of factors.
b. an acceptable fit often cannot be obtained with only a single explanatory variable.
c. controlled experiments are often too difficult to conduct.

d. the abundance of data and speed of modern computers make multiple regression a
practical option.
e. all of the above.

4.82 In comparing regression fits on cross section and time series data
a. R is usually lower for cross section data because it is easier to explain why different
items are different
b. R is usually higher for cross section data because it is easier to explain why different
items are different
c. R is usually lower for cross section data because it is more difficult to explain why
different items are different
d. R is usually higher for cross section data because it is more difficult to explain why
different items are different
e. there is no systematic difference between fits for either type of data

Answer the next five questions based on the following case description and statistical output:
We next examine a regression equation where advertising space is a function of newspaper sales
and the state of the economy:
advert advertising space (in thousands of inches) purchased during each month
circ the monthly level of circulation (measured in millions)
jobless the monthly unemployment rate (in percent)
The regression output is:

advert = 125 + 1.95 circ - 6.13 jobless

4.83 Forecast advertising space next month when circulation is 4 million and unemployment
rate is 6 percent.
a. 22,000 inches
b. 29,000 inches
c. 96,000 inches
d. 132,000 inches

4.84 If the circulation numbers were instead reported in thousands (rather than millions) of
newspapers, what would the coefficient of circ now have to be for the fitted equation to
tell exactly the same story?
a. 1950
b. 1.95
c. 0.00195

d. 0.00000195
e. Cannot be determined from the information provided.

4.85 Assuming unemployment is the same, circulation decline of one million will result in a
a. 1950 inch increase in advertising space
b. 127,000 inch increase in advertising space
c. 1950 inch decrease in advertising space
d. 127,000 inch decrease in advertising space

4.86 For a one percentage point increase in the unemployment, other things equal, advertising
space will
a. increase by about six thousand inches
b. increase by about 119 thousand inches
c. decrease by about six thousand inches
d. decrease by about 119 thousand inches

4.87 Which of the following describes the data set and regression equation:
a. data set is time series data and the regression is multivariate regression
b. data set is cross section data and the regression is multivariate regression
c. data set is time series data and the regression is simple regression
d. data set is cross section data and the regression is simple regression
e. not enough information provided to determine the data set and regression type.

4.88 Determine for each of the following cases:
(a) b
1
= 15, X = 10
(b) b
1
= 15, X = 100
(c) b
1
= 1.5, X = 100
(d) b
1
= -5, X = 10
(e) b
1
= 15, X = -10

4.88: a is +150, b is +1500, c is +15, d is -50, e is -150


A linear equation relating business variables lets us summarize the relationship of among
variables by an equation, make predictions on a variable based on data for related variables, and
measure the sensitivity of one variable to changes in a related variable.
Regression analysis is a statistical method for analyzing linear relationships among
bivariate or multivariate data. A regression equation is an equation summarizing the linear data
relationship of a dependent variable with one or more explanatory variables. Simple
regression involves equations with only one explanatory variable. The form of the Simple
Regression Equation is = b
0
+ b
1
X. The predicted value of Y, , represents the approximate
values of the dependent variable calculated from the regression equation. Unrealistic regression
intercept values are common when values near zero of the explanatory variable are impossible or
do not occur in the data.
The Marginal Effects Rule for Simple Regression is that, for a given change in the
explanatory variable, the average change in the dependent variable is = b
1
X.
The regression equation of a trend line for a time series variable Y is = b
0
+ b
1
t
and b
1
is the overall trend rate. A forecast is a prediction about the future value of a variable.
Extrapolation is prediction based on values outside the range of the data used to make the
prediction. Include the warning when forecasting from time trends: assuming past trends
continue.
To be useful to business, regression analysis must provide us with two types of
information: (1) the best-fitting equation, and (2) how well the data fits that equation. The
standard error of the estimate, SEE, in regression is the square root of the mean of the squared
errors. The standard error of the estimate measures average deviation of the dependent variable
from the fitted regression equation in the same way that standard deviation measures variability
from the mean.
The mean square error, MS
E
, is the sum of squared errors averaged over the number of
degrees of freedom: MS
E
= SS
E
/ DF. The standard error of the estimate is the square root of the
mean square error. For normally distributed errors, predictions based on regression often may be
roughly approximated with margin of error of two standard errors.
In regression analysis, the total sum of squares, SS
T
, is the sum of squared deviations of
the dependent variable from its mean. SS
E
is nearly always smaller than SS
T
, and much smaller
if knowledge of X provides strong evidence about the value of Y. In the case of a perfect fit, SS
E

is zero.
R measures the goodness of fit for a regression by the portion of variation in the
dependent variable accounted for by the dependent variable. R = (1 - SS
E
/ SS
T
), or in
percentage terms R = (1 - SS
E
/ SS
T
) 100%. R measures how good a fit is while SEE reports
the variability around the fit. SEE is measured in the same units as the dependent variable so it
can inform us about the margins of error. R is a relative measure to let us determine what
percentage or fraction of a perfect fit we have achieved. Therefore, these two fit measures

complement one another. Accounting for variation in a single product, company, stock, or
country is generally easier than explaining why different items are different. Thus, time series
regressions often produce higher Rs than do regressions on cross-sectional data.
The covariance is a measure of linear association equal to the mean of cross-product
deviations. Because its value depends on the units used to measure the variables, the covariance
has limited appeal as a measure of linear association.
For bivariate data X and Z, the correlation, r
X Z
, is a measure of linear association. It is
covariance scaled up or down by standard deviations of each variable:
Correlation is a unitless measure of association defined over the range (-1, 1). The larger the
magnitude of r, the stronger the association. Variables are uncorrelated if r = 0 and perfectly
correlated if r = 1. Variables are positively correlated if they are directly related, and negatively
correlated if variables are inversely related.
R for simple regression is the same as the squared correlation between the same two
variables. The correlation between the variables in simple regression has the square root of R
for its magnitude and the same sign as the regression slope. Regression and correlation share
certain computations and both measure the direction and degree of linear association or fit.
Regression and correlation each measure the direction and degree of linear association or fit.
Because regression requires us to choose a dependent variable, we are rewarded with an equation
for predicting and forecasting, dealing withWhat If? questions, and addressing cause-and-
effect issues. However, many regressions do not indication a cause-and-effect relationship.
Often, the regression fit results when explanatory and dependent variables are jointly affected by
the same factors not included in the regression. The dependent variable may even be a cause of
variation in the explanatory variable.
A multiple regression equation is a least-squares equation with at least two explanatory
variables; the general form for k explanatory variables X
1
, X
2
, ..., X
k
, is
= b
0
+ b
1
X
1
+ b
2
X
2
+ . . . + b
k
X
k

Including several potentially important explanatory variables is usually required to obtain an
acceptable regression fit and disentangles the distinct relationships among the variables. Most of
the principles, terms, and methods used in simple regression apply directly or by easy extension
to working with multiple regression. Multiple regression equations summarize the linear
relationship of the dependent variable with the explanatory variables and may be used to predict
the dependent variable for any particular values of these explanatory variables.
Marginal Effects for Multiple Regression: If all other explanatory variables remain
constant, the regression coefficient b
j
measures the average individual effect on Y of a one unit
increase in the j
th
explanatory variable. The predicted change in Y, , from a change in X of
X
j
is = b
j
X
j

The standard error of the estimate SEE in multiple regression is
1
= =
k n
SS
MS SEE
E
E


Adjusted R for a regression is a measure of fit that makes a downward adjustment to R
to correct for the average contribution to the sample fit if the explanatory variables were
unrelated in the population. When comparing the fit of two different regression equations, be
sure that the dependent variable and type of data are the same. Even if they are, adjust for
differences in the number of explanatory variables before comparing Rs.

Case Study Exercises
I. Answer the next four questions based on the following case and statistical output:
The following regression equation:
Predicted P GAS = b
0
+ b
1
month
where the variables are
P GAS monthly average price at pump for regular gasoline (in cents / gallon)
month month in the time series, with months numbered from 1 to 53
was regressed for n = 53 months from April 1986 to August 1990. The Minitab output was:

P GAS = 83.8 + 0.414 month

s = 4.532 R-sq = 66.9% R-sq(adj) = 66.3%

1. Rounded to the nearest cent, the annual trend rate (i.e., every 12 months) in gas price
increases is about
a. 1 cent b. 5 cents c. 84 cents d. 89 cents e. None of the above

2. The forecast of gas prices in March 1991 (i.e., month = 60), assuming past trends
continue is approximately
a. 25 cents
b. 35 cents
c. 84 cents
d. 109 cents

3. The variation in gas prices explained by this trend equation is approximately what
fraction?
a. One third
b. One half
c. Two thirds
d. Three quarters
e. Cannot be determined from the output above

4. The correlation between gas price and month is
a. 0.44
b. 0.66
c. 0.67
d. 0.82
e. Insufficient information provided to determine the answer

II. Answer the next four questions based on the following case and statistical output:
Data on the following fours variables were collected from a random sample of n = 49 persons
taking the business law section of the CPA (Certified Public Accounting) exam:
LAWSCORE = each person's score on the business law section of the CPA exam
HOURS = number of hours that person studied per week to prepare for the exam
GPA = undergraduate grade point average of the person taking the exam
WORKEXP = number of years work experience of the person taking the exam
for the regression equation in the following form:
Predicted LAWSCORE = b
0
+ b
1
HOURS + b
2
GPA + b
3
WORKEXP
When the regression was run, the following output results:

LAWSCORE = 53.8 + 0.400 HOURS + 3.45 GPA + 0.272 WORKEXP

s = 9.050 R-sq = 25.4% R-sq(adj) = 20.5%


SOURCE DF SS MS F p
Regression 3 1256.46 418.82 5.11 0.004
Error 45 3685.78 81.91
Total 48 4942.25

1. The degrees of freedom for the error sum of squares is calculated as follows:
a. 48 - 1 - 1 = 46 degrees of freedom
b. 50 - 2 = 48 degrees of freedom
c. 45 - 4 = 41 degrees of freedom
d. 49 - 3 - 1 = 45 degrees of freedom

2. If a second regression equation reports an R
2
= 28.6% and an adjusted R
2
=19.2%, then
this second equation has
a. A better fit than the regression equation above.
b. A worse fit than the regression equation above.

c. The same fit as the regression equation above.
d. Not enough information provided to answer this question.

3. Assuming the other explanatory variables don't change, ten more hours study weekly
a. increases exam scores an average of 0.4 points
b. increases exam scores an average of 4 points
c. increases exam scores an average of 40 points
d. increases exam scores an average of 58 points

4. The exam score for a person studying 30 hours per week who earned a 3.0 grade point
average in college and has ten years work experience is predicted to be approximately
a. 72 b. 79 c. 85 d. 91 e. None of these

III. At the beginning of 1991, a national realtors association wants you to analyze trends in the
home construction industry over the preceding four years and forecast new construction activity
this year.
Predicted H starts = b
0
+ b
1
Month
The variables in the regression equation are for month t
H starts number of housing starts (in thousands) during month t
Month month = 1 to 48 from January 1987 to December 1990)
The regression is fit yielding the following Minitab output:

H starts = 1727 - 12.2 month
[other output omitted]

s = 97.54 R-sq = 75.7% R-sq(adj) = 75.2%


SOURCE DF SS MS F p
Regression 1 1366204 1366204 143.60 0.000
Error 46 437648 9514
Total 47 1803852

Now answer the following questions:
(1) What was the monthly trend rate in housing starts over the period examined?
Housing starts (increased / decreased) at an average rate of thousand per month
decreased, 12.2
(2) What percent of variation in car sales was explained by this trend equation? 75.7%
(3) Calculate the correlation between Month and Sales.

r =
2
R = 757 . 0 =- .87 because b
1
< 1
(4) Verify with a calculator that the standard error of the estimate has the proper relationship to
the mean square error MS
E
. SEE =
E
MS = 9514 . 0 = 97.54
(5) Verify with a calculator that R has the expected relationship to the error and total sum of
squares, SS
E
and SS
T
.
R
2
= 1 - SS
E
/ SS
T
= 1 - 437,648 / 1,803,852 = .757 or 75.7%
(6) Remember that month = 48 was December 1990. Use a calculator to forecast housing starts
for January 1991 and again for February 1991 from the fitted regression equation.
predicted H starts = 1727 - 12.2(49) = 1129.2 thousand for January 1991
predicted H starts = 1727 - 12.2(50) = 1117.0 thousand for February 1991
(7) The actual values for housing starts in January 1991 was 844 thousand and 1008 thousand for
February. Compare your answers in question (6) above with these actual outcomes, and discuss
how close each of your forecasts was.
(8) Below is the time series plot of housing starts from 1987 through the end of 1991 (i.e., for an
additional 12 months beyond the data for which the regression equation was fitted)
Both forecasts were too high, however the second forecast (for February 1991) was much
closer being only 109 thousand off rather than 285 thousand away from the actual
outcome.

-
- **
1750+ *
- * * *
H strts - ***** *
- * 2 *** *
- **2 * *
1400+ ** * * ** *
- ** * * *
- * *
- ****
- 2 * ***
1050+ * **2
- * ***
- *
- *
-
--+---------+---------+---------+---------+---------+----month
0 12 24 36 48 60

(9) Based on your examination of this plot, discuss how you can tell that the regression equation
we estimated would yield very bad forecasts for mid- and late-1991 (months 54 to 60).

The downward trend over the first four years (48 months) was used to fit the regression
equation. However, the trend may have reversed and turned into an upward trend
thereafter. Thus, forecasts from the fitted line will be increasingly wrong and too low.
(10) If a different regression equation were to be used instead of a time trend equation, suggest
one or more explanatory variables that would explain the variation in housing starts in the U.S.
Acceptable answers: interest rates, construction costs, unemployment, and population.

IV. The director of a hospital pharmacy wants to determine how staffing level affects the rate at
which prescriptions are processed. She decides to use the following regression equation.
prescript = b
0
+ b
b
staff + b
2
in-pat
where the variables are defined as follows:
staff = average number of staff on duty on day t
presrip = prescriptions processed per hour during day t
in-pat = number of in-patients at the hospital on day t
Forty-seven days are sampled and a regression yields the following output:

prescrip = - 67.1 + 21.0 staff + 0.395 in-pat.
[other output]

s = 14.22 R-sq = 69.6% R-sq(adj) = 68.3%


SOURCE DF SS MS F p
Regression 2 20392 10196 50.45 0.000
Error 44 8892 202
Total 46 29284

(1) Verify that the error degrees of freedom should be 44, given this regression and sample size.
DF = n - k - 1 = 47 - 2 - 1 = 44
(2) Using the marginal effects"delta" () formula: Based on the fitted equation, determine the
expected change in the prescription processing rate, assuming the other explanatory variable in
the regression equation does not change: [In each case, show delta formula computations and
report answers in correct direction of change, numerical magnitude, and correct units]
(a) The staff level at the pharmacy is increased by two persons.
prescrip = b
1
staff = (+21)(+2) =+42 an increase of 42 prescriptions / hour
(b) The number of patients at the hospital decreases by 100.
prescrip = b
2
in-pat = (+.395)(-100) =-39.5 a decrease of 40 prescriptions / hour
(3) Prediction:
(a) Predict the prescription processing rate when there are 200 patients in the hospital and 10
persons maintained on staff at the pharmacy.

predicted prescrip = -67.1 + 21.0(10) +.395(200) = -67.1 + 210 + 79
= 222 prescriptions / hour
(b) Why shouldn't you worry about the negatively signed intercept term, -67.1, in the fitted
equation? [Hint: use information from the descriptive statistics below]
Even if both variables are at their minimum values, predicted prescrip would still have
positive value.
staff 47 9.512 9.563 9.509 1.037 0.151
prescrip 47 170.55 174.93 170.68 25.23 3.68
in-pat. 47 95.98 97.00 95.72 15.40 2.25

MIN MAX Q1 Q3
staff 7.375 12.125 8.625 10.187
prescrip 110.00 227.86 153.57 188.14
in-pat. 72.00 129.00 81.00 108.00

(c) What is the danger of using the fitted equation to predict "prescript" if only four staffers are
on duty at the pharmacy?
The range of staff data was 7.375 to 12.125, so four staff would result in extrapolation if
used to fit the regression and predictions cannot be trusted.


CHAPTER 5 COLLECTING AND
GENERATING DATA

Approach: We expand our knowledge about types and sources of business data and how to
collect and record data. We use computer software to collect random samples from a
population and explore the most common types of random sampling employed in survey
research.

Where We Are Going: Having completed Part I, we are ready to introduce inferential
statistics as soon as the fundamental concepts of probability and probability distributions
are covered. The data gathering methods discussed here will be applied throughout the
remainder of the text.

primary and secondary data, observational and experimental data
random numbers and random sampling
anecdotal evidence, nonresponse bias, and systematic sampling

SECTION 5.1 The Essential Role of Data in Business Decision Making
SECTION 5.2 Collecting Data by Sampling from a Population

5.1 The Essential Role of Data in Business Decision Making
In the first four chapters, we have seen some of the uses of statistical analysis. We
demonstrated how displaying and summarizing data can help businesses deal with day-to-day
problems, make informed decisions, and plan their future course of action. But before we can
examine a histogram or find the median, we must first have the data to analyze. An essential and
often overlooked area of business statistics revolves around obtaining the proper data. We
therefore conclude Part I by delving into the ways business data is obtained.
Opponents of data analysis often claim that data is too expensive or unnecessary to solve
a problem. But data can be a real bargain compared to the cost of making the wrong decision.
The most commonly-voiced excuse is,We don't need to waste time and money collecting data
because everybody knows that... Unfortunately, what everybody knows to be true may actually
be wrong! Management may unanimously believe that company sales are rising and customer
satisfaction is at a record high. Since these claims involve measurable data, the company should
rely on actual sales data and survey customers.
Business executives who never did well in math classes often feel uncomfortable and
insecure around numbers and graphs. But ignorance is bliss is a poor excuse for not gathering
important business data. Operating in the dark inevitably results in ill-advised and uninformed
decisions. Data is necessary to keep people honest. Without data to monitor what's happening
and track where it has been, a business can easily become inefficient and drift aimlessly.
Data is also important to challenge prejudice in the workplace. Racist and sexist
stereotypes are often deeply-held in traditionally all-white male factory and office settings.
Sensitivity training workshops to raise worker consciousness may be unsuccessful unless data
is produced showing that women and minority employees are no less dedicated and productive.
It is easy to fall prey to quack remedies when hard facts aren't available. In many
counties, blame was recently leveled at the local airport authority for mismanagement and lack
of vision. It seems that major air carriers announced they were no longer serving many local
airports. A host of costly and radical quick-fix solutions were immediately proposed in each
locality. Some advocated starting up their own airline while others argue passionately for
taxpayer-subsidized financial guarantees to attract airlines back.
Fortunately, cooler heads usually prevailed. Most local governments decided to gather
data before rushing to judgment. A national survey of airports made it apparent that fewer and
fewer major airlines were operating out of most small city airports. Clearly, the loss of air
carriers was a more general phenomenon and could not be blamed on the incompetence of any
single airport authority. The survey also discovered that passenger departures, total number of
flight, and airport revenues at most of these smaller urban airports have not declined and airfares

did not rise after the remaining airlines added flights to replace the lost ones. Thus, because
there were no major harmful results, most local governments decided that a wait-and-see
approach was warrant until data indicated fare or overall service was adversely affected. In the
few localities that did not assemble the relevant data, however, unwarranted blame, panic, and
pie-in-the-sky panaceas would have dominated the debate.

Population and Variable Selection
But just not any data will suffice. Data must be gathered from the proper population. In
Chapter 1, we learned that a population consists of data on all relevant objects or events.
Therefore, to identify the proper population, we must decide which persons, things, or events are
relevant. For example, we chose for our compensation case a population consisting of all CEOs
from large computer-related corporations.
Sometimes a complete listing of the population is not available or even possible. Every
year, thousands of small businesses fail, relocate, change their name, or switch their line of
business. Such volatility makes it impossible for even the most conscientious chamber of
commerce to maintain up-to-date listings of local businesses. Other populations, such as all
feasible car designs, cannot be listed because there is an endless set of designs possible.
In such cases, a description of the population is used in place of a listing. For example,
the chamber of commerce instead may describe the characteristics of small businesses, such as
the population of privately owned, for-profit enterprises with fewer than 100 employees
currently operating within city limits. Similarly, the population of car designs may be described
by what makes a car a car and not a bus, a truck, a motor home, or a minivan.
Populations that cannot be listed explicitly can usually be defined by a set of conditions or
characteristics that only describe the members of that population.
Besides designating the population from which to collect data, we must also decide on
appropriate variables in the population to collect. Although crucial to statistical decision
making, variable selection can also be a confusing task. There are many wrong turns we can
make and few road signs to get us safely to our destination.
The most important consideration is to select variables that can be measured in a
consistent and acceptable way. Most people report to pollsters that they are in favor of freedom,
the environment, and progress but oppose crime, welfare, and big government. However, words
like freedom and the size of government may mean different things to different people. To
convert a concept like welfare into a measurable variable, it has to be carefully defined. But

once a specific variable definition is established, the popular consensus usually dissolves. For
example, when farm subsidies, social security, and corporate welfare are included in the
definition, recipients of these government programs voice considerably less opposition to
welfare.
Some variables are so elusive that definitions may be meaningless. Economists, for
example, base many of their theories on the principle that the profit motive governs businesses
behavior. Even if profits in the short run suffer when a business sponsors a little league team or
pursues market share, these actions may be motivated to raise long-run profits. This conclusion
may be valid if the little league team fosters goodwill toward the business or if increasing market
share help to corner the product market. But there is a risk of engaging in circular reasoning if
we allow all business actions to be explained as profit motivated. To guard against this
temptation, precise variable definitions should be adopted beforehand and not revised whenever
new data fail to match our preconceived ideas.
Poorly-defined variables may also be contaminated by a second variable that exaggerates
a point or even reverses a conclusion. A leading news magazine cover story once reported that
recreational vehicle (RV) sales had doubled in the past decade. However, the statistics ignored
the doubling of prices that occurred over that same period. Had the number of RVs been
measured, rather than their dollar value (or had prices been adjusted for inflation), the story
might not have rated a feature article. Increasing government spending is a recurrent theme of
newspaper editorials. As with the RV example, inflation artificially magnifies spending
increases. In addition, population growth creates more taxpayers to fund government
expenditures but more people needing government services such as education and roads. It turns
out that inflation-adjusted government spending per capita has been fairly stable in recent years.
Sometimes private or public decisions are based on the wrong variables. A quality
control officer at a hotel chain may measure quality by the percentage of rooms cleaned each day
by noon, while the chain's reputation suffers from a different lack of quality control: a
reservation system that causes frequent overbooking of rooms. A widely-publicized news story
reported that illiteracy levels had risen so high that one-third of Massachusetts job seekers are
unable to complete the state unemployment form. Anyone who has tried filling out a
government form might object to this as a measure of functional literacy. Even if the form was a
valid test for literacy, the news report did not provide comparison data from previous years. It is
possible that the rate used to be even higher than one-third. Illiteracy may be severe, but we will
never know until we collect appropriate data.
For years, analysts advised U.S. steel firms to increase investment enough to catch up
with their more efficient German and Japanese rivals. But the U.S. firms were investing
primarily in outmoded technology, so worker productivity could not rise much. Industry

analysts should have recommended measuring only investment in modern technologies (such as
basic oxygen furnaces and continuous casting).
The management of a food processing company, trying to avoid a proxy fight from its
stockholders, cites the companys excellent performance in their industry. Dividends and stock
prices may be performing better than many other industry stocks. Nevertheless, a stockholder
group could still successfully challenge management based on low company profits in recent
quarters. Unless profits are monitored as well, management may not effectively guard itself
against a proxy challenge.
Often a variable is either accidentally or deliberately selected which slants the results in a
desired direction. A Japanese copier firm claimed to be the market leader. Since the firm
specializes in small, low-cost machines, the firm is number one only in sales units shipped and
not among the top five in total value of copier shipments. Even a specific variable such as
product weight may require additional clarification. Consumers comparing hamburgers among
fast-food chains need to know that "quarter-pounders" are named for their pre-cooked weight,
before the patties lose their fat and water on the grill.
As a final example of misleading and inappropriate variables, supermarket chains are
fond of reporting to consumer groups that, after meeting expenses, their profits are only one or
two percent of the product price. This rate is substantially smaller than most manufacturing
profit margins. But wait a minute! Manufacturers actually produce products. By contrast,
supermarkets mainly assemble, unpacked, shelve, sell, and bag the goods they sell. Because the
manufacturer is responsible for producing the product, a better choice of variable might be
profits as a percent of what the supermarket contributes to the product's value, called value-
added. Suppose services offered by supermarkets adds 15 cents on the dollar to the value of a
product, and manufacturing makes up another 60 percent (with the remaining 25 percent going to
distributors and raw material suppliers). Because they contribute only one-fourth as much to
value-added, supermarkets appear much less profitable relative to manufacturing based on price
margins. Using a profit share of value-added as our variable, we can directly compare profits
among different types of businesses from retailing to manufacturing.
Data collection should focus on variables whose scope matches the problem being
investigated and whose information is not contaminated by extraneous variables.


Data Precision and Rounding
Collecting data on quantitative variables presents special challenges. In Chapter 1, we
learned that units and scaling factors are necessary to bring business data to life. The precision
of a measurement is also an important factor in how we should record data.
Remember your last eye exam. When you tried to read the bottom line of an eye chart, a
fuzzy letter that looks like a P may actually be a B, D, or R. Like your vision, scientific
instrument and accounting records may be accurate to only to the first few figures (engineers call
these significant figures). Data should always be rounded so that the last digit reported doesn't
exceed the precision of the measurements.
Did you ever watch a friend or relative figure how many miles their new car gets to the
gallon of gas? Suppose the trip odometer registers 331 miles since the last fill up and the tank
took 15.9 gallons to fill. Dividing 331 by 15.9, the new car owner proudly proclaims the
calculators answer of 20.81761 miles / gallon. But the closest answer that should be reported is
20.8 miles / gallon, rounded to the 3-figure precision of numbers used in calculating this
quotient. Even 3-figure precision may be unjustified. Filling the tank to the same level as last
time is difficult because gasoline expands at higher temperature or the car may be refilled while
parked at an angle. Thus, the actual precision justifies rounding only to the first two figures, or
21 miles / gallon.
Sometimes, rounding is done for reasons other than data precision. It is easier to work
with simpler, rounded numbers if they don't sacrifice any meaningful detail. For example, would
the Board of Directors really have benefited by knowing CEO compensation to the nearest dollar
(e.g., $1,215,735)? Unfortunately, newspaper articles, corporate reports, and political campaign
literature often report data to excessive precision to dazzle readers, stockholders, or the voting
public.
Data should be rounded to the number of figures that best reflects the amount of useful
information present in the data.

Data Sources
Explosive growth in business news magazines, newspapers, and Internet services has
made market, financial, trade, and accounting, and other secondary data routinely accessible to
nearly everyone.

DEFINITION: Secondary data is data collected for purposes other than those intended by the
end user.
Each public corporation, for example, must satisfy government reporting requirements.
Government provide us with a wide range of secondary data on economic conditions such as
trade, the money supply, tax revenues, interest rates, unemployment, and inflation. Other
valuable sources of secondary data are trade associations, chambers of commerce, and
international organizations.
However, secondary data may not satisfy all our particular needs. After all, government
agencies and the financial world don't check with us before deciding which variables to collect or
populations to sample. Often, in fact, the kind of data and the way it is recorded is justified
primarily by government regulations or the all-purpose excuse, We've always done it that way.
. Thus, some important information never is recorded or gets lumped together with other
information.
Suppose, for example, you need information on how heavily laundry detergent brands are
advertised on television. Advertising trade publications may only report overall promotional
expenditures without breaking out TV spending from other types of radio, magazines, billboards,
discount coupons, and free samples. Even worse, the data combines promotional expenses for
all products of the parent corporation, not just laundry detergents. Some laundry brands may be
unrepresented if they are produced by nonpublic corporations. These problems are common
whenever we use secondary data. In that case, the only way to get the crucial data you need is to
observe and record it first hand. This data we need must be collected on our own as primary
data.
DEFINITION: Primary data are data collected or generated by the end users for their own
analytical purposes.
In some cases, obtaining primary data merely involves recording observational data.
Hiring a firm to record traffic flows past your business at different times of day may be far more
useful to you than the Department of Transportation's average daily traffic volume counts at the
intersection down the block. On the other hand, primary data collection can be a time-
consuming and expensive alternative to secondary data. The benefits of specialized but
expensive primary data should always be weighed against the convenience and cost savings from
more generic secondary data sources.
DEFINITION: Observational data are data that can be measured and recorded merely by being
present when the data is generated.


Observational data does not require direct intervention in the process. The data is continually
and automatically produced. All anyone has to do is be there to observe and measure the data
when it occurs.
Another common source of primary data is a survey, such as those used in new product
marketing and customer satisfaction study. Questionnaires are designed and either mailed out or
administered by phone interviews. Data on the desired variables for the relevant populations can
be obtained if the survey design and collection process is carefully planned out. Surveys are
also an excellent source of data on attitude and expectations that are not easily available from
secondary data sources. A survey, for example, can provide advanced insights about how much
consumers are willing to borrow in the coming year. Survey design methods will be explored
later in this chapter.
With surveys, data is merely collected from respondents to a questionnaire. In a sense,
we are still talking about observational data because we are simply observing product
preferences or household debt that already exist. We need only commission a survey to collect
the data. On the other hand, survey data is not strictly observational because subjects must be
contacted and then decide to respond before information can be collected.
By contrast, we sometimes require a type of primary data that is not observational at all
because it doesn't exist. The only way to bring it into existence is by designing and conducting
an experiment.
DEFINITION: experimental data is data specially generated by a carefully-designed
experiment that measures the response by subjects to different treatments under a controlled set
of conditions.
Like the science projects you did back in high school, business experiments usually
require careful planning plus a generous dose of time and resources. In this rapidly changing
world, businesses often confront decisions that have little parallel to previous experiences.
Suppose a production manager wants to determine whether flexible work scheduling will reduce
employee absenteeism. If this type of production has never tried flextime before, there is no
data to analyze. One method to obtain relevant data would be to design an experiment where
some workers are assigned to flextime, others are not, and absenteeism, morale, and productivity
rates are compared. As with surveys, experiments like this one are often not approved because
of their costs and the time necessary to design and conduct them. Poorly designed experiments
can result in useless or even misleading results. But information from a carefully conducted
experiment can provide a business with essential advice about whether to proceed with a major
new system changeover. The final section of this chapter will discuss much more about
experimental design, and later in this text we will examine inferential methods for analyzing data
generated from experiments.

Time Considerations in Business Data
In the first four chapters, we saw many examples of time series and cross sectional cases.
With cross section data, businesses can determine what their universe looks like at a particular
point, a snapshot in time. For example, a multinational enterprise surveys its work force and
finds that experience and wages are highest at its U.S. plants but that education is greatest at its
Western European subsidiaries. A quality control officer ranks the various causes of production
line stoppages at a factory last year; she discovers that the production line was stopped 47 times
because equipment needed adjustment and only 12 times was the line stoppage due to worker
error.
Statistical analysis of time series data has assumed a crucial role in business and
government policy formulation. To comply with tax laws and regulations, businesses are
required to periodically report all sorts of financial and operational data. The business press and
Internet have enormously expanded our access to business and economic data. Time series
information to analyze business trends and construct economic forecasts is now easily available
online at our desktop computer. Variables such as net worth, gross receipts, wages and salaries,
depreciation, unemployment, gross domestic product, and interest rates are reported and recorded
as "series" and are therefore referred to as serial data. If a long term contract offer is being
considered, the Board at Tangerine may also want you to forecast the trend in CEO
compensations based on industry patterns of increases. You might then perform time series
analysis on the yearly average compensations in the industry over the preceding 15 years. The
annual time series data contains a sequence of industry compensation figures from every recent
year.
Confusion between time series and cross section data is a common source of erroneous
business decisions. For example, a company catering to the rapidly growing retired population
may need to predict future markets for its products. But it is improper to expect the elderly of
the future to have buying habits that mirror those of todays elderly. Just as an old family
photograph freezes forever in time people's age, hair style, and clothing fashion, cross section
data records the information about corporations, cities, executive, or consumers at one period in
time. Cross sectional market surveys of today's senior citizens reflect values and tastes shaped
by the previous half-century. The elderly today are often more thrifty and traditional because
they grew up under Depression hardships, World War II rationing, and more restrictive moral
codes. An understanding of serial data is essential for predicting what the elderly will buy in the
future. Because people retiring in future decades will be the postwar "baby-boomers." Wise
marketing and product development planners will tailor advertising campaigns and new product
designs to a generation raised in a prosperous and less restrictive society.

Relying on cross-sectional data also makes our analysis highly sensitive to the
circumstances at the time the data were gathered. If the most recent period was unusual in some
respect inflation or depression, wartime, energy crisis and gas rationing then statistical
analysis of data from that period might only apply to times with similar circumstances. In such
cases, use of data from the more distant past may inject a sense of perspective to our analysis.
Business analysts also should be wary of before-and-after comparisons. Substantial shocks such
as major mergers or adopting new accounting practices may alter the variables being measured.
The data must be adjusted to make earlier measurements comparable. Analysis should be
restricted only to observations occurring since the change.
Time series data should also be used with care. People who base their arguments around
time series data are often the ones who misuse it the most. We are cautioned to remember the
"lessons of history" else we are doomed to repeat the tragedies that befell our predecessors.
Unfortunately, history is difficult to decipher and easily interpreted to support many different
theses. Some try so hard to avoid past mistakes that they fail to act at all or go too far in the
opposite direction. General Motors had poor performance with a president who emphasized
marketing and finance, but they saw little improvement when they replaced him with a CEO
whose primary expertise was engineering. Blind imitation of previous successes, such as
television spinoffs and movie sequels, can be equally foolhardy. To reduce these problems,
overly simplistic analysis of time series data should be avoided.
Finally, time considerations are involved when decisions are implemented. Don't be too
quick to proclaim success or failure for recent business or public policy decisions. Spectacular
early triumphs often do not translate into permanent success. A "quick fix" solution may patch
up a crisis only temporarily. On the other hand, it may take time for a bad situation to be
reversed. TQM is but one example in which front-end investment in training and reorganization
requires that we emphasize long-term performance.
Confusing time series and cross section data can lead to erroneous conclusions. The
problem with cross section data is that it reflects conditions present at the time the data
was collected. By contrast, analysts using serial data should beware reaching simplistic
before and-after-comparisons or making premature judgments.

5.1 Which of the following variables would best measure the quality of education offered
among a population of colleges?
a. SAT exam scores of entering students at each college
b. salaries of professors at each college
c. average grades of all graduates at each college

d. satisfaction surveys for graduates of each college
e. the tuition costs at each college

5.2 When making decisions based on time series data, one rule of advice is
a. base your decisions only on the most recent data available
b. don't be too quick to proclaim success or failure of a new policy
c. we must learn the lessons of history so that we never risk repeating past mistakes
d. the behavior of people in each age group today can be used to predict the behavior
of those same age groups in the future
e. time series data is superior to cross section data for decision making

5.3 What is wrong with deciding whether a business has grown over the past twenty years by
comparing the sales revenues?
a. the variable is contaminated by price changes from inflation
b. this involves circular reasoning
c. sales revenue usually cannot be measured
d. sales revenue is not a variable
e. there is nothing wrong

5.4 Which of the following is not an example of faulty reasoning involving time?
a. A firm suffering a temporary cash flow crisis fires a large portion of its workforce.
b. Things were better in the good old days, so we should do things the way they used to
be done.
c. Luxury car sales increased in the 1980s because of ads using 1960s rock music, so the
same music is approved for ad campaigns in the 1990.
d. all of the above

5.2 Collecting Data by Sampling from a Population
In Chapters 1 and 3, we learned that we are frequently unable to collect data from the
entire population. Because of the competitive demands of the business world on our time and
resources, businesses must often content themselves with a sample drawn from the population.
While they can offer us enormous savings in cost and time, samples that represent only a tiny
fractions of the population can reveal all the information we may need about the complete
population.


Chapter Case #1: Problems Measuring the U.S. Populations
Many terms in statistics, such as population and census, owe their origin to census
collections. The U.S. government is bound by the Constitution to conduct a national census, a
complete count of the population, every ten years. The U.S. Census findings are very important
because they determine the number of congressional representatives each state gets and the
distribution of federal funding for numerous government programs. Economists, sociologists,
and business planners also use Census information in their analysis. For example, a hotel chain
may not want to target a new franchise to a region with declining incomes. The county school
board uses population counts for the youngest age groups to decide whether additional schools
should be budgeted.
But a census collection can be a formidable task, especially for a nation as large and
diverse as the United States. Many people value their privacy and are suspicious of government
intrusions. The enormous cost of the Census Half-a-million Census workers and billions of
dollars budgeted precludes more frequent data collection, such as annually. Unless
information is timely, it may not be very useful to decision makers. Some Census data from
1990 were not current enough for businesses to seize a market opportunity in 1994. For states
experiencing heavy migration, such as Washington and Florida, the 1990 data were too outdated
for solving the problems of 1995. Those in need of more timely information must rely upon
estimates from sample data.
Besides providing infrequent and expensive information, there is another, equally-serious
deficiency with Census data. In the United States, despite the massive expenditures, years of
planning, and survey reliability controls, the decennial Census reports inaccurate information. In
every recent census, millions remained uncounted; evidence from survey samples verifies the
huge scope of these errors.
The undercounting errors are most severe in large cities and among African Americans
and Latinos. Illegal aliens and urban minorities are less likely to be reachable by Census
enumerators. In addition, the "plight of the homeless" is a Census-taker's nightmare and a social
problem because the homeless can't be mailed forms or reached by phone. The highly publicized
effort in 1990 to locate and interview the homeless in subway stations and public shelters yielded
mixed success. As a result, states with the heaviest concentrations of uncounted, such as New
York, lost federal funds and political clout in Congress because the Census reported their
populations wrong.
Why doesn't the Census Bureau use statistical sampling adjustments to reduce these
enumeration errors? Although these sampling techniques were recommended by nearly every
prominent statistician, the Census Bureau was ordered in 1987 by the Secretary of Commerce to

abandon its planned post-enumeration survey.
7
The result was an undercount of some five
million people. The Secretary was accused of playing politics with data collection, since
undercounting occurred more in regions populated by the opposition party and by recipients of
federal welfare programs. Similar political roadblocks prevented the use of sampling for the
2000 Census.
In statistics, we attempt to collect sample data that provides a reasonably accurate picture
of the parent population. A process called random sampling is used to obtain sample data that
represents the information in the population.

Random and Simple Random Sampling
Market research firms and sample survey teams are also hired by the business community
to collect primary data when conventional data sources fail to address important problems. How
are these survey samples generally collected? We discussed how descriptive statistics such as
the sample mean gives us some idea about the population mean. We also learned about drawing
samples and discovered the importance of collecting random samples.
If we must use a sample to make an educated guess about the population mean, ,
statisticians usually advise us to draw a random sample because of its desirable properties.
DEFINITION: A sample of size n drawn from a population is a random sample if every
observations in that population has an equal chance of being selected as any other.
In practice, things are usually a bit more complicated. There is more than one type of
random sample from which to choose. The most common form is a simple random sampling.
DEFINITION: A random sample of n observations is a simple random sample if every possible
sample of n observations has the same chance of being selected as any other.
The easiest way to understand how simple random samples are selected is to examine a
familiar example: a fair lottery drawing. Most states now sell lottery tickets, and winning tickets
are those that match all numbers from a simple random sample. The typical selection method is
to set spinning balls marked with all possible numbers (say, 1 to 49) in a clear container. Then
numbered ball pop up into the chamber atop each container and six balls are displayed as the
winning combination. One week the winning combination might be 3, 5, 17, 32, 42, and 46.

7
A fascinating account of the 1992 trial that grew out of New York City's challenge to the accuracy of the 1990 Census is found in
Chance (1992) vol 5, Nos. 3-4, pp. 28-38.

This is a simple random sample because the set of six numbers had the same chance of winning
as any other six numbers (about one in 14,000,000).
8

Now imagine the lottery machine contains balls marked with the CEO compensations for
each of the 38 major electronic and computer firms in the industry population. Then our simple
random sample could be collected by selecting n = 6 different balls. The sample is random if the
balls are the same size and weight and are thoroughly mixed so that each has the same chance of
popping up.

Using the Computer to Simulate Random Sampling
How would such a simple random sample be collected in an actual business situation?
Although we do not find population data balls spinning in lottery containers, we can easily
reproduce the random-selection conditions present in a lottery. A numbered listing of the
companies, stocks, employees, cities, or any other type of population could be used to randomly
select a sample.
9
Today's statistical and spreadsheet software can help us collect these random
samples.
What does a random sample look like and how much does it resemble the population
being sampled? If this were a science text, you would have classroom demonstrations and
laboratory work to convince you that the laws of nature you are reading about are true. We too
can conduct experiments to demonstrate the fundamental ideas in business statistics, but these
simulations today use computers rather than laboratory test tubes and Bunsen burners.
DEFINITION: A simulation reproduces a real-world phenomenon by artificially duplicating its
most essential features.
Simulation often provides a convenient alternative, especially when the actual experiment
involves high costs, time delays, or excessive risks.
10
Airline pilots (as well as video game
enthusiasts) use a flight simulator to practice take-offs and landings without the expense, time
delays, and risk to human life associated with the real thing. The crew of the space shuttle

8
For this type of sampling, we eliminate the possibility of getting the same ball twice by setting aside (not replacing) each drawn ball.
This is sampling without replacement.

9
Such a listing is often called the sampling frame. Even if we cannot list everything in a sampling frame, we can still construct our
samples if we can specifically describe the characteristics of the population.

10
Monte Carlo simulation, named after the gambling capital of Europe, has attained special prominence in academic research because
analytical solutions are often elusive for complex statistical questions applicable to the business world.

practices long hours in simulators for an in-space repair of a satellite. Both Minitab and Excel
may be used to simulate the random sampling process.

Chapter Case #2: Please Turn Out the Lights Before You Leave
The 1970s and 1980s were not a good time to work in a Midwest or Northeast U.S.
factory. Foreign manufacturers led by the Germans and Japanese were snatching ever increasing
shares of U.S. domestic auto, steel, electronics, and textile markets. Particularly hurt was New
York City, once the primary headquarters for multinational corporate giants. According to data
collected by the New York Department of Economic Development, New York City experienced
1246 manufacturing plant closings during the 1976-1990 period. Many factors were blamed for
New York's loss of its high-paying manufacturing base. Some claimed it was high state taxes
and excessive regulations, others said it was the declining quality of life, and still others called it
a natural result of New York City's transition to a financial services economy while other nations
regained their pre-World War II market shares.
One way to discover what if anything was responsible would be to gather information
from site visits of the closed factories. The city's industrial development analysts could not
possibly spare the staff and travel budget to examine every factory closure in the New York City
area. Of the 1246 plant closings, suppose the budget was only sufficient to study a sample of
thirty closed plants. Starting from a listing of all factory closures, the city's analysts would
probably let the computer draw a simple random sample of 30 different numbers between 1 and
1246. For example, Figure 5.1 contains a sorted listing of 30 random numbers generated by
Minitab using the instructions listed in Figure 5.2.

Data Display

Random30
44 48 66 72 102 107 149 159 264 299 340
390 409 496 497 506 545 579 602 613 712 735
808 831 991 1058 1224 1225 1234 1243
Figure 5.1

Using Minitab to Draw a Set of Random Integers

Pull-Down Menu sequence: Calc Random Data Integer...
Complete the Integer Distribution DIALOG BOX as follows:
(1) type number of random integers needed in box between Generate and rows of data.

(2) click mouse inside box labeled Store in column(s): and type in a name of eight or fewer
characters for the new column of data
(3) specify the range of possible integers by typing the minimum and maximum numbers in the
boxes to the right of Minimum value: and Maximum value:
(3) click OK button, and the random numbers will appear in the new worksheet column
Figure 5.2

This simple random sample therefore consist of those thirty factories on our sequential listing:
#44, #48, #66, #72, and so forth.
11

Remember, when you try doing this yourself, you will almost certainly wind up with a
different combination of thirty random numbers. Because these are random numbers and there
were 1246 to choose from, repeating the Minitab procedure would almost surely result in a
different sample of 30 numbers.
12
For example, a second sample that Minitab chose for us
contains an entirely different set of numbers (see Figure 5.3).
Data Display

Random30
22 39 227 309 322 372 511 515 586 710 733
742 750 750 824 834 891 892 906 943 1023 1032
1095 1101 1107 1109 1116 1125 1181 1188
Figure 5.3

If the plant identification numbers are already listed in our computer spreadsheet or
worksheet, we can easily use Excel or Minitab to select our simple random sample (see Figure
5.5). The advantage to this approach is that ID numbers do not need to be sequential.
Using Excel to Draw a Simple Random Sample

(1) Select Sampling from the Data Analysis box, and click OK
Then complete the Sampling DIALOG BOX as follows:
(2) Click inside box following Input Range and type cell range for the identification numbers
(or drag mouse through spreadsheet data so cell range appears in the box)

11
Because the same number can come up twice, it is best in practice to generate more than 30 random numbers and then select the
first 30 different numbers from the sorted listing.

12
Computers must use algorithms to generate these numbers so they are not strictly random, only pseudo-random. However, they
can be considered random for practical purposes.

(3) Click inside box following Number of Samples: and type the number of observations in the
sample you wish to draw.
(4) click on OK to obtain a simple random sample of data in a new sheet
Figure 5.4

How much can a sample, especially a small one, tell us about the complete population?
Could a random sample of 30 or even 100 plants tell us a lot about the overall population, or
would the sample be more misleading than helpful? In Chapter 8, we will conduct additional
simulations that demonstrate the ability of random samples to reflect accurately populations
information.

Warning Signs of Nonrandom Sampling
There is a tremendous advantage to having random samples. By matching estimators and
other statistics with sampling distributions, we may infer confidence intervals and utilize other
methods of quantifying uncertainty. In addition, if certain values in the sample space have a
greater probability of entering the sample whereas others have virtually no chance, estimates will
be biased in the direction of the oversampled values. Thus, wherever possible, we should try to
obtain random samples. Unfortunately, many of the most common sampling methods result in
nonrandom samples.
For example, many business leaders pride themselves on being graduates of the "school
of hard knocks" and love to tell colorful stories, or anecdotes, about their experiences. These
stories may be useful for illustrating a general problem or business opportunity. Although
anecdotal evidence may be interesting and informative, it is usually too limited and
unrepresentative to use for decision making.
DEFINITION: Anecdotal evidence is information on a very limited number of data observations
selected in a nonrandom manner.
Unfortunately, many managers base decisions on experience of only one or two cases.
"We tried that once and it didn't work" is often the kiss of death for any similar proposals.
Managers seldom recognize this as a decision based on a sample containing only a single
observation. Worse yet, anecdotal evidence is often remembered primarily because it was
exceptional in some way or other. A decisions based on anecdotal evidence may therefore
reflect the exception instead of the rule. A fluke success for a new product line, unusual TV
show, or initial public stock offering is likely to call forth dozens of lemmings marching into a
sea of losses.

Seemingly random sampling methods may sometimes not be random after all.
Throughout most of the Viet Nam War, the United States used a draft instead of
recruiting volunteers. The draftees selected, however, were not a random sample of young
males. Wealthier and educated youths found many ways to avoid the draft, leaving the armed
forces populated disproportionately with minorities and the poor. To improve the fairness of the
process, a lottery was initiated for the 1970 draft. Into a large bowl were placed 366 capsules,
each containing a paper marked with one day of the year (with one for February 29th). The
capsules were supposed to be thoroughly mixed before they were drawn one-by-one from the
bowl. The order of selection determined the order of the draft for all eligible males that year: if
your birthday was among the first hundred or so dates drawn, you would be drafted. However,
something went wrong. Most of the first hundred dates selected were from the last few months
of the year.
The lottery was not truly random. The capsules were placed into the bowl by month
beginning with January. Although the capsules were supposed to be mixed thoroughly, this
appears not to have been the case. Capsules containing dates from September through December
remained near the top of the bowl. Since the sampling procedure was not truly random, this draft
lottery replaced a system which disproportionately selected from the poor and minorities to one
that discriminated against draft-age males born later in the year.
Another lottery example involved a criminal cause of nonrandomness. A few years ago,
the same numbers popped up on all the machines during a Pennsylvania Lottery broadcast. The
procedure used numbered ball machines like those described here. But balls with three of the
numbers were the only ones able to pop out the top. All the other balls had been weighted down
by a gambling syndicate which bet heavily upon the few remaining combinations among those
three numbers. Suspicion was triggered when, through bad luck (for the conspirators), a "3"
came up on each machine. Review of the videotape revealed that only three of the balls the
unweighted ones were circulating vigorously inside each machine. What appeared to be
random sampling was anything but!
As mentioned earlier, a common basis for selecting a sample is a directory listing: for
example, of clients, suppliers, employees, real estate, or stocks. A random selection method may
be to match directory entries with a series of random numbers generated by flipping a coin,
rolling dice, or using the random number generating capabilities of a computer. For example, a
one-in-six sample could be obtained by selecting only those entries with the last digit in the
directory ID number matching the number on the face of a die. Before computers could easily
turn out random number lists, random selection methods were a time-consuming process.
Because of its ease of use, the most popular alternative was systematic sampling.


DEFINITION: Sampling a population in some regular fashion is called systematic sampling.
For example, every sixth entry from a directory could be selected rather than referring to random
numbers. To monitor performance standards, a microprocessor assembly work station might be
sampled every 20 minutes and the operating characteristics of a computer chip tested. Every
eighth house in a targeted neighborhood may be surveyed for a potential market impact study.
Yet unless we design systematic samples carefully, a nonrandom sample may result. If
we select the first listing from each page of a telephone directory, pages that begin with a new
letter may start with names such as AAAA Auto Repair. Thus, businesses tend to be
overrepresented in such samples. If the directory doesn't break apart multiple listings, such as a
doctor with two offices and a home address, these individuals are also more likely to be at the
top of a page and thus wind up in our sample. Similarly, production quality may be sensitive to
the time of day. Consequently, sampling on the hour may primarily examine work immediately
before or after lunch breaks and shift changes when quality may be abnormally high or low. If
multiples (or fractions) of eight houses are on each block, sampling every eighth house in a
neighborhood may oversample corner houses. Since corner lots tend to be larger, these houses
cost more and are occupied by higher income families than others in the same neighborhood.
Although systematic sampling is often a convenient method to collect data, we should also make
sure that the regularity in collection does not create nonrandom sampling biases as well.
A particular problem with collecting survey data is nonresponse. We have all discarded
marketing survey questionnaires sent to us in the mail, walked the long way around people with
clipboards at shopping malls, or hung up on callers asking for "a minute of our time to answer a
few questions." And these don't include all the times the phone rang while we were out and mail
that hasn't reached us since we moved. It is often just as difficult to get businesses to respond to
surveys. They may not use staff time to fill out questionnaires or are reluctant to disclose
information that could be used against them by rivals or regulators. The large number of
bankruptcies, name changes, and relocations among smaller firms makes it next to impossible to
obtain an up-to-date listing from which to sample in the first place. It is not unusual for a mailed
questionnaire to have a response rate of 5 or 10 percent.
Those who respond are likely to be a nonrandom subsample of the survey's original target
sample. The result is nonresponse bias.
DEFINITION: If those responding differ from non-respondents in any of the variables being
surveyed, estimates from that sample will contain nonresponse bias.


Respondents share certain traits that could cause nonresponse bias. For example, they
often have higher incomes than nonrespondents and tend to be more educated. Thus, a survey
asking about support for lowering taxes on the wealthy or cleaning up toxic waste dumps may
report estimates that contain an upward bias. Any survey that reports results qualified by the
phrase "among those responding to a survey" should be carefully judged by its response rate and
methods used to address nonresponse bias. If our directory listing is incomplete or contains
errors, biased estimates can also result as those who cannot be contacted have no chance of
responding.
To reduce the potential for response bias, there are several methods used to improve
response rates. Listings and directories are updated and compared to insure accuracy and
maximum coverage of the intended population. Mailed survey forms may include a dollar bill to
compensate you for your time filling out the questionnaire. Sometimes contests with prize
awards accompany the survey to increase participation. Businesses are encouraged to respond
by offering to share the results of the survey findings or assuring confidentiality of their answers.
Response rates are usually very high if a reputable authority supports the survey. Examples are
employee surveys authorized by the employer or a survey of retail establishments by the U.S.
Department of Commerce.
One of the most common methods of increasing response rates is by conducting one or
more follow-up surveys on nonrespondents. Call-backs of not-at-homes, repeated mailings, or
return visits often successfully capture a sizable percentage and bring into the sample
respondents who by their absence would cause bias. Following the mailing of 1990 Census
forms, thousands of Census workers were dispatched to interview nonrespondents and search out
the more transient population missing or incorrect on the original mailing lists.
13

Even if the sampling method used gives us a nonrandom sample, it may be possible to
adjust our estimation procedure to avoid bias. This adjustment can be achieved if we can weight
the sample observations by their true proportion in the population. To illustrate one such
technique, suppose a phone survey is conducted by a beef marketing council to discover how
often Americans serve their product. The first calling from a randomly selected sample of
households yields a 50 percent response rate and a mean of 2.1 meals a week. A call-back
attempt raises the response rate by 25 percent and the new respondents report a mean of 3.9
meals of beef per week. Since only half as many respondents are in the call-back survey, the
relative frequency weights used to calculate the sample mean are 2 / 3 and 1 / 3 and the sample
mean is therefore 2.7 meals / week. This estimate is likely to be biased due to the 25 percent of
the original sample missed by both survey attempts. However, the nonrespondents may share
similar behavior (including beef consumption patterns) to the call-back respondents because

13
Nevertheless, millions of people again went uncounted.

neither group is at home as much as those responding to the initial call. If we use the call-back
group mean to also estimate the 25 percent nonrespondents, the estimate for the entire population
should then involve equal weights (50 percent each) and a larger mean of 3.0 meals per week.
More generally, the final follow-up survey is often used to estimate those in the sample that did
not respond to any of the surveys.

5.25 A business school accreditation organization circulates class evaluation forms to students
attending the next to the final week of the term. All except one of the following would
likely cause bias in estimating educational effectiveness of the courses:
(a) students place too much faith on instructors who say the courses are effective.
(b) students underestimate the usefulness of courses until they obtain full time jobs in
business.
(c) the evaluation forms contain questions primarily focused on whether courses meet
minimum standards.
(d) students who dropped or withdrew from the course before the final week are not
sampled.
(e) all of the above are likely sources of bias.

5.26 A state in fiscal crisis is considering various tax reform proposals. To get a feeling for
what voters would support, the governor decides to conduct a survey before deciding
which reforms to propose to the state legislature. Which of the following survey designs
would you recommend?
(a) set up a series of 900 numbers for people to call to express their preference.
(b) mail a survey form to all registered voters, and determine the preferences from those
who respond.
(c) conduct extensive "man-on-the-street" interviews at all major shopping malls.
(d) survey a random sample of political science professors and other expert as to which
reform plan they believe the voters will support.
(e) none of the above.


Populations that cannot be listed explicitly can usually be defined by a set of conditions
or characteristics that only describe the members of that population. Data collection should
focus on variables whose scope matches the problem being investigated and whose information
is not contaminated by extraneous variables. Data should be rounded to the number of figures
that best reflects the amount of useful information present in the data.
Primary data is data collected or generated by the end user their own analytical purposes.
Secondary data is data collected for purposes other than those intended by the end user.
Observational data is data that can be measured and recorded merely by being present when the
data is generated. Experimental data is data specially generated by a carefully-designed
experiment that measures the response by subjects to different treatments under a controlled set
of conditions.
Confusing time series and cross section data can lead to erroneous conclusions. The
problem with cross section data is that it reflects conditions present at the time the data was
collected. By contrast, analysts using serial data should beware reaching simplistic before and-
after-comparisons or making premature judgments.
A sample of size n drawn from a population is a random sample if every observations in
that population has an equal chance of being selected as any other. A random sample of n
observations is a simple random sample if every possible sample of n observations has the same
chance of being selected as any other.
Anecdotal evidence is information on a very limited number of data observations selected
in a nonrandom manner. Sampling a population in some regular fashion is called systematic
sampling. If those responding differ from non-respondents in any of the variables being
surveyed, estimates from that sample will contain nonresponse bias.



Chapter 6 Probability and Randomness
Approach: Because business outcomes are never certain, we need to draw upon the laws of
probability to survive in a world of uncertainty. We build upon everyday encounters with chance
events to introduce probability. Our primary emphasis is on probability concepts most
applicable to statistical inference, such as random variables, expected value, and statistical
independence. We then use probability theory and expected value to understand Bayes Theorem
construct payoff matrices, and introduce expected monetary value and related decision theory.
Where We Are Going: Probability will allow us to quantify how much uncertainty we face
in uncertain business situations. We will use our new understanding about probability to explore
how probability is distributed for the most commonly encountered statistical inference situations.

experiments, outcomes, events, sample spaces, and Venn diagrams
probability and odds
frequency-based, classical, and subjective probability
mutually exclusive, exhaustive, and complementary events
addition and multiplication rules and statistical independence
marginal, conditional, and joint probability
relative frequency and weighted sums
probability distribution and cumulative probability distribution
expected value definitions of mean, standard deviation, and variance
Bayes Theorem, actions, states of nature, payoffs, and payoff matrices
expected monetary value (EMV) and EMV strategies

SECTION 6.1 Using Probability to Cope with Business Uncertainty?
SECTION 6.2 Probability Concepts for Understanding Statistical Inference
SECTION 6.3 Random Variables and Expected Value
SECTION 6.4 Principles of Decision Theory

6.1 Using Probability to Cope with Business Uncertainties
In Part One, we learned a wide range of descriptive statistics methods. To understand
these methods, all we needed was a basic working knowledge of graphing and simple algebra.
Most of the remainder of this text is devoted to statistical inference. Although most inference
methods are based on descriptive statistics we have already learned, inference also relies heavily
on fundamental principles about probability and probability distributions. The language of
probability, for example, helps us measure the amount of confidence we have in estimates based
on sample data. Our conclusions may calculate a margin of error for these estimates or assess
the uncertainty of our findings. Thus, we have gone about as far as we can in statistics until we
acquire additional knowledge about probability.
Statistics relies on probability theory to quantify the amount of uncertainty we associate
with estimates and other inferences based on sample data.
Business statistics shows us how to make the best use of information in the face of
uncertainty. What would it be like to live and work in a world where everything that happens
was known in advance with complete certainty? There would be no unpleasant surprises. You'd
never again waste time in a course that you know you won't pass or apply for a job you won't get.
A manager would always know in advance the results of a price increase, a merger, or a new ad
campaign. But although a world of certainty might eliminate worry and ulcers, it would also put
an end to pleasant surprises that spice up our lives, such as surprise parties and an unexpected
bonus. Anyway, the world we inhabit continually requires us to deal with uncertainty.
Perhaps you have heard about people who refuse to travel or even leave their secure
homes because they are frightened that something bad might happen. Fortunately, most people,
business, and public institutions find better ways to cope with uncertainty. Families stock up on
flashlights, water jugs, and canned goods in case a natural disaster leaves them without power
and transportation. McDonald's reacts to the risk of future beef shortages by signing long term
supply contracts. Because trial evidence cannot prove guilt absolutely, juries are instructed to
convict the defendant if the evidence has convinced them "beyond a reasonable doubt."

Experiments, Events, and Sample Spaces
To learn how to make decisions in an uncertain world, we must first examine the process
that creates uncertainty. If one particular outcome is not certain, then something else must be
possible as well. A mortgage rate of 8% next month, for example, is only one possibility; a rate
higher or lower than 8% could occur instead. Experiments are responsible for generating
mortgage rates and all other business and social data, the outcomes of these experiments.

DEFINITION: An experiment is any process that generates distinct results, called outcomes.
In Chapter 5, we learned that carefully designed experiments are sometimes the best way, or
perhaps the only way, to obtain the data we need. In fact, all data can be considered the result of
the haphazard experiments performed in our society. Even if we are not aware of them,
experiments are constantly going on around us and we all are rats in this laboratory.
As a college student, you daily participate in many experiments. This morning, for
instance, will you be late or on time for class? What score will you get on next week's history
exam? Will you get accepted into a fraternity or sorority? Go out with Pat on Saturday night?
Lose financial aid? Get into the classes you need next semester? Afford tuition if it increases
next year? Receive any job offers when you graduate, and if so, how many?
After college, you will have to consider outcomes of other experiments. How many sales
commissions will you earn this month or what will be your year-end bonus? Will you be
promoted, transferred, or be downsized? Will you win the weekly football pool or even the
state lottery? How much will your stock or real estate investments pay off (or lose)? Will
interest rates fall enough to let you finance a house? Will your marriage fail or your kids turn to
drugs or crime? What are your chances of living to retirement age or of dying prematurely from
cancer or AIDS? Will you develop high blood pressure or hair loss (perhaps from worrying
about everything else that might happen)?
For each of these experiments, we can list (or at least describe) all possible outcomes in
the sample space.
DEFINITION: The listing or description of all possible outcomes resulting from an experiment
is called the sample space.
In a 45-question multiple choice exam, for example, the sample space for the experiment is the
raw score integers from zero to 45. Your particular outcome, 34, is therefore one of the
outcomes in this sample space. By contrast, the sample space for the experiment of asking Pat
out Saturday consists of only two possible outcomes Yes and No. In the tuition increase
experiment, describing the sample space may be better than trying to list all the possibilities.
Suppose financial and political realities rule out any tuition cut or an increase by more than
$2000 a year. Then, the sample space may be described as the set of all increases from $0 to
$2000.
We can more compactly express sample spaces using set notation by placing brackets
around the listing or description. Our three sample spaces are the following sets:
S = {0, 1, 2, . . . , 45} for the multiple choice exam

S = {Yes, No} for dating
S = {$0 Tuition Increase $2000} for the change in tuition

You have a stake in the experiments just described because each outcome affects your
fortunes or happiness differently. Often, it makes sense to group similar types of outcomes into
events.

DEFINITION: An event is a collection of one or more outcomes that is meaningful to a decision
maker.
In the 45-question exam, a score of 40 or more may earn a grade of "A", 35 to 39 a "B", and so
forth. Then the set of events
{A, B, C, D, F}
may in turn be defined by five event sets. One possible breakdown of exam scores is the
following:
A = {40, 41, . . . , 45}
B = {35, 36, . . . , 39}
C = {30, 31, . . . , 34}
D = {25, 26, . . . , 29}
F = {0, 1, . . . , 24}

In the tuition example, suppose that an increase of $750 or more will force you to work
more hours and become a part-time student, and an increase of $1500 will cause you to drop out
of college entirely. Then the set of events that have meaningful consequences to you is:
{full time, part time, drop out}
with events defined by the following ranges of outcomes:
full time = {$0 x < $750}
part time = {$750 x < $1500}
drop out = {$1500 x $2000}

As important as events may be to the business or individual, much of what affects us is
beyond our control. A power outage last night may cause your alarm to go off late. Your car
may not start or a highway accident or storm may snarl you in rush hour traffic.

Similarly, your exam grade will depend not just on how hard you studied. The questions
asked may not be the ones you prepared for most. If you found yourself in a class of bright
students, a good score may still rank you lower on the grading curve. Your exam performance
will be adversely affected if you caught a flu bug. As with most other things, our fate is not
strictly in our own hands!

What is Probability?
To operate successfully in business, we need methods to make decisions under
uncertainty. Statistics developed these methods by drawing upon basic principles from
probability theory. Probability theory admits that we cannot be certain any particular event in
the sample space will occur. But we can do the next thing: determine the chance, or probability,
of each event occurring.
DEFINITION: The probability of an event measures the chance that event will occur when an
experiment is conducted.
For example, the radio may declare a 30 percent probability of rain today. Thus, meteorologists
expect it will rain an average of three out of every 10 days under the current weather conditions
(regardless of whether you just washed your car or forgot to take your umbrella).
To save space, we often use the P( ) notation instead.
The notation P(A) refers to the probability of event A occurring.
For example, P(Rain) = 0.20 means that the probability of rain today is 20 percent. Notice also
that business statistics traditionally uses the decimal fraction, such as 0.20, rather than the
percentage equivalent commonly expressed in our everyday speech.
The range of probabilities is limited by its too opposite extremes.
Probability ranges from 0, for an event with no chance of occurring, to 1,
for an event which is certain to happen. In terms of the P( ) notation:
0 P(A) 1
for all events A in a sample space

The lower extreme of zero is reached precisely only for an impossible event, such as your
college instructor being picked first in the next NBA draft. A highly unlikely event, such as
winning the grand prize in the weekly state lottery by buying only one ticket, still has a

probability slightly greater than zero (less than one in a million). Sometimes, the laws of nature
reduce the sample space to a single certain outcome, such as the probability of the sun rising
tomorrow or an unavoidable meltdown at a nuclear reactor once the core reaches a critical
temperature. Alternatively, certainty may result from defining an event to include all possible
outcomes in the sample space. Thus, the probability is 1.0 that a person leaving a jewelry store
has or has not purchased something. For decisions not involving certain or impossible events,
we need to attach a probability to each event reflecting its uncertainty.
Thus, the importance we give a particular event depends on its probability of occurring as
well as its numerical value. Why else would folks risk a $150 traffic ticket and car higher
insurance rates when they are fairly sure there are no speed traps nearby? Yet the same drivers
may observe No Parking signs because they know these are regularly patrolled, even though
the fines amount to only $5 or $10.
We often see chance reported as odds rather than probability, especially in monetary or
life-threatening situations such as gambling, insurance, and financial market risks. Odds are
expressed as a ratio of two numbers. Suppose, for example, the odds are 4-to-1 against a
particular merger being approved by the target firm's management. This means that management
approval will occur on average once for every four disapproval votes. By contrast, a labor expert
who reports 3-to-2 odds in favor of a wage hike tells us that higher wages will occur one-and-
one-half-times as often as no wage increase under this negotiating climate. Because four is 80
percent of (4 + 1), there is an 80 percent probability the merger will not be approved.
Consequently, there remains only a 1-in-5 chance, or 20 percent probability, of approval. See if
you can use similar logic to show the wage hike has a 60 percent probability when 3-to-2 odds
favor it.
An odds ratio is converted to probability by expressing one number in the odd ratio as a
percentage of the sum of the two numbers in the ratio.
Working with probability has developed into an important facet of the business world.
Benjamin Franklin once remarked that the only things we can count on with certainty are death
and taxes. However, even death and taxes involve important aspects of probability. Anybody
who has been through one knows that a tax audit is time consuming and emotionally trying.
Cheating or making undocumented deductions can expose a taxpayer to fines or even a jail term.
Some people go so far as to overpay taxes each year so they won't have to worry about being
audited. However, the probability of being audited by the I.R.S. is quite small in any individual
year, and that probability is higher if you declare large deductions for things such as charity or
medical care. A few enterprising authors claim to know these secret probabilities, and publish
strategies that will reduce your chances of being audited.

Comedians joke that life insurance is a strange gamble. The only way to win is to lose:
you have to die to collect. People who dont want to put their children at financial risk will
purchase insurance against the chance they will die early or require expensive medical care.
Statisticians employed by these insurance companies are called actuaries. Insurance companies
don't know when individual policyholders will die, but mortality rates measure the actuarial
probability of each one dying this year. For policyholders with high mortality rates such as
heavy smokers or elderly males the company may raise their rates or cancel their coverage.
The risk to heavy smokers, for example, may be reflected in a one-in-three chance (33 percent
probability) for contracting lung cancer by age 60.

The Frequency Approach to Probability
Where do these probabilities come from? Since probability is an essential part of
statistical decision making, we need methods to calculate the probabilities of any event. There is
no single method suitable for all purposes. Instead, the rich history of probability theory offers
business statisticians three very different approaches to find probability.
One approach is empirical, depending on data evidence to justify and calculate
probability. This method of finding probabilities is based on the frequency definition of
probability.
DEFINITION: According to the frequency definition of probability, the probability of an event
is the percentage of all outcomes in which that event occurs.
Thus, the frequency definition is observation based and relies on collecting the results of many
repeated experiments, counting the number of occurrences of an event, and comparing that
frequency to the total number of observations. The resulting ratio is the relative frequency.
DEFINITION: The relative frequency of an event is r / n, the ratio of its frequency r to the total
number of observations n.
If we knew the frequency of event A in the entire population, then the frequency
definition of probability tells us to use the relative frequency as P(A), the probability of A
occurring. In 1992, for example, 18.05 million workers in manufacturing out of the 108.44
million Americans were employed in nonagricultural jobs.
1
Thus, the probability was 0.1665, or
about one in six, of a manufacturing employee being selected at random from the nonagricultural
worker population. Lacking this population information, we may instead collect a large random

1
Figures are for October 1992. Source: Establishment Survey, U.S. Dept. of Labor, Employment and Earnings.

sample of non-agriculture workers (say 10,000), determine the fraction in manufacturing, and
use this as an excellent estimate of the probability.
The Frequency Approach to Probability: The probability of an event is its relative
frequency for an experiment repeated a very large number of times. For event A occurring
with frequency r in n experiment repetitions, then for very large n
P(A) = r / n

Frequentists recommend that random samples be used to estimate unknown population
probabilities. Relative frequencies derived from frequency distributions or histograms
provide estimates for event probabilities in the sample space. In Section 3, we will
investigate the important case of relative frequencies of quantitative data.

What if the event frequencies for the population cannot be counted because the
population itself is virtually infinite? Suppose the state gaming commission in Nevada suspects
that a roulette wheel (or the casino management) is not on the level. Or perhaps you have
devised a system of keeping track of face cards played so you can win at blackjack. A coin can
be flipped, a pair of dice rolled, a deck of cards dealt, or a roulette wheel spun a limitless number
of times. When the cards become sticky or worn, a virtually identical deck takes its place and
the game goes on. 17th century probabilists examining various games of chance first analyzed
this type of question.
Businesses confront this same situation with time series data such as sales, inventories,
bank loan interest rates, and production processes. Fast-food hamburger preparation and
customer order processing, for example, never come to an end. New workers replace those that
quit and occasionally the flame broiler needs repair or the franchise is sold. Nevertheless, the
business goes on.
To deal with questions involving an ongoing process, frequentists suggest measuring
probabilities by counting events and determining the relative frequencies.
For ongoing processes, the probability of an event is the average rate at which that event is
observed.
The frequentist approach to ongoing processes is often applied to statistical process control
situations, where management enlists employees to monitor relative frequencies in production or
service quality. To decide if process probabilities have shifted from current standards, periodic
samples are drawn and relative frequencies are then compared against established norms.
For example, suppose the percent of misrouted package deliveries, cam shafts diameters
beyond design specs, or student registration complaints substantially exceed the target level
objective. Employees immediately alert the quality control officer that probabilities of these low

quality production events may have increased. Before the situation deteriorates any further, that
process identified as out of control is halted temporarily. This action prevents any further
costs, such as product recalls, lost customers, and damage to company reputation. The shut
down process immediately undergoes an intensive examination by a quality control team that
includes front line employees. Ideally, the cause of the probability changes can be identified and
remedied quickly. Statistical process control will be examined in greater detail in other chapters.

Classical and Subjective Approaches to Probability
*

Finding probability from frequency information, however, has its limits. Each
experiment a production run, a new fast-food franchise location, a space shuttle launch can
be costly, time consuming, or even risky to life and limb. New or changing conditions may
make previous relative frequencies unsuitable to be used as event probabilities. Moreover, the
frequency approach can alert us to a change in an event's probability, but few clues about what
caused the process to change. To control and improve quality, modern businesses often require
these deeper insights. The classical approach furnishes us with an alternative means for
calculating probabilities.
DEFINITION: In the classical approach, probability is calculated analytically based on a
detailed understanding of the process by which experimental outcomes are generated.
Thus, the probability of each event may be determined merely by inspecting the
experimental apparatus and the conditions prescribed for conducting the experiment. Physical
inspection of a coin or dice can determine the probability from a fair toss or roll. If the coin is
not two-headed, then we expect equal probability for each possible outcome in the sample space.
The die is more difficult to inspect. The cubic shape of the die may be verified by careful
measurements and even X-rayed to see if it is "loaded" by a weight embedded near one of the
faces.
But what if the coin flip or dice roll is not fair? To understand the process that
generates outcomes, we must also examine the operators and experimental procedure. As we
saw in Chapter 5 with the draft lottery, imperfect efforts at randomizing may affect the
probabilities of events. If a deck of cards is not shuffled properly between poker hands, some
card patterns (such as a flush or full house) may be dealt more often. With practice, cheaters can
learn to deal from the bottom of the deck or even control the number of rotations in a coin flip.
The conduct of an experiment may have as much effect on probability as the equipment used.

Casino management therefore trains its employees extensively and uses supervisors and
overhead cameras to monitor activities at the gaming tables.
2

Classical probabilities are applicable to many types of business operations. Knowledge
of the production process may substitute for sample evidence from actual production line
experiments. By understanding the design of factory, office, or transportation equipment and the
way employees are trained to operate in these environments, we may determine the probability
of each event without counting outcome frequencies.
Quite often, however, applying the classical approach is not practical. Increasingly
sophisticated production technologies and interactive teamwork may enhance productivity and
quality. However, classical probabilities of these production processes are too complex to
calculate. Management researchers are still struggling to understand what makes one production
or design team efficient and another one not.
A third approach to probability allows management, industrial engineers, and other
business professionals to make an educated guess about probabilities. Experts can use their
training, experience, and intuition to assess the subjective probability of an event occurring.
DEFINITION: Subjective probability is based on informed human judgments about the
probability of an event.
In business, these judgments may be formulated from experience with similar situations,
knowledge of human behavior and the functioning of institutions, or simply hunches.
The empirical counting methods of the frequency approach and the intellectual analysis
of the classical approach are each based on objective information. Therefore, two individuals
examining the identical available information should reach the same probabilities judgments. By
contrast, two people may arrive at different subjective probabilities despite identical experiences.
Subjective probability expresses our personal degree of doubt or belief. Assigning a
subjective probability of nearly 1.0 reflects a very strong belief an event will occur, and a
probability close to zero indicates major doubt that the event will be observed.
Movie reviewers reach different judgments about the chances a particular movie will win the
best picture Oscar. Outside consultants may disagree about the likelihood of success for a new
product line. Experienced surgeons often differ about a patient's probability of surviving a
transplant operation. In addition, expertise does not protect doctors, lawyers, economists, tax

2
Even in casinos, quality control must be carefully enforced. Monitoring is required to protect profits from cheaters and convince
gamblers that nobody has an unfair advantage. However, monitoring must not be intrusive enough to interfere with efficient operations or annoy
customers.

accountants, and scientists from bias. On the contrary, experts often display substantial bias
because of their professional commitment, narrow experience, and emotional involvement. This
variability and bias in subjective probability make some "frequentists" and classical probabilists
uncomfortable. But a biased judgment by an expert can be far superior to an uninformed guess.
There are also common methods to lessen bias from subjective probabilities. Many
insurance companies require a second opinion from another surgeon before an operation will be
authorized. The consensus of a panel of experts or focus groups may be used to decide whether
to launch a new product. Even a single subjective probability may be corrected for bias.
Suppose a knowledgeable architectural consultant inspects your Miami Beach hotel plans and
reports a 30 percent probability that this structural design would sustain major damage from
hurricane-force winds. Based on hotel damage sustained from hurricanes Hugo and Andrew, we
learn that previous assessments by this consultant overestimated probability by 10 percentage
points. Therefore, we decide on 20 percent as our unbiased probability of major damage
following a hurricane.
These approaches to probability provide us with three options. Sometimes one is best, in
other cases another is more appropriate.
When historical data are available or easily gathered, relative frequencies are generally
used to estimate probabilities. For simple, well-understood processes, classical
probabilities may be preferable. But in new, complex, and hard-to-quantify situations,
asking experts for subjective probability is usually advisable.
In many situations, these approaches may be used in tandem by blending two (or
occasionally all three) to obtain the best possible estimate of probability. Many financial
advisors, for example, forecast future asset yields by effectively combining government and
market information, analytical models, and subjective probability judgments.
3
For processes too
complex to solve analytically by classical methods, the basic features can be programmed on
computers to simulate enough observations for precise frequency estimates of process
probabilities.

3
L. Brown, G. Richardson, and S. Schwager "An Information Interpretation of Financial Analyst Superiority in Forecasting
Earnings." Journal of Accounting Research (Spring 1987): 49-67.


6.1 A personnel manager uses her insights gathered from experience and training to quantify
the probability that the new salesperson will be caught stealing merchandise. The
approach used to arrive at this probability is best described as the
a. frequency approach
b. subjective approach
c. classical approach
d. probability approach

6.2 If the set S is defined by
S = {Toyota Corolla, Nissan Sentra, Honda Civic, Mazda MPV Minivan, Ford trucks},
which of the following dealerships could have its new vehicle inventory described by S?
a. a Japanese car dealer
b. a foreign car dealer
c. a car dealer
d. a car, truck, and van dealer
e. each of the above

6.3 An insurance agent calculates that one in twenty unsolicited letters to prospective clients
results in a new policy sold. What is the probability that any given letter will yield a
sale?
a. 0.95
b. 0.50
c. 0.20
d. 0.05
e. insufficient information answer

6.4 Subjective probability
a. is derived from human judgments.
b. expresses our personal degree of belief.
c. is small when there is great doubt that an event will occur.
d. is most appropriate in new, complex, and difficult to quantify situations.
e. all of the above.


6.2 Probability Concepts for Understanding Statistical Inference
Probability is the key to unlock the mysteries of inferential statistics. However,
probability theory is a distinct and enormous field of study. Because this is a business statistics
text, we only need to outline the areas of probability necessary for understanding statistical
inference.
4
Other areas of probability theory are essential to important business applications
outside statistical analysis such as queuing theory, the study of lines (called queues). Queues
of impatient customers, callers waiting on hold, and backlogs of new orders can quickly mount
up unless the queue is serviced efficiently.

Combining Probabilities for Mutually Exclusive and Exhaustive Events
Consider a weekly magazine such as Newsweek. Although subscriptions and newsstand
sales are important, Newsweek's revenue comes primarily from selling advertising. Among the
companies willing to pay the highest ad rates are auto manufacturers, computer companies, and
brokerage firms. These advertisers are primarily interested in reaching a readership with high
incomes and heavy consumption in their company products and services. Thus, it is not
surprising that professionals, such as doctors, lawyers, and engineers, are sought-after
subscribers.
To examine its ability to appeal to advertisers, the circulation department at Newsweek
regularly surveys its subscribers to find out their occupations. The frequentist approach to
probability may then be applied to the survey findings to find the probability that a randomly
selected subscriber will have a particular occupation.
Applying the terms introduced in Section 1, the experiment consists of Newsweek
marketing its magazine and the outcomes are the occupations of its subscribers. A large random
sample of these experimental outcomes is collected by conducting the subscriber survey. Each
occupation is a possible outcome in the sample space S.
S = {all possible occupations}

4
The final portion of this section is optional. This additional material on probability is included for those who plan to cover the
decision theory in Chapter 15.

Since S contains every possible outcome, the probability of S, P(S), must equal 1.0. For this
example, there is a 100 percent chance that any randomly selected subscriber will have some
occupation in S.
5

The outcomes in S may often be grouped into events meaningful to decision makers.
Suppose the circulation department is most interested in distinguishing between two kinds of
events: subscribers who are professionals and those who are not. If the probability is 0.20 that a
subscriber is a professional, then there is only a one-in-five chance that a magazine carrying a
company's ads will reach the primary target audience of professionals. The company may decide
to take its advertising to a magazine with a higher professional subscriber probability.
What about the other four-fifths of subscribers? The event Non-Professional is the
complement of the event Professional.

5
Any exceptions can be avoided by carefully defining S broadly enough. Thus, an unemployed or hospitalized engineer is still
considered an engineer, and S includes "occupations" such as homemaker, retired, prisoner, and student.
Professional
Non-Professional
Figure 6.1

DEFINITION: For any event A, its complement A
c
consists of the set of all outcomes in the
sample space that are not associated with A.
Because an event and its complement have no outcomes in common, A and A
c
are said to be
mutually exclusive.
DEFINITION: A set of events is mutually exclusive if these events have no outcomes in
common.
Since, by definition, non-professionals is a different set of occupations than professionals, these
complementary events are mutually exclusive.
We can use Venn diagrams to represent probabilities and mutually exclusive events.
DEFINITION: A Venn diagram is a rectangular-shaped diagram portraying event probabilities
by regions whose relative areas reflect the probability of each event.
P(A) = (area of A) / (diagram area)
Figure 6.1 represents probabilities of the two events of interest to Newsweek's circulation
department. As with any Venn diagram, the entire area within the diagram contains 100 percent
of the probability. The probability of any event is represented by how much of the diagram it
occupies. The larger its area, the larger the probability. In the figure, the oval area labeled
Professional occupies about one-fifth of the diagram area because P(Professional) = 0.20.
Professionals and non-professionals are mutually exclusive, so their regions do not overlap.
Professional
All Others
Executive
Figure 6.2

If events have no overlapping areas in a Venn diagram, they are mutually exclusive.
Advertisers also consider executives valuable subscribers because they decide which
copiers, computers, fleet cars, or pension and health care plans to use at their company. Suppose
the Newsweek folks also include executives in their survey and discover that P(Executive) =
0.12. Then the Venn diagram for three subscriber types Professional, Executive, and All
Others is displayed in Figure 6.2.
As in the previous example, these events are portrayed as mutually exclusive. No overlap
of event areas occurs if we are careful in defining professionals and executives. For example, a
doctor is reclassified from Professional to Executive once she becomes a medical school dean or
head of the hospital. In this manner, no subscriber will be listed as both professional and
executive.
When events are mutually exclusive, their probabilities may be combined in a
particularly simple manner.
Addition Rule for Two Mutually Exclusive Events: If A and B are mutually exclusive
events, the probability of either A or B (or both) occurring is the sum of the probabilities of
each event. In notational shorthand,
P(A or B) = P(A) + P(B)
where "or" separating two events indicates that either or both events occur.

We can sum these areas together without worrying about counting the probability of any
outcome twice because the events are mutually exclusive. No outcome probability is double
counted because no outcomes are common to both events. The combined area occupied in
Figure 6.2 by Professional and Executive represents the sum of their probabilities.
P(Professional) + P(Executive) = 0.20 + 0.12 = 0.32
Advertisers can be informed of a nearly one-in-three chance (32 percent) that a magazine
carrying their ads will reach a subscriber who is either a professional or an executive.
If all events are mutually exclusive, their event probabilities may be summed.
Addition Rule for Mutually Exclusive Events: If events are mutually exclusive, the
probability of at least one of them occurring is the sum of their individual probabilities. If
A, B, C, and D are mutually exclusive,
P(A or B or C or D) = P(A) + P(B) + P(C) + P(D)


For example, suppose survey evidence finds that the market for submarine sandwiches consists
of the following: (1) walk-in customers, (2) drive-thru window traffic, (3) phone orders, (4)
faxed orders, and (5) special party catering. Since these are mutually exclusive events, their
probabilities may be summed. If P(drive-thru) = 0.20, P(phoned in) = 0.15, and P(faxed in) =
0.05, then
P(drive-thru or phoned in or faxed in) = 0.20 + 0.15 + 0.05 = 0.40
A sub shop that does not offer any of these three methods of ordering will have a 40 percent
chance of losing each potential customer.
We learned earlier that certainty is present only if the probability is 1.0. Recall also from
Section 1 that a sample space S consists of all possible distinct outcomes of an experiment and
an event is a collection of outcomes meaningful to a decision maker. Thus, the probability of
being in the sample space, P(S), is 1.0 because S is an exhaustive listing of all possible
outcomes.
DEFINITION: A exhaustive set of events contains every possible outcome and therefore has a
probability of 1.0.
Thus, the three events described in Figure 6.2 are exhaustive. The pair of complementary events
portrayed in Figure 6.1, Professionals and Non-professionals, is also exhaustive.
The set containing an event and its complement must be exhaustive.
However, we previously found the events in both these Venn diagrams (Figures 6.1 and
6.2) to be mutually exclusive. Therefore, event probabilities must also add up to 100 percent.
The sum of the probabilities for a set of mutually exclusive and exhaustive events must sum
to 1.0. If {A, B, C, D} is a mutually exclusive and exhaustive set, then
P(A) + P(B) + P(C) + P(D) = 1.0

For example, the three subscriber events sum to 1.0.
P(Professional) + P(Executive) + P(All Others) = 1.0
If we are provided with the probabilities for all but one these events, we may easily
calculate the probability of the remaining event.
For any set of mutually exclusive events and exhaustive events {A, B, C, D, . . . }, the
probability of any particular event, say A, is
P(A) = 1.0 - P(B) - P(C) - P(D) - . . .


This rule is most commonly used to calculate the probability of a catch-all event such as all
others, everything else, or none of the above. For the three occupations in Figure 6.2, for
instance, we know that P(Professional) = 0.20 and P(Executive) = 0.12. Therefore,
P(All Others) = 1.0 - 0.20 - 0.12 = 0.68.
Consequently, there is a 68 percent chance that any given magazine will be purchased by a
subscriber not in the desirable target audience of professionals and executives.
Since any event and its complement are mutually exclusive and exhaustive, the addition
rule also yields a sum of 100 percent.
For any event A in the sample space,
P(A) + P(A
c
) = 1 and P(A) = 1 - P(A
c
).

We recognized this relationship intuitively by remarking about Figure 6.1 that the other four-
fifths must be non-professional subscribers. This formula is especially useful whenever working
with the complement of an event is easier. By solving for the probability that an event does not
occur, the probability of the event itself can easily be found.

Conditional Probability, Independent Events, and Randomness
Until now, we have only worked with probabilities called marginal probabilities.
DEFINITION: Marginal probability is the overall probability a single event occurs.
Examples of marginal probability are P(Professional), P(a job offer from DuPont), and
P(winning the office football pool this week). A second kind of probability we'll also need to
understand is conditional probability.
DEFINITION: The probability of event A occurring given that event B occurs is the conditional
probability, denoted P(A -B).

This probability tells us the chance of an event A occurring under special "conditions" namely
that another event B also occurs. The vertical line separating the two events translates to the
words given that. Thus, we may read P(A -B) as follows: "the probability that A occurs given
that B occurs."

For any particular event A, conditional probability P(A |B) may vary depending on which
event B is given. For example, at the beginning of the term, your chance of failing your finance
course may be only 0.05. This unconditional probability is the marginal probability P(failing
finance). The conditional probability of failing finance may be considerably greater, say 0.40,
given that you failed the midterm exam. By contrast, the conditional probability of a failing
course grade might shrink to 0.01 given you receive a midterm average of 85.
A common application of conditional probability occurs when two (or more) attributes
are measured for each experiment, such as the multivariate data discussed in Part One. A
weather broadcast may report temperature, humidity readings, cloud cover, air pollution levels,
barometric pressure, and wind speed and direction. Some of these factors are interrelated. The
probability of a very hot summer temperature is lower given that rain is forecast. The statistical
models introduced in Part Four estimate one variable conditional on a particular combination of
other events occurring.
However, one situation exists in which conditional and marginal probabilities are not
different: when events are independent.
DEFINITION: Event A is independent of event B if (and only if) P(A) = P(A|B).
The term "independence" is a good word to use here, because it suggests objects that are free
from each other's influence. The Declaration of Independence "dissolved the bonds" of the
thirteen American colonies from control by England.
Independence can be difficult to achieve in practice. Moving away from home may still
not make you independent if you "depend" on parents for college tuition or to do laundry you
bring home every other weekend. Even if you are financially independent, your choice of a
marriage partner may not be independent of your parents' wishes. Similarly, the success of any
company is dependent on a host of factors outside the control of management, such as the
economic climate, the price of production input, and pricing strategies of rival companies.
Often, we expect unrelated events to be independent. Although astrology is a lucrative
practice, your astrological sign should be independent of how much money you make or whether
today is a good day for taking an ocean cruise. The same is true for many superstitions and
prejudices. Wars and plague cannot be prevented by expelling ethnic minorities. Walking under
a ladder or breaking a mirror will not increase the probability of misfortune. Quack medicines
won't improve your chances of surviving cancer.
Many lottery ticket buyers carefully examine previous winning numbers in an attempt to
gain a betting advantage. Some people believe that numbers not drawn recently are overdue,
while others believe that betting recent winning combinations will improve their odds. However,

past results would be useful only if the numbers drawn in one lottery are somehow related to the
outcomes of subsequent drawings. If each lottery represents a fair experiment, the winning
numbers are independent from week to week.
A crucial property of independence is that it operates in both directions.
If A is independent of B, then B is independent of A. Consequently,
P(A) = P(A | B) and P(B) = P(B | A).

Thus, if the probability of a new Broadway musical being successful is the same whether its cast
contains a big name star, then the probability of attracting a major star will be identical
regardless of whether the play becomes a hit.
Independence also may be applied to any number of events that are independent of one
another, or mutually independent.
If events are mutually independent, any conditional probability is simply the
corresponding marginal probability. Thus, if A, B, C, and D are mutually independent
events, then
P(A | B,C, and D) = P(A)

Working with mutually independent events can simplify probability computations immensely.
For example, a waiter at a popular restaurant determines from years of observation that the
following are all independent events: (1) a big tipper, (2) a male customer, and (3) a person
ordering alcohol with the meal work. Then the probability of receiving big tips is the same for
male and female customers, regardless of whether they order alcohol. In addition, customers
ordering liquor won't tip any more or less than teetotalers, and males are just as likely to order
alcohol as female customers.
Independence is among the most important concepts in statistics. independence is
critically important to statistical inference. to use the inferential methods in statistics, we will
often need to rely on independence among statistical events. Finally, the techniques to design
controlled experiments and eliminate confounding (described in Chapter 5) are efforts to obtain
independence.
Independence has perhaps its most important implications for random survey sampling
and random experimental processes. Each observation drawn by simple random sampling, for
example, must be independent of all other draws in the same sample. Thus, selecting a particular
value from the sample space has no memory of what has gone before it.

A random experiment results from repeated independent observations drawn from the
same sample space.
Independence can often be verified by examining frequency information from large samples.
Are conditional and marginal probabilities approximately the same? If so, there is independence.
Even small samples may provide clues about whether independent sampling is taking place.
Unfortunately, most people have a mistaken impression of what random sequences look like.
Consider Figure 6.3a, three sequences of heads (H) and tail (T). See if you can guess
which one was not the result of 20 successive coin flips.
Sequence 1: T H T H H H H H T H T T T H T T T T H T
Sequence 2: T T H T T H H H T T H T H H H H H H H T
Sequence 3: T T H T H H T H T H T H T H T H H T H T
Figure 6.3a

Did you guess Sequence 1 or 2 because their long runs of consecutive heads or tails just dont
feel random? For example, Sequence 1 has runs of five successive heads and then runs of three
and four tails (see Figure 6.3b). Sequence 2 has an even longer run of seven heads.
The Three Sequences with Runs of Three or More Highlighted
Sequence 1: T H T H H H H H T H T T T H T T T T H T
Sequence 2: T T H T T H H H T T H T H H H H H H H T
Sequence 3: T T H T H H T H T H T H T H T H H T H T
Figure 6.3b
When it comes to probability however, our common sense often leads us astray. We
intuitively feel that runs of several heads or tails should not occur if the coin flips are
independent of each other. Therefore, we expect successive coin flips to alternate heads and tails
more often than actually do. Most people are therefore surprised to learn that Sequence 3 is the
only one we did not generate by a random process of coin flips!
6

With a little reflection about what we now know of independence, it is easy to see why
Sequence 3 is least likely to be a random sequence. If this a fair coin,
P(H) = P(T) = 0.5

6
For a very readable discussion of why our intuitions are often incorrect, see Clifford Konold, Issues in Assessing Conceptual
Understanding in Probability and Statistics, Journal of Statistical Education, v. 3, n. 1 (1995).

For the coin flips to be mutually
independents experiments, each H should
have an equal chance of being followed by a
T or another H. Thus, runs of three or four
heads should occur frequently and seven or
more will happen occasionally. The same
should be true for runs of tails. We should
only rarely observe the alternating sequence
H T H T H T . . . However, we see that
pattern predominating in the third sequence.
Eight of nine tails are followed by a heads,
and eight of ten heads are followed by tails
in Sequence 3. By contrast, notice that
alternating heads and tails occur only about
half the time in the first two sequences we
obtained by independent coin flips. This
independence property of random sampling
is essential to the sample inference methods
we will develop in later chapters.
Another common misconception
about randomness is that each event will
occur at approximately its probability, even
for small samples. For example, suppose a
manager has six assistants, Jim, Barb, John,
Dave, Jenny, and Karen. She has
recently been accused of favoring three
assistants Jim, Dave, and Karen based
on her 60 most recent selections for first-class air travel on business trips. While Jim, Dave, and
Karen went first class at
least12 times, Barb, John, and Jenny only enjoyed that perk seven times a piece (see Figure
6.4a). The manager protests that she did allocate the limited number of company first-class seats
randomly. She assigning each employee a number from 1 to 6 and rolled a die before each trip
to decide which of her assistants flew first class. In fact, the frequency pattern in Figure 6.4a
was produced randomly by the computer equivalent of rolling a die (like we did with simulated
in Chapter 5). So was Figure 6.4b which appears to discriminate against John and favor Dave
most. The source of such workplace misunderstandings is the unreasonable expectation that
randomness leads to predictable relative frequencies in smaller samples. Any deviations often
lead to a witch hunt to find the cause of a problem that doesnt even exist. Substantial variations
0
5
10
15
Employee
F
r
e
q
u
e
n
c
y
Jim
Barb John
Dave
Jenny
Karen
Figure 6.4a
6 5 4 3 2 1
15
10
5
0
Employee
F
r
e
q
u
e
n
c
y
Karen
Jenny
Dave
John
Barb
Jim
Figure 6.4b

from one sample to the next naturally occur in random processes much more often than we want
to believe.
Randomness is commonly misinterpreted and random data are seldom identified correctly
in practice. We subconsciously expect random sequences to be much too regular, and we
mistakenly extend the law of large numbers to small samples.
Finally, remember that a lack of independence does not necessarily mean that one event
causes the other. Causation requires controlled research designs like the ones discussed in
Chapter 5. Is there any chemical in cigarette smoke that increases the risk of a premature death?
Only after decades of medical research was a cause-and-effect link finally established between
exposure to chemicals in tobacco smoke and developing lung cancer.
7
However, living near high
power lines or having silicone breast implants are yet to be causally linked to harmful health
effects despite extensive research studies.

Chapter Case #1: How Probability Affects News Coverage
A favorite concern of the public and politicians is "media bias" in television news. News
directors claim that professional ethics prevents them from injecting bias into their reporting.
However, both sides of this debate may be ignoring a more important source of bias. Media bias
arises from selecting which stories to cover and the amount of coverage, not from slanting the
stories covered.
Local crimes and accidents seem to receive an undue amount of air time. By contrast,
major events that affect us are less often covered, and we seldom hear about developments in
science. The blame is usually placed on ratings which determine the advertising rates that TV
sponsors pay. In the competitive race for ratings, news coverage is determined by what viewers
want to watch. However, probabilities also have a role in determining which news stories are
reported.
Unlike other types of programming, television news is dominated by uncertainty. News
can occur anywhere in the world. Sometimes a single story dominates the news for days, but on
other days several stories share the spotlight. A major news event can break at any time, forcing
TV new directors to redesign her entire news program.

7
A very readable account of the smoking issue is presented in "Statistics, Scientific Method, and Smoking," by B. W. Brown, Jr. in
Statistics: A Guide to the Unknown, J. Tanur (Ed.), San Francisco: Holden-Day, 1978.

Early on any given morning (or the previous day) a news director must decide how to
cover the likeliest news stories. The news director tries to protect her job by relying on
probabilities. It is too risky to send camera crews to remote locations on the off chance that a
story will occur in the vicinity. In addition, distant wars and natural disasters such as
earthquakes, hurricanes, and volcanoes may not receive much coverage because travel and
communications delays reduce the probability of timely reporting. These major stories might no
longer be news by the time the report is ready for broadcast. Instead, news bureaus are located
only in the world's largest cities, where the probabilities are greatest for news occurring.
Local stations are accused of taking a If it bleeds, it leads! attitude toward news
coverage. Yet crime and traffic accidents are certain to produce daily stories, and police serve as
on-the-scene reporters. Monitoring a police band radio will locate crimes and accidents, assuring
the local news director that crews she dispatches will get pictures and tearful interviews in time
for the six and eleven o'clock news. At the opposite extreme are stories involving local politics,
business development, and social conditions. A local news director will seldom cover these
stories because newsworthy events are not expected on a daily basis.
A majority of today's stories assigned to camera and reporting crews will follow up the
biggest stories from the preceding day. The reason for this allocation of resources is conditional
Professional
All Others
Over $100,000
Figure 6.5a

probability. If something was interesting to viewers yesterday, such as the Oklahoma City
bombing or Princess Diana's death, it is likely that enough will happen to make it "news" today.
Politicians also manipulate the news by playing the probability game. Self-promoters are
skilled at catering to the media's preference for certainty in news events. By alerting the media
that a news conference has been scheduled for an important announcement two hours before the
evening news, a politician can ensure coverage by guaranteeing the news director a story.
An encouraging trend in news is the increasing coverage on important issues such as
corporate downsizing or sexual harassment in the workplace that lack the headline impact of
breaking news. Because these stories that can be assembled weeks in advance and inserted on
"slow news" days, news directors can use them to reduce uncertainty.

Further Concepts in Probability
We now have all the probability tools that we will need to study statistical inference. If
we explored a bit deeper into probability theory, we can discover many other useful business
problem solving tools. For example, the addition rule for combining probabilities assumes
mutually exclusive events. What we do if events are not mutually exclusive? When events share
overlapping areas in their Venn diagrams, a more complicated, general addition rule must be
applied instead. Consider the Venn diagram in Figure 6.5a where events are not mutually
exclusive.
This figure differs from the earlier Venn diagram in Figure 6.1 because subscribers with
incomes over $100,000 are now represented. Although the survey indicates that P(over
$100,000) = 0.15, the circulation department must be careful not to overstate the probability of
either of two events occurring. Adding P(Professional) to 0.15 is not permissible because some
professionals make over $100,000. This probability is represented in Figure 6.5a by the
overlapping area for the two events. If we add probabilities, we will count this area twice.
So far, we have worked with two different types of probabilities. The simplest is
marginal probability, such as P(job offer from DuPont). The second type is conditional
probability, such as P(job offer from DuPont | graduated with a Chemistry degree). There is one
other important kind of probability called joint probability.
DEFINITION: Joint probability is the probability of two or more events occurring together.
P(A and B) is the probability of events A and B both occurring.


We underline "and" to give it a special meaning. The expression A and B signifies that BOTH
events occur. This meaning is similar to the use of "joint" in business and law, such as joint
custody of a child following a divorce or a joint venture by several oil companies to cooperate in
constructing a major pipeline. In Chapter 15, we will show how joint probabilities are related to
marginal and conditional probabilities.
The overlapping area in Figure 6.5a portrays this third type of probability. In the
subscriber example, the joint probability is:
P(Professional and over $100,000)
Although we will soon see that joint probability is important in its own right, we may use
this concept immediately to modify our addition rule.
Addition Rule for Any Two Events: If A and B are two events, the probability of either A
or B (or both) occurring is the sum of their separate probabilities minus their joint
probability. Thus,
P(A or B) = P(A) + P(B) - P(A and B)

By subtracting the joint probability, we remove the probability counted twice. The logic is easily
understood from the Venn diagram. If we add the areas for the two ovals in Figure 6.5a, we
double count the overlapping area. Thus, the sum of 0.20 and 0.15, 0.35, contains the joint
probability twice. Suppose the survey estimates the value for the joint probability to be 0.05. By
subtracting this overlap, we obtain the correct value for P(professional or over $100,000), 0.30.
What if the events are mutually exclusive? The general rule still applies, but the joint
probability is zero because no outcomes are common to the two events.
The joint probability of mutually exclusive events is zero. If A and B are mutually
exclusive,
P(A and B) = 0.

How about more complicated situations such as the one pictured in Figure 6.5b? Some
advertisers may also want to target female readers. However, the Venn diagram reflects a
probability P(female) = 0.10 for this magazine, perhaps because the publisher has failed to
address news of interest to working women. As in the case for two events, adding probabilities
can overstate the chance of a subscriber having any of the three desired attribute? Some women
are professional, and some are not. Joint probabilities are again a factor that should not be
ignored. The three overlapping areas that represent the joint probabilities again must be
subtracted in the formula. However, this procedure reduced the total by too much because the
three overlapping areas themselves overlap. We therefore need to add back in the joint
probability of all three events.

Addition Rule for Any Three Events: If A, B, and C are three events, the probability of at
least one of them occurring is:
P(A or B or C) = P(A) + P(B) + P(C) - P(A and B) - P(A and C)
- P(B and C) + P(A and B and C)

In Figure 6.5b, it is easy to spot the central area belonging to all three event areas. This
represents female professionals earning over $100,000. Suppose that the following joint
probabilities are known from the subscriber survey:
P(Professional and over $100,000) = 0.05
P(Professional and female) = 0.05
P(over $100,000 and female) = 0.04
P(Professional and over $100,000 and female) = 0.02
Then, applying the three event addition rule we obtain the following probability:
P(Professional or over $100,000 or female) = 0.20 + 0.15 + 0.10 - 0.05 - 0.05 - 0.04 + 0.02
= 0.33

Notice also that the three events pictured in Figure 6.5b are neither mutually exclusive nor
exhaustive. Events not only overlap, but also fail to cover the Venn diagram area. There
Professional
Over $100,000
Female
Figure 6.5b

obviously are many subscribers with none of the three traits sought by advertisers. But what is
that remaining probability? Having already solved for the combined area occupied by the three
events, we may easily determine the probability of the complement: subscribers with none of the
three desired attributes. Since P(Professional or over $100,000 or female) was found equal to
0.33, the probability of a male, nonprofessional with income no more than $100,000 must be 1.0
- 0.33, or 0.67.
University admissions directors are wary of this double counting problem. Suppose a
prestigious, residential college boasts admitting 10 percent racial minorities and 10 economically
disadvantaged in its freshman class. With random room assignments, the probability of being
assigned a roommate of either type may still be only slightly more than 10 percent if nearly all
the minority admissions are also economically disadvantaged. The college is using most of the
same people to support its admissions claims of racial diversity and economic opportunity. The
Venn diagram would then look like Figure 6.6.
Earlier, we introduced the concept of independence. One primary benefit of working
with independent events is the ease of calculating joint probabilities.

Disadvantaged
Minority
Figure 6.6

Product Rule for Independent Events: For independent events, joint probability is the
product of marginal probabilities. If A, B, C, D, . . . are independent events,
P(A and B and C and D and . . . ) = P(A) P(B) P(C) P(D)
This product rule does not use conditional probabilities because independence assures us that
conditional and marginal probabilities are identical. Thus, the chance that events occur together
is merely the product of their probabilities of occurring separately. In Chapter 15, we will show
how joint probabilities must be calculated from conditional probabilities when events are not
mutually independent.
One important business application of this product rule involves experiments with a large
number of independent repetitions. A critical machine used in an auto assembly plant may have
only a 0.001 probability of jamming for each auto going through the production line. Since the
probability is 0.001, or one in 1000, many people interpret this to mean that the machine jams
every 1000 cars. An understanding of the product rule and complementary events allows us to
correct this misconception. The complement, No Jam, has a probability of 1.0 - 0.001 = 0.999.
If P(No Jam) = 0.999 for each of car assemblies, the probability of a production run without the
machinery jam may be calculated by the product rule for independent events. We simply
multiply 0.999 by itself n times, where n is the number of assemblies in the production run. The
resulting joint probabilities are illuminating for a plant maintenance manager. For example,
P(No Jam after n = 128 cars) = (.999)
128
= 0.88
and
P(No Jam after n = 1000 cars) = (.999)
1000
= 0.37.

Using the rules for calculating probabilities of complementary events, there is a 12 percent
chance (1.0 -0.88) of one or more machinery jams in a production run of 128 cars. The
probability rises to 63 percent (1.0 -0.37) of at least one jam during a 1000 car production run.
Because they involve overlapping areas of probability, joint probability may be
unexpectedly small. Suppose that a hostage rescue mission will succeed only if each of three
events occurs during rescue operations. The weather must cooperate and be overcast to prevent
the squad from being spotted by moonlight. Secondly, each of three helicopters must be
operational throughout the mission. Finally, sufficient squad members must survive the shootout
to rescue the hostages. If the following marginal probabilities are estimated by the squad leader:
P(overcast weather) = 0.3
P(helicopter operational) = 0.8 for each helicopter
P(shootout survival) = 0.25


then the joint probability is the chance that for mission success:
P(mission success) = (0.3)(0.8)(0.8)(0.8)(0.25) = 0.0384.
Although marginal probability may appear to involve acceptable risk, the less than 4 percent
chance of mission success does not recommend it. Many corporate strategies suffer from this
same deficiency. We often blame Murphy's law: If anything can go wrong, it probably will.
Most often, however, failure occurs because we expect too many things to go right!

6.16 If the probability of an event A is P(A) = .25, then the probability of its complement C,
P(C) must be
a. 0
b. 0.25
c. 0.5
d. 0.75
e. 1.0

6.17 If this year, P(Recession) = 0.25, P(Mideast War) = 0.10, and
P(Recession and Mideast War) = 0.05, then P(Recession or Mideast War) is
a. 0.025
b. 0.20
c. 0.30
d. 0.35
e. 0.40

6.18 Upon entering an intersection where two roads cross, cars have a 0.2 probability of
turning left. Then the event Right Turn
a. has a probability of 0.8
b. is the complementary event of Left Turn
c. is mutually exclusive of Left Turn
d. is independent of Left Turn
e. all of the above

6.19 Suppose that of all newly-trained employees at a fast-food franchise, fifty percent are still
working there one year later but another thirty percent do not last more than the first two
months. Therefore, there is a twenty percent chance that a new employee will
a. be around after two months

b. not last a year
c. be fired immediately
d. work there more than two months but no more than a year
e. insufficient information to answer

6.20 A new product of the type your company is considering have achieved market success in
200 out of 1000 cases. Which of the following correctly describes this record?
a. the probability of success is 0.20
b. the probability of not achieving success is 80 percent
c. the odds are four-to-one against success
d. the chance of success is one in five
e. all of the above

6.21 If there are even odds that you will get a job offer from Acme, Inc., then the probability
P(Acme job offer) equals
a. 100 percent
b. 75 percent
c. 50 percent
d. 33 percent
e. not enough information provided

6.22 Which of the following does not indicate statistical independence among events?
a. Students in the honors program have the same chance of passing this course as any
other type of student.
b. The chance of a small firm failing is identical to that for any size firm.
c. The odds of the boss' son getting a promotion are no better or worse than for any other
employee of the firm
d. People who saw their new commercial were no more likely to shop at K-Mart than
those who didn't watch it.
e. All of the above indicate independent events.

6.23 For a product to be delivered on time, all of the following must occur: the order is
processed within one work day, the product is shipped the following day, and the
shipment is routed through the proper regional distribution center. If each of these event
are independent and their probabilities are 0.8, 0.8, and 0.5, the probability of on-time
delivery is
a. 0.50
b. 0.48
c. 0.40
d. 0.32

e. 0.24

6.24 If S is the set containing events describing the time between servicing of a copying
machine under warranty, which of the following would be a possible event in S?
a. five months
b. customer complained
c. asked for money back
d. cartridge need replacing
e. all of the above

6.25 If events S = {A, B, C} is an exhaustive set of events, then
a. P(A) + P(B) + P(C) = 1
b. P(S) = 1
c. A, B, and C must be mutually exclusive events
d. all of the above

6.26 If S = {A, B, C} and P(A) + P(B) + P(C) = 1.0, then
a. A, B, and C are mutually exclusive and exhaustive events.
b. A, B, and C are mutually exclusive but not exhaustive events.
c. A, B, and C are exhaustive but not mutually exclusive events.
d. A, B, and C are complements.
e. A, B, and C are independent events.

6.27 If the sample space consists of a listing of each product sold at a supermarket, then frozen
foods and fresh produce are
a. outcomes
b. events
c. sample spaces
d. experiments

6.28 If the sample space consists of a listing of each product sold at a supermarket, then frozen
foods and fresh produce are
a. independent
b. mutually exclusive
c. exhaustive
d. complementary
e. all of the above


6.29 Getting promoted to a senior vice-president position has a probability of 0.50 if you are
head of the marketing but only 0.20 if you are head of research. These are examples of
a. marginal probabilities
b. joint probabilities
c. conditional probabilities
d. independent probabilities

6.30 Which of the following sets consists of mutually exclusive events?
a. Marital Status = {single, married, divorced, widowed}
b. Education = {high school degree, college degree}
c. Region = {South, West, East, North}
d. Occupation = {office worker, clerk, receptionist}

6.31 If events A and B are statistically independent, and P(A) = 0.5 and P(B) = 0.2, then the
joint probability P(A and B) must be
a. 0.1
b. 0.2
c. 0.5
d. 0.9
e. cannot calculate without knowing the conditional probabilities

U.S. manufacturers are surveyed about whether they have fewer than 50 workers and if they do
any exporting of their product. Answer the following questions based on the Venn Diagram:
6.32 Which of the following is described in the Venn diagram?
a. Exporting has a greater probability than Less Than 50 Workers
b. Less Than 50 Workers has a greater probability than At Least 50 Workers.
c. At Least 50 Workers has a greater probability than Not Exporting.
d. All of the above are portrayed in the diagram.
e. Cannot be answered by examining the diagram.


6.33 Which of the following are
described in the Venn diagram?
a. Exporting and Less Than 50
Workers are exhaustive events.
b. Exporting and Less Than 50
Workers are mutually exclusive
events.
c. Exporting and Less Than 50
Workers are complementary
events.
d. All of the above

6.34 Which of the following
conditional probabilities is most
closely approximated in the
Venn diagram above.
a. P(Exports | Less Than 50 Workers) is 0.50
b. P(Less Than 50 Workers | Exports) is 0.50
c. P(Exports | Less Than 50 Workers) is 0.85
d. P(Less Than 50 Workers | Exports) is 0.85
e. Cannot be answered with information presented in the diagram.

6.35 A new comedy series has a 20 percent chance of being renewed for a second season. On
average, 60 percent of second-season comedies are renewed. What is the probability a
new comedy will still be on the air for its third season?
a. 0.48
b. 0.32
c. 0.12
d. 0.08

6.36 If events A and B are statistically independent, then
a. P(A) = P(B)
b. P(A) + P(B) = 1
c. P(A|B) = P(A)
d. P(A|B) = P(B)
e. all of the above

E x p o r t
L e s s t h a n 5 0 W o r k e r s
U . S . M a n u f a c t u r i n g C o m p a n i e s
Figure 6.7

6.37 Which of the following pairs of events are most likely to be independent?
a. a rusting fender and a car less than two years old
b. a person who has attended college and an athlete over 7 feet tall
c. a company president earning more than $500,000 a year and a Fortune 500 company
d. an A on the first exam and final course grade of D
e. having blue eyes and majoring in accounting

6.38 If the probability of an event A is P(A) = .25, then the probability of its complement C,
P(C) must be
a. 0
b. 0.25
c. 0.5
d. 0.75
e. 1.0

6.39 People usually express reluctance to offer a needy relative one of their two kidneys for
transplant surgery. They are often convinced to become donors when they learn that the
same diseases that cause one kidney to fail will also damage the other kidney. This
argument relies on explaining that the failure of each kidney are not
a. exhaustive events
b. mutually exclusive events
c. independent events
d. complementary events
e. all of the above

6.40 If events A and B are statistically independent, and P(A) = 0.5 and P(B) = 0.4, then the
joint probability P(A and B) must be
a. 0.1
b. 0.2
c. 0.5
d. 0.9
e. cannot calculate without knowing the conditional probabilities

6.41 For a product to be delivered on time, all of the following must occur: the order is
processed within one work day, the product is shipped the following day, and the
shipment is routed through the proper regional distribution center. If each of these event
are independent and their probabilities are 0.8, 0.6, and 0.5, the probability of on-time
delivery is
a. 0.50
b. 0.48

c. 0.40
d. 0.30
e. 0.24

6.42 If A and B are mutually exclusive events, determine the probability P(A or B) under each
of the following circumstances:
(a) P(A) = 0.05 and P(B) = 0.20
(b) P(A) = 0.50 and P(B) = 0.30
(c) P(A) = 0.50 and P(B) = 0.49

6.24 If A, B, C, D, E, and F mutually exclusive and exhaustive events and P(A) = 0.10, P(B) =
0.20, P(C) = 0.30, P(D) = 0.15, P(E) = 0.23, and P(F) = 0.02, determine each of the
following:
(a) P(A or B)
(b) P(B or C or E)
(c) A
c

(d) B
c

6.43 Use the general addition rule to calculate P(A or B) in each of the following cases:
(a) P(A) = 0.60, P(B) = 0.50, and P(A and B) = 0.40
(b) P(A) = 0.10, P(B) = 0.10, and P(A and B) = 0.05
(c) P(A) = 0.90, P(B) = 0.30, and P(A and B) = 0.20
(d) P(A) = 0.50, P(B) = 0.30, and P(A and B) = 0.30
Draw Venn diagrams that accurately reflect the marginal and joint probabilities given in
each part of the preceding problem.

6.3 Random Variables and Expected Value
We now apply our newly-acquired knowledge of experiments, sample spaces, and
probability to define random variables.
DEFINITION: A random variable is a variable whose quantitative values are outcomes of an
experiment.
Understanding and predicting random variables such as sales, stock prices, and interest rates is
the focus of much business analysis. You have worked with variables since your first algebra

course. Part One of the text explained how to define, measure, display, and summarize random
variables and their multivariate relationships.
Recall that quantitative data may be discrete or continuous. The sample space of a
random variable determines whether random variables are discrete or continuous.
DEFINITION: A discrete random variable is a random variable consisting of discrete data. A
continuous random variable is one with continuous data in its sample space.
Using definitions of discrete data from Chapter 1, a few examples of random variables and their
sample spaces are:
(1) Meet, number of staff in meetings, has sample space {0, 1, 2, 3, 4}.
(2) Score, finance score for a 100-point exam, has sample space {0, 1, . . . , 100}.
(3) Age, employee age in company sales force, has sample space {18, 19, . . . , 65}.
(4) BillErr, billing errors in a lot of 20 orders, has a sample space {0, 1, . . . , 20}.

In this chapter, we restrict our discussion to discrete random variables and postpone until
Chapter 7 a discussion of probabilities for continuous random variables which present special
challenges.

Probability Distributions for Discrete Random Variables
Unlike the variables you remember from algebra classes, the values of these and other
discrete random variables may not be determined with certainty. Usually, the best we can do is
state the probability of each numerical outcome. To do so, we must match each value in the
sample space with its probability of occurring. This is called the probability distribution.
DEFINITION: The probability distribution, P(X), for a discrete random variable X is a listing
of the probabilities for each possible value of that variable.
In everyday speech, the word random often suggests disorder. Random acts of
senseless violence suggest a breakdown in our social fabric. In statistics, however, random does
not mean utter chaos where anything can happen. The value of a random variable does vary, but
each possible value occurs with a specific probability. Thus, random variables contain elements
of order and disorder.
Although random variables may vary from one experiment to the next, outcomes are
restricted to the sample space and likelihoods are precisely regulated by a probability
distribution.

How are probability distributions for random variables discovered? According to the
frequency approach to probability discussed in section 1, probabilities may be approximated
from the frequency distribution of a large sample. Histograms based on frequency distribution
data are easily converted to relative shares. Recall from Section 1 that the relative frequency
of an event is the ratio of its frequency to the sample size.
For each possible outcome x
i
of the random variable X occurring n
i
times in a sample of n
observations, its relative frequency f(x
i
) is
f(x
i
) = n
i
/ n

In the preceding section, we discovered properties for mutually exclusive and exhaustive
events. Conveniently, all outcomes of a random variable must be mutually exclusive and
exhaustive. For example, consider the sales by a machine parts distributor. The random variable
Sales measuring annual sales (in millions of dollars) must have a single numerical value. Sales
may be 12, 50, or 212 million dollars, but sales cannot be more than one of these values in the
same year for the same parts distributor. Consequently, a random variable X obeys the addition
rule for mutually exclusive events, and probabilities must sum to 100 percent because random
variable numerical values are exhaustive.

If X is a discrete random variable with sample space {x
1
, x
2
, . . . , x
k
}, then
8

P(x
1
) + P(x
2
) + . . . + P(x
k
) = 1

Government labeling requirements state that prescription medicine provide instructions
for proper use, possible side effects, and warnings to high-risk patients. As future users of
statistical methods, you will need a similar set of warnings. A list of assumptions, often stated in
the language of probability, alerts us that certain conditions should (or should not) be present for
the statistical method to work as advertised. Independence is one of the most commonly made
assumptions, and marginal and conditional probability concepts from Section 2 also may be
applied to random variables.

Two random variables X and Z are independent if their conditional and marginal
probabilities are equals for all values of the random variable.
P(X) = P(X|Z) for all values in the sample space of X and Z

If two random variables are independent, the probability distribution for either variable is
unaffected by the value of the other variable. For example, the number of textbooks you need to

8
For clarity, we use capitalized letters (such as X and Y) to identify the random variables themselves and lower case letters (such as x
1

or y
n
) for particular numerical values of X and Y.

buy this term is likely to be independent of the number of brothers and sisters you have.
However, the number of textbooks is dependent on the number of classes you take. For
example, the probability P(2 books|5 classes) should be close to zero, but P(2 books|2 classes)
may be fairly large, say 0.75.

Chapter Case #2: A Meeting of the Minds
Suppose you are the new manager of a customer service department. You have inherited
a pile of customer complaint about service delays and a staff of four technicians who seem
overworked. Your predecessor attributed these problems to an inefficient and lazy staff. Before
you can assess the situation, you first should gather distributional information. After studying
the time logs from the past year, you discover that two or more of your staff are frequently tied
up in company meetings. The meeting logs reveal the probability distribution and histogram in
Figures 6.12 and 6.13 for the discrete random variable Meet, the number of staff in meetings at
any time during the work day.

D i s c r e t e P r o b a b i l i t y
D i s t r i b u t i o n
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 1 2 3 4
N u m b e r o f S t a f f a t M e e t i n g s
P
(
M
e
e
t
)
Figure 6.13

Meet P(Meet)
0 0.2
1 0.1
2 0.1
3 0.1
4 0.5
Figure 6.12

Outcomes of a discrete random variable may be combined into a single event by
summing their probabilities. Thus, the probability that at least two members of your staff are in
meetings is 0.7. The calculations involve the simple application of the addition rule:
70 . 0 50 . 0 10 . 0 10 . 0 ) 4 ( ) 3 ( ) 2 ( ) 2 ( = + + = = + = + = = > Meet P Meet P Meet P Meet P
Probabilities may also be combined using a cumulative probability distribution.
DEFINITION: A cumulative probability distribution, F(x), for a random variable X is the
probability P(Xx) for each possible value of X.
This is simply a direct extension of the running totals for cumulative frequencies we discussed in
Chapter 2. For example, F(1) for the Meet variable is the probability that no more than one
technician is in meetings. This is the combined, or accumulated, probabilities P(0) and P(1).
C u m u l a t i v e P r o b a b i l i t y
D i s t r i b u t i o n
0 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
0 1 2 3 4
N u m b e r o f S t a f f a t M e e t i n g s
F
(
M
e
e
t
)
Figure 6.14

Thus,
F(1) = P(Meet 1) = P(0) + P(1) = 0.2 + 0.1 = 0.3
Likewise, F(2) would include P(2) = 0.1 in addition, for a combined probability of 0.4. The
cumulative probability for the large possible value, in this case F(4), will be 1.0 because we can
always be certain of being at or below the maximum. Figure 6.14 displays this cumulative
probability distribution.
Probability distributions can say a lot about a business problem. This one tells us that the
department is operating without half or more of its service staff 70 percent of the time. Even
more alarming, there is a 50 percent chance that no technicians will be available when a service
request will comes in. No wonder there are customer complaints and dejected employees!
A major obstacle to productivity in the workplace is lost time at endless meetings. If
meetings were limited to addressing specific problems or objectives, only those individual with
relevant expertise would need to attend. Not only are smaller, task-specific meetings are more
effective, they also free up time for everyone else at the company to be more productive. In this
example, the previous service manager insisted that all or most technicians attend most company
staff meetings. By sending only one representative to these meetings, your department are now
able to dig out from the backlog of service requests.

The Mode and Range for a Probability Distribution
As we learned in Chapter 3, summary measures such as the mean, mode, range, and
standard deviation may often help decision makers make sense of large data sets and complex
distributions.
9
However, the methods we used to calculate these measures of average and
variability relied on knowledge of the population data set. In many business situations, we
cannot obtain data on the entire population of a random variable because businesses are ongoing
enterprises. Fortunately, we dont need the population or any data at all if we know the
probability distributions. These distributions can often be closely approximated from relative
frequencies available from business records or by subjective probabilities from experts
experienced in market analysis. We can then use probability distribution information to find the
mean and other summary measures using definitions and formulas that closely parallel the ones
for population data. Lets see how.

9
The median may often be found at the value of X where the cumulative probability F(x) is 0.50. However, no unique median exists
for many discrete probability distributions. In the current chapter case, for example, three or fewer were at meetings 50% of the time but four
were at meetings the other 50% of the time. The middle of the distribution is neither 3 nor 4. Some convention -- such as half way between 3

The mode and range are the simplest, but sometimes the most useful, summary measures
to find from a probability distribution. The range is merely the difference between the largest
and smallest values in the sample space. In the case example, zero to four technicians may be in
meetings. To maintain operations and reduce customer complaints, you may want to establish a
rule that no more than three members of your staff may attend meetings at the same time.
Because then at least one person is always on duty to field service calls, the range of possible
outcomes shrinks from four to three under the new rule.
To find the mode, we look for the most likely outcome.
The mode of a discrete random variable is the outcome with the largest probability, and the
range is the range of outcomes described by the sample space.
The mode in the service department case occurs at Meet = 4 technicians because P(4) is 0.5,
which is greater than for any other outcome. The mode is an especially important summary
measure here to alert the new department manager that all four technicians away at meetings
clearly occurs most often.

Using Probability Weights to Find the Mean of a Discrete Random Variable
Recall from Chapter 3 that was defined as the arithmetic mean of population data.
If the population contains N observations of the random variable X, the population mean is:
= x
i
/ N.
For the typical discrete random variables, the limited number of outcomes will occur repeatedly
in the population. In the customer service department case, for example, either 0, 1, 2, 3, or 4
technicians will be at staff meetings at any given time. If the entire population were known, we
could group repeating values together and calculate
Meet
as follows:
N
Meet
) 4 ... 4 4 ( ) 3 ... 3 3 ( ) 2 ... 2 2 ( ) 1 ... 1 1 ( ) 0 ... 0 0 ( + + + + + + + + + + + + + + + + + + +
=

and 4 -- must then be followed to establish the median. We encountered a similar problem in Chapter 3 when the median fell between two
discrete values.

By multiplying the five possible outcomes by the number of times each outcome occurs in
the population, we obtain the formula:
N
N N N N N
Meet
4 3 2 1 0
4 3 2 1 0
+ + + +
=
where N
0
is the number of times zero technicians are in meetings, N
1
is the number of times one
is in meetings, and so forth. To illustrate, suppose the population was very small, say only N =
10, with Meet = 0 occurring once, Meet = 1 three times, and Meet = 2, 3, and 4 two time each.
Substituting, we find the following mean for department staff in meetings:
10
) 4 ( 2 ) 3 ( 2 ) 2 ( 2 ) 1 ( 3 ) 0 ( 1 + + + +
=
Meet

As mentioned earlier, however, calculating the mean this way is often not practical.
Because meetings may start and end anytime, data on Meet collected every minute results in a
population of nearly 125,000 observations annually (52 weeks x 40 hours / week x 60 minutes /
hour)! Thankfully, collecting and analyzing such cumbersome data sets is seldom necessary if
we know the probability distribution for the random variable. What happens when we divide N
across each term in the numerator of the
Meet
formula. Then, N
0
/ N is the relative frequency of
Meet = 0 occurrences, and similar meanings can be attached to N
1
/ N, N
2
/ N, N
3
/ N, and N
4
/ N.
Each of the five possible values of Meet has been assigned a specific importance, or "weight,"
according to its rate of occurrence. Values with high probabilities are assigned larger weights;
while less likely values are down-weighted. The mean then equals the sum of each outcome
weighted by its relative frequency or probability.
You learn about weighted sums when you calculate your course grade. For example, you
ask your management instructor why you received a B for the course when your test scores of
100, 100, and 70 average to 90 average. However, this would be true only if each of the three
exams have the same weight. The instructor says your average is only 85 because the first two
tests only count 25 percent of the grade each while the final exam has a 50 percent weight. The
following weighted average calculations shows that your B grade was proper:
Course Average = test score test weight
= 100 (0.25) + 100 (0.25) + 70 (0.50)
= 25 + 25 + 35 = 85

A weighted average that uses probabilities as weights relies on the idea of expected
value.


DEFINITION: The expected value, E(X), of a discrete random variable X with probability
distribution P(X) is the sum of each value of X weighted by its probability. If the sample space
of X consists of the set of the k numerical outcomes {x
1
, x
2
, . . . , x
k
}, then
E(X) = x
1
P(x
1
) + x
2
P(x
2
) + . . . + x
k
P(x
k
)
or in summation notation,
E(X) = x
i
P(x
i
)

According to this definition, the expected value of a random variable is its population mean.

Expected Value Definition of the Mean: The mean of a random variable X is
= E(X)

For example, the mean number of technicians tied up at meetings is calculated as follows:

= E(Meet) = Meet P(Meet)
= (0)(0.2) + (1)(.1) + (2)(.1) + (3)(0.1) + (4)(0.5) = 2.6

Did you notice that is 2.6 rather than 2.0, the arithmetic mean of the five possible
outcomes 0, 1, 2, 3, and 4? As in the management
course grade example, an unweighted average only
applies when all outcomes have the same weight.
That is clearly not true for the probability weights
here. The probability is much greater of all four
technicians being in meetings (0.5) than that none are
(0.2). The greater probability weights for the Meet
= 4 outcome raises the mean substantially above
2.0.
If several outcomes are possible, it is usually easier do these computations for the mean
spreadsheet software such as Excel. For our staff meeting case, we start by typing the five
possible outcomes of Meet in one column of a spreadsheet and the corresponding probabilities
into another column (like the one labeled P(Meet) in Figure 6.18). Next,
activate the Formula Bar by typing an equal sign in a new column and type (or click the mouse)
in cell locations to multiply outcomes and their corresponding probabilities. Finally, use the
AutoSum button () on the toolbar to sum this column.
10
As we will see shortly, spreadsheets

10
If you are not a regular user of spreadsheet software, check a software manual or the Help menu to familiarize yourself with the
formula bar, the tool bar, and basic data entry.
Meet P(Meet) Meet x P(Meet)
0 x 0.2 = 0.0
1 x 0.1 = 0.1
2 x 0.1 = 0.2
3 x 0.1 = 0.3
4 x 0.5 = 2.0
E(Meet) = Mean = 2.6
Figure 6.18

can be even more convenient in finding variance and standard deviation of discrete random
variables.
The word expected does not indicate what we expect to happen nor even the most
likely outcome. The mode is most likely. In this case, the mode is 4.0 and not 2.6. In fact, as
we learned back in Chapter 3, the mean may not be in the sample space at all. For example, it is
not possible for 2.6 technicians to be in meetings. Instead, expected value refers only to
importance given outcomes proportional to their probability. The expected has its greatest
usefulness for large samples, and smaller samples may have means quite different from the
population mean. Small sample means can be especially different if the probability distribution
includes extreme value outcomes that occur very rarely. State lotteries are one example of this.

Chapter Case #3: You Gotta Play If You Want to Win
Despite demands for better schools and more prisons, little popular support exists for tax
hikes to fund these government programs. To prevent budget deficits, most states have recently
legalized gambling such as lottery drawings. A typical version is the weekly lottery (described
in Chapter 5) which sells dollar tickets with six different numbers from 1 through 49. Typical
prizes range from $4 million for successfully matching all six numbers drawn to only $5 for
matching any three numbers.
These prizes and the approximate probabilities of holding each winning ticket are shown
in spreadsheet form (see Figure 6.19) along with weighted average computations. Notice that
larger prizes have progressively lower probabilities.
11
Notice also that the most commonly

11
The probabilities are calculated by extending the rules developed earlier in this chapter. For example, the chance of winning the
grand prize is so small because it involves a joint probability of one of your six numbers matching the first number drawn (6/49), one of your
remaining five matching the second number drawn (5/48), and so on. The product of these six event probabilities is only one chance in 14
million!
Numbers Matched
on Lottery Ticket Prize P(Prize) Prize x P(Prize)
Match All 6 Numbers 4,000,000 $ x 0.00000007 = 0.28 $
Match 5 of 6 Numbers 1,500 $ x 0.00002 = 0.03 $
Match 4 of 6 Numbers 70 $ x 0.001 = 0.07 $
Match 3 of 6 Numbers 5 $ x 0.02 = 0.10 $
2 or Fewer Matches - 1 $ x 0.98 = - 0.98 $
E(Prize) = Mean = - 0.50 $
Figure 6.19

occurring prize is the loss of $1, the price of a lottery ticket. This lost dollar therefore has a
negative sign in the Prize column of Figure 6.10. The expected value of $ -0.50 for playing the
lottery is also negative, because the largest probability weight (nearly 0.98) is for losing your
dollar.
Why do people buy lottery tickets if the mean payoff for each ticket is a negative 50
cents? After all, many investments have positive expected payoffs, such as stocks, bonds, or
savings accounts. Perhaps lottery players are compulsive gamblers. Maybe they are swayed by
deceptive ad campaigns that glamorize the lottery. Some may even play because states use
lottery profits to support popular programs such as education. However, the primary reason
lotteries attract so many customers is that the 50 cent expected loss only applies to someone
buying millions of tickets over their lifetime. Remember that people lose a dollar nearly every
ticket. For most people, only the occasional $5 and even rarer $70 ticket matches raise their
mean loss slightly above $ -1. Extremely small probabilities for the top prizes are deliberately
set to assure only a few big weekly winners who wind up with large, positive lifetime payoff
averages.
12
For example, buying ten tickets each week for ten years (about 5000 tickets) will
leave most people at least $4000 poorer. Some will have matched 5 of 6 numbers for a $1500
prize, leaving them down at least $2500. However, the hundreds of grand prize winners over
those ten years would average approximately $4 million / 5000, or $ +800 payoff per ticket.

Standard Deviation and Risk-Return Analysis*
We may also find a measure of variability from a probability weighted average. To find
the variance and standard deviation the random variable, we focus on its squared deviations of
from the mean.

12
The top prize is usually advertised as even larger only because multiple winners split up the prize money and the winnings are paid
over many years while the state pockets the interest on the money.
Deviations Squared Probability Weighted
Meet P(Meet) Meet x P(Meet) from Mean Deviations P(Meet) Squared Deviations
0 x 0.2 = 0.0 -2.6 6.76 x 0.2 = 1.352
1 x 0.1 = 0.1 -1.6 2.56 x 0.1 = 0.256
2 x 0.1 = 0.2 -0.6 0.36 x 0.1 = 0.036
3 x 0.1 = 0.3 0.4 0.16 x 0.2 = 0.032
4 x 0.5 = 2.0 1.4 1.96 x 0.4 = 0.784
E(Meet) = Mean = 2.6 Variance = 2.460
Standard Deviation = 1.568
Figure 6.20

Expected Value Definitions of and : The variance for a discrete random variable X is
= E[(X - )]
and the standard deviation is the square root of the variance.
Once again, expected values are calculated as a weighted average, using the probabilities P(X) as
the weights.
To determine the variance, compute the weighted average
2
= (x
i
- ) P(x
i
)
If its value is not already known, may first be found by the expected value formula.
The expression (x
i
- ) was also used to define the population variance in Chapter 3.
Before, however, we multiplied these squared deviations by 1 / N, the relative frequency of each
observation in the population. As we discovered earlier, P(X) contains the proper weights for
working with probability distribution information.
As we also explained in Chapter 3, the square root restores the squared differences to the
same scale and units of measurement as the mean. Thus, we may directly compare the sizes of
and . Variance measures will be used extensively in later chapters when we evaluate regression
and analysis of variance models.
We may examine the intermediate calculations of most easily using a spreadsheet (see
Figure 6.20). For our service technicians case example, the Deviations from Mean contains
Meet minus 2.6, the mean we found earlier, for each of the five possible outcomes. The next
column contains Squared Deviations which are then multiplied by probability weights. The
sum of these Weighted Deviations Squared is the variance, 2.46, and its square root (1.568) is
the standard deviation. Meet has a moderate amount of variability, since the coefficient of
variation is 100 (1.568) / (2.6), or a CV of 60.3 percent.
Numbers Prize x Deviations Squared Weighted
Matched Prize P(Prize) P(Prize) from Mean Deviations P(Prize) Squared Dev
Match All 6 4,000,000 $ x 0.00000007 = 0.28 $ 4,000,000.5 $ 16,000,004,000,000.20 x 0.00000007 = 1,120,000.3
Match 5 of 6 1,500 $ x 0.00002 = 0.03 $ 1,500.5 $ 2,251,500.25 x 0.00002 = 45.0
Match 4 of 6 70 $ x 0.001 = 0.07 $ 70.5 $ 4,970.25 x 0.001 = 5.0
Match 3 of 6 5 $ x 0.02 = 0.10 $ 5.5 $ 30.25 x 0.02 = 0.6
2 or Fewer - 1 $ x 0.98 = - 0.98 $ - 0.5 $ 0.25 x 0.98 = 0.2
E(Prize) = Mean = - 0.50 $ Variance = 1,120,051.1
Standard Deviation = 1,058 $
Figure 6.21

By contrast, the lottery case has a relatively large standard deviation ($1,058) because the
two largest prizes have enormous deviations from the mean (see Figure 6.21). The thousand
dollar standard deviation reflects the possibility of huge winnings that attract lottery ticket
buyers.

Chapter Case #4: Risky Business
Standard deviation is often used in finance to measure the risk associated with different
investments. Investors can make rational choices by comparing the mean rates of the return
and the levels of risk of each asset. For example, suppose a couple goes to an investment
counselor. The counselor narrows the choices to two alternatives for the couples long term
financial goals: (1) a blue-chip stock fund, and (2) a piece of undeveloped land near the
expanding western outskirts of town. Based on past performance of the stock market and similar
real estate, the investment counselor presents them with the distribution of annual returns in
Figure 6.22.
Blue Chip Stock Fund Undeveloped Land
Return P(Stock) Return P(Land)
50 % 0.20 40 % 0.20
20 0.30 25 0.20
0 0.30 10 0.25
-20 0.20 - 10 0.35
Figure 6.22
Blue Chip Stock Fund
Return Prize x Deviations Squared Weighted
Annual % P(Stock) P(Stock) from Mean Deviations P(Stock) Squared Dev
50 x 0.20 = 10 38 1444 x 0.20 = 288.8
20 x 0.30 = 6 8 64 x 0.30 = 19.2
0 x 0.30 = 0 -12 144 x 0.30 = 43.2
-20 x 0.20 = -4 -32 1024 x 0.20 = 204.8
E(Stock) = Mean = 12 Variance = 556.0
Undeveloped Land
Return Prize x Deviations Squared Weighted
Annual % P(Stock) P(Stock) from Mean Deviations P(Stock) Squared Dev
40 x 0.20 = 8.00 28 784 x 0.20 = 156.8
25 x 0.20 = 5.00 13 169 x 0.30 = 50.7
10 x 0.25 = 2.50 -2 4 x 0.30 = 1.2
-10 x 0.35 = -3.50 -22 484 x 0.20 = 96.8
E(Stock) = Mean = 12.00 Variance = 305.5
Figure 6.23

Obviously, neither investment is a sure thing. Each may lead to higher, lower, or even negative
returns.
Figure 6.23 contains the spreadsheet work for each investment. The stock fund has a 12
percent mean return but a level of risk of = 23.6. Because the couple can achieve the same 12
percent return at a somewhat lower risk ( = 17.5), they select the land investment.
Newcomers to investing often expect to find risk-free assets with high return. Perhaps
their friends are always bragging about risky investments that paid off. However, high average
return usually comes at a price: shouldering higher risk. This tradeoff between risk and return is
a natural result of open trading in asset markets. In our example, the risk-return relationship on
land and stocks would not persist for very long. Other investors would choose land over stock.
The depressed stock prices would drive up their effective return over the newly-inflated land.
Although stocks remain riskier, their higher yield attracts investors.
In practice, reputable investment counselors guide new investors into balanced
portfolios composed of riskier and safer assets tailored to the needs and tax circumstances of
each client. By holding a well-managed, diversified portfolio, investors can obtain the proper
balance of liquidity, risk, and return. A carefully designed portfolio can even reduce overall risk
below that of the safest asset in that portfolio. This seemingly absurd result is possible because
some assets enjoy their highest yield when other assets are doing their worst. For example, auto
stocks do well when the economy is booming, but these stocks plummet when recession curtails
auto sales. By contrast, new home sales are often weakened by the high mortgage interest rates
of economic boom periods, but sales usually recover when a slack economy makes more
mortgage funds available. Although auto and home construction industry stocks may have high
risk individually, a portfolio containing both may produce high yet dependable overall return
throughout the business cycle.
We now know enough about probability to introduce many of the fundamental concepts
and methods in business statistics. With a knowledge about the proper distribution, we will
incorporate probability into our sample inferences about unknown populations. Although we can
seldom be certain of our estimates, we will use probability concepts to quantify our level of
certainty. That knowledge is often enough to give us a competitive edge in this uncertain world.

6.64 If X is a random variable with sample space {2, 6} and P(X=2) = 0.5, P(X=6) = 0.5, then
4 equals
a. the mean
b. the variance

c. the standard deviation
d. both a and b
e. both a and c

6.65 In Tampa, the probability distribution for the random variable X measuring the price
charged for weekday video tape rental is found to be:

Price Charged P(X)
$ 1.50 0.10
2.00 0.60
2.50 0.10
3.00 0.20

For a randomly selected video store, the probability P($1.00<x<$2.25) is
a. 0.10
b. 0.20
c. 0.60
d. 0.70
e. 0.80

6.66 Suppose the number mergers exceeding $1 billion in assets for the acquired company
occurring each year in the U.S. is determined to have the following distribution: P(0) =
0.05, P(1) = 0.05, P(2) = 0.2, P(3) = 0.2, P(4) = 0.2, P(5) = 0.2, P(6) = 0.05, P(7) = 0.05,
with all other outcome values having 0 probability.
(a) What is the probability of at least 4 of these major mergers this year?
(b) Find and by using the calculator. Is the mean the same as the median in this case?
Explain how you found the median to make this comparison. Where is the mode or is
there one?
(b) Now calculate on a calculator [(0 + 1 + 6 + 7) + 4(2 + 3 + 4 + 5)] / 20. Explain why
these calculations also yield the mean for this example.

6.67 Suppose the probability distribution for the random variable X measuring the price
charged for weekday video tape rental is found to be:
Price Charged P(X)
$ 0.99 0.10
1.50 0.15
2.00 0.60
2.25 0.10
2.50 0.05

Determine the following probability for prices charged at a randomly selected video
store:
(a) P(x>1)
(b) P(x2)
(c) P(1<x<2)
(d) P(2x3)

6.68 A magazine subscription service offers a set of cash prizes available if you send back the
reply envelop. Although there is no obligation to subscribe to anything, you have to pay
the postage for the return letter. By Federal law, the service is required to inform you of
the probabilities of winning each prize.
(a) Type the following probability distribution into two columns of Minitab, one column
for the prize outcomes and a second for the corresponding probabilities. Remember not
to type the commas in numbers like 25,000. Create and print the product column, and
decide whether the expected value exceeds the $0.25 price of postage.

Prize Probability of Prize
$10,000,000 .00000001
25,000 .000001
100 .0005
0 .99949899

(b) Which prizes contribute the most to the expected value?
(c) Why wasn't it necessary to type the fourth row of the distribution to calculate
expected value?
(d) Explain why we could have come up with a decision by changing the fourth "prize"
value from zero to minus .25.
(e) Why might a person still decide to return the envelop even if the expected value is
less than 25 cents? Would you be tempted to return it?
(f) Determine the standard deviation from the variance computed by expected values.
Explain why is larger than .

6.4 Principles of Decision Theory
Earlier, we introduced the concept of statistical independence. We learned that a simple
product rule applies to joint probabilities of independent events. In many business decisions,
however, events are not independent. This simple product rule then became a special case of the
more general product rule involving both the conditional and marginal probabilities.

General Product Rules for Two Events: The joint probability of two events is the marginal
probability of one event multiplied by the conditional probability of the other event given
the first event. The joint probability for two events A and B, P(A and B), is
P(A and B) = P(A)P(BA)

Since the joint probabilities P(A and B) and P(B and A) are the same, the product rule may
equivalently be stated:
P(A and B) = P(B)P(AB)
The conditional probability is required when the chance of one event occurring depends on
whether the other event occurs. Consider, for example, the lack of independence associated with
job advancement. Several restaurant chains have been found in recent years to favor their white
employees for promotion. Women have also encountered a "glass ceiling" blocking their
promotion. Marginal probabilities are P(Male) = 0.50 and P(Promotion) = 0.20. As we also
learned in Chapter 6, joint probability is represented in Venn Diagrams by areas common to both
events. In this case, suppose the size of the overlapping area P(Male and Promotion) is 0.15.
The events are not independent because the joint probability does not equal the product of the
marginals:
P(Male and Promotion) = 0.15 = P(Male)P(Promotion) = (0.50)(0.20) = 0.10
Switching to the general product rule instead, we must replace one of the marginal probabilities
by its conditional probability counterpart. Let's use the following application of the formula to
deal with our example:
P(Male and Promotion) = P(Male)P(PromotionMale)
The conditional probability P(BA) is the probability of event B if we are assured that event A
occurs as well. The conditional probability is therefore equal to the proportion of area A taken
up by event B in the diagram. This fraction is the overlapping area divided by the total area of
A.
Conditional probability may be expressed as the ratio of joint probability to a
marginal probability in the following way:
P(BA) = P(A and B) / P(A)
This result should is equivalent to the formula for joint probability given at the beginning
of this section. Multiply both sides by P(A) to convince yourself that this is the same as the
general product rule. The conditional probability in our example is therefore

P(PromotionMale) = P(Male and Promotion)/P(Male) = (0.15) / (0.50) = 0.30
By contrast, females employees have only a 10 percent chance, one-third that of males, for
obtaining a promotion. We can verify this conclusion by first noticing that
P(Female and Promotion) = 0.05
the remaining probability of promotions not going to males. Therefore,
P(PromotionFemale) = P(Female and Promotion) / P(Female)
= (0.05) / (0.50) = 0.10

Lack of independence may result from other factors besides discrimination, of course.
For example, oversleeping alters your chances of being on time for work or for a morning class.
Your boss, professor, and alarm clock are not discriminating against you, yet the events are still
not independent. Similarly, the probability of workplace errors will be greater immediately
following a change in the workplace such as bringing in newly trained workers, reassigning
supervisors, installing new computers, or revising inventory reporting procedures.
Marketing researchers know that the probability of a consumer purchasing any product
a sports car, hi-potency vitamins, an exercise treadmill, an Xbox video game, a diet cola, an
ocean cruise ticket, or a jet ski is not be independent of demographic factors like age, gender,
race, income, region, and education. A higher probability of buying a product is also associated
with previous purchase of related products.
To take advantage of this lack of independence, sales info from inventory tracking
software helps companies know how much of each products to stock at each regional store
during each season of the year and how to market it. WalMart even determined which items sell
best when a hurricane is about to hit an area: not only the obvious products like bottled water,
flashlights, generators, and tarps, but also unexpected items like Strawberry Pop-Tarts and other
foods that require no preparation or refrigeration.

Chapter Case #2: Lets Order Out
Decision Theory
As a final application of probability, we briefly explore the field of decision theory. The
buck stops at the manager's desk. In chapter 2, we summarized the procedure for decision
makers based on statistical analysis, data collection, and a clear problem statement. But even

with knowledge of all relevant probabilities, uncertainty remains. How does the decision maker
process information and what does an informed decision-making process look like?
DEFINITION: Events that may affect the success of a decision but are beyond the control
of the decision maker are called states of nature.
Important states of nature could be future weather conditions for a pilot or construction
contractor. Or they could be tomorrow morning's rush hour traffic flow for a taxi cab dispatcher,
trucking company, or parade organizer. They could be the preferences of next year's television
viewers, toy buyers, or perfume wearers for businesses that design these products. Which state
of nature will occur is not known at the time a decision must be made. Therefore, we will have
to rely on their probabilities and their effects on the outcome of each decision.
Actions, on the other hand, lie within our control.
DEFINITION: The set of alternatives available to decision makers are called actions.
Decision makers must decide on actions for each situation discussed in the preceding paragraph.
For example, which route and at what altitude should the pilot to fly? To what completion
schedule should the contractor commit in contract bidding? Should the toy manufacturer expand
its computer game or action figure lines?
The final feature of decision analysis are the numbers, called payoffs.
DEFINITION: A payoff is the value of a random variable of interest to the decision
maker that results from an action and occurrence of a particular state of nature.
Payoffs of concern to pilots may be flight time or number of passenger complaints about a
bumpy ride. For the contractor, the payoff may be the cost of overtime pay and late charges for
exceeding the scheduled completion date. For the toy company and perfume manufacturer, the
payoff may be revenues, units sold, or profits. Payoff for the television networks may be sweeps
month ratings, which directly translate into advertising rates that can be charged.
We now examine a decision problem containing all these features. A few years ago, a
new addition to the pizza delivery wars was the oversized, rectangular pizza. Little Caesar's was
the first onto the market with its "Big, Big Cheese," followed by Pizza Hut's "Bigfoot" and
Domino's "The Dominator". Americans spend about $20 billion a year on pizza. Domino's is
the market leader in delivered pizza but must always be aware of changing tastes and customer
needs. Recent success of two-for-one price discounts has indicated that customers are
consuming larger amounts per order. Let's go back to the beginning of 1993 when Domino's was

struggling with the decision of what to do about the newly emerging oversized-pizza market.
14

The following are the three alternative actions considered by Domino's:
(1) Continue producing only small to extra-large round pizza sizes (the Round action)
and hope oversized pizzas are a short-lived fad or fade to a small portion of the
market.
(2) Expand its line with a copy of the other 1-by-2 foot, rectangular pizzas currently
on the market (the Copy action).
(3) Heavily promote an oversize that is 20 percent larger than others on the market
(the Big action).
Three states of nature are considered possible. Each is an alternative market size for oversized
pizzas: Major (one-third of the market), Modest (one-fifth of the market), and Weak (one-tenth
of the market). There are three actions and three states of nature that affect action payoffs.
Thus, there are nine resulting profit payoffs (in millions) for Domino's.
The payoffs from the Round action indicate that this decision does best if oversized
pizzas are poor sellers. Profits of $300 million result because no expenses are incurred to design,
launch, and promote a new product line. At the other extreme, a loss of $200 million is a logical
consequence if oversized pizzas really catch on among customers and Domino's doesn't have
one. An ad campaign emphasized that their rivals do not offer cheeseburger topping or clear
cola. Similar analysis of payoff patterns reveals that the Copy action yields its greatest payoff
($200 million) under Modest market conditions. The catch-up penalty for entering the market
late leads to lower profits ($100 million) if oversized pizzas are very popular, and a weak market
also hurts profits due to the wasted development and advertising costs of a product line that
doesn't carry its own weight. Finally, Domino's will make enormous profits ($500 million) from
a Big action if the market is Major but profits disappear if the market is Weak.
However, each of these strategies ignores payoff probabilities. The Big action looks
extremely attractive if we can virtually guarantee a major market share for oversized pizzas. On
the other hand, Round would be the odd-on favorite if oversized pizzas are strongly suspected to
be a passing fad. Expected values is the most common way to weight payoffs by their
probability. Expected values for each action may be calculated using the formulas presented in
Chapter 6. All we need to do is change over to decision theory notation.

14
Strategic responses by Domino's rivals will not be considered here. For a readable account of game theory, which has revolutionized
economics, see Dixit and Nalebuff, Thinking Strategically: The Competitive Edge in Business, Politics, and Everyday Life, New York: Norton,
1991.

When payoffs are measured in dollar values (as they so often are in business), expected
values are called expected monetary value, or EMV.
Expected Monetary Value of an Action: For action a subject the n possible states of
nature s
1
, s
2
, ..., s
n
having probabilities P(s
1
), P(s
2
), ..., P(s
n
) and resulting in n payoffs
M
1
, M
2
, ..., M
n
, the expected monetary value of action a, EMV(a), is
EMV(a) = M
1
P(s
1
) + M
2
P(s
2
) + ... + M
n
P(s
n
) =
i
M
i
P(s
i
)

Decision makers can utilize historical frequency data and expert judgments to estimate
probabilities for each state of nature. The success rates of deep dish and double topping pizzas,
for example, can assist Domino's in guessing probabilities for the latest product line. Suppose
Domino's estimates the following probabilities for the oversize pizza market.
P(Major) = 0.2
P(Modest) = 0.3
P(Weak) = 0.5

By comparing these EMVs, a probability-based EMV strategy may be selected.

DEFINITION: The EMV strategy is to select the action having the maximum expected
monetary payoff.
From the payoff matrix, this strategy for Domino's is Big, since its EMV ($160 million) exceeds
the EMVs ($140 and $130 million) for the other two actions under consideration.
DEFINITION: A payoff matrix is a table of payoffs for each action and state of nature
combination.
The payoff matrix for Domino's is the following:

State

of

Nature

Action

Major

Modest

Weak

Round

-200

100

300

Copy

100

200

100

Big

500

200

0


The expected monetary values may be calculated for each row of payoffs in the table. For
example, the EMV for the first row is $140 million by the following calculations:

EMV = (-200)(0.2) +(100)(0.3) + (300)(0.5) = -40 + 30 + 150 = 140
Earlier in this chapter, we learned that expected value measures the mean for a very large
sample. Domino's may not be able to wait long enough to benefit from choosing expected value
strategies. A few disastrous decisions could put the company in the red, lose its market share
lead, or even threaten bankruptcy. Many of the largest retailing chains Southland (parent of
7-11 convenience stores), Sears, Ames, Hills, Montgomery Ward, and Channel have suffered
severely from one or two ill-fated decisions.
Even if their unwieldy bureaucracies would allow it, established market leaders seldom
pioneer new products and technologies. Instead, leading firms generally copy and co-opt once
success is demonstrated in the industry. Thus, the low-risk strategy for Domino's is to copy and
use its vast network of 5,000 stores to increase market share. Expected profits are somewhat
less, only $130 million, but Domino's avoids the downside risks (zero profits and $200 million in
losses) of the other two actions. The payoff variances (using methods developed in Chapter 6)
verify that the Copy action has the smallest standard deviation.
Wouldn't it be wonderful to know before you make a decision which state of nature will
occur? What would this "perfect information" be worth? We can calculate this.
The value of perfect information is the difference between the EMV of perfect
information and the EMV of the EMV strategy.
Referring back to Domino's payoff matrix, we discover that perfect information would allow
them to choose Big and pocket $500 million if Major will occur. If Modest is going to occur,
they can select either Copy or Big to obtain $200 million. The $300 million from choosing
Round is assured if they learn that the market will be Weak. Weighting these best possible
payoffs by the state of nature probabilities, we calculate the EMV under perfect information to
be the following:
(500)(0.2) + (200)(0.3) + (300)(0.5) = 100 + 60 + 150 = 310.
We found earlier that the Big action has an EMV of $160 million. Thus, perfect information has
an expected value for Domino's $310 million, or $150 million more than profits for the EMV
strategy. If Domino's employs an EMV strategy for its decisions, they should therefore be
willing to pay no more than $150 million dollars for a perfectly dependable market survey.


6.69 A manager may not select the investment alternative with the highest expected profit if
another investment
a. has a lower standard deviation and the manager is averse to risk
b. has a higher standard deviation and the manager is averse to risk
c. has a lower standard deviation and the manager is risk loving
d. both a and c are possible explanations of the manager's behavior
e. both b and c are possible explanations of the manager's behavior

The questions that follow are related to the following decision making problem: A manufacturer
has to decide whether to replace (R), fix (F), or ignore (I) its aging factory equipment this year.
If R is chosen, there is a 0.8 chance of no production stoppages (NO) and a 0.2 chance of minor
stoppages (MIN). If F is selected, on the other hand, P(NO) falls to 0.5 and P(MIN) increases to
0.5. If I is chosen by the manufacturer, P(MIN) = 0.6 and there is now a 0.4 chance for major
stoppages (MAJ). Because of the higher costs associated with replacing equipment, profits from
choice R will only be 15 if NO occurs and 5 if MIN results. For choice F, a NO outcome yields
profits of 18 and MIN results in profits of 8. In the case of choice I, MIN produces profits of 20
but MAJ cause profits of 0.

6.70 The number of decision forks faced by this manufacturer is
a. 0
b. 1
c. 2
d. 3
e. 4

6.71 The number of chance forks faced by this manufacturer is
a. 0
b. 1
c. 2
d. 3
e. 4

6.72 The greatest expected profits for this manufacturer are derived by choosing
a. R
b. F
c. I
d. either a or b
e. either b or c

6.73 The expected profits from R is while expected profits from I are
a. 7 and 8
b. 7 and 12
c. 13 and 8
d. 13 and 12

6.74 In problems using Bayes' theorem, we
a. always seek to determine a conditional probability as our answer
b. assume that outcomes each statistically independent
c. look for keywords such as "who", "what", and "how" to determine whether we are
dealing with marginal probabilities
d. all of the above

6.75 In problems using Bayes' theorem, we
a. always seek to determine a conditional probability as our answer
b. assume that events are not statistically independent
c. look for the presence of keywords such as "given that", "if", and "when" to identify
conditional probabilities
d. all of the above

Use the information from the following situation to answer questions below:
A study of past shuttle launch attempts reveals that the probability of a launch (L) taking place
was 40 percent if a heavy clouds cover (HC) was forecast, 60 percent if a light cloud cover (LC)
was forecast, and 80 percent if no clouds (NC) were forecast. A survey of weather forecasts for
the Cape informs us that clear days are forecast three-quarters of the time and light clouds are
forecast 20 percent of the time. Assume that there are only three kinds of forecasts: HC, LC, and
NC.
6.76 The probability P(HC) is
a. 0 percent
b. 5 percent
c. 30 percent
d. 37.5 percent
e. 55 percent

6.77 The 40, 60, and 80 percent probabilities given in the problem are
a. marginal probabilities
b. conditional probabilities

c. joint probabilities
d. Bayesian probabilities
e. all of the above

6.78 Using Bayes' theorem, we may use the shuttle launch and weather information to solve
for
a. P(HCL)
b. P(LCL)
c. P(NCL)
d. all of the above

6.79 The denominator 0.74 in Bayes' formula from the launch and weather probabilities is
calculated from the following sum:
a. 0.01 + 0.24 + 0.49
b. 0.02 + 0.12 + 0.60
c. 0.04 + 0.08 + 0.62
d. 0.12 + 0.24 + 0.38
e. 0.20 + 0.30 + 0.24

6.80 The probability that heavy clouds were forecast if a launch is known to have occurred
that day is approximately
a. 22 percent
b. 12 percent
c. 7 percent
d. 3 percent
e. 1 percent

6.81 Use the general multiplication rule to calculate the joint probability P(A and B) in
each of the following cases:
(a) P(A) = 0.60, P(BA) = 0.80
(b) P(A) = 0.10, P(BA) = 0.20
(c) P(B) = 0.90, P(AB) = 0.10
(d) P(B) = 0.60, P(BA) = 0.80, P(AB) = 0.30 [Hint: use only two of these]


6.82 A company hires only high school and college degree holders working for it.
P(promotionhigh school degree) = 0.10 and P(promotioncollege degree) = 0.35. Overall,
P(college) = 0.20 among workers up for promotion.
(b) Set up and solve Bayes' formula to find P(college degreepromotion).
(c) Set up and solve Bayes' formula for P(high schoolno promotion)

6.83 Suppose that in the auto sales example from the text, the auto maker only
distinguished between High (sales greater than 2 million cars) and NotHigh sales (sales less than
2 million). If historical data indicate P(High) = 0.50, P(RecessionHigh) = 0.10, and
P(RecessionNotHigh) = 0.40, calculate the probability of P(HighRecession). [Hint: don't
forget to first find the complementary probability P(NotHigh).] Also find the probability
P(NotHighRecession).

6.84 For the auto sales example, why shouldn't we expect the conditional probabilities
provided in the auto maker problem, 0.10, 0.20, and 0.45, to sum up to 1.0? [Hint: Notice that
the denominator measures P(Recession) and was found to be 0.20; thus, recessions constituted
only 20 percent of the years studied.]

6.85 For the Bayes' formula pizza example in the text, calculate the conditional
probabilities that a randomly selected round pizza delivery will come from Domino's. From
Pizza Hut. From Little Caesar's.

6.86 Consider the following payoff matrix associated with whether to proceed with the
building of a new ocean-front hotel in Miami, under various possible future gas price levels.

Construction

Future

Gas

Prices

Action

High

Moderate

Low

Build

-400

200

900

Postpone

-100

0

200

Cancel

300

100

0

Suppose the probabilities are P(High) = 0.20, P(Moderate) = 0.60, and P(Low) = 0.20.
(c) What is the EMV of deciding to Build?
(d) What is the EMV strategy?

6.87 Suppose new international petroleum market instability causes the builder to revise
probabilities to P(High) = 0.50, P(Moderate) = 0.30, and P(Low) = 0.20.

Section Review Summary
The joint probability of any two events A and B is P(A and B) = P(A)P(BA) =
P(B)P(AB). Conditional probability may be expressed as P(BA) = P(A and B)/P(A). Bayes
formula is derived from these rules, and relates conditional probability for one of n mutually
exclusive and exhaustive events to marginal and conditional probabilities for that same event.
A payoff matrix is a table of payoffs for each action and state of nature combination.
The EMV strategy is the action having the maximum expected monetary payoff. The
EMV strategies will always agree.
The value of perfect information is the difference between the EMV of perfect
information and the EMV of the EMV strategy.

Section Review Exercises
6.95 A business school finds that 20 percent of its graduates received an "A" in
statistics, but only 5 percent received an "A" among those who did not ever graduate. If 80
percent of all students at the business school do graduate, find using Bayes' formula the
probability of a student graduating if he or she receives an "A" in statistics. Why is your answer
so close to 100 percent?

6.96 The makers of wax paper are considering the advertising budget. New microwave
oven uses will be promoted in ads for this venerable product. However, the producer is uncertain
about how well the ads will be received. Suppose the state of nature probabilities are assumed to
be P(Receptive) = 0.6 and P(Ignore) = 0.4. Answer the following based on the payoff matrix:

Advertising

Consumer

Attitudes

Action

Receptive

Ignore

Heavy

120

-50

Moderate

60

-30

Low

30

-10

None

0

0

(c) What is the EMV of Heavy advertising? Of Moderate advertising?
(d) What is the EMV strategy?

6.32 Suppose that the state of nature probabilities from the previous exercise are
assumed to be P(Receptive) = 0.3 and P(Ignore) = 0.7. Determine the EMV strategy.

6.34 Based on the payoff matrix and probabilities given in the preceding exercise.

An experiment is any process that generates distinct results, called outcomes.
Simulation duplicates, in a laboratory or with a computer, a real-world experiment we want to
better understand. The listing or description of all possible outcomes resulting from an
experiment is called the sample space. An event is a collection of one or more outcomes that is
meaningful to a decision maker.
Statistics relies on probability theory to quantify the amount of uncertainty we associate
with estimates and other inferences based on sample data. The probability of an event measures
the chance that event will occur when an experiment is conducted. Probability ranges from 0, for
an event with no chance of occurring, to 1, for an event which is certain to happen. An odds ratio
is converted to probability by expressing one number in the odd ratio as a percentage of the sum
of the two numbers. A Venn diagram is a rectangular-shaped diagram portraying event
probabilities by regions whose relative areas reflect the probability of each event. If events have
no overlapping areas in their Venn diagram, they are mutually exclusive. The probability of an
event is the fraction of diagram area it occupies.
According to the frequency definition of probability, the probability of an event is the
percentage of all outcomes in which that event occurs. The relative frequency of an event is r /
n, the ratio of its frequency r to the total number of observations n. For ongoing processes, the
probability of an event is the average rate at which that event is observed.
In the classical approach, probability is calculated analytically based on a detailed
understanding of the process by which experimental outcomes are generated. Subjective
probability is based on informed human judgments about the probability of an event.
Subjective probability expresses our personal degree of doubt or belief. Assigning a subjective
probability of nearly 1.0 reflects a very strong belief an event will occur, and a probability close
to zero indicates major doubt that the event will be observed. When historical data are available
or easily gathered, relative frequencies are generally used to estimate probabilities. For simple,

well-understood processes, classical probabilities may be preferable. But in new, complex, and
hard-to-quantify situations, asking experts for subjective probability is usually advisable.
For any event A, its complement A
c
consists of the set of all outcomes in the sample
space that are not associated with A. A set of events is mutually exclusive if these events have
no outcomes in common. A set of events is exhaustive if every outcome in the sample space is
associated with at least one event in the set; any exhaustive set has a probability of 1.0. The set
consisting of an event A and its complement A
c
is exhaustive. If A and B are mutually exclusive
events, the probability of either A or B (or both) occurring is the sum of the probabilities of each
event. If events are mutually exclusive, the probability of at least one of them occurring is the
sum of their individual probabilities. The sum of the probabilities for a set of mutually exclusive
and exhaustive events must sum to 1.0. If A and B are two events, the probability of either A or
B (or both) occurring is the sum of their separate probabilities minus their joint probability.
Marginal probability is the overall probability a single event occurs. The probability of
event A occurring given that event B occurs is the conditional probability, denoted P(A|B).
Joint probability is the probability of two or more events occurring together. The joint
probability of mutually exclusive events is zero. P(A and B) is the probability of events A and B
both occurring.
Event A is independent of event B if (and only if) P(A) = P(A|B). If A is independent of
B, then B is independent of A. For independent events, joint probability is the product of
marginal probabilities. If two or more events are mutually independent, then the marginal
probability of any one of these events is identical to the probability of that event conditional on
any combination of the other events occurring. A random experiment results from repeated
independent observations drawn from the same sample space. Randomness is commonly
misinterpreted and random data are seldom identified correctly in practice. We subconsciously
expect random sequences to be much too regular, and we mistakenly extend the law of large
numbers to small samples.
A random variable is a variable whose quantitative values are outcomes of an
experiment. A random variable has outcomes confined only to its sample space and the
likelihood of those outcomes varies according their probabilities. A discrete random variable
is a random variable consisting of discrete data. For each possible outcome x
i
of the discrete
random variable X occurring n
i
times in a sample of n observations, its relative frequency f(x
i
) is
f(x
i
) = n
i
/ n. A continuous random variable is one with continuous data in its sample space.
The probability distribution, P(X), for a discrete random variable X is a listing of the
probabilities for each possible value of that variable.
The expected value, E(X), of a discrete random variable X with probability distribution
P(X) is the sum each value of X weighted by its probability. If the probability distribution of a

random variable X is known, the population mean of X is defined by = E(X). The variance
for a discrete random variable X is E[(X - )] and the standard deviation is the square root of
the variance. If X is a discrete random variable with mode x
mode
, then P(x
mode
) is the maximum
probability for the distribution. A cumulative probability distribution, F(X), for a random
variable X is the probability P(x X) for each possible value of X.

Chapter Review Cases: Expected Value Computations
1. Suppose three long term investment options are available, a bond, a stock, and an investment
in a condominium. All three of these assets has risk associated with it. However the
probabilities for each possible payoff are known. For the bond, it returns $6 per $100 investment
in 60% of cases, $8 in 30% of cases, and $10 in only 10% of cases. For the stock investment, the
figures are a loss of $20, a gain of $10, and a gain of $40 with probabilities 25%, 50%, and 25%,
respectively. For the bond investment, the figures are a loss of $50, no gain, and a gain of $150
with probabilities 10%, 80%, and 10%, respectively.
Set up these payoffs and probabilities into table form, and use this table to calculate the mean
and standard deviation of each investment. Which assets have the highest expected yield?
Which has the lowest? Why would you not automatically eliminate the asset with the lowest
yield? If you were required to select one of the two highest yielding investments, which would
you choose? Why?
Answers:
Bond P(Bond) BP(B) B - (B-)
2
(B-)
2
P(B)
$ 6 0.6 3.6 -1 1 0.6
8 0.3 2.4 1 1 0.3
10 0.1 1.0 3 9 0.9
=$7.0 = 1.8 =$1.34

Stock P(Stock) SP(S) S - (S-)
2
(S-)
2
P(S)
-$20 0.25 -5 -30 900 225
10 0.50 5 0 0 0
40 0.25 10 30 900 225
=$10 = 450 =$21.2

Condo P(Condo) CP(C) C - (C-)
2
(C-)
2
P(C)
-$50 0.1 -5 -60 3600 360
0 0.8 0 -10 100 80
150 0.1 15 140 19600 1960
=$10 = 2400 =$49
Both the stock and the condominium have $10 expected yields, considerably higher than the $7
return from buying the bond. However, the standard deviation for stock yields is about 15 times
greater than that for the bond, and the condo is more than twice as risky than the stock. Thus,
many risk-averse investors may still be attracted to the lower expected yield of the bond because
of its relatively safe, dependable return. Between the stock and the condo which have equal
expected returns, most investors will choose the stock because of its substantially lower

riskiness, as measured by standard deviation. Only risk-loving individuals who are willing to
take the long odds of a 150% return should choose the condo investment.

2. Complete the two columns in the following table, and use your results to determine the mean
and then the standard deviation for the distribution of new plant expansion payoffs.
The probability distribution for the random variable X measuring the payoff is:
Payoff (X) P(X) X P(X) (X - ) P(X)
(in millions $)
0 0.75
5 0.20
20 0.05
Answers:
Payoff (X) P(X) X P(X) X- (X-)
2
(X-) P(X)
0 0.75 0 -2 4 3.0
5 0.20 1 3 9 1.0
20 0.05 1 18 324 16.2
=$2.0 million
2
= 19.2, =$4.4 million

3. The probability distribution for the random variable X measuring the number of cars owned by
a family:
Number of Cars (X) P(X) X P(X) (X - ) P(X)
0 0.32
1 0.18
2 0.37
3 0.11
4 0.02
Complete the two columns in the above table, and use your results to determine the mean and
then the standard deviation for the distribution of household car ownership.
Answers:
Number of Cars (X) P(X) X P(X) X- (X-)
2
(X-) P(X)
0 0.32 0 -1.33 1.77 0.57
1 0.18 0.18 -0.33 0.11 0.02
2 0.37 0.74 0.67 0.45 0.17
3 0.11 0.33 1.67 2.79 0.31
4 0.02 0.08 2.67 7.13 0.14
= 1.33 cars
2
= 1.12, = 1.1 cars


4. Complete the two columns in the following table, and use your results to determine the mean
and then the standard deviation for the distribution of new product payoffs.
The probability distribution for the random variable X measuring the payoff of launching a new
product is:
Payoff (X) P(X) X P(X) (X - ) P(X)
(in millions $)
0 0.60
2 0.25
5 0.10
20 0.05
Answers:
Payoff (X) P(X) X P(X) X- (X-)
2
(X-) P(X)
0 0.60 0 -2 4 2.4
2 0.25 0.5 0 0 0
5 0.10 0.5 3 9 0.9
20 0.05 1.0 18 324 16.2
=$2.0 million
2
= 19.5, =$4.4 million

CASE MINI-PROJECT:
Your movie studio has just finished production on Welcome Back Kotter, The Movie. Market
research into other TV-made-into-movies (based on ticket sales, worldwide distribution, and
tape and TV rights) indicates the following probabilities for each possible revenue outcome.
Complete the columns in the following table, and use your results to determine the mean and
then the standard deviation for the movies revenue.
Revenue (X) P(X) X P(X) X- (X-)
2
(X-) P(X)
(in millions $)

10 0.3

40 0.6

130 0.1

= $ million
2
=

= $ million


Chapter 7 Probability Density Functions and
Normal Distribution

Approach: From discrete random variables, we move on to distributions associated with
continuous random variables. The most important of these, the normal distribution, is given
special attention.

Where We Are Going: By matching the distributions of sample statistics to known
probability distributions, we will be able to apply probability notions to sample inferences
problems.

families of distributions
tails of a distribution and infinitely long tails
probability density functions and cumulative distribution
normal density functions
standard normal and log normal

SECTION 7.1 Selecting a Distribution for Statistical Analysis
SECTION 7.2 Families of Discrete Distributions
SECTION 7.3 Probability Concepts for Continuous Random Variables
SECTION 7.4 The Normal Density Function


7.1 Selecting a Distribution for Statistical Analysis
To make statistical decisions, we must rely on a probability distribution applicable to the
situation at hand. If we know the probability distribution for a random variable, we can specify
our level of uncertainty by the language of probability. For example, we may inform the
management of a seafood restaurant franchise that we are 95 percent confident their clientele will
spend between $80 and $120 weekly dining out this year, based on survey sampling and
knowledge of the underlying probability distribution.
But there is an important catch, one that does not appear on the restaurant's menu. How
do we know which distribution applies? In many business statistics situations, the difficult task
is recognizing which probability distribution is the proper one. Once the correct distribution is
found, the answers may fall out more or less automatically. Of course, there is no guarantee that
we will like the statistical results. Sometimes, the results are simply that we cant say much with
any degree of certainty. Nevertheless, at least we will learn what we do know and how well we
know it. We may then base decisions on our level of knowledge or ignorance.
We will use several general techniques over and over again to determine the proper
distribution. We preview these methods in Table 7.1.
Table 7.1
Methods for Identifying the Appropriate Distribution
1. Find the distribution from classical analysis of the experimental process or use
subjective probability of experts or your own experience and intuition.
2. For larger samples, use the relative frequency distribution to identify the
distribution of the parent population.
3. Rely on large-sample properties and statistical theory to approximate the
distribution of an estimator or other sample statistic.
4. Choose a distribution applicable to the type of problem and data being used.

Grouping Probability Distributions into Families
Suppose we can always identify a suitable distribution in each situation. Won't we still
need to familiarize ourselves with a different distribution for each decision we face? Thankfully,
this is seldom necessary in practice. We only need to stock a handful of distributions in our tool

kit. In this chapter, we introduce some of the most frequently-used distributions, two for discrete
and two others for continuous random variables. In later chapters, we will need to introduce only
a few others.
Why do we only need a few distributions to do most statistical analysis? We will soon
discover that a common set of distributions may be applied to the statistical methods most
commonly used in business. Moreover, business and economic variables tend to have a few
standard types of distributional patterns. The final reason is that distributions may be classified
into a few major families, each applicable to a variety of situations.
DEFINITION: A family of distributions is a set of similar distributions that differ from one
another only by the values of their parameters.
In this chapter, we introduce four distribution families, each defined in terms of its population
parameters. Only a few more distributional families will be introduced as we need them.
In your algebra courses, you used one such family: a straight line. Recall from Chapter 4
that all straight lines may be described by the expression y = a + bx. What makes one line
different from another are the values assigned to the parameters a and b, the intercept and slope.
Thus, the line y = 10 + 5x has a greater slope than the line y = 10 + 2x and a smaller intercept
than y = 20 + 5x. But the linear form also imposes limitations. Try as we might, we cannot give
curvature to a line by changing parameters a and b. This can be accomplished only by switching
to another form such as the quadratic: y = a + bx + cx.
The same basic idea applies to each distributional family. The probabilities of a
distributional family may be found from an algebraic equation that contains one or more
parameters. Rather than slopes and intercepts, the parameters we'll encounter will be statistical
things, such as and . By changing the parameter values of distributional, we can alter its
shape or position the same way that changing the slope or intercept modifies a linear equation.
But like the line, a member a statistical family must retain its family characteristics. No change
in the parameters can give it the shape of a different distribution family.
Thus, to describe our statistical universe we need rely on only a few reliable families of
distributions. Once we identify which of them to applies to our business analysis, we need only
describe the specific distributional shape by its parameter values. Although the algebraic
formulas for probability distribution are generally complex, the probabilities are accessible
through the same statistical software that we've already been using to conduct our statistical
analysis. Thus, we wont need to bother learning to work with algebraic formulas or carry
around reams of arcane distribution tables.


7.1 Which of the following is not a method to determine the appropriate distribution for a
decision making problem?
a. make reasonable assumptions about the distribution in the population
b. apply well-understood distributions as approximations for large-sample situations
c. apply well-understood distributions by using ordinal measures
d. discern the population distribution from the sample's distribution in large sample
situations
e. all of the above are valid methods

7.2 In many situations, we may be able to assign a distribution for our analysis by
a. examining distributions of earlier studies
b. asking experts about the process by which the data are generated
c. using our subjective feeling
d. all of the above

7.3 Most statistical analysis uses only a few distributions for all but one of the following
reasons:
a. we may use large sample properties
b. business variables are distributed in only a few different manners
c. we can represent many distributions by a handful of distribution families
d. business statistics relies primarily on discrete distributions
e. all of the above are explanations

7.4 In distributions that are skewed to the right,
a. the median will lie to the left of the mean
b. the mean will lie to the left of the median
c. the median and mean will be identical
d. the relationship between the median and mean will depend on the specific distribution
e. skewed distributions do not have a median


7.2 Families of Discrete Probability Distributions
The Binomial Distribution
The first distribution we introduce is one of the oldest known and still the most useful
discrete probability distribution. To understand this discrete probability distribution, we first
need to introduce the idea of a Bernoulli trial.
DEFINITION: A Bernoulli trial is an experiment involving two possible outcomes,
success S and its complement, failure F, whose probabilities are
P(S) = p and P(F) = 1 - p = q.
Success is a relative term in a Bernoulli trial and does not necessarily represent a
desirable outcome. Success merely denotes the event we wish to monitor. Many business
decisions are based on binary data which may be viewed as outcomes of Bernoulli trials.
DEFINITION: binary data is categorical consisting of only two possible
outcomes.
Examples of binary data encountered in the business and economic world are the following:
a merger occurs or it does not
a manager hired is a male or a female
a sale takes place before or after lunch
the stock market goes up or it does not go up
a company adopts TQM methods or it does not adopt them
a student graduates this year or does not graduate this year
a car buyer chooses the dealer's bank financing or does not

Obviously, some of these examples have outcomes that do not classify as success or
failure in the conventional sense. With an unfriendly takeover, management of the target firm
may not consider it a success for the merger to go through. Hiring a female manager is no less a
success than hiring a male. Instead, we may label male a success simply because we are
investigating gender bias in promotional patterns. Alternatively, female would have been the
successful outcome if we were monitoring compliance with affirmative action goals.
Notice that some binary examples that we listed are clearly derived from quantitative
data. The stock market may go up 10 points or 50, down five points or 75, or remain unchanged.
The binary outcomes up and not up reduce this quantitative information to two states, thereby
assisting a pension funds manager to issue a buy order on a stock portfolio. Similarly, the time

of a sale is also quantitative. By converting this time-of-day information to before-and-after
lunch break outcomes, the store manager may be better able to decide about staffing work shifts.
However, evidence from a single trial is seldom revealing. The observed outcome may
be a coincidence. Anyway, most companies are in business for the long haul. They are
interested in the Bernoulli trial outcomes performed throughout the year. Workers continually
come up for promotion and dealers try to sell cars all year long. Let's use a case example to
develop a special type of discrete random variable based around counting Bernoulli trial
successes.

Chapter Case #1: Keep on Truckin'
A truck owned by a shipping company may suffer from a broad set of mechanical
problems. One of the most important concerns is how many trucks in the fleet can be operated
on any given day. If a particular truck is operational today, we may label that a success S. If on
the other hand a truck must be repaired or serviced today, for whatever the reason, the company
loses revenue from lower distribution capacity and adds to its maintenance expenses. This
outcome may then be classified as a failure F.
We often can obtain P(S) for Bernoulli trials from previous sample information or
understanding of the mechanical process. Company records can provide reliable historical
frequency evidence for p, the probability that any individual truck cannot operate today.
The shipping company is not interested in whether a particular truck can operate today.
Instead, the company is concerned with how many trucks are in operating condition, that is, the
number of successes in a series of Bernoulli trials. Each truck in the fleet is a separate trial and a
success is recorded if that truck is in operating condition today.
In statistics, each trial is a sample observation. In Bernoulli trials, each trial must be
independent of the others. The number of successes is then a discrete random variable with a
binomial distribution.
DEFINITION: A discrete random variable X that measures the number of successes x in
n independent trials, each with the same probability of success, has a binomial
distribution.
Suppose that trial success for each truck is independent of success for any other truck. Then the
number of trucks in operation on a given day is a binomial random variable.

Discrete random variables have probability distributions. How do we find binomial
probabilities? In Chapter 6, we primarily relied on relative frequency distributions to discover
these probabilities. For the binomial and other common distributions, the process that generates
the random variable also supplies us with the distributional probabilities.
Beginning at the extreme cases, P(x = n) and P(x = 0), our probability calculations are
straightforward. Notice that there is only one way to get x = n successes: if every Bernoulli trial
is a success. Because the binomial distribution deals with independent events, trial outcome
probabilities may be multiplied using the simple product rule developed in Chapter 6. For
example with n = 3 trucks, x = 3 can occur only if all three trials result in success. Then the
probability P(3) must equal p p p, or p
3
. If p = 0.60, then
P(3) = (0.60)
3
= (.6)(.6)(.6)
so P(3) is 0.216. The opposite extreme, no successes (x = 0), is the same as having all n trials
failures. Thus, P(0) must be q
n
, where q = 1 - p was defined earlier. For the shipping company
example, q = 1 - 0.60 = 0.40. Therefore, the probability that none of the three trucks will be
operational today is
P(0) = (0.40)
3
= (0.4)(0.4)(0.4) = 0.064.
However, we have only accounted for 0.216 and 0.064 of the binomial distributions
probability in the trucking example. What about the remaining 1.0 - 0.216 - 0.064, or 72
percent? This probability can be captured by the two other possible outcomes: x = 1 and x = 2
trucks in operation. The computations of these probabilities, however, become more
complicated than those for P(0) and P(n). When x is between zero and n, the Bernoulli trial
results contain a mixture of successes and failures. Our product calculations must therefore
include both p and q. Specifically, for x successes in n trials, p occurs x times in the product and
q occurs the remaining (n - x) times. Thus, the probability of x successes followed by is
p
x
q
(n - x)
. In fact, this must also be the probability for any other combination of x successes and
n - x failures.
The probability of n independent trials resulting in a particular arrangement of x
successes is
p
x
q
(n - x)

However, this product is only a fraction of P(x). To explain why, we again examine the
shipping company. Suppose the company is concerned with x = 2 companies being operational.
One way that could occur is if the truck #1 and #2 are each operational, but #3 is not. The
probability of this pattern of trial outcomes, S-S-F, is p q, or (0.6) (0.4) = 0.144. Yet x = 2

success may be achieved by two other patterns: one with only truck #2 non-operational (S-F-S)
and the other with truck #1 non-operational (F-S-S). The probabilities of these are p q p and q
p, respectively. These two products also equal 0.144 because they each contain the same
number of p's and q's as p q. Because the three are mutually exclusive outcomes, we may apply
the simple addition rule to sum the probabilities. Thus,
P(2) = 0.144 + 0.144 + 0.144 = (3)(0.144)
so P(2) is 0.432. Applying similar logic to the three trial combinations that result in x = 1 truck
operational, P(1) can be shown to be (3)(.6)(.4), or 0.288. Now we have the complete binomial
distribution for the trucking case. Figure 7.1 lists this distribution together with its
corresponding cumulative probabilities F(X), the probability of values no larger than x.
x P(X) F(X)
0 .064 .064
1 .288 .352
2 .432 .784
3 .216 1.0
Figure 7.1

On any given day, the shipping company should be informed that the chance of all three
trucks operating (21.6 percent) is half the probability (43.2 percent) of exactly two will be in
working order. Based on the cumulative F(X) probabilities, the company also can learn that no
more than one truck will be operational more than one third of the time because
F(1) = P(X 1) = P(0) + P(1) = 0.352.
The complementary event, at least two trucks will operate is P(x 2) = 1 - F(1), or 0.648. Of
course, this same result could be obtained by adding P(2) to P(3). Its amazing what we can
discover from knowing the type of distribution, the number of trials n, and probability of
success on each trial p.
If the truck fleet was larger, however, it is more time consuming to count the number of
different combinations that result in x successes. Even with only n = 5 trucks, there are ten
different ways to obtain x = 3 successes. Two of these are S-S-F-F-S and S-F-S-S-F. An easier
method is to use the combinatorial formula.
Combinatorial Formula: The number of combinations C
x
n
that n objects can be
assigned to x of one type and (n - x) of another type is determined by the formula

)! ( !
!
x n x
n
C
n
x
=
Recall that the exclamation marks "!" in the formula symbolizes factorials.
1
The n "objects" are
the number of Bernoulli trials and x is the number of successes. For our first truck example,
having two out of three trucks operational corresponds to n = 3, x = 2. The number of
combinations is therefore
3!/(2!1!) = (3 2 1)/(2 1 1) = (3 2)/2 = 3.
However, we had figured that out by listing the S-S-F, S-F-S, and F-S-S possibilities. How
about the problem of 3 operational in the 5-trucks fleet? Then,
C
3
5
= 5!/(3!2!) = (5 4 3 2 1)/(3 2 1 2 1) = (5 4 3 2)/(3 2 2) = 10
confirming the presence of ten possible combinations.
Multiplying the probability of any particular sequence of Bernoulli trials by C
x
n
gives us
the general formula for binomial probabilities.
Formula for the Binomial Distribution: The probability of x successes out of n
independent trials is
x n x n
x
p p
x n x
n
C
= ) 1 (
)! ( !
!

For example, what is the probability of having exactly three operational trucks in the 5-truck
shipping company? Each of the ten combinations has a p
3
q probability because of x = 3
successes and (n - x) = 2 failures. Therefore, P(3) is calculated from the binomial probability
formula as follows:
P(3) = 10 (0.6)
3
(0.4) = 0.3456
or slightly more than one-third.
For even larger values of n, it is easier to have the computer solve the formula and give
us the probabilities we need. We may use Minitab or Excel to obtain binomial probabilities (as
well as most other commonly used distributions). Figures 7.2 and 7.3 contains instructions and
A copy of Minitab printout (Figure 7.6) verifies our previous computations that P(3) = 0.3456
when n = 5 and p = 0.6.

1
Recall from algebra that for a positive integer m, m factorial is m! = m (m-1) (m-2) 2 1 and 0! is defined to equal 1. Thus, 3! = (3)(2)(1) =
6.

Using Minitab to Find Probabilities for a Binomial Distribution
Pull Down Menu Sequence: Calc Probability Distributions Binomial...
Complete the Binomial Distribution DIALOG BOX as follows:
(1) click to highlight circle next to Probability or Cumulative probability
(2) type in the number of trials n in the box following Number of Trials:
(3) type in the probability of success p in the box following Probability of Success:
(4) highlight circle to the left of Input Constant: and type the number of successes x
(5) click OK button to find the binomial probability
Figure 7.2
Using Excel to Find Probabilities for a Binomial Distribution
Click on the f
x
button to open the Function Wizard - Step 1 of 2 Box:
Select BINOMDIST from Function Name listing
Click on the Next button to open the Function Wizard - Step 2 of 2 Box:
(1) type in the number of successes x in the box following Number_s
(2) type in the number of trial n in the box following Trials
(3) type in the probability of success p in the box following Probability_s
(4) type FALSE in the box after cumulative (or TRUE for the cumulative probability)
(5) the binomial probability will appear at the upper right in the box after Value:
Figure 7.3
Probability Density Function

Binomial with n = 3 and p = 0.600000

x P( X = x)
2.00 0.4320
Figure 7.6

Cumulative probabilities for binomial distribution also may be used to answer important
questions (see Figure 7.7). Because the cumulative probability F(1) is 8.7 percent, the 5-truck
shipping company should be prepared for the occasions when it has no more than one truck
operational. If such situations are unacceptable, the company needs to maintain its truck in
better condition or invest in a new or more reliable fleet of trucks.
Cumulative Distribution Function


x P( X <= x)
1.00 0.0870
Figure 7.7


The binomial distribution is a family of distributions with two parameters, n and p. Thus,
the distribution looks different for different values of n and p. The binomial distribution is more
flexible than the purely symmetrical distributions we will explore later in this chapter.
Let's examine the symmetrical version of the binomial for the ten-trial symmetric case
copied from Minitab output is also instructive (see Figure 7.9).
2
Notice the symmetry of
probabilities around mode, x = 5. Thus, P(4) equals P(6), P(3) equals P(7), and so forth. The
three events nearest the center of the distribution
x = 4, 5, and 6 together capture nearly two-thirds of the probability. Toward the extremes,
probabilities become very small. P(0) and P(10) are only 1/10th of one percent, and x = 1 and x
= 9 each have about one percent chance of occurring.

x P( X = x)
0.00 0.0010
1.00 0.0098
2.00 0.0439
3.00 0.1172
4.00 0.2051
5.00 0.2461
6.00 0.2051
7.00 0.1172
8.00 0.0439
9.00 0.0098
10.00 0.0010
Figure 7.9

Binomial distributions may also be skewed.
The binomial family of distributions is symmetrical if p = 0.5, but skewed to the right for p
< 0.5 and skewed left for p > 0.5. The closer p is to 0 or 1, the greater the skewness.

2
To obtain the entire binomial distribution shown in this figure, highlight Input column: and type the column number containing the
numbers from 0 to n, the number of trials (or use the Set Patterned Data dialog box in the Calc menu to generate this integer sequence).

Chapter Case #2: How Much is That Doggie in the Window?
Today, purebred pets can command high prices. Unfortunately, the unscrupulous
inbreeding practices of puppy mills lead to high rates of blindness, hip displacement, and other
hereditary conditions. Suppose one in ten dogs sold by a breeder carries a hereditary gene for
one of these conditions. However, an inspector only has time to randomly sample n = 4 dogs
from each breeder in her territory. If at least three dogs in her sample are not carriers, the
breeder's license to make interstate sales will not be suspended. How can the breeder determine
the probability of not being suspended? The answer is to use the binomial distribution with n = 4
for the sample size and p = 0.9 as the probability of success, i.e., that a dog is not a carrier. The
binomial distribution from Minitab is shown in Figure 7.11.

x P( X = x)
0 0.0001
1 0.0036
2 0.0486
3 0.2916
4 0.6561
Figure 7.11

We can calculate at least three dogs in her sample are not carriers,
P(x3) = P(3) + P(4) = .2916 + .6561 = .9477
Thus, the breeder has a 95 percent chance of retaining her license.
Notice how this distribution is highly skewed to the left and has most of its probability in
the two largest values of x. In fact the mean for this distribution is 3.6, because
E(X) = (0)(.0001) + (1)(.0036) + (2)(.0486) + (3)(.2916) + (4)(.6561) = 3.6
using the expected value method presented in Chapter 6. Even more complex effort is required
to discover that the standard deviation is 0.6. Fortunately, an added bonus of using distributions
like the binomial is we often may bypass expected value calculation. Two simple formulas are
available to find the mean and standard deviation for any binomial distribution.
For a binomial distributed random variable X where sample size is n and P(S) = p,
= np and = npq.


The formula for the mean is logical. If the probability is 9 in 10 for a single trial, then the mean
number of successes in four trials should be nine-tenths of the way along a line from zero to four.
Four times 0.9 is 3.6, the same mean we found by the expected value method. Next, we apply
the second formula to the dog breeder example,
= 4(.9)(.1) = 0.36
whose square root gives us the standard deviation , 0.6.
So far, we have worked with known binomial distributions. In inferential statistics, we
will often need to work in the opposite direction: inferring the unknown value p from a random
sample of n independent trials. In Chapter 9, we will learn how to estimate the population
proportion p from the sample proportion x/n of successes. A nonparametric method of Chapter
14, the sign test, will also use the binomial distribution. In addition, sample inference about
proportions is especially important to the statistical process control analysis in Chapter 15.


CASE MINI-PROJECT:
An average of 3 out of every 10 shoppers entering a mall shoe store will actually purchase shoes
at the store.
1. The probability that the next shopper will buy shoes is 0. .
2. The probability that each of the next three shoppers will buy shoes is 0. .
3. The probability that none of the next three shoppers will buy shoes is 0. .
4. The probability that exactly one of the next three shoppers will buy shoes is multiplied
by (.3)(.7)
2
.
Suppose 20 shoppers enter the store this morning. The computer reports the following binomial
probabilities:

Probability Distribution for Binomial with n = 20 and p = 0.3
x P( X = x)
0 0.0008
1 0.0068
2 0.0278
3 0.0716
4 0.1304
5 0.1789
6 0.1916
7 0.1643
8 0.1144
9 0.0654
10 0.0308
11 0.0120
12 0.0039
13 0.0010
14 0.0002
15 0.0000

5. The most likely outcome is pairs sold this morning, but this mode occurs less than one-
fifth of the time.
6. The probability that fewer than 4 pairs of shoes are sold this morning is 0. .
7. The probability that at least 11 pairs of shoes are sold this morning is 0. .
8. This distribution is / is not (circle one) symmetrical because the probability of a trial success
is not equal to 0. 9. The mean for this distributions is pairs sold and the standard
deviation equals pairs sold.


A maintenance firm is concerned about the cost of servicing computers at a college. Suppose the
probability that any particular computer in a student lab breaks down this month is 0.2. If there
are 10 computers in the lab, use the output below to answer the following questions.

BINOMIAL WITH N = 10 P = 0.200000
K P( X = K)
0 0.1074
1 0.2684
2 0.3020
3 0.2013
4 0.0881
5 0.0264
6 0.0055
7 0.0008
8 0.0001
9 0.0000

7.5 The probability that exactly two computers in the lab will break down this month is
approximately
a. 0.1
b. 0.2
c. 0.3
d. 0.4
e. 0.5

7.6 The probability that no more than two computers will break down this month is
approximately
a. 0.3
b. 0.4
c. 0.5
d. 0.6
e. 0.7

7.7 If the cost of repairing a computer is $300, the approximate probability that service to the
lab will cost the maintenance firm at least $1800 is
a. 0.001
b. 0.006
c. 0.01
d. 0.06
e. 0.1

7.8 Which of the following is not an example of a Bernoulli trial:
a. the merger goes through or it does not go through
b. the manager hired is a male or a female
c. the sale takes place before or after lunch
d. the stock market goes up today or it does not go up
e. all of these are examples of a Bernoulli trial

7.9 A company either adopts TQM methods or it does not adopt them. If the probability of
adoption is 0.4, then, in the notation of Bernoulli trials,
a. P(S) = 0.4
b. P(F) = 0.6
c. p = 0.6
d. q = 0.4
e. all of the above

7.10 If the probability of any particular car buyer choosing the dealer's bank financing is 0.7,
then the probability that the next four buyers will each choose the dealer's bank financing
is approximately
a. 0.53
b. 0.49
c. 0.34
d. 0.24
e. 0.17

7.11 For the preceding example, out of n = 5 buyers the expected number of buyers who
accept the dealer's bank financing is
a. 0.7
b. 1.05
c. 2.5
d. 3.2
e. 3.5

7.12 The standard deviation for the expected number of buyers in the preceding example is
a. 3.50
b. 1.87
c. 1.22
d. 1.05
e. 1.02


7.13 Suppose the truck fleet discussed in the case example ages over time until p declines
from 0.5 to only 0.4. Recompute the probability P(x>_2) for n = 3.

7.15 Reverse the set up of the dog breeder case example by requesting the table for n = 4 but p
= 0.1. Define the probability of success from the inspector's point of view and find
P(x>1) where x is now the number of dogs carrying the disease. How is the binomial
distribution you used here different from and how is it similar to the distribution for n =
4, p = .9?

7.16 Suppose the inspector in the case example examines 5 dogs instead of 4 and uses the
same criteria that all but one must be noncarriers to pass inspection. What is the
probability of the breeder retaining his license now?
7.17 Another breeder in the case example whose practices are more lax has one in five of its
dogs carriers of the disease. Determine the appropriate probability table and find the
probability of this breeder retaining its license. Compare your answer with first breeder's
probability, and account for the sizeable difference.

7.18 Suppose the probability is p = 0.25 of a new products launched this year becoming an
eventual market success. Find the probability of
(a) no more than 2 new products out of eight being successful
(b) at least four of 10 new products being successful

7.20 What happens to for given value of n as p gets further away from 0.5? Use the formula
for with n = 9 and p=.5, .6, .75, .9 as evidence to answer this question.

7.38 The probability of matching all 5 numbers in the Florida lottery is 1 in 14 million. If
numbers combinations are selected at random by ticket buyers, find the following
probabilities:
(a) of no winners if 7 million tickets are sold
(b) of more than 2 winners if 21 million tickets are sold
(c) of from 4 to 8 winners if 35 million tickets are sold


7.3 Probability Concepts for Continuous Random Variables
Although discrete random variables play a prominent role, most random variables we
encounter in business are continuous. Sales, production, time worked, distance to market, and
many other critical decision variables are measured in continuously divisible units of money,
time, distance, or size. Even if variables start out as discrete variables, they become continuous
when expressed as percentages of something or financial ratios between two variables.
Because of their essential role in inferential statistics, continuous random variables need
to receive extra care in understanding their often perplexing nature. Probability for
continuous random variables presents us with special problems. By the conclusion of this
chapter, you will be prepared to tackle sample inference methods that rely on continuous
distributions. As a side benefit, you will also learn about normal and uniform random
variables and discover when a continuous distribution can approximate a binomial
distribution.

The Zero Probability Paradox and Probability Density
Probability for a continuous random variable differs from that of a discrete variable in
one major respect.
The Zero Probability Paradox: The probability of each possible value occurring is
essentially zero for continuous random variables.
On the surface, the idea of zero probability for every possible outcome sounds absurd. In any
single experiment, some event has to occur. That event must therefore have started with a
probability greater than zero. Furthermore, the probability of all possible values for a random
variable must sum to one. If all event probabilities are zero, where did all the probability go?
This apparent contradiction is caused by the infinite number of values possible for a
continuous random variable. Suppose the sample space for X, a continuous random variable,
consists of all x in the interval from a to b. We denote this interval by [a, b].
A continuous random variable has a sample space consisting of all values over one (or
more) intervals.
Even if [a, b] has finite length, the number of possible values x is still infinite.
Our solution to the zero probability paradox is to switch from probability to probability
density.

DEFINITION: The probability density at any possible value of a continuous random variable is
the rate at which probability accumulates in intervals around that value.
We therefore represent continuous random variables by probability density functions instead of
probability distributions.
DEFINITION: A probability density function, or pdf, is an algebraic function f(X) (or the
graph of this function) that assigns a probability density to each possible value x in the sample
space of a continuous random variable X.

Probability for Density Functions
Most shoe stores have peak times, perhaps around lunch breaks and again when people
get off from work or out of school. Figure 7.20 presents a pdf with this pattern of peak shopping
times. Notice that peaks occur around 12 noon and following 5 P.M. (x = 17). Although the
densities vary, we can read the densities f(X) off the graph as the height of the curve for each

value x. Figure 7.20 examines one of these values, x = 15.5. Observe from the graph that
f(15.5) is approximately 0.12, but that density does not apply for other x values. The varying
density makes it difficult to calculate interval probabilities from areas under the pdf.
In a small interval around x = 15.5, for example [15.25,15.75], the interval probability
may be approximated by a rectangular area. The calculations for this height-times-width
approximation is
P(15.25 x 15.75) ~ (x
2
- x
1
) f(X)
= (15.75 - 15.25)(0.12)
= 0.06

Thus, the manager concludes that about 6 percent of sales occur between 3:15 and 3:45 P.M.
If we need probabilities for broader intervals, however, this procedure will result in worse
approximations. An elaborate series of rectangle computations may be required instead. Suppose
the store manager wants to estimate the interval probability for sales between 1:00 and 5:00 in
the afternoon. The rectangular sum method is diagramed in Figure 7.21 to approximate the area
under the pdf, 0.43, displayed in Figure 7.22.

Distributional Characteristics of Density Functions
The preceding analysis is illuminating, for it shows geometrically how interval
probabilities relate to density functions. Time-consuming, approximation methods are seldom
necessary however. Instead, practitioners find probability areas under commonly-used density
functions directly from the computer or statistical tables. Statistics students in previous
generations devoted many hours learning to wade through and interpolate distributional tables
derived from each density function formula. The modern approach used in this text relies on
obtaining more precise and informative distributional information directly from computer
spreadsheet and statistical software. Our computer software can find probabilities for density
functions as easily as they did for discrete distributions such as the binomial.

Like discrete distributions, density functions may be symmetrical or skewed, unimodal or
bimodal. Figure 7.23 portrays four examples of pdfs. Unlike the uniform and bimodal pdfs
graphed in previous figures of this section, all four of these density functions are unimodal.

In addition, the two pdfs pictured at the top of Figure 7.23 are symmetrical. As we found for
discrete distributions, these two symmetrical density functions have identical mean, median, and
mode, in this figure at 16 and 10, respectively. Similarly, we observe the same relationship
between mean, median, and mode for the skewed right and left pdfs in Figure 7.23 that we found
for other types of distributions.
However, the median, mode, cumulative probability must be reinterpreted for use with
density functions.
15
For example, the mode for pdfs needs to be redefined as density rather than
probability.
If X is unimodal, the mode occurs where f(x) is greatest.
Cumulative distribution functions may also be derived from density functions.

15
The mean of continuous random variables is still defined as the expected value. However, the summations of discrete values must be
replaced by "integration" across an interval of values, and therefore requires mathematics beyond the scope of this text.

The cumulative probability function, F(x) =P(Xx), for a continuous random variable X is
measured by the cumulative area under the density function f(x).
As it did for discrete distributions, cumulative probabilities help us explain the median of a
continuous random variable.
For a continuous random variable X with density function f(x), the median M is at the
value of x where the cumulative probability F(x) is 0.50.
A final characteristic of density functions is especially important. Recall that the tails of
a distribution contain the extreme values relatively far from the median. Notice that the two
skewed density curves in Figure 7.23 have one tail and the graph on the upper right contain two
tails. Each of these tails is infinitely long.

DEFINITION: A density function has an infinitely long tail if the density f(x) is positive no
matter how far you go into that tail.

Thus, a nonzero area always remains under the density function regardless of how deep you are
in tail.
However, "possible" is not the same as "likely." If we can dismiss a large range of
possible values as highly improbable, we may devote most of attention to the remaining interval.
If the total area under a portion of the density function is very small, it is unlikely that a
value from that interval will occur.
We constantly make plans for tomorrow knowing there is always a slight chance today will be
our last. We could die in an auto accident, be struck down by a heart attack, or share death with
millions of others from a nuclear holocaust or devastating earthquake. The reason we don't
allow our daily decisions to be influenced by these possible events is that each has a tiny
probability of occurring. We tend to ignore in our day-to-day decisions the dire results of an
earthquake or fatal heart attack because we know their probabilities are very small.
For the remainder of this chapter, we will explore the properties of the density function in
the upper right graph of Figure 7.23, the normal pdf. We will soon discover the useful qualities
of this symmetric, unimodal density function with infinitely long tails.

7.42 A probability density function applies only to
a. continuous random variables
b. discrete random variables
c. distributions having both tails infinitely long
d. distributions having at least one infinitely-long tail
e. any random variables


7.4 The Normal Density Function
In Chapter 3, we informally introduced the normal distribution as a bell-shaped curve
primarily concentrated within two standard deviations of its mean. We are now prepared to
examine this normal curve properly: as a density function.
The normal pdf continues to be one of the most useful in business statistics. As
mentioned in the previous section, probabilities cannot easily be calculated from the most density
functions. Because the normal pdf does not have a uniform density, graphs can only
approximate interval probabilities by areas of rectangles. Fortunately, probabilities for normal
and many other density functions may be requested from statistical software such as Minitab and
Excel.
16
Later in this section, we will use the computer for this purpose. For now, let's postpone
our search for specific probabilities. Instead, we concentrate on the properties that make the
normal distribution so interesting and useful. Two essential characteristics of a normally
distributed random variable are its symmetry and central peak.
The normal distribution is unimodal and symmetric with infinitely-long tails in each
direction.
As we learned in Chapter 3, a symmetric and unimodal distribution has the three popular
measures of the average mean, median, and mode all equal. The presence of two infinitely-
long tails allows the normal distribution to describe random variables with extensive sample
spaces.

16
As previously mentioned, prior to the general availability of this software, tabulated distributions were the primary source of probability
information. Distribution tables are cumbersome to use, incomplete, and often require interpolation.

Two

members of the normal family are graphed in Figures 7.24 and 7.25. Clearly, all four curves
display symmetric and unimodal characteristics. Yet each is a different member of the normal
family of density functions. Observe that the two graphs in Figure 7.24 have identical shapes,
but the one on the right is centered at x = 10 instead of x = 5. In the next figure, (Figure 7.25)
both are centered at the same mean, x = 4, but one has more of its probability dispersed over a
wide area. Like other distributions presented in this chapter, the normal pdf is a whole family of
distributions described by its parameters.
The normal pdf is a two-parameter family of density functions whose members differ only
by the values of their parameters: mean and standard deviation .
The distributions in Figure 7.24 have different means, one with = 5 and the other with =
+10. However, their identical shapes indicate that is the same for each density curve. The
normal pdfs in Figure 7.25, by contrast, each have = 4. The difference in is responsible for
their dissimilar shapes. The curve with much of its probability area concentrated around the
mean has the smaller standard deviation; the curve the greater has more probability in the tails.
A nonzero area is in the tails of every normal pdf, but the probability density falls off
rapidly as we move toward extreme values. We can virtually ignore any value beyond three
standard deviations from the mean.
The Empirical Rule: If X is normally distributed, x will lie within one, two, and three
standard deviations of the mean with probability 68 percent, over 95 percent, and 99.7
percent, respectively.
For example, suppose cast-iron cam shafts trimmed on a lathe have diameters that are normally
distributed with = 3.00 inches and = 0.05 inches. Then we know there is a 68 percent chance
that a randomly selected shaft diameter will lie in the interval [2.95", 3.05"], which is one
standard deviation on either side of the mean. Furthermore, there is more than a 95 percent
chance for the cam shaft diameter to falls within the [2.90", 3.10"] interval and a 99.7 percent
probability it is in [2.85", 3.15"]. Of course, it is also possible for the diameter to be as large as
3.50, ten standard deviations about the mean. However, the minuscule probability this far into
the right tail of the normal pdf allows the quality control officer to dismiss the possibility that a
3.50 inch cam shaft will occur.
On the other hand, a normal density function is little comfort to a nervous arts festival
sponsor if the standard deviation is relatively large. Suppose that arts festival receipts indicate
that annual attendance has been normally distributed with = 30 thousand visitors but = 8
thousand. Then, the 68, 95, and 99.7 percent intervals (expressed in units of thousands of
visitors) are [22, 38], [14, 46], and [6, 54]. Thus, nearly one-third of the years the festival is
held, attendance will fall below 22,000 or be above 38,000. Unfortunately, one bad attendance

year is sometimes enough to kill a regional event that depends on volunteers, contributions, and
civic support.
The smaller the in relation to , the narrower the interval needed to capture any given
probability for a normally distributed variable.

The Standard Normal
We have seen that the only things that make one normally distributed variable different
from another are the values for the parameters and . The member of the normal family of
distributions that is easiest to interpret is the standard normal.
DEFINITION: A continuous random variable, Z, has a standard normal pdf if it is normally
distributed with = 0 and = 1.
For the standard normal, we may condense a statement such as the following:
"There is a 68 percent chance x will lie within one standard deviation of the mean."
to the following instead:
"There is a 68 percent chance that z will lie in the interval [-1, +1]."
By using Z, we have eliminated the need for words like standard deviation or mean. Because the
mean is zero, the 68, 95, and 99.7 percent z intervals are also centered at zero. If also is one,
the units automatically count the standard deviations from the mean. For the cam shaft example,
the interval [2.90 inches, 3.10 inches] for diameter X becomes simply [-2, +2] for the standard
normal Z. Due to the compact and unitless form of the standard normal, transforming normally
distributed variables to a standard normal Z variable is often convenient.
How is this transformation accomplished? Recall from Figure 7.24 that two normal
densities with the same have the identical shapes. They are merely displaced from one another
by their different means. By subtracting the mean, we can move the center of each density
function to zero. We also discovered from Figure 7.25 that two normally distributed variables
with the same mean will have different dispersions. Because these dispersions are proportional
to their standard deviations, dividing by give each density function a standard deviation of one.
Any normally distributed random variable X with mean and standard deviation may
be converted to standard normal by subtracting and then dividing by .

z = (x - )/ for each value of x in the sample space
While X is measured in units of the original variable, such as millions of dollars, thousands of
employees, or percent change in stock prices, Z is measured in standard deviations from the
mean.
The standard normal variable Z is measured in units of standard deviations from the mean
of the original X variable.

Chapter Case #6: Sold to the Highest Bidder
Suppose, for example, a school board is considering bids submitted to construct a new
high school. If the amount of the bid X is distributed normally with mean of $20 million and
standard deviation of $4 million, X (measured in millions of dollars) may contain bid data such
as 24, 16, 20, 26, and 18. Standardizing X involves first subtracting 20 to relocate the mean to
zero, and then dividing by four to scale down the dispersion to one. In the process, the X data
just listed are converted to standardized Z values of 1, -1, 0, 1.5, and -0.5 based on calculations
(24 -20)/4, (16 -20)/4, and so forth. The Z data contains the same information as the X data, if
each is interpreted correctly. Thus, x = 24 in units of millions of dollars is equivalent to z = 1
standard deviation above the mean of X because 1 is +1. Using similar reasoning, x = 18 is the
same as z = -0.5 standard deviations from the mean. Probabilities derived from the standard
normal pdf may then be translated back into the variable units of X for final decision making
purposes.

The Importance of the Normal Distribution
Why do we devote so much time analyzing one specific density function? First, the
normal is a commonly observed distribution for many types of business and economic variables.
It closely approximates observed distributions of many financial ratios and market variables, for
example. Secondly, its symmetry, central peakedness, and infinitely long tails make it very easy
to work with statistically. In addition, its well-understood shape can be controlled completely by
changing only two parameters, the mean and standard deviation.
Instead of trying to match each situation with a unique and exotic distribution, the normal
distribution is often used whenever it reasonably approximates reality. And it often does.
Under many conditions, we may also use the normal even when the random variables are
clearly not normal. Furthermore, the normal distribution plays a prominent role in several

statistical methods commonly used in business. A large-sample property we will introduce in
Chapter 8 broadens the applicability of the normal distribution to many types of decision
problems whatever the distribution of the random variables involved.
The normal pdf is the most important distribution because many random variables
encountered in business are approximately normal, while others may be described by the
normal under many conditions. The normal distribution also arises in commonly-used
statistical methods.
How can we tell whether a random variable is in fact approximately normal? And if the
variable is not approximately normal, when can it be converted to a normally-distributed
variable? Occasionally, we don't know anything about the distribution. Even lacking
knowledge about a random variable, we may seek insights from a random sample. If the
distribution of the sample data appears symmetrical with a single, dominant mode, we may infer
that the parent population is approximately normal. A histogram may indicate whether the data
appears to be normal. For smaller sample sizes, however, inferring any particular distribution
from so few observations is often difficult.
Most of the time, we go into a problem with some information about the distribution.
One of the most useful ways of deciding about normality is to investigate studies using similar
data. If others have found a random variable X to be normal in previous years or for similar
products or situations, X is likely to be normal for your data set as well. In addition, the type of
variable also provides important clues that a variable is not normally distributed. Variables that
measure total amounts, such as sales, employment, or assets, are usually highly skewed toward
the few largest members in the population. The wealth of U.S. families, for example, will
contain a handful of billionaires such as Bill Gates, Ted Turner, and the heirs to the Sam Walton
fortune.
However, normally distributed variables important to decision makers are often derived
from variables that are not normal. The companies in the recording industry, for instance, have
net incomes and net worth highly-skewed toward a few very large companies like Sony. But the
rate of return, the ratio of after-tax earnings to net worth, is approximately normal. Although the
Japanese conglomerate Sony may have 1000 times the income from its recording operations of a
small new-age music rival, the new worth of the two enterprises may also be 1000 to one. Thus,
profits for these and other companies in the vast recording industry are likely to be normally
distributed. The same is true for many other financial ratios, such debt-to-equity and price-
earnings ratios on stocks.
Time series data, especially after adjustment for inflation, seldom have the skewness
encountered with cross-sectional measurements of different objects in the population. Tracking
a single corporation's sales or stock price over a period of time generally yields a narrower range

of values than a cross-sectional collection of sales or stock prices among different corporations.
For the recording industry example, Sony's sales are unlike to vary by more than 20 or 30
percent over time.
Variables that measure total amounts may be highly skewed for cross-sectional data but
time series data and financial ratios are likely to be normally distributed.
Even when time series data are not normal, informative variables derived from the data
may be normal. Time series data on the stock market indexes, housing starts, and the consumer
price index are seldom normally distributed. However, most business analysts are more
interested in the changes in these measures anyway. Changes in daily stock prices are often the
first thing reported on the business news (The Dow was up 55 points today). These changes
tend to be normally distributed. We even have special names for some of these changes. For
example, the percentage change in consumer prices we call the inflation rate.
Change or percentage change in a variable may be normally distributed even if the
variable itself is not.
For a normal distribution, one standard deviation on either side of the mean should
contain 68 percent of the communications companies. One standard deviation around the mean,
2.5, of the logged employment data is the interval [1.5, 3.5] because the standard deviation of
log
10
(employ) is 1.0. To convert 1.5 and 3.5 to meaningful employment numbers, we need to
unlog them by raising ten to the power of these logged numbers.
17
Since 10
1.5
is 32 and 10
3.5

and 3200, the interval [1.5, 3.5] translates to [32, 3200]. Assuming the distribution is basically
unchanged from 1992, the editor of the industry guidebook can report that just over two-thirds of
the communications firms have between 32 and 3200 employees.

Working with Normal Densities and Probabilities
We now are ready to use the computer to find the actual densities and probabilities under
the normal pdf curve. The Minitab and Excel instructions and sample computer screens for
finding normal densities are shown in Figures 7.31 through 7.32.
Using Minitab to Find Densities and Probabilities for a Normal Distribution
Pull Down Menu Sequence: Calc Probability Distributions Normal...
Complete the Normal Distribution DIALOG BOX as follows:

17
Recall from algebra that the inverse operation of log
10
(x) is 10
x
.

(1) click to highlight circle next to Probability or Cumulative probability
(2) type in the mean in the box following Mean:
(3) type in the standard deviation in the box following Standard deviation:
(4) highlight Input Constant: and type the value for x (or highlight Input Column:)
(5) click OK button to find the normal density or cumulative probability

Figure 7.31

Using Excel to Find Densities and Probabilities for a Normal Distribution
Click on the f
x
Select NORMDIST from Function Name listing
(1) type in the value for x in the box following x
(2) type in the mean in the box following mean
(3) type in the standard deviation in the box following standard-dev
(5) the normal density or cumulative probability appears on the upper right after Value:
Figure 7.32

First, lets verify the symmetry of the normal pdf by finding the height of the density curve one
standard deviation above and below the mean. For example, suppose annual sales volume is
normally distributed at gas stations franchised by a major petroleum corporation. If mean volume
(measured in millions of gallons/year) is 1.5 and standard deviation is 0.2, then one above is
1.7 and one below is 1.3 million gallons/year. Figure 7.35 shows the normal densities at 1.7
and 1.3 as Minitab output. Notice that the probability density for each is identical, 1.21.


Normal with mean = 1.50000 and standard deviation = 0.200000

x P( X = x)
1.7000 1.2099


Normal with mean = 1.50000 and standard deviation = 0.200000

x P( X = x)
1.3000 1.2099
Figure 7.35

The mode is the greatest density and should occur at the mean. The densities should drop off
slowly at first and then much more rapidly as we move farther and farther from the mean. Recall
that the standard normal Z variable has a zero mean and standard deviation of one. Using the
column input option of Minitab, Figure 7.36 contains densities at 0.1, 0.2, and 0.3 standard

deviations above the mean. The last few entries in the figure are densities for 2, 3, 4, and five
standard deviations above the mean.


Normal with mean = 0 and standard deviation = 1.00000

x P( X = x)
0.0000 0.3989
0.1000 0.3970
0.2000 0.3910
0.3000 0.3814
2.0000 0.0540
3.0000 0.0044
4.0000 0.0001
5.0000 0.0000
Figure 7.36

Observe that the density does in fact declines very slowly from 0.40 to 0.38 as we move from the
mode to 0.3 standard deviations away. By contrast, once we are two standard deviations from
the mean, the density drops to 0.054. The normal density declines rapidly toward zero, though
never actually equaling zero. At five standard deviations from the mean, f(z) is still not zero
just too small to register a number even in the fourth decimal place.
In section 3, we constructed rectangles to approximate the probability area under pdf
curves. Because f(z) remains nearly 0.4 for values near the mode, the probability that a
normally-distributed random variable X lies within 0.1 standard deviations of the mean is
represented in standard normal Z notation as
P(-0.1 < z < 0.1)
This probability is approximated by the rectangular area product of the height and width. The
probability density is 0.4 and the width of the interval [0.1, -0.1] is 0.2. We approximate P(-0.1
< z < 0.1) to equal (0.4)(0.2) = 0.08, or 8 percent of the total probability area. For the school
board example, they should expect about 8 percent of the construction bids to come in within 0.1
standard deviations of the mean if the bids are normally distributed. Since = $20 million and
= $4 million, 0.1 is $0.4 million. Therefore, about 8 percent of bids will fall within $400,000 of
$20 million.
For wider intervals, however, rectangular probability estimates become increasingly poor
approximations. With relative frequencies for discrete random variables, we could sum different
probabilities as cumulative measures. Knowledge of cumulative probabilities helps us
accomplish the same result for density functions.

Using the standard normal Z for cumulative probabilities is usually easier. As we saw
with discrete random variables, the cumulative probability F(z) is the probability P(Z < z). The
same Minitab instructions may be used to obtain cumulative probabilities that we used with the
discrete distributions. Figure 7.37 displays the pattern of accumulated probability for z values of
-3, -2, -1, 0, 1, 2, and 3.
Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1.00000
x P( X <= x)
-3.0000 0.0013
-2.0000 0.0228
-1.0000 0.1587
0.0000 0.5000
1.0000 0.8413
2.0000 0.9772
3.0000 0.9987
Figure 7.37
Thus, F(-2) is about 2.3 percent and F(+1) is over 84 percent. We notice immediately several
facts about the normal pdf. First, as we accumulate area under the density curve, the cumulative
probability becomes larger for larger z.
F(z
1
) F(z
2
) if z
1
z
2

However, very little probability area is accumulated at the beginning where the left tail densities
are nearly zero. By the time we pass three standard deviations to the right of the mean, we have
accumulated virtually all (99.87 percent) of the probability. As a final observation, we note that
the median for the standard normal occurs at z = 0, as seen from the F(0) = 0.50. As expected,
half the area (50 percent) under this symmetrical pdf is located to the left of the mean.
Values generated from F(Z) may also be used to find probability areas for z greater than
a particular value or within an interval of z values. To find the first of these, we subtract the
cumulative probability of the complementary event from 100 percent.
Finding Probabilities of Complements of F(Z): For any value z
1
,
P(Z > z
1
) = 1 - F(z
1
)
For example, Z > 1 has the complementary event Z < 1 (remember that the probability P(1) = 0
for a density function). Therefore, P(Z > 1) = 1 - F(1). Since we found F(1) to equal 0.8413,
P(Z > 1) = 1 - 0.8413 = 0.1587, or about 16 percent.
For probabilities within an interval [z
1
, z
2
], we merely subtract the accumulated area at
the left endpoints from the area at the right endpoint.

Finding Interval Probabilities: For z
1
less than z
2
,
P(z
1
< Z < z
2
) = F(z
2
) - F(z
1
)
Thus, P(1 < Z < 2) is the difference F(2) - F(1). From the computer output for these areas found
earlier, this probability is 0.9772 - 0.8413, or 0.1359. Thus, 13.6 percent of the pdf lies within
one and two standard deviations above the mean.
This method of determining interval probabilities for Z may then be applied to any
normally distributed random variables. For the school board case, suppose the existing high
school is so overcrowded or dilapidated that they plan to accept the first bid under $28 million,
the largest amount that won't require passage of a new bond issue. Assuming that bids are
independent, what is the probability that the first bid will be accepted?
To answer this question, we first must translate values of the X variable (the bids on high
school construction) into Z values and then determine the probabilities from the appropriate
cumulative probabilities. Recall that the bids are normally distributed with = $20 million and
= $4 million. Therefore, the probability of a single bid coming in under $28 million is the same

as P(Z < 2), because z = (28 - 20)/4 = 2. But we earlier found P(Z < 2) = F(2) = 0.9772 by using
the computer. Thus, there is almost a 98 percent chance that the first bid would be acceptable to
the board. The complementary event has a probability of P(X > 28) = 1 - P(Z < 2) = 1 - 0.9772 =
0.0228. It is therefore unlikely that the board will have to wait on more bids.
Notice also from this example that the cumulative probability is approximately 2.3
percent for z = -2. Because the normal pdf is symmetrical, the same percentage, 2.3 percent, is
more than two standard deviations above the mean. Together, these two tails of the distribution
contain less than 5 percent of the probability, leaving more than 95 percent for the middle
portion. This is the 95 percent rule for normal we presented earlier in this section. More
precisely, the values for Z are -1.96 and 1.96.
Ninety-five percent of the area under the normal pdf curve is within 1.96 standard
deviations of the mean. The tails each contain 2 percent of the probability.
P(-1.96 < Z < 1.96) = .95
In terms of any normal variable X
P( - 1.96 < X < + 1.96 ) = .95

We may generalize this relationship to a definition for z
/2
, where the Greek letter alpha ()
refers to the combined area in the tails. Thus, if /2 = .025, then = .05, leaving 95 percent for
the interval within the tails. More generally,
DEFINITION: z
/2
is the value of Z necessary to produce a single tail area of /2 and a combined
area of the tails of .
P(Z < -z
/2
) = /2 and P(Z > z
/2
) = /2

For = .05, the z
/2
we want is z
.025
= 1.96. In later chapters, we will use /2 values for the Z
random variable to carry out inferential statistics.
To verify that z
.025
= 1.96, we merely reverse the procedure used to find probabilities
from z values. Now, we determine the z value for a specific cumulative probability /2. This
inverse procedure is accomplished in Minitab by highlighting Inverse cumulative probability
in the Normal Distribution dialog box. The instructions and sample dialog box are shown in
Figures 7.39 and 7.40. Notice that z
.025
= -1.95996.
Using Excel to Find Inverse Standard Normal Cumulative Probabilities
Click on the f
x
Select NORMSINV from Function Name listing
(1) type in the value for x in the box following x

(2) type in the mean in the box following mean
(3) type in the standard deviation in the box following standard-dev
(5) the inverse standard normal cumulative probability will appear in the box after Value:
Figure 7.39

This confirms that the left tail of 2 percent (needed to place 5 percent in each tail of the normal
and 95 percent in between) requires a z value of 1.96. Thus, F(-1.96) = .025 and z
.025
= 1.96.
This process may also be used to determine z
/2
for other percentages, such as 99, 90 (see the
Exercises).
We may also tabulate and plot density curves and cumulative probabilities using the
computer. To do this in Minitab or Excel, we first must make a sequence of z values for the
horizontal axis of our graph. For the standard normal, nearly all the probability area is confined
to two or three standard deviations around zero. For illustration purposes, we have chosen a
sequence of values starting from 2.5 and continuing in increments of 0.1 until 2.5 is reached.
Rather than type all 51 numbers (zero and 25 more on either side of zero), we can let the
computer do most of work. In Minitab, the Set Patterned Data dialog box in the Calc menu can
easily generate this column. In Excel, the sequence needs only be started by typing the first two
numbers, 2.5 and 2.4 in a column. Then extend the sequence by highlighting these two cells,
clicking on the lower right corner, and dragging the mouse down the column far enough to create
a cell with +2.5.
Next, we find the standard normal densities corresponding to each of the z values in your
newly created column. In the Normal Distribution dialog box of Minitab, highlight Input
Column and put the column of z values in the adjoining box. In Excel, use the NORMDIST
function in the Function Wizard to obtain the density or cumulative probability by referencing
the first cell in the z-value column containing 2.5. Then simply drag the lower right corner of
the cell containing the normal probability downward to create the normal densities or
probabilities corresponding to the remaining values of z. Figures 7.41 and 7.42 reproduce these
two Excel tables for the normal pdf. Before the general availability of computers, it was
necessary to compile tables such as this (some filling entire volumes) for every sort of useful
distribution.
These tables may be easily graphed to replicate the standard normal density and
cumulative probability curves. In Minitab, a line graph is generated from the Graph menu by
selecting a line plot in the Plot dialog box. The Chart Wizard in Excel produces similar graphs
from plots of densities or cumulative probabilities. These two Excel plots are displayed in
Figures 7.43 and 7.44.

Observe that average density of the curve appears to be about 0.25 over the range from
2 to +2 that contain most of the probability area. Thus, a crude measurement of the total
probability area under the density curve is (0.25)(4) = 1.0, or 100 percent. Notice also that rather
than being bell shaped like the normal pdf, the cumulative probability curve rises throughout,
most steeply around z = 0 where the density values of the standard normal are the greatest.

7.55 By examining the algebraic function for the normal pdf, it is easy to see that the density
a. is determined from only two parameters, and
b. is the same for x = 1.5 as it is for x = 1.5
c. always has a positive value
d. is a maximum at x =
e. all of the above

7.59 Which of the following is not a characteristic of the normal pdf?
a. it is unimodal
b. it is symmetrical
c. the mean and mode are identical
d. the mean and median are identical
e. all of the above are characteristics of the normal pdf

7.60 Which of the following would be most likely to be approximately normal?
a. the distribution of annual income of U.S. households
b. distribution of city and town populations in Texas
c. age distribution of workers at Kodak
d. number of McDonalds franchises in each of the fifty states
e. sales of manufacturing corporations in the United States

7.61 Which of the following is true about the standard normal distribution:
a. the mean is 0
b. the standard deviation is 1
c. the variance is 1
d. all of the above
e. a and b only

7.63 Ninety-five percent of the area under the normal pdf is
a. within 1.96 standard deviations of the mean
b. within 1.96 of the mean
c. within 1.96 of zero
d. beyond 1.96 of zero

7.64 If class size at a college is normally distributed, approximately 68 percent of the class
sizes at the college are
a. within one standard deviation of zero
b. within one student of the mean

c. within one percent of the mean
d. within one standard deviations of the mean
e. within one student of the standard deviation
7.65 If class size at a college is normally distributed, then approximately 95 percent of the
class sizes at the college are
a. within two standard deviation of zero
b. within two student of the mean
c. within two percent of the mean
d. within two standard deviations of the mean
e. within two student of the standard deviation

1. A hardware store has weekly sales that are normally distributed with mean of $18,000
and standard deviation of $4,500. What can you conclude about the median and modal
class midpoint, and approximate percentage of weeks with sales between $9,000 and
$27,000?

Average sale, whether measured by the mean, median, or mode, should be about $18,000
per week because the distribution is normal. Approximately 95 percent of the weeks
should experience sales between $9,000 and $27,000 because 95% of the population for
a normally distributed variable should have values within two standard deviations (2
times $4,500) of the mean, $18,000.

2. A quality control chart records the lead impurity levels for 60 motor oil samples collected
today. From years of operation, the production manager knows that impurity levels for
properly functioning machinery are normally distributed with mean of 80 parts per
million (ppm) lead impurities and standard deviation of 12 ppm.
(a) What can you conclude about the median and mode for this distribution?
(b) What is an interval of lead impurity levels wide enough to include approximately 57
(or 95 percent) of the 60 oil samples?
(c) What is an interval of impurity levels wide enough to include virtually all 60 (about
99.5 percent) of the samples?
(d) How many standard deviations from the mean would an impurity level of 140 ppm
be?
(e) How many standard deviations from the mean would an impurity level of 62 ppm be?
(f) To standardize the impurity data to a standard normal variable Z, I would subtract the
number: from each value and then divide by the number: .
By doing so, the new data Z would have a mean equal to and a standard deviation
equal to .

Answers:
(a) median and mode will be the same as the mean, 80 ppm.
(b) plus or minus 2 is 80 24, or an interval of (56 ppm, 104 ppm)
(c) plus or minus 3 is 80 36, or an interval of (44 ppm, 116 ppm)
(d) (140 - 80)/12 = 5 standard deviations above the mean
(e) (62 - 80)/12 = -18/12 = -1.5 standard deviations above the mean
(f) subtract 80, divide by 12, mean of Z is 0 with standard deviation of 1.

CASE MINI-PROJECT:
Since the local pizza parlor opened several years ago, the number of orders per day has been
normally distributed with mean of 32 and standard deviation of 7. The manager tries to have the
proper amount of dough, toppings, and staff on hand: too much increases the operating cost and
too little results in lost customers. From this information, answer the following:
1. The 50th percentile for this distribution is pizzas ordered and the most commonly

occurring order is pizzas ordered (Note: the answer to each blank is a number, not
pepperoni).

2. What is an interval wide enough to include the number of pizzas ordered on approximately 95
out of 100 days?

3. If the manager hires enough staff to handle between 25 and 39 pizza orders per day, she can
expect to have anticipated the correct staffing on approximately what percentage of the days that
the pizza parlor operates?

4. Four pizzas ordered today would correspond to how many standard deviations below the
mean?

5. How many standard deviations above the mean would orders for 42 pizzas be? (express
answer as decimal fraction)

6. To convert the number of pizzas ordered X to a standard normal variable Z, first subtract

from X and then divide by . (Note: the answer to each blank must be a number!)

Chapter 8 Intro to Sampling Distributions,
Estimation, and Testing

Approach: The concepts developed regarding probability, random variables, and the standard
normal distributions are used to extend our knowledge about sample inference. Simulation,
sampling techniques, and density functions are applied to introduce the general subject of
sampling distributions, estimation, and hypothesis testing.
Where We Are Going: The inference techniques introduced in this chapter will be
specifically applied to univariate and multivariate model inference methods. Additional
techniques and sampling distributions will be introduced as needed.
statistics and sampling distribution
t distribution
point and interval estimation
sampling error and bias
confidence interval, confidence level, and confidence coefficient
standard error and estimated standard error of the mean
testing, null and alternative hypotheses, one- and two-sided tests, type I and II errors
statistical significance, significance level, p-value decision rule, test statistics

SECTION 8.1 Introduction to Statistical Inference
SECTION 8.2 Sampling Distributions
SECTION 8.3 Estimation and Confidence Intervals
SECTION 8.4 The Fundamentals of Hypothesis Testing
SECTION 8.5 The Practice of Inferential Statistics and Common Pitfalls

8.1 Introduction to Statistical Inference
How do people survive, let alone succeed, in a business world where randomness is
everywhere you look? On any particular day, you could get fired or promoted. This year, your
company could set earnings records or absorb crippling losses. Fortunately, as we learned in
Chapters 6 and 7, knowing the probability of each possible outcome can help us sort through and
make informed decisions in our uncertain world. In fact, we may even use probability
information in place of data. Armed with only probabilities from a probability distribution or
density function, we can find the mean, standard deviation, and other summary measures of
important random variables.
This chapter bridges the remaining gap between the descriptive data chapters of Part I
and the upcoming chapters on statistical inference. Recall that sample data allows us to infer
information about a variable or relationships among variables when complete data from the
population is too expensive, time consuming, or otherwise unobtainable. Samples only contain a
portion of the information in a population, so sample inferences about a population will almost
always be wrong. The critical question is: How wrong and in which direction?
For example, you conduct taste tests and finds that 40 percent of 100 soft drink
consumers in your sample prefer your companys new cola over the market leader. How close is
this figure to the preference percentage of all cola drinkers? Forty percent could be too high or
too low, very close or far away from the actual population percentage. You can only know for
sure by launching the cola in the market, a costly and potentially disastrous decision.
Although sample data cannot provide answers with absolute certainty, statistical
inference methods we will be learning can get us the next best information: probability-based
insights. For example, inference methods in Chapter 9 may allow you to be reasonably sure that
between 32 and 48 percent of the cola-drinking population prefer your new cola brand. We will
even choose how reasonably sure a decision-maker needs to be: 80 percent or 99 percent.
To make this kind of inference, however, we must first learn how reliable our sample
information is. Random sampling and carefully-designed experiments may protect our results
from bias, but samples still must vary or they wouldnt be random. Before making inferences
about the population percentage from cola taste tests, for example, we need to know the
probability distribution of the sample percentage. That percentage is a random variable because
it varies from one random sample to the next with specific probabilities.


8.2 Sampling Distributions
After our exploration of probability distributions and random variables, we are now better
equipped to find probabilities associated with sample inference. For continuous random
variables, we learned that each range of values has a probability corresponding to an area under
the density curve. For example, the probability area for the standard normal may be calculated
from the z values. More generally, if we know the distribution family and the value of its
parameters (such as and for the normal pdf), we may find any probability we need for that
distribution.
In Part II, we only considered a particular type of experiment: random variables involving
individual observations. For example, if the random variable under investigation is annual
income, our experiment consisted of recording annual income for a randomly-selected engineer
in the population.
Statistical inference extends experiments to the full sample. Once we have an entire
random sample as our observation, random variables may be derived from the information in that
sample. Some examples of random variables based on sample information are the sample mean
of CEO salaries, the sample proportion of managers holding a Masters degree, or the regression
coefficient of population for a sample drawn from postal volume data. Each of these is a random
variable because they will vary from one sample to the next according to some probability
distribution. Because these random variables, or statistics, are gathered from information in a
whole sample, their probability distribution is called a sampling distribution.
DEFINITION: A statistic is a random variable derived from information in a random sample,
and its probability distribution is the sampling distribution for that statistic.
1

Back in Chapter 3, we learned about a particular type of statistic called an estimator. The
sample mean X is an estimator of , the population mean of X. Therefore, the density function
for X is a very useful sampling distribution indeed. If we know this sampling distribution, we
can find the probability that the mean will fall within any particular range of values.
We may use knowledge about the sampling distribution to make important inferences
about mean income. In a random sample of engineers, for example, the sample mean income is
an estimator. After all, the sample mean is a random variable because it has numerical values
and has a probability distribution. The probability of each possible value of X is determined by
the probability of drawing a random sample whose mean is that value of X .

1
Dont get confused. The plural, statistics is the name of the subject of this textbook. Also, most people commonly use statistics
to refer to demographic and financial numbers reported in newspapers and government documents.

To illustrate how this works, suppose a large software design company pays its thousands
of engineers either $40,000 or $80,000 annually, with equal number of engineers at each of the
two possible pay scales. In a random sample of four observations, X = $40,000 can only occur
if all four engineers earn $40,000. Thus, P( X ) is (1 / 2)
4
, or 1 / 16
th
. Similarly, X = $80,000
requires all four salaries at that level, so P( X = 80,000) is again 1 / 16
th
. Other possible values
for $50,000 and $70,000 result from four equally likely samples sequences. For example, a
sample mean of $50,000 requires that one engineer make $80,000 while the other three earn
$40,000. However, the $80,000 engineer can occur first, second, third, or fourth in the sample.
Thus, P( X = 50,000) = 4(1 / 16), or 1 / 4. The remaining possibility, X = $60,000, occurs in an
average of 6 / 16, or 3 / 8, of samples. This sampling distribution for this statistic X is
summarized in Figure 8.1. Just like any other distribution, the probabilities for this sampling
distribution sum to 1.0.
X P( X )
$40,000 1 / 16
$50,000 4 / 16
$60,000 6 / 16
$70,000 4 / 16
$80,000 1 / 16
16 / 16 = 1
Figure 8.1

Because of its simplicity, we could calculate the sampling distribution here for X .
Engineering income was a discrete random variable with two equally-likely outcomes. In
practice, sampling distributions are too complex to be found this way. Discrete random variables
normally involve a more diverse range of outcomes and event probabilities. Moreover,
companies usually have nearly continuous pay scales, so X is also a continuous random variable
with a density function for its sampling distribution. Sampling distributions for estimators and
other statistics, such as the sample proportion, will also be continuous random variables though
the proportion is calculated from discrete variables. The sample proportion of engineers holding
a Masters degree is a continuous random variable in the interval [0, 1] although both the
denominator number of engineers and the numerator engineers with a Masters degree are
discrete random variables.

In Part II, we dealt with uncertainty by attaching probabilities to our findings. In
estimation problems, these probabilities are found by matching estimators to known density
functions such as the normal pdf. While sampling distributions for the mean and proportion help
us estimate, another type of inferential statistic will allow us to test whether some vital business
condition is true or not. If we know the sampling distribution for an estimator or test statistic (at
least approximately), we may make important statistical inferences from the sample data.
The sampling distribution of a statistic is used to make inferences about the unknown
information from populations that can only be randomly sampled.
One of our most important tasks, then, is to find sampling distributions for business
situations. Although they cannot be calculated directly, sampling distributions can usually be
obtained by following a simple strategy. For now, we will only illustrate this strategy by
exploring the sampling distribution of the sample mean under alternative realistic conditions.
The simplicity of the results may very well surprise you. Our findings here will set the pattern
for the sampling distributions we will need for inferences involving proportions, regression and
experimental design models, and many other types of statistical analysis. We introduce the
distribution for the sample mean by a case study from recent business history.

Chapter Case #1: The Urge to Merge
A massive merger wave helped mark the 1980s as the Decade of Greed. Not since the
Robber Barons had we seen this many large corporate acquisitions. Unlike previous merger
waves, multibillion dollar giants were vulnerable to takeover. Colorful corporate raiders like T.
Boone Pickens, junk-bond king Michael Milken, and faceless investment bankers assembled
unprecedented amounts of capital for their takeover bids. Companies that were household names
on the corporate landscape for more than a century disappeared overnight. All of the major food
manufacturers and half the largest oil companies were gobbled up. Firms that acquired banks,
airlines, and book publishers were themselves bought up.
Then as suddenly as it began, the merger wave subsided. After peaking around 1987,
mergers declined rapidly. The savings and loan scandal and crackdown on junk bonds dried up
the capital required to mount corporate raids. To retain power, managers of target firms pushed
through stockholder voting rules that blocked takeover challenges. State and Federal
government policies shifted toward increased scrutiny of mergers.


Looking back, were government policymakers wrong to permit this merger wave to
persist or did they unnecessarily impede a healthy reorganization process? The threat of
takeover may prod management of merger targets to be more efficient and responsive.
Alternatively, it also encourages unproductive defensive tactics to ward off potential suitors.
How can we tell if mergers are healthy for the economy and stockholders?
One way is to examine the profit performance of the acquiring firms. Merger can
improve efficiency if the acquiring firm has superior management skills that result in higher
profits. On the other hand, poorly performing companies may use takeovers to distract their
disgruntled stockholders. This second type of company will report lower profits.
Suppose an antitrust economist was preparing a report to help the Justice Department
reconsider its merger guidelines just as the merger wave was peaking. To investigate whether
acquiring firms were marked by high or low profit performance, the economist selected 1987
profit performance for the largest merger participants. To avoid making statistical inference, the
economic analyst would need complete profit data on all large acquisition-prone firms. For
example, there were 164 corporations with 1987 acquisitions totaling at least $100 million.
2
A
common measure of profit performance is the rate of return (ROR) on corporate assets. Figure
8.2 displays Minitab descriptive statistics on ROR, net income as a percent of total assets.

ROR 164 3.421 3.000 3.230 5.812 0.454

ROR -13.000 38.000 0.000 6.000
Figure 8.2

2
All data for this case study were derived from Compustat Public Disclosure tapes. Rate-of-return data have been rounded to the
nearest whole percentage point.

The antitrust economist could then use procedures
from Chapter 3 to report the central tendency and
variability measures of profit performance. Although
rates of return ranged from losses (denoted by a
negative sign) of 13 percent to profits of 38 percent,
the overall population mean = 3.4 percent was
considerably below average profits earned by other
large corporations. In fact, the Q3 quartile value
indicates that only one-fourth of these acquisition-
prone firms earned a rate of return exceeding a barely-
respectable 6 percent. Rather than displaying superior
management skills, acquisitions may be better
interpreted as attention-getting distractions from
mediocre profits. Armed with complete population
data, the economist would report that tougher Justice Department merger guidelines would have
little adverse impact on profits in our economy. In fact, large corporations might be forced to
improve their own internal efficiency if the merger option is less available to them.
However, these findings are based on having complete population information. Critical
population data are seldom available in a timely fashion for decision makers. Perfect 20-20
hindsight is of little use to a policymaker at the time. The clear-and-present economic danger
from the 1980s runaway merger wave may have prevented the Justice Department policymakers
from working with full population data. At the time for intervention, information would not yet
be publicly available on ROR for most acquisition-prone firms. Instead, the antitrust economist
would need to infer population information from research data on a small sample. What if staff
time and budget resources only permitted her to obtain reliable profit data from a random sample
of n = 16 corporations in the population?
What would such a sample look like and how would it compare with the full population
examined earlier? Could a sample of a handful of firms have told her very much about the true
population mean, or would the sample be as likely to be misleading as helpful? Before we try to
answer this, lets first examine several random samples.
samp1 samp2 samp3 samp4 samp5
3 5 1 2 -1
9 1 1 6 0
-2 4 4 -1 -3
4 -1 5 5 1
-4 3 2 -1 -1
8 5 7 1 -3
-6 8 13 6 6
11 1 -2 3 4
1 0 6 -4 -7
8 5 2 4 -1
6 -3 8 6 0
-9 4 9 -10 7
0 2 6 6 5
2 4 13 0 4
4 3 0 1 2
7 1 1 9 1
Figure 8.3

In Chapter 5, we learned about simple random samples and used the computer to simulate
the random sampling process that generates survey and experimental data. Because we have
information unavailable at the time actual 1987 population data for the 164 acquisition-prone
corporations we can simulate random samples like ones the antitrust economist would collect.
Figure 8.3 contains five simple random samples containing n = 16 observations each drawn from
the 1987 population of corporate ROR data.
3

It is immediately apparent that the samples are all different. With so many different
profit rates in the population, finding two out of five samples the same would be highly unlikely.
Of course, if we repeated this simulation experiment, five different samples would almost surely
be drawn.
As a result of the small sample size and the variability in the profit data, the sample
means vary from sample to sample. The relevant portion of Excel descriptive statistics is
displayed in Figure 8.4. Notice that X for sample 3 is the greatest (4.75) primarily because it
contains two firms with large ROR values of 13 percent. Sample 5, which contains the most
ROR observations with losses (i.e., negative values), also has the smallest sample mean (0.875).
Nevertheless, the sample means are remarkably close to the actual population mean, =
3.42 percent. In fact, the mean for samples 1, 2, 3, and ,4 all are within 1.4 percentage points.
The differences we do find are not due to bias, because we learned in Chapter 3 that X is an
unbiased estimator of . Thus, the average value of X for a very large number of these samples
must approximately equal .

3
The samples were obtained repeated uses of the Sampling dialog box from the Data Analysis tools menu (explained in Chapter 5).
The Input Range selected is the ROR population in the chapter spreadsheet.
Mean 2.625 2.625 4.75 2.063 0.875
Standard Error 1.426 0.682 1.116 1.174 0.926
Standard Deviation 5.702 2.729 4.465 4.697 3.704
Minimum -9 -3 -2 -10 -7
Maximum 11 8 13 9 7
Count 16 16 16 16 16
Figure 8.4

Unfortunately, our antitrust economist back in 1987 could not benefit from the average of
repeated samples. She would only have time to collect a single sample! She might have drawn
samples like the five we drew here, or any one of the trillions of equally-probable samples of 16
observations from the 164 in the population. Some of these samples have X close to while
others are far away. How could the economist judge which sample mean that she has? To infer
anything about the population mean from only one sample, she needed to know the sampling
distribution for the sample mean.

Sampling Distribution of Sample Mean for Known Standard Deviation
We begin our search for sampling distributions with the estimator X when X is normally
distributed and the standard deviation is known already. Later in this section, we will relax
each of these two conditions and investigate situations involving unknown and non-normal X.
In later chapters, sampling distributions for proportions, regressions, and other important
inference situations will use many of the same concepts and procedures developed here.
If X is normally distributed, the sampling distribution for X will also be normally
distributed. If X is centered around , the sampling distribution will also be centered around .
4 5 3 5 2 5 15 5 - 5 - 15
10 0
9 0
8 0
7 0
6 0
5 0
4 0
3 0
2 0
10
0
i nc/ t a
F
r
e
q
u
e
n
c
y
Figure 8.5

If X is a normally distributed random variable with mean , then X will be also normally
distributed with mean .
We learned in Chapter 7 that certain types of business-related random variables are
approximately normal. For example, financial ratios such as the rate of return being examined
by our antitrust economist are often normally distributed. Though total asset data are highly
skewed toward a few very large corporations, net income is likely to be similarly skewed as well.
The ratio (or percentage) of these two skewed variables may approximate a normal distribution.
Examination of the histogram (see Figure 8.5) indicates that the population of profit rate data is
approximately normally distribution. Even though the antitrust economist did not have access to
this histogram information on the full population, experience with previous years' profit data (or
with similar financial data) led her to assume a normal distribution for 1987 profits. Then,
whichever sample she collected, the analyst could consider the sample mean an unbiased and
normally-distributed estimator of .
She therefore knew the distribution X and had an unbiased estimator of one of its
parameters, . But we learned from Chapter 7 that the normal pdf is a two-parameter family of
distributions. Recall that any normally distributed random variable X may be converted to the
standard normal Z by the formula Z = (X - ) / . If X is normally distributed with a mean
estimated by X itself, then all we lack is the standard deviation for the distribution.
The population standard deviation measures the variation of a random variable X. For
the sampling distribution of X , we are after another type of standard deviation: the average
variation from sample to sample of X . This is the standard error of the mean.
DEFINITION: The standard error of the mean,
X
, is the standard deviation of X and
measures how much the sample mean varies in repeated random sampling of X.
We will need to know
X
to make sample inferences about . As you might expect from the
similarity in the notation,
X
is related to and may be derived directly from it.
If X is a random variable with a known standard deviation , then the standard error of
the mean
X
is directly proportional to and inversely proportional to the square root of
the sample size n.

X
= / n

The direct relationship with is logical. The greater the variability in the data, the more
X will vary among samples. Variation in X results from the fact that samples can look
substantially different from their parent population. But very different samples are most likely
when the population data is highly variable. In Chapter 3 we learned that is a measure of
average variability of X. Thus, is also a primary source of variability for X . If is relatively
small, sample means will be close to the population mean. A large , on the other hand, can
allow an individual sample to contain observations that are all far more than (or far less than) .
However, so long as samples contain more than a single observation, X must be vary less
than X itself. The larger the sample size, the more similar X s will be to one another and to the
population mean . Larger random samples are more likely to resemble the population in every
way, including sharing similar means. Why do we divide by n instead of by n? One
explanation is that the standard deviation is itself a square root, as we learned in Chapter 3.
These appeals to logic and intuition are necessary here because the derivation of the
sampling distribution for
X
involves math beyond the scope of this text. To add further
support, we will examine simulation evidence shortly.
The direct relationship between
X
and indicates that, for equal-sized samples,
doubling will also double
X
. If = 10 and n = 25, for example, then

X
= 10 / 25 = 10 / 5 = 2.
But if doubles to 20, then
X
= 20 / 5 = 4, or twice as large.
Dividing by n has the opposite effect. Quadrupling the sample size, for instance,
causes standard error of the mean to be cut in half. If = 10 and n is increased fourfold from 25
to 100,
X
declines from 2 to

X
= 10 / 100 = 10 / 10 = 1.
We may combine all our results into a single set of conclusions.


If X is normally-distributed with mean and standard deviation , X will also be normally
distributed with mean and standard deviation
X
= / n . Therefore, the statistic
( X - ) /
X

has a standard normal sampling distribution Z.

For the antitrust economist to determine
X
, she first needed to know the approximate
value of . Knowledge about is usually gained from previous or similar studies. Such
information is often available in quality control analysis and other business situations where
continual monitoring of similar production systems are conducted. Suppose in this case that
there was reliable information about the true population value of . Profit data on acquisition-
prone corporations from the early 1980s was available in 1987. If the variability of ROR data
was fairly constant over this period, for this historical data could be used. The antitrust
economist would then divide this number the square root of the sample size to arrive at the
standard error of the mean. Using the known value = 5.8 and sample size n = 16,
X
= 5.8 / 4 = 1.45
She could use this value in combination with the sample mean to infer things about the
population mean. For example, suppose her survey resulted the third sample where X = 4.75.
Knowing that sample means have a standard deviation of 1.45, she would not have ruled out the
possibility that is as large as 6 or 7, or as small as its actual value, 3.42. In Chapter 9, we will
apply information on
X
more formally to estimation and hypothesis testing problems. Right
now, our objective is to continue exploring sampling distributions alternative inference
conditions.

The t Distribution When the Standard Deviation Is Unknown
What if the value of is not known? After all, if we don't know , we probably don't
know much about the average variability around either. However, there is no need to start over
from scratch. By analyzing those cases where we do know , we also introduced many of the
same concepts and procedures for sample inference when is not known. We thus may use a
modified version of formulas developed in the previous section. All we need to do is replace
by an estimator for .

In fact, we already know
about this estimator: the sample
standard deviation s introduced in
Chapter 3. By substituting s for
in our definition of the standard
error of X , we have an estimator
of
X
called the estimated
standard error of the mean.
DEFINITION: The estimated standard error of the mean, s
X

, is
s
X
= s / n
where s is the sample standard deviation and n is the sample size.

Recall that we are already using X as an estimator of the mean for the sampling distribution. We
are only going one step further using s
X
to estimate
X
, the other parameter of the sampling
distribution. Like a sugar substitute, we may use s X in place of
X
when is not known.
The estimated standard error of the mean, s
X
, may be used in place of
X
when is
unknown.
How closely do our five samples from the 1987 ROR data estimate ? Figure 8.6 reproduces the
earlier descriptive statistics results. The sample standard deviation for sample 1, s = 5.7, is very
close to the actual population value, = 5.8. The other samples do not fare quite as well. The
worst estimate is from sample 2, whose s = 2.729 is less than half of . Notice that although the
first two samples have identical sample means (2.625), sample 2 has less than half the sample
standard deviation because of its narrower range of values.
Excel and Minitab also report the standard errors in their descriptive statistics. In Figure
8.6, these five standard errors obey the relationship to s and n prescribed by our definition of
s
X
. The five samples contain n = 16 observations, so each estimated standard error is one-
fourth the size of the sample standard deviation. For example, s = 3.704 for sample 5. The
estimated standard error reported in Figure 8.6 is 0.926, which is one-fourth of s. Because it
does best estimating , sample 1 also comes the closest estimating
X
(1.426 versus 1.45).
Mean 2.625 2.625 4.75 2.063 0.875
Standard Error 1.426 0.682 1.116 1.174 0.926
Minimum -9 -3 -2 -10 -7
Maximum 11 8 13 9 7
Count 16 16 16 16 16
Figure 8.6

Unknown also creates another difficulty. If is not known, we also can no longer use
the normal distribution as our sampling distribution. Because s is only an estimate of , using
s
X
in place of
X
introduces additional sampling error. The resulting increase in uncertainty
alters the shape of the sampling distribution. The smaller the sample size used to calculate s, the
more uncertainty is introduced. A new and more flexible density function is therefore required
to meet our sampling distribution needs. This is called the Student t, or simply the t distribution.
DEFINITION: The t distribution, is a family of density functions, t(k), with the k parameter
distinguishing one member of the family from another.
"Student" was the pen name of William D. Gosset. Gosset discovered the t distribution nearly a
century ago while he was employed at Guinness, a beer company better known today for its book
of world records.
4

The t distribution has precisely the properties we need for the sampling distribution of X
(see Table 8.1).
Table 8.1
Properties of the t Distribution

1. Symmetrical and unimodal
1. Infinitely long tails
2. = 0
3. > 1 and tails thicker than the Z distribution
4. thicker the tails and larger are most noticeable if k is very small
5. approximates the Z distribution for large values of k

Notice that the t and standard normal distribution Z share the first three properties. But unlike Z,
the parameter k allows the shape of the t distribution to vary. Figure 8.7 shows t distributions for
two different values of k, 3 and 15, on the same graph with the standard normal. For t(3), there
is substantially more area under the tails of the t density curve than there is for t(15). The tail
areas for the Z distribution dissipate even more abruptly. When k is large, the t and Z density
curves are virtually indistinguishable from one another.

4
In addition to deriving and tabulating this distribution, Gosset also verified its correctness by plotting X for hundreds of samples of
n = 4 finger lengths of criminals! A readable account is found in D. B. Owen (ed.), On the History of Statistics and Probability, (New York:
Marcel Dekker, 1976), Chapter 1.

What does all this have to do with sampling distributions? Recall that ( X - ) /
X
has
a standard normally distribution. Its counterpart for unknown , ( X - ) / s
X
, has a t
distribution.
For a random sample of size n observations of a normally distributed random
variable X, ( X - ) / s
X
has a t(n-1) distribution with parameter k equal to (n - 1).
Because its tails vary in thickness, the t distribution is able to incorporate the added uncertainty
introduced when must be estimated by s. When we have lots of information about the
population, s will generally be an excellent estimator for . The t distribution behaves in
precisely this manner. For large n, t(n-1) closely resembles Z. However, for small samples, s
may vary substantially from . The t distribution reflects this added uncertainty by producing
thicker tails than Z.
The errors in estimating by s are determined by the amount of information available to
Figure 8.7
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
D
e
n
s
i
t
y
Normal andt Distributions
normal
t(3)
t(15)
Figure 8.7

calculate s. As we discussed in Chapter 3, degrees of freedom are the separate pieces of
information remaining after all parameters have been estimated. In a sample of size n, we begin
with n degrees of freedom. However, calculation of s uses up one degree of freedom to first
calculate X . That was the reason for k = n - 1 degrees of freedom in the sampling distribution.
The sampling distribution for X when is unknown has a t distribution whose shape
varies with the sample size. Instead of z
/ 2
values, we now need t
/ 2
values from the sampling
distribution. Because the t(n-1) density curve changes its height and tail thickness for different
sized samples, however, the t
/ 2
(n-1) values also vary with n.
t
/ 2
(n-1) is the number of estimated standard errors required to place / 2 probability in
each tail and (1 - ) in the middle of the t(n-1) distribution.
We may again use the Minitab or Excel to obtain probabilities and t
/ 2
values. Instructions and
dialog box examples are provided in Figures 8.8 through 8.10.
Using Minitab to Find Densities, Probabilities, and t Values for a t Distribution
Pull Down Menu Sequence: Calc Probability Distributions T...
Complete the T Distribution DIALOG BOX as follows:
(1) highlight circle next to Probability density, Cumulative probability, or Inverse
cumulative probability.
(2) type in the value of k - 1 in the box following Degrees of freedom:
(3) highlight Input Constant: and type the value for x (or highlight Input Column:)
(4) click OK to find t density, cumulative probability, or inverse cumulative probability.
Figure 8.8

Using Excel to Find Probabilities Areas of Tails for a t Distribution
Click on the f
x
Select TDIST from Function Name listing
(1) type in the absolute value of x in the box following x
(2) type in the value of k - 1 in the box following degrees-freedom
(3) type 1 for the probability area of one tail (or 2 for both tails) in the box following tails
(4) the t distribution cumulative probability appears on the upper right after Value:
Figure 8.9

Using Excel to Find t Distribution Values for Two-Tailed Probability Areas
Click on the f
x
Select TINV from Function Name listing
(1) type in the combined probability area of the two tails in box following probability
(2) type in the value of k - 1 in the box following degrees-freedom
(3) the inverse t distribution value appears on the upper right after Value:
Figure 8.10

For the economist survey of 16 corporations, there were n -
1 = 15 degrees of freedom. The t necessary to capture 95 percent
of the probability is t.
025
(15). Observe from Figure 8.12 that
t.
025
(15) = 2.13 is larger than z
/ 2
= 1.96 for the standard normal.
Because of the thicker tails of the t distribution, we therefore must
go an additional 1 / 6th of a standard deviation farther into each tail
to accumulate 95 percent of the probability under the sampling
distribution.
If she had used an even smaller sample, say n = 4, the t
value would have increased to t
.025
(3) = 3.18, 60 percent larger than
1.96. On the other hand, large samples have t
/ 2
(n-1) very close to
z
/ 2
. Consider the case of n = 64. With 63 degrees of freedom,
t.
.025
(63) becomes only 2.00, which is barely greater than 1.96. The
Exercises at the end of this section contain opportunities to
generate other t
/ 2
values and explore additional properties of the t
distribution.
One other point about the t distribution deserves to be
mentioned. Until recently, it was accepted practice to revert to z
/ 2

approximations of t
/ 2
(n-1) whenever sample size exceeded 30.
5

Rather than introduce as much as a 3 to 7 percent downward bias
(see Exercises), statistical software now makes it just as easy to use
the precise t distribution values, regardless of sample size. Thus,
the t distribution is applicable to all cases of normally distributed X
and unknown .

5
One justification for this practice was that abbreviated t tables often omit values for more than 30 degrees of freedom or listed t
values only for a few larger degrees of freedom.

The t distribution is perhaps the most commonly-used sampling distribution in business
statistics. In addition to inferences about population means, the t distribution also arises in
regression modeling and experimental design situations. Chapters 9, 10, and 11 will discuss
these applications extensively.

The Central Limit Theorem: Approximating the Sampling Distribution
In many instances, assuming a population is approximately normal is reasonable. But
what if nothing is known about the population distribution or what if experience suggests the
distribution is not normal? Can we then say anything about the sampling distribution?
Surprisingly, we can if the sample size is large enough. This conclusion results from something
with an imposing title, the central limit theorem, one of the most useful results in statistics.
6

One convenient version of this theorem states the following:
Central Limit Theorem: The sampling distribution of X for a random variable X whose
mean is and standard deviation is will approximate a normal pdf with mean and
standard deviation
X
= / n if the sample size n is sufficiently large.
The larger the sample size n, the closer the approximation is to normality.
How large is sufficiently large? The answer depends on the distribution of X. For
distributions that look very unlike the normal density curve such as highly-skewed, thick-
tailed, multi-modal, or uniform distributions samples of 50 or even as many as 200
observations may be needed for the sampling distribution of X to become approximately normal.
If fairly-symmetrical, unimodal populations are sampled, on the other hand, the sampling
distribution of X may closely resemble the normal pdf for samples as small as 10 or 15
observations.
The central limit theorem occupies an indispensable position in statistical inference. By
using only the normal pdf, we may make inferences from the sample mean (and from other
sample statistics) regardless of the distribution of the population sampled. What about situations
where is unknown? The central limit theorem allows us to approximate the sampling
distribution by a t distribution that again uses the estimated standard error s
X
.

6
A theorem is the mathematical equivalent to theory in the sciences. As we shall discuss in the next chapter, a theory is accepted by
testing it against experimental data. Rather than using data as proof, a theorem employs other mathematical relationships which are assumed or
have been proven previously.

Central Limit Theorem for Unknown : The sampling distribution of X when X has mean
will approximate a t distribution with mean and standard deviation s
X
= s / n if the
sample size n is sufficiently large.
For now, we examine only applications in univariate data. However, the usefulness of the
central limit theorem extends to analysis of more than one variable, as we will see in the next
few chapters.
Although its proof requires mathematics beyond the level of this text, the central limit
theorem may be easily verified by computer simulation experiments. Let's return to our antitrust
economist. Besides profit rates for acquisition-prone corporations, she also wanted information
about their indebtedness. An interest rates rise, recession, or stock market calamity could throw
heavily leveraged firms into insolvency. To assess this aspect of the merger mania, the debt-to-
asset ratio needed to be examined. The Minitab descriptive statistics and histogram of the debt-
Figure 8.13
0.0 0.5 1.0
0
10
20
Debt-to-Asset Ratio
F
r
e
q
u
e
n
c
y
Histogram of 1987 Debt-Asset
Ratios for Acquisition-Prone Firms

to-asset population data are presented in Figures 8.13 and 8.14. The population mean was nearly
0.32 and the standard deviation .244.
dbt/asst 154 0.3195 0.2500 0.3043 0.2439 0.0197

dbt/asst 0.0100 0.9800 0.1400 0.4900
Figure 8.14
Unlike the ROR data, however, debt-asset ratios do not appear to be normally distributed.
A large group of corporations had debts that are under about 40 percent of their assets. A
smaller but substantial group of firms had debts amounting to half or more of total assets.
Lacking information on the full population, the antitrust economist once more collected a
random sample of n = 16 corporations. This time, however, she was sampling from a debt-asset
ratio population that was not normally distributed. Because the ROR data was normally
distributed, she could identify the sampling distribution. Whether the standard deviation was
known or was estimated from the sample data, she could make inferences about the population
mean. Was she still justified in using approximately these same sampling distributions? The
answer is yes, and we can prove it by the following computer simulation.
If the sample size is large enough, the central limit theorem concludes that X has
approximately a t distribution with a standard error
X
= / n . To examine the shape of the
actual sampling distribution, we will need a very large number of random samples. A program
was written in Minitab (explained in the Exercises) to repeatedly draw random samples of n = 16
observations from the debt-to-asset population data. The mean for each of the 1000 samples was
then stored in a worksheet column. Figures 8.15 and 8.16 present the Minitab descriptive
statistics and histogram for these 1000 sample means. What these figures describe is the
sampling distribution for X (as precisely as 1000 samples can portray it).
x-bars 1000 0.31917 0.31688 0.31838 0.05813 0.00184

x-bars 0.14812 0.49937 0.27750 0.35750
Figure 8.15

With the speed and availability of computers today, why dont we use this method to find
all our sampling distributions? The answer is that we cant. These simulations relied on
knowing the actual population in the first place. But if we have the population data, there is no

need to make inferences from sample data. This simulation is therefore only useful to verify the
central limit theorem, which itself furnishes us with the approximate sampling distributions.
What are the results of our simulation experiment, and do they support the central limit
theorem? Examine the histogram in Figure 8.16. One thing is immediately clear. The
distribution of sample means does not resemble the distribution in Figure 8.14 of the debt-asset
ratios we sampled. Instead, the sampling distribution for X shown in Figure 8.16 is unimodal,
very nearly symmetrical, and approximately bell-shaped.
The descriptive statistics report that sample means average 0.31917, about the same as
the population mean (0.3195). The standard deviation of the sample means, 0.058, is also close
to the standard error. Recall that was 0.2439. Dividing by the square root of n for samples of
16 observations, we obtain a standard error of approximately 0.244 / 4 = 0.061. Thus, a sample
size of only 16 was sufficiently large to apply the central limit theorem approximations for this
non-normally distributed case example. Because she knew about the central limit theorem, the
0.5 0.4 0.3 0.2 0.1
140
120
100
80
60
40
20
0
Debt/Asset Ratio
N
u
m
b
e
r

o
f

S
a
m
p
l
e

M
e
a
n
s
Histogram of Sample Means for Debt-Asset Ratios
Results of Minitab Simulation: 1000 random samples of 16 observations.
Figure 8.16

antitrust economist could use information from a random survey of 16 corporations to make
important inferences about 1987 mean debt ratios.
We have seen how normal and t distribution formulas may be applied to large samples by
using the central limit theorem. However, what can we do for samples too small to invoke the
central limit theorem? A sample drawn from an extremely-skewed population, for example,
would have to be quite large to meet the standards of the central limit theorem. Smaller samples
may still give decision makers estimates of mean sales. Logarithms also may be used to convert
a non-normal variable to an approximately normal one. Then, we may make inferences about
the log of by using the Z or t distribution and the log of the sample data.
7
Alternatively,
nonparametric methods which do not require normal distributions may be employed.
Finally, we will need a few other commonly-encountered sampling distributions in our
toolkit. One of these is discrete, the binomial about which we have already learned. Two others,
the F and chi-square distributions, will be introduced in other chapters as we need them. Our
task for the remainder of this chapter, however, is to develop a better understanding of inferential
techniques. After discussing estimation and hypothesis testing methods, we can make much
better use of the sampling distributions we have learned.

8.1 Statistical inference favors the use of larger samples because large sample size do each of
the following except:
a. reduce the thickness of the t distribution tails.
b. increase the n quotient in the standard error calculations.
c. tend to make central limit theorem approximations appropriate.
d. permit the use of the normal distribution for our sampling distribution.
e. all of the above

8.2 If X is a normally distributed random variable with mean and standard deviation , then
X will be a random variable with
a. a normal distribution
b. a mean of
c. a standard deviation of / n
d. all of the above

7
To convert back to un-logged form, raise estimates to the power of the based used for the logarithms.

8.3 If X is a random variable, the sampling distribution of X
a. is normally distributed if X is normal with a known value of
b. has a t distribution if X is normal and unknown
c. is approximately normal for large samples and known value of
d. has a t distribution approximately for large samples and unknown
e. all of the above

8.4 Which of the following are not common characteristics of both the normal distribution
and t distributions:
a. Their shape depends on the number of degrees of freedom.
b. They are symmetrical.
c. Each have two infinitely-long tails.
d. They each have a single mode.
e. All of the above are characteristics of both distributions.

8.5 If mean auto sales for a random sample of n = 9 salespersons last year was $300,000 and
the sample standard deviation was $90,000, then we should use as the standard error of
the estimate a value of
a. $60,000
b. $30,000
c. $15,000
d. $7500
e. answer depends on the choice of sampling distribution.

8.6 If = 8 for a random variable X, then
X
for a sample size of 16 is
a. 0.5
b. 2
c. 4
d. 8
e. 32

8.7 If s = 24 for a random sample, then s
X
for a sample size of 36 is
a. 0.67
b. 3
c. 4
d. 8
e. 144

8.8 In comparing the t and standard normal Z distributions, which of the following is not
true?
a. the t distribution is not symmetrical for small samples
b. the t distribution converges to the normal distribution as the sample size increases
c. the t distribution has thicker tails than the normal
d. only the t distribution varies with the number of degrees of freedom
e. all of the above are true

8.9 The central limit theorem is applied to inference situations where
a. the sample size is large
b. the sample size is small
c. the population is normally distributed
d. the sample is not a random sample
e. the population standard deviation is unknown

Below is a histogram for a sample of n = 60 employee wages at a factory.
21. Assuming that is unknown, why can a confidence interval for the population mean be
estimated by a t sampling distribution?
a. because the population is very nearly normally distributed
b. because the central limit theorem usually can be applied for such large samples
c. because of the large of large numbers ensures us of randomness
d. because t distributions may be used whenever distributions are skewed
3.1 2.9 2.7 2.5 2.3 2.1 1.9 1.7 1.5 1.3 1.1 0.9 0.7 0.5
18
16
14
12
10
8
6
4
2
0
W A G E S
F
r
e
q
u
e
n
c
y

e. all of the above

8.10 With a calculator, determine
X
from and n. Summarize the relative effect of and n
upon
X
from these results.
(a) = 10, n = 9
(b) = 10, n = 81
(c) = 5, n = 36
(d) = 1, n = 900

8.3 Estimation and Confidence Intervals
What is inference? By collecting a random sample, we may infer the unknown value
of the population mean from the sample mean, X . Using the terms we have become familiar
with since, X is a random variable and X is an estimator of . We call the specific value of X
for any particular sample a point estimate of .
DEFINITION: A point estimate is a numerical guess for an unknown population parameter.
The formula for X uses random sample data to obtain each estimate.
Inferences based upon sample evidence can assist decision making. Decision makers
display a striking paradox in their reactions to analysis based on sample information. Even
highly educated people too often distrust sample inferences based on small sample size,
especially if the population size is very large. Common sense counsels them to reject lab tests or
market surveys based on only a tiny fraction of all consumers. Here, however, our common
sense provides incorrect guidance about the value of small sample information. Recall from
Chapter 5 that anecdotal evidence is a bad basis for decisions: the sample consists of only one
observation and may be biased.

Estimation Bias and Sampling Error
When business decisions must rely on sample inferences, estimators that are unbiased are
usually preferred. We learned in Chapter 3, for example, that the sample mean is an unbiased
estimator of . Many other estimators we examine in this text will possess this property as well.

From Chapter 6, we learned that expected value weights outcomes by their probability of
occurring. Thus, expected value is a natural vehicle for restating more concisely the notion of
unbiasedness.
An estimator of a parameter is unbiased if its expected value equals that parameter.
For example, since X is an unbiased estimator of , then E( X ) = is also true. If, on the other
hand, an estimator is biased, then its bias may be determined by using expected values.
Bias is the difference between a parameter and the expected value of its estimator.
To illustrate a biased estimator, recall from Chapter 3 that the sample variance s, defined
by the formula
s = (x
i
- X ) / (n - 1)
is an unbiased estimator of . Thus, E(s) must equal . If we substitute n in place of (n - 1),
we would have another estimator of which we will call S. Then S is defined by the formula
S = (x
i
- X ) / n, the formula we would use if we mistakenly believed we are working with
the complete population. Yet S is merely s multiplied by (n - 1) / n, and s is an unbiased
estimator of . Thus, E(S) = (n - 1) / n, which is smaller than because (n - 1) / n is a
fraction smaller than one. Therefore, our S estimator of displays a downward bias.
For example, if = 30 and we are using a sample n = 10 to estimate it, then repeated
random samples will produce an average of 30 for s. By contrast, S will average (n - 1) / n =
9 / 10, or only 27. Like a clock you've forgotten to set an hour ahead for daylight-saving time,
a biased estimator never averages out to the true parameter value even in the long run.
However, what if your estimate is actually much higher or lower than the population
value we are estimating? We can never be sure this isn't the case if we are making inferences
from a sample. For the Tangerine case in Part 1, an underestimate of mean compensation could
lead the board of directors to offer too little to retain the current CEO. This could result in a long
and frustrating search for a replacement. On the other hand, an overestimate for these high-
priced individuals might unnecessarily strain the finances of the business and cause stockholder
opposition. The stakes can be high and more information would clearly assist the board of
directors assess the variability of sample estimates for . It would be helpful to let the Board
know the margin of error for your estimates of , or the chances of being more than say
$100,000 off in either direction. In this chapter, we focus attention on methods of determining
and reporting the degree of uncertainty or level of confidence in our estimates from sample data.

As appealing as unbiased estimators are, they do not resolve the basic limitation of a
point estimate. While "on average" an unbiased estimator is correct, its estimates are still going
to be wrong for most samples. Remember from Chapter 6 that expected values refer to the mean
of many, many observations, while the business decisions are usually based on evidence from a
limited number of observations. The underlying problem with point estimates is that they
communicate nothing about the size of the sampling error.
DEFINITION: sampling error is the difference between a parameter and the value of its
unbiased estimator in a particular sample.
Thus for most decision situations involving sample inference, there will be sampling error. Our
point estimate will almost surely be wrong and sometimes very wrong.
Consider, for example, the often critical need to estimate mean warehouse inventory .
Suppose = 531 cartons. However, X is unlikely to exactly equal . A random sample of
warehouse stock will have X that may be low (518 cartons), high (570), very close (529) or far
off the mark (1344). Each results in an error. Unfortunately, point estimates alone cannot
inform us about the magnitude or sign of this sampling error. Given the uncertainties, the
warehouse manager understandably should demand to know more before acting on
X information.

Anatomy of a Confidence Interval
Faced with the uncertainty surrounding point estimation from sample data, we are
tempted to pursue one of two radical responses. At one extreme, we could ignore the sample
entirely and base decisions purely on guesswork. But that would be throwing the baby out with
the bath water. In decision making, some information is better than none. At the other extreme,
we could regard the point estimate as precisely correct and ignore possible variability among
sample estimates. The danger with this tactic is that decision makers may then place undue faith
in unreliable information.
Novices to business statistics are often afraid of appearing unsure. They try to provide
definitive solutions, even when such precision is unjustified. In statistics, we learn that the only
thing certain is uncertainty itself. We need some way to utilize the sample information, and yet
simultaneously attach a measure of uncertainty to this estimate. The point estimate still may
serve as the focus of the sample estimates reported. Yet this estimate must be augmented by
information that reflects uncertainty and variability.

In business, we encounter variability measures accompanying point estimates. Consumer
confidence polls and market surveys report a margin of error. The following are examples
involving estimates of averages:
I'd say, as a rough guess, we rework an average of 15 assemblies each month.
Our new boss reprimands a dozen or so employees a week.
Flights average 10 minutes late give or take six minutes.
GM's market share averaged about 40 percent in each of the past three decades.

Along with a point estimate, each of these examples implies variability in the italicized words. A
common way to incorporate variability is by the familiar "plus or minus" form. The first two
cases are vague versions of plus or minus language. Familiarity with the speaker will usually tell
the listener that "a rough guess" implies 2 assemblies and a "dozen or so" indicates 12 4
employees. The airline example specifies the variability explicitly. Flights average 10 6
minutes behind schedule. The final example involves rounding. Rounding an estimate often
indicates variability. Market shares about 40 means closer to 40 than to 30 or 50, and thus a
40 5 share for GM.
Alternatively, we may combine point estimate and variability by reporting an interval
estimate.
DEFINITION: An interval estimate (a, b) is a range of values from a to b likely to contain the
population parameter being estimated.
The advantage of interval estimates over point estimates is obvious. A point estimate is like a
harpoon. Direct hits are required. By contrast, an interval estimate is like a fishing net. A fish
will be caught if it lies anywhere inside the range of your net. Unlike harpoons, the net size may
be varied. If the water is murky or the fish are elusive, a larger net may be required.
The previous plus-or-minus expressions may each be translated to interval estimates. For
the first example, 15 2 is equivalent to an interval estimate of 13 to 17 assemblies reworked.
Similarly, 12 4 converts to 8 to 16 employees, 10 6 becomes 4 to 16 minutes late, and 40 5
means a 45 to 55 percent market share.
But how do we determine how wide to make these intervals? If it is too narrow, the
parameter may not fall within the interval. If we play it safe by choosing a very wide interval,
we may be squandering our information and not helping decision makers. Suppose, for example,
that sample information indicates with near certainty that flights for your company's airline
average 10 3 minutes behind schedule. You reported 4 to 16 minutes just to be on the safe
side. If new federal regulations mandate averages no greater than 15 minutes, your caution may
cause management to expend millions of dollars eliminating a problem does not exist. On the

other hand, if (7, 13) is a carelessly narrow interval estimate, you may unnecessarily expose your
company to federal fines and lost reputation.
To resolve this dilemma, we select an interval width just wide enough to meet the
confidence level acceptable to decision makers.
DEFINITION: The confidence level is our degree of confidence that the unknown population
parameter being estimated lies within an interval estimate.
Like probabilities, confidence levels may be measured in percentages. Commonly-used
confidence levels are 90, 95, and 99 percent. An interval estimate reported with a confidence
level becomes a confidence interval.
DEFINITION: A confidence interval is a combination of interval estimate and confidence level
information about an unknown population parameter.
For example, if (11.5, 16.7) is the 95% confidence interval for mean number of assemblies
reworked weekly, then we are 95 percent confident that the population mean lies between 11.5
and 16.7 assemblies per week. A 90% confidence interval would be narrower, perhaps (11.9,
16.3), because we need to be only 90 percent confident. Conversely, a 99% confidence interval
has to be wider, say (10.6, 17.6), to capture the for week reworking with such near certainty.
In the inset at the conclusion of this section, we discuss reasons for using the term "confidence."
To construct a confidence interval, we therefore need three ingredients: a point estimate,
an interval width, and a confidence level. All these ingredients may be obtained by applying the
theory of probability distributions to estimators derived from random samples.

Why We Say Confidence Instead of Probability
Why do we refer to "confidence" rather than "probability" in our interval estimates? The
explanation lies in the nature of the estimation problem itself. Because point intervals convey a
false impression of precision, we instead tell decision makers there is a very good chance that the
unknown population parameter lies within a confidence interval.
However, the word probability implies that one of several alternative events is about to
happen. By contrast, estimation is an attempt to capture the population parameter within an
interval. After selecting our sample and generating a confidence interval from it, the true
population value being estimated either is or is not within that interval (although we may never
know which). Strictly speaking, we are 95 percent confident that the population parameter is in
the interval. Instead of fishing nets, this time consider the analogy of catching flies in your hand.

You may swipe at one, but you can't be sure you caught it without opening your hand and
allowing it to escape. You may still have a level of confidence that you caught the fly, although
the actual probability is either 0 or 100 percent.
Why do we need a confidence level at all? Can't we just go far enough to the right and
left to capture all the probability under the density function? The reason that we cannot is most
sampling distributions have infinitely-long tails in one or both directions. No matter how wide
we make the interval, there will be a nonzero probability area remaining outside the interval.
However, we learned in Chapter 7 that just because anything is possible doesn't mean everything
is equally probable. Most sampling distributions we encounter will have infinitely long tails
converging rapidly to the horizontal axis. The result is a negligible probability area beyond a
few standard errors of the mean.
There is a tradeoff between interval width and level of confidence. The greater the
confidence level, the wider the interval is.
A fundamental principle in economics states there is no such thing as a free lunch. This
same principle applies in statistics to selecting confidence intervals. The narrower the interval
the less likely we are to capture the population parameter. If we want a small range of values for
our decision, we must be willing to take our lumps if the population values are above or below
this range (as it likely will be). In the extreme case, insisting on an interval with no width is the
same as point estimation with a negligible chance of occurring. At the other extreme, an
unacceptably wide confidence interval is often needed to achieve near certainty of capturing the
population parameter. Thus, selection of interval and confidence level are opposite sides to the
same process.
Most often, a 95 percent confidence interval is selected. This level says that our interval
estimate will contain the population parameter an average of 19 times out of 20. Other
commonly selected levels of confidence are 90 and 99 percent. Which should you choose? That
depends on the relative importance you place on reporting a more precise range (i.e., a narrower
interval) versus the embarrassment if a population value is outside your interval. One way
around this tradeoff dilemma is to collect a larger sample. Later in this chapter, we will explore
methods to determine the optimal sample size.

8.31 The difference between a parameter and a sample point estimate of that parameter is
called the
a. sampling error
b. bias

c. variance
d. distribution
e. degrees of freedom

8.32 Which of the following does not suggest an interval estimate for the population mean:
a. As a rough guess, I'd say we rework an average of 15 assemblies each month.
b. Our new boss reprimands a dozen or so employees a week.
c. Flights average 10 minutes late give or take five minutes.
d. GM's market share averaged about 40 percent in each of the past three decades.
e. All of the above suggest interval estimates.

8.33 Which of the following does not involve statistical inference?
a. descriptive statistics
b. point estimation
c. interval estimation
d. hypothesis testing
e. forecasting

8.34 Which information is generally found in a confidence interval?
a. a point estimate
b. an interval width
c. a confidence level
d. all of the above
e. a and c only

8.35 The probability distribution of an estimator or statistic is a
a. random sample
b. confidence interval
c. null hypothesis
d. summary statistic
e. sampling distribution

8.36 Analysts for a tire manufacturer estimate from a random sample that mean tread life for
its new radial belt tire is between 20,000 and 50,000 miles. Which of the following
strategies might be used to obtain a narrower confidence interval?
a. Use a point estimate.
b. Collect a larger sample.
c. Use a smaller value for .
d. Use a larger confidence level.
e. All of the above.

8.4 The Fundamentals of Hypothesis Testing
In Chapter 1, we presented a general outline of the procedure for business decision
making under uncertainty. We are now ready to expand upon this procedure by connecting it to
a method well known in the sciences.

Hypotheses and the Scientific Method
You probably recall from some science course how your teachers would talk about
something called the "scientific method." You learned that scientists relentlessly attempt to
unravel the mysteries of the universe by formulating hypotheses, possible ways that the physical
or biological world operates. Data on events predicted by the hypotheses are then gathered
through controlled laboratory experiments, if feasible, or from painstaking observations of
nature, geological strata, or the astronomical horizon. Next, the scientists test the hypothesis by
checking whether the data are consistent with that hypothesis. Vessels containing gas are
carefully measured, trajectories of galaxies hurtling through space are plotted, or the
chromosomes of parents and offspring are compared. If the data do not refute the hypothesis
being tested, the hypothesis is elevated to a theory, a hypothesis which has successfully passed
this testing process.
DEFINITIONS: A hypothesis is a statement that can be subjected to empirical evidence. A
theory is a hypothesis that is found consistent with empirical evidence.
Hypotheses are usually presented as two contradictory statements: a null hypothesis H
0
and an
alternative hypothesis H
A
.
DEFINITION: In stating a hypothesis, all possible conditions of the unknown population are
reduced to one of two states: an alternative hypothesis H
A
, the condition the test is trying to
establish; and a null hypothesis H
0
, the state we wish to reject.
H
0
is sometimes called a "straw man" which is placed alongside the hypothesis we are really
testing, H
A
. Rejecting H
0
means we have knocked down this straw man and support H
A
. Not
rejecting H
0
produces a negative inference about the alternative hypothesis.
Hypothesis testing consists of rejecting or not rejecting H
0
based on sample evidence.
Rejecting H
0
allows us to accept H
A
.
Some examples of currently accepted scientific theories which began as hypotheses are
the following:

The universe started from a "big bang."
Blond hair color (the natural kind) is a genetically recessive trait.

The null hypothesis blond- and brown-haired couples have the same chance of producing
blond- or brown-haired offspring was rejected long ago by observing relative frequencies.
The big bang theory was promoted from hypothesis to theory about three decades ago. The null
hypothesis, all the matter in the universe always has been spread about, was rejected when
evidence became overwhelming that galaxies are moving away from one another at
extraordinary speeds. Examples of null hypotheses that experimental evidence has failed to
reject include the following:
Heavier objects (dropped by Galileo from the Leaning Tower of Pisa) fall no faster than
lighter objects of the identical shape and size
The sun is the center of the solar system
Taking quack medicine does nothing to cure cancer

"Quantum theory" is only 70 years old, yet it explains much of what we know today in
physics. The success possible from using the scientific method was described by the late
Richard Feynman, a Nobel prize winning physicist.
Just to give you an idea of how [quantum] theory has been put through the wringer, I'll
give you some recent numbers: experiments have Dirac's number [a fundamental measure in the
theory] at 1.00115965221 (with an uncertainty of about 4 in the last digit); the theory puts it at
1.00115965246 (with an uncertainty of about five times as much). To give you a feeling of the
accuracy of these numbers, it comes out something like this: If you were to measure the distance
from Los Angeles to New York to this accuracy, it would be exact to the thickness of a human
hair. That's how delicately quantum electrodynamics has, in the past fifty years, been checked--
both theoretically and experimentally. ....Things have been checked at distance scales that range
from one hundred times the size of the earth down to one-hundredth the size of an atomic
nucleus. These numbers are meant to intimidate you into believing that the theory is probably not
too far off!
8

Nevertheless, scientists continually caution us that theories cannot be proven, only
disproved. It may always be possible to design an experiment that forces an established theory
to be modified or discarded. Both Mendel and Darwin gathered evidence to overturn accepted
theories in their fields. Each of their theories, in turn, has been extensively modified and refined
in this century. One theory may turn out to be only a special case of a yet to be discovered

8
This and the two subsequent quotes are from Richard P. Feynman, QED: The Strange Theory of Light and Matter. Princeton, NJ:
Princeton Univ. Press, 1985. pp. 7-10.

general theory. Newtonian laws of motion were such a special case of Einstein's theory of
relativity.
Hypotheses can never be proved, only rejected. Accepted theories are replaced when other
theories do a better job of explaining and predicting.

Because a theory can never be proved, the best we can do is reject the null hypothesis and accept
for now the relationship stated in the alternative hypothesis.
By contrast, some statements do not require testing because they are true by definition.
We don't need to test whether football fields are 100 yards long; the rule book tells us they must
be. Some statements are portrayed as hypotheses when they are actually tautologies.
Tautologies are always true because they involve circular reasoning and terms are constantly
redefined to fit any exceptions that arise. A statement is not a hypothesis if it cannot ever be
wrong.
Other statements are not hypotheses because they cannot be subjected to verification by
data collection and testing procedures. Ethics, values, and religious precepts are not testable,
since they are not subject to empirical verification. The existence of God is not empirically
testable. It is a matter of faith. Besides, an all-powerful deity could manipulate time and space
in any experimental measurement. We help the homeless and oppose murder because of our
values, not because of some theory.

Formulating and Testing Business Hypotheses
In business statistics, we also formulate hypotheses, gather data to test hypotheses, and
reject or elevate them to the status of a theory depending on the results of these tests. If
subsequent tests with new data call an existing theory into question, revision or replacement of
the theory may be necessary.
However, there are several important differences. Unlike the physical and biological
universe, few permanent laws govern the economic behavior of corporations, workers, or
households. Behavior instead varies from one country to another and from one generation to the
next. Cultural differences, shifts in tastes and values, and changes in institutions eventually
reduce the usefulness of prevailing theories. Revisions to accepted theories will be required to
understand and forecast the business environment of the 21st century.

Another problem is the importance of human behavior in business and economics.
Human beings are complex creatures who exercise individual choice in their actions. Buyer
decisions and worker motivation are never as predictable as planetary orbits or bacteria growth in
a petri dish. Since business organizations also are determined by decisions of individuals, they
too are subject to the whims of human action.
One final difference distinguishes the sciences from business statistics. According to
Feynman, although a scientist can describe to you "how Nature works, you won't understand
why Nature works that way. But you see, nobody understands that." Understanding the
motivations of managers, workers, and customers is useful and often essential. We may use this
information to predict how they react and adapt to changes in the business environment.
Theories in business are seldom valid for all time and all situations. Their precision is
limited by the complexity of human behavior. Unlike the sciences, business theories seek to
explain the motivations behind the observed data.
Though typically less precise than laboratory measurements, data collection in business
and economics benefits from the great attention society gives to all things material, such as
inflation, trade, income, sales, employment, and taxes. Archaeologists tell us that records of
commerce sometimes are our only information about an ancient civilization. What has improved
through the ages is the timeliness of information, so crucial to decision making. Today, with
computers and data retrieval systems, we can pull together in seconds business information that
would have taken months to assemble ten or twenty years ago.
Larger measurement errors and behavioral uncertainties make it essential that statistical
analysis be applied to hypotheses testing in business. Business statistics has the imposing
challenge of testing hypotheses in a world where theories are frequently in flux and seldom fully
explain individual and business phenomena. To conduct a statistical test, we use many of the
same terms and methods we employed in interval estimation.
Both estimation and testing use sample evidence to make statistical inferences about
population parameters. Instead of estimating confidence intervals, testing attempts to
determine whether these parameters have particular hypothesized values.
Where do we get our hypotheses in business and economics? Often they are extensions
of theories from psychology, sociology, industrial engineering, or economics which then may be
tested in business contexts. As an example using ideas from environmental psychology, a
personnel manager may test whether teaming up experienced field representatives with trainees
decreases policy sales at an insurance company.

Alternatively, testable hypotheses may be generated from previous findings or from
similar situations. Reports that compensation levels are falling in the hotel industry may prompt
the motel owners association to test whether the same wage declines are occurring among its
members. During the 1970s, studies revealed that female college graduates earn the same
amount as males with only a high school diploma. A subsequent study may use this pay pattern
as a reference point for the subsequent study: Are gender disparities today as large as they were
in the 1970s? We begin our introduction to statistical hypothesis testing with a quality control
example.

Chapter Case #2: Snap, Crackle, and Pop
Quality control at a breakfast cereal plant includes regular examination of product
samples to decide whether the proper amount of vitamin mixture is sprayed onto wheat flakes to
justify their advertising claims. The ads state that its Healthy Choice cereal supplies 100 percent
of the minimum daily requirements of vitamin C. If tests conclude that the cereal averages a
vitamin content less than the level advertised, the flow levels on the nozzle jets must be adjusted
appropriately. How do we construct hypotheses to test the production samples?
Although hypotheses are sometime presented verbally, words generally lack the clarity
and conciseness of mathematical expressions. If the form of the hypothesis is clear from the
accompanying verbal description of the decision problem, test results are sometimes reported
with no formal statement of H
0
and H
A
. However, a formal statement of the hypotheses avoids
possible misinterpretation of what is being tested. It is also good practice for students to state
hypotheses formally.
The customary method for stating hypotheses is a two-line format. "H
0
:" is written on
the first line and "H
A
:" on the second. Following the colon (":") on each line, the null and
alternative hypotheses are then expressed as algebraic expressions. To ensure that all possible
population states are represented by H
0
and H
A
, the null hypothesis is defined to be the
complementary event "not H
A
."
H
A
and H
0
are mutually exclusive and exhaustive events by defining the null hypothesis as
the complement of the alternative hypothesis.


Types of Tests and Types of Error
The complementary relationship is accomplished by equality and inequality signs in the
algebraic relationships. The expression on the H
0
line always includes an equality condition, so
an equal sign (=) should be used. For tests in this chapter (and several other tests presented in
later chapters), the H
A
expression involves strict inequality ("<", ">", or "=/ " signs), the opposite
of the equality in H
0
. Which of these signs we choose depends on the type of test described by
the hypothesis. With many tests, including all those introduced in this chapter, either a one-sided
or a two-sided test may be selected.
DEFINITION: A one-sided test is required to test an alternative hypothesis containing a one-
way inequality, either strictly greater than or strictly less than; a not-equal sign in H
A
directs us
to perform a two-sided test.
As we will see shortly, the name originates from the tails of the sampling distribution used to
assign probabilities to the tests.
The alternative hypothesis for tests on the population mean, for example, may be stated in
one of three possible ways:
In tests for whether the population mean is equal to
0
, two one-sided tests and a two-
sided test are available:
Two-sided Test: H
0
: =
0

H
A
: =/
0

One-sided Tests: greater-than alternative less-than alternative
H
0
: =
0
H
0
: =
0

H
A
: >
0
H
A
: <
0

Notice that the format of each of the tests follows the rules stated earlier.
When should we choose a one-sided test and when should we use a two-sided test? We
should choose the form that best describes our hypothesis. If we are solely interested in testing
whether or not the population mean is less than a particular value
0
, we should select the first
version of the one-sided test. If we instead hypothesize that is greater than
0
, the second form
should be our choice. The two-sided test is more appropriate for a hypothesis that tests whether
is different from
0
in either direction.
Each of these may be examined by returning to our breakfast cereal case. Suppose that
0.4 grams of ascorbic acid (vitamin C) per 3 ounce serving provides 100 percent of the minimum
daily vitamin C requirements. If the cereal is found to contain a mean below the amount claimed

on the box, a false advertising or labeling charge could be filed at the Federal Trade Commission
(FTC). To decide whether the vitamin spray nozzle requires adjusting, we test the hypothesis
that the mean concentration of ascorbic acid per serving is less than 0.4. However, if we are
testing whether < 0.4 is true for the population, we must be able to reject the event that = 0.4.
We therefore state the one-sided test with a less-than alternative:
H
0
: = 0.4
H
A
: < 0.4
Rejecting the null hypothesis will lead the cereal company to increase the nozzle flow. On the
other hand, if tests do not allow rejection of H
0
, the company decides not to alter the production
process.
With the inequality reversed to a greater-than alternative, we obtain the other one- sided
test.
H
0
: = 0.4
H
A
: > 0.4
Suppose advertising claims are seldom investigated (a situation that occurred during the 1980s
when regulatory agencies had their budgets severely reduced). Then the cereal maker may
primarily be concerned about whether or not is greater than 0.4. Since vitamins cost more than
wheat flakes, too much ascorbic acid adds to production costs and hurt company profits. Thus,
rejecting H
0
alerts the company to reduce the nozzle flow. Not rejecting H
0
means the company
does not need to make these changes.
Suppose the company is also concerned about cost control. A test is conducted to
investigate whether is different from 0.4. The alternative hypothesis statement must now
correspond to the two-sided test so that H
A
includes differences above and below 0.4.
H
0
: = 0.4
H
A
: =/ 0.4
Rejection of the null hypothesis will suggest actions to remedy the problem: increase the nozzle
spray to fulfill advertising claims or decrease the spray to reduce costs. If H
0
cannot be rejected,
the company will not need to intercede in the production process in either direction.
Notice that hypotheses H
0
and H
A
each referred to , not X . We don't need to "test"
anything about the value of X , since its precise value may always be determined from the

sample data itself. The sample data is merely a means to an end. Remember that a sample is
collected to make inferences about the population.
H
0
and H
A
refer to the unknown population parameters, not their sample estimators.
In the cereal example, we may be concerned about whether sample data supports the proposition
< 0.4. By contrast, the expression X < 0.4 is not a hypothesis, because any specific sample
mean either is or is not less than 0.4 grams / 3 oz. serving.
As we learned in Section 8.2, however, sample inferences are hardly perfect. Failure to
reject H
0
may in fact mean that H
0
is the true state of the population. On the other hand, H
A
may
be true but we cannot substantiate that fact from the available sample data. Perhaps the nozzle
sprays an average of less than 0.4 grams /serving. A random sample of production output may
not indicate anything is wrong. Because of this uncertainty, we can't be absolutely certain which
is the case. Perhaps the sample size is too small or the data being sampled is too variable for H
0

to be rejected.
We would commit the opposite sin if we reject H
0
when we should not have. Even if H
0

is true for the population, drawing a random sample very different from population conditions
described in H
0
may always be possible. Thus, we may commit two opposite kinds of errors in
hypothesis testing: type I error and type II error.
DEFINITION: If we reject the null hypothesis when in fact H
0
is the true state of nature for the
population, we commit a Type I error; conversely, failure to reject H
0
when H
A
is true for the
population involves Type II error.
For example, employers increasingly administer blood tests to job applicants to avoid hiring
possible drug users. However, these tests are not perfectly reliable. Diet or prescription
medicine may trigger a "false positive" reading on these tests. In addition, applicants can avoid
drug use before the interview to obtain a "false negative" on their blood test.
Either we commit one of these two errors or we correctly identify the actual state for the
population. We may portray the decision matrix and possible consequences in Table 8.2.


Table 8.2
Decisions and Outcomes of a Hypothesis Test
True State of the Population
H
0
True H
A
True
Decision Based on
Test Results
Do Not Reject H
0
correct decision type II error
Reject H
0
type I error correct decision

For the breakfast cereal vitamin case, suppose the test involved the following:
H
0
: = 0.4
H
A
: < 0.4
Then, Type I error occurs if the quality control manager rejects H
0
when 0.4 grams /serving is
the true average. Type II error results from failure to reject H
0
when an insufficient amount of
vitamin C is being applied to the product.
Type I and Type II errors are assigned probabilities designated by Greek letters and .
DEFINITION: In hypothesis testing, is the probability of committing a type I error and is the
probability of a type II error.
To prevent making a wrong decision, we would like to keep and both as small as possible
and thus minimize the chances of both types of errors occurring. However, a tradeoff necessarily
exists between and .
The smaller the value we assign to , the larger will be the , and vice versa.
We can show this tradeoff by reducing the chance of type I error. This can be accomplished by
rejecting H
0
only in the presence of extremely convincing sample evidence. However, this
strategy will result in an excessive risk of type II error. Similarly, setting standards for rejecting
H
0
too low reduces type II errors at the expense of error of the type I variety.
Because a major spill could cost a billion dollars in clean up costs, petroleum companies
may choose a large . The chance of the company wrongly removing a nonuser from
captaining an oil tanker may need to be large. Otherwise, the , the chance of giving the green
light to an oil tanker captain who is a substance abuser, may be unacceptably high. In less
sensitive jobs, employers are more apt to give experienced employees the benefit of the doubt in

drug testing. A small is used because replacing a valued employee results in high hiring and
training costs for the company.
In actual situations, we can control the size of if we can determine the sampling
distribution for the null hypothesis condition. The size of , on the other hand, will vary
according to the specific distribution identified with the alternative hypothesis. In hypothesis
testing, we generally do not specify this distribution or its parameters. Thus, we usually know
little about the size of . Nevertheless, for any given alternative distribution, the tradeoff
between and still applies. By setting the level of , the decision maker automatically assigns
a corresponding, but unknown, value to . Changing will change in the opposite direction,
for any given population and sample size. One way to decrease risks for both types of error is to
collect larger samples.
By assigning a value to , we automatically determine . Although usually unknown,
may be reduced by adopting a larger or increasing the sample size.

Statistical Significance and p-Values
In Section 8.2, we used as part of the (1 - ) 100% confidence level. For testing
purposes, the symbol is used again, this time to designate the significance level.
DEFINITION: The probability of a type I error, , is the significance level for the test.
The significance level is also called the level of significance for the test. Rejection of the null
hypothesis allows us to conclude that the relationship described in H
A
is statistically significant.
DEFINITION: If the test results in rejection of H
0
, the inequality expressed by H
A
is considered
statistically significant at the significance level; if H
0
cannot be rejected at the significance
level, then the expression described by H
A
is not statistically significant.
For the cereal case, suppose is assigned a value of .01 by the quality control manager.
Then production should be shut down whenever a randomly collected cereal sample yields test
results that reject the null hypothesis. Rejecting H
0
is equivalent to concluding < 0.4 is
significant at the .01 level. Then, at the .01 significance level, sampling evidence indicates that
cereal concentration of vitamin C is less than the 0.4 grams /serving.
The most commonly selected significance levels are .01, .05, and .10, but decision
makers should choose an according to the risk they are willing to take. Shutting down
production is costly and perhaps unnecessary, but so is recalling thousands of cereal boxes and
damaging the company's reputation among retailers and consumers.

When test findings are reported in the press or to decision makers, statistical significance
is often abbreviated to the single word significance and the significance level used is generally
omitted. A few years ago, after tests found it caused significant reductions in cholesterol, oat
bran was touted as a miracle cure. Manufacturers who put oat bran in all their products took
heavy losses when scientists explained to the public that the benefits from eating oat bran are
statistically significant but small. Section 5 discusses the causes of this and other
misunderstandings about significance.
Once we translate our verbal statements into the correct H
0
and H
A
form and establish an
acceptable level, we are ready to gather a sample and use sample evidence to conduct the
hypothesis test itself. The outcome of a test is determined by applying the proper decision rule.
DEFINITION: A decision rule for a test is the criteria for rejecting or not rejecting the null
hypothesis.
Decision rules are based on the properties of the test statistic used for the test.
DEFINITION: A test statistic is a random variable constructed from sample information and the
assumption that the null hypothesis is true.
Like any random variable, a test statistic has a distribution. Because its distribution depends on
sample information, a test statistic has a sampling distribution.
The sampling distribution of a test statistic is the distribution the test statistic will have if
we assume that the null hypothesis is true.
Instead of using sampling distributions for estimation purposes, we now use them for testing.
Test statistics allow us to decide whether to reject H
0
. One form of the decision rule involves
comparing test statistic values to a rejection region.
DEFINITION: A rejection region is the range of test statistic values improbable enough to
reject the null hypothesis at the level of significance.
The area of probability for the one-sided test is the / 2 probability for a two-sided test.
Decision Rules Using Test Statistics: We reject H
0
if the test statistic lies within the
rejection region. Otherwise, we cannot reject H
0
.
Because our hypotheses are expressed as decisions between two competing statements, H
0
and
H
A
, the decision rule gives us an "either-or" outcome for selecting one of these two. If the test
statistic lies within the rejection region, we reject H
0
, otherwise we cannot reject H
0
.

Figure 8.17 portrays a one-sided test for the first breakfast cereal hypothesis test. The
curve on the right is the distribution for X centered on 0.4, the value associated with H
0
. If X is
far to the left of 0.4, the sample is unlikely to have been drawn from the population described by
H
0
. This range of X values is the rejection region and corresponds to that portion of the
distribution with area of probability (shown as the dark shaded area). It should now be clear
why this is called a lower-tail one-sided test.
Hypothesis testing does not usually specify the distribution if H
A
is true for the
population. However, we have illustrated a possible one in Figure 8.17 to show , the probability
area for type II error. If H
0
is rejected, it may be because this other distribution for X may be the
true one. If X does not lie within the rejection region, we cannot reject H
0
even if H
A
is true.
The right tail for this alternative distribution, the crosshatched area , represents this Type II area
probability. In Chapter 9, we will apply these concepts to tests for based on standardized test
statistics.
Figure 8.18 shows and for the upper-tail version of the one-sided test. If cost controls
rather than regulatory standards are the concern, the curve on the right is an alternative
distribution to the H
0
sampling distribution for X centered at 0.4 grams per serving. As before,
the area is confined to a single tail, but this time it is the right tail.
Finally, for two-sided tests, we would use both tails of the distribution, each containing
only / 2 of the probability. As in confidence interval estimation, the two tails combine for a
total of . Sampling distributions consistent with the alternative hypothesis could then be
centered either above or below 0.4. We reject H
0
if X lies in the rejection regions at each tail.
Recently, a much easier and more flexible alternative to test statistics decision rules has
become the standard for business statistics. To explain this option, we must define the p-value.
DEFINITION: The p-value is the probability of obtaining a test statistic as or more extreme as
the one obtained from the sample, assuming H
0
is true for the population.
For example, the test statistic for a particular sample may have a 20 percent chance of occurring
given the null hypothesis is true. Then its p-value is 0.20. Consequently, the p-value is often
called the observed significance level. Before computers, these p-values were seldom used
because they were difficult to determine from statistical tables. Statistical packages like Minitab
easily compute and report these p-values.
The p-value gives us a probability that is directly comparable to the values established
for the test. If we compare the p-value to the , the result is an alternative decision rule.

The p-Value Decision Rule for Hypothesis Testing: If the p-value is less than the
significance level , reject the null hypothesis; if it is not less than , do not reject the null
hypothesis.
If p < , reject H
0

If p , do not reject H
0

Suppose the p-value, the probability of a sample mean 0.35 grams/serving or less, is 12 percent.
A Type I error would occur if production were shut down unnecessarily. If this probability is
established at 5 percent, then p = 0.12 exceeds = .05. Thus, the null hypothesis cannot be
rejected. The mean spray level per serving is not significantly less than 0.4, and production
should not be halted. If a later sample with X = 0.31 has a p-value of .03, p would then be less
than . H
0
should therefore be rejected, and a significant difference is concluded. The
production would be interrupted while the spray nozzles (or other possible causes of the shortfall
in vitamin C) are inspected.
The p-value test has a couple important advantages over the test statistics decision rule.
First, the p-value decision rule looks identical for any type of test, regardless of test statistic or
sampling distribution. This uniform decision rule Is p less than ? makes seldom-used
statistical tests easier to apply and interpret.
Secondly, by reporting the p-value, decision makers may assign their own significance
level instead of having a single rejection region imposed by the person who ran the statistical
program. Suppose the marketing department presents test market success with a p = .07
probability of a type I error. If you don't believe the company should take any more than a 5
percent risk of a failed venture this year, you can argue against going ahead with this project.
A p-value decision rule is equivalent to the corresponding test statistics decision rule, and
the p-value rule usually is more flexible and easier to use. Now that computer software
reports them, p-values are used to conduct most statistical tests today.
In practice, decision rules are seldom used anymore. Is it any wonder why? Not only do
we have to calculate t-statistics, we also must convert into a t distribution value by checking
the inverse cumulative distribution. The p-value decision rule is far easier and more flexible.
The p-value for a test reports the observed significance level appropriate to the sampling
distribution and null hypothesis. Therefore, the decision rule reduces to
if p is less than , reject H
0

if p is not less than , we cannot reject H
0
.

From here on, we will rely only on the p-value decision rule to conduct all our tests.

The procedures we have developed in this section for testing the hypotheses are
summarized in Table 8.3 below.
Table 8.3
Procedures for Hypothesis Testing

1. Express the hypothesis H
A
to be tested in terms of population parameters
and assign the intended inequality sign.
2. State the equality condition for the null hypothesis H
0
as the complementary
event to H
A
.
3. Assign a significance level by the decision makers willingness to risk a
Type I error (and implicit tradeoff with Type II error)
4. Collect a random sample from the population.
5. Obtain the p-value for the sample data from area of the tail(s) in the
sampling distribution of the test statistic.
6. Use the p-value decision rule to decide whether to reject the null hypothesis.
7. Translate test results into conclusions meaningful to decision makers.

Although we have used examples for tests relating to the population mean, much of what
has been presented so far relates to hypothesis testing in general. To proceed further, we must
switch from a general discussion of testing to specific types of decision problems. In this chapter
we will base our tests on sampling distributions developed in the previous chapter. However, do
not lose sight of the general procedures and principles of hypothesis testing outlined in Table 8.3.
Test statistics and sampling distributions will vary from test to test. Attaching general meaning
to specific test statistics formulas is easy. The tests introduced in this chapter may not be the
ones you will commonly encounter in your career. Many business statisticians prefer
alternatives tests which make fewer distributional assumptions. Economic statisticians (called
econometricians) may go years without needing any of the tests described in this chapter.
9
The
tests discussed in Chapters 10, 11, and 14 are often more useful in business statistics today.

9
Some econometric software omit these tests entirely.

8.41 A statement capable of being subjected to empirical evidence is
a. a hypothesis
b. a tautology
c. a theory
d. a parameter
e. an assumption

8.42 One difference between business statistics and statistics in the sciences is
a. business statistics cannot use the scientific method
b. sciences do not need to formulate hypotheses
c. sciences do not test hypotheses
d. human behavior is not as predictable as planets and viruses
e. all of the above are differences

8.43 Each of the following is a major source of measurement error in business and economic
data except
a. many societies fail to record business and economic data
b. governments suppress data to protect confidentiality
c. people are reticent to disclose financial information
d. businesses are reticent of making information available to rivals
e. all of the above are sources of measurement error

8.44 In traditional hypothesis testing, we
a. try to reject the null hypothesis
b. set up the null hypothesis as a "straw man"
c. make the null and alternative hypotheses exhaustive
d. make the null and alternative hypotheses mutually exclusive
e. all of the above

8.45 What is wrong with the following hypothesis test:
H
0
: X = 12
H
A
: X =/ 12
a. should be stated in terms of 0, not 12
b. should involve population parameters
c. does not tell us which is significant
d. does not involve a confidence interval
e. all of the above


8.46 A type II error is
a. rejecting H
0
when we shouldn't have
b. rejecting H
A
when we shouldn't have
c. not rejecting H
0
when we should have
d. not rejecting H
A
when we should have
e. the answer depends on the particulars of the problem

8.47 Alpha, , is
a. the level of significance
b. the probability of a type I error
c. the chance we take that our significance judgments are wrong
d. the chance we take that we are incorrect when rejecting H
0

e. all of the above

8.48 In a two-sided test, each rejection region corresponds to a tail containing probability area
equal to
a. 1 -
b. 1 -
c. / 2
d. / 2
e.

8.49 A formula that may be calculated from sample information and hypothesized values in H
0

is called
a. a rejection region
b. a decision rule
c. a test statistic
d. a significance level
e. an hypothesis test

Concepts and Applications:
8.58 State what is wrong about each of the following statements.
(a) I have a theory about why this happens, and someday I am going to test it.
(b) I have a theory, which probably cannot be tested, about why this happens.
(c) That may be true in theory, but in the real world it doesn't happen that way.

8.59 Find what is wrong with the format for each of the following hypothesis statements:
(a) H
0
: = 5
(b) H
0
: > 5

H
A
: = 5
(c) H
0
: = 5
H
A
: > 6
(d) H
0
: = 5
H
A
: = 5
(e) H
0
: X = 5
H
A
: X > 5

8.60 Translate the following hypotheses into a verbal statement:
(a) H
0
: = 20 percent
H
A
: > 20 percent
(b) H
0
: = $3.5 billion
H
A
: > $3.5 billion
(c) H
0
: = 300 employees
H
A
: =/ 300 employees

8.61 State H
0
and H
A
for each of the following hypotheses:
(a) Mean production in our assembly plants exceeds 1400 machines per day.
(b) The mean time spent in meetings by corporate executives is less than five hours per
week.
(c) The average increase in cost of sending packages first class is more than $100 per
month.
(d) Interest rates offered by banks this year are significantly different from the 5 percent
average available last year.

8.62 Decide which type of error (type I or II) we are referring to under each of the following
situations:
(a) There is a 10 percent chance that rejection of the null hypothesis was an incorrect
decision
(b) although we did not reject H
0
, the probability is 5 percent that H
A
is true instead
(c) the level of significance is .01
(d) is .10
(e) is .05

8.63 Which level of significance, .10 or .01, would you select for rejecting the null hypothesis
that there is no problem for each of the following circumstances? Justify your answers.
(a) inspecting trouble-free production processes for abnormal operations
(b) pre-screening a group of high-risk hospital staff employees for AIDS
(c) the need to modify the current marketing campaign

8.64 Headlines are always reporting a scientific study that proves some food or lifestyle
increases your risk of cancer. A few months later, another study refutes those findings.
Explain how the small sample sizes for these studies and large may be the source of
these conflicting results.

8.65 Determine the test results for each of the following:
(a) p = .07 and = .05
(b) p = .07 and = .10
(c) p = .17 and = .10
(d) p = .07 and / 2 = .10
(e) p = .01 and = .01

8.5 The Practice of Inferential Statistics and Common Pitfalls
Before applying statistical inference concepts introduced in this chapter to one- and two-
samples, regression models, experimental designs, and other forms of inferential statistics, lets
stand back a little ways to examine estimation and testing in perspective. The following
discussion will help you avoid several common errors and popular misinterpretations related to
statistical inference.

Hypothesis Testing versus Interval Estimation
Interval estimation and hypothesis testing are both forms of inferential statistics. As
such, testing also makes use of sampling distributions to reach statistical conclusions. A two-
sided test and a confidence interval both use the same t (or z) value for comparable confidence
and significance levels. You also may have noticed that the symbol, , is used for the type I
error probability and significance level as well as for the in the (1 - ) confidence level. For
example, if = .05, then the corresponding confidence level is 1 minus .05, or 95 percent. This
correspondence is no accident, since both represent areas under t distribution tails.
However, there are also some important differences between testing and interval
estimation. Rather than estimate confidence intervals, the intent of hypothesis testing is to reach
conclusions about hypothesized relationships. Moreover, while the mathematics of two-sided
tests may resemble confidence interval estimation, the mean of the density functions for the t (or
Z) distribution is generally different. In addition, the math will only be the same if the levels of

significance and confidence are the same. For example, we may test at the .01 level and then
seek 90 percent confidence intervals.
There is also a fundamental conceptual difference between estimation and testing. Notice
that the intervals for the mean are centered at the point estimates calculated from the sample
data. By contrast, the middle for hypothesis testing is the null hypothesis value,
0
, known
before any sample is drawn. When we test, we construct an interval around the null hypothesis
condition to decide whether the actual sample statistic lies inside. We then reject H
0
if it is
unlikely the estimate is this many standard deviations away. On the other hand, a confidence
interval attempts to capture the unknown value of the population parameter.
In estimation, the mean of the sampling distribution is the value for the estimator of the
unknown parameter. In testing, the mean of the sampling distribution is the hypothesized
value of the parameter specified in the null hypothesis.
The above distinction leads us to another difference between testing and estimation.
Although statistics texts usually introduce testing following estimation, this ordering is due to the
greater ease of teaching the latter. In actual practice, testing should logically precede estimation.
It makes little sense to report that an additional expenditure of $10 million on advertising adds $3
million, plus or minus $6 million, to profits. Suppose we conclude from sample evidence that
sales representatives spend an average of 12.5 minutes longer with each client, plus or minus
20.2 minutes? How would a corporate vice president react to this finding? In each instance, the
analyst has placed the cart before the horse. Before you can estimate the size of an increase, you
must first test whether the increase occurred in the first place. For each of these two examples
the increase, although positive, is not statistically significant because the confidence interval
included zero and overlapped into effects of the opposite sign. Interval estimates are only
meaningful when applied to statistically significant test results.
Often the decision making problem dictates whether testing or estimation is the main
priority of statistical analysis. Estimation is the only means for quantifying the impact of one
variable on another or forecasting the range of an important time series measure. With very
large samples, in fact, statistical significance loses much of its importance. On the other hand,
testing is highly recommended if we want to determine the validity of a new theory or discover
the effect (if any) of one variable upon another.
Finally, there are situations where testing and estimation have no parallels. While one-
sided tests are often justified, confidence intervals are nearly always constructed from both tails.
In later chapters we will introduce tests for which no meaningful estimation problem exists.


Testing is similar to interval estimation in several ways including the computations.
However, testing is a distinct and complementary category of inferential methods that often
addresses a different class of decision making questions.

The Importance of Negative Findings and Ethical Testing Practices
Despite the formal structure of the process, hypothesis testing is sometimes as much an
art as a science. As with other methods of statistical analysis we have introduced, testing is also
subject to unscrupulous practices. People who want to convince you to do something or justify
their own decisions may manipulate statistical tests for their own purposes. Even honest
practitioners may be subconsciously steered toward tests that favor preconceived results. Others
may find their judgment clouded by using an "ends justifies the means" rationale to promote a
just cause. Reporters and your colleagues at work are of little help either. Just as the news
media doesn't print headlines like "No Planes Crashed Today" or "Food Additive Not Shown to
Cause Cancer," there is bias toward reporting only significant results. Even academic journals
are not immune to this bias. They are less likely to publish articles reporting no significant result
(and researchers tend to not even submit such results) especially if these negative findings upset
established theory. We are tempted to search for excuses too small a data set, measurement
errors, unusual circumstances, etc. when our negative results may simply mean that there is
no significance to be found!
Yet many of the most important findings are those unable to reject the null
hypothesis. For example, a recent study concluded that billions of dollars in health care
costs could be saved by eliminating most heart bypass surgery. This surgical procedure
does not increase life spans significantly when compared with survival among patients
undergoing nonsurgical treatments. Negative test results, especially if unexpected, can be
as important as findings that confirm conventional wisdom.
Fortunately, there are often telltale signs that things are not all they appear, and we have
accepted guidelines for conducting our own tests. The first area to be careful of is in designating
a null hypothesis. If it is too easy to reject, claims of statistical significance may have a hollow
ring. Like the champion boxer who passes up top contenders to fight glass-jawed opponents, an
unrealistic or improbable null hypothesis makes rejection a foregone conclusion. A toothpaste
manufacturer once announced significant reductions in cavities. It turned out that the
comparison was with those who did not brush, so the null hypothesis amounted to the unlikely
equality of means between the brushing and non-brushing groups.

Another way testing may be abused is in selection of the type of test and the level of sig-
nificance. As we stated earlier, we should construct our hypothesis and designate before
gathering and analyzing a sample. The reason for this sequence is to prevent the sample patterns
from influencing these choices. The value of should be selected according to the Type I error
acceptable for the problem. We should also decide beforehand whether a one-sided test is
justified based upon previous studies and theory on the subject. If instead we peek at the data or
play around with some statistics, ethical procedures may be compromised.
Suppose we had intended to set at .01 but a p-value of .03 resulted from the test. We
could argue that our standard for rejecting the null hypothesis had been to test at the .05 level all
along. But would we have reported a .01 significance standard had the p-value been .007
instead? If our answer is yes, then we are behaving less than ethically.
Similarly, maintaining a .05 significance level may easily be diluted to a weaker = .10
merely be switching from a two-sided to a one-sided test. The wider rejection region of one-
sided test makes it more accommodating to modest t-ratios if they are in the expected direction.
By contrast, the two-sided test resembles the density function graph for a one minus
confidence interval, with one-half in each tail. The distance (i.e., the number of standard
deviations) from the middle of the distribution to the beginning of the tail will therefore be less
for the one-sided test. Thus, weaker sample evidence allows us to reject H
0
for one-sided tests
that would not be rejected using a two-sided test.
Which form we choose therefore depends upon our previous knowledge of the decision
making problem. If theory or findings from prior studies indicate that rejection of the null
hypothesis can occur in only one direction (say, greater than), then we are justified in conducting
a one-sided test. If either direction of inequality is conceivable, a two-sided test is in order.
Suppose you have inadvertently observed the data or statistical output. What should you do
then? One solution is to ask yourself what you would do if the sign of t were reversed. You
should be able to honestly claim that you would not have reported significance. More
unscrupulous analysts use one-sided tests whenever these tests enable the desired conclusions to
be reached.
Ethical practices should be observed in selection of the null hypothesis, the type of test, and
the significance level.

When Significance May Not Be Significant
For much of the century, hypothesis testing of the form outlined in this chapter, often
called classical hypothesis testing, has been the dominant form of statistical analysis used in
business and economics. In recent years, classical testing has come under attack for reasons
cited already and because of its inherent limitations from constructing H
0
and H
A
.
As we mentioned at the beginning of this section, it is important to construct a null
hypothesis that represents the opposite of the alternative hypothesis. Even so, rejecting H
0
does
not give us a specific alternative H
A
. Before accepting a hypothesis into the ranks of established
theory, it is essential that test results be replicated. In other words, the test should be repeated
under a variety of conditions to check whether statistical significance is generally found. In
some sciences, statistical testing is bypassed altogether in favor of many replications until the
theory is confirmed. However in business and economic statistics, repeated "verification" of
statistically significant test results may bring us no closer to eliminating competing alternative
hypotheses. We may be forced to wait for a sampling situation that lets us discriminate between
them (see Exercises).
Perhaps the most common confusion about hypothesis testing is with the term
"significance." If you check a thesaurus, you'll find synonyms such as important, substantial,
vital, or pivotal. In most cases, substantial differences of sample means are also statistically sig-
nificant. Conversely, unimportant effects are usually not statistically significant. The parallel
between important and significant is especially likely for smaller samples. Only large or
persistent differences can produce the sizable numerators in the t-ratio necessary to conclude
statistical significance. But because the value assigned to
0
is unlikely to be precisely true for
the population, virtually any null hypothesis can be rejected if the sample is large enough. One
alternative is to replace the single value
0
with an interval of values to eliminate this bias (see
Exercises). For very large samples, estimates such as the summary measures used in Part One
(and extended for bivariate data in Chapters 5) are usually more useful instruments for decision
makers. Even with smaller samples, to prevent confusion it is good practice to report these
estimates along with more glamorous test findings.
10
Statistical significance does not indicate how substantial a result is, especially for larger
samples.

10
Stimulating discussion of these points are found in A. Sawyer and J. Peter, "The Significance of Statistical Significance Tests in
Marketing Research," Journal of Marketing Research (May 1983):122-133; and D. McCloskey, "The Bankruptcy of Statistical Significance,"
Eastern Economic Journal (Summer 1992):359-361.

8.66 Which of the following is not a limitation of statistical significance?
a. in large samples, significance can be found from minor patterns
b. in small samples, even substantial effect size may not result in significance
c. although we can select , we usually don't know
d. rejecting H
0
does not necessarily support any particular H
A

e. all of the above are limitations

8.67 Which of the following is not an unethical practice in hypothesis testing?
a. choosing an H
0
that is easy to reject
b. always conducting one-sided tests
c. conducting hypothesis tests prior to estimating confidence intervals
d. choosing significance levels just large enough to obtain significance
e. all of the above are unethical

8.68 A one-sided test that yields significance at the = .05 level is equivalent to significance
for the corresponding two-sided test at the
a. = .10 level
b. = .05 level
c. = .025 level
d. = .01 level

8.69 One difference between hypothesis testing and interval estimation is
a. hypothesis tests may have no meaningful estimation counterpart
b. hypothesis testing involves inferential statistics
c. hypothesis test examine the distribution centered around X
d. hypothesis testing loses much of its usefulness for small samples
e. hypothesis testing is more important in forecasting problems

A statistic is a random variable derived from information in a random sample, and its
probability distribution is the sampling distribution for that statistic.
The t distribution, is a family of density functions, t(k), with the k parameter
distinguishing one member of the family from another. Properties of the t Distribution:
Symmetrical and unimodal, Infinitely long tails, = 0, > 1 and tails thicker than the Z
distribution, thicker the tails and larger are most noticeable if k is very small, and approximates
the Z distribution for large values of k.

For a random sample of size n observations of a normally distributed random variable X,
( X - ) / s
X
has a t(n-1) distribution with parameter k equal to (n - 1). t
/ 2
(n-1) is the number
of estimated standard errors required to place / 2 probability in each tail and (1 - ) in the
middle of the t(n-1) distribution.
Central Limit Theorem: The sampling distribution of X for a random variable X whose
mean is and standard deviation is will approximate a normal pdf with mean and standard
deviation
X
= / n if the sample size n is sufficiently large. Central Limit Theorem for
Unknown : The sampling distribution of X when X has mean will approximate a t
distribution with mean and standard deviation s
X
= s / n if the sample size n is sufficiently
large.
A point estimate is a numerical guess for an unknown population parameter. An
estimator of a parameter is unbiased if its expected value equals that parameter. Bias is the
difference between a parameter and the expected value of its estimator. Sampling error is the
difference between a parameter and the value of its unbiased estimator in a particular sample.
An interval estimate (a, b) is a range of values from a to b likely to contain the
population parameter being estimated. The confidence level is our degree of confidence that the
unknown population parameter being estimated lies within an interval estimate. A confidence
interval is a combination of interval estimate and confidence level information about an
unknown population parameter. There is a tradeoff between interval width and level of
confidence. The greater the confidence level, the wider the interval is.
A hypothesis is a statement that can be subjected to empirical evidence. A theory is a
hypothesis that is found consistent with empirical evidence. In stating a hypothesis, all possible
conditions of the unknown population are reduced to one of two states: an alternative
hypothesis H
A
, the condition the test is trying to establish; and a null hypothesis H
0
, the state
we wish to reject. Hypothesis testing consists of rejecting or not rejecting H
0
based on sample
evidence. Rejecting H
0
allows us to accept H
A
. Hypotheses can never be proved, only rejected.
Accepted theories are replaced when other theories do a better job of explaining and predicting.
Theories in business are seldom valid for all time and all situations. Their precision is
limited by the complexity of human behavior. Unlike the sciences, business theories seek to
explain the motivations behind the observed data. Both estimation and testing use sample
evidence to make statistical inferences about population parameters. Instead of estimating
confidence intervals, testing attempts to determine whether these parameters have particular
hypothesized values.
H
A
and H
0
are mutually exclusive and exhaustive events by defining the null hypothesis
as the complement of the alternative hypothesis. A one-sided test is required to test an
alternative hypothesis containing a one-way inequality, either strictly greater than or strictly less
than; a not-equal sign in H
A
directs us to perform a two-sided test. In tests for whether the

population mean is equal to
0
, two one-sided tests and a two-sided test are available. H
0
and
H
A
refer to the unknown population parameters, not their sample estimators.
If we reject the null hypothesis when in fact H
0
is the true state of nature for the
population, we commit a Type I error; conversely, failure to reject H
0
when H
A
is true for the
population involves Type II error. In hypothesis testing, is the probability of committing a
type I error and is the probability of a type II error. The smaller the value we assign to , the
larger will be the , and vice versa. By assigning a value to , we automatically determine .
Although usually unknown, may be reduced by adopting a larger or increasing the sample
size.
The probability of a type I error, , is the significance level for the test.
If the test results in rejection of H
0
, the inequality expressed by H
A
is considered statistically
significant at the significance level; if H
0
cannot be rejected at the significance level, then
the expression described by H
A
is not statistically significant.
A decision rule for a test is the criteria for rejecting or not rejecting the null hypothesis.
A test statistic is a random variable constructed from sample information and the assumption
that the null hypothesis is true. The sampling distribution of a test statistic is the distribution the
test statistic will have if we assume that the null hypothesis is true. A rejection region is the
range of test statistic values improbable enough to reject the null hypothesis at the level of
significance. Decision Rules Using Test Statistics: We reject H
0
if the test statistic lies within
the rejection region. Otherwise, we cannot reject H
0
.
The p-value is the probability of obtaining a test statistic as or more extreme as the one
obtained from the sample, assuming H
0
is true for the population. The p-Value Decision Rule
for Hypothesis Testing: If the p-value is less than the significance level , reject the null
hypothesis; if it is not less than , do not reject the null hypothesis. A p-value decision rule is
equivalent to the corresponding test statistics decision rule, and the p-value rule usually is more
flexible and easier to use. Now that computer software reports them, p-values are used to
conduct most statistical tests today. The p-value for a test reports the observed significance level
appropriate to the sampling distribution and null hypothesis. Therefore, the decision rule reduces
to: if p is less than , reject H
0
; if p is not less than , we cannot reject H
0
.
Procedures for Hypothesis Testing: express the hypothesis H
A
to be tested in terms of
population parameters and assign the intended inequality sign; state the equality condition for the
null hypothesis H
0
as the complementary event to H
A
; assign a significance level by the
decision makers willingness to risk a Type I error (and implicit tradeoff with Type II error);
collect a random sample from the population; obtain the p-value for the sample data from area of
the tail(s) in the sampling distribution of the test statistic; use the p-value decision rule to decide
whether to reject the null hypothesis; and translate test results into conclusions meaningful to
decision makers.
In estimation, the mean of the sampling distribution is the value for the estimator of the
unknown parameter. In testing, the mean of the sampling distribution is the hypothesized value

of the parameter specified in the null hypothesis. Testing is similar to interval estimation in
several ways including the computations. However, testing is a distinct and complementary
category of inferential methods that often addresses a different class of decision making
questions.
Negative test results, especially if unexpected, can be as important as findings that
confirm conventional wisdom. Ethical practices should be observed in selection of the null
hypothesis, the type of test, and the significance level. Statistical significance does not indicate
how substantial a result is, especially for larger samples.

Chapter Review Exercises:
8.76 "Conclusive evidence" from extensive taste tests showed Coca Cola management that a
new sweeter cola formula was preferred over its current version. As a result, the largest
soft drink company changed over to the new formula in the mid-1980s. Negative public
reaction led Coke to bring back their "classic" formula. Discuss whether this episode
illustrates an example of improper choice of . To what extent may there have been a
misapplication by management of the test results?

8.77 We may often infer the plus or minus variation of an estimate from the precision or extent
of rounding reported. For each of the following, state what the uncertainty is:
(a) The average cross-country fare has fallen about $120 since deregulation of the airlines
(b) Costs have risen approximately $21 million in our foreign subsidiaries
(c) The average number of rejections has been about 1000 engines per week in the
assembly phase
(d) The average wear on the cylinder is .034 millimeters per year for the life of the machine


CHAPTER 9 ONE- AND TWO-SAMPLE
INFERENCES

Approach: This is the first of several chapters that apply the inference concepts and methods
learned in Chapter 8. Here we explore the estimation and hypothesis testing of single
population means and proportions and differences between the means of two populations.
Sampling distributions, all of which we have learned about in Chapters 7 and 8, are
applied to these applications and used to calculate confidence intervals and test
hypotheses.
Where We Are Going: Very similar estimation and testing procedures may be used to
estimate and test hypotheses associated with regression, experimental design, and other
inference situations commonly arising in business statistics. Only a few new sampling
distributions will be needed to supplement our current arsenal, each relying on a
particular set of distributional requirements. Even when these conditions are not met,
nonparametric methods will provide us with a safety net of powerful inferential
procedures.
finite population correction factor
pooled sample standard deviation
inference for proportions and two-sample differences

SECTION 9.1 One-Sample Estimation of the Mean
SECTION 9.2 One-Sample Tests for the Mean
SECTION 9.3 Inference for Matched Pair Samples


9.1 One-Sample Estimation of the Mean
It is now time to start applying our new knowledge of sampling distributions and
inference procedures. Business decisions often rely on estimating a parameter or assessing the
true state of that unknown that parameter for a single population. When an automaker uses a
random sample of 20 sport utility vehicles to estimate the mean number defects, how wide is the
95 percent confidence interval? How can survey results from 200 randomly selected consumers
be used to decide whether a significantly larger proportion of the population prefers Burger
Kings new french fries to those at McDonalds? Can a personnel manager at a major corporate
headquarter test whether employee age has a significant effect on the number sick day a worker
calls in each year? In this and the following three chapters, we roll up our sleeves and get our
hands dirty tackling these and many other kinds of estimation and testing problems.
We cant expect to solve each of these problems using the same statistical methods. In
Part I, we found that different population measures and statistical methods are often needed to
describe different kinds of data: univariate, bivariate, and multivariate data; categorical, discrete,
and continuous data; and survey, experimental, and observational data. For example, we
described univariate population data by means or medians or proportions, but regression was
capable of fitting data as a relationship among variables. It is therefore not surprising that
sample inference method may also be different from one type of problem to the next.
In this chapter we consider some of the most commonly used one- and two-sample
inference problems. The statistics for these methods all involve sampling distributions we have
seen previously seen: the normal, t, and binomial distributions. We will therefore be able to
draw upon the results of the central limit theorem and other sampling distribution insights
developed in the previous chapter. In later chapters, we will learn about methods involving
multivariate and types of univariate inference that use other sampling distributions and rely on
additional understanding of sampling distributions. What all the inference methods do share is a
common vocabulary and standard analytical procedure. Although the problem context and data
may require a different sampling distribution and statistical technique, each estimation and
testing method uses the same general approach, analytical steps, and terminology outlined in
Chapter 8.
We begin here with one-sample estimation of the mean. For convenience, the
organization of this section parallels the discussion of sampling distributions in the previous
chapter. The next section examines hypothesis testing of the sample mean, including the
important quality control applications. Sample inference for proportions and two-sample
inference are covered in the last two sections.

However, an important warning is in order before we get started. Because material in
this chapter is traditionally covered first, business statistics students are easily tempted to identify
the specific methods in this chapter as all there is to inferential statistics. Nothing could be
further from the truth. Many consultants and business policy analysts who conduct statistical
analysis every day might go years without performing a method from this chapter. For example,
most economic statisticians (known as econometricians) rely exclusively on multivariate
regression and never use one- and two-sample methods. Finally, many business statisticians
strongly prefer nonparametric methods, such as those presented in Chapter 12, because of the
types of small and non-normal data sets they typically analyze.
Many other types of statistical inference problems besides one- and two-sample inference
are important concerns to business analysts. Recognizing which method is proper for each
kind of problem is essential. However, the dozens of inferential methods share common
procedures and terminology with one another.

Estimating the Mean when Standard Deviation is Known
We begin by constructing confidence intervals for the mean when x is a normally
distributed random variable and the standard deviation is already known. This procedure will
then permit us to examine interval estimates for the case of acquisition-prone corporations.
Although a majority of populations in business and economics are non-normal, many common
types of business-related random variables are approximately normal. Later in this section, we
will extend the procedure introduced here to situations involving unknown , and non-normal x.
Recall from Chapter 8 that 1.96 standard deviations below and above will capture 95
percent of the probability for any normally distributed random variable x, leaving two equal-
sized tails containing 2 1/2 percent of the probability. We now have everything we need to
construct 95 percent confidence intervals for . In probability notation,
P( 1.96 < x < + 1.96) = 0.95
We also learned in the previous chapter that if x is normal, then x must also be normally
distributed. For each random sample that we collect, x will then have a 95% chance of falling
within 1.96 of .
P( 1.96 < x < + 1.96 ) = 0.95


This equation provides us with a confidence interval for x , but not for . By rearranging the
terms in this expression algebraically, the following probability statement may be derived:
P( 1.96 < < + 1.96 ) = 0.95
This expression provides us with a 95 percent confidence interval for the mean of x.
If x is normally-distributed, x is the point estimator of and the 95% confidence
interval has endpoints 1.96 standard errors of the mean on either side of x .
95% C. I. = ( x 1.96 , x + 1.96 )
This formula requires that we already know the value of . The 95% confidence interval is
shown in relation to the normal distribution in Figure 9.1.
Lets return to the antitrust economist from last chapters case study. A 95 percent
confidence interval for the mean
ROR could be calculated if the
population standard deviation, 5.8,
were known from previous year
information and the population was
presumed to be approximately
normal. Suppose her sample was
the third of the five we drew, samp3
(see Figure 9.1 reproduced from
Chapter 8). Notice that the mean
x was 4.75 for this sample, more than a full percentage point above the population mean for
ROR, 3.42 percent. Reporting 4.75 without any warnings about the effects of sampling error
would have misled the Justice Department policy makers. Once the actual population data
became available, the economist would be looking for a new job.
By reporting a confidence interval instead, the economist would have incorporated the
effects of sampling error in her estimated size of ROR. In Chapter 8, we found = 1.45 by
dividing = 5.8 by the square root of the sample size, n = 16. Thus,
1.96 = (1.96)(1.45) = 2.84
and the 95 percent confidence interval based on this sample is 4.75 2.84, or the following
interval:
Mean 2.625 2.625 4.75 2.063 0.875
Standard Error 1.426 0.682 1.116 1.174 0.926
Minimum -9 -3 -2 -10 -7
Maximum 11 8 13 9 7
Count 16 16 16 16 16
Figure 9.1

95% C. I. for = (1.91, 7.59), approximately (1.9, 7.6)
The economist would then report she was 95 percent confident that the mean rate of return for
acquisition-prone corporations in 1987 was between 1.9 and 7.6 percent. This interval does
contain the true population mean, 3.42 percent. Had she drawn a different sample instead, say
samp2 with a mean of about 2.62, the confidence interval would have been different:
95% C. I. for = 2.62 2.84 = (0.22, 5.46), approximately (0.2, 5.5)
This interval, however, also contains 3.42. In fact, all five samples described in Figure 9.1 have
means that lie within 2.84 points of the population mean, = 3.42. Thus, the 95 percent
confidence intervals for all five samples were wide enough to contain the population mean. This
result is not very surprising when you recall that an average of 19 out of 20 (95%) of random
samples should contain .
Minitab may be used to compute these confidence intervals directly (see instruction and
sample dialog boxes in Figure 9.2

Using Minitab to Obtain Z-Intervals from Columns of Sample Data
Pull-Down Menu sequence: Stat Basic Statistics 1-Sample z...
Complete the 1-Sample z DIALOG BOX as follows:
(1) click mouse on variables to be estimated and press Select button
(2) type in (1)100% confidence level value in box next to Confidence interval:
(3) type in the known value of standard deviation () in box next to Sigma:
Figure 9.2

The ROR samples result in the five confidence intervals of the Minitab output (see Figure 9.4).
Confidence Intervals

The assumed sigma = 5.80

Variable N Mean StDev SE Mean 95.0 % C.I.
samp1 16 2.62 5.70 1.45 ( -0.22, 5.47)
samp2 16 2.62 2.73 1.45 ( -0.22, 5.47)
samp3 16 4.75 4.46 1.45 ( 1.91, 7.59)
samp4 16 2.06 4.70 1.45 ( -0.78, 4.91)
samp5 16 0.87 3.70 1.45 ( -1.97, 3.72)
Figure 9.4


The assumed sigma value, 5.80, was supplied in the dialog box (step 3 in Figure 9.2) and used
to calculate the standard error (1.45) reported in the fifth column of all five samples.
Recall that we calculated a 95% confidence interval of (1.91, 7.59) for the third sample,
the same interval reported for samp3 on the right end of the Minitab output. Now it is easy to
see that all five of the confidence intervals contain the true value of = 3.42. In fact, only the
fifth sample has an interval endpoint at all close (3.72) to the population mean.
More generally, we can label the combined area of the two distribution tails , as we did
in Chapter 8. Then 95 percent confidence intervals become a special case where = .05. Thus,
the confidence level is the probability area (1 )100% between these tails. The confidence
coefficient, (1 ), is the decimal fraction equivalent to the confidence level percentage.
In Chapter 7, we defined z
/2
to be the number of standard deviations from the cumulative
standard normal pdf that leaves /2 probability in each tale and (1 ) in between. Using this
notation, we may state the formula for any confidence interval under the assumptions so far
considered.
For normally distributed random variable X and known , the (1 )100 % confidence
interval for is ( x z
/2
, x + z
/2
).
Confidence intervals other than 95% may be constructed merely by changing the value of in
z
/2
. For example, the 99 percent confidence interval results from specifying z
.005
. The general
case for the (1 )100% confidence interval is shown in Figure 9.5.
What if the Justice Department policy makers would have been satisfied with confidence
intervals containing the true ROR in an average of 4 out of 5 samples. Figure 9.6 contains the z-
intervals for an 80% level of confidence.


samp1 16 2.62 5.70 1.45 ( 0.77, 4.48)
samp2 16 2.62 2.73 1.45 ( 0.77, 4.48)
samp3 16 4.75 4.46 1.45 ( 2.89, 6.61)
samp4 16 2.06 4.70 1.45 ( 0.20, 3.92)
samp5 16 0.87 3.70 1.45 ( -0.98, 2.73)
Figure 9.6

Because the intervals are based on information in the same five samples, the only things different
in this Minitab output are the confidence intervals themselves. The intervals are narrower than

their 95% counterparts because they only need to contain the actual population mean 80 percent
of the time. Now, only four of the five confidence intervals contain 3.42. The fifth sample lies
completely below the true population mean. For an 80 percent confidence level, we would
expect an average of one out of five samples to produce a confidence interval that does not
contain . Of course, that is just an average rate of occurrence. Another set of five random
samples might have only three or all five intervals containing .
Wouldn't you rather report the (2.89, 6.61) interval from the Figure 9.6 for the third
sample than the (1.91, 7.59) found earlier for the 95 percent confidence level? But remember the
price for obtaining the narrower interval. Remember that the interval for samp5 does not even
include = 3.42. If the economist was unfortunate to drawn the fifth sample, the lower
confidence level would have no longer protected her from reporting a range completely below
the mean rate of return for the acquisition-prone firms. She might have mistakenly concluded
from this narrow interval that profit performance was even lower than it actually was, perhaps
even indicating losses that threatened eventual insolvency. The wider 95 percent interval
includes low enough profits to dissuade her from this radical and incorrect conclusion about
merger motives.
For the 95 percent confidence intervals in Figure 9.7, each tail contained (5 percent)/2, or
.025. At the 80 percent confidence level of Figure 9.8, the z value must only be far enough from
the mean to include 80 percent of the area under the normal curve. The remaining 20 percent is
split among the two equal-sized tails, so the area under each tail must be 10 percent, or .10. The
z
.10
value is substantially smaller, so the confidence interval has a narrower range.

Chapter Case #1: Insured by the Good Hands People
Air bags and crackdowns on auto theft rings and drunk drivers are among
the factors allowing auto insurers to lower their rates recently. In preparing an
article on this subject, a local newspaper reporter wants to estimate area auto
insurance premiums charged to safe drivers. Using subscriber lists, she conducts a
telephone survey to collect a random sample of 12 annual premiums. It is clear
from the sample data in Figure 9.9 that insurance rates can vary considerably even
for safe-driver policyholders. She knows enough about statistics not to report the
sample mean as nearly $700 without also including a margin of error based on
sample inference.
Examination of insurance data from previous years convinces her that the
population is approximately normal with a population standard deviation of
prem
850
610
600
550
600
1056
750
650
545
600
1200
360
Figure 9.9

approximately $200. The reporter therefore is justified in using the normal distribution to
construct a confidence interval for the mean.
Using the Excel Confidence function (see instructions in Figure 9.10), the reporter
obtains a 95% confidence interval half-width of $113 (show in Figure 9.11). She therefore
writes that although the insurance industry does not publish average rates broken down by
region, a scientific study of paper subscribers indicates that the annual premium for safe drivers
averages about $700. She then adds the important qualification that this estimated mean is based
on a random sample of 12 subscribers and is subject to a margin of error of $113 and a 5%
chance of a larger error. Alternatively, she may report her findings as the following interval
estimate: estimated premiums average between $585 and $810.
Using Excel to Find Confidence Interval Half Width for Known Sigma
Click on the f
x
Select CONFIDENCE from Function Name listing
(1) type in the value for the 1 - confidence level in box following alpha
(2) type in the population standard deviation in box following standard-dev
(3) type in the sample size in box following size
(4) the confidence interval half width appears on the upper right after Value:
Figure 9.10

Optimal Sample Size and Correcting for Finite Populations*
There is another way to obtain narrower intervals without reducing the confidence level.
All too often, the sample has already been collected or the sample size is otherwise outside our
control. But if a statistician is involved early enough in the process, she can make the sample
size a crucial ingredient in the study design.
The confidence interval formulas hold the key to finding the sample size that is required
for any combination of confidence level and interval width. The general formula for a
confidence interval for (for normally-distributed X and known value of ) is
x z
/2

Thus, the half the width of a confidence interval is z
/2
. Lets call this interval half width d.
Then according to the definition for , the size of d is determined by three factors: the z value,
the standard deviation, and the sample size.


n
z
d
o
o 2 /
=
Solving for n and squaring both sides, we arrive at a formula for sample size n. The result
allows us to present the rule for selected the optimal sample size.
The sample size needed for a (1 )100% confidence interval of with width 2d is

2
2 /
|
.
|
\
|
=
d
z
n
o
o

Therefore, we should choose a sample size directly proportional to the square of standard
deviation and of the z distribution value for the desired confidence level. In addition, the sample
size is inversely proportional to the square of the interval width we want.
As an example of how to approximate the sample size needed, again consider rate-of-
return data for corporations heavily involved with acquisitions. What if the antitrust economist
needed a 95 percent confidence interval, an interval half-width no more than 4, and she knew
beforehand that was roughly 6.
1
The formula for optimal sample size yields the following
result:

2
2 /
|
.
|
\
|
=
d
z
n
o
o
= [(1.96)(6)/2]
2
or approximately 36
On the other hand, if Justice Department policy makers only needed the less-inclusive 68 percent
confidence interval, she could use a z
/2
value only half as large. Thus, a sample of [(1)(6)/2] =
9 corporations would suffice. This is the same sample size to obtain a 95 percent interval with
twice the width (8 percentage points). Thus, we see that doubling the desired interval width or
halving z
/2
will reduce the required n by a factor of 2, or 4 times. This result is exactly what we
would predict from our formula.
To illustrate, a random samples of only n = 4 firms was collected from the 1987 ROR
population data. The four observations, 1, 3, 8, and 9, have a mean identical to for the third
sample of n = 16 firms (refer back to Figure 9.6). A 68 percent confidence interval is calculated
in Figure 9.12.

1
All we usually want is a rough estimate for sample size, so our knowledge about u merely has to be a ballpark guess.


samp n=4 4 4.75 4.65 2.90 ( 1.87, 7.63)
Figure 9.12

As expected, the = 2.90 for the n = 4 sample is twice as large the 1.45 used earlier for n = 16.
However, the lower confidence level, 68% rather than 95%, offsets this. The net result after
rounding is identical confidence intervals: (1.9, 7.6). Thus, the 68% confidence interval for a
sample of only four acquisition-prone firms has the same width as a 95% confidence interval for
a sample of 16 firms.
Larger samples reduce , resulting in narrower confidence intervals. However, larger
samples require additional time and resources, scarce commodities for business decision-makers.
This trade-off between sample size and interval width is a fact of life among professional
pollsters and marketing survey firms.
1
Businesses requiring immediate, quick-and-dirty interval
estimates often are willing to accept the moderate sampling errors associated with small survey
sizes. More cautious survey sampling requires much larger, and consequently higher, survey
costs usually reserved for final pre-production market tests. Beyond sample sizes of five to ten
thousand, there is little advantage in lower sampling error regardless of population size (see
Exercises for examples).
Optimal sample size is based on three factors: variability of the population data, estimation
precision needed, and confidence level acceptable to the decision makers.
Strictly speaking, the standard errors we have presented here and in Chapter 8 apply only
to random samples from infinite-sized populations. What if populations are finite and we sample
a large portion of that population? Of course, if sample size n equals the population size N, we
have collected the entire population. Confidence intervals then shrink down to a single point
the actual measurement of not an estimate of it.
If n is only a fraction of N, on the other hand, values based on / n are inflated and
the resulting confidence intervals are too wide. In practice, if the sample size n is a small
fraction of N, the formulas we have been using yield a close enough approximation. For n/N in
excess of about 1/20 or 1/10, however, a finite population correction factor is usually necessary.

1
Polls and surveys also use special types of random samples, called stratified and cluster samples, discussed in Chapter 5, to further reduce the
standard error for a given sample size, time, and money expended.

DEFINITION: The finite population correction factor, fpc, adjusts standard errors downward
for the effects of sampling a substantial fraction of a population. The fpc and the finite
population formula for
f
are the following:
FORMULA:
1
=
N
n N
fpc and therefore fpc s
x
f
x
= o

For the rate of return case from Chapter 8, the population N consisted of 164 corporations and
sample sizes n were 16. Since
(N n)/(N 1) = (164 16)/163 = 0.908
the square root, 0.953, is the finite population correction factor. Therefore,

f
= (1.45)(.953) = 1.38
Replacing in our earlier calculations
f
, the 95 percent confidence interval shrinks for our
third sample from (1.9, 7.6) to (2.0, 7.5). The economists guess about the true value of would
probably cause at least this much estimation error. Thus, the economist was justified in
disregarding the fpc in her analysis.
Not so were n a substantial fraction of N. Suppose the antitrust economist had the time
and resources to randomly sample n = 64 out of the 164 acquisition-prone firms in the
population. The uncorrected confidence interval for one such sample is presented in Figure 9.13.



samp 64 64 2.281 4.617 0.725 ( 0.860, 3.703)
Figure 9.13

Because of the larger sample size, the (0.9, 3.7) interval is not as wide as those for n = 16. But it
should be even narrower! The fpc is only 0.783, the square root of (164 64)/163. Thus,
f
=
0.568 should be used in place of = 0.725. With a smaller interval half-width of (1.96)(0.568) =
1.11, the finite-population-adjusted 95% confidence interval (1.2, 3.4) should be reported.


Finite population correction may be ignored unless over 10 percent of a population is
sampled. If a substantial fraction of the population is sampled, however, only the finite
population correction factor allows the proper inferences to be made.

Estimating the Mean for Unknown Standard Deviation
What if the value of is not known? As we discussed in Chapter 8, this is the case more
often than not. By substituting s for when calculating the standard error of x , we have an
estimator of , the estimated standard error of the mean. We may use this s in place of to
construct confidence intervals whenever is not known.
We also can no longer use the normal distribution. Recall from the previous chapter that
the shape of the sampling distribution ( x )/s is described by the more flexible t distribution.
For a random sample of size n collected on a normally distributed random variable X, ( x )/s
has a t distribution, with n 1 degrees of freedom.. The t distribution resembles the normal
distribution for larger values of n, but has much thicker tails for very small samples.
When must be estimated by s, gets replaced by s and the sampling distribution
changes from Z to t. It thus becomes a straightforward task to modify the formula for
constructing confidence intervals.
For a normally-distributed random variable X and unknown , the (1 )100 %
confidence interval for is calculated from a random sample of n observations by
95% C. I. for = ( x t
/2
(n1)s , x + t
/2
(n1)s ).
As we did with the Z distribution, t
/2
(n 1) is the t value that leaves an /2 probability
area in each tail of the distribution. Figure 9.14 shows the 95% confidence interval for a sample
size of n = 9 and unknown .
In Chapter 8, we also learned that the estimated standard deviation s differs from one
sample to the next, not fixed at a known value . For the third sample we drew in the Chapter 8
rate-of-return case, the sample standard deviation s was 4.46, somewhat less than the actual 5.81
value for the population parameter . By contrast, s = 5.70 for the first sample was nearly the
same as while s = 2.73 for the second sample was less than half the size of . The other three
sample standard deviations 4.46, 4.70, and 3.70 were in between those of samples 1 and 2.
We reprint the top half of Minitab summary statistics for all five samples in Figure 9.15.


samp1 16 2.62 3.50 2.86 5.70 1.43
samp2 16 2.625 3.000 2.643 2.729 0.682
samp3 16 4.75 4.50 4.64 4.46 1.12
samp4 16 2.06 2.50 2.43 4.70 1.17
samp5 16 0.875 0.500 1.000 3.704 0.926
Figure 9.15

Because s varies from the true population parameter ,
1
s will also differ from = 1.45.
Dividing s by n in the third sample, we find that
s = 4.46/4 = 1.12 for sample 3
considerably greater than
s = 2.73/4 = 0.682 for sample 2
Notice that this information is also provided in the last column of the Minitab printout labeled
SEMean, the standard error of the mean. Whenever s is smaller than , s is smaller than . Had
s been larger than , then s would have been larger than .
Unlike the z
/2
standard normal values, the t
/2
values vary with sample size. This also
affects the width of confidence intervals. The 16-firm samples have n 1 = 15 degrees of
freedom, so the t
/2
(n 1) necessary to construct 95 % confidence interval is t.
025
(15).
Inverse Cumulative Distribution Function

Student's t distribution with 15 d.f.

P( X <= x) x
0.0250 -2.1315
Figure 9.16

Based on the Minitab inverse cumulative value reported in Figure 9.16, an interval 2.13 standard
deviations on either side of the sample mean locates with a 95% level of confidence. Observe
that t.
025
(15) = 2.13 is larger than the z
/2
= 1.96 we used earlier. Because of the thicker tails of
the t distribution, we have to travel an additional 1/6th of a standard deviation farther (2.13
1.96 = 0.17) into each tail to obtain an interval with 95% of the sampling distribution.

1
Technically, it is s that is an unbiased estimator of u, the population variance. The square root of the expected value is not precisely the same
as the expected value of the square root, hence the modest bias.

The antitrust economist would then construct a 95% confidence interval t standard errors
around the sample mean. This confidence interval has endpoints
t.
025
(15) s = (2.13)(1.12) = 2.39 percentage points
on either side of 4.75, the mean for sample 3. The resulting interval is
4.75 2.39, or (2.36, 7.14)
Other confidence level, such as 80 or 99 percent, could have been constructed in the same way
we did earlier from z values. Here, however, t values are used. Thus, an 80% confidence
interval is obtained by using the t
.10
(15), and 99% requires us to use t
.005
(15).
In practice, we let statistical software perform the computations and obtain the t
distribution values for us (in Figure 9.19). The instructions for using Minitab are provided in the
accompanying instruction and dialog boxes (see Figures 9.17 and 9.18).
Using Minitab to Obtain a Student-t Confidence Interval
Pull-Down Menu sequence: Stat Basic Statistics 1-Sample t...
Complete the 1-Sample t DIALOG BOX as follows:
(1) click mouse on variable to be estimated and press Select button
(2) type (1)100% confidence level value in box next to Confidence interval:
Figure 9.17


samp1 16 2.62 5.70 1.43 ( -0.41, 5.66)
samp2 16 2.625 2.729 0.682 ( 1.170, 4.080)
samp3 16 4.75 4.46 1.12 ( 2.37, 7.13)
samp4 16 2.06 4.70 1.17 ( -0.44, 4.57)
samp5 16 0.875 3.704 0.926 ( -1.099, 2.849)
Figure 9.19

Again, except for rounding, all results are the same as our earlier computations from the t
distribution. Nearly all important information n, x , s, and s is provided to construct or
verify the intervals at the end of each line of output. The economist who drew sample 3 but
lacked knowledge about could still report a confidence interval based on an estimated standard
deviation and the relevant densities of the t distribution.

Returning to the chapter case, the newspaper reporter would probably not have access to
reliable historical data about standard deviation on auto insurance rates. Her subscriber survey
sample of 12 premiums could still be used to find a confidence interval based on an estimated
standard deviation and t values. Figure 9.20 displays the Minitab t interval output. Notice that
the 95% confidence interval (550, 850), rounded to two figures, is centered at the sample mean,
about $700. However, this interval is quite a bit wider than the (585, 810) interval found earlier
for known . A wider interval is usually (but not always) the price we pay for the additional
uncertainty introduced in estimating from a small sample.

prem 12 697.6 234.4 67.7 ( 548.6, 846.6)
Figure 9.20

Based on these subscriber survey findings, the reporter writes that an average annual premium
for safe drivers between $550 and $850, with a 5% chance that the mean lies outside this range.
Alternatively, she could report a point estimate and interval half width, (846.6 548.6)/2 = 298/2
= 149. This version would state that the estimated mean is about $700 subject to a $150 margin
of error.

Estimation of the Mean for Large Samples
In Chapter 8, we discovered that the central limit theorem provides large-sample
distributional properties for x that approximate the sampling distributions when X is normally
distributed. We therefore may apply the same confidence intervals derived earlier in this section.
Even if must also be estimated from the sample data, we simply use confidence intervals based
on s and the t distribution.
The confidence interval for based on a large sample size n collected on a random variable
X may be approximated by
( x z
/2
, x + z
/2
) if is known,
and by
( x t
/2
(n 1)s, x + t
/2
(n 1)s)
if is unknown regardless of the distribution of X.

We apply the central limit theorem to interval estimation by returning to the CPA chapter case
discussed in Chapter 2.

Chapter Case #2: Forms Over Substance?
Every spring, millions of people spend their weekends rummaging through drawers and
shoeboxes full of crumpled receipts. Despite reforms to simplify the tax code and filing process,
many of us lack the patience and competence to file tax returns ourselves. Instead, Americans
rely on the services of tax preparers to interpret the complex language, recent code changes, and
latest tax court rulings. Clients may even come out ahead if preparers find enough money-saving
deductions.
As tax season approaches, a small but rapidly growing CPA firm is interested in the
average number of IRS forms and schedules filed per client. The most time-consuming part of
tax preparation is filling out the forms. If the managing partner in the CPA firm has a good
estimate of the number of form each client needs, he can better decide how many professionals
to staff this tax season.
0
5 10 15
0
1
2
3
4
5
6
7
Number of Tax Forms and Schedules
N
u
m
b
e
r

o
f

C
l
i
e
n
t
s
Histogram for CPA Firm, n = 45 Clients
Figure 9.21

Because there were only a few tax changes this year, the managing partner bases his
estimates on a random sample of 45 client bills from the previous spring. Unfortunately, the
population standard deviation is unknown, and the Minitab histogram of sample data (see Figure
9.21) strongly suggests that the number of forms per client has a non-normal distribution. An
examination of the CPA firms clientele indicates the sources of this bimodal distribution. Those
using the "short" and "EZ" 1040 forms seldom hire the higher-priced services of a CPA. Thus,
only one client in the sample needed a single form to file. However, different types of clients
being served explains the two modes (one around 3 or 4 and another near 9 forms). Many
smaller CPA firms have two classes of clients: salaried clients who need relatively few forms
and self-employed business operators who require many more forms.
Nevertheless, the partner is still justified in using the t distribution to closely approximate
the 80% confidence interval he needs to make staffing decisions. The central limit theorem may
be invoked, because the sample size is reasonably large for a distribution whose histogram is not
excessively skewed.

n forms 45 6.022 3.187 0.475 ( 5.404, 6.640)
Figure 9.22

Figure 9.22 displays the Minitab t interval output for the sample data. Thus, the managing
partner can be 80 percent confident that
n forms
is 6.0 forms per client, with a margin of error of
0.6 forms. If he also knows the average time to process each form and the number of new clients
during the mid-February to mid-April filing crunch, the managing partner can decide how many
additional CPAs and support staff need to be hired this year.
We have seen how normal and t sampling distributions may be used to construct
confidence intervals from large samples. But what can we do if samples are too small to invoke
the central limit theorem? A sample drawn from a highly skewed population or a population
with extreme outliers, for example, would have to be very large to meet the standards for the
central limit theorem. The reason is that the sample mean remains highly sensitive whether an
extreme value makes it into the sample.
Smaller samples may still provide decision-makers with estimates of mean sales.
Logarithms also may be used to convert a non-normal variable to an approximately normal one.
Then, we may compute confidence intervals for the log of by using the Z or t distribution and
the log of the sample data.
1
Alternatively, nonparametric methods may be used because they

1
To convert back to un-logged form, raise estimates to the power of the based used for the logarithms.

require fewer restrictions such as assuming normally distributed populations. Several of these
methods will be presented in Chapter 12. Table 9.1 summarizes the estimation methods
discussed here. In the next section, however, we extend the inferential methods learned in this
section to investigate hypothesis testing of the mean.
Table 9.1
Summary of Sampling Distributions for One-Sample Estimation of the Mean

Normal Distribution Distribution Not Normal
Large Sample Small Sample Large Sample Small Sample
known z intervals z intervals approximate z interval use nonparametric
inference or
transform data unknown
t intervals t intervals approximate t interval


CASE MINI-PROJECT:
A premium is the price charged to keep insurance coverage. A random sample of n = 58 life
insurance policyholders is surveyed. Below is the histogram for this sample of insurance policy
premiums.

Histogram of premium N = 58

Midpoint Count
5.0 15 ***************
15.0 17 *****************
25.0 10 **********
35.0 7 *******
45.0 5 *****
55.0 2 **
65.0 0
75.0 2 **

1. The sample indicates that insurance premiums are / arent distributed approximately
normally. [circle one]
2. According to the theorem, if is known / unknown (circle one), then
mean premium should have an approximately normal sampling distribution because the
size, n = 58, is relatively large. If is known / unknown (circle one), the sampling
distribution approximates a Student-t distribution instead.


premium 58 22.64 17.44 2.29 ( 18.81, 26.47)

Based on the above output for annual policy premium (in dollars), answer the following:
3. The standard error of the mean, 2.29, is about one-eighth the sample standard deviation,
17.44, because we must first divide it by the of the sample size, 58.
We are percent confident that life insurance premiums have a mean between $18.81 and
$ for the overall population.
4. Approximate a 95% confidence interval for mean premium [Hint: 2s]

9.1 The finite population correction factor (fpc) should be used if
a. the sample size is a substantial fraction of the population size
b. the sample size is larger than 30.
c. the population mean is small.
d. populations are not normally distributed.
e. any of the above indicate that the fpc should be used.

9.2 A confidence interval based on the Student t sampling distribution will
a. tend to be wider than an interval based on the normal distribution.
b. always be wider than an interval based on the normal distribution.
c. tend to be narrower than an interval based on the normal distribution.
d. always be narrower than an interval based on the normal distribution.
e. tend to have the same width as an interval based on the normal distribution.

9.3 Suppose that analysis of sample data results in an estimated mean for insurance adjustors
of 12.5 claims processed per day with a margin of error of 1.6 and a 95 percent level of
confidence. Then the confidence interval reported is
a. (11.7, 13.3)
b. (10.9, 14.1)
c. (12.5, 14.1)
d. (12.5, 15.7)
e. unable to determine from the information provided

9.4 Based on the preceding problem, 95 percent represents
a. the likelihood that interval contains the population mean
b. the likelihood that interval does not contain the population mean
c. the likelihood that interval contains the sample mean
d. the likelihood that interval does not contain the sample mean

9.5 The log transformation may be used when working with variables having a
a. bimodal distribution
b. symmetrical distributions
c. uniform distributions
d. distribution with two infinitely-long tails
e. highly skewed distribution


9.6 Which of the following variables would be a likely candidate for log transformation?
a. the wealth of students' parents at your college
b. the height of students at your college
c. the grade point average of students at your college
d. the length of textbooks used at your college
e. all of the above

Answer the next three questions based on the following case and statistical output:
A manufacturer collects a sample of 24 monthly sales (in units of millions of dollar) to estimate
mean monthly sales for the population:


sales 24 10.796 4.273 0.872 ( 9.301, 12.291)

9.7 For this case, the standard error of the mean is a little more than one-fifth the size of the
sample standard deviation because:
a. The square-root of the sample size is slightly less than five
b. The standard deviation is a bit less than five
c. Half the mean is slightly greater than five
d. Twice the width of the confidence interval is slightly greater than five
e. None of the above facts are relevant here

9.8 Which of the following can we conclude?
a. the population mean is $10.8 million
b. we are confident that sales are between $9.3 and $12.3 million in 90% of all months in
the population
c. population mean is $10.8 million and we are 90% confident that the sample mean lies
between $9.3 and $12.3 million
d. The sample mean has a 80% chance of lying within the interval from $9.3 million and
$10.8 million
e. The sample mean is $10.8 million and we are 90% confident that the population mean
is between $9.3 and $12.3 million

9.9 The confidence interval reported is substantially narrower than four-standard-errors-of-
the-mean wide because
a.90% confidence intervals are narrower than 95% intervals
b. confidence intervals using the t-distribution are narrower than those using the z-
distribution

c. the standard deviation is fairly small in this sample
d. we must first divide by the square root of the sample size
e. all of the above

9.10 With a calculator, determine from and n. Summarize the relative effect of and n
upon from these results.
(a) = 10, n = 9
(b) = 10, n = 81
(c) = 5, n = 36
(d) = 1, n = 900

9.11 Determine the optimal sample size n (rounded to the nearest integer) given the z
/2
= 1.96
for each of the following cases:
(a) d = 2 and = 25
(b) d = .01 and = 0.1
(c) d = 300 and = 10,000

9.12 Find the optimal sample size n in the preceding exercise for z
/2
equal to 2.58 and 1.65.

9.13 Use the finite population correction factor to revise the 95 percent confidence interval for
the samples 2 through 5 presented in the chapter case example.

9.14 Determine the correction factor in each of the following cases
(a) n = 25 and N = 200
(b) n = 4 and N = 100
(c) n = 64 and N = 2000
(d) n = 4 and N = 20
In which cases will the correction factor make much of a difference? Explain.

9.15 For each part of the preceding exercise, calculate the 95 percent confidence interval if =
22.5 and = 10.

9.16 Compare the correction factor when n = 4 and N = 12 with the case of n = 100 and N =
300. Despite having identical n/N ratios, why do the correction factors differ? Explain.


9.2 One-Sample Tests for the Mean
In this section, we explore the other type of inference about the population mean of a
random variable x: hypothesis testing. In Chapter 8, we introduced you to the special vocabulary
associated with hypothesis testing null and alternative hypotheses, type I and type II error, one-
and two-sided tests, test statistics and p-values and significance levels, and decision rules and
statistical significance. We also outlined the steps for stating hypotheses and conducting
statistical tests. Recall that these steps involve the following:
state the null and alternative hypotheses and assign a significance level
collect a random sample from the population
obtain the p-value tail area(s) from the sampling distribution of the test statistic
translate the p-value decision rule test results meaningfully for decision makers.

To state the hypothesis, we must select a null hypothesis value
0
and decide whether a
one-sided or two-sided test of the mean is appropriate. There are two possible directions for the
inequality in the one sided test. The format for the three versions are:

One-Sided Tests H
0
: =
0
or H
0
: =
0

H
A
: >
0
H
A
: <
0

Two-Sided Test H
0
: =
0

H
A
:
0

Sampling Distributions for the Test Statistic

Most of what we learned about the sampling distribution of in Chapter 8 may be applied
equally well to either testing or interval estimation. In particular, we discovered in the previous
chapter that if X is normally distributed and is known, then x from a random sample is also
distributed normally with standard deviation . By the central limit theorem, this sampling
distribution for is approximately valid for large samples even if X is not normal. Finally, if is
not known, we need only replace the standard normal distribution with the t distribution and
substitute the estimated standard error, s, for the unknown . We now apply these sampling
distributions properties to the general testing procedure outlined in Chapter 8.


One-sample tests and estimates of the mean rely on the same sampling distributions. Thus,
random samples from a normal distribution have normally-distributed means if is known
and t distributions if is unknown. These sampling distributions apply approximately for
large enough samples from non-normal populations.
For samples from a normally-distributed population and known, 0 is normally distributed
with mean and standard deviation . Remember how we standardized a normal distribution in
Chapter 7? By subtracting the mean and dividing by the standard deviation, we created a
random variable z = (x)/ with a zero mean and a standard deviation of one. We can do the
same with the sample mean.
If X is normally distributed with mean and a known standard deviation , then
( x )/ has a standard normal distribution.
To conduct a test, we start by assuming the null hypothesis is true. When H
0
: =
0
is true,
( x
0
)/ will have a standard normal. The t-distribution replaces this sampling distribution
when the standard deviation must also be estimated from the sample data.
For the null hypothesis to be true, ( x
0
)/ must have a standard normal sampling
distribution if X is normally distributed or the sample is sufficiently large. If is unknown,
then ( x
0
)/s must have a t(n1) distribution instead.
This gives us the test statistics for inferences about the population mean, also called the z- and t-
statistics.
1

DEFINITION: The z-statistic ( x
0
)/ and the t-statistic ( x
0
)/s are the test statistics for
the population mean from samples with known and unknown standard deviations, respectively.
Like other standardized random variables, the Z- and t-statistics measure the number of standard
errors from the mean.
The Z- and t-statistics for one-sample mean tests measures how many standard errors x is
from the null hypothesized value
0
.
A small test statistic of 0.5, for examples, tells us that the sample mean is only half a standard
error from
0
. This minor difference between 0 and
0
could easily be due to sampling error. On

1
We will also uses t-tests in Chapter 11 for testing the significance of independent variables in regression models.

the other hand, a t- or z-statistic equal to 4 indicates a sample mean four standard errors from
0
.
This large difference is unlikely enough to justify rejection of H
0
that equals
0
.
How many standard errors is large enough to reject the null hypothesis? That depends on
the sampling distribution probabilities and the significance level assigned to the decision. To
reject the null hypothesis, the test statistic must lie within the rejection region of the sampling
distribution.

In two-sided tests of the mean, the test statistic falls in one of two rejection-region tails if
z-statistic ( x
0
)/ > z
/2
or < x z
/2
known
t-statistic ( x
0
)/s > t
/2
(n1) or < x t
/2
(n1) unknown
In one-sided tests, the test statistic lies within the one tail of the rejection-region if
test statistic > z
/2
or t
/2
(n1) for H
A
: >
0

test statistic < z
/2
or t
/2
(n1) for H
A
: <
0

Examples of rejection regions for one-sided and two-sided t-tests are illustrated in Figures 9.23
and 9.24). Notice that the rejection regions for the two tailed test begin farther from
0
because
each tail contains only half as much probability /2 as the one-sided tests.

Interpreting the p-Value Decision Rule
Test statistics and rejection regions help us understand how we are able to infer statistical
significance from even very small samples. In practice, however, the p-value decision rule
introduced in Chapter 8 is far easier to use. The p-value reports the observed significance
level, the area in the sampling distribution tail (or tails for two-sided tests) beyond the test
statistic. In the case of one-sample inference of the mean, the p-value measures the probability
of obtaining a sample mean at least as far from
0
as the x we got.
p-Value Decision Rule for One-sample Tests the Mean:
if p < , reject H
0
conclude is significantly greater/less than
0

if p / , cannot reject H
0
conclude not significantly different from
0

Chapter Case #3: Money Makes the World Go Round
To see hypothesis testing in action, consider an example from public monetary policy.
The U. S. Federal Reserve Board, or FED for short, is arguably the most powerful government
agency in the world. Although unelected, FED directives are designed to control the levels of

national income, unemployment, and inflation. Because the U.S. economy exercises enormous
influence, the global economy is also greatly affected by FED policy.
The FED oversees monetary policy by using various tools at its disposal to control the
money supply, a common narrow definition of which is M1 (pronounced "Em-One"). However,
the FED is pulled in two opposing directions. Since a volatile money supply can destabilize the
entire economy, the FED attempts to maintain a stable, or "target, rate of growth in M1. To
monetarists like Nobel-laureate Milton Friedman, instability in the money supply is a major
source of business cycle fluctuations. For more than a quarter century, monetarist members who
claim allegiance to a monetary growth target have dominated the FED.
However, Dr. Friedman contends that the FED often veers off track and overshoots the
growth rate target. Especially during severe recessions, such as 1974-75, 1980-82, and 1990-91,
the FED faces tremendous political pressure to increase the money supply enough to fuel
economic recovery. Although the FED is free to formulate monetary policy, a disgruntled
Congress could revoke the FED's independence. Like a dieter who fails to attain his weight loss
goals because of occasional eating binges, a FED that fights recessions may elevate average
growth in M1 significantly above the stated target rate.
The FED's target growth rate for M1 over the last quarter century
has been about 6 percent, sufficient to allow for population and
productivity growth in the economy. Has the FED exceeded this target?
To shed light on this question, a bank economist conducts a one-sided test
at the = .05 significance level on the following hypothesis:
H
0
: = 6 percent
H
A
: > 6 percent
The economist begins by collecting M1 data and calculating its
annual percentage growth rate from 1976-1977 through 1994-1995 (see
Figure 9.24).
1
The 19 observations may be treated as a random sample of
opportunities for the FED to change M1 over this period.
A portion of the descriptive summary statistics from the Excel
output is shown in Figure 9.25. The sample mean is 7.57 percent, more
than 1 points above the 6 percent target rate. However, is this
difference statistically significant? After all, the data range of 16 points
is relatively large M1 rose by as little as 0.8 percent and as much as

1
Derived from first quarter M1 data, currency and demand deposits, is in the billions of U.S. dollars, average of daily figures, seasonally
adjusted Board of Governors of the Federal Reserve System via Dow Jones Statistical Release H.6 and The Federal Reserve Bulletin.
Time M1 %
Interval Growth
1994-1995 0.8
1993-1994 10.1
1992-1993 11.9
1991-1992 10.9
1990-1991 4.3
1989-1990 1.8
1988-1989 3.4
1987-1988 3.8
1986-1987 16.8
1985-1986 11.5
1984-1985 6.5
1983-1984 9.1
1982-1983 9.3
1981-1982 6.6
1980-1981 6.9
1979-1980 7.7
1978-1979 7.4
1977-1978 7.8
1976-1977 7.2
Figure 9.24

16.8 percent.
Percentage changes in economic and business
time series variables generally have approximately
normal distributions. In addition, frequency
distributions of money supply growth rate data appear
normally distributed. Although the sample size is not
very large, a t-test is justified based on the test statistic
( x
0
)/s. We now enlist the computer to find all the
information needed to conduct this test. Figure 9.26
presents the instructions and a dialog box example performing Minitab 1-Sample t tests, and
Figure 9.28 contains the results for the M1 growth rate case study. Later in this section, we will
show how to conduct z-tests using computer output.

Using Minitab to Conduct a One Sample t-Test
Pull-Down Menu sequence: Stat Basic Statistics 1-Sample t...
Complete the 1-Sample t DIALOG BOX as follows:
(1) click mouse on variable to be tested and press Select button
(2) click mouse on Test mean: to run a t-test (instead of Confidence interval:)
(3) type value of
o
from your null hypothesis into the box next to Test mean:
(4) choose Alternative: hypothesis (not equal for a two-sided test, less than or greater
than for the two varieties of one-sided tests)
Figure 9.26

T-Test of the Mean

Test of mu = 6.000 vs mu > 6.000

Variable N Mean StDev SE Mean T P-Value
M1Growth 19 7.568 3.847 0.883 1.78 0.046
Figure 9.28

All the relevant information is provided from the Minitab output of Figure 9.28 for
understanding and conducting the t-test. The first line
Test of mu = 6.000 vs mu > 6.000
M1 % Growth
Mean 7.57
Standard Error 0.88
Range 16.00
Minimum 0.80
Maximum 16.80
Count 19.00
Figure 9.25

states the null and alternative hypotheses, indicating
0
is 6 and we want to conduct a one-sided
test with H
A
: > 6. Inspecting the remaining two printout lines from left to right:
M1Growth 19 7.568 3.847 0.883 1.78 0.046

we see that the first four entries contain information also found earlier in the Descriptive
Statistics output. The standard error of the mean (SE Mean) reported is easily verified from
information provided on sample standard deviation and sample size:
s = s / n = 3.847 / 19 = 0.883
Similarly, we can also check the correctness of the test statistics reported, 1.78.
t = ( x
0
)/s = (7.568 6)/0.883 = 1.568/0.883 = 1.78
Thus, the sample mean is less than two standard deviations above the target rate.
How often should we expect a sample of 19 observations to have a mean that is 1.78
standard errors more than the population mean? The answer to this question is provided by the
p-value, the final and most useful element of the t-test output. The p-value of 0.046 tells us the
chance that the observed difference between x and = 6 was due simply to sampling error.
Sample means at least 1.78 standard errors above will occur only 4.6 percent of the time when
H
0
is true. We may verify this p-value directly from the t-distribution. The inverse cumulative t-
distribution for n 1 = 18 degrees of freedom accumulates 4.6 percent of its probability area in a
single tail at 1.78 standard errors from the mean (see Figure 9.29). Before computers, however,
p-values were seldom used in testing because they were difficult to find precisely.
Inverse Cumulative Distribution Function

Student's t distribution with 18 d.f.
P( X <= x) x
0.0460 -1.7798
Figure 9.29

We now may apply the p-value decision rule. As you recall, this rule is exquisitely
simple: if p < , we reject H
0
. Otherwise, we cannot reject H
0
. Since p-value, 0.046, is indeed

less than = .05.
1
We may therefore reject the null hypothesis at the .05 significance level and
conclude that the actual rate was significantly greater from the FED's target rate.
This finding allowed us to promote our hypothesis to a legitimate theory that has stood up
to hypothesis testing against actual data. Until other data reveals a different pattern, we maintain
that the FED significantly exceeds its M1 targets over the long run.
For this one-sided test, the rejection region was all in the right tail. What if a two-sided
hypothesis were justified instead? What if the bank economist also considers the possibility that
below-target M1 growth may occur. The FED is composed of bankers, a profession that fears
inflation more than recession. The surest way to soak up excess inflationary pressure is a tight
money policy of slow M1 growth. If departures in either direction from the 6 percent target rate
are possible, the economist should conduct the following two-sided hypothesis test:
H
0
: = 6
H
A
: 6
Using a t-test procedure to conduct a two-sided test, we obtain the Minitab output
reproduced in Figure 9.30.
T-Test of the Mean
Test of mu = 6.000 vs mu not = 6.000

M1Growth 19 7.568 3.847 0.883 1.78 0.092
Figure 9.30

The output is the same as before except for two crucial differences. The first line of the output
replaces > with not = (not equal) to inform us that a two-sided test was performed. The
other difference is that the p-value, 0.092, is twice as great as the one-sided value (0.046). This
larger p-value for the two-sided test is no longer less than . Thus, we can neither reject the null
hypothesis nor can we conclude that the FED allowed the annual rate of growth in M1 to drift
significantly above the 5 percent target rate.
Why are we not able to reject H
0
using a two-sided test? We explained in Chapter 8 how
in a one-sided test is confined to only a single tail of the t distribution, making the rejection
region wider. Although the significance level ( = .05) remains the same, t = 1.78 no longer lies

1
Did you hesitate before agreeing with this conclusion? Although 46 is greater than 5, the first is carried to one more decimal places. Be careful
when you compare decimal fractions of different length. Before comparing the p-value with , add enough extra trailing zeros to the shorter
decimal fraction. In this case, for example, compare 0.046 with 0.050.

inside the rejection region. Our ability to obtain significance was due to the one-sided nature of
the earlier the test. Because we expected any differences by the FED to be errors on the high
side of the target rate, the rejection region can include a wider range of t-statistic values. The
price we pay for this advantage is the sacrifice of the entire rejection region for negative t values.
As we mentioned in Chapter 8, the ethical implications should be considered whenever choosing
between one- and two-sided tests.
Finally, data on the M1 growth rate in previous years (1966-1975) had a standard
deviation of 4.55. If the bank economist assumed that = 4.55 for the subsequent period as well,
a z-test could be performed instead. The Minitab output would have looked like Figure 9.31.
The instructions and sample dialog box are shown in Figure 9.32.

Using Minitab to Conduct a One Sample z-Test
Pull-Down Menu sequence: Stat Basic Statistics 1-Sample z...
Complete the 1-Sample z DIALOG BOX as follows:
(1) click mouse on variable to be tested and press Select button
(2) click mouse on Test mean: to run a z-test (instead of Confidence interval:)
(3) type value of
o
from your null hypothesis into the box next to Test mean:
(4) choose Alternative: hypothesis (not equal for a two-sided test, less than or greater
than for the two varieties of one-sided tests)
(5) type in the known value of in the box next to Sigma:
Figure 9.32

Z-Test

Test of mu = 6.00 vs mu > 6.00

Variable N Mean StDev SE Mean Z P-Value
M1Growth 19 7.57 3.85 1.04 1.50 0.067
Figure 9.31

Generally, the thinner tails of the normal distribution produce smaller p-values than
comparable t-distributions. However, the population standard deviation that we imposed (4.55)
exceeds the sample standard deviation (3.85) that would be used for a t-test. Thus, the p-value of
0.067 for this z-test is too large to reject the null hypothesis.

On the other hand, if growth rates of M1 are non-normal and the sample is small for the
large sample approximations to hold, t- and z-tests should not be used. Instead, growth rates
could be transformed into logarithms, which may have a normal distribution. Alternatively,
nonparametric test, such as those discussed in Chapter 12, should be considered. These tests
impose less-restrictive distributional assumptions than do the tests considered in the present
chapter.

Significance Testing in Quality Control*
In industrial quality control, periodic sampling of production is used to figure out whether
equipment is operating at designed specifications. If current operations are outside specified
standards, the source of the problem is identified and its cause corrected. Rather than risk a
product recall or customer dissatisfaction, a manager may temporarily shut down an entire
facility until quality is restored.
Frequent samples must be collected to provide continual monitoring of quality. Sample
sizes are usually small to keep monitoring costs down. Can hypothesis testing help us make
quality control decisions based on the information from a very small sample?

Chapter Case #4: Coming Up Short
Recently, a major dairy products company faced embarrassing publicity. A national
consumer group claimed the public was getting less yogurt than the companys 8 oz. label
advertised on the container. After weighing a randomly collected sample of yogurt containers
from grocer shelves, the consumer group reported to the press that mean product weight was
substantially less than 8 ounces. The public relations director immediately went into damage
control by questioning the size of the sample. Can a handful of containers really say very much
about the mean weight of the millions of yogurt cartons the company produces each year?
To demonstrate how convincing test results can arise from even very small samples, lets
examine one-sided test results for different samples. For the yogurt case, the null and alternative
hypotheses are the following:
H
0
: = 8 ounces
H
A
: < 8 ounces


From years of production data, suppose it is well known that content weights are
known to be approximately normal. Consider the two alternative samples of n = 4
observations shown in Figure 9.33. Observe that none of the four containers in
sample 1 have contents weighing more than a quarter ounce less than the eight. By
contrast, the third and fourth containers in sample 2 have weights 1.5 and 2.0
ounces below the amount listed on the yogurt container. The mean for sample 2 is
7.0 ounces, which is 6 times farther from 8 ounces than the mean of sample 1
(7.85 ounces). But before we jump to the conclusion that sample 2 is the
better candidate for rejecting H
0
, glance at the standard deviations for the
two samples in the t-test output from Minitab (see Figure 9.34).
T-Test of the Mean

Test of mu = 8.0000 vs mu < 8.0000

sample 1 4 7.8500 0.0258 0.0129 -11.62 0.0007
sample 2 4 7.0000 0.8907 0.4453 -2.25 0.055
Figure 9.34

The tightly-arrayed pattern of weights in sample 1 results in a standard
deviation (0.0258) that is 34 times greater than the s = 0.8907 for
sample 2, whose values are more widely spread. Because both samples
contain the same number of observations, the standard errors of the mean
retain this one-to-34 ratio.
1
The two t-statistics are each calculated
from the ratio ( 8)/s. Therefore, a numerator that is 6 times greater
for sample 2 is more than offset by a 34-times-larger denominator. The result is
a t-statistic for sample 2 that less than one-fifth the size (6 34 = 0.19) as the t-statistic for
sample 1.
The actual t-statistics, (7.85 8)/.0129 = 11.6 is more than five times the magnitude of
(7 8)/.04453 = 2.25. Note that the signs are negative because mean sample weights are below
0
= 8 ounces. Even if the consumer group uses = .01 as its significance level, the p-value
0.0007 for sample 1 is tiny enough to easily reject H
0
. The first sample of only four observations
provides sufficient evidence to conclude that the mean weight of all yogurt containers produced
by the company is significantly less than claimed on the package.
By contrast, sample 2 data with its greater variability would have casts reasonable doubt
that the full ounce shortfall was not due to sampling error. The p-value of 0.055 is not less than
= .05, so the null hypothesis of an 8 ounce mean cannot be rejected. Thus, a small but
sample 1 sample 2
7.88 7.70
7.82 7.80
7.86 6.00
7.84 6.50
Figure 9.33
sample 2
7.70
7.80
6.00
6.50
7.70
7.80
6.00
6.50
Figure 9.35

consistent shortfall in weight is more conclusive evidence than larger differences arising from a
sample of more volatile sample observations.
We have just observed how small sample evidence of persistent differences from
0
avail
decision makers of definitive conclusions in quality control settings. For more erratic patterns
such as found in sample 2, larger sample sizes are therefore necessary. Suppose the data in
sample 2 persist over a sample size of n = 8 observations. Suppose we replicate sample 2 data
with four more identical observations (see Figure 9.35). The t-test output is show in Figure 9.36.
T-Test of the Mean

Test of mu = 8.000 vs mu < 8.000

sample 2 8 7.000 0.825 0.292 -3.43 0.0055
Figure 9.36

Because the sample contains the same data weights, the mean remains at 7 and the
sample standard deviation has a similar value. However, the persistence of the data pattern
across a larger sample produces a much smaller p-value, sufficient to reject the null hypothesis at
the .01 significance level. The greater t-statistic magnitude (3.43 instead of 2.25) results
primarily from dividing the standard deviation by the square root of 8 rather than 4.
2
The
explanation is clear: such a small sample mean is far less likely from a sample of n = 8 than it is
from a sample of only four observations. The consumer group facing large variations like those
in sample 2 would be well advised to examine larger samples, even if that means less frequent or
more costly monitoring of the firm's output.
Statistical significance requires a persistent pattern in small sample data. If
considerable dispersion exists in the population, larger samples should be selected.
For very large samples, on the other hand, statistical significance may lose much of its usefulness
if minor patterns are accorded undeserved importance. This idea was discussed in section 5 of
Chapter 8.
If the FDA or FTC agencies later gain access to company production records, they might
conduct follow-up tests based on a known value for . They could then use a z-test and the
thinner-tailed normal distribution to base their conclusions.

1
Notice that SE Mean = s is half of STDEV for each sample because we divide by 2, the square root of n = 4.
2
The reduction in StDev from 0.8907 to 0.825 is due to a divisor of (8 - 1), over twice the (4 - 1) divisor in the original sample.

CASE MINI-PROJECT:
Worried that inflation has pushed up material costs above the $2 million/month budgeted, a
manufacturer collects a sample of n = 24 months of expense data (measured in millions of
dollar). A test is conducted at the = .05 significance level to determine whether monthly
expenses average significantly more than $2 million.
1. Complete the alternative hypothesis below:

H
0
: = $2 million

H
A
: > $2 million

The sample data results in the following t-test output:

T-Test of the Mean

Test of mu = 2.000 vs mu > 2.000

MATERIAL 60 2.462 1.567 0.202 2.28 0.013

2. To conduct the test by the p-value decision rule, we note that p = is <, > (circle one)
.05, the

significance level chosen by the manufacturer. Thus, we reject, cannot reject (circle one) the
null hypothesis H
0
.

3. Based on these test findings, the manufacturer should conclude that mean monthly materials
cost are / are not (circle one) significantly greater than $2 million.

4. Verify arithmetically that the t-ratio reported in the computer output above tells us that the
sample mean is 2.28 standard-errors-of-the-mean greater than 2.0.

Answer the next two questions based on the following case and statistical output:
The manager of the water utility is worried that demand has changed from the 110 million
gallons monthly average that was experienced during the previous decade. Demand for water (in
millions of gallons/month) is sampled for 60 recent months.

T-Test of the Mean


DEMAND 60 117.70 14.42 1.86 4.14 0.0001

9.54 A test is conducted at the = 0.01 level to determine if average monthly demand now is
significantly different from 110 million gallons. We may conclude that
a. p > and therefore reject H
0

b. p < and therefore reject H
0

c. p < and therefore cannot reject H
0

d. p > and therefore cannot reject H
0

e. Insufficient information provided to reach a conclusion

9.55 For this large sample, we can be 95 percent confident that monthly water demand now
averages between approximately
a. 114.0 and 121.4 million gallons
b. 103.3 and 132.1 million gallons
c. 114.6 and 121.8 million gallons
d. 115.8 and 119.6 million gallons

Answer the next four questions based on the following case and statistical output:
A premium is the price charged to maintain insurance coverage. A random sample of n = 40 life
insurance policyholders is surveyed. An insurance company tests whether premiums charged its
nonsmoker policyholders are significantly below $25 for $1000 coverage.

T-Test of the Mean

Test of mu = 25.00 vs mu < 25.00

NonSmoke 40 18.94 16.39 2.59 -2.34 0.012


9.56 The null and alternative hypotheses for this test on mean nonsmoker premiums are:
a. b. c. d. e.
H
0
: = $25 H
0
: = $25 H
0
: = $25 H
0
: > $25 H
0
: < $25
H
A
: > $25 H
A
: $25 H
A
: < $25 H
A
: < $25 H
A
: > $25
9.57 According to the printout, the mean premium for this sample of nonsmokers
a. is 16.94 standard errors below $25
b. is 16.39 standard errors below $25
c. is 2.59 standard errors below $25
d. is 2.34 standard errors below $25
e. is 0.012 standard errors below $25

9.58 Mean premiums tests significantly less than $25 (per $1000 coverage) at any the
following significance levels except:
a. = .20 significance level
b. = .10 significance level
c. = .05 significance level
d. = .01 significance level
e. tests significant at any of the levels selected above

9.59 If a two-tailed test had been conducted instead, the computer output would have been
exactly the same except
a. the t-ratio would have been twice as large: t = 4.68
b. the t-ratio would have been half as large: t = 1.17
c. the t-ratio would have been positive: t = +2.34
d. the p-value would have been twice as large: p = 0.024
e. the p-value would have been half as large: p = 0.006

9.60 Calculate t in each of the following situations:
(a)
0
= 5, = 2, and s = 1
(b)
0
= 15, = 12, and s = 1
(c)
0
= 5, = 8, and s = 1
(d)
0
= 5, = 2, and s = 3
(e)
0
= 50, = 20, and s = 10
(f)
0
= 5, = 2, and s = 1

9.3 Inferences for Matched Pair Samples
So far, we have only considered one-sample inference of the mean. Sometimes,
however, we must also compare means from two different populations. Collecting independent
random samples from the two populations may be used to estimate or test hypotheses about the
differences in these means. Alternatively, paired sample data may be used to make these
statistical inferences. Businesses most often encounter these comparison questions in the
following situations:
I. before-and-after comparisons
II. comparing information about one category with that of another category
III. experimental designs comparing treatment and control groups

Testing whether employee turnover has decreased since the pay hike and estimating the
increase in market share since mergers are examples of before-and-after inference. Businesses
also compare two categories that coexist in the same time period. Marketing analysts test
whether women respond as positively as men to an ad campaign. Similar tests could be
conducted between Baby Boomers and Generation-X, urban and suburban, or college educated
and high school graduates.
For example, a merger between two aerospace giants is held up because production costs
may average significantly more at one of the two companies. Random samples are collected
from plants belonging to each company, and production cost at each is measured. If mean costs
may differ significantly, the two samples may be used to estimate cost differences and a margin
of error reported. Stockholders should prevent the merger from going through unless downsizing
and capital investment can bring efficiency to the higher-cost firm
As mentioned in Chapter 5, statistical comparisons may be misleading. If confounding
factors have also changed meanwhile, before-and-after analysis may be unreliable. The higher
cost firm may produce more labor-intensive products. If sufficient time, resources, and authority
are allocated to a study, carefully designed experiments can provide more persuasive statistical
conclusions. In the simplest experiments, a treatment is compared with a control group. For
example, a before-and-after study would compare work performance after changing to a casual-
dress policy in all offices. A categorical comparison would compare performance at casual-dress
companies with business-attire companies. A controlled experiment, by contrast, might
randomly assign offices to one of two samples, one for each type of dress code. Then,
confounding factors do not vary between the samples.
Businesses make inferences about the difference in population means may conduct
controlled experiments, before-and-after analysis, and two-category comparisons.

Comparing Means for Matched Pair Sample Data
Matched paired data has a major advantage over two-samples from
different populations because the same subjects are used. As we discussed in
Chapter 5, this type of design can prevent confounding problems. For
instance, a market taste test has each subject evaluate Burger Kings new
french fries and McDonalds fries. Lets examine a case where paired data is
available and useful for inferential analysis.

Chapter Case #5: Give Until It Hurts
Each year, employees are encouraged to support a charity such as United
Way to meet some corporate goal. A regional coordinator of a national
charitable organization is in charge of organizing this years corporate giving
campaign. Recent cuts in welfare and other programs for the needy have
increased the pressure on charitable organizations to fill the gap. Fortunately,
people have more to donate because of rising incomes and stock market
gains. However, donations are treated less favorably by the tax laws, and
contributors may be discouraged by media reports of inefficiency at
charitable organizations. The regional coordinator decides to test whether
corporate charitable giving has changed significantly in the past year.
Because there are thousands of companies in the region, he collects a random
sample of 55 companies and compares employee contributions this year with
those from last year. Figure 9.44 displays the paired data from the Excel
spreadsheet.
Figure 9.44
GiveNow GiveLast
$3,060 $4,613
$17,098 $24,875
$10,339 $19,854
$4,241 $4,460
$5,407 $8,410
$6,115 $4,407
$18,859 $28,208
$9,562 $7,339
$409 $230
$740 $594
$8,545 $7,955
$8,909 $5,635
$13,344 $17,288
$14,379 $11,206
$1,034 $2,093
$3,115 $9,789
$100 $100
$200 $1,260
$5,957 $7,580
$1,240 $1,933
$11,749 $15,011
$5,157 $9,759
$1,639 $500
$5,219 $1,807
$460 $5,033
$6,840 $4,271
$3,257 $9,246
$1,267 $1,457
$22,368 $21,676
$1,227 $1,500
$3,425 $3,246
$1,278 $2,780
$4,567 $5,781
$5,291 $2,467
$5,487 $6,398
$9,856 $7,864
$5,218 $4,651
$2,156 $1,423
$5,548 $4,511
$4,157 $5,368
$309 $242
$2,589 $1,895
$2,317 $2,581
$807 $1,154
$2,712 $1,099
$4,872 $3,297
$3,572 $4,293
$5,581 $6,007
$11,204 $10,915
$9,866 $10,128
$2,975 $2,805
$1,123 $1,561
$3,484 $4,319
$7,859 $6,997
$1,469 $1,262

Clearly, the two columns reveal that some companies raised more last year while other
are giving more this year. A preliminary study based on a random sample of paired data shows a
drop off in per company charitable contributions this year from $6200 to $5450. But these are
only sample means. Can we infer a statistically significant change in contributions among all
companies in the county? For this test of paired data, we may use the same type of one-sample t-
test developed in section 2.
Because the data are in pairs, the differences in corporate giving at each company may be
tested by the following hypothesis:
H
0
:
GiveNow

GiveLast
= 0
H
A
:
GiveNow

GiveLast
0
After using our spreadsheet to compute these differences, the column of data is subjected to a
one-sample t-test. The Minitab output is presented in Figure 9.45.
T-Test of the Mean
Test of mu = 0 vs mu not = 0

GiveChg 55 -756 2887 389 -1.94 0.057
Figure 9.45

The computer output reports the mean shortfall of $756 in this years donations over last
years for the 55 companies sampled. However, the estimated standard error of the mean, $389,
is also large. The resulting p-value of 0.057 exceeds = .05, so he cannot reject the null
hypothesis. Therefore, the regional coordinator cannot conclude at the .05 level that a significant
change in corporate giving has occurred this year.

For paired data, one-sample tests are used to test the difference in population means.
This section dealt with analysis of two normally distributed populations. For large
samples, the central limit theorem extends these procedures to use with non-normal populations.
The nonparametric procedures of Chapter 12 provide alternative methods that do not require
normally distributed populations. In Chapter 10, we will deal with situations involving
comparisons of more than two populations. Then, we will make greater use of pooled standard
deviations.


Case Study Exercise:
For sales revenue figures (measured in dollars) during 84 days at a college town yogurt store:

yogurt 84 595.8 110.9 12.1 ( 580.2, 611.4)

Using the information above, answer the following questions:
(A) By examining the size of the sample and the approximate distribution in your histogram,
state two alternative justifications for constructing t intervals.
1.
2.
(B) Why is STDEV more than nine (9) times larger than SE MEAN ?

(C) Show that + t
/2
s = 611.4, the right end of the confidence interval.
(D) Complete the following interpretation of this interval estimate:
At the percent confidence level, we estimate that daily yogurt sales are between 580
and 611 Suppose that we wish to test whether mean yogurt sales are significantly less
than $625/day at the 0.01 significance level.
(a) Formally complete the null and alternative hypotheses in the following:
H
0
:
H
A
:
(b) Which of the following two tests would you use? Conduct the test and state the results:

T-Test of the Mean


yogurt 84 595.8 110.9 12.1 -2.41 0.018

T-Test of the Mean

Test of mu = 625.0 vs mu < 625.0

yogurt 84 595.8 110.9 12.1 -2.41 0.0090


Chapter 10 Analysis of Variance Using
Experimental Design
1

Approach: In Chapters 8 and 9, we learned how to use sample data to make statistical
inferences about a population. We calculated confidence intervals and conducted hypotheses
tests about the population mean . In this chapter we will address the general class of problems
analyzed by a method called analysis of variance to test whether several populations share the
same mean. In the process, we shall apply experimental design to linear models.
experimental design and controlled experiments
response variables, treatments, and experimental unit
factors and confounding
randomization, completely randomized designs, and balanced designs
sum of squares, mean square, and AOV table
F distribution and F statistic
treatments or populations
models and model assumptions
partitioning variation and one-way AOV

SECTION 10.1 Generating Data from a Designed Experiment
SECTION 10.2 Comparing Means for a Completely Randomized Design
SECTION 10.3 Linear Modeling Under One-Way Analysis of Variance
SECTION 10.4 The F test for One-Way Analysis
SECTION 10.5 Navigating the AOV Table
SECTION 10.6 Comparing Individual Treatment Means
SECTION 10.7 Analysis of Variance Versus Regression Analysis

1
Note to instructor: Except for the final section, this chapter relies only on material from Parts I, II, and III of this text.
Thus, analysis of variance may be covered prior to Chapters 4 and 10. An optional section (Section 10.7) allows
instructors to compare both types of linear modeling: analysis of variance and regression.

10.1 Generating Data from a Designed Experiment
When government agencies or private business information services collect data, we have
no direct say in what or who gets measured and when or how that measurement takes place. For
example, most countries report unemployment rate based on sample data they collect from the
labor force. Suppose a corporation is considering whether to relocate one of its production
plants. In decision making situations, we ideally would like to control the process by which the
data for our analysis are produced. This process is known as experimental design.
DEFINITION: An experimental design is the process used to produce data best able to help us
answer a specific set of questions.
As already mentioned, an experiment does not require scientists in white lab coats or beakers
bubbling with green liquids. Experiments do require subjects, often called experimental units.
DEFINITION: Experimental units are the subjects of the experiment.
Experimental units may be smokers in a medical experiment, classroom students in an education
study, or test tube samples of reagents in chemical research. For business studies, experimental
units may be customers, office employees, products tested, or factories, for example.

Response Variables, Factors and Treatments
The object of experiments is to observe a response variable X under varying sets of
circumstances orchestrated by the analyst. The X response is measured at a range of different
settings, called treatments, for one or more other variables, the factors.
DEFINITIONS: The response variable is the quantitative variable observed as the outcome of
an experiment. The factors are categorical variables whose settings, the treatments, elicit
variations in the response variable. Treatments may be any possible values, levels, or states for a
factor.
The terms used -- such as "response" and "treatment" -- owe their origins to well-planned
scientific experiments. In fields such as medical science, researchers administer different drug
treatments to a series of subjects and then measure patient response, perhaps in terms of blood
pressure, heart rate, weight loss, or rate of recovery.
In business, similar treatment-response measurement often takes place. For example, an
insurance company may study whether one factor, the type and level of educational background
of its field representatives, affects policy sales. One version of the design may then involve

individual field reps as the experimental units, policy sales in a particular month as the response
variable, and education of field reps as the factor with its three treatments: no college education,
some college education but no degree, and a four-year college degree.
An airline may want to determine whether any differences exist among the maintenance
cost of the four different types of airplanes in its current fleet, and also whether the age of its
planes has any influence on maintenance costs. The design analyst for the airline decides to
select individual planes in the fleet as the experimental units, annual maintenance cost of parts
and labor for each plane as the response variable, and model and age category of the plane as the
two factors.
Remember that the purpose of a well-designed experiment or study is to produce useful
data. We gather data to shed light on decision making problems. When data does not address
our needs, time and money are wasted and we are still no closer to an informed decision.
If the approach is "Let's collect some data and see what it tells us," most of the time it
won't tell us much about the questions we need answered.
Whenever business or public-sector decision makers are in a position to generate data
specific to a particular problem, it is absolutely essential to enlist the aid of a statistician to help
in the design phase. The road to business failure is littered with cases of business statistician
"out of the loop" in the crucial early stages of data generation. All too often, management brings
in these experts only for data analysis. However, even the best statistician cannot conclude much
from a poorly designed study.
What are some of the features of a good design? What are the pitfalls we should try to
avoid? In Chapter 1, we discussed the need to identify the proper population and relevant
variables, and use random samples to avoid biased estimation. We later saw that random
sampling was also essential for reaching meaningful conclusions about a population.

Confounding in Poorly-Designed Experiments
These same recommendations may be extended to more controlled experimental design
situations. However, additional difficulties occur when investigating whether a relationship
exists between a factors and a response variable. One of the most troublesome barriers to data
analysis occurs when the effects of several factors are confounded.
DEFINITION: The effects of factors are said to be confounded in the data if we cannot
determine which factor is producing the observed effect on the response variable.

For example, suppose that insurance company field reps with college degrees are found
to average substantially higher monthly policy sales. However, suppose it is also discovered that
college degree holders tend to be assigned to more lucrative sales territory. The company cannot
be certain whether the higher sales are due to better educational qualifications or to the favorable
territory assignments. In this case, territory serves to confound the analysis. It could be that
education has no influence on sales. If the territory has a large enough role in affecting sales, it
is even possible that college education may be a liability to sales performance. We may never
know as long as the design does not take territory assignments into consideration in some way.
If factor effects are confounded, data patterns observed between responses and any single
factor cannot be trusted. Incorrect or biased results are a potential outcome. In any case, doubt
regarding the validity of the conclusions may prevent decision makers from using the results.
Experimental design can help untangle confounded factors by exercising some control
over the process by which sample data are generated. There are several common forms of
control, any or all of which may be present in a well-designed experiment. These include (1)
holding other factors constant, (2) randomization, and (3) a balanced design.

Controlled Experiments
One method to isolate the effects of a factor from those of other factors is to maintain all
other factors at the same level or status throughout the experiment. Since factors that are
constant cannot be the cause of changes in the response variable, any changes must be due to the
factor under investigation.
Difference in treatment effects for one factor may be examined by holding constant all
other factors that may otherwise affect the response variable.
For the insurance example, a study might control for territory variation by examining
policy sales of field reps from the same territory. Because more insurance policies may be sold
in some parts of the year than in others, comparisons among field reps are made over an entire
year. Some years have experienced slower sales, so it is also important to measure each field
rep's policy sales in the same year. If the insurance company sells a broad range of policies, the
study may restrict analysis to field representatives who specialize in the same type of policies
and customers, such as small business policyholders. By holding constant all these potentially
confounding factors (as well as other factors, such as age, gender, experience, etc.), any
substantial differences in policy sales may be traced to education differences.

One major limitation evident from this example is that holding all other factors constant
may be a difficult and tricky proposition. Time and money, usually scarce resources in business,
must be expended to do a first rate job. And there is always the possibility that one or more
confounding factors have been overlooked. The carefully controlled procedures used by product
testing labs come under fire whenever a new car model is recalled, an over-the-counter drug has
unreported side effects, or the shelf life of milk fails to reach the expiration date coded on the
container.
The firms developing these new products are caught in a dilemma. They want to avoid
damaging losses of sales and to the reputation of their brand names. On the other hand,
competitive pressures to catch the market at the right time and ahead of their rivals puts time at a
luxury. Testing short cuts are often employed. Rather than monitoring the effects of a
medication for ten or twenty years in human subjects, animals with much shorter life spans may
be used. Test conditions are accelerated to simulate a much longer period of standard usage.
Shelf life conditions for pasteurized milk properly stored in a refrigerator may be tested under
room temperatures to speed up bacterial growth.
A related problem arises if potentially confounding factors are numerous or cannot all be
identified. A cost effective means to hold constant many of these factors (even some of the
unknown ones) is by conducting experiments within the same environment at the same time and
circumstances. As we saw, there are many possible factors that determine the effectiveness of an
insurance salesperson. Any listing of these factors would probably omit some we hadn't thought
of. But by limiting the study to a particular year, territory, and insurance specialty, we should
automatically be able to fix any factors that primarily vary over time, space, or sales condition.
For example, restricting analysis to a single territory controls for factors such as driving time
needed to cover a territory, average income of clientele, and prevailing regional customs and
business practices.
A final limitation to explicit control of other factors is that the experimental conditions
selected may not be typical of the application for which the tests are intended. New car dealers
must provide fuel efficiency information for highway and city driving, but the MPG (miles per
gallon) numbers originally reported were for lab tests where cars were placed on treadmill
devices. The resulting MPG figures were inflated because no account was taken of wind
resistance, cornering, or passenger and cargo weight loads that tend to reduce fuel efficiency.
If other factors are held constant in the design, it is vital that conditions for the
experiment simulate as closely as possible the circumstance of the decision environment. Test
marketing of new products may be targeted to places like Columbus or Orlando to ensure the
presence of a diverse pattern of demographic factors (such as race, age groups, and consumer
tastes).

In addition, a description of the experimental conditions should be attached to the
statistical findings to guard against misapplication. Thus, attached to new car posted fuel
efficiency numbers is a disclaimer stating that mileage may vary depending on driving condition
and these figures should be used for comparison purposes only. By contrast, subfreezing
temperatures were recorded at the Florida launch site on the morning the space shuttle exploded.
However, the laboratory tests that verified the sealing effectiveness of the space craft's "O-rings"
were conducted at 55 degrees and higher, the minimum temperature usually encountered at the
Cape.
Whenever the real world context for the decision changes, the experiment must then be
redone to check whether the results remain valid. In the airline maintenance example, the
increased wear and tear on planes since deregulation should make ten-year-old study results
extremely suspect. Time tested practices need to be re-examined every so often in the light of
changing factors in the market environment.

Randomized and Balanced Designs
A second design feature, randomization, is nearly always incorporated in business
experiments. We saw how random sampling can prevent bias and allow us to make conclusions
that quantify our degree of uncertainty. Random sampling has its broadest powers when we have
control over an experiment that generates the data. After holding constant the factors that we
want to and are able to control, a randomization of the restricted population is then attempted.
DEFINITION: An experimental design uses randomization if some portion of the data
production process relies on random sampling.
According to the properties of random sampling discussed in Part II, if we assign each
factor treatment randomly to a large sample of experimental unit, then each subsample will tend
to have similar treatment distributions for all other factors. Randomization may be accomplished
by drawing samples from mailing lists, membership directories, or product warrantee
information files.
For example, theaters make most of their revenue from food and beverage sales, not from
ticket revenues which mostly go to the movie distributor. Suppose consultants are hired by a
major movie theater chain to determine whether the types of food sold at its snack bar affects the
amount spent on concessions. If each type of menu is offered at a different theater or in different
weeks at the same theater, concession revenues variations may be caused by neighborhood
differences or the type and number of customers attracted to the movies playing that week.
Therefore, the consultants select a random sampling of theaters and weeks to test, and then

assign each menu at random among the theaters and weeks in the sample. For instance, suppose
there are three menus being considered and the sample contains nine different theaters and five
different weeks during the year. Then each menu is assigned at random to 15 of the 45 possible
(i.e., 9 theaters x 5 weeks) theater-week combinations.
Notice that rather than fix the treatment values of factors other than menu, the consulting
firm allows them to vary. However, random assignments of the different menu treatments tends
to create similar sample distributions of the other factors. Thus, factor effects that might
otherwise be confounded will be distributed independently of one another instead. If the samples
from each treatment group have similar distribution of every other factor, then any significant
difference in the mean of the response variable may be attributed to factor under investigation.
In the example, each menu is subjected to a random combination of weeks and theaters. If one
of the menus generates significantly higher revenues than the other two, it cannot be claimed that
higher sales may have been cause by differences in audience tastes, neighborhood food
preferences, movies playing, or time of year.
You may have observed another advantage from using a randomized design. Not only
were the effects of confounded factors disentangled from the menu factor, but these other factors
were allowed to take on realistic values as well. Unlike the limitations we faced by fixing
factors, random selection of experimental units (in this case, theater weeks) and random
assignment of treatments resulted in a representative set of times of the year and neighborhood
theaters. Had the consultants selected a single theater and week and presented movie goers with
a choice of the three menus, the movie chain might be concerned that menu differences won't
apply at other times of the year or in other parts of the country. Even if several theaters and
weeks were chosen by some nonrandom method, subconscious biases in the selection process
might cloud the findings. Suppose several theaters from the southwest are included in the test.
Consequently, the results may be biased in favor of a Tex-Mex menu.
Randomization is used in experimental design to obtain a realistic range of situations and
prevent bias from confounded factors, especially for factors that either cannot easily be
controlled explicitly or are unknown to the experimenter.
A one-factor experiment that relies primarily on random sampling of each treatment
population is called a completely randomized design.
DEFINITION: A completely randomized design involves randomly assigning treatments to a
random sample of experimental units.
The theater consultant study is an example of such a design. Rather than attempt to fix the
environment to particular prescribed condition, treatments are assigned randomly to a random
sample of experimental units.

Randomized designs are not always possible if human subjects are involved. Random
assignments may require a degree of coercion inconsistent with a democratic society. There are
laws that limit the use of people as guinea pigs. In many cases, compliance may be smoothed by
providing participants with monetary compensation or free products for responding to a
questionnaire, sampling a product, or participating in a simulated experiment. Another setting
for ensuring cooperation from subjects is the workplace, where compliance is ensured by
management control over the work environment, offering either rewards (promotion, salary,
perks) or punishment (termination, work assignment).
If randomized assignments are infeasible, experiments often rely on self selection.
Examples of self selection are volunteers for a laboratory study and survey respondents who
agree to be interviewed or mail in a questionnaire. If experimental units are allowed to choose
themselves (rather than being assigned at random), there is the strong chance for bias. Bias may
occur if the selection process is not independent of the factor and response variables. Suppose a
state agency concludes that the type of safety inspection programs used -- regular visits,
unannounced visits, or no visits -- does not affect the number of safety violations observed. If
the subjects consist only of manufacturers that agreed to participate in the study, self selection
bias may have occurred. The factories with the worst safety records, unlikely participants, may
be most likely to alter their unsafe plant condition if either of the first two inspection treatments
are implemented.
A third useful device often found in experiments is a balanced design.
DEFINITION: A design is balanced if the data analyzed contain an equal number of
observations from each treatment group or, in the case of more than one factor, from each
treatment combination.
Balanced designs are desirable for several reasons. For a comparison of response to different
treatments, it seems logical to equally sample each treatment population. If we generate few
responses to one of the treatments, we will lack information about the effects of that treatment.
It usually makes little sense to install one of the three menus at only two theaters and the other
two menus at 10 theaters each. A balanced design also prevents factor effects from being
confounded when there are two or more factors under investigation. Later in this chapter you
will learn about other advantages of a balanced design.
Balanced designs may also be possible with observational data. Business research often
must rely on secondary data sources available from Web sites and government sources, but even
these may be turned into after-the-fact balanced, randomized designs. Secondary data are first
grouped by factor treatment. Then equal-sized random samples are drawn from each treatment
population for a randomized and balance design. If confounding with other factors is a concern,
more complicated sampling designs are also possible (see the Exercises).

Constructing a balanced design requires information about experimental units. Still,
balance may not be feasible if some treatment combinations do not occur often enough in the
population. The requirement for a balanced design is that there be sufficient number of
observations to sample from for each treatment. For example, the FDIC may be investigating the
flight of deposits from banks in five different stages of insolvency. If the rarest of the five stages
occurred in only four instances, a balanced design would be constrained to generate a total
sample size of n = 20 (i.e., four cases from the five treatment stages). If a larger sample is
desired, another option is to combine two of the stages into a single factor.

10.1 Data on the outcome of an experiment is found in the
a. experimental unit
b. response variable
c. factor or factors
d. treatments
e. experimental design

10.2 Analysis of variance often uses the following method to control for factors that may
"confound" the test results?
a. replication
b. balanced design
c. randomized design
d. all of the above

10.3 Which of the following is not a problem associated with trying to control confounding
factors in an experimental design:
a. can be time consuming and costly
b. may overlook some factors that need to be controlled for
c. need to fix factors at realistic values for the results to be applicable
d. may prevent achieving a balanced design
e. all of the above are common problems encountered in controlling for other variables

10.4 Which of the following is not a problem associated with constructing a completely
randomized design:
a. may not be able to compel participation in an experiment
b. it may be unethical to force participants to serve as guinea pigs
c. if only volunteers participate, self selection bias may result

d. participation may require costly compensation
e. all of the above are common problems encountered in constructing completely
randomized designs

10.5 A property assessor wishes to determine whether property values vary among houses on
difference sized lots. If the assessor measures property value only for a random sample
of houses in the Atlanta area in October 1993, the relation between property value and lot
size may be confounded if
a. property values differ between different regions of the country
b. property values vary from one year to the next
c. property values vary among houses different distances from downtown Atlanta jobs
and shopping
d. property values differ among houses, apartments, and commercial
e. none of the above are capable of producing confounding effects for this experimental
design

10.2 Comparing Sample Means Under a Completely Randomized Design
As we learned back in Chapter 6, we ideally would like to control the process by which
the data for our analysis are collected. We called this process experimental design. The
purpose of a well-designed experiment or study is to facilitate the collection of useful data.
Remember that we gather data to shed light on decision making problems. When data collected
do not address our needs, time and money are wasted and we are still no closer to an informed
decision.
Recall that experiments involve a special set of terminology. The subjects are called
experimental units, which may range from customers to employees to products to factories,
depending on the business decision objective. The object of experiments is to observe a
response variable X under varying sets of circumstances orchestrated by the analyst. The X
response is measured at a range of different settings called treatments, for one or more other
variables, or factors.
We also learned that the effects of factors are confounded in the data if we cannot
determine which factor is producing the observed effect on the response variable. Experimental
design can help untangle confounded factors by exercising some control over the process of
collecting sample data. We learned about several common forms of control, any of which may
be present in a well-designed experiment. These include 1) holding other factors constant, 2)
randomization, and 3) a balanced design. Randomization is used in experimental design to

obtain a realistic range of situations and prevent bias from confounded factors, especially for
factors that either cannot easily be controlled explicitly or are unknown to the experimenter.
In Chapter 9, we dealt with tests about a single population or between two populations.
In this chapter we will consider a more complicated type of experiment than the ones from Part
III. When we want to compare several population means, we conduct analysis of variance (or
AOV for short) experiments. A one-factor experiment that relies primarily on random sampling
of each treatment population is called a completely randomized design. A design is balanced
if the data analyzed contain an equal number of observations from each treatment group or, in
the case of more than one factor, from each treatment combination.
In this chapter, we apply some of these design techniques, beginning with completely
randomized one-factor designs. We begin with an introduction to simple analysis of variance
models, and later we shall explore some two-factor designs. In the process, we will learn some
of the terms and tools of analysis of variance. Let's start by putting analysis of variance in
context by discussing a case example of a completely randomized design.

Chapter Case #1: Grading Among Majors in a Business Program
A dean at a Florida university seeks to determine whether graduates from the college of
business tend to perform equally well academically, regardless of major within the school. The
dean is reticent to measure academic performance by overall grade point average because
students in different majors are required to enroll in different courses. It is possible that grading
may be tougher or more lenient for the courses in one department than in those of another, thus
making overall GPA comparisons inappropriate. There is however a core of six upper division
business courses required of all majors in the college: business statistics, finance, marketing,
management organization, production, and business policies. Graduates were therefore
compared by their mean grade in these six common courses.
To control for differences in preparation that might "confound" the results, one part of the
study draws a random sample of 35 recent business graduates. The sample contained seven
randomly chosen graduates from each of the five majors: accounting, finance, management,
marketing, and general business.
1
The data on six-course grade point average (GPA) is located in
columns C21 through C25 of the chapter worksheet, with one column assigned to each major.

1
Economics majors were omitted because students may opt for a major either within or outside the college of business, thus introducing
self-selection difficulties for the study.

ROW ACCT FIN MGMT MKTG GEN

1 2.16667 1.83333 1.83333 2.00000 2.00000
2 2.33333 2.33333 2.00000 2.16667 2.00000
3 2.50000 2.50000 2.50000 2.50000 2.00000
4 2.66667 3.00000 2.50000 2.66667 2.00000
5 3.33333 3.33333 3.00000 2.66667 2.33333
6 3.33333 3.50000 3.16667 2.83333 2.33333
7 3.50000 3.83333 4.00000 3.00000 3.00000

It is evident from the data listing that there is considerable variation in GPA within each major.
Every column contained one or more students with GPA of at least a B average (3.0) and at least
one with close to (or below) a C average (2.0).
Despite the large variation within each major, let's see if there are any systematic
differences in grade performance between the majors. We first combine the 35 data observations
into a single column and then examine the univariate statistics.
ACCT 7 2.833 2.667 2.833 0.544 0.206
FIN 7 2.905 3.000 2.905 0.713 0.269
MGMT 7 2.714 2.500 2.714 0.744 0.281
MKTG 7 2.548 2.667 2.548 0.356 0.135
GEN 7 2.238 2.000 2.238 0.371 0.140
COMBINED 35 2.6476 2.5000 2.6183 0.5869 0.0992

From the first half of the DESCribe output reprinted above, we learn that the combined mean is
2.6476, or about 2.65. However, the means also vary by major. Finance majors in the sample
had the highest mean GPA (2.905), followed by accounting majors (2.833) and management
majors (2.714), with marketing (2.548) and general business majors (2.238) bringing up the rear.
Analysis of variance deals with ways of dividing up, or "partitioning," the variation of a
response variable around its overall mean. The variations are measured by the sum of squared
variation around the mean. We begin by partitioning the total variation, the total sum of squares
SS
T
.
DEFINITION: The total sum of squares, SS
T
, is the sum of the squared differences of each
observation in the sample, x
i, j
from the overall sample mean X . Algebraically, the formula is
SS
T
= (x
i, j
x
j
)
2
.
By subtracting and then adding the treatment mean for each sample, 0
j
, the difference inside the
parentheses of the SS
T
formula may be re-expressed as follows:

(x
i, j
0) = (x
i, j
x
j
0 + x
j
)
= (x
i, j
x
j
) + ( x
j
x )
Thus, an individual observation's difference from the overall mean (x
i, j
0) can be partitioned
into the sum of its two parts:
1) its difference from its treatment mean (x
i, j
x
j
), and
2) the difference between the treatment mean and overall mean ( x
j
x ).

If we square both sides of the above equality and then sum both sides over all observations on all
treatments, it can be shown that
1

3(x
i, j
x ) = 3(x
i, j
x
j
) + 3( x
j
x )
The summation on the left of the equal sign is merely SS
T
, defined previously. The first of the
two sums to the right of the equal sign is the error sum of squares.
DEFINITION: The error sum of squares, SS
E
, is the sum of squared differences of each
observation x
i, j
from its sample treatment mean x
j
.
This sum measures the variation in the response variable X that cannot be explained by
differences among the treatments. The individual variation around the treatment means is called
the "error" because it measures the variation not attributable to systematic population differences.
The final summation involves differences between the k treatment means and the overall
mean. This term therefore is called the treatment sum of squares, SS
TR
.
DEFINITION: The treatment sum of squares, SS
TR
, measures that portion of total variation
accounted for by treatment differences on the response variable.
Thus, we have partitioned the total sum of squares into treatment and error sum of squares.
SS
T
= SS
E
+ SS
TR

Based on this relationship among the three types of sums of squares, we can determine the
treatment mean square.

1
The three terms in this equality involve summation over each i observation for each of the j = 1, ..., k treatments. Thus, the three
summations here are "double sums" over i and j.

The treatment sum of squares, SS
TR
, may be determined from SS
TR
= SS
T
SS
E
.
If mean response does not vary much among the k treatments, then SS
E
will be nearly as large
and SS
T
and SS
TR
will be much smaller. On the other hand, if each treatment contains fairly
uniform responses but responses vary substantially between the treatments, then SS
TR
will be
similar in size to SS
T
and the SS
E
term will be relatively small.
To compute the SS
T
in our case example, we first find the size of GPA difference account
for each student by subtracting the sample mean 2.6476 from each GPA in the combined sample
data. Although some of these differences are small (such as the 0.019 in the middle of the first
line), nearly half of the GPAs differ from the overall mean by at least one-half (0.5) of a grade
point. We then square those differences and sum to obtain the total sum of squares SS
T
, 11.71,
for the sample of 35 graduates.
We then ask how much of this total variation belongs to SS
E
, that is, cannot be traced to
differences in mean GPA for the five different business majors. Errors arise from the failure of
the treatment means to match the actual values of the response variable in the corresponding
experimental units. For this example, the errors are the differences of the graduates' GPAs from
the sample mean GPA in their own major.
We next would need to determine these differences from the means for each of the five business
major means by subtracting these treatment means 2.833, 2.905, 2.714, 2.548, and 2.238
from the seven students in the corresponding majors. These differences are displayed in the table
below.
ROW ACCT dif FIN dif MGMT dif MKTG dif GEN dif

1 -0.66633 -1.07167 -0.88067 -0.548000 -0.23800
2 -0.49967 -0.57167 -0.71400 -0.381330 -0.23800
3 -0.33300 -0.40500 -0.21400 -0.048000 -0.23800
4 -0.16633 0.09500 -0.21400 0.118670 -0.23800
5 0.50033 0.42833 0.28600 0.118670 0.09533
6 0.50033 0.59500 0.45267 0.285330 0.09533
7 0.66700 0.92833 1.28600 0.452000 0.76200

For example, the 0.667 arising from the first accounting graduation (in Row 1) results from an
individual GPA of 3.50 that is 0.667 grade points above the mean for accounting majors, 2.833.
Notice how larger many of these differences are. Forteen of the 35 differences from treatment
means are at least 0.5 grade points in magnitude, which suggests that SS
E
contains much of the
SS
T
variation. This conclusion is verified when we compute the error sum of squares SS
E
by
square each difference above and summing. The result is SS
E
= 9.73, nearly as large as the SS
T

we calculated earlier, 11.71.

Thus, most of the GPA variation in the sample is not accounted for by differences among
the majors. The treatment sum of squares, SS
TR
, can most easily be found from the relationship
SS
T
= SS
TR
+ SS
E
. Using the sums of squares already computed, SS
TR
= SS
T
SS
E
= 11.71
9.73 = 1.98. Thus, only about one-sixth (SS
TR
/SS
T
= 0.169) of the total GPA variation for the
sample is attributable to differences in GPA by major.
Reexamination of the data in the original printout tells us why SS
E
is so large for this
case study. While the means for each treatment vary considerably, within each major there is
even greater variation. To illustrate, general business graduates recorded a substantially poorer
average on the six required course than did finance majors (2.238 versus 2.905). Nevertheless,
one of the seven accounting grads posted a GPA of 1.83 and one of the general business students
in the sample boasted a 3.0 GPA.
But these comparisons result from sample information. The dean wants to decide
whether the parent populations have different means as well, or if these differences in sample
means could have arisen purely by chance. We therefore need to test whether the GPA
differences found for the sample are statistically significant ones. Do the differences in sample
means for each major reflect actual differences in population means, or could such patterns have
occurred due to sampling error? The 2.905 average for finance majors and 2.833 for accounting
majors may appear to be considerably greater than the 2.548 and 2.238 among market and
general business majors. The question is are we justified in rejecting the null hypothesis that all
five majors achieve the same GPA in the six required core business course?
To help the dean decide whether or not all majors have the identical means, we first must
construct a model of how student major is related to the GPA. This is our task in the next
section. From this analysis of variance model, we will be able to apply statistical testing to base
our final decisions.

10.15 What distinguishes one-way analysis of variance from other types of AOV is
a. there is only one response variable
b. there is only one factor
c. there is only one treatment
d. both b and c
e. none of the above are unique to one-way AOV


10.3 Linear Modeling Under One-Way Analysis of Variance
A model is a simplified statement, commonly expressed algebraically, about a
relationship that we wish to better understand.
1
Due to the complexities of business and
economic relationships, decision-makers are usually forced to use simplified relationships that
include only the most important forces involved in the phenomena to be explained or predicted.
Less important factors are omitted from models, and those factors that are included may be
subject to measurement errors. Therefore, statistical models are generally used. Statistical
models contain two parts. In statistical models, we acknowledge that individual observations
contain a pattern, called a systematic portion, and differences from that pattern, designated by c
(the Greek letter "epsilon"). Although we did not state the model formally back in Chapters 7
and 8, the inferences we made about a single population mean were derived from the following
model:
x
i
= + c
i

where is the population mean, x
i
is the ith observation on variable X in the population. If the
population does not contain identical values for X, individual observations will typically lie
above or below . The c
i
term, the difference from for the ith observation, reflects these
individual variations.
If, for example, mean sales at a frozen yogurt shop were known to be $1000 per week, a
simple model for weekly sales at this shop would be
x
i
= 1000 + c
i

Using this model allows the shop manager to predict sales to be $1000 in any given week and to
average $1000/week over any particular time span. Most weeks, however, should not have
exactly $1000 in sales, and many weeks might have sales quite different from this population
mean. All kinds of factors may cause sales to vary from the average. Suppose one week
experienced sales of x = 1500, implying that cmust be 500. Thus, sales that week exceeded the
mean by $500. Another week might have below average sales of $800. Then the individual
term, c = 200, would be negative to reflect sales $200 under the average. If the manager knows
the statistical distribution of , she can determine the probability of sales exceeding some value,
say $1200, or falling in the interval $900 to $1100 in any given week. We found out how to
determine these probabilities in Chapter 7.
But as we learned in Chapter 8, the manager usually doesn't know the value of the model
parameter . Instead, a sample must be collected to estimate by x . In Chapter 9, we

discovered how to test whether is different from a particular value,
0
. Null and alternative
hypotheses, H
0
and H
A
, were constructed. Then, decision rules were formulated from test
statistics and p values. You may want to review this material from Part III at this time before
reading further.
Now consider a more sophisticated model involving more than one population. To allow
the responses variable X to vary systematically by treatment, we state the model as follows:
x
i, j
=
j
+ c
i, j

where
x
i, j
is the value of X for the ith observation on the jth treatment,

j
is the population mean for the jth treatment, and
c
i, j
is the individual difference (positive or negative)

As before, c distinguishes an observation from its mean, but in this model the mean is each
treatment's population mean. For our six-course GPA example, the model would be
x
i, FIN
=
FIN
+ c
i, FIN

for finance majors. Analogous versions would apply for accounting, and each of the other four
majors. If
FIN
= 2.80, then a graduate with a GPA of 3.00 would have an individual difference
from the population mean of 3.00 2.80 = 0.20 grade points. Similarly, students with GPAs of
2.00 and 3.25 imply values for c of 0.80 and 0.45, respectively. Notice that a negative value for
c occurs whenever an observation is below the mean: 2.00 is 0.80 below the mean of 2.80, so
2.00 = 2.80 + (-0.80).
Because population parameters cannot be known unless we have information on the
entire population, parameters such as
FIN
may only be estimated or their values tested. Analysis
of variance allows us to test whether the response variable in the model varies by treatment. In
our case example, this translates to testing for significant differences among the GPA means of
the business majors.
In one-way analysis of variance, we extend our notions of hypothesis testing employed
in Chapter 9 to problems of this type.
2
We test the null hypothesis that the population means are
identical, against the alternative hypothesis that at least one of the population means is different
from the rest.

1
A more extensive discussion of models, from a regression context, is contained in Chapters 9 and 10.
2
Analysis is labeled "one-way" because the set of treatments levels correspond to a single factor, such as college major. In section 10.7,
we shall consider two-way analysis of variance involving models where two factors affect the response variable.

Think about what it means for the one-way model to be worthless: the response variable
X has the same population mean at all treatments. For example, we should expect little from a
model that attempts to account for grade differences by knowing student eye color, sales
differences from the race of the corporation's president, or stock price differences by the number
of vowels in the firm's name. Presuming that such models have no explanatory or predictive
value, we revert to our simpler model
x
i, j
= + c
i, j

where is the overall population mean among all treatments. In the case example, would be
the mean GPA for all business school graduates, regardless of major. If this model performs as
well as the treatment model, we can estimate X from a sample by ignoring all information about
the treatments and always guess 0, the unbiased estimator of .
The alternative formulation of these models guides us in stating the null and alternative
hypotheses to decide which model to chose.
In testing the one-way analysis of variance model
x
i, j
=
j
+ c
i, j

the null and alternative hypotheses may be stated by
H
0
:
1
=
2
= ... =
k
=
H
A
: at least one
j
different from for j = 1 to k

H
0
and H
A
obey the basic format rules for stating hypotheses (refer to Chapter 9). The null
hypothesis contains equal signs, and the alternative hypothesis describes the inequality condition.
Moreover, the hypotheses each are stated in terms of the unknown population parameters for the
model, in this case the k population means
j
.
Yet there is also an obvious difference in this type of hypothesis test. Rather than a
simple hypothesis containing a single parameter, the analysis of variance tests generally involve
several conditions and parameters. This compound set of conditions may be seen more easily by
restating H
0
as
H
0
:
1
= ,
2
= , ..., and
k-1
=
Why do we only require k 1 conditions? Because if all but one of the k population means is
equal to the mean for the combined population , then so must the kth mean. In the extreme
case, with only one treatment (k = 1), then k 1 = 0 indicates that there are no treatment mean
differences to test. With two treatments, k = 2 and k 1 = 1. Then there is only one condition
associated with the null hypothesis, H
0
:
1
=
2
= , because
1
=
2
forces the average of these

two means, , to be equal as well. For cases of more than two treatments (i.e., k > 2), testing
will always involve a compound set of conditions.
When conditions are tested as a group, not separately, we cannot reject any of them if the
null hypothesis is to be retained. Thus, rejecting the null hypothesis doesn't tell us which or how
many equalities were violated. Compared to the single equality hypotheses of Chapter 9,
grouped hypotheses may be less informative. To conclude significance for the model, we only
need to demonstrate that a single treatment mean
j
is different from the other means!
In our GPA example, k = 5 populations and the hypothesis to be tested is stated as
follows:
H
0
:
ACCT
=
FIN
=
MGMT
=
MKTG
=
GEN

H
A
: at least one
j
is different from the
overall
, j = 1,...,5
In order to derive a test statistic with known sampling distribution, we must make a
distributional assumptions about the model.
Assumptions of the One-Way Analysis of Variance Model
1. Although they may have different means
j
, each of the k populations have
the identical standard deviation
j
= and are normally distributed
2. Analysis is conducted on independent random samples drawn from each of
the k populations
Alternatively, these assumptions may be expressed as distributional assumptions about the
term:
j
for the j = 1,...,k treatments are independent, normally distributed random variable with
a zero mean and a common standard deviation o.
Although the first assumption is often not met, the test described in the following section
may still be used so long as substantial departures from normality or large differences among
population standard deviations o
j
exist. Test for normality will be discussed in Chapter 14, and
transformations of nonnormal data to normal were addressed in Chapter 8. Sample standard
deviations usually give an indication of whether population variances between treatments are
substantially different. In Chapter 14, a test for equality of standard deviations is presented. The
second assumption has been shown to be more important than the first.
The second assumption, use of independent, random samples, is especially important.
The usual cause for failure in this assumption is inadequate attention to potentially confounding

factors. As we discussed in Section 10.1, a well-designed experiment should generate an
independent, random sample. On the other hand, test results are unreliable if other factors that
may affect the response variable are neither controlled nor randomly distributed among treatment
samples. In some cases, however, a confounding factor that has not been dealt with adequately
in the design phase may instead be incorporated into the analysis of variance model. Sections
10.7 and 10.8 will discuss methods for analyzing models in which the response variable is
affected by a second factor.
Notice that there is no requirement that sample sizes n
j
be equal for each population.
Although our case example involves n
j
= 7 for each of the five majors, one-way analysis of
variance could have been conducted even if there were more graduates from some majors than
from others (refer to the case example in section 10.6). However, if the treatment sample sizes n
j

are not approximately equal, assumptions of equal variance become a more essential
requirement.

10.20 Which of the following is not an assumption of analysis of variance models?
a. zero overall population mean
b. constant standard deviation among the treatment population
c. normally distributed treatment populations
d. independent, random samples
e. all of the above are assumptions of AOV models

10.21 Which of the following is an assumption about the random disturbance term o in analysis
of variance models
a. constant standard deviation of o
b. normal distribution of o
c both a and b
d. none of the above

10.22 If treatment sample sizes are approximately equal, which of the following can be said
about the sensitivity of the assumptions for one-way analysis of variance:
a. moderate departures from normality do not invalidate AOV test results
b. moderate differences among treatment standard deviations do not invalidate AOV test
results
c. use of non-independent or nonrandom samples do not invalidate AOV test results
d. a and b are true
e. only balance designs may be subjected to analysis of variance tests

10.23 If we wish to test whether construction worker wages are significantly different among
four different states, the number of treatments necessary for analysis of variance must be
a. 1
b. 2
c. 3
d. 4
e. insufficient information provided to answer this question

10.24 In an AOV test of whether mean construction worker wages are significantly different
among four different states, not rejecting the null hypothesis implies that
a. mean construction worker wages are the same as mean wages in other occupations
b. mean construction worker wages are the same from one year to the next
c. mean construction workers wages are the same for each state
d. wages are the same among all construction workers
e. all of the above

10.25 In order to reject the null hypothesis that mean MPG (miles per gallon) is the same
among subcompact, compact, and mid-sized cars, we must conclude that
a. sample mean MPGs are all different
b. at least one sample mean is different from the sample mean MPG for the other two car
sizes
c. population mean MPGs are all different
d. at least one population mean is different from the population mean MPG for the other
two car sizes

10.4 The F Test for One-Way Analysis
As always, testing requires a test statistic with a known sampling distribution so we may
construct an appropriate decision rule. Our task then is to find a test statistic for the model, and
its sampling distribution. At first glance, it would appear that a test statistic for the analysis of
variance model might use the ratio of the two components partitioned from the total variation
SS
T
in section 10.2. The GPA variation accounted for by knowing a student's major is measured
by SS
TR
, and the variation of GPA within majors is represented by SS
E
. If the ratio SS
TR
/SS
E
is
small, the variation associated with differences in treatment means is likely due to sampling
error. For example, SS
TR
= 1.98 is small relative to SS
E
= 9.93 in our business major GPA

example. However, this analysis is incomplete because it compares total rather than mean
variations.
Recall that sample variance s (and its square root s, the sample standard deviation) was
defined in terms of sum of squared differences from the sample mean. This sum of squares is
identical to the total variation measure SS
T
. But because a measure of average (rather than total)
variability was needed, we dividing by n 1, the number of degrees of freedom (i.e., pieces of
information) in the sample. (Recall that we don't divide by n because one degree of freedom is
lost in the process of estimating the unknown population mean from that same sample.)
This idea of averaging total variation to obtain variance measures may be extended to
SS
TR
and SS
E
, the components of SS
T
. The term "analysis of variance" arises from its
comparison of these variances derived from components of the total sum of squares. In testing
an analysis of variance model, numerator and denominator of the F-ratio test statistic are each
expressed in terms of mean sums of squares. The average of a sum of square is called a mean
square, and the test statistic is the ratio of these mean squares. As with the sample variance, the
divisor used to compute the mean is the number of degrees of freedom associated with the sum
of squares.
We first define the mean square for the treatment mean square, the numerator in the ratio
that we want.
DEFINITION: The treatment mean square, MS
TR
is SS
TR
averaged over one fewer than the
number of treatments.
MS
TR
= SS
TR
/(k 1)
This mean square is the mean contribution of the model per treatment difference. We average
over k 1 instead of k because it is the difference among treatment means that causes this
variance. In the extreme case of k = 1 treatment, the overall mean would be identical with the
single treatment mean. Two treatments (k = 2) would create only one source of difference:
population one's mean versus population two's.
In the case example, there are five majors, so there are four degrees of freedom to
average over. If we know four differences among the
j
, we can find all other possible differences
(see the Exercises for an example). Thus, the 1.98 treatment sum of squares translates to a MS
TR

of 0.495 (i.e., 1.98/4).
For the denominator of the F-ratio, we seek the mean square error.


DEFINITION: The mean square error is the average variability about the overall mean as
measured by the mean of the error sum of squares SS
E
. Thus,
MS
E
= SS
E
/(n k)
Rather than divide SS
E
by the sample size n, or even by n 1, the proper number of degrees of
freedom is n k. There are k sample means that must first be extracted from the information in
the sample: the k treatment means.
1
This leaves us with only n k pieces of information, or
degrees of freedom. For the case of six-course GPAs, n k is 35 5 = 30 and MS
E
= 9.73/30 =
0.324.
To test for the significance of an analysis of variance model, the appropriate test statistic
is the ratio of the ratio of mean square treatment to mean square error, and this ratio has a pdf
called an F distribution.
The F-ratio to test the significance of an analysis of variance model with k treatments and
sample size n is the ratio MS
TR
/MS
E
; the sampling distribution for this ratio is an F
distribution with v
1
= k 1 and v
2
= n k degrees of freedom.
As always, testing requires a test statistic with a known sampling distribution. The t-
distribution is inadequate to represent the compound set of equalities described by H
0
. However,
the F distribution that we used with analysis-of-variance tests in regression models is up to the
task. In fact, F-tests are flexible enough to test any of the linear models in this chapter.
2

Recall from Chapter 11 that the F distribution is a unimodal family of right-skewed
probability density functions defined for nonnegative values. Specific shapes for F distribution
family members is determined by its two parameters, v
1
and v
2
.
Our task now is to argue that a test statistic for the analysis of variance (often abbreviated
AOV or ANOVA) model has F for its sampling distribution. For AOV analysis, the F-ratio now
compares the AVERAGE between-treatment differences versus the AVERAGE within-treatment
variability. The F distribution IS then that sampling DISTribution that represents the NULL
HYPOTHesis that no difference exists in the population between the treatment means. TO
REJECT the null hypothesis and conclude that significance treatment difference exists in the
sample data, the F-ratio needs to be large enough to not have occurred merely from random data
patterns.

1
The overall mean does not require any more information because it represents an average of the treatment means.
2
The F-distribution may also be used for other types of compound hypotheses, such as testing groups of variables (see Chapter 13).

In testing an AOV model, numerator and denominator of this test statistic are expressed in terms
of sum of squares like those discussed in Chapter 11. For the denominator of the F-ratio, we use
the average variability around each of the treatment means for our SS
E
. If the assumptions of the
analysis of variance model are valid, then falling in the rejection region of the F distribution
allows the decision maker to reject the null hypothesis and conclude that there is significant
variation in response by treatment.
As in the Chapter 11 regression F test, decision rules may be formulated from the
probability that sampling error alone caused the observed value of the test statistic. For
significance tests in AOV, this rule is the following:
Decision Rule for Testing an AOV Model: The test at the significance level for whether a
difference exists among k population means, the following F-ratio rule applies:
if F-ratio > F
1-o
(k 1, n k), reject H
0

otherwise, do not reject H
0

Rejecting H
0
in this test also allows us to conclude that the analysis of variance model is
statistically significant. This decision rule is portrayed in the accompanying graph.
Since the F-ratio is the ratio of these two variance measures, the F statistic for our GPA
example is 0.495/0.324, or 1.53. A glance at the F distribution indicates that the right tail
contains substantial probability area beyond F = 1.5. If H
0
were true for the population, F-ratios
as large as 1.53 (or even larger) would occur fairly often in sample data. To formally confirm
that H
0
should not be rejected for our case example, we request the specific F
1-
distribution
value for the rejection region. We may use the EXCEL built-in statistical function FINV
(inverse of cumulative probability for the F distribution). Simple click on an empty cell in a
spreadsheet and type
=FINV(o, df1, df2)
where you must insert numerical values for o, df1, and df2 as follows:
o the significance level alpha you want to use for the AOV test
df1 the degrees of freedom for the treatment sum of squares, k 1
df2 the degrees of freedom for the error, n k

Excel will return in that spreadsheet cell the critical F-ratio beyond whose tail area contains o
probability, F
1-o
(v
1
,v
2
). For example, an o = .05 significance level and degrees of freedom v
1
= 4

and v
2
= 30 requires us to type =FINV(.05,4,30) in the cell. Excel will report we a critical
value for F
.95
(4,30) = 2.68962, which rounds to 2.69.

According to the F-ratio decision rule, the F statistic for our business GPA sample is not in the
rejection region because F = 1.53 < F
1-
(4,30) = 2.69. The dean therefore does not reject H
0
. He
concludes instead that there was no difference at the .05 significance level in grade performance
among business majors.

10.33 The F-ratio for one-way analysis of variance is the ratio of
a. the treatment mean to the overall mean
b. the treatment mean square to the mean square error
c. the treatment sum of squares to the error sum of squares
d. the alpha value to the p-value
e. the p-value to the alpha value

Answer the following questions about the analysis of variance table below:
SOURCE DF SS MS F
FACTOR 3 18
ERROR 20
TOTAL 48

10.34 The number of treatment for the factor is
a. 1
b. 2
c. 3
d. 4

10.35 The sample size for the experimental design is
a. 22
b. 23
c. 24
d. 25
e. 26
10.36 The error sum of squares, SS
E
, and mean square error, MS
E
, are
a. 30 and 10
b. 30 and 1.5

c. 66 and 33
d. 66 and 3.3
e. not enough information to determine

10.37 The F ratio is
a. 1
b. 3
c. 4
d. 5
e. 8

10.38 Using a calculator, determine MS
TR
and MS
E
from the following information:
(a) SS
TR
= 30, SS
E
= 480, n = 30, and k = 6
(b) SS
TR
= 300, SS
E
= 48, n = 30, and k = 6
(c) SS
TR
= 30, SS
E
= 480, n = 60, and k = 12
(d) SS
TR
= 30, SS
E
= 480, n = 12, and k = 3
(e) SS
TR
= 30, SS
E
= 48, n = 12, and k = 4

10.39 Using a calculator, determine the F-ratio from the following information:
(a) MS
TR
= 400, MS
E
= 120
(b) MS
TR
= 0.50, MS
E
= 0.020
(c) SS
TR
= 30, SS
E
= 480, n = 12, and k = 3
(d) SS
TR
= 300, SS
E
= 48, n = 30, and k = 6

10.5 The Analysis of Variance Table
Because of the extensive chain of logic and computations to finally arrive at the F test,
the intermediate results are usually presented in an analysis of variance table.
DEFINITION: An analysis of variance table is a table that contains the components used to
determine the F-ratio and conduct the F test; the table include information about the degrees of
freedom, total and mean sum of squares for each portion of the variation involved in the analysis.
The AOV table has the following general form:

SOURCE
DEGREES OF
FREEDOM
SUM OF
SQUARES MEAN SQUARES F
Treatment k 1 SS
TR
= 3( x
j
x ) MS
TR
= SS
TR
/(k 1) F = MS
TR
/MS
E

Error n k SS
E
= 3(x
i, j
x
j
) MS
E
= SS
E
/(n k)
Total n 1 SS
T
= 3(x
i, j
x )

The primary advantage of this analysis of variance table is its clear presentation of the
process by which the test statistic, the F-ratio, is derived. The AOV table displays the
partitioning of both the total sum of squares (SST) and the total degrees of freedom (n 1) into
treatment and error components. Notice how the treatment and error lines sum to the total on the
bottom line for the sum of squares column. For the degrees of freedom column, the total n 1
equals the sum of the treatment degrees of freedom k 1 and the error degrees of freedom n k.
The AOV table also provides a compact display of the intermediate computation process.
As our eye works its way across the table from left to right, we see that the mean squares are
derived from the ratios of the two preceding columns, the sums of squares and the degrees of
freedom. The F-ratio in turn is the quotient of the mean squares.
The above calculations were carried out step-by-step for the case example in order to
demonstrate the procedure by which one-way analysis of variance is conducted. As with other
statistical methods discussed earlier, however, Minitab can provide all the information you need
to perform AOV within a single menu operation, which combines all these steps and presents
relevant statistical information to help us reach meaningful decisions.
For our case study, each column contains GPAs for a specific major and there are five
columns necessary in the command line. We therefore and obtain the analysis of variance table
and descriptive sample statistics.
The output of AOVOneway has two sections. The bottom section contains summary statistics
(sample counts N, means, and standard deviations) for the response variable (GPA) at each of the
five treatments "levels" -- the five business majors. Notice that these are the same values found
in the DESCribe output from section 10.2. The diagram of "95 PCT CI'S FOR MEAN" and the
"POOLED STDEV" will be discussed later.


ANALYSIS OF VARIANCE
SOURCE DF SS MS F p
FACTOR 4 1.979 0.495 1.53 0.220
ERROR 30 9.730 0.324
TOTAL 34 11.710
INDIVIDUAL 95 PCT CI'S FOR MEAN
BASED ON POOLED STDEV
LEVEL N MEAN STDEV -----+---------+---------+---------+-
ACCT 7 2.8333 0.5443 (--------*-------)
FIN 7 2.9048 0.7127 (--------*--------)
MGMT 7 2.7143 0.7436 (--------*--------)
MKTG 7 2.5476 0.3563 (--------*--------)
GEN 7 2.2381 0.3709 (--------*--------)
-----+---------+---------+---------+_
POOLED STDEV = 0.5695 2.00 2.50 3.00 3.50

The top half of the output presents the AOV table. The first five columns of the AOV
table have the same format as the one presented above, with degrees of freedom, sum of squares,
and mean square abbreviated by DF, SS, and MS, respectively. Notice that the DF, SS, MS, and
F columns contain the precise results we determined in our step-by-step computations on this
sample. Let's postpone for the moment a discussion of the p-value reported in the last column.
By checking the relationships in the AOV table, we often gain valuable insights about the
significance test on the model. First, it is easy to verify on a calculator that the treatment and
error sums of squares add up to the total: SS
TR
+ SS
E
= 1.979 + 9.730 = 11.710 = SS
T
. These are
the same values (except for rounding) that we calculated earlier. The degrees of freedom are k
1 = 4 for treatments and n k = 30 for the errors out of n 1 = 34, the total for variation around
the mean. The mean squares column, MS, is found from the quotient of the two preceding
columns, SS/DF, the sum of squares averaged over the degrees of freedom. The F-ratio, in turn,
is the ratio of these mean squares. Thus, the computer output table compactly displays all the
intermediate measures we derived separately from the business GPA data.
Instead of first determining F
1-
, it is easier from computer output to apply the p-value
decision rule. The sixth column of the AOV table in the computer output provides the p-value.
As we discussed in Chapter 9, the p-value tells us the probability of obtaining a test statistic as
large (or larger), given that the null hypothesis is the true state of nature. In the case of a one-
way AOV test, the test statistic is the F-ratio and the null hypothesis relates to equality among
treatment means.


p-Value Decision Rules for Testing an AOV Model: At the o significance level, difference
among k population treatment means may be tested by the following p value decision rule:
if p < o, reject H
0
if p > o, do not reject H
0

In the case example, the p-value of .220 indicates that we cannot reject the null hypothesis at the
.05 significance level. As with the F-ratio decision rule, we again advise the college dean that no
statistically significant differences were found. Average performance in the six required courses
of the business curriculum did not vary significantly among the five majors.
What about the assumptions of the AOV model? Were they adequately met? The
Minitab output provides sample evidence about the equality of standard deviations assumption.
Remember that analysis of variance tests assume that each of the k treatment populations have
similar standard deviations. The sample standard deviations, provided in the second portion of
the AOVOneway output, allow us to observe large departures from this assumption. In order to
establish the proper denominator for the F ratio, the MS
E
, the F test assumes identical population
standard deviations. An estimate of that common standard deviation for the k populations
sampled is the same as the standard deviation for the combined, or "pooled", sample.
In the case example, notice that two pairs of the sample standard deviations are similar
and another lies in between. Finance and management have the largest dispersion (s = 0.71 to
0.74), marketing and general business the lowest s
j
(0.36 to 0.37), and accounting has an
intermediate value of s
j
(0.54). The pooled standard deviation, approximately 0.57, is reported
on the final line of the Minitab output. It is easy to verify that this pooled standard deviation is
the square root of the pooled variance, since 0.5695 squared is the MS
E
= 0.324 (found in the
analysis of variance table). You may also check with a calculator that 0.324 is the mean of the
squared standard deviations for the five individual s
j
's, 0.54, 0.71, 0.74, 0.36, 0.37.
As mentioned earlier, AOV tests are reasonably insensitive to moderate departures from
equal standard deviations. Nevertheless, two of the standard deviations are only half as large as
the s
j
for two of the others. This result may cause more cautious investigators switch to tests that
do not rely on the assumption of similar standard deviations across treatments (see Chapter 14).
1

The dean should also check that the samples were collected in a random and independent
manner, and that approximately normal distributions occur for each population studied.
In the next section, we examine a case in which significant differences are found among
population means. We shall then be able to investigate the nature of the alternative hypothesis
that applies when we decide to reject H
0
.

1
You may have noticed that majors with the smallest means also have the smallest standard deviations. This relationship between
j
and
j
is often observed in business modeling.

10.42 Tests to compare individual treatment means may be conducted only if
a. the null hypothesis that all treatment means are equal was be rejected
b. at least one of the sample treatment means is different from the rest
c. the sample size is large (at least 30)
d. there are at least four treatment being compared
e. not all the assumption of the AOV model are valid

10.43 If we can reject H
0
:
1
=
2
=
3
, then
a.
1
is different from
2

b.
1
is different from either
2
or from
3
or from both
c.
1
is different from
2
and
2
is different from
3

d. one of the three 's is different from the other two, which are equal
e. either c or d must be true

10.44 For each of the following, complete the omitted values for DF, SS, MS, or F.
(a) SOURCE DF SS MS F
FACTOR 4 20
ERROR 30 60
TOTAL 34 80

(b) SOURCE DF SS MS F
FACTOR 3 20
ERROR 20 60
TOTAL

(c) SOURCE DF SS MS F
FACTOR 5 20
ERROR
TOTAL 29 80

(d) SOURCE DF SS MS F
FACTOR 8.5
ERROR 21 63
TOTAL 23 80

(e) SOURCE DF SS MS F
FACTOR 4 4.5
ERROR 15
TOTAL 14

10.6 Comparing Individual Treatment Means
In the early 1990s, a rash of bank failures caused banks to monitor more closely their
loan approval process. Consider a bank manager who wants to determine whether his five loan
officers are approving unsecured consumer loans at different rates of interest.
1
A random sample
of unsecured consumer loans approved by each loan officer was collected. The interest rate on
the loans is printed from the sample in five columns below.
ROW lender1 lender2 lender3 lender4 lender5

1 12.00 12.50 13.5 13.00 12.00
2 12.25 12.50 13.5 13.50 12.50
3 12.25 13.50 14.0 13.75 13.25
4 12.50 13.75 14.0 14.00 13.50
5 12.50 14.00 14.5 14.50 13.50
6 13.00 15.00 15.0 15.00 13.75
7 13.00 15.00 16.0 15.00 13.75
8 13.50 15.25 16.5 15.50 14.00
9 13.50 15.50 18.0 14.25
10 15.00

lender1 9 12.722 12.500 12.722 0.551 0.184
lender2 9 14.111 14.000 14.111 1.146 0.382
lender3 9 15.000 14.500 15.000 1.541 0.514
lender4 8 14.281 14.250 14.281 0.860 0.304
lender5 10 13.550 13.625 13.563 0.848 0.268

Observe that the number of loans from each treatment varied slightly in the final sample.
Unequal treatment sample sizes often happen even when efforts are made to obtain equal n
j

counts. For example, respondents may drop out of the survey sample due to illness, and
employee subjects may quit or be fired before the study is completed. In this case, ten loans
were sampled from each population. However, five of the records were subsequently discarded
from the sample because they had been mislabeled as unsecured loans, but turned out to be
business or secured loans. As mentioned at the end of Section 10.3, analysis of variance tests
may be used with unbalanced designs so long as there are not large differences among the n
j

treatment sample sizes. If there are large differences among the n
j
, a stricter reliance must be
placed on the constant standard deviation assumption.

1
Unsecured consumer loans are among the riskiest assets held by a bank because they lack the collateral backing in case the borrower
defaults on the loan.

From the sample treatment means, it is evident that lender 3 averaged the highest interest
rates (15.0%) and lender 1 the lowest (12.7%). But were any of these differences
significant? A one-way AOV test was conducted with o established at .05.
SOURCE DF SS MS F p
FACTOR 4 26.15 6.54 6.00 0.001
ERROR 40 43.60 1.09
TOTAL 44 69.75
LEVEL N MEAN STDEV ----------+---------+---------+------
lender1 9 12.722 0.551 (------*------)
lender2 9 14.111 1.146 (------*------)
lender3 9 15.000 1.541 (------*------)
lender4 8 14.281 0.860 (-------*------)
lender5 10 13.550 0.848 (------*-----)
----------+---------+---------+------
POOLED STDEV = 1.044 13.0 14.0 15.0

Although SS
TR
= 26.15 is smaller than SS
E
= 43.60, the F-ratio is 6.00. The reason such a large
F-ratio statistic occurs is that the degrees of freedom are ten times larger (40 versus 4) for the
error sum of squares. Thus, 26.15/43.60 = 0.60 but the ratio of the mean squares is ten times as
large, or 6.00. Because p = .001 is less than o = 0.05, the bank manager rejects H
0
that all loan
officers approve loans at the same average rate.
Having established that at least one difference exists, the manager may want to determine
which specific version of the alternative hypothesis applies. One attractive choice is the loan
officer with the most unusual sample mean. In this sample, lender 3 approved loans at the
highest average interest rate (15.0%) and 0.72% more than that of lender 4, the next highest. But
lender 3 also had the largest standard deviation (1.54%). From the data listing, the manager also
notes that nearly half of lender 3's loans were approved at 14% or less, below the mean rates on
loans approved by lenders 2 and 4. Therefore, the bank manager again resorts to statistical
analysis.
Minitab provides a diagram of one-sample 95% confidence intervals for each of the five
lenders. Like those found in Chapter 8, these intervals are computed from the sample mean and
standard deviations (printed on the left side of the output) and use the t
.025
(n-1) distribution

values.
1
Searching the diagram for intervals that do not overlap generally provides a rough
indication of where treatment differences exist.
2

After examining the loan officer confidence intervals for overlaps, the bank manager
concludes that loan officer 1 approves loans at lower interest rates than loan officers 3 and 4.
Loan officer 1 may be giving preferential rates to friends, or perhaps loan officers 3 and 4 (and
even 2) are more successful in obtaining higher rates for the bank. On the other hand, two or
three of the loan officers may be costing the bank potential borrowers attracted by lower interest
from competing banks. The manager therefore decides to schedule a workshop for loan officers
to make the loan approval process more uniform. To reduce the discretion of loan officers, he
also designates a narrower range of interest rates for unsecured consumer loans.

10.7 Regression vs. Analysis of Variance: A Comparison
3

By now, you have surely noticed certain similarities between analysis of variance and the
regression analysis presented in Chapters 9-10. These similarities are numerous and are not
coincidental. Both methods are employed to analyze problems in which variation in a dependent
variable is related to other variables. In analysis of variance the dependent variable is called the
response variable and the explanatory variables are the factors. The treatments for each factor
are analogous to the values for each explanatory variable.
The AOV models resemble regression models. The one-way analysis with a single factor
parallels the simple regression model containing only one explanatory variable. The two factors
in two-way models are like the two explanatory variables in a multiple regression model. The
model assumptions of normal distribution, constant standard deviation, and independent random
samples overlap considerably with the assumptions listed in Chapter 11.
In regression, the F test for the model tests whether the explanatory variables as a group
have any effect on explaining variation in the dependent variable. In a similar fashion, the F
tests in AOV models test whether the set of treatments have a significant effect on the response
variable. In each case, we are testing whether several parameters are different from a reference
value indicating no effect. Analysis of variance tests whether any of the treatment population

1
If the manager originally suspected that one (or more) specific loan officers had average rates different from the rest, two-sample t intervals may
be constructed among pairs of columns (by the TWOSample-t command discussed in Chapter 9) . However, hunting for significant differences
by running all possible pairwise tests has been shown to exaggerate the number of significant differences. Comparisons of one-sample intervals
is a more conservative test, and thus more appropriate in this latter situation.
2
Depending on the type and number of multiple comparisons, any of three other methods (Tukey, Scheffe, and Bonferroni) may be more
appropriate. For an description, comparison, and examples of each method, see Chapter 17 of Applied Linear Statistical Models (2nd ed.) by
Neter, Wasserman, and Kutner.
3
The material in this section assumes students have also completed the first two chapters of Part IV. Instructors who cover analysis of
variance before regression should postpone this optional section until Chapters 9 and 10 are completed.

means
j
are different from the overall mean . Regression tests whether any of the coefficients
j
differ from zero.
Both types of models may be portrayed by an AOV table. The only difference is that the
regression row is replaced by one or more factor (and interaction) rows. In each case, we call the
total variation from the mean SS
T
, the total sum of squared deviations. Both methods also share
SS
E
, the error sum of squares. In regression, the errors result from deviations from the fitted
equation. Similarly, the errors in AOV occur due to failure of the treatment means to match the
values in the corresponding experimental units. As with regression, it can be shown
algebraically that the explained portion of SS
T
, called SS
TR
rather than SS
R
, is related to SS
T
and
SS
E
by the expression SS
T
= SS
TR
+ SS
E
.
As with the regression F test, we construct an AOV table from the sum of squares,
dividing each by the appropriate number of degrees of freedom to find the mean squares, and
subsequently using the ratio of the mean squares, the F ratio, as our test statistic.
1

In fact, it is often possible to construct equivalent regression and AOV models and obtain
essentially identical results when using the same sample data. Some statistical computer
packages use the same command to run either regression or AOV models, grouped within the
generic title "general linear models."
Regression analysis and analysis of variance are very similar in their models, hypotheses,
and tests.
With all these similarities, there remain some important distinctions as well. Regression
is designed for analysis of unbalanced samples with confounded variables so often the product of
secondary data generated outside our control. This fact is what makes regression so popular in
business and economics. Opportunities for laboratory testing are the exception to the rule. A lot
is riding on each business action and our economic well-being depends on wise public policy.
With the stakes so high, corporate and government decision makers are less likely to turn
decisions about tax legislation, merger deals, product redesign, foreign investment, and
advertising budgets over to business researchers merely to settle disputed intellectual issues.
Therefore, we use regression analysis whenever we cannot control the data collection
process and must accept confounded factors. However, by including all-important factors
explicitly as explanatory variables within a multiple regression model, individual variable
contributions are able to be identified.

1
Even the degrees of freedom k - 1 degrees of freedom for k treatments corresponds to the k explanatory variables in a regression model
lacking an intercept term.

But if time and budget make it feasible, a carefully designed experiment is always
preferable. For all its usefulness, regression analysis often has a weak link: the sample data.
Regression cannot get blood from a stone. There must be sufficient variability among the
explanatory variables to separately isolate relationships with the dependent variable. If
explanatory variables in a model are highly correlated, it may be impossible to distinguish with
statistical techniques the separate influences of each. For example, people with poverty level
income usually have little education but are also typically the products of low-income parents. It
is therefore difficult to determine the individual contributions of either education or parent
income if the bulk of any random sample of those in poverty possess deficiencies in both factors.
If we attempt to clarify the situation by omitting one of the correlated explanatory variables from
the regression model, we are apt to inject biased coefficient estimation instead.
1

With regression analysis, we often have little control over the experiment generating our
sample, so correlation among important explanatory variables is quite common and may be dealt
with only by the statistical procedure discussed in the preceding paragraph. If a controlled
experiment is possible, we saw in Section 10.1 how confounded factors may be disentangled by
a careful design. By fixing certain factors, randomization of others, and data collection using a
balanced design, we may then apply simple analysis of variance models.
Analysis of variance serves as a complement to regression analysis. The former is
employed when controlled experiments are able to be conducted, the latter when
unbalanced samples arise from uncontrolled data collection.
A second fundamental difference between regression and AOV is that analysis of
variance factors have their treatments in the form of categories rather than as quantitative
measures. Discrete quantitative variables, if they contain only a few possible values, may also be
represented as treatments. Other quantitative variables such as income and age may be used as
factors once they have been categorized into value ranges suitable as treatments.
However, grouping quantitative variables into a series of treatments may involve the loss
of information. Suppose that each extra year of experience at a business results in approximately
$500 increase in salary. A regression model Salary =
0
+
1
Exper + c would better capture this
relationship (and account for a greater share of SS
T
) than the corresponding AOV model x
i, j
=
j

+ c
i, j
in which the
j
are experience group means. The reason for the reduction in explanatory
ability for the AOV model is due to the loss of precision in experience information. By coding
Exper into a few treatments, say 1 = under 5 years, 2 = 6 to 10 years, and so forth, less variation
in Salary are accounted for. In addition, there is also an arbitrariness in the selection of ranges
sizes and break points to the designation of treatments; this limitation of AOV modeling parallels
the histogram design problems discussed in Chapter 2.

1
Chapter 12 discusses this "multicollinearity" problem in greater detail.

On the other hand, recall that regression models assume that explanatory variables have a
geometric slope relationship to the dependent variable. Thus,
Salary = 10,000 + 500 Exper + c
means that each year of experience results, on average, in $500 higher salary. But what if such a
rigid relationship does not prevail over all experience ranges? Suppose that experience initially
results in small pay increases, perhaps $200 per year, at mid-career increases substantially to
$1200 per year, and fades to $400 per year after 30 years on the job. Analysis of variance can
capture this and other types of relationships. AOV models fit each treatment mean separately
rather than forcing a slope fit where none exists.
The price for this flexibility in AOV models is a greater sacrifice in degrees of freedom.
You may have wondered why an additional explanatory variable in a regression model accounts
for the same number of degrees of freedom, one, as the addition of one treatment. The reason is
now apparent. In fitting a regression model, the only parameter we estimate for each variable is
the slope coefficient
j
. An additional factor with a treatments, by contrast, requires the
estimation of (a 1) differences in treatment means. The more treatment categories we
designate for a factor, the more degrees of freedom used. Greater degrees of freedom affect
AOV test results adversely in two ways. The treatment mean square in the numerator of the F-
ratio is reduced if we average over a larger (a 1) amount without proportionally increasing the
treatment SS explained by the factor. Moreover, the denominator of the F-ratio is increased
because SS
E
is averaged over fewer remaining degrees of freedom. Thus, statistical significance
of a factor may be sacrificed if a factor is designated by excessively numerous treatments. Even
more degrees of freedom are expended if interactions are included in the model.
Of course, a larger sample size provides us with the luxury to examine factors in greater
detail. In addition, a carefully controlled experiment may not require the inclusion of the many
control variables necessary in the corresponding multiple regression model; therefore, degrees of
freedom may also be saved in AOV models.
In comparison with regression models, analysis of variance models allow for more flexible
relationships between factors and response variables, but at the cost of additional degrees
of freedom.


10.78 Which of the following is not common to both regression and analysis of variance?
a. involve linear models
b. use an F-test to test for the significance of the model
c. use quantitative explanatory variables
d. use a dependent variable that is measured quantitatively
e. all of the above are common to regression and analysis of variance

10.79 Regression models are similar to analysis of variance models in all except which of the
following ways:
a. the dependent variable in regression is like the response variable in AOV
b. the explanatory variables in regression are like the factors and treatments in AOV
c. both use F tests for significance of the model
d. both report analysis of variance tables
e. all of the above are similarities between regression and analysis of variance

10.80 Which of the following advice should you give someone deciding between the use of
regression and analysis of variance models?
a. use analysis of variance if you have a balanced design
b. use analysis of variance if you need to estimate a slope coefficient for an explanatory
variable
c. use regression if you have a controlled experiment with most confounding factors held
fixed
d. use regression if you have only one or two explanatory variables, each of which is
categorical
e. all of the above are good advice to give


CASE MINI-PROJECTS:
1. A two-way model (without interaction) is run to test whether brand of computer and capacity
of hard disk have a significant effect on the price of a 386 computer. The 24 data observations of
stored in three columns with response variable and two factors defined as follows:
pr 386 response variable: dollar price of 386 computer
brand factor 1: coded 1 = non-name brand, 2 = name brand, 3 = premium brand
(IBM or Compac brands)
harddisk factor 2: coded 1 = small capacity hard disk (less than 40 megabytes), 2 =
larger capacity hard disk (60 or more megabytes)

ROWS: harddisk COLUMNS: brand
1 2 3 ALL
1 1963.0 1911.0 2685.0 2186.3
2 2861.0 1931.3 3589.0 2793.8
ALL 2412.0 1921.1 3137.0 2490.0

CELL CONTENTS --
pr 386:MEAN

Analysis of Variance for pr 386

Source DF SS MS F P
brand 2 5986494 2993247 7.41 0.004
harddisk 1 2213730 2213730 5.48 0.030
Error 20 8075865 403793
Total 23 16276089

Use the p-values to conduct the F tests at the = .05 level for each factor, interpret your test
results verbally to a person shopping for a 386 computer, and examine the table of means to
show that a brand name computer with large capacity hard disk appears to be an excellent buy.

2. A one-way analysis of variance test is conducted on whether the amount a customer spends in
a jewelry store varies with the time a salesperson spends with that customer. The columns
contain random samples of customer purchase amounts at three treatments for salesperson time
spent: 5, 10, and 15-30 minutes.
5 min purchase when salesperson spent about 5 minutes with customer
10 min purchase when salesperson spent about 10 minutes with customer
15-30min purchase when salesperson spent 15 to 30 minutes with customer
The sample data on customer dollar purchase amounts are sorted and printed below:

ROW 5 min 10 min 15-30min
1 0 0 0
2 10 0 0
3 10 0 0
4 18 0 0
5 50 25 10
6 50 30 35
7 78 50 70
8 60 100
9 100 200
10 100 200
11 150 300
12 419 400
13 575 400
14 560
15 575
16 599

and a one-way analysis of variance is conducted:

SOURCE DF SS MS F p
FACTOR 2 181681 90841 2.60 0.089
ERROR 33 1151870 34905
TOTAL 35 1333551
LEVEL N MEAN STDEV ----------+---------+---------+------
5 min 7 30.9 28.7 (-----------*-----------)
10 min 13 116.1 178.2 (--------*-------)
15-30min 16 215.6 225.9 (-------*-------)
----------+---------+---------+------
POOLED STDEV = 186.8 0 120 240

(1) State the alternative hypothesis given that the null hypothesis is
H
0
:
5 min
=
10 min
=
15-30min

H
A
: at least one
j
different from the other means
(2) Show results of p-value decision rules for this test at o=.05 significance level and state your
test findings in one sentence.
p = .089 > o = 05 No significant spending differences found regardless of amount of
time salesperson devoted to customer.
(3) If the manager only examines the means of the three treatment samples, he might conclude
that 10 minutes with the customer nearly quadrupled purchase amounts ($116.1 versus $30.9),
and 15 to 30 minutes double that figure again ($215.6 versus $116.1). Use the data listing and
confidence interval diagram to respond to this mistaken conclusion. The overlapping confidence
intervals and huge range of data within the last two categories indicates within-category
variation dominates the between-category variation.

(4) Is the design balanced? Explain.
Design unbalanced because fewer observations from "5 min" than from other two categories
3. Many companies have United Way drives among their employees, and set goals to surpass
each year. test whether marital status and the size of one's salary has a significant effect on the
percentage of salary donated to charity. A factorial design with replications from random
sampling results in 30 employee observations stored in three columns with response variable and
two factors defined as follows:
donate% response variable: percentage of salary donated to charity
marstat factor 1: coded 1 = single, 2 = married
paycode factor 2: coded 1 = salary under $20,000, 2 = salary $20,000 to $30,000,
and 3 = salary at least $30,000
A table of means is first generated:

ROWS: marstat COLUMNS: paycode

1 2 3 ALL

1 0.9284 1.4323 2.2490 1.5366
2 0.9508 1.4080 1.7954 1.3848
ALL 0.9396 1.4202 2.0222 1.4607

CELL CONTENTS --
donate%:MEAN
Then a two-way model (with interactions) is run, with the following results:

Analysis of Variance for donate%

Source DF SS MS F P
marstat 1 0.1729 0.1729 0.85 0.367
paycode 2 5.8841 2.9420 14.39 0.000
Interaction 2 0.3442 0.1721 0.84 0.443
Error 24 4.9053 0.2044
Total 29 11.3064

(1) Use the p-values to conduct the F tests at the = .05 level for each factor, interpret your test
results verbally.
(2) Use the ALL column or ALL row of the table of means to quantify any significant patterns
found in question #1. [made up example: "The significant factor, marital status, was associated
with single employees donating at nearly twice the rate (4.2%) of married employees' donations
(2.2%)."]
(3) Conduct a test at the .05 level for possible interactions between marital status and salary.
What does the test result allow you to conclude about whether an additivity assumption would
have been valid?

CHAPTER 11 Regression Modeling,
Testing, and Prediction
Approach: The inference procedures of estimation and hypothesis testing developed in
Chapter 8 are applied to regression analysis. We discover conditions under which least-squares
estimators have desirable statistical properties and familiar sampling distributions. We use these
properties to obtain confidence intervals for predictions, forecasts, and marginal effect analysis.
We also extend testing methods to decide which variables in our models are statistically
significant and learn how to test the entire model for significance.

population and sample regression equations
efficiency and the Gauss-Markov theorem
random disturbances and the systematic portion of the population regression
significance tests of explanatory variables
F distribution and F-ratio test statistic for testing regression models
analysis of variance table in regression
standard error of the conditional mean
prediction and forecast intervals
turning points
ex post and ex ante forecasting
conditional and unconditional forecasting, contingent forcasts
Forecasting bias, MAE, RMSE, and MAPE

SECTION 11.1 Introduction to Regression Modeling and Statistical Inference
SECTION 11.2 Testing and Estimation of Individual Explanatory Variables
SECTION 11.3 Testing Regression Models for Significance
SECTION 11.4 Making Predictions by Regression Inference
SECTION 11.5 Forecasting Methods and Forecast Modeling

11.1 Introduction to Regression Modeling and Statistical Inference
Suppose you were majoring in another subject, perhaps education, biology, psychology,
or engineering. You would still be required to learn statistics to become a competent
professional in these areas. However, no other profession relies on regression as much as
business to solve its important statistical problems. To many applied business statisticians,
regression is the first and only line of attack they need to consider. This is true for virtually all
economic problems and many others from marketing, finance, accounting, and management.
Business problems tend to be multivariate, and regression is often the easiest to use and interpret
for the non-experimental data most common in business. Businesses have a critical need to
forecast the future, speculate about What If? questions, and know which factors are and are not
related to business target variables. This chapter will demonstrate why business analysts in
statistical inference situations so extensively use some of the reasons regression.
We now may return to regression with our toolbox full of sample inference concepts,
distributions, and methods. Many of the same statistical inference principles introduced in
Chapter 8 apply for regression analysis. In this second chapter on regression, we convert
regression from a descriptive tool to an inferential method by identifying regression parameters,
sampling distributions, and estimation and hypothesis testing methods.
We first encountered regression in Chapter 4 as a descriptive statistical method for
summarizing variable relationships. These regression equations were used to make predictions
and answer What if? marginal effects questions. Recall that the fitted equation for simple
regression is
= b
0
+ b
1
X
and is merely a special case of the multiple regression equation
= b
0
+ b
1
X
1
+ b
2
X
2
+ ... + b
k
X
k

Thus, the difference between simple and multiple regression is the number of explanatory
variables. Moreover, multiple regression is more commonly used in business because
business and economic phenomena are seldom simple enough to explain by simple
regression adequately.
Lets review descriptive regression analysis in a regional industrial development case
study. We will then use this case example to introduce inference methods in regression.


Chapter Case #1: Industrial Development in New York
Most communities value their manufacturing plants because their traditionally high
wages and spending generate tax revenues and supports local businesses. In the 1970s and
1980s, a wave of plant closings in the Midwest and Northeast rust belts of
the United States devastated regional economies. As a result, most regions
have created industrial development agencies to retain existing
manufacturing jobs, encourage factory expansions, and lure new plants to
the region.
Industrial development corporations (IDCs) are often empowered to
offer vacant "shell" structures in tax-free industrial parks, financial
assistance from industrial development bonds, tax abatements or deferrals,
job training, or outright subsidies. IDCs sometimes receive bad press and
threats of funding cuts when previous financial assistance to plant startups
did not result as many jobs as expected. IDCs and factory owners may
understate or exaggerate job creation potential of new plants. However,
erroneous job predictions may not be the fault of the IDC or the factory
owner. When the decision to open a factory is made, the plant size and
investment are well known. By contrast, employment at the plant is less
certain, dependent on fluctuating market conditions and success in winning
contracts.
The low-income counties of northern New York State have lost
thousands of jobs from aluminum refinery closing along the St. Lawrence
Seaway. As part of its effort to restore jobs to this depressed part of New
York, an IDC examines the statistical relationship between the number of
jobs created by a new venture and the plant size and the amount invested.
The multiple regression equation model has the following form:
predicted jobs = b
0
+ b
1
size + b
2
invest
where the jobs variable measures the number of jobs created, size is the new
floor area added (in thousands of square feet), and invest is the new
investment in plant and equipment (in millions of dollars) at each plant.
The IDC then collects a random sample of 56 manufacturing plants
expansions since 1980 from a six-county area of northern New York. The
sample data on job creation, new plant area and investment are listed in
j obs si ze i nv est
2 8. 0 0. 158
4 5. 0 0. 150
5 1. 0 0. 008
5 19. 5 0. 059
5 3. 5 0. 200
6 2. 0 0. 040
6 16. 3 1. 600
7 12. 5 0. 700
8 4. 0 0. 200
8 5. 8 0. 222
10 24. 0 0. 350
10 51. 8 1. 000
10 10. 0 0. 500
10 4. 8 0. 250
10 10. 0 0. 204
10 10. 0 0. 025
12 8. 1 0. 344
12 8. 0 0. 155
15 24. 0 0. 550
15 25. 0 1. 200
18 12. 5 0. 350
19 38. 4 1. 049
20 17. 0 0. 375
20 20. 0 0. 250
20 20. 0 0. 240
20 19. 2 0. 300
20 18. 5 0. 665
20 19. 5 0. 500
20 15. 0 1. 600
20 122. 0 1. 000
20 10. 0 0. 700
25 9. 0 0. 310
25 20. 0 1. 400
25 60. 0 1. 400
25 20. 0 1. 000
25 10. 0 1. 000
29 20. 0 0. 900
30 6. 0 0. 200
30 20. 0 0. 670
30 7. 5 0. 225
31 60. 0 1. 500
40 4. 0 0. 750
40 10. 0 1. 250
40 17. 5 1. 500
40 30. 0 0. 750
44 18. 0 0. 465
45 26. 3 1. 800
45 50. 0 3. 331
45 40. 0 0. 750
50 95. 0 10. 000
50 16. 9 1. 720
55 60. 0 1. 000
60 20. 0 0. 150
60 21. 0 0. 660
68 57. 6 7. 500
72 122. 0 3. 687
Figure 11.1

Figure 11.1, sorted by number of jobs created.
1
Figures 11.2 and 11.3 contain the portions of the
Minitab and Excel regression output we examined in Chapter 4.
Regression Analysis

jobs = 16.7 + 0.175 size + 4.02 invest

s = 14.73 R-sq = 33.3% R-sq(adj) = 30.8%
Figure 11.2

Both outputs report the fitted regression equation as
predicted jobs = 16.7 + 0.175 size + 4.02

invest
In Chapter 4, we made predictions from fitted equations like this. To predict jobs for a factory
expansion of 10 thousand square feet and $1 million, for example, insert size = 10 and invest = 1
to the fitted equation as follows:
predicted jobs = 16.7 + 0.175(10) + 4.02

(1) = 16.7 + 1.75 + 4.02 = 22.5 jobs
Thus, the regression predicts 22 or 23 jobs created. However, a glance at the data in Figure 11.1
shows an expansion of size = 10 and invest = 1 created 25 jobs, or
an error of 2.5 jobs. Even larger errors occur for many of the 55
other data observations.
Recall that we used least-squares as a descriptive tool to
summarize bivariate or multivariate relationships. Equations
seldom fit the data precisely. Thus, the predicted value of the
dependent variable differs from the actual value Y, resulting in
a different error
e = Y -
for each observation. The least-squares fit identifies the unique
set of numbers b
0
, b
1
, b
2
, ..., b
k
that yields the smallest possible
error-sum-of-squares SS
E
. We used SS
E
to obtain the standard
error of the estimate and R
2
measures of how well the regression
equation fits the data. For example, both outputs report 14.73 for

1
Data is derived from the Industrial Migration File, New York State Department of Economic Development, Albany, NY.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.57697
R Square 0.3329
Adjusted R Square 0.30772
Observations 56
ANOVA
df
Regression 2
Residual 53
Total 55
Coefficients
Intercept 16.7223
size 0.17521
invest 4.02079
Figure 11.3

the standard error and R
2
= 33.3% in the IDC case. Earlier, we expected a fairly weak regression
relationship because the many other factors that can affect job creation. The small R
2
warns us
that the equation only accounts for one-third of the variation in the jobs data. Moreover, the
relatively large standard error, nearly 15 jobs, indicates that fitted values are subject to a margin
of error of about 30 jobs.

Population and Sample Regression Equations
What if we want to use regression for statistical inference? IDCs are not interested in
predicting job creation in the population, not the sample. After all, the sample data already
contain these actual job figures! In sample inference applications, the least squares equation
becomes the sample regression equation.
DEFINITION: A sample regression equation is the fitted equation for the sample data.
= b
0
+ b
1
X
1
+ b
2
X
2
+ . . . + b
k
X
k

Instead, the sample regression is only a means to an end. We estimate the variable relationship
for the overall population to be the same as sample regression equation.
In Chapter 8, we discovered that decisions must be based on samples data when access to
the complete population is not practical. In our case study, most IDA have small staffs and
industry surveys are notoriously time-consuming and labor intensive. Thus, a comprehensive
expansion survey would not be feasible. Fortunately, even small random samples can provide
highly informative inferences about the population parameters in a regression equation. We then
may use fitted sample-data equations to estimate parameters in the population regression
equation.
DEFINITION: The population regression equation is the relationship between dependent
variable Y and the k explanatory variables in the population
Y =
0
+
1
X
1
+
2
X
2
+ . . . +
k
X
k
+
where
0
,
1
,
2
, ...,
k
are the population parameters and indicates the random disturbances
for each observation, variations not explained by the rest of the regression.
As we did previously with and , the Greek alphabet is used to represent population
characteristics in regression. The letters and are pronounced beta and epsilon. In
Chapter 9, the parameters of interest the population mean and proportion summarized single

variables. By contrast, we may need many regression parameters
0
,
1
,
2
, ...,
k
to summarize
a population relationship among many variables.
The population regression equation consists of two parts:
(1) a systematic portion
0
+
1
X
1
+ . . . +
k
X
k
that accounts for variation in Y by a
linear combination of the explanatory variables, and
(2) random disturbances, , unaccounted variation in Y in each observation.
Well have much more to say about random disturbances shortly.
Did you notice the three essential differences between a sample regression equation and
its population counterpart? If you didnt, compare the two together for the industrial
development regression.
jobs =
0
+
1
size +
2
invest + population regression
predicted jobs = b
0
+ b
1
size + b
2
invest sample regression
First, the bs represent least-squares estimators of the parameters in the population
relationship. The parameters are unknown unless we have complete population data, so we
must settle for estimates based on sample data. For example, our sample of 56 factory
expansions produced estimates b
0
, b
1
, and b
2
of 16.7, 0.175, and 4.02 for the fitted equation:

invest
However, another sample would have given us a different equation. Secondly, the fitted
equation finds predicted, not actual, job creation amounts. The third difference is the absence of
random disturbances in the fitted equation. These random disturbances are not observable.
For each observation,
= jobs -
0
-
1
size -
2
invest
Thus, cannot be calculated from the population regression without first knowing the values.
However, the s are unknown population parameters that we can only approximate at best by
the bs in a sample regression.
Because of their similar notation, errors e in sample regressions are easily confused with
random disturbances in the population regression. However, because the bs in a sample
regression are merely estimates of the parameters. For example, the errors for the industrial
development sample regression are:

e = jobs - predicted jobs
= jobs - b
0
- b
1
size - b
2
invest
These errors would equal the random disturbances only in the unlikely event that b
0
, b
1
, and b
2

are the same as their population counterparts,
0
,
1
, and
2
. Although sample regression errors
will differ from random disturbances, we will soon see how analyzing the patterns and
distribution of e will yield important insights about .
The only exception would be a population regression containing zero random
disturbances for all observations. Then, no errors could occur either because sample observation
must also perfectly fit the population equation. Therefore, all sample regressions would estimate
the population relationship precisely. This precision is nearly achieved in the physical sciences
because simple and precise physical laws of nature are observed using extremely accurate
instruments. In business regressions, however, complex relationships require us to simplify most
models and measurement errors abound.
Sample regressions differ from population regressions in three respects: sample regression
contains b estimates instead of parameters, predicted rather than actual values of the
dependent variable, and no random disturbances . Analysis and plotting of errors is
essential to understanding .

Sources of Random Disturbances: Modeling and Measurement Error
Adding random disturbances to the population regression converts an exact
mathematical relationship into a statistical equation involving uncertainty, probability, and error.
Since it is the statistical properties of regressions we investigate in this chapter, we first need to
understand the reasons for including the term. For business applications, there are two
fundamental sources of random disturbance: random modeling error and random measurement
errors.
How do we decide which variables to select for any given regression? Because the world
is complex, most multivariate relationships must be simplified by constructing a model.
DEFINITION: A model helps us make timely, low-cost, and manageable decisions by reducing
the complexity of a real-world problem to only its essential aspects.


As simplifications of reality, models are easier to understand and explain to a client or
boss. An equation with only a few variables has a compelling elegance, especially if most
variation in the dependent variable can be explained. Limited time, money, sample size, and
experience also requires us to simplify population regression equations by retaining only the
most important explanatory variables in the model. Anyway, human behavior and the business
institutions that people create are much too complex to be perfectly explained by even a
thousand variables. Of course, most important is a relative term. If resources, time, and
information are plentiful or the study is vital enough, a regression may contain dozens or even
hundreds of explanatory variables. For example, to decide where to locate new franchises, gas
station chains use complex regression models containing 80 or more explanatory variables that
describe highway flows, local demographics, site characteristics, and market competition. This
regression is still a model because it omits hundreds of less important variables.
A population regression equation is a statistical model because only those explanatory
variables that we consider most important can be included.
Thus, business models cannot be expected to fit population data perfectly because
workable models must simplify the infinite complexity of human behavior, corporate
bureaucracies, and government regulations. Though relatively minor individually, these omitted
variables together can account for a sizable portion of the total variation in Y. For example, a
complex set of factors affect the profitability of a new product or the productivity in a workplace
setting. Even after including several explanatory variables, we should not expect our model to
explain profits or productivity very well. What the model fails to explain from its systematic
portion is therefore relegated to the term. These modeling errors are especially large for cross
sectional data, and is the primary cause for the poor fits discussed in Chapter 4.
2

Even if a regression could incorporate all real-world complexities, measurement errors
would still require us to add a random disturbance term. Precise measurements may be difficult
due to recording and rounding errors in variables such as employee expense account information.
An equally important source of error occurs if the ideal variable or cannot be measured. In
Chapter 1, we emphasized the importance of basing business decisions on the most appropriate
variables. However, when proprietary business data is unavailable, we often must settle for less
than perfect measures as regression variables. Other times, there is no perfect measure. If a
model contains a variable that measures promotional efforts, which types of expenditures do we
include? A narrow definition may only include ads in the print and broadcast media while
broader definitions may add expenses for free samples and coupon mailings, wining and dining
distributors and clients, or sponsoring celebrity golf tournaments. Even if we decide which

2
Even if we include all relevant variables, the linear form of regression still may introduce a source of error to our model. In Chapter 13, we
will examine nonlinear equations and other aspects of modeling besides variables selection.

definition is best, business may report another measure and reporting practices across an industry
may also vary. and difficulties quantifying human behavior
Where no data exists for an important variables, it is usually preferable to use an
imperfect substitute that proxies for the ideal variable.
3
For example, we may construct a
measure of customer dissatisfaction at a retain clothing chain by summing the complaints
registered in person, by phone, and by mail. Use of a proxy variable contributes to measurement
sources of random disturbance in regression model.
To acknowledge these modeling and measurement limitations, we add to all regression
models. What cannot be explained by the systematic portion of the regression equation is
relegated to the a random disturbance term.
The random disturbances in population regression equations is due to modeling error
omissions of less important explanatory variables and measurement errors from using
difficult-to-measure and proxy variables.
Although size of the expansion and amount of investment are the two most important
requirements for job creation, our industrial development case study omits variables that could
create modeling error. For example, knowing how automated the factory is helps us determine
the number of workers hired. In addition, fewer new jobs will be created if the plant is already
overstaffed. Measurement errors are also present in our IDC regression. Plant and capital
equipment investment expenditure may be difficult to value accurately, especially if machines
are reconditioned or transferred from other plants. Size of the expansion is imperfectly proxied
for by floor area, another source of measurement error. These and other modeling and
measurement errors wind up in the disturbance term.
Next, we investigate the distributional properties of least-square estimators that apply
under a particular set of conditions. In ensuing sections, we use these distributional properties to
test the significance of regression models and individual variables and to estimate confidence
intervals for the dependent variable and variable coefficients.

Regression Assumptions and Useful Inference Properties
In Chapter 4, we introduced the principles behind least-squares regression. Multivariate
data was fit to an equation by finding the unique combination of intercept and slope coefficients
that gives the smallest error sum of squares. The resulting regression equation fits the data

3
In Chapter 13, we shall also discover that omission of important variables also may cause biased estimation of regression coefficients.

best, if by best we mean minimizing SS
E
. But who is to say that an equation based on some
other data-fitting rule such as minimizing the sum of either the absolute errors |e| or the square
root of errors might not be better?
More fundamentally, least-squares regression suffers from the same limitations that we
saw with univariate descriptive statistics such as the mean. How closely will a particular sample
regression mirror the relationships in the population? For our industrial development case, our
sample regression was

invest
with an R
2
of 33.3%. How can we tell if the coefficients of size or invest would not be much
larger (or smaller) or have negative signs for a regression on all plant expansions, not just a
random sample of them? Perhaps the population data would produce a much better fit than
33.3% or, alternatively, yield no fit at all. Can we establish a margin of error around the 22.5
jobs the sample regression predicts for a 10,000 square-foot, $1 million factory expansion? Is
sample regression evidence sufficiently persuasive to conclude that the size of an expansion has
a direct and statistically significant impact on job creation? Before you finish this chapter, you
will be about to answer all these questions and more.
From your study of Chapter 8, you probably recognized that the questions in the
previous paragraph involve interval estimation and hypothesis testing. Recall that to construct
intervals or conduct tests, we first must identify the sampling distributions of meaningful
statistics derived from the sample data. Moreover, these distributions applied only under certain
conditions, such as normal populations, independent random sampling, or sufficiently large
samples. Without understanding more about the sampling distribution properties of least-squares
estimators, we cannot use sample regressions to form probability-based inferences about
population relationships.
We have already encountered several inference properties of estimators that will be
useful for working with least-squares estimators. One of these properties is unbiasedness. Recall
from Chapter 8 that an unbiased estimator has an expected value equal to the parameter being
estimated.
If the least squares estimators b
0
, b
1
, ..., b
k
in the sample regression are unbiased,
E(b
0
) =
0
, E(b
1
) =
1
, . . . , E(b
k
) =
k
.
By contrast, biased estimators would tend to be too large or too small. A sample regression
containing these estimators is inclined to overstate (or understate) predictions and marginal
effects of explanatory variables.

However, unbiasedness merely assures us that the mean of a very large of number of
sample estimates will equal these population parameter. Although unbiased estimators are
appealing for being correct on average over many repeated samples, estimates are still wrong
for nearly every particular sample. Most regression studies fit data from only one random
sample, and the b values in sample regression may be far from the parameters being estimated.
Thus, we not only want unbiased estimators. We usually want least-squares regression
estimators also to have the smallest possible variability around the true parameter values. In
previous chapters, we often measured variability by the standard deviation. Therefore, we want
estimators to have the smallest standard deviation from the true population parameters. We may
combine these two sampling properties into the concept of efficiency.
4
DEFINITION: Minimum standard deviation among all unbiased estimators is called efficiency
and estimators with this property are efficient.
Notice that efficiency refers only to comparisons among unbiased estimators. If we are willing
to accept estimators that are biased, it might be possible to secure still smaller standard
deviations. Recall that the sample median is a usually biased while the sample mean is not.
However, the variability of these two estimators often favors the sample medians for highly
skewed business data. In Chapter 13 we will discuss a method of biased estimation that may be
preferable when regression estimates are extremely volatile.
Finally we need one additional property for our least-squares estimators. Efficiency
assures us of sample regression equations with intercept and slope coefficients that are unbiased
and have the least variability from the correct parameters. As result, we want to use efficient
regression estimators to obtain our point estimates. However, we still have no information to
generate interval estimates or conduct hypothesis testing. To make use of statistical inference
methods fully, we need to know the sampling distribution for the most important regression
statistics. Furthermore, inference is not feasible if we must start from scratch every time we do
another regression study. Thus, we want the sampling distributions to be well-known, easily
accessed, and very few in number. This was the case for univariate inferences of the mean.
Under certain, commonly occurring conditions, we could rely on the t and normal distributions
for exact or approximate estimates and test statistics. In the previous chapter, we added the F
distribution to our mix for analysis of variance model tests. As we will discover shortly, the
normal, t, and F will be the relevant distributions for regression inference.

4
Technically, the least-square estimators are efficient only if the explanatory variables are fixed rather than random variables. The Xs are
fixed if data is generated by controlled experiments. For regressions on observational data, efficiency may be obtained by considering regression
estimates as conditional on the values of the explanatory variables observed in the sample data.

It is generally preferable to conduct inference under conditions where estimators are
efficient and statistics are restricted to very few, well-known sampling distributions.
What conditions are needed for have least-squares estimators to be efficient and for
regression statistics to have familiar and dependable sampling distributions such as t and F? To
obtain these desirable properties in regression, we must make several assumptions. We will see
some parallels to univariate assumptions. However, because multivariate analysis furnishes us
with more information than univariate analysis, the distributional properties also are more
complex. To accommodate this complexity, a longer list of assumptions must be invoked and
different sampling distributions must be applied to each inferential task. In mathematical
statistics, this efficiency property is the result of the Gauss-Markov theorem.
5
One of the
simplest ways to present this theorem and the accompanying assumption list is the following:

The Gauss-Markov Theorem: Regression estimators are efficient if the following
assumptions are true of the population regression equation for all observations:
(1) the parameters each have constant values
(2) the expected value of the random disturbances is zero.
(3) explanatory variables are not correlated with
(4) the standard deviation of the random disturbances,
, is constant
(5) the random disturbances for different observations are uncorrelated
Before we explore the rationale for and limitations of this imposing set of assumptions,
lets discuss some important implications of the Gauss-Markov theorem. First, if
b
1
is the
standard deviation for the least-squares estimator b
1
,
b
2
for b
2
, and so forth, efficiency implies
that least-squares estimators have the smallest
b
s among all unbiased estimators of the s.
So far, we have discussed only the sample properties of the least-square coefficients.
What is the implication of the Gauss-Markov theorem for least-squares estimates of the
dependent variable Y? In Chapter 8, we found probability density functions to match sampling
distributions for univariate estimators such as Y . But least-squares involves bivariate and

5
Like the other important theorem introduced in Chapter 8, the central limit theorem, the proof of the Gauss-Markov theorem involves
mathematics beyond the scope of this text.

multivariate data, so the appropriate estimates represent conditional probabilities. Recall from
Chapter 6 that the conditional probabilities of Y = 2 given that X = 5 is denoted by P(Y=5|X=2).
In regression, the dependent variable is conditional on all k explanatory variables in the equation.
the conditional mean of Y.

DEFINITION: The conditional mean of the dependent variable,

Y | X
1
, X
2
, ..., X
k
,
is the

population mean of Y for a particular set of explanatory variables values in the regression.

Because dependent variables in regression are generally continuous random variables, the
probability of Y is represented by a conditional probability density function rather than a single
pdf. Despite the introduction of conditional probabilities and density functions, least-squares
predictions of the dependent variable will also be unbiased.
Under the assumptions of the Gauss-Markov theorem, estimated by a least-squares
equation is an efficient estimator of the conditional mean of Y. Thus,
E( | X
1
, X
2
, ..., X
k
) =
Y | X
1
, X
2
, ..., X
k

For example, a model of production quality control at an assembly line might be

Rejects =
0
+
1
Speed +
2
Service +
Rejects, the number of audio speaker assemblies discarded per hour due to poor quality, is
modeled as a linear function of the speed of the assembly line (in output per hour) and the
number of days since the most recent servicing of the assembling equipment. Since b
0
, b
1
, and
b
2
are efficient estimators of their counterparts,
Rejects
|
Speed, Service
may be estimated from the
least-squares fit of data collected at randomly-sampled production hours. Suppose one such
sample produces the following regression equation:
predicted Rejects = -4.0 + 0.05 Speed + 0.3 Service
We estimate
Y
|
Speed = 60, Service = 10
by
predicted Rejects = -4.0 + (0.05)(60) + (0.3)(10) = 2.0.


Thus, for this sample, use of an efficient estimator results in a conditional sample mean of
two assemblies rejected per hour.
Unfortunately, these five assumptions alone dont result in the sampling distributions we
want for least-squares estimators and test statistics. For inferences from the sample mean, we
assumed in Chapter 9 a normally distributed population or a sample size large enough to invoke
the central limit theorem. Either assumption led to a normally distributed standard error of the
mean for known , or to a t distribution whenever is unknown. The Z or t distribution were
then used to construct confidence intervals or conduct tests. For testing and estimating in the
linear regression, we again assume normality or a sufficiently large sample.
6
To perform testing and estimation, we add a sixth assumption to regression:
(6) the random disturbances are distributed normally
Assuming a normal distribution allows us again to use the t statistic for estimation and most
testing (and the more general F distributions to test the entire model for significance). If the
random disturbances are normal and the other five assumptions are valid, we can specify the
conditional distribution of Y.
If all six assumptions hold, the conditional distribution of the dependent variable Y for any
given values for the explanatory variables is normally distributed with a mean of
0
+
1
x
1
+ ... +
k
x
k
and constant standard deviation
.
Figure 11.4 provides a graphical portrayal of this conditional distribution for the simple
regression case.

6
Without normality, nonparametric methods (discussed in Chapter 12) may be applied to estimate or test regressions.

In what sense can we say that time series data give us a random sample for making
statistical inferences? On the surface, a series of the fifty most-recent weekly sales figures or the
last five years of monthly unemployment rates looks nothing like a sample, and especially not a
random sample. Yet, inferences may be applied to serial observations if we consider them as a
sequence of experimental outcomes. Each experiment randomly generates one out of all
outcomes that could have occurred under circumstances prevailing at the time. The experimental
conditions vary from one period to the next as the explanatory variables change their values.
However, all observations are sampled from the same population as long as no fundamental
shifts occur in the business environment to alter underlying variable relationships. Extending
this argument, forecasting inferences require us to assume that model relationships continue to
apply in the future. Section 5 introduces the exciting field of regression forecasting methods.


Making Sense of the Regression Assumptions
The Gauss-Markov properties of least-squares regression rely upon five assumptions.
We can learn much by inspecting the assumptions needed to make this theorem valid. The sixth
assumption also merits additional clarification for its importance to testing and interval
estimation. With each assumption, we will attempt to answer the following questions: what is its
meaning, under what conditions is it likely to be approximately true, what causes it to fail, what
happens if it is not true, and what can be done to prevent it from failing? In the final section of
this chapter, we will show how to detect failure of certain assumptions by examining residual
plots. Chapter 13 also includes methods for correcting or preventing breakdowns in these
assumptions.
We should always approach statistical analysis like we do a bottle of medicine. The
warnings on the bottle describe conditions under which undesirable or even fatal side-effects
may occur. Do not be misled by easy-to-use computer software, voluminous output, and the
popularity of regression in modern business applications. The ability of regression analysis to
shed light on decision making problems rests on properties derived from the six assumptions.
Although these assumptions are approximately true for a broad range of business situations,
several notable exceptions deserve our careful attention.

(1) Constant Values for the Parameters: The first assumption, constant parameters, is
necessary so that the same systematic expression applies to the entire population. In our
industrial development case example, this assumption means that
1
and
2
are constant for all
types of factories in northern New York. Suppose instead that each million dollars of investment
creates fewer jobs on average for older factories where most investment goes to replace
outmoded equipment. Then,
2
may be smaller, say 2, for the older factories and perhaps fairly
large, perhaps 6 for newer factories. A regression fit to a sample of both types of factories might
estimate
2
somewhere in between, perhaps 4. This slope coefficient doesnt describe the effect
of investment on job creation for either older or newer factories.
The assumption of constant parameters would also fail if the relationship with an
explanatory variable is not linear. For example, advertising may be related to profits, but that
relationship is best described by a quadratic function. In many product markets, there is little
impact on profits from small advertising budgets because they fall below a threshold where
consumers remember the ads. At the other extreme, excessive advertising among rivals such as
beer or cola companies tend to be self-canceling and yield in only minimal profits. By contrast,
the most profitable companies may be the ones that advertise at levels between the two extremes.
The slope coefficient is not constant, starting off positive as advertising adds more and more to
profits but eventually flattens and may even turn negative if excessive ads become wasteful.

There are two general methods to ensure constant parameters. The first is to adopt
models that adequately describe the processes that determine the dependent variable. To explain
profits by advertising, we may need a quadratic regression model. The job creation model may
need to include a categorical variable distinguishing between older and newer factories.
Chapter 13 presents modeling methods to estimate nonlinear regressions and incorporate
categorical variables. But there is often an alternative to using these modeling methods. If the
parameters are not constant for the population under consideration, perhaps we can restrict
analysis to portions of the population where the parameters are roughly constant. Rather than
investigate all factories, we might limit ourselves to only newer factories. Instead of
investigating the relationship for all advertising levels, we may only sample companies with low
to moderate advertising-to-sales ratios. The more similar the factories are, the more likely that
the same job creation relationship will apply to all factories studies. And a slope coefficient is
more likely to remain constant over a narrower range of advertising levels. Similarly, regression
on time series data will obey the constant parameter assumption if the time period sampled is not
so lengthy as to include radically different industrial or market environments.
A carefully constructed model or narrowly-defined population can ensure constant
regression parameters.
(2) Random Disturbances Have Expected Value of Zero: The remaining four
assumptions all relate to the random disturbances . Assumption 2 implies that the mean of these
random disturbances is zero. This requirement ensures that the systematic portion of the
population regression is correct on average. Otherwise, the least-squares equation will produce
biased estimates of the dependent variable. Fortunately, if this assumption is violated, the
intercept b
0
in the fitted regression will absorb any positive (or negative) effect from
measurement errors, omitted minor variables, or most other sources of bias. As a result, the
variable coefficients b
1
, b
2
, ..., b
k
will retain their unbiased properties. Although both Excel and
Minitab allow us to fit regressions without an intercept (see Exercises), concern for Assumption
2 makes it advisable to include an intercept
0
in all regressions. In simple regression, for
example, omission of
0
forces the regression line to go through the origin, resulting in a
misleading estimate of the slope. Recall from Chapter 4 that the intercept often has no useful
interpretation anyway if the sample data do not include the origin or if the reality of the case
prevents all the X variables from equaling zero. Even if Assumption 2 is valid and the intercept
does not belong, the regression should reflect that fact by estimating
0
at or near zero.
(3) Explanatory variables Uncorrelated with : If the explanatory variables have fixed
values, Assumption 3 clearly holds because constants cannot be correlated with , a random
variable. If the data result from a controlled experiment, explanatory variables are preset at
designed levels. In many business decisions, the explanatory variables are also directly
controlled. Employment, investment, product specifications, and price are established by the

firm and not directly affected by random processes. In other situations, however, some or all of
the explanatory variables are not under the control of the business. Interest rates in financial
models, foreign exchange rates in export trading models, and weather conditions in construction
industry models are examples of uncontrolled explanatory variables. Then, omission of
important variables from the regression model is the most common cause of correlation between
an explanatory variable and the random disturbances.
(4) Constant Standard Deviation of the Random Disturbances: The first three
assumptions are all we need for unbiasedness. Assumptions 4 and 5 are required only to ensure
efficiency. Under the fourth assumption, Y will have identical dispersion around the systematic
portion of the regression equation. As well will discuss later in this chapter, least-squares
equations are highly sensitive to large residuals e. If the magnitude of tends to be larger for
extreme values of the explanatory variables, sample observations collected at those values will
tend to influence the least-squares outcome greatly. Depending on which observations get
included in a particular sample, the least-squares coefficients may be well above or well below
their corresponding population parameter. If
is not constant, least-squares estimators are not

efficient, although they retain their unbiasedness. Least-squares estimators will not have the
smallest possible variability from one sample to the next if Assumption 4 does not hold.
Nonconstant standard deviation arises most often in business analyses of wide-ranging
cross sectional data. The validity of Assumption 4 often may be restored by redefining the
variables to create a smaller range of their values in the population. In the industrial
development case study, for example, factory investments range from under $10,000 to as much
as $10 million. Logs would shrink this range, because log
10
of 10,000 is 5 and log
10
of 10
million is only 7! Method of model transformation will be covered in Chapter 13. An even
more common approach used in business research is to sample from a narrower range of the
variables in the population. One reason we analyzed only Division I-A college football in the
Chapter 4 case study was to restrict data to older football programs with large stadiums and
budgets. Later in this chapter we will examine error plots that shed light these issues.
(5) Uncorrelated Random Disturbances: Assumption 5 refers primarily to time series
data, when data have a sequence by the time they occurred. In time series regression, random
disturbances in nearby time periods are often correlated. Because it occurs primarily with time
series data, this correlation is called serial correlation. In the most common case of first order
autocorrelation, only random disturbances in adjacent time periods (
t
and
t - 1
) are correlated.
Since these random disturbances are not statistically independent, least-squares estimators are
unbiased but no longer efficient. This inefficiency is caused by the inability of to reflect the
conditions at one period of time fully because of the influence of random disturbances from the
preceding period.

A time trend model of volume of mail provides a simple example. Although the general
upward trend in mail volume has been fairly constant, random disturbances from the trend leave
aftereffects on the following period volume. When volume falls below the trend line (as in 1959
and again in 1975) or above the trend (in 1946, 1969, and 1980), volume for the next few periods
remains on the same side of the line as well.
(6) Normally Distributed Random Disturbances: As with the univariate tests and
estimates for the mean, we may arrive at a normal distribution in one of two ways. One way is to
assume approximate normality for the random variable being sampled.
For sufficiently large samples, however, the assumption of normality in will usually
follow automatically even if none of the variables in the regression have normal distributions.
From our earlier discussion of sources of , random disturbances arises from factors such as
omitting less important explanatory variables and from measurement limitations on the variables
we do include. If the explanatory variables result in one or more random measurement errors,
the sum of these errors would be captured within the random disturbances. Omission of several
minor variables causes to include the sum of their effects as well. As we saw in the earlier
discussion of Assumption 2, E() = 0 derives from the offsetting contributions to from these
sources of error.
If, however, the sources of the random disturbances are also independent and identically
distributed, their sum will be approximately normal. This property of sums of random variables
is a logical extension of the central limit theorem. Recall that this theorem applied to the mean,
which is merely the sum divided by n. Thus, the central limit theorem should also extend to
these sums.
The assumption of normally distributed is approximately true for larger samples if all
sources of error have similar distributions and are independent.
If the sample is large enough, we may also examine the least-squares errors, e. For most of the
regression applications, such as those from in Chapter 4, the errors are distributed approximately
normally even when the dependent and explanatory variables are highly skewed or have uniform
distributions. If these errors are approximately normal, then the unmeasurable values are also
likely to be distributed normally. Thus, a histogram plot of regression residual errors that is
approximately bell shaped indicates that the normality assumption is valid. These results lend
strong support for the testing and estimation methods used in this chapter.


Multicollinearity and Misspecification
Before we move on to regression testing and estimation, there are two final statistical
concepts left to tackle: multicollinearity and model specification bias. This pair of statistical
concerns may be cause for more instances of badly conducted regression analysis than any of the
others covered in this section. So pay close attention here or you might commit those mistakes,
too.
Recall from Chapter 4 how we introduced you to the index of linear association called
correlation. We pointed out that regression analysis is a more sophisticated and powerful method
of linear analysis. Yet correlation can still help us anticipate and isolate a special problem with
regression estimation known as multicollinearity.
DEFINITION: A regression may suffer from multicollinearity if one (or more) explanatory
variables can be approximately expressed as a linear function of the other explanatory variables.
The best way to understand the causes of multicollinearity and potential harm it may
cause is to first discuss the extreme case of perfect collinearity.
DEFINITION: Perfect collinearity occurs if one (or more) explanatory variables can be
expressed exactly as a linear function of the other explanatory variables.
The simplest case of perfect collinearity occurs between two explanatory variables. For
example, consider the following fitted regression on a sample of college students:
Predicted-Wt = -380 + 96 HtFt
where Wt is weight in pounds and HtFt is height in feet. Thus, the equation predicts that a 5 foot
student would weigh 100 pounds because
Predicted-Wt = -380 + 96 (5) = 100
To set up our case of perfect collinearity, notice that if we converted the HtFt data into height in
inches (by multiplying the HtFt data by 12), least squares would have fit the following equation:
Wt = -380 + 8 HtInch
where HtInch is height in inches. Notice that the coefficient of HtInch has to be one-twelfth as
large (96/12 = 8) because the data on height are 12 times larger. Thus, this equation generates the
identical predictions as the one based on HtFt. For example, 5 feet is 5x12 = 60 inches, so
Wt = -380 + 8 (60) = 100

But suppose an analyst is careless and accidentally includes both measure of height in the
regression. The result would be perfect collinearity, because the two explanatory variables are
linear combinations of one another (e.g., HtInch = 0 + 12 HtFt). If he tried to run such a
regression, the computer would throw out one of the height variables because it would be unable
to calculate a unique least-squares fitted equation. The perfect collinearity creates not one, but an
infinite number of equivalent least-squares regression equations!
To observe what these equations look like, begin with the two we already know:
Wt = -380 + 8 HtInch + 0 HtFt
These two trivial versions merely involve a zero coefficient for one of the two height measures,
so the prediction is generated entirely from one variable. But heres a third equivalent equation
where neither coefficient is zero:
Notice how each variable contributes to half the prediction, but the predictions are still the same.
Instead, what if HtInch contributes twice the amount of the prediction but HtFt offsets it?
Wt = -380 + 16 HtInch 96 HtFt
By the same principle, we get one full prediction from 11 minus 10 predictions in this equation:
Wt = -380 + 88 HtInch 960 HtFt
Thus, there are an infinite number of equations that generate the same least squares predictions.
Worse yet, the coefficients of each variable can be any magnitude as long as the other variable
coefficient is a sufficiently large offsetting amount. That is, the individual coefficients are
arbitrary and have infinite variability (i.e., standard deviations) under perfect collinearity.
Why did this regression indeterminacy happen? Remember that least-squares finds that
unique combination of variable coefficients that minimizes the error sum of squares in the linear
regression equation. However, perfect collinearity means that one explanatory variable is a linear
combination of the other explanatory variables. Therefore, one variable contains information
already present in the equation. This is regression overkill: redundant information in the form of
extra, unnecessary variables. The solution is to throw out one (or more) of the collinear variables
and thus eliminating perfect collinearity and that is what the computer automatically does
because our careless analyst forgot to. Height in feet and inches contains the same information,
so only one of the two can be included.

A glance at the correlation matrix would also have alerted the analyst of perfect
correlation. HtInch and HtFt must be perfectly correlated because they are linear combinations of
each other, and correlation of +1 (or -1) among any pair of explanatory variables causes perfect
collinearity. A similar situation would have occurred if the analyst included in the regression two
variables that are not simple multiples, such as temperature in both Fahrenheit (F) and Celsius
(C), because they are still linear combinations of each other (i.e., F = 32 + 1.8 C).
1

Regressions with perfect collinearity yield an infinite number of solutions, so variables must be
removed until no redundant information remains. Perfect correlation among any two explanatory
variables results in one of the two being dropped from the regression.

Ordinary Multicollinearity
We just learned that least squares cannot deal with perfect collinearity, but fortunately
this problem is easily detected and resolved in regression programs like Minitab and Excel. Not
so, however, if multicollinearity is not perfect, such as if two or more regressors are highly
correlated. As with perfect collinearity, ordinary multicollinearity involves regressors so
correlated that least squares cannot disentangle which independent variable explains what.
Because variables provide approximately the same information, multicollinearity problems can
be the same as those of perfect collinearity we observed in the weight-height case:
multicollinearity can cause arbitrary, volatile, and unreliable regression coefficients, resulting in
inflated standard errors for the estimated regression coefficients, reduced t-ratios, and high p-
values. If these effects of multicollinearity are strong enough, the consequence will be that
important variables may not test significant!
Thankfully, if the explanatory relationships in the regression data are strong enough,
significance tests may still survive even high correlations among independent variables.
Moreover, multicollinearity cannot harm other objectives of regression analysis such as the F-
test of the model or R
2
, predictions and forecasts, or even significance tests of independent
variables in the model that are not correlated.
Like every physician knows, a particular symptom may be associated with more than one
possible explanation. For example, fever may indicate you have a cold or an infection. So it is
with multicollinearity, whose symptom is nonsignificant independent variables. But we know
there is a simpler, more obvious explanation for lack of significance the variables tested dont

1
Perfect collinearity can also involve more complicated linear combinations among the explanatory variable that
may not be revealed by correlation statistics. For example, net worth = assets debts, so including all three as
explanatory variables involves a linear combination and hence perfect collinearity among 3 rather than 2 variables.
However, correlations will not be perfect between any two of these variables.

matter! How can we discriminate between the two possible explanations of the same symptom?
If explanatory variables that are correlated with one another do not test significant, we cannot
determine whether or not damaging multicollinearity was the culprit. However, quite often these
correlations are not high, allowing us to disregard anyone trying to rationalize with, well, it
would have been significant were it not for multicollinearity. That excuse holds no more water
than a student cuts class all term, and then tries to explain away his low grade by complaining
how unclear the lectures were.
Multicollinearity is damaging only if it prevents independent variables from testing significant.
But if correlations among independent variable are low, lack of significance can not be due to
multicollinearity. And multicollinearity cannot adversely affect the regression fit and predictions
nor the significance test of the overall model and of uncorrelated explanatory variables.
What can cause multicollinearity problems in the first place? High correlations among the
independent variables is always a possibility when the data is not the result of a controlled
experiment like those discussed in Chapter 10. With the secondary data typically available for
regression analysis, the business world and society are busy generating data from their own
experiments. For example, market forces such as product demand, income trends, interest rates
to finance purchases, and many other factors outside a business analysts control will affect
monthly auto sales. If some of these factors move together over the period under study, the result
is high correlations. For example, falling auto prices were highly correlated in the 2007-2008
period with a simultaneous decline in interest rates and incomes. A regression of auto sales on
these three variables might obtain a high R
2
fit without one, two, or perhaps all three of these
independent variables testing significant because of multicollinearity. Because the individual
effects on sales of price, interest, and income could not be disentangled, the model tests
significant, but which variable or variables did the heavy lifting could not be discerned.
The other major reason for multicollinearity is a poor choice of variable definitions in the
regression model. By giving a little though to these variable definitions, we can prevent a lot of
unnecessary correlation that could have led to multicollinearity. For instance, suppose our model
to explain variation in police force per capita in a sample of cities includes variables that measure
total crime, violent crime, and population. If we define the first two of these variables as their
totals number of crimes and number of violent crimes we are inviting high correlation among
all three independent variables. More populated cities naturally have more crimes and a certain
portion of these will be violent crimes. But it is easy to redefine these variables so that risks of
multicollinearity virtually disappear. Simply replace number of crimes by the crime rate (say per
10,000 population). By dividing by population, we obtain a crime rate measure that is no longer
naturally correlated with population. Similarly, we disengage correlations involving violent
crime by choosing the variable Percent of Violent Crime, i.e., dividing the number of violent
crime by total crimes.

Multicollinearity is common in time series data because variables tend to move together as they
are affected by common forces such as inflation, population growth, and technology. We may
often prevent multicollinearity by carefully defining independent variables to reduce correlation.
What we have discussed so far should ease most of your concerns about multicollinearity.
Unfortunately, business analysts are too often so fearful of multicollinearity that they figuratively
leap from the frying pan into the fire. They swap away the mere possibility of damaging
multicollinearity in return for dangerous model specification bias, a very poor tradeoff indeed!
Definition: Misspecification results from removing important independent variables from a
model. Specification bias is the biased estimation of regression coefficients resulting from
omitting important independent variables, a violation of Assumption 3. This bias may raise or
lower estimated coefficient values, change their signs, or reverse the results of significance tests.
The bias occurs whenever an omitted variable is correlated with, and therefore proxies for,
variables retained in the model. When least squares searches for the omitted important variable,
it instead finds its proxy and award the proxys coefficient with the missing variable affect. Thus,
misspecified model coefficients are prone to bias because they combine the net effect of their
own impact on the dependent variable as well as the effect of any variables they proxy for.
For example, suppose we had omitted the number of crimes variable from the police
force model in order to avoid multicollinearity. Since number of crimes is highly correlated with
population, the population variable proxies for the number of crimes information missing from
the regression. Now suppose that population doesnt really affect police force per capita, i.e., big,
small, and medium cities have a similar number of cops per capita. What if police force size is
driven by crime rates, however. Nevertheless, population may still test significant because its
data resembles crimes totals. Thus, model misspecification caused wrong significance results.
But what if we were fortunate, and no bias occurred? Perhaps the omitted variable was
not related to police force size. Or maybe we removed several important variables from the
model, but their effects canceled each other out, leaving unbiased coefficient estimates.
Ironically, being right for the wrong reason is of little benefit to us in such cases. Because the
model omitted crime, an obviously important variable, few would trust your test results and
estimates no matter how correct they happened to be. In statistics, being right is of little help
unless you obtain your results by proper analytical methods as well.
Never omit a variable that clearly belongs in the model. If you do, you risk specification bias.
Even if no bias occurs, the results cannot be trusted because the model was misspecified.


11.1 Which of the following is in proper form for a regression?
a. predicted Sales = b
0
+ b
1
Invest + b
2
Employ
b. Sales =
0
+
1
Invest +
2
Employ +
c. predicted Sales =
0
+
1
Invest +
2
Employ
d. Sales = b
0
+ b
1
Invest + b
2
Employ +
e. all of the above

11.2 Which of the following is a source of random disturbance due to modeling error?
a. using a proxy variable instead of the ideal variable needed for your model
b. using a sample to estimate population characteristics
c. including only the most important variables for your model
d. using less than precise instruments to make your data readings

11.3 A proficiency exam score used in a regression model to "proxy for" worker efficiency is
an example which source of random disturbance?
a. sampling error
b. measurement error
c. modeling error
d. all of the above
e. b and c only

11.4 The random disturbance term is required in population regressions because of
a. sampling error
c. modeling error
d. all of the above
e. b and c only

11.5 If you have data for the entire population, which of the following will no longer be a
factor?
a. sampling error
c. modeling error
d. errors in judgment
e. all of the above

11.6 If all regression assumptions are valid, least-squares estimators

a. are unbiased
b. have minimum standard deviation among all unbiased estimators
c. are efficient
d. all of the above

11.7 Which of the following is not true about autocorrelation?
a. it results in inefficient estimation
b. it is a problem only for time series data
c. it means that and
j
are correlated for i =/ j
d. it results in biased estimation
e. all of the above are true

11.8 Omitting an intercept, or constant, term from a regression equation
a. is recommended whenever we suspect that E() is not zero
b. means the equation cannot be estimated with least-squares analysis
c. usually improves the regression fit
d. should be avoided even if the intercept is meaningless in the equation
e. all of the above

11.9 Which of the following is not a regression assumption?
a. the parameters are constant
b. E() is zero
c. is uncorrelated with each of the explanatory variables
d. each is uncorrelated with every other
e. all of the above are regression assumptions

11.10 A changing slope in a simple regression equation means that which assumption is
violated?
a. the parameters are constant
b. E() is zero
c. is uncorrelated with each of the explanatory variables
d. each is uncorrelated with every other
e. the random disturbance is normal

11.11 If all regression assumptions are valid except there is nonconstant , estimators are
a. unbiased but not efficient
b. efficient but not unbiased
c. unbiased and efficient
d. neither unbiased nor efficient

11.12 According to the Gauss-Markov theorem
a. the assumption of normally distributed is approximately true for large samples
b. least-squares estimators are efficient under the regression assumptions
c. the least-squares equation minimizes the sum of squared errors
d. all of the above
11.13 In many regression situations, according to the central limit theorem,
a. the assumption of normally distributed is approximately true for large samples
b. least-squares estimators are efficient under the regression assumptions
c. the least-squares equation minimizes the sum of squared errors
d. all of the above

11.2 Testing and Estimation of Individual Explanatory Variables
We are now prepared to make inferences from regression. We begin with the two most
common forms of hypothesis tests for regression analysis: significance tests for the individual
explanatory variables, followed by tests for the regression model itself. In section 4, we will
modify the standard error to find confidence intervals around the fitted Y-value, , and apply
this method to construct forecast intervals from a simple time trend model.
Hypothesis testing with regression commonly takes two forms:
(1) testing if individual explanatory variables have a significant relationship with the
dependent variable, and
(2) testing the significance of the entire model
The second of these will be postponed until Section 11.3. In this section, we introduce
hypothesis testing and interval estimation of explanatory variables.
For many business and economic consultants, this section contains the most valuable
statistical tools they will ever use. Managers, if they are wise, want to know if they are doing the
right things in running their businesses. Testing explanatory variables for statistical significance
often provides the answers that managers need.
Unfortunately, too many businesses spin their wheels by blindly following procedures
and practices unrelated to product quality, productivity, cost control, market share, and the

bottom line. They adopt or maintain standards for hiring, promoting, planning, marketing, and
all other phases of company operations without ever subjecting these practices to statistical
analysis. Did you ever take a summer or part-time job which had you wasting a lot of time in
unproductive or inefficient activities? In poorly run business, these untested practices may be
defended arguments such as:
everybody else does it
weve always done it that way
everyone knows that . . .

Without data analysis, none of these arguments are convincing. Usually the only way to decide
which variables are related to the dependent variable is to collect a random sample and conduct
statistical tests on them.
In this section, we examine three cases to observe how explanatory variable significance
tests may be used in business. Lets begin with the industrial development case introduced in the
previous section. Recall that the IDC collected a random sample of 56 manufacturing plant
expansions in northern New York. The regression of job creation on new plant area and
investment was the following:

invest
Yet this equation only applies to the sample data. It is always possible there is no relationship
between these variables in the overall population. The fitted equation could have resulted from
sampling error.
We experienced similar sampling errors when we used the sample mean to make
inferences about the population mean. As we learned in earlier chapters, random data often
display patterns where none exist in the population. These coincidental patterns are all the more
likely in smaller samples. Justifiably, the IDC cannot conclude that jobs are related to either size
or invest without conducting the proper statistical tests. Fortunately, the same methods of
testing we used in preceding chapters may be extended to testing explanatory variables for
significance.

Formulating Hypotheses for Variable Significance Tests
The IDC needs to determine whether the relationship between either explanatory
variable and job creation is a statistically significant one. To design a test for significance of a

particular explanatory variable, we first must state a null hypothesis describing no relationship.
Then, if the test rejects this H
0
, we may conclude that the variable is significant.
What is the form of H
0
for these significance tests? Consider again the population
equation:
jobs =
0
+
1
size +
2
invest +
If size is unrelated to jobs, then changes in factory size cannot have any impact on jobs.
However, the only way for that can happen is to disconnect the size variable from the equation.
We pull the plug connecting the dependent variable to an explanatory variable by making its
regression coefficient zero.
Statistical significance of an explanatory variable is established by rejecting the null
hypothesis that the variables coefficient is zero in the population equation.
Thus, to test if size has a significant relationship with the jobs variable, we tests whether
the
1
coefficient is zero. Similarly, significance for invest involves rejecting the null hypothesis
that
2
is zero. As with other two-sided tests, the alternative hypothesis contains an inequality
sign.
The two-sided test for significance of the variable X
j
in the population equation
Y =
0
+
1
X
1
+
2
X
2
+ . . . +
k
X
k
+
where j is a particular integer from 1 to k, has the null and alternative hypotheses:
H
0
:
j
= 0 H
A
:
j
=/ 0

Recall that rejecting the null hypothesis lets us claim significance for whatever is
described in the alternative hypothesis. If the IDC finds that coefficient
1
is significantly
different from zero, then the IDC should conclude that variation in plant size is significantly
related to the number of jobs created. Similarly, rejecting H
0
in the test for
2
means that the
relationship of investment with jobs is significant.
In regression, a relationship may be in one of two possible directions depending on the
sign . If
j
> 0, the coefficient is positive and a direct relationship exists between the
explanatory and dependent variable. On the other hand,
j
< 0 indicates these variables are
related but move in opposite directions. We learned in Chapter 8 that two-sided tests are used
whenever a reasonable argument may be made for alternative hypotheses in either direction.
This naive approach to regression modeling is especially common during when we do not have
much experience or theory to guide us. When the first worldwide energy crisis occurred in
1973-1974, market analysts were uncertain how particular U.S. stock prices would respond. For
example, higher oil prices may lower auto stock prices if people dont have enough left to buy

new cars after spending more on gas and heating oil. On the other hand, higher oil prices make
older, gas-guzzling cars too expensive to drive, causing more fuel-efficient, new car sales and
rising auto stock prices. Thus, a direct or inverse relationship is possible. A two-sided test under
these circumstances was therefore justified.
However, if the test results in statistical significance, we may conclude that the
population relationship is in the direction observed in the sample regression.
If an explanatory variable is significant by a two-sided test, a positive sample
regression coefficient indicates a significant, direct relationship with the dependent
variable. Conversely, a negative coefficient indicates a significant, inverse relationship.
For example, oil prices tested significant in regressions on auto stock prices, based on sample
data from the 1970s. The sample regressions yielded a negative regression coefficient, so
analysts concluded there was a significant, inverse relationship between oil prices and auto
stocks. Although many used cars were not fuel efficient, new U.S. cars were also gas guzzlers.
Only the fuel-efficient car makers from Japanese auto makers profited from the energy crisis.
In contrast to two-sided tests, one-sided tests involve alternative hypotheses restricted to
only one direction. In one-sided regression variable significance tests, the alternatives to
H
0
:
j
= 0 are either
j
> 0 or
j
< 0 (but not both).
There are two possible one-sided tests for significance of explanatory variable X
j
:
H
0
:
j
= 0 and H
0
:
j
= 0
H
A
:
j
> 0 H
A
:
j
< 0

One-sided tests are more common if we enter a study understanding something about the subject.
For example, recurrent crises in the Middle East have provided the information about stock
market behavior that allows us to conduct one-sided tests. As the United States becomes ever
more oil-dependent, stock market analysts correctly anticipated that rising oil prices during the
1990-1991 Gulf War would briefly depress the U.S. economy and its non-petroleum stock prices.
On the other hand, falling inflation-adjusted oil prices during most of the 1980s and 1990s
contributed to the tenfold climb in the overall U.S. stock market. Regressions using stock market
indices as the dependent variable now expect a negative sign for the coefficient of the world oil
price variable.
In the industrial development case, logic and experience tells the IDC that if factory
space has any effect at all on jobs, the relationship should be a direct one. A larger operation,
measured by larger floor space, should require more employees to run the operation. Thus, the
following one-sided test is justified:

H
0
:
1
= 0
H
A
:
1
> 0

Even if much is already known about the situations under analysis, however, two-sided
tests still may be advisable. Consider the test for the other explanatory variable. Investment
clearly belongs in the industrial development model. For example, job creation often requires
added building and equipment to a factory. Thus, the IDC might at first consider conducting the
one-sided test:
H
0
:
2
= 0
H
A
:
2
> 0

Fortunately, a quick-thinking IDC staff member points out that some factory investment is
designed to replace workers. Automation in manufacturing has been as much a source of
declining jobs as plant closures from oversees competition. Paper mills that once provided
hundreds of secure jobs are now fully-automated, leaving only jobs for fork lift operators,
security guards, and a person to monitor the computer panel.
Because investment may have a negative instead of positive coefficient, the IDC properly
decides to conduct the two-sided test:
H
0
:
2
= 0
H
A
:
2
=/ 0

If the sample regression coefficient b
2
turns out to have a positive sign, this two-sided test still
allows the IDC to test whether this direct relationship between investment and job creation is
statistically significant. However, what if the sample produces a negative regression coefficient
for invest? Recall from Chapter 8 that it is unethical to begin with a one-sided test and switch to
a two-sided test after examining sample statistics. Thus, if the IDC establishes a direct
relationship as their alternative hypothesis, they cannot alter the test if the sample regression
yields a negative sign for b
2
. Unless we are willing to ignore results that strongly suggest that
the relationship is in the opposite direction, we should avoid one-sided tests except in clear-cut
situations.

A one-sided test of an explanatory variable is justified only if one possible direction of the
relationship can be disregarded. Altering an alternative hypothesis to accommodate an
unexpected coefficient sign in the sample regression is unethical.
Observe how similar all these tests are to one another. By contrast, the value (and units)
of
0
changed from one example to the next in univariate tests of the mean. In Chapter 9 for
example,
0
was 8 ounces for the yogurt case, while
0
was 6 percent growth in M1 in the FED

case. By contrast, if the null hypothesis is true, must be zero.
9
Thus, the null hypothesis
always has the same format, H
0
:
j
= 0. The only choices remaining to us are deciding whether a
one- or two-sided test is proper and selecting the explanatory variables to test.
Before we show how to conduct these significance tests, did you notice that we
apparently forgot to discuss the test involving
0
. The two-sided test for whether
0
is
significantly different from zero is:
H
0
:
0
= 0
H
A
:
0
=/ 0

In section 1, we advised you to include an intercept term in regression equations to improve the
fit and absorb the effect of measurement errors and other sources of bias. As a result of
absorbing these other effects, however, the test is less reliable. Fortunately, this test is also not
as informative because
0
is the only not associated with an explanatory variable. Thus, no
variable relationships are involved in the test of
0
. Furthermore, recall from Chapter 4 that the
intercept has no special meaning unless zero values can and do occur for all explanatory
variables.

Significance tests involving
0
are usually not too informative or statistically reliable.

The Test Statistic and Decision Rule for Variable Significance Tests
We just learned that the dependent variable is unrelated to an explanatory variable only if
its coefficient is zero. In the previous three chapters, we used sample data to make inferences
about population parameters. Here, we use the sample regression coefficient in the fitted
equation to test whether or not a particular
j
is zero. Yet how far must b
j
be from zero for us to
reject the null hypothesis that
j
= 0? A large sample regression coefficient does not, by itself,
indicate a significant relationship. To understand why not, consider again the sample regression
in the industrial development case:

invest
The coefficient of the size variable, 0.175, is much smaller than 4.02, the coefficient of invest.
Does that imply that invest is significant or that size is not? Not necessarily!

9
Testing whether a particular
j
is different from a some nonzero value may also be conducted, but these are not variable significance test.

Suppose the IDC was fortunate to have four other random samples of 56 New York
factories, and the fitted equations on these samples resulted in the following coefficients for size:
0.162, 0.184, 0.178, and 0.170. Because these coefficients were positive and hardly varied at all
from one sample to the next, the IDC could nearly be certain that jobs and size are directly
related and the population coefficient is somewhere around 0.15 or 0.20. By contrast, suppose
the invest coefficients varied considerably from 4.02 in these four random samples: 12.5, 0.8,
7.1, and -3.4. The relatively larger variability in sample coefficients calls into question whether
jobs and invest are related at all in the population.
As we learned in previous chapters, sample inference doesnt work like this in practice.
Studies such as this only collect one sample, so no direct information is available about how
regression coefficients vary from one sample to the next. Instead, the variability of regression
coefficient estimates must also be inferred from a standard errors estimated using the sample
data. Each of these is an estimated standard error of the regression coefficient s
b
j
.
DEFINITION: The estimated standard error of b
j
, s
b
j
, is the sample standard deviation of b
j
,
and there is a different s
b
j
for each explanatory variable X
j
in the sample regression.
These are estimated standard errors because the actual standard error of b
j
,
b
j
, is another
population parameter we may only estimate from sample data.
Observe the parallels with univariate inference described in Chapter 9. Instead using X
obtained from sample data to estimate , we now estimate each
j
by b
j
in the sample regression.
Before, we estimated
X
, the standard error of X , by its sample data version, s
X
, and used it
to measure the variability of X from one sample to the next. Here, we do the same thing for
bivariate and multivariate relationships by estimating. In regression inference, the estimated
standard error, s
b
j
, is used to measure variability of b
j
from sample to sample.
To continue the analogy with univariate inference, a test statistic is needed. Recall from
Chapter 9 how the statistic for testing the mean measured departures of the estimator from the
null-hypothesized
0
value. This X -
0
difference was converted to standard errors by dividing
by s
X
. The result was a test statistic ( X -
0
)/s
X
that tells us how many standard errors the
sample mean is from
0
. The null hypothesis was rejected if the sample mean enough standard
errors above or below
0
.

In regression tests of explanatory variables, the test statistics also draw upon the null
hypothesis, the estimate, and its estimated standard error. However, H
0
is even simpler now
because must be zero if the null hypothesis is true. Thus, the (b
j
- 0)/s
b
j
measures the number
of standard errors that the sample regression coefficient departs from zero. The test statistic is
therefore the simple ratio b
j
/s
b
j

.
The statistic for testing significance of explanatory variable X
j
: b
j
/s
b
j
, where b
j
is the
sample regression coefficient and s
b
j
is the estimated standard deviation of b
j
.
To construct confidence intervals and conduct tests, a test statistic must have a known
sampling distribution. Like the univariate test of the mean, the sampling distribution of this test
statistic is a t-ratio, assuming that is normally distributed and the other five assumptions of
section 1. A t-distribution rather than a z-distribution applies for regression inference because of
the thicker density function tails resulting from estimating the standard error.
The test statistic b
j
/s
b
j
has a t-distribution (at least approximately) with (n - k - 1) degrees
of freedom for the test under the six assumptions of regression modeling.
In the regression case, the outcome of the test again relies on the relative magnitudes of
numerator and denominator for the t-ratio. We reject H
0
if the sample regression coefficient is
enough standard deviations s
b
j
from zero to conclude that b
j
did not occur by chance. In the
industrial development regression, suppose the size variable coefficient b
1
were 0.10 and s
b
1

equaled 0.20. Then, the t-ratio of 0.5 (0.10/0.20) says we are only half a standard error from the
null hypothesis for the population coefficient,
1
= 0. Such a small t-ratio could easily have
occurred by chance, and would not justify the IDC rejecting H
0
so they would conclude that
size is a not a significant variable for explaining job creation. On the other hand, the t-ratio
would have been 5 if b
2
= 0.10 but s
b
j
were only 0.02. Five standard errors from the mean of
the t-distribution is extremely unlikely.

The p < decision rule may then be adapted to explanatory variable significance tests.
Recall from Chapter 8 that we reject H
0
if the test statistic lies within the rejection region, the tail
(or tails) of the sampling distribution containing a total of area of probability. For a two-sided
test, the rejection regions have /2 probability under each tail. p measures the combined area of
the two tails at least t = b
j
/s
b
j
standard errors from zero. The accompanying figure portrays this
two-sided test graphically (see Figure 11.9). The decision rule is similar to others we have
learned in previous chapters.

Decision Rule for Two-Sided Tests of Explanatory Variables:
If p < , reject H
0
and conclude that the explanatory variable is related to the
dependent variable at the significance level.
If p </ , we cannot reject H
0
, so no significant relationship has been found.

Unlike the standard errors for univariate tests, s
b
j
in regression generally involve
complex formulas and extensive computations. Thus, we enlist the aid of the computer to obtain
sample regression coefficients and their standard errors, t-ratios, and p-values. We return to the
industrial development regression output, but this time we examine inference information from
Minitab and Excel that we have ignored until now.
Regression Analysis

jobs = 16.7 + 0.175 size + 4.02 invest

Predictor Coef Stdev t-ratio p
Constant 16.722 2.718 6.15 0.000
size 0.17521 0.09468 1.85 0.070
invest 4.021 1.483 2.71 0.009
Figure 11.10

Notice that the Minitab table immediately following the regression equation (see Figure
11.10) is nearly identical with the table we extracted from the bottom of the Excel regression
output (Figure 11.11). Each table contains five columns. The first column presents a list of the
explanatory variables, preceded by a separate line entitled Constant (in Minitab) or Intercept
(in Excel) for the intercept term in the regression. A regression with more independent variables
would have more lines. The next column (Coef or Coefficients) lists the sample regression
coefficients: b
0
, b
1
, and b
2
.
The other three columns have titles familiar to us from the univariate tests of Chapter 9.
is short for standard errors of each b
j
, s
b
j
, is reported in the third column (Stdev in Minitab or
Standard Error in Excel). The t-
ratio (labeled t Stat in Excel) is
found in column four, followed by
the p-value (p or P-value) in the last
column.
As with univariate tests, the
computer output presents information that allows us to understand the test better. Here, we may
easily verify that the t-ratio reported is correctly derived from the sample regression coefficient
and its standard error. By examining the quotient of the preceding two columns, t-ratio = 2.71
for the invest variable may be checked by dividing b
2
= 4.0208 by s
b2
= 1.4834. Thus, the
coefficient of invest is 2.71 standard errors above zero.
Coefficients Standard Error t Stat P-value
Intercept 16.7223 2.7182 6.15 0.0000
size 0.1752 0.0947 1.85 0.0698
invest 4.0208 1.4834 2.71 0.0090
Figure 11.11

Is 2.71 standard errors enough to conclude there is a significant
relationship between invest and jobs? The easiest way to answer this is to
apply the p-value decision rule.
10
Suppose the industrial development
corporation establishes an = .05 significance level for its two-sided test.
The p-value is the probability of obtaining a test statistic as (or more)
extreme as the one actually found if H
0
is true for the population. The p-
value of .0090 indicates that 0.9% of the probability under the t-distribution
lies beyond 2.71 standard errors from the
2
= 0 specified by the null
hypothesis. Thus, the IDC has a 0.9% chance of being wrong if they reject
the null hypothesis. However, the industrial development corporation is
conducting the test with = .05, so they are willing to be wrong less than
5% of the time. Because p = .009 is less than = .05, the test supports a
finding of significance for the invest variable. Therefore, the IDC
concludes that the investment amount has a significant and direct
relationship to factory job creation.
What about one-sided tests, such as the test for the size variable?
Earlier, we justified a one-sided test because experience and logic strongly
suggests that more floor space should necessitate more employees to run the
operation. The IDC established the one-sided test:
H
0
:
1
= 0
H
A
:
1
> 0
Just as we did with univariate inference in Chapter 9, one-sided
variable significance tests have a rejection region on only one side of the
sampling distribution. Thus, all of the probability area is located under a
single tail of that distribution. The IDC knows that any relationship
between size and jobs must be a direct relationship. Therefore, if
1
in the
population equation is not zero, it must be positive (see Figure 11.12). An
example of a one-sided test involving inverse variable relationships will be
examined shortly.
Implementing the p-value decision rule is slightly different because
the Minitab and Excel regression output only report p-values for two-sided
tests. Fortunately, this problem is easily corrected. The computer p-values

10
Before computers provided us with p-values, cumbersome decision rules involved comparing the t-ratio with values of the t-distribution at
the rejection region boundaries. These critical values of t
/2
were found in t-distribution tables for the degrees of freedom and significance level
applicable to each case.
Figure 11.12
sick age yrs avail
8 37 4 1.625
0 67 0.5 0.000
4 41 7 4.125
3 40 12 2.000
11 36 4 0.500
7 35 7 0.750
2 42 14 6.625
1 31 1 2.000
2 42 6 10.750
8 32 13 1.000
3 37 10 1.750
1 31 10 12.125
6 62 14 10.250
7 36 16 1.000
5 35 2 1.500
0 34 13 11.375
4 37 16 0.125
1 41 14 7.500
8 70 6 9.625
15 56 9 0.625
2 55 1 1.375
12 72 12 1.500
11 64 3 2.125
2 54 1 2.375
1 50 6 6.500
2 45 3 6.875
0 50 0.5 0.000
0 23 1 2.250
4 34 5 5.250
6 29 9 3.125
2 34 10 3.750
5 28 4 2.500
0 26 6 6.750
8 34 4 0.375
1 39 5 6.625
8 47 2 2.125
8 28 6 0.625
14 30 8 0.125
5 26 4 1.000
6 30 3 0.750
11 27 8 1.125
8 35 6 1.000
8 41 15 0.750
10 38 8 0.750
0 51 1 3.375
2 49 3 5.625
10 41 14 0.250
0 43 2 0.750
2 29 1 3.250
0 31 0.5 0.000
3 50 9 5.250
2 47 2 3.375
1 57 13 13.250
3 46 12 3.750
10 38 3 0.750
4 37 3 1.875
5 34 4 1.875
5 33 8 0.875
12 44 3 0.875
18 51 14 0.875
0 38 0.5 0.000
9 38 2 0.875
8 55 1 0.875
0 40 0.5 0.000
10 29 2 0.875
3 48 3 2.875
2 44 1 2.125
7 64 14 0.375
0 31 0.5 0.000
6 48 9 0.625
4 35 8 0.750
1 42 6 5.000

report the area on both tails, but one-sided tests restrict the alternative hypothesis to only one tail.
Because the two tails have the same area, we only need to divide the p-value in half to obtain the
probability in a single tail.
Decision Rule for One-Sided Tests of Explanatory Variables:
If p/2 <o, reject H
0
and conclude that the explanatory variable is related to the dependent
variable at the significance level.

If p/2 </ o, we cannot reject H
0
, so no significant relationship has been found.

Halving the p-values of .070 yields .035, which is less than the significance level of .05
set by the industrial development corporation. Based on this one-sided test, the IDC rejects the
null hypothesis and concludes that plant size is directly related to job creation.
Because this tail has twice the area, the rejection region is fewer standard errors away
from the center of the distribution. Again, our prior knowledge about what we are testing gives
us a better chance to obtain a significant test result. Observe that a two-sided test on the size
variable would have used a p-value of .070, which is not less than = .05. In this case,
therefore, the one- and two-sided tests yield different results. Only the one-sided test lets us
conclude that size and jobs are directly related at the .05 significance level. But as we discussed
earlier, this advantage of one-sided tests should not be abused. It is unethical to change from a
two- to a one-sided test to accommodate the sample regression results. The next chapter case
illustrates how the proper use of variable significance testing can help guide a business in
becoming productive.


Chapter Case #2: Sorry (Cough, Cough), I Wont Be In Today
An airline catering company was considering policies to deal with its high absentee rate.
Management believed that newer and younger employees called in sick most often because
young people party more and newer workers are less loyal to the company. Although this
proposition had never been tested, management believed it to be true because they themselves
were older and more senior and too dedicated to call in sick. If the catering company went ahead
with its plan to increase monitoring of younger, more recent hires, some dedicated workers may
have quit while many who stayed would have become alienated and less loyal to the company.
Ironically, the management plans were based on incorrect conclusions. Younger workers were
absent from work no more often than their older co-workers, and newer employees actually took
even fewer sick days than colleagues with greater seniority at the company.
Luckily for the catering company, a management trainee was enrolled at a university
business statistics course which required a regression project. Her regression analysis on a
random sample of 72 recent employment records investigated the relationship between

sick days absent in the past year
and the following two explanatory variables:
age employee age
yrs years seniority at the company
for each employee at the company. Thus, the population regression was modeled as follows:
sick =
0
+
1
age +
2
yrs + c
After discussing the relationships with
personnel management experts and talking with
focus groups, management trainee did not feel
justified in running one-sided tests. Some
supported the reasoning of management.
However, others she interviewed argued that
newly-hired or younger workers have too little job security to risk taking too many sick days.
Thus, the more secure, senior workers may be the ones abusing sick-day privileges. Experts also
told her that older workers may play hooky from work because they are bored with their job or
were passed over for promotion. Finally, she noticed that management overlooked the fact sick
days may be higher with older workers simply because our health tends to decline as we age.
Because the arguments were at least reasonable for direct relationships as for inverse ones, the
trainee decided that two-sided variable significance tests were proper for each variable.
Figures 11.13 and 11.14 display the explanatory variable table from the regression
outputs of Excel and Minitab.
The positive signs for both variable coefficients in the sample regression does not support the
inverse relationships anticipated by the catering company management. If anything, older and
more senior employees called in sick more often than did the younger workers.
Regression Analysis

sick = 2.22 + 0.0321 age + 0.227 yrs

Constant 2.220 1.936 1.15 0.255
age 0.03208 0.04412 0.73 0.470
yrs 0.2266 0.1037 2.18 0.032

s = 4.126 R-sq = 7.6% R-sq(adj) = 4.9%
Figure 11.14

Coefficients Standard Error t Stat P-value
Intercept 2.2203 1.9356 1.15 0.255
age 0.0321 0.0441 0.73 0.470
yrs 0.2266 0.1037 2.18 0.032
Figure 11.13

Two-sided tests at the = .05 level, however, result in only one of these variable testing
significant. Because p = .032 is less than = .05, sick days increase significantly with rising
seniority. Thus, the management trainee concludes that length of service has a direct relationship
with absenteeism. By contrast, we cannot infer that sick days are related to employee age at the
.05 significance level. The p-value of .470 indicates that sample data relationships this strong or
stronger will occur 47% of the time when age and sick days are unrelated in the overall
population.
Did you also notice that R
2
was only 7.6% and the adjusted R
2
was even smaller reported
in the Minitab output (see Figure 11.14)? The extremely weak fit suggests that absenteeism may
be an essentially random event. Alternatively, specific worker situations such as health
condition, family responsibilities, and job stress may be more important determinants of
worker absences. Quick-fix solutions based on simplistic explanations are doomed to failure if
we are dealing with random or highly complex phenomena. Studies such as this prevent
companies from pursuing ineffective and discriminatory policies toward newer and younger
employees. Instead, adopting morale-building programs for senior workers and preventive
health care for all workers may reduce excessive absences. Well be able to shed more light on
this case after introducing model significance tests in section 3.
We now summarize in Table 11.2 the procedure for testing and estimation of explanatory
variables in a regression. The final step in the table will be discussed next.
Table 11.2
Procedure for Inferences about Explanatory variables

1. Design regression model to include all important explanatory variables.
2. Decide which variables to test and which to include merely as control variables.
3. Determine which variables are eligible for one-tailed tests and in which direction.
4. Assign by your willingness to be wrong when you claim a variable is significant.
5. Collect the largest possible random sample within resource limits of the study.
6. Use p-value decision rule to conduct tests, but use p/2 to conduct one-tailed tests.
7. Translate test results into verbal conclusions about which variables test significant.
8. Find interval estimates for significant variables and interpret as marginal effects.

Marginal Effects Analysis with Regression Inference
In addition to significance testing, the industrial development corporation may want to
estimate the effect on job creation of changes in either explanatory variable. In Chapter 4, we
learned how the regression coefficients could be used to answer What If? questions. The
average change in one explanatory variable was calculated by using the rise-over-the-run delta

calculations in the formula = b
j
X
j
. We interpreted these as marginal effects by qualifying
our results with the comment other things equal.
Although marginal effects analysis is often most useful in inferential regression studies,
additional uncertainty is also introduced. As we have seen, a sample regression may not reflect
the actual relationships in the population equation. Thus, we must always test a variable for
significance before we attempt to quantify its relationship with the dependent variable.
If an explanatory variable does not test significant, its marginal effect on the dependent
variable should not be estimated.
For example, no significant relationship between absenteeism and age was found. Calculating
marginal effects using the sample regression coefficient b
1
= .0321 is meaningless and
misleading if the null hypothesis that
1
is zero cannot be rejected.
An important but less obvious problem remains even if we restrict marginal effects
analysis to significant variables. After finding a significant, direct relationship between job
creation and investment, suppose the IDC wants to determine the average impact of an additional
$1 million investment on factory employment (assuming plant size is unchanged). Using the
delta formula:
predicted-jobs = b
2
invest
= (4.02)(+1) = +4.02

because the invest variable is measured in millions of dollars. However, we should no longer use
this result to predict an average of 4 more jobs because the calculations were only based on a
point estimate of the population coefficient. The value of the sample coefficient, 4.02, is almost
surely different from the actual population parameter,
2
.

Fortunately, this margin of error may be included within intervals estimates centered
around each estimated coefficient. The IDC should therefore construct a confidence interval for
2
. Because regression coefficients have a t-distribution, we construct confidence intervals from
sample regression coefficients and their standard errors.
If sample sizes are large, t-distribution values are only slightly larger than those for the
normal distribution. Thus, 95% confidence intervals may be approximated by two standard
errors around the estimated coefficient. When samples are smaller or more precision is desired,
intervals based on t-distribution values should be reported instead.


The (1 o ) 100 percent confidence interval for a regression coefficient |
j
is
(b
j
t
/2
(nk1) s
b
j
, b
j
+ t
/2
(nk1) s
b
j
)
The 95% confidence interval for
j
for larger samples is approximated by
(b
j
- 2 s
b
j
, b
j
+ 2 s
b
j
)

In this case, t
.025
= 2.01 for n k 1 = 51 degrees of freedom, so the approximation is nearly
perfect. The 95% confidence interval for
2
is approximately b
2
= 4.02 plus or minus 2(1.48), or
4 3. The IDC may then be 95% confident that the coefficient of invest is between 1 and 7 and
that an additional $1 million in investment results in between 1 and 7 new jobs. Notice how
close this approximation is to the (1.05, 7.00) interval based on the exact t-distribution values.
This information is automatically reported in the Excel output in the Lower 95% and Upper
95% columns (see Figure 11.15).
Often such a high confidence level is not required. For example, a 68% confidence
interval can be approximated by only one s
b
j
from b
j
. By lowering the confidence level to
roughly two out of three, the interval narrows to 4 1(1.48), or approximately (2.5, 5.5).
Thus, comparing two plants with the same floor area, the average effect of an extra $1 million
invested between 2 and 5 additional jobs.
We can also demonstrate the consequence of putting the cart before the horse when it
comes to estimating regression coefficients. For example, suppose the trainee skipped the testing
phase and jumped directly to estimating the marginal effect of age on sick days. Recall that the
age coefficient is b
1
= 0.032 in the sample regression on catering company employees, but the
standard error s
b
1
is even larger 0.044. The resulting 95% confidence interval for
1
is
0.032 2(0.044) = 0.032 0.088
or (-0.056, 0.120). Because this interval contains negative as well as positive values, marginal
effects may be direct or inverse. In fact, the confidence interval includes
1
= 0, the null
hypothesis. This result should not be surprising given that b
1
was not found significantly
different from 0 at the = .05 level.
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%
Intercept 16.72 2.72 6.15 0.000 11.27 22.17
size 0.18 0.09 1.85 0.070 -0.01 0.37
invest 4.02 1.48 2.71 0.009 1.05 7.00
Figure 11.15

Chapter Case #3: Would You Like a
Good Deal on a Used Car?
The final case in this section shows
how significance testing of explanatory
variables and finding specific marginal effects
for regression coefficients with a regression
where both direct and inverse relationships are
present.
Finance companies need a method to
decide on the loan value of used cars. A car
dealer want to start doing its own financing,
and so designs the following model for used
car prices:
Price =
0
+
1
EngSize +
2
HP +
3
Age +
where Price is the price of each used car,
EngSize is engine size (in cubic inches), HP is
horsepower, and Age is measured in years.
For significance tests of these three
explanatory variables, one-sided tests are
clearly justified. Larger engine size and
greater horsepower should command a
premium price on a used car. There is greater
demand for bigger and more powerful engines
for hauling cargo and passengers, entering
expressway traffic, and climbing hills. Also,
these used cars are likely to last longer
because their larger high-powered engines
because they wont have been overworked
under previous owners. We should therefore
expect a direct relationship between Price and
EngSize or HP and positive values for
1
and
2
.
A one-sided test is also proper for
testing the age variable, but this time an
inverse relationship is anticipated. Among
Make Price EngSize HP YrsOld
BUICK SKYLARK $10,280 2400 150 1
BUICK CENTURY $7,765 2200 120 3
BUICK REGAL $4,560 3100 140 7
BUICK LESABRE $14,560 3800 205 2
BUICK PARK AVENUE $12,300 3800 170 4
MERCURY TRACER $3,115 1900 88 6
CHEV. CAVALIER $4,115 2200 110 5
CHEV. CORSICA $4,435 2200 120 3
CHEV. MALIBU $12,595 2400 150 1
CHEV. LUMINA $10,350 3100 160 2
CHRYSLER LEBARON $6,200 3000 142 4
DODGE NEON $5,440 2000 132 3
DODGE SHADOW $1,630 2200 93 7
DODGE SPIRIT $3,565 2500 101 5
DODGE DYNASTY $3,040 2500 100 6
FORD ESCORT $3,785 1900 88 5
FORD TEMPO $4,305 2300 96 4
FORD TAURUS $11,630 3000 145 1
MERCURY SABLE $9,065 3000 140 3
PLYM. NEON $6,550 2000 132 2
PLYM. SUNDANCE $3,580 2200 93 6
PONTIAC GRANDAM $10,340 2400 150 1
SATURN $4,995 1900 85 4
93 ACURA INTEGRA $7,750 1900 130 5
ALFA ROMEO $11,930 3000 183 3
AUDI 90 $5,665 2300 130 7
DAIHATSU $1,450 1000 53 6
GEO METRO $5,170 1000 55 3
GEO PRIZM $8,860 1600 105 1
HONDA CIVIC $6,070 1500 102 6
HONDA ACCORD $8,980 2200 130 4
HYUNDAI ACCENT $6,875 1500 105 2
HYUNDAI EXCEL $2,780 1500 81 5
HYUNDAI ELANTRA $1,735 1600 113 6
HYUNDAI SONATA $1,705 2400 116 7
KIA SEPHIA $5,430 1600 105 3
MAZDA PROTG $5,770 1900 103 4
MAZDA 626 $11,690 2000 118 1
MITSUBISHI MIRAGE $8,430 1500 92 2
MITSUBISHI GALANT $4,870 2000 102 5
NISSAN SENTRA $5,875 1600 115 3
NISSAN ALTIMA $7,550 2400 150 4
NISSAN STANZA $4,055 2400 138 6
PEUDEOT 405 $870 1900 110 7
SAAB 900 $7,860 2100 140 5
SUBARU JUSTY $4,065 1200 73 5
SUBARU IMPREZZA $10,555 1800 115 2
TOYOTA TERCEL $10,220 1500 93 1
TOYOTA COROLLA $8,630 1600 74 3
VW FOX $2,025 1800 81 6
VW GOLF $7,100 2000 115 4
VW JETTA $4,415 1800 123 7
VW PASSAT $8,310 2000 134 5
NISSAN MAXIMA $11,530 3000 160 4
Figure 11.16

cars with comparable new car prices, older cars experience deteriorating looks and performance,
often require more repairs not covered under new car warranties, and may possess out-of-date
technology or styling. Since age serves as a liability to sale of a used car, the sign of
3
should
be negative. The three one-sided significance tests are:
H
0
:
1
= 0 H
0
:
2
= 0 H
0
:
3
= 0
H
A
:
1
> 0 H
A
:
2
> 0 H
A
:
3
< 0

To conduct these tests at the .05 level and estimate marginal effects for variables that test
significant, a least-squares equation was fit from a random sample of 54 used cars (see Figure
11.16).
11
The sample data resulted in the Excel regression output reproduced in Figure 11.17.
As expected, the estimated coefficient of age is negative (b
3
= 1113.6), indicating an
inverse relationship with used car prices. The anticipated direct relationships of Price with the
two engine variables is also reflected in the positive value of b
1
, 0.2661, and of b
2
, 49.83. To test
whether these estimated coefficients represent significant departures from zero, we again employ
either the p-value decision rule. Notice that engine size does not test significant because
p/2 = (0.6487)/2 = 0.32
is not less than = .05 for the one-sided test. By contrast, the p-values for the other two
variables are easily less than .05. We therefore conclude that a significant, inverse relationship
exists between used car prices and age and a direct relationship with horsepower, but no
significant relationship has been found with engine size.
Notice that the Figure 11.17 not only includes the 95% confidence interval for each
coefficient, but also 90 percent confidence intervals. To obtain this additional pair of columns,
go to the Excel Regression dialog box, check the Confidence Level box, and type in the percent
confidence level desired (in this case 90). The 90% confidence interval for
3
is 1321.4 and
905.8. As expected, this interval is somewhat narrower than the 95% one on its left, (1362.6,
864.5). We are therefore 90 percent confident that the average used car declines in price

11
The sample is from The Black Book: Official Used Car Market Guide Florida Edition actually used by car dealers.
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 90.000%Upper 90.000%
Intercept 4642 1135 4.09 0.0002 2362 6922 2740 6544
EngSize 0.2661 0.6487 0.41 0.6834 -1.037 1.569 -0.821 1.353
HP 49.83 13.46 3.70 0.0005 22.80 76.86 27.28 72.38
YrsOld -1113.6 124.0 -8.98 0.0000 -1362.6 -864.5 -1321.4 -905.8
Figure 11.17

between approximately $1320 and $910 per year. The 90% confidence interval for
2
is (27.28,
72.38), so the marginal effect of an extra fifty horsepower is:
predicted-price = b
2
HP
= (27.28)(+50) = +1364 at the lower end
= (72.38)(+50) = +3619 at the upper end

Thus, fifty more horses under the hood translates to between $1400 and $3600 higher average
prices for used cars, other things equal.

Limitations and Pitfalls of Variable Significance Tests
A few additional cautions are in order regarding significance tests. As we indicated in
Chapter 8, sometimes there is a difference between importance and statistical significance. To
compare relative impacts on the dependent variable, examine the marginal effects for typical
variation in each explanatory variable. For example, used car prices do not rise much if HP
increases by one horsepower (less than $100, based on the regression coefficient). However,
nobody markets their cars as more powerful because they have just one or two more horsepower.
Thus, the auto dealer examined typical differences of 50 horsepower. The influence of
horsepower on price was more than $1000, important indeed! Similarly, although b
2
is more
than 20 times larger than b
1
(4.01 versus 0.175) in the IDC case study, a glance at the data (back
in Table 11.1) reveals that the size values also appear to be about 20 times those of invest. Thus,
in relative terms, changes in each variable have similar effects on job creation.
Practical significance is reflected by the marginal effect of a typical change in the
explanatory variable.
Several years ago, books about how to become a one-minute manager were all the rage.
Today, with accessible computers and data bases, variable significance testing is also
surprisingly easy to do. A business consultant can fly down a column of p-values, and in no time
at all report the test results of dozens of significance tests from an imposing regression. Still,
anything that powerful can be dangerous in the hands of an untrained or unscrupulous
practitioner. We have already demonstrated the dangers from improper use of one-sided tests.
Most of the other ethical considerations cited in Chapter 8 for univariate tests extend to
significance tests of explanatory variables in a regression. In Chapter 13, we will discuss other
unethical practices associated with regression inference and model selection. However, one
practice is so often abused that it deserves mention before we proceed any further.
Because variable significance testing is so easy and the p-values await expectantly on the
computer printout, the common instinct is to test every variable in the regression. Unfortunately,

the more variables we test from a particular regression equation, the more likely a variable that is
unrelated to the dependent variable will test significant. To understand how mistaken
conclusions become more probable, suppose we are testing several explanatory variables at a .05
significance level. We expect only a 1-in-20 chance of incorrectly claiming significance.
However, this .05 probability only applies if we are conducting a single t-test in the regression.
Each additional variable we test is like playing Russian Roulette, spinning the chamber and
pulling the trigger one more time. Thus, the probability of at least one explanatory variable
testing significant is substantial greater than .05 if we test five or ten variables. Even if no
population relationships exist between Y and any explanatory variable, we can virtually assure a
significant variable is uncovered by testing enough variables (see Exercises for an example).
Fortunately, we do not need, nor are we expected, to test all explanatory variables in a
regression model. In some models, most explanatory variables are included primarily to control
for their effect on Y. Only then is it possible to isolate the variable relationships in which we are
interested. Remember that if the most important variables are not present, we risk estimating
biased coefficients that will invalidate our significance tests. This bias problem will be discussed
further in Chapter 13.
If we need to test several variables in a regression model, what should we do? One
option is to choose an even smaller than the significance level for the test. For example, to test
four variables at the .05 level, you might want to compare the p-value to .01 instead of .05.
Alternatively, use an of .05, but interpret the test results as if the significance level were a
larger value. For testing two or three variables, rejecting the null hypothesis with = .05 may
indicate significance at the .10 level instead.
The significance level of explanatory variable tests becomes diluted if additional variables
are tested from the same regression. Fortunately, testing variables included solely to
improve the fit is often unnecessary and prevent estimation bias.
A potentially greater limitation of explanatory variable tests is that they usually cannot be
used to infer significance for the entire regression model. Fortunately, In the next section, we
introduce another type of test that avoids this problem by testing all variables in the model as a
group.

Inside the Estimated Standard Error of Regression Coefficients
In Chapter 8 we learned that to avoid the embarrassments caused by false claims of
significance, we accept the risk of overlooking significant relationships. In the Chapter 9
discussion of optimal sample size, we found out how to reduce type II errors without adding to

the likelihood of type I errors if we had some control over the model design and data gathering.
A larger sample also lessens the tradeoff between confidence level and estimation precision.
Many of these same issues arise in explanatory variable testing. If we could increase the
t-ratio, we would be more likely to reject H
0
for any given sample. Because b
j
is an unbiased
estimator of
j
, the primary opportunity for obtaining larger t-ratios is to shrink the size of s
b
j
.
But what factors determine the magnitude of the estimated standard error of a regression
coefficient? Although its general formula is too complex to examine here, many of the
properties of s
b
j
in simple regression also apply to multiple regression as well.
DEFINITION: The estimated standard deviation of b
j
, s
bj
, in simple regression is
s
b
j
=
1 n s
SEE
X

where n is the sample size, s
X
is the standard deviation of X. In multiple regression, s
b
j
retains
its direct proportionality with SEE and inverse relation to the square root of (n 1).

Thus, t-ratios will be larger if SEE can be made smaller. We learned in Chapter 4 that SEE is
the square root of e/(n-k-1). Moreover, e is the error sum of square SS
E
, and is related to
the fit measure R by the formula R = 1 - SS
E
/SS
T
. The better the fit, the smaller will be the
values of SS
E
, SEE, and s
b
j
.
If the sample size is large enough, a small s
b
j
can occur even with fairly small Rs
because s
b
decreases with increasing n. For example, s = 2 and s
X
= 1 could result in s
b
j
of
either 0.5 or 0.1 depending on whether n = 17 or 401. If b = 0.3 in each instance, the t-ratio
would be 0.3/0.5 = 0.6 for the smaller sample but 0.3/0.1, large enough for significance, for the
larger sample. Moreover, the greater degrees of freedom (n - k - 1) associated with larger n
results thinner tails resembling the normal distribution and its smaller p-value probability areas.
For smaller samples, the number of explanatory variables k becomes especially important
to achieving a small s
b
j
. Larger k reduces the number of degrees of freedom (n k 1), the
information remaining to estimate by SEE. The result is a reduced t-ratio and larger p-values
from a thicker-tailed t-distribution. For example, if n = 19, SS
E
= 144, and k = 2 explanatory
variables, then n k 1 is 16 and SEE = 3, the square root of 144/16. By contrast, the same SS
E

and n but k = 9 explanatory variables lowers the degrees of freedom to 19 - 9 - 1 = 9 and SEE

increases to 4 (the square root of 144/9). In addition, areas under the tails of the t-distribution
tails rise. Thus, more elaborate models require stronger relationships or larger samples to
overcome the lost degrees of freedom.
The s
X
in the denominator is the final factor relevant to the outcome of explanatory
variable t-tests. Without this term, the simple regression standard deviation has approximately
the same form as
n
s
used for univariate tests of the mean. The more the dispersion of an
explanatory variable, the smaller is its standard deviation of b
j
and consequently the larger its t-
ratio. Thus, data collected from a wide rather than a narrow range of x-values is more likely
produce significant test results and greater precision. Figures 11.18 and 11.19 portray two
extreme cases: one where the range of x-values is narrow, the other where it is broad. Slight
variations from one sample to the next are capable of enormous difference in the slope (and
intercept) of the fitted equation. On the other hand, the second case does not allow substantial
sampling error due to the wider range of x-values.
A wide ranging explanatory variable and a large sample gives a weak relationship its best
change to test significant. When only a small sample is feasible, only a few explanatory
variables should be fit.

CASE MINI-PROJECT:
An expressway planner in Orlando Florida is preparing her analysis about which factors affect
the traffic through a major toll plaza. Time series data for 72 months and the following model:
TRAFFIC =
0
+
1
METROPOP +
2
SALETAX +
3
UNEMPLOY +
4
AUTOTOUR
+
5
AIRARRIV +
with variables defined as:
TRAFFIC monthly traffic volume measured at expressway toll plaza
METROPOP monthly population estimates for the metropolitan area (in thousands)
SALESTAX monthly state sales tax collected (in millions of dollars)
UNEMPLOY monthly unemployment rate (in percentage points)
AUTOTOUR monthly volume of automobile visitors to the state (in thousands)
AIRARRIV monthly passengers arriving at metropolitan airport (in thousands)
and the regression output is:

TRAFFIC = 8559 + 115 METROPOP - 90.3 SALETAX - 1639 UNEMPLOY
+ 88.8 AUTOTOUR + 228 AIRARRIV


Constant 8559 85392 0.10 0.920
METROPOP 115.2 138.0 0.83 0.407
SALETAX -90.33 70.69 -1.28 0.206
UNEMPLOY -1639 4574 -0.36 0.721
AUTOTOUR 88.76 19.29 4.60 0.000
AIRARRIV 227.97 20.57 11.08 0.000

1. Complete the alternative hypothesis for two-tailed significance test on the SALETAX variable.
H
0
:
2
= 0
H
A
:
2
(complete this line)
Using the p-value decision rule to conduct this test at the = .05 level, we find that p is (less /
greater) than , so we (cannot reject / reject) the null hypothesis, and we therefore conclude that
sales tax revenue (is / is not) significantly related to expressway traffic.
2. A one-tailed test is conducted on whether airport arrivals is directly related to computer price;
then
H
0
:
5
= 0
H
A
:
5
3. A one-tailed test for the AIRARIV variable is justified in this model because more people
flying into Orlando should: (circle correct answer): (a) mean more rental car traffic on the
expressway, (b) result in more expressway traffic to pick up friends and relatives at the airport
(c) both a and b.
4. Using the p-value decision rule to conduct a one-tailed test for METROPOP at the = .05
level, we find that p/2 is (less / greater) than , so we (cannot reject / reject) the null hypothesis,
and we therefore conclude that the metropolitan area population (has / does not have) a
significant effect on expressway traffic.
5. Using a two-standard deviation margin of error, we can be approximately 95% confident that
each additional one thousand auto tourists adds about 89 cars/month to the traffic, plus or minus
cars [please round].


11.25 A valid decision rule for a two-sided test of a regression coefficient is
a. t-ratio > t
/2
(n k 1)
b. t-ratio < t
/2
(n k 1)
c. t-ratio > t

(n k 1)
d. t-ratio < t

(n k 1)

11.26 A valid decision rule for a two-sided test of a regression coefficient is
a. p >
b. p <
c. p/2 >
d. p/2 <

11.27 Which of the following steps does not belong in the inference process for explanatory
variables?
a. use the t-ratio or p-value decision rules to determine test results
b. test alternative models at several different significance levels
c. translate test results on each coefficient into significance about the corresponding
explanatory variable
d. interpret estimates of regression coefficients as slopes
e. all of these step belong

11.28 Which of the following steps is out of sequence in the inference process for explanatory
variables?
a. state the regression model
b. collect the sample data and estimate the regression equation
c. decide which variables are to be tested
d. determine which variable are eligible for one-sided tests
e. assign a level of significance

11.29 Which of the following does not belong with the rest in interpreting the findings of a two-
sided test of an explanatory variable X
1
?
a. X
1
is directly related to the dependent variable in the model
b.
1
is significantly different from zero
c. we reject the null hypothesis
d. X
1
is statistically significant
e. all of the above are equivalent

11.30 For the one-tailed test, a valid decision rule for an explanatory variable to test significant
is that the regression coefficient has the anticipated sign and
a. p > /2
b. p < /2
c. p/2 >
d. p/2<

An insurance company investigates the determinants of life insurance rates. A survey of n = 58
policyholders collects information on the three variables in the following the regression model:
premium =
0
+
1
Age +
2
mortrate +
where
premium annual premium for each $1000 in life insurance coverage (in dollars)
Age age of policyholder (in year)
mortrate mortality rate (number of deaths per 1000) based on gender, health, and
other policyholder characteristics. Answer the following six questions based on the following
regression output:

Regression Analysis

premium = - 0.77 + 0.364 Age + 1.40 mortrate

Constant -0.765 3.463 -0.22 0.826
Age 0.3636 0.1158 3.14 0.003
mortrate 1.4038 0.2283 6.15 0.000

11.31 Which are the null-alternative hypotheses for a two-tailed significance test on the Age
variable?
a. b. c. d. e.
H
0
:
1
= 0 H
0
:
1
= 0 H
0
:
1
= 0 H
0
:
1
> 0 H
0
:
1
< 0
H
A
:
1
> 0 H
A
:
1
=/ 0 H
A
:
1
<0 H
A
:
1
< 0 H
A
:
1
> 0

11.32 According to the regression printout above,
a. Age is more than three standard deviations greater than mean age
b. Age is about 0.36 standard deviations greater than zero
c. The sample regression coefficient of Age is about 0.36 standard deviations greater than
zero
d. The sample regression coefficient of Age is more than three standard deviations
greater than zero

e. The population regression coefficient of Age is about 0.36 standard deviations greater
than zero

11.33 If Age is tested at the = .05 significance level, then we may conclude each of the
following except:
a. p <
b. Reject H
0

c. The coefficient of Age is significantly different from zero
d. Age has a significant and direct effect on life insurance premiums
e. All of the above is valid

11.34 Each additional year of age adds an average of how much to insurance premiums (other
things equal)?
a. 36 cents with a margin of error of about 3 cents
b. 36 cents with a margin of error of about 23 cents
c. 11 cents with a margin of error of about 3 cents
d. $3.14 with a margin of error of 12 cents
e. Cannot be answered because Age is not statistically significant

11.35 Construct the null-alternative hypotheses for a one-tailed significance test that mortrate is
directly related to premium?
a. b. c. d. e.
H
0
:
2
= 0 H
0
:
2
= 0 H
0
:
2
= 0 H
0
:
2
> 0 H
0
:
2
< 0
H
A
:
2
> 0 H
A
:
2
=/ 0 H
A
:
2
< 0 H
A
:
2
< 0 H
A
:
2
> 0

11.36 Mortality rates would have a significant, direct relationship with premiums at any of the
following significance levels except:
a. = .20 significance level
b. = .10 significance level
c. = .05 significance level
d. = .01 significance level
e. tests significant at any of the levels selected above

11.37 Determine the t-ratio under each of the following conditions:
(a) b = 10, s
b
= 2
(b) b = 8, s
b
= 3.5
(c) b = 1, s
b
= 2
(d) b = .01, s
b
= .001
(e) b = 3500, s
b
= 2100

11.38 Determine s
b
for a simple regression given the following:
(a) s = 12, s
X
= 45, n = 36
(b) s = 1.2, s
X
= 4.5, n = 36
(c) s = 6, s
X
= 45, n = 9
(d) s = 12, s
X
= 22.5, n = 144
Explain the relationship in your answers based upon the trade-offs between s, s
X
, and n in
the formula for s
b
.

11.3 Testing Regression Models for Significance
Were you ever on a vacation where everything seemed to go wrong: the car had a flat tire
on the way to the airport, your flight was overbooked, they lost your luggage, and it rained the
entire trip. At some point, you probably asked yourself: Was this trip necessary? A similar
question may be asked about regression models: Does the model as a whole have any
explanatory value or is it totally worthless?

Constructing the Compound Hypotheses for Model Testing
First, we need to restate this question as a statistical test with null and alternative
hypotheses. H
0
should correspond to a worthless model, one in which all explanatory variables
are unrelated to Y. Suppose your boss insists that you use the following ill-advised model to
predict the inflation rate:
inflation =
0
+
1
bowlpts +
2
oscar +
3
lottery +
where bowlpts is the points scored in the Super Bowl, oscar is the length of this years Oscar
winner, and lottery is the sum of winning numbers in the Pennsylvania lottery. If this model has
no predictive value, we would do just as well with an even simpler model,
inflation =
inflation
+
in which we always estimate the dependent variable by its mean. Recall from Chapter 4
that we unplug explanatory variables by giving them zero coefficients.
If a model with k explanatory variables X
1
, X
2
, . . . , X
k
has no predictive value, then
Y =
Y
+ 0 X
1
+ 0 X
2
+ . . . + 0 X
k
+


To appease your boss (and save your job), you decide to predict inflation from these three
irrelevant variables by the following method:
predicted-inflation =
inflation
+ 0 bowlpts + 0 oscar + 0 lottery
Unfortunately, most business models are not so obviously worthless. Thus, we must adapt
testing methods to decide whether a model is statistically significant.
The above reformulation guides us in stating the null and alternative hypotheses to test
whether an entire model is statistically significant.
To test the regression model
Y =
0
+
1
X
1
+
2
X
2
+ . . . +
k
X
k
+
for significance, the null and alternative hypotheses are
H
0
:
1
=
2
= . . . =
k
= 0
H
A
:
j
=/ 0 for at least one j from 1 to k

As always, H
0
and H
A
are complementary events. The null hypothesis describes the equality
conditions for population parameters, and the alternative hypothesis describes all possible
departures from equality.
Yet there is also an obvious difference with this type of hypothesis test. Rather than a
simple hypothesis containing a single parameter, the test for multiple regression model involves
several conditions and parameters. This compound set of conditions may be seen more easily by
restating H
0
as follows:
H
0
:
1
= 0,
2
= 0, . . . , and
k
= 0
These k equalities are tested as a group, not separately, and must all be true for the null
hypothesis to hold up. If we cannot reject the null hypothesis, therefore, no other tests need to be
run. Notice that
0
= 0 is not included in the list. In fact, its value is
Y
if H
0
is true.
By contrast, rejecting the null hypothesis often doesn't tell us very much. Compared to
the single equality hypotheses we used in t-tests of individual variables, rejecting a grouped null
hypothesis can be far less informative. Conditions for model significance are described by the
alternative hypothesis:
H
A
:
j
This may mean that any one variable is significant, or all k of them, or any combination of
significant and nonsignificant variables in between. In the used car case study, for example,
Price =
0
+
1
EngSize +
2
HP +
3
Age +

so there are seven different ways that the model may test significant:
1
=/ 0,
2
=/ 0,
3
=/ 0,
1
=/ 0
and
2
=/ 0,
1
=/ 0 and
3
=/ 0,
2
=/ 0 and
3
=/ 0, or
1
=/ 0 and
2
=/ 0 and
3
=/ 0. The number of
ways increases geometrically as explanatory variables are added to the model, exceeding one
thousand in a model with only ten explanatory variables! Thus, separate explanatory variable
tests are required to investigate individual relationships.
However, often we are not interested in any specific explanatory variables or their
marginal effects. If predicting the dependent variable is our sole objective, then a test of the
entire model is all that is necessary. Section 4 will discuss inference methods in prediction and
forecasting.
Significance of a model does not tell us which or how many explanatory variables are
significant. On the other hand, individual variable tests are unnecessary if the overall
model does not test significant or if prediction is our only goal.

Analysis of Variance in Regression
As always, testing requires a test statistic with a known sampling distribution. The test
statistics for univariate mean tests and explanatory variable tests for regressions are ratios. These
ratios measure how much the sample findings depart from the null hypothesis and compare that
to typical variability among random samples. If sample outcomes are unlikely to result from
sampling error, then H
0
may not be true. However, the t-distribution is not adequate to represent
the compound set of equalities described by H
0
.
Fortunately, the F distribution that we used with analysis-of-variance models is up to the
task. In fact, F-tests are flexible enough to test any multiple regression in this text, no matter
how many explanatory variables it contains.
12
Recall from Chapter 10 that the F distribution is a
unimodal family of right-skewed density functions defined for nonnegative values. Specific
shapes for F distribution family members is determined by its two parameters.
13

Our task now is to argue that a regression model may be tested with F sampling
distribution. Like the other test statistic we have seen, the F-ratio for testing regression models
also compares departures from H
0
with average sample variability. However, numerator and
denominator of this test statistic are expressed as sum of squares like the error sum of squares
SS
E
discussed in Chapter 4. For the denominator of the test statistic, we use our measure of

12
The F-distribution may also be used for other types of compound hypotheses, such as testing groups of variables (see Chapter 13).

13
If you have not covered Chapter 10 yet, read about the F distribution in section 10.3.

average sample variability, the mean square error, MS
E
= SS
E
/(n - k - 1). Remember that MS
E

is the square of the standard error of the estimate, SEE, and is the variance of Y in the sample
around the fitted value .
14

All we need now is a numerator to measure departures from H
0
, the contention that the
model is worthless. Least squares fits the sample data to a regression equation, but coincidental
patterns can occur in a finite sample. For each observation in the sample, the variation of Y
around its mean Y may be divided into the part explained by the regression and an unexplained
error portion. This relationship also holds among sums of squares:
Total Variation = Variation Explained + Unexplained Error
(Y -Y )
2
= ( - Y )
2
+ (Y -)
2

If we always guessed Y as our estimate of Y, we would commit errors of SS
T
= (Y - Y )
2
. By
fitting the sample data to the regression model, we have reduced errors to SS
E
= (Y -)
2
. The
regression sum of squares, SS
R
, is therefore the difference between two other sums of squares
and measures the total contribution of the regression model.

DEFINITION: The regression sum of squares, SS
R
, is the Y variation explained by the sample
regression, and the difference between total and error sums of squares:
SS
R
= SS
T
- SS
E

The numerator for our F-ratio contains another mean square. This measures the average
contribution by each of the k explanatory variables to the overall fit. This mean square for the
regression, MS
R
.
DEFINITION: The mean square for the regression, MS
R
= SS
R
/k, is the contribution per
explanatory variable.
The reason we divide by k is that larger models have a natural advantage over models with fewer
explanatory variables. Least-squares is much more likely to find coincidental sample-data
relationships among a large number of variables. If the null hypothesis is true, fits due purely to
sampling error will increase SS
R
for each additional variable in the model. The test statistic
compensates for models of different sizes by measuring MS
R
.

14
Mean square error in its general form that includes bias is often used to evaluate a models forecasting performance (see Chapter 13).

Notice that our ratio of mean squares cannot be negative because sums of squares are not
negative. As we learned in Chapter 10, the ratio of mean squares has an F distribution with two
different degrees of freedom.
The F-ratio for testing a regression model is the ratio MS
R
/MS
E
. The sampling distribution
for this ratio is an F distribution with k and (n - k - 1) degrees of freedom, where k is the
number of explanatory variables and n is the sample size.
A useful way to present the building blocks for calculating and testing the model is an
analysis of variance table, or ANOVA table.
DEFINITION: A regression analysis of variance table displays in tabular form the information
on degrees of freedom and total and mean sum of squares for each component of variation used
in finding the F statistic for testing the model.
Chapter 10 devoted much of its attention to constructing these tables. Here, the format of the
analysis of variance table is similar, but has characteristics appropriate to regression models (see
Table 11.2).
Table 11.2 Analysis of Variance Table for Regression
SOURCE DEGREES OF
FREEDOM
SUM OF
SQUARES
MEAN SQUARES F
Regression k
SS
R
= ( - Y )
2

MS
R
= SS
R
/k F = MS
R
/MS
E

Error n - k - 1 SS
E
= e MS
E
= SS
E
/(n-k-1)
Total n - 1
SS
T
= (Y -Y )
2

By checking the relationships in the ANOVA table, we can gain valuable insights about the
significance test on the model. Moving from left to right, the ANOVA table shows how the F-
ratio is constructed. The mean squares are sample variances and the F-ratio is a variance ratio,
hence the name analysis of variance.
Let's return to our job creation case, and examine its ANOVA table in the Minitab
regression output of Figure 11.21. Well look at an Excel output shortly.


Regression Analysis

jobs = 16.7 + 0.175 size + 4.02 invest

Constant 16.722 2.718 6.15 0.000
size 0.17521 0.09468 1.85 0.070
invest 4.021 1.483 2.71 0.009

s = 14.73 R-sq = 33.3% R-sq(adj) = 30.8%

SOURCE DF SS MS F p
Regression 2 5736.9 2868.5 13.22 0.000
Error 53 11496.5 216.9
Total 55 17233.4
Figure 11.21

Observe the table entitled Analysis of Variance is at the bottom of the output.
15
Back in
Chapter 4, we used information in this table to show how SEE and R
2
were computed. Notice
that ANOVA table has the format we discussed, but now it contain the actual numbers for the
model and sample data. The column headers DF, SS, and MS are abbreviations for degrees of
freedom, sum of squares, and mean square.
The advantage of the analysis of variance table is its clear presentation of the variation
explained and unexplained by the model. Notice how the regression and error lines sum to the
totals on the last line of the table. To fit the model, the least-squares equation consumes k
additional pieces of information from the sample, leaving us with n - k - 1 degrees of freedom
assigned to the error variance. The degrees of freedom are k = 2 for regression and n - k - 1 = 53
for the errors. These sum to n 1 = 55 degrees of freedom for total variation from the mean of
the dependent variable. Remember that one degree of freedom is lost because we must compute
Y from the same sample data.
Verify on a calculator that regression and error sums of squares also add up to the total:

SS
R
+ SS
E
= 5,736.9 + 11,496.5 = 17,233.4 = SS
T

The mean squares column, MS, is found from the quotient of the two preceding columns, SS/DF,
the sum of squares averaged over the degrees of freedom:

15
Minitab also often furnishes other information following the analysis of variance table. We will discuss some of these later on in this
chapter.

MS
R
= SS
R
/k = (5736.9)/2 = 2868
MS
E
= SS
E
/(n - k - 1) = (11,496.5)/53 = 216.9
Finally, the F-ratio is the ratio of mean squares:
F = MS
R
/MS
E
= 2868.5/216.9 = 13.22
This F-ratio compares the average explained to the average unexplained. Even if the null
hypothesis is true, sampling error may often result in F-ratios substantially greater than zero. For
example, the F distribution in Figure 11.22 shows sizeable areas in the tails beyond F of 1 or 2.
On the other hand, the tail of this distribution has very little probability area beyond F = 5 or 6.
If H
0
were true for the population, F-ratios as large as 13.22 should therefore be extremely
unlikely in sample data.


The F-Test in Regression
Because the F distribution will vary by sample and model size, we again use a p-value
decision rule to conduct the test. In significance tests for a regression model, the decision rule
portrayed in Figure 11.22 may be interpreted as follows:
p-Value Decision Rule for Testing a Regression Model: At significance level ,
if p < , reject H
0

if p </ , do not reject H
0

Rejecting H
0
allows us to conclude that the model is statistically significant.

To apply this rule, the test statistic must be converted to a p-value. In model tests, this p-
value is the probability that sampling error alone caused a F-ratio as large as the one observed.
The probability under this tail of the F distribution is provided in the p column at the right end
the ANOVA computer output (Figure 11.21). As we did in section 2, interpret p = .000 as a p-
value that rounds to zero in the third decimal place. Even if the industrial development
corporation established an = .01 significance level, p is still less than . Thus, the IDC rejects
the null hypothesis H
0
:
1
=
2
= 0, and they conclude that the job creation model
jobs =
0
+
1
size +
2
invest +
is statistically significant at the .01 level. They may then wish to conduct t-tests for individual
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.275
R Square 0.076
Adjusted R Square 0.049
Observations 72
ANOVA
df SS MS F Significance F
Regression 2 96.42 48.21 2.832 0.066
Residual 69 1174.46 17.02
Total 71 1270.88
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2.2203 1.9356 1.15 0.255 -1.641 6.082
age 0.0321 0.0441 0.73 0.470 -0.056 0.120
yrs 0.2266 0.1037 2.18 0.032 0.020 0.434
Figure 11.23

explanatory variables, because there are three alternatives possible:
H
A
: either
1
or
2
(or both) =/ 0
On the other hand, the IDC may not need any further testing if all they need is an equation that
predicts factory job creation significantly better than always guessing mean job creation.
Next, we examine the sick day case study using the Excel ANOVA table, and see what a
nonsignificant model looks like. The only differences with the Minitab output are that Excel
places a table called ANOVA in the middle of the regression output (see Figure 11.23), the error
source is titled Residual, and the p-value is labeled Significance F.
Although k = 2 as in the previous case, the sample size n is 72. Thus, the degrees of
freedom are 2 and 69. Notice that SSR is less than one-tenth the size of SSE. However, the
mean square for the regression is nearly three times MSE after dividing by 2 and 69,
respectively. Nevertheless, the p-value is 0.066 for this F-ratio of 2.832. If the catering
company is conducting this test at the .05 level, they would have to conclude that the model is
not significant.
But doesnt this contradict the t-test results reported in section 2? Recall that the
management trainee found one of the two variables significant. By comparing p = .032 with =
.05, she concluded that length of service has a significant and direct relationship with
absenteeism. What should we say if model and individual variable tests yield opposite
conclusions?
We already gave you the answer: always test the model first. If it does not test
significant, no other tests are needed. In the sick day case study, how could one of the
explanatory variables test significant if the model does not test significant. This dilemma too
was explained earlier. In section 2, we cautioned that the more explanatory variables we test, the
more diluted the significance level becomes. Even though the management trainee only tested
two variables, the effective was inflated above .05. To test whether any part of the model is
significant, first test the model. The F-test is designed to tackle the compound conditions for this
test, so its significance level may be trusted. If the sick day model had tested significant, then
individual t-tests could be used to locate specific significant relationships. However, the model
was not significant, so there is no sense looking for something that is not there!
There is one exception where testing the model and testing an explanatory variable in the
model are equivalent. The exception applies to testing any simple regression model, as the next
case example illustrates.


Chapter Case #4: Bedding Down for the Night
The advent of HMOs and Medicare price controls has stemmed the runaway health care
costs that once threatened to bankrupt Medicare and private health insurance programs. One of
the most controversial tactics has been placing limits on hospital stays. In the past, the fear of
malpractice suits led doctors to prevent earlier release of their patients. Although hospital
charges of several hundreds of dollars per day are now being saved, these cost containment
programs have come under fire by the press and Congress for sending sick patients home
prematurely. Major financial losers are hospitals themselves, and empty hospital beds forced
hundreds of hospitals to close. A hospital administrator collects data on patient age and hospital
stay (hospstay) for a random sample of 93 cardiac patients to investigate its patient release
practices. The sample was fit to the following simple regression model:
hospstay =
0
+
1
age +
Notice that the model and the age variable each test significant and have the same,
minuscule p-values of p = .000000001. What is the chance of that happening? Well actually, the
p-values for the t-test and F-test will always be identical in simple regression. The two test are
equivalent because the only way a regression model with one explanatory variable can be
significant is if that variable is significant. Moreover, the test statistics are also related to one
another in simple regression. Because the ANOVA table contains sums of squares, the F-ratio is
the square of the t-ratio.
Significance of the model is equivalent to significance of the independent variable only in
simple regression.
In the hospital stay example, the t-ratio is 6.729, whose square is 45.28, the F-ratio. The hospital
administrator concludes that hospital stays are an average of 1.0 to 1.8 days longer for a cardiac
patient that is ten years older. He only conducted a t-test because it is equivalent to the F-test for
this simple regression model.

Testing for a Significant Fit
There is another use for the F-test of regression models. Remember that least squares
finds the best fitting equation for the sample data, even if no such fit exists in the population.
We call the population fit (pronounced rho squared).


DEFINITION: The population fit, , is the R for data on the entire population.
The regression equation will nearly always account for some variation in Y due to chance
patterns in the sample. The result is that the sample fit, R, contains an upward bias in estimating
the corresponding population parameter . Thus, even if = 0, R will exceed zero in nearly
all random samples and sometimes R may be quite large.
16

How can we infer from R that our model has any fit in the population? It just so
happens that the F-test also tests if the sample fit could have occurred by chance.
The test for significance of model fit:
H
0
: = 0
H
A
: > 0
may be tested by the F-test for significance of the model itself.

If H
0
is rejected, we conclude is significantly greater than zero and that the regression model
has at least some fit in the population.
Why cant we tell whether a fit is significant merely by examining? After all, both R
and the F-ratio contain sums of squares from the ANOVA table. However, F-ratios also include
k and n, factors not found in the R formula. In fact, we can express the F-ratio in terms of R, n
and k:
17

F = (n - k - 1) R k (1 - R)
What does this formula tell us about getting large enough F-ratios for R (and the model) to test
significant? First, a higher R is more likely to test significant. The F-ratio will be greater
because a high R increases the numerator and reduces (1 - R) in the denominator. However, a
larger sample size can accomplish the same thing by increasing (n - k - 1) in the numerator of the
F-ratio. Finally, fewer independent variables results in both a smaller the k in the denominator
and larger (n - k - 1) in the numerator, causing a larger F-ratio as well.
A significant fit depends on more than just the size of R. The larger the sample, the
smaller the R needed to produce a significant fit. In addition, models with fewer
explanatory variables also need smaller R (or smaller samples) for significant fits.

16
While the adjusted R measure (introduced in Chapter 4) attempts to correct for this bias, the F-test discussed assumes a null hypothesis
that there is no population fit.

17
Using R = (1 - SS
E
/SS
T
) and SS
R
= SS
T
- SS
E
, first show that (1 - R)/R = SS
R
/SS
E
. Then, substitute for SS
R
/SS
E
in the formula for F.

In the industrial development case, for example, the Minitab output (Figure 11.21)
reported R = 33.3% for a sample of n = 56 factories and a model containing k = 2 explanatory
variable. Based on the formula above,
F = (n - k - 1)R k(1 - R)
= (56 - 2 - 1)(.333) 2(1 - .333) = 53(.333) 2(.667) = 13.23
the same value reported (except for rounding) for the F-ratio in the ANOVA table.
However, a considerably smaller R can produce large F-ratios if the sample is large
enough. Suppose R were 3%, less than one-tenth as large as before, but the sample contained
n = 860 factories. Then we would have obtained about the same F-ratio:
F = (860 - 2 - 1)(.03) 2(1 - .03) = 857(.03) 2(.97) = 13.25
On the other hand, even large R may still not test significant for small samples, especially if the
model contains many explanatory variables. Consider a model with k = 10 variables that we
insist on testing with an inadequately small sample of n = 16. We are momentarily pleased by a
regression whose R is 60%. However, we are returned to reality when the model does not test
significant. It is easy to see how this happened, once we see the very small F-ratio that results:
F = (16 - 10 - 1)(.60) 10(1 - .60) = 5(.6) 10(.4) = 0.8
To investigate a model with so many explanatory variables, a larger sample is necessary. For
example, if n = 60, the sample R would have produced an F-ratio of
F = (60 - 10 - 1)(.60) 10(1 - .60) = 49(.6) 10(.4) = 7.35
large enough to test significant at the .01 level.

CASE MINI-PROJECT:
An expressway planner is preparing her analysis about which factors affect the traffic through a
major toll plaza. Time series data is collected for 72 months and the following model is used:
TRAFFIC =
0
+
1
POP +
2
TAX +
3
UNEMP +
4
AUTO +
5
AIR +
with variables defined as:
TRAFFIC monthly traffic volume at toll plaza (in thousands)
POP monthly population estimates for the metropolitan area (in thousands)
SALESTAX monthly state sales tax collected (in millions of dollars)
UNEMP monthly unemployment rate (in percentage points)
AUTO monthly volume of automobile visitors to the state (in thousands)
AIR monthly passenger volume arriving at metropolitan airport (in thousands)
and the regression output is:

TRAFFIC = 8.6 + 0.115 POP - 0.0903 TAX - 1.64 UNEMP + 0.0888 AUTO + 0.228 AIR
[other output not relevant to this case]

s = 26.51 R-sq = 83.1% R-sq(adj) = 81.8%

SOURCE DF SS MS F p
Regression # 227866 %#$@* &%#@! 0.000
Error 66 46388 703
Total 71 274254

1. Complete the null and alternative hypotheses lines for the F-test of this model:
H
0
:
1
= = 0
H
A
:
j
0 for at least one j, where j = 1 through .
2. Using the p-value decision rule, p is (less / greater) than , so we (reject / cannot reject) the
null hypothesis, and we therefore conclude that this expressway traffic model (tests / does not
test) significant at the = .01 level.
3. This test is also equivalent to the following test for the population R-square,
2
:
H
0
:
2
= 0
H
A
:
2
Test findings from question #2 allows us to conclude that the model fit ( is / is not ) significant.
4. A garbled Fax transmission made some items in the output unreadable. From your
knowledge of the analysis of variance table, the degrees of freedom for the regression must equal
, the mean square regression is therefore , and the F-ratio must then be equal to

11.45 If for a particular explanatory variable, the sample regression coefficient is 10.5 and its
standard deviation is 3.5, then the t-ratio is
a. 7.0
b. 14
c. 0.33
d. 3.0
e. insufficient information provided to determine the t-ratio

11.46 Which of the following can be determined from t-tests on regression results?
a. whether the model is significant
b. whether specific explanatory variables are significant
c. whether the regression fit is significant
d. all of the above

11.47 In simple regression models, the F-test and the t-test
a. test statistically equivalent hypotheses
b. always yield the identical p-values
c. result in an F-ratio that is the square of the t-ratio
d. all of the above

11.48 The alternative hypothesis for the F-test on the multiple regression model
Y =
0
+
1
X
1
+
2
X
2
+ can be stated as follows:
a. H
A
: either
1
=/ 0 or
2
=/ 0 but not both
b. H
A
: either
1
=/ 0 or
2
=/ 0 or both
c. H
A
: both
1
=/ 0 and
2
=/ 0
d. H
A
:
1
=
2

e. H
A
:
1
=/
2

11.49 If SS
E
= 10, SS
R
= 10, n = 36, and k = 5, then the F-ratio is
a. 6
b. 2
c. 1
d. 0.33


11.50 If R = 80%, SS
T
a. 24
b. 12
c. 6
d. 4
e. 2

11.51 If MS
R
= 600 and s = 20, then the F-ratio is
a. 30
b. 15
c. 10
d. 3
e. 1.5

Answer the following three questions based on a regression analysis of variance table;
unfortunately, a defect in the printer causes it to provide information on degrees of freedom (DF)
for the regression and error and the SS
R
.
SOURCE DF SS MS F p
Regression 4 136.400 @#!@$# &%$#@ 0.000
Error 22 #$@&#@$ %$#@##
Total $@ 189.200

11.52 The sample used to generate this AOV table must contain
a. n = 26 observations
b. n = 27 observations
c. n = 18 observations
d. n = 19 observations

11.53 MS
R
= and SS
E
= , where the blanks are equal to
a. 27.28 and 52.8
b. 34.1 and 325.6
c. 34.1 and 52.8
d. 27.28 and 325.6

11.54 The F-ratio for the table is approximately
a. 14.2
b. 11.4
c. 10.8

d. 9.57

11.55 The F distribution has each of the following traits except:
a. is defined only for nonnegative values of F
b. contains an infinitely-long tail
c. is skewed
d. is unimodal
e. all of the above are traits of the F distribution

11.56 Which of the following has an F distribution?
a. the ratio of sums of squares
b. mean squares
c. sums of squares
d. the ratio of mean squares
e. all of the above

11.57 A regression model is more likely to test significant if
a. the sample size n is large
b. there are many explanatory variables in the model
c. R is small
d. the used for the test is small
e. all of the above

11.58 If SS
E
= 600, SS
R
a. 6
b. 4
c. 3
d. 1/2

Answer the following three questions based on a regression analysis of variance table;
unfortunately, a defect in the printer causes it to provide information on degrees of freedom (DF)
for the regression and error and the SS
R
.

SOURCE DF SS MS F p
Regression 3 *$#@!# @#!@$# &%$#@ 0.000
Error &# 84.80 %$#@##
Total 19 204.05


11.59 The sample used to generate this AOV table must contain
a. n = 20 observations
b. n = 19 observations
c. n = 17 observations
d. n = 16 observations

11.60 SS
R
is equal to
a. 288.85
b. 119.25
c. 39.75
d. 20

11.61 The F-ratio for the table is approximately
a. 8.91
b. 7.50
c. 1.406
d. 0.469

11.62 Using a calculator, solve for the F-ratio given the following information:
(a) n = 30, k = 4, SS
R
= 200, SS
E
= 1000
(b) MS
R
= 60, s = 20
(c) n = 36, k = 2, SS
T
= 120, R = 0.60

11.63 Complete the following ANOVA table by calculating the MS column and F ratio:
SOURCE DF SS MS F
Regression 4 136
Error 21 630
Total 25 769

11.64 Complete the omitted information from the following ANOVA table if the model contains
3 explanatory variables and the sample size is 32.
SOURCE DF SS MS F
Regression 60
Error
Total 90


11.65 Complete the following ANOVA table, determine the R, and test whether R is
significantly greater than 0 at the .01 level.
SOURCE DF SS MS F
Regression 6 240
Error 120
Total 18

11.66 Complete the following ANOVA table, determine the R, s, and test whether the model is
significant at the .01 level.
SOURCE DF SS MS F
Regression 2 0.30
Error 50 0.04
Total

11.67 Determine four things that are wrong in the following ANOVA table:
SOURCE DF SS MS F p
Regression 4 180 40 2.0 .001
Error 25 500 20
Total 30 640

11.4 Making Predictions by Regression Inference
A distinguishing characteristic of business statistics is its emphasis on prediction and
forecasting. Anticipating outcomes is crucial to the survival and success of most business,
whether it involves predicting market share of your new product or the time it will take to
complete a contract if your company wins the bid.
Regression analysis has become one of the most effective forecasting methods in use by
businesses today. Because it explains variation in a single target variable by fitting
scientifically-collected data to a well-designed model, regression can be the best tool for
predicting outcomes and forecasting the future. In many business applications, estimating the
dependent variable is the primary reason for designing a model and fitting a regression equation.
We saved this important topic until now in order to use many statistics and modeling concepts
developed in preceding sections.
In Chapter 4, we used regression to forecast the stock market and predicted the volume of
mail. We generalized this procedure to make point predictions from any simple or multiple
regression model. By substituting the values for each explanatory variable, the estimated
regression equation may be used to make these predictions. To estimate the employment for

plant size of 25 thousand square feet and $1 million investment, for example, we supply the
values size = 25 and invest = 1 to the regression equation that we fit from the sample data:
predicted-jobs = 16.7 + 0.175 size + 4.02 invest
The point estimate therefore rounds to 25 jobs because
predicted-jobs = 16.7 + 0.175(25) + 4.02(1) = 25.1
As we learned from Chapter 8, however, there may be little chance that this particular
factory creates exactly 25 jobs unless the margin of error for point estimates is very small. To
indicate that 25 jobs is only the likeliest outcome among many others possible, the IDC should
report a range likely to contain the number of jobs actually created. Thus, confidence intervals
are preferable for decisions involving estimation.

Estimating the Conditional Mean of the Dependent Variable
Before we can define or calculate this interval, we must first distinguish between the two
types of Y-estimation problems confronting business:
predicting individual observations of Y
estimating the conditional mean of Y
Recall that the conditional mean of the dependent variable,

Y | X
1
, X
2
, . . . , X
k
, is the
population mean of Y for a particular set of explanatory variable values in the regression.
To illustrate prediction of an individual observation, suppose the industrial
development corporation learns that a factory in their region has announced an expansion.
What if the parent firm has yet to disclose planned employment? Or perhaps hiring plans
havent been finalized, so they dont know themselves. The IDC can use the least-squares
equation to predict new plant employment to update school enrollments, utility usage, and
tax revenues in the region.
On the other hand, suppose the IDC administers a regional grant to assist industrial
support of many expansions, each with 25,000 square feet and $1 million investment. The IDC
would need to estimate the mean job creation, a conditional mean because it is conditional on
specific values of size and invest variables.

In section 1, we learned that is an efficient estimator of the conditional mean and also
of individual predictions. Thus, both types will have the same point prediction we calculated
earlier:
predicted new jobs at one factory = estimated conditional mean at many factories = 26
However, the confidence intervals for each kind of prediction may have very different widths.
To understand why, we begin by comparing the conditional mean to another mean that
we previously examined, the univariate mean. To estimate , we used the unbiased estimator
X , the estimated standard error s/ n , and t for the sampling distribution. The interval for
Y | X
1
, X
2
, . . . , X
k
is also centered at the unbiased estimator, which is in regression. If
random disturbances are approximately normal, the sampling distribution for unknown again
is a t distribution. The final parallel to univariate interval estimation is in the formula for s
, the
estimated standard error of the conditional mean.
DEFINITION: The estimated standard error of the conditional mean, s
, is directly
proportional to SEE, attains a minimum of SEE / n when all explanatory variables are at their
means, and increases as explanatory variables move farther away from their means.
Like s/ n , s
at its minimum is also inversely proportional to the square root of sample size.
The explanation is logical. The sample mean of the dependent variable is Y . The advantage of
regression comes from using variable relationships to alter our predictions according to how far
explanatory variables are above or below their means. If none of the explanatory variables
depart from their means, however, we lose this advantage and unbiased regression equations
should estimate Y simply by Y . Knowledge that explanatory variables are at their means results
in a smaller standard error (see Exercises for a comparison).
However, s
is influenced by yet another factor. Variability in estimating the conditional

mean increases with distance of explanatory variables from their sample mean. This relationship
is difficult to show in general, because the complex multivariate computations of s

are usually
left to the computer. Nevertheless, the general properties of s
may be discerned from the

special cases of simple regression. If the regression contains only one explanatory variable, s

has a relatively simple formula:

DEFINITION: In simple regression when the value of the explanatory variable equals X
P
, s
is

s
=
_
+
2
2
) (
) ( 1
X X
X x
n
SEE
p

Notice that this reduces to s
= SEE/ n if X
P
is the mean X because the second term under the
square root is then zero. The reason for this second term is that sampling error causes errors in
estimating regression coefficients
j
. Any sampling error present in b
j
becomes magnified in s

the further x
j
is from
j
X . This is why the second term contains

(X
P
- X ).
Assembling all these elements, confidence intervals take on a familiar form.
Confidence Intervals for the Conditional Mean of Y:

(1 - ) 100% C. I. for

Y | X
1
, X
2
, . . . , X
k
= t
/2
(n-k-1) s

Confidence intervals for conditional means are narrower for larger samples, better fitting models
(because SEE is smaller), and explanatory variables near their means.
10 20 30 40 50 60 70 80 90 100
0
10
20
Patient Age
H
o
s
p
i
t
a
l

S
t
a
y

(
D
a
y
s
)
Y = 6.78 + 0.14 X
R-Squared = 0.332
Regression
95% CI
Confidence Intervals for the Conditional Mean
Figure 11.26

Figure 11.26 presents graphically the estimated conditional mean and surrounding
confidence interval for a simple regression, the hospital stay case from section 3. The Minitab
commands and menu sequence are found in Figure 11.27. Observe that the confidence interval
band widens as we move further in either direction from the mean of the explanatory variable.
Using Minitab to Plot Conditional Mean or Prediction Interval Confidence Bands
Pull-Down Menu sequence: Stat Regression Fitted Line Plot...
(1) find dependent variable on listing and double click to place it in Response (Y): box
(2) place explanatory variable (simple regressions only) in Predictor (X): box.
(3) check Display Options... box(es) to Display confidence bands or prediction bands
(4) change Confidence Level: or add Title: if desired, then click OK to obtain bands plot
Figure 11.27

Lets examine confidence intervals for
Y | X
1
, X
2
, . . . , X
k
for the industrial
development case. Figure11.29 describe the step for obtaining these confidence intervals using
Minitab.
Using Minitab to Obtain Predictions and Conditional Means from Regressions
(1) find dependent variable on listing and double click to place it in Response: box
(2) double click on explanatory variable in listing to place it in Predictor: box.
(3) click on the Options... at the bottom to open the Regression Options dialog box
(4) click in area below Prediction intervals for new observations: and type the values of each
explanatory variable in the same order as these variables are listed in the regression.
(5) click on OK in this dialog box and then on OK in the Regression dialog box.
Figure 11.29

Regression Analysis

jobs = 16.7 + 0.175 size + 4.02 invest

[rest of output is same as before]

Fit Stdev.Fit 95.0% C.I. 95.0% P.I.
25.12 1.97 ( 21.17, 29.08) ( -4.69, 54.93)
Figure 11.31


The regression output will contain the same information as before, but now there will be two
new lines containing prediction and confidence interval of the conditional mean printed at the
bottom (see Figure 11.31). The Fit column reports the point prediction for factory expansions
of 25,000 square feet and $1 million (i.e., size = 25 and invest = 1). The fit of 25.1 jobs matches
the value we just calculated directly from the regression equation. Next is printed the s
, called
Stdev.Fit (short for standard deviation of the fit). The multiple regression formula for normally
makes it difficult to verify with our calculator. In this example, however, s
= 1.97 happens to
equal SEE/ n . You can check this yourself because the Minitab output reports SEE = 14.73
and the sample contains 56 factories. Thus,
SEE/ n = (14.73)/ 56 = 1.97.
The descriptive statistics in Figure 11.32 allow us to solve this little mystery. Notice that
the means for size (24.75) and invest (1.051) among all factory expansions in the sample are
nearly identical to the values we used to fit the equation.

size 56 24.75 18.25 20.81 26.30 3.51
invest 56 1.051 0.663 0.752 1.679 0.224
Figure 11.32

The third column at the bottom of the regression output (Figure 11.31) contains the 95%
confidence interval for the conditional mean (labeled 95.0% C.I.). This interval will be centered
at = 25.1 and, for larger samples such as this, will have endpoints approximately two s
on
either side of . In this case,
2 s
= 25.12 2(1.97) = (21.18, 29.06)

is nearly the same as the confidence interval reported. With a 95% level of confidence, the
IDC therefore concludes that its industrial expansion program will create between 21 and
29 jobs on average at factories receiving regional grants.


Predicting Individual Outcomes of the Dependent Variable
Confidence intervals for the conditional mean have only limited application in business
statistics. Applications must ask about the mean of many observations with identical values of
all explanatory variables. We just examined such an example involving many factory
expansions with the same investment and floor space. For an application to the used car case,
national rental car companies buy thousands of the same car models each year in order enjoy
quantity discounts and standardize their service departments. Companies often dispose of these
fleets after two years. They might use estimates of the conditional mean to anticipate the
average price that they could sell these used fleet cars.
However, the most frequent application of regression analysis is to predict or forecast
individual outcomes. Earlier in this section, we discussed this second type of problem involving
estimation of Y. The industrial development corporation must predict the number of new jobs
resulting from a particular factory expansion. Unlike the previous examples, you are not
interested in a range for the mean job creation of many identical expansions. Instead, you are
only concerned about a single factory with a specific but unknown number of new jobs.
Remember from Chapter 9 that the sample means vary far less than the individual
observations. Thus, individual predictions are usually subject to substantially larger margins of
error than estimates of the conditional mean. Because confidence intervals for conditional means
are narrower, it is especially tempting to report them in place of prediction intervals.
Unfortunately, the confidence interval for the conditional mean is misleading and inappropriately
precise when applied to individual predictions.
Business problems about conditional means arise far less than individual predictions
because circumstances leading to each outcome seldom are repeated. When making a
prediction, never report the narrower confidence interval of the conditional mean.
How are prediction intervals calculated and what determines how much wider they are
than the conditional mean intervals? In section 2, we fit a random sample of 54 used cars to a
model relating car prices to engine size, horsepower, and age. Suppose your old car is on its last
legs, and you notice an ad in the classifieds that describes a four-year-old car with a 2200 cubic
centimeter, 120 horsepower engine. To decide if the $6000 asking price is a bargain, you could
make a point prediction of about $7900 using the sample regression results from section 2:
predicted-Price = 4642 + 0.266 EngSize + 49.8 HP - 1114 YrsOld
= 4642 + (0.266)(2200) + (49.8)(120) - (1114)(4)
= 4642 + 585.2 + 5976 - 4456 = 6747.2


Although the asking price is about $750 less, a confidence interval centered at $6750 would be a
better basis for your purchasing decision. This interval might indicates that even better bargains
might be common in the market. Fortunately, inferential methods can also help us find this
interval. Unlike the rental car company, however, you are not interested in an interval for the
mean price of an entire car fleet. Your modest concerns are about estimating individual car
prices. Predictions are subject to two types of error.

Prediction errors in regression come from two different sources:
(1) Sampling errors in estimating the population equation.
(2) Individual population data variation around the population equation.

Confidence intervals for the conditional mean only involve sampling errors and are
estimated by s
. For example, the coefficients in our sample regression are estimates of the
unknown population parameters
0
,
1
,
2
, and
3
in the used car equation. These are the only
errors involved with estimating the conditional mean of car prices. Sampling error also
contribute to overall prediction errors because predictions must use the sample equation rather
than the population equation.
By contrast, the variability of data around the population is estimated by the standard
error of the estimate, SEE. Population equations such as the used car model are only a
simplified version of reality. Only the most important variables are included and data on these
variables are subject to measurement error. For example, optional equipment such as automatic
transmission and sports packages may also have some effect on used car prices. Furthermore,
the age of a car is an imperfect measure of aging technologically, old-fashioned styling, and
general wear and tear. Thus, even the true population equation could not prevent substantial
prediction errors. On the other hand, because they result from variation around the mean, these
errors do not contribute to errors in estimating the conditional mean.
The estimated standard error of the prediction, s
P
, combines the effect of both sources of
prediction error.
DEFINITION: The estimated standard error of the prediction, s
P
, is

2 2
^
) (
Y
P
s SEE s + =

The confidence interval for predictions, the prediction interval, may then be constructed.


Prediction Intervals for an Individual Observation:
(1 - ) 100% P. I. for Y = t
/2
(n-k-1) s
P

After specifying 2200, 120, and 4 in the Prediction intervals for new observations:
area of the Options... dialog box, the Minitab regression output for the used car model is shown
in Figure 11.33. Notice first that the point prediction Fit, $6753, is the same (except for
rounding errors) as we calculated earlier. The 95% prediction interval, 95% P.I., is printed on
the far right of the same line: ($3547, $9959). Why is it 7.4 times wider than the ($6319, $7187)
confidence interval for the conditional mean?

Regression Analysis

Price = 4642 + 0.266 EngSize + 49.8 HP - 1114 YrsOld
[intermediate output omitted here]
s = 1581 R-sq = 79.4% R-sq(adj) = 78.2%
[rest of output is same as before]

6753 216 ( 6319, 7187) ( 3547, 9959)
Figure 11.33

The answer reflects the relationship between s
P
and s
, the only difference between the

two types of confidence intervals. Observe from Figure 11.33 that SEE is 1581 and s
is only
250. Thus, estimated standard error of the prediction is:

2 2
^
) (
Y
P
s SEE s + = 1596 216 1581
2 2
= + =

which is also 7.4 times larger than s
= 216. Four-year-old used cars with 2200 cubic

centimeter, 120 horsepower engines may vary from about $3500 to $10,000. A dealer asking
price of $6000 suddenly doesnt look like such a bargain. Sure it is $750 below the point
prediction. However, the much lower left end of the prediction interval indicates room for
haggling or shopping around for substantially better deals.
Most students at this point are disappointed that their prediction intervals are not
narrower. After all, R
2
tells us that almost four-fifths of variation in the price is explained by the
model. Yet the best we can do is an interval more than $6400 wide! Before you throw in the
towel and give up on regression, remember that this is a 95% prediction interval. Because the
sample is fairly large, the 95% P.I. is closely approximated by 2 s
P
. In this case, 6750

2(1600) = 6750 3200. Many people would accept an interval containing two-out-of-three car
prices in the population, a margin of error of approximately one s
P
would suffice. Suddenly, our
prediction interval would be half as wide, from roughly $5150 to $8350.

EngSize 54 2140.7 2000.0 2118.7 608.0 82.7
HP 54 117.67 115.00 116.98 30.48 4.15
YrsOld 54 4.019 4.000 4.021 1.898 0.258

EngSize 1000.0 3800.0 1750.0 2400.0
HP 53.00 205.00 95.25 140.00
YrsOld 1.000 7.000 2.750 6.000
Figure 11.34

As the descriptive statistics in Figure 11.34 show, the used car we predicted had engine
characteristics and age very close to the sample means of 2140.7 cubic centimeters, 117.67
horsepower, and 4.019 years of age. Earlier, we stated that confidence intervals for the
conditional mean increase when explanatory variables are not near their means. Lets see the
effect on prediction intervals for an older used car with a smaller engine and horsepower. The
prediction line of Minitab output for 1200 cubic inches, 75 horsepower, and two-year-old car is
shown in Figure 11.35.
6472 503 ( 5460, 7483) ( 3138, 9805)
Figure 11.35

Although the explanatory variables are now closer to their minimum or maximum values than to
their means, the new prediction interval ($3138, $9805) is $6667 wide, only 4% wider than
before. By contrast, the confidence interval for the conditional mean ballooned 133% in width
(from $868 to over $2000!). Using Minitabs fitted line plots again, Figure 11.36 shows on a
single graph the confidence and prediction interval bands for the hospital stay case.
Compared with confidence intervals for the conditional mean, prediction intervals widths
are not reduced much by larger samples or explanatory variables near their means. Good-
fitting models usually produce the narrowest prediction intervals.

Figure 11.36

Recall the plus or minus two SEE rough approximation used in Chapter 4. We now
know that s
P
must be larger than SEE. But because SEE usually dominates the smaller s
in the
s
P
formula, we can refine this approximation a bit. The following advice is serviceable if we
have to make quick-and-dirty predictions in out the field.
Except for smaller samples and predictions from extreme values of explanatory variables,
the standard error of the estimate is only slightly less than s
P
. Thus, a 95% prediction
interval is seldom more than 10% wider than 2 SEE.
For example, our first used car predictions interval had width only 1% more than 2 SEE would
have estimated. Even with the more extreme values used to fit the second used car prediction,
the interval width increased to 5% over using the SEE approximation.

Extrapolation Revisited
In Chapter 4, you were cautioned about another possible source of prediction error from
extrapolation. Because time was an explanatory variable in the trend line equations of Chapters
4, extrapolation was readily apparent. Recall that trend forecasts, by definition, are based on
time periods later than those used to fit the preceding trend data. Thus, trend forecasting always
involves extrapolation, since we have no access to future data. We saw that extrapolating the
Great Bull Market trend resulted in a 100% forecast error in 1998 because previous trends failed
to persist. At the time, we confined the concern to time trends forecasting, so our advice was to
include the warning: assuming past trends continue. However, extrapolation error can occur in
any type of regression prediction, and often we can do more than just warn people about them.
DEFINITION: extrapolation error is error caused by making predictions with values of one or
more explanatory variables outside the range data used to fit the sample regression.
Until now, we assumed that inferences about predictions and conditional means were
based on fitting a sample from the relevant population. Extrapolation errors can result if we fit a
sample that does not have the range of values that will be used to make predictions. Just as with
simple trend models, different coefficients may apply to variable relationships outside the
sampled population ranges. In extreme cases, coefficient signs may reverse or other more
important explanatory variables need to be added to the model.
If we blindly forged ahead, the computer would allow us to make predictions. For
example, what if we attempted to predict used car prices for a ten-year-old car even though only
cars no more than seven-years-old were sampled. Figure 11.37 displays the bottom lines of the

Minitab regression output for a ten-year-old car with EngSize = 2200, HP =120, and
YrsOld = 10.
72 774 ( -1483, 1627) ( -3465, 3608) X
X denotes a row with X values away from the center
Figure 11.37
Extrapolation error led to an unrealistic point prediction of $72, and nearly half the prediction
interval (and the confidence interval for the conditional mean) is negative.
18
Even ten-year-old
cars are worth a lot more than that!
Extrapolation is an especially troubling source of error because there is no way to adjust
prediction interval for its effects. Extrapolations introduced an unknown amount of error, so we
cannot know how closely older cars conform to the fitted relationships found for more recent
cars. Lacking any alternatives, Minitab reports the 95.0% C.I. and 95.0% P.I. calculated as if
no extrapolation occurred. Fortunately, a warning is issued alongside this information when
extrapolation is detected. Notice the X at the right and the message:
19

This alerts us that the prediction requires extrapolation beyond the range of at least one
explanatory variable. To identify which variables are the culprits, check their values against the
Min and Max sample range limits in the descriptive statistics (Figure 11.34). Only YrsOld = 10
lies outside these min-max ranges, so age is solely responsible for extrapolation.
The worst thing you can do with extrapolation problems is to ignore them. Sometimes
the predictions are valid because variable relationship extend unchanged beyond the sample that
we fit. Nevertheless, you should always clearly alert whoever uses your predictions that
inferences were based on inappropriate data ranges.
Happily, an ounce of prevention is worth a pound of cure for most extrapolation
problems. A carefully designed sample may prevent extrapolation errors entirely. In the used
car case, for example, we should sample from a population more closely related to the kinds of
cars targeted for prediction: older cars or at least a wider range of car ages. Older cars depreciate
much more slowly, so the coefficient of YrsOld should naturally be smaller. For very old cars, in

18
Check the prediction interval for the IDC (back in Figure 11.31). A negative lower end of a prediction interval may occur when
extrapolation is not a problem. If the variable we are predicting should not be negative, such as job creation, the sensible practice is to replace
this value with a zero.

19
If severe extrapolation is involved, a XX or XXX may be printed instead.

fact, this coefficient could turn positive. The popularity of classic auto shows and magazines
attest to the high prices of old cars.
One final comment about extrapolation. Dont confuse it with the wider intervals that
result from explanatory variables not near their means. These prediction intervals have therefore
been adjusted for the increased margin of error. By contrast, intervals reported in Minitab output
are also wider for out-of-range value was used for prediction. However, using out-of-range
values to make predictions also causes extrapolation errors not adjusted for in the output.
The unknown effects of extrapolation cannot be corrected. Extrapolation may be
avoided by including in sampled data the values you will use for predictions.

CASE MINI-PROJECT:
United Way predicts donations of two firms using information on wages, employment, and last
years giving. They collect a random sample of 55 participating companies and fit the model:
Giving =
0
+
1
Wages +
2
EMP +
3
GiveLast +
where the variables in the model are defined as:
Giving Total amount raised for United Way at each company (in dollars)
Wages Ave employee annual wage (in thousands of dollars)
EMP Number of employees at the company
GiveLast Total amount raised last year at the same company (in dollars)
Use the descriptive statistics and regression results to answer the subsequent questions:

Wages 55 27.42 28.00 26.98 10.41 1.40
EMP 55 203.0 150.0 180.9 143.0 19.3
GiveLast 55 6202 4460 5425 6238 841

Wages 12.00 60.00 17.00 34.00
EMP 100.0 800.0 120.0 220.0
GiveLast 100 28208 1807 7955

Giving = - 449 + 43.8 Wages + 2.93 EMP + 0.661 GiveLast
[Other output omitted here]
s = 2210 R-sq = 81.1% R-sq(adj) = 80.0%
[Other output omitted here]
5440 298 ( 4841, 6038) ( 962, 9917)
PREDICTION FOR Wages = 27.5, EMP = 200, and GiveLast = 6200 FIRM 1

26104 12251 ( 1503, 50704) ( 1106, 51101) XX
PREDICTION FOR Wages = 45, EMP = 5000, and GiveLast = 15000 FIRM 2
XX denotes a row with very extreme X values

1. Check with the fitted equation that $5440 is the prediction of Firm 1's giving (Show work).
2. We are 95% confidence that Firm 1 will donate from $ to $ this year.
3. Firm 1's 95% P.I. is about $54402s because Wages, EMP, and GiveLast are near their
values.
4. The XX extrapolation warning attached to the Firm 2 prediction results from trying to predict
corporate giving based on a level of Wages, EMP, GiveLast (circle one) outside the sample
data range used to fit the model.

11.77 Which of the following measures the interval for the average value of the dependent
variable given particular values of the explanatory variables?
a. the prediction interval for Y
b. the confidence interval for the conditional mean of Y
c. the forecast interval for Y
d. the univariate confidence interval for Y

11.78 For which of the following situations would I use a prediction interval?
a. estimating the number of defects in a car rolling off the assembly line at 4 P.M.
b. estimating the mean time spent by a sales clerks with customers of a particular age
c. estimating the average number of years that CEOs retain their jobs if they have been
with that same corporation 10 year
d. all of the above

11.79 The standard error of the conditional mean s
is related to the standard error of the

estimate s in that the former is equal to
a. s
b. s/ n
c. s only when all explanatory variables are at their mean
d. s/ n only when all explanatory variables are at their mean
e. always larger than s


11.80 Which of the following is true about the standard error of the conditional mean s
?
a. s
is larger the further the explanatory variables are from their means
b. s
is larger the closer the explanatory variables are to their means

c. s
is unaffected by the explanatory variables, only is affected

d. s
is larger the closer the explanatory variables are to one another


11.81 In comparing the confidence interval for the conditional mean with the prediction
interval, which of the following is true?
a. prediction intervals are larger and more affected by extreme values for explanatory
variables
b. prediction intervals are larger but less affected by extreme values for explanatory
variables
c. prediction intervals are smaller but more affected by extreme values for explanatory
variables
d. prediction intervals are smaller and less affected by extreme values for explanatory
variables
e. comparisons depend on the specific model and sample being analyzed

11.5 Forecasting Methods and Forecast Modeling
Forecasting has always been an essential part of business because the decisions made
today affect the future. Quality employees cannot be recruited at a moment's notice, so the
personnel department needs advanced notice about increased staffing needs. Thanking
customers for making their product number one is embarrassing if sales drop to second by the
time the ad campaign reach the airwaves. Sales, inventory, interest rate, and unemployment
forecasts are often crucial to informed action by decision makers. This section is devoted to
regression forecasting methods and how to evaluate the forecast performance of models.

Forecast Intervals in Regression Models
Regression predictions and their estimated standard errors were introduced in the section
4. In Chapter 4, we defined forecasting as a form of prediction involving time series data
analysis. Most of these prediction interval concepts and methods may be directly transferred to
time series forecasting. The (1 - )100% prediction interval then becomes our forecast intervals
based on an estimated standard error of the forecast, a point estimate, and a t-distribution for the
sampling distribution.
DEFINITION: In time series forecasting, the estimated standard error of the forecast s
F
is
the estimated standard error for the prediction, and the forecast interval is the prediction interval
t
/2
s
F
.
In Chapter 4, we used simple regression with a time trend variable to introduce
forecasting of stock market indexes. We now examine a more sophisticated nd realistic
regression model and apply inference methods to stock market forecasting.

Chapter Case # 5: Wall Street Lays an Egg
During the past two decades, athletic shoes have moved from the track to the workplace,
and the leader of this footwear revolution has been Nike. By stressing quality, comfort, and
style, the humble sneaker was elevated to a status symbol with $100 price tags and high profit
margins for Nike. Constant design changes, cost-conscious contracting, and endorsements from
sports celebrities like Michael Jordan held off Reebok and other market share challengers. A
darling of Wall Street as well, Nike stock outperformed even the runaway Dow by increasing
thirtyfold.

Suppose a market analyst constructs the
following model to forecast Nike stock prices, Stock P,
in dollars per share. Figure 11.38 from the Excel
spreadsheet displays the time series data for 51 quarters
gathered on Stock P and three explanatory variables:
24

DJIA Dow-Jones Industrial Average
Revenue Nike Sales (in billions of dollars)
Dividend Nike Quarterly Dividend ($/share)
Data from second quarter 1983 through the end
of 1995 were fit to the following model:
Stock P =
0
+
1
DJIA +
2
Revenue +
3
Dividend +
The prediction options of Minitab were used to forecast
Nike stock price for each of the four quarters of 1996
(see Figure 11.39).
The model contains variables generally believed
to have major impact on stock prices. Major corporate
stock such as Nike tends to follow the overall ups and
downs in the Dow, an index of blue chip stocks.
However, the success of a companys product, measured
here by sales revenue, also affects its stock market
performance. Finally, dividends are paid out from
profits. If investors interpret higher dividends as funds
not reinvested in modernizing and expanding the
company, the stock price may decline.
The market analyst notes the Minitab regression
output reports a fit close to 90% and all explanatory
variables test significant at the .05 significance levels.

Regression Analysis

24
Nike stock prices (after adjusting for stock splits in 1984 and 1990), dividends, and DJIA from The Daily Stock Price Record (OTC until
1990, NYSE thereafter). Revenues are found on Nikes history page of its website at www.nike.com.
Year Qtr Stock P DJIA Revenue Dividend
1983 2 4.125 1221.96 0.87 0.00
1983 3 4.875 1233.13 0.87 0.00
1983 4 5.000 1258.64 0.92 0.00
1984 1 5.750 1164.89 0.92 0.10
1984 2 5.000 1132.40 0.92 0.10
1984 3 4.500 1206.71 0.92 0.10
1984 4 3.375 1211.58 0.96 0.10
1985 1 4.375 1266.00 0.96 0.10
1985 2 5.000 1301.72 0.96 0.10
1985 3 6.875 1326.63 0.96 0.10
1985 4 7.625 1546.67 1.07 0.10
1986 1 9.875 1818.61 1.07 0.10
1986 2 9.750 1892.72 1.07 0.10
1986 3 5.125 1767.58 1.07 0.10
1986 4 5.875 1895.95 0.88 0.10
1987 1 9.625 2304.69 0.88 0.10
1987 2 8.375 2418.53 0.88 0.10
1987 3 11.375 2596.28 0.88 0.10
1987 4 9.750 1938.83 1.20 0.10
1988 1 11.125 1988.06 1.20 0.10
1988 2 14.250 2141.71 1.20 0.10
1988 3 14.750 2112.91 1.20 0.10
1988 4 13.250 2168.57 1.70 0.15
1989 1 16.875 2293.62 1.70 0.15
1989 2 20.375 2440.06 1.70 0.15
1989 3 33.250 2692.82 1.70 0.15
1989 4 26.625 2753.20 2.00 0.20
1990 1 33.875 2707.21 2.00 0.20
1990 2 38.375 2880.69 2.00 0.20
1990 3 31.250 2452.48 2.00 0.20
1990 4 40.250 2633.66 3.00 0.20
1991 1 46.125 2913.86 3.00 0.14
1991 2 36.125 2906.75 3.00 0.14
1991 3 54.125 3016.77 3.00 0.14
1991 4 72.375 3168.83 3.40 0.15
1992 1 67.375 3235.47 3.40 0.15
1992 2 62.125 3318.52 3.40 0.15
1992 3 79.500 3271.66 3.40 0.15
1992 4 83.000 3301.11 3.90 0.20
1993 1 76.625 3435.11 3.90 0.20
1993 2 55.125 3516.08 3.90 0.20
1993 3 45.000 3555.12 3.90 0.20
1993 4 46.250 3754.09 4.17 0.20
1994 1 53.000 3635.96 4.17 0.20
1994 2 59.750 3624.96 4.17 0.20
1994 3 59.250 3843.19 4.17 0.20
1994 4 74.625 3834.44 4.28 0.25
1995 1 75.000 4157.69 4.28 0.25
1995 2 84.000 4556.10 4.28 0.25
1995 3 111.125 4789.08 4.28 0.25
1995 4 138.500 5117.12 4.97 0.15
Figure 11.38

Stock P = - 26.2 + 0.0175 DJIA + 12.8 Revenue - 93.2 Dividend

Constant -26.201 5.236 -5.00 0.000
DJIA 0.017457 0.004281 4.08 0.000
Revenue 12.828 3.104 4.13 0.000
Dividend -93.16 44.06 -2.11 0.040

s = 10.93 R-sq = 89.2% R-sq(adj) = 88.5%

121.11 7.79 ( 105.44, 136.79) ( 94.10, 148.13) XX
122.29 8.01 ( 106.17, 138.42) ( 95.01, 149.57) XX
126.27 8.80 ( 108.56, 143.97) ( 98.02, 154.51) XX
142.73 12.17 ( 118.23, 167.23) ( 109.80, 175.66) XX
Figure 11.39

To make the four quarterly forecasts, data for the three explanatory variables were fit to the
sample regression (see Figure 11.40). Figure 11.39 contains these four lines of forecast
information.
25
By 1996, Dow and Nike sales revenue were over twice the means of the sample
data used to fit the model (see Descriptive Statistics in Figure 11.41). Only dividends remained
near its mean of 0.14. From section 4, we know that forecast intervals will be widened. In
addition, continued rise in the Dow above its sample maximum (5117) will produce forecasts
with unknown extrapolation errors.

25
Minitab allows you to make any number of predictions or forecasts from the same Regression Options dialog box. The 1996 data on
DJIA, Revenue, and Dividend were copied into three new columns of the Minitab worksheet, and the column names (rather than the data values)
were typed into the area following Prediction intervals for new observations.
Year Qtr Stock P DJIA Revenue Dividend
1996 1 162.125 5587.14 4.97 0.15
1996 2 102.750 5654.63 4.97 0.15
1996 3 121.750 5882.17 4.97 0.15
1996 4 120.000 6448.27 5.12 0.10
Figure 11.40


DJIA 51 2602 2596 2550 1028 144
Revenue 51 2.287 1.700 2.233 1.367 0.191
Dividend 51 0.14255 0.15000 0.14489 0.06082 0.00852

DJIA 1132 5117 1819 3319
Revenue 0.870 4.970 0.960 3.900
Dividend 0.00000 0.25000 0.10000 0.20000
Figure 11.41

It is easy to verify that the point forecast Fits are calculated correctly (except for
rounding errors). For example, the first quarter forecast for 1996 is:
forecast-Stock P = -26.2 + 0.0175 DJIA + 12.8 Revenue - 93.2 Dividend
= -26.2 + 0.0175 (5587.14) + 12.8 (4.97) - 93.2 (0.15)
= 121.2 dollars/share

Compare these fits with the actual 1996 stock prices for Nike compiled conveniently in Figure
11.42. Of the four forecasts (121, 122, 126, and 143), only the third quarter forecast is very close
to the actual Nike stock price ($121.75/share). Two other forecasts had errors of about
$20/share, and the first quarter forecasting error was the largest of all (over $40/share). Later,
we will show how to use this error information to evaluate the forecasting performance of a
model. As we saw with prediction, however, individual forecasts should be reported as intervals.

Fit Stdev.Fit 95.0% C.I. 95.0% P.I. Actual
121.11 7.79 ( 105.44, 136.79) ( 94.10, 148.13) XX 162
122.29 8.01 ( 106.17, 138.42) ( 95.01, 149.57) XX 103
126.27 8.80 ( 108.56, 143.97) ( 98.02, 154.51) XX 122
142.73 12.17 ( 118.23, 167.23) ( 109.80, 175.66) XX 120
Figure 11.42

The market analyst may use the 95% P.I. column in Figure 11.42 to find forecast
intervals. Unfortunately, as we anticipated from the sample descriptive statistics, the XX warns
about severe out-of-range extrapolation errors. Moreover, the forecast intervals are 25% to 50%
wider than the 2 SEE intervals we would have obtained from fitting data nearer the sample
means. Nevertheless, notice that three of the four quarterly forecast intervals included the actual
stock price for Nike. The model only failed to capture the temporary upsurge in the first quarter.


Sources of Forecasting Errors and How to Reduce Them
Based on what we learned in section 4, smaller samples tend to increase s
F
. Patterns that
occur in a few observations may differ from variable relationships in the overall population. For
example, hiring over the past eight quarters may have coincided with increasing sales, but this
direct relationship between sales and staff size could easily be due to chance. It is therefore
important to resist the temptation to forecast only from the most recent observations.
Regression predictions usually must rely on large enough samples to apply the central
limit theorem. For small samples, the t-distribution is inappropriate and forecast intervals based
on this distribution will be wrong.
26
Even if is approximately normal, small samples result in
wider forecast intervals because of the thicker-tailed t-distributions.
Although larger samples are preferred in statistical inference, forecasting relies on time
series data that present special limitations. Information on all variables in our model may not
have been collected prior to a particular date. For example, an airline may have neglected to
keep detailed records on the number of hours the computer reservation system was down until
the company discovered a need to forecast "down time. In other situations, substantial changes
in a business can cause data prior to these change to be incomparable with subsequent data. As
General Motors makes major changes in accounting practices and organizational structure to
cope with declining market share, data before and after may not be comparable for forecasting.
Similarly, General Electrics profit forecasts based on data prior to the recent sale of their
enormous consumer-appliance division would also risk an "apples and oranges" comparison. If a
sample goes far enough back in time, we always risk changes in the fundamental processes being
modeled. A quarter century ago, families were larger, unions and job security were common, the
elderly experienced greater poverty, households saved more and were less willing to run large
debts, fewer women had careers, and a host of other differences existed. A market forecast
model fit from 25 years of data may need to be highly complex to account for all these
demographic changes. In extreme cases, the historical record of comparable data is so short that
we must postpone forecasting until enough time passes to generate additional data.
To increase sample size, businesses sometimes attempt to collect time series observations
at more frequent intervals. Inventory stocks formerly measured quarterly may be collected
monthly or weekly. Two years that furnished only eight quarterly observations now yield 24
months or about 104 weeks of data.

26
Then, nonparametric methods that reduce forecasting precision should be substituted for regression (see discussion in Chapter 14).


However, there are drawbacks to this approach to the small sample problem. If variables
in the model, such as corporate dividends or economic growth rates, are only reported on an
annual or quarterly basis, monthly analysis merely repeats the same information. Variables
capable of more frequent collection may not add much information, like the soup that a cook
waters down when unexpected guests arrive. Decisions about new investment, new product,
advertising, and even price changes are often made by management on a quarterly or annual
basis. Observing these processes more often may be as uninformative as checking your mailbox
more often long before the normal postal delivery. Frequently-collected data may resemble a
video taped movie shown in slow motion: it appears to take forever for the villain to fall after
being shot. Like a daily tracking poll before presidential elections, hourly observations on
financial market response to an interest rate movement may contain mostly noise and static and
little unique information.
27
Finally, analysis of data confined to a short-run period of a year or
two may fail to measure important long-run forces such as the business cycle, evolving consumer
tastes, and trends in supply prices. If one of these changes occurs in the period we want to
forecast, the sample used to fit the model will not reflect relationships essential to making a
reliable forecast.
Although larger samples improve forecasting precision, samples may be limited if older
data are unavailable or not comparable. Data collected more frequently increases sample
size but may not add much information.
We also saw that forecasting errors increase if explanatory variables are far from their
means. These errors are more common in forecasting because many explanatory variables, such
as prices and population, grow over time. The effect is a wider forecast interval. And if these
variables grow beyond values observed during the sample period, there is an additional danger:
extrapolation error. We encountered each of these problems company revenues and the Dow
index in the Nike case.
As with other types of prediction, interval widths for forecasts tend to be highly sensitive
to the overall fit. If the model explains most variation in Y, large R will go hand-in-hand with a
relatively small SEE. As explained in Chapter 4, time series regression often achieves a better
fit than a similar model regressed on cross-section data. However, models that provide good
sample fits may still fail to produce good forecasts. This discouraging outcome may occur for
two reasons: forecastings version of modeling and measurement error.
Models tend to evolve as time passes, causing another form of extrapolation error when
we try forecasting from an older model. This problem cannot be avoided because we are unable
to sample data from the future. Thus, our sample regression wont tell us whether the model will

27
Too frequent data collection is thus a sources of inefficient estimation from serial correlation, discussed earlier in section 1.

fit future data as well is it fit the past. We can only hope that the future looks enough like the
past to maintain stable variables relationships through the periods we need to forecast. This leap
of faith is a type of extrapolation, and risks large, unknown forecasting errors. Furthermore, the
further we forecast into the future, the greater the chance for model relationship shifts.
Regression model relationships may change over time in several ways. One way is a
change between past and future magnitudes or signs of the variable coefficients, violating the
assumption of constant 's (discussed in section 1). Because the sample can only estimate the
old coefficients, the forecasts will contain an unknown amount of bias. For example, suppose
the growth of Nike to multinational stature makes it less sensitive to fluctuations in the Dow. If
the stock market slips next year, Nike stock price would be forecast too low if the sample
regression coefficient 0.0175 were used. Perhaps the future coefficient of DJIA is only 0.0100,
but unfortunately we cannot know this yet. If coefficient signs reverse in the future, direct
relationships become inverse ones and vice versa. Higher dividends in the future may tend to
drive up Nike stock prices, for instance, flipping the negative coefficient to positive.
Even worse forecasting problems may occur when variables that once belonged in the
model should be replaced by newly important ones. Unlike the immutable laws of nature,
human behavior and business institutions can change. Variables that strongly affected Y in the
past may not do so in the future. For example, the decline in unions has made work-days-lost-to-
strikes a far less important factor affecting industrial profits. Conversely, variables not
appropriate for past models may assume prominence at a later point in time. When international
trade was only a minor part of most industries, currency exchange rates were not vital to the
bottom line for most business. Today, businesses newly-active in export markets suffer when the
dollar rises on world markets because U.S.-made goods then cost too much for foreign citizens to
buy.
Because forecast inferences cannot be based on future data, extrapolation of past
relationships results in an unknown amount of bias, especially if
(1) explanatory variables move outside their historical range
(2) some variables no longer are important while others become important
(3) variable coefficients change substantially or switch signs

Another source of severe forecasting error is measurement errors in the explanatory
variables. No matter how well the equation fits the sample data, forecasts will contain
substantial errors if inaccurate measurements of the explanatory variables are used. With other
types of prediction, measurement error is generally not a major issue. If these errors are
excessive and if sufficient time and resources are available, the usual cure is more sensitive
instruments, retraining of survey worker, or recording data to another decimal place. In
forecasting, however, measurement error is often the largest source of error.

To determine how many copies to print in its new edition, suppose a publishing company
forecasts orders for its popular business statistics textbook. The following model fits time series
data for the past 25 years very well:
predicted-Copies = -20,000 + 0.20 Enroll
where Copies is the number of copies ordered and Enroll is total college enrollment of business
majors. Even if other types of errors are small, measurement errors of next year's enrollment
may result in enormous forecasting error. Future enrollment data cannot be collected directly; so
the publisher must first guess future enrollment. An overestimate of enrollment by 50,000
students can cause the company to take losses on (0.20) (50,000) = 10,000 unsold books next
year. Forecast errors are largely determined by how closely we can guess the business
enrollment. Conditional forecasting therefore introduces an additional source of error to our
forecasts of the dependent variable.
DEFINITION: conditional forecasting occurs when one or more explanatory variables must be
guessed because their values not known with certainty for the period forecast.
What are some ways to guess future enrollment? One method would be to design another
forecasting model. But if that model also contains explanatory variables whose future values
cannot be known, we may not be much better off. For example, enrollment is itself heavily
dependent on the college age population, which will definitely rise steeply after more than a
decade of only modest growth. However, enrollment is also affected by employment
opportunities and the cost of attending college. New federal aid programs have made it possible
for many more students to afford college, but tuition has also increased. Employment
opportunities depend on the future state of the economy.
An easier way to obtain future values for explanatory variables is to make an educated
guess or rely on outside authorities. Often, closeness to the pulse of customers, suppliers, and
distributors provides business decision makers with a sixth sense about the future course of
important variables. The most naive approach assumes that hard-to-predict variables won't
change. Thus, the value from the previous period is assigned to the unknown explanatory
variable. Alternatively, a simple rule-of-thumb is used, such as a 5 percent increase in
enrollment each year. Notice that information about current year enrollment must be used in this
forecast. Market researcher, financial and industry analysts, and economic experts make their
living from selling informed predictions.
28
The textbook publisher may use future business
enrollment forecasts made by the AACSB, an accrediting association.

28
There is evidence that financial analysts incorporate this to outperform forecasts from statistical models.

In the Nike study, large errors are likely in estimating the future values of the DJIA and
revenue variables. Stock market and industry sale experts may help us reduce these errors.
A more flexible version of conditional forecasting is called contingency forecasting.
DEFINITION: Contingency forecasting involves generating several forecasts, one for each
alternative set of circumstances, or scenario, that is likely to arise.
The future is seldom certain. However, by forecasting Y for the most-likely combinations of
explanatory variable outcomes, a decision maker can be well prepared. For example, the
publisher may identify three likely possibilities for business enrollment in the coming year: (1) a
rosy scenario with business enrollment of 450,000 students, (2) a normal year enrollment of
375,000, and (3) a substantial decline in majors to 300,000. Using a sample equation given
earlier, point forecasts are 70,000, 55,000, and 40,000 orders, respectively. The publisher
therefore only prepares for forecast intervals around these three amounts, rather than worry that
orders might fall to 25,000 or surge much beyond 70,000. Similar scenarios could be
constructed for likely future combinations of Nike sales revenue and the Dow index.
The alternative to conditional forecasting is unconditional forecasting.
DEFINITION: Unconditional forecasting occurs when the future values of all explanatory
variables are known with certainty.
We have already encountered one method, trend forecasting, to guarantee us of unconditional
forecasts. Suppose that a reasonably-good fit can be obtained from the following model:
Copies =
0
+
1
Year +
where Year is a trend variable that measures time in years. The advantage of using a trend model
to guess next year's enrollment is that future values of time can be measured.
29
If the year is Year
= 2000, then the next year must be Year = 2001. Even if better fits may be obtained from more
sophisticated models, their measurement errors in explanatory variables may still result in
inferior forecasts.
For Nike, the long-term rise in its stock prices yields a respectable fit of nearly 80% for
the simple, time trend model (see selected portions of Minitab output in Figure 11.43). Here, the
only explanatory variable is qtr, a sequence of integers from 1 to 51 (for the 51 quarters in the
historical time series). Forecasts for 1996 are generated for t = 52 through 55. However, these
trend forecasts are all well below the 103 to 162 dollars/share of Nike stock during 1996.

Regression Analysis

Stock P = - 15.0 + 1.93 t

s = 14.64 R-sq = 79.7% R-sq(adj) = 79.3%

85.58 4.16 ( 77.22, 93.95) ( 54.99, 116.17)
87.52 4.28 ( 78.91, 96.13) ( 56.86, 118.18)
89.45 4.41 ( 80.60, 98.31) ( 58.72, 120.18)
91.39 4.53 ( 82.28, 100.49) ( 60.59, 122.19)
Figure 11.43

There are other opportunities for obtaining unconditional forecasts. Some actions may be
locked-in long before they actually occur. Major investment projects such as plant construction
may be financed and committed a year or more in advance. Thus, a model that uses investment
as an explanatory variable may forecast for at least a year unconditionally. Network advertising
buys of air-time blocks are also formalized many months in advance. Union contracts, long-
term raw material purchases, mortgages, hotel convention bookings, and land leases are other
examples of variables whose values are known well into the future. In the Nike case, a dividend
payout policy may be well established. Perhaps preliminary paid registration may be used to
measure enrollment next year, because tuition refund penalties are substantial.
30

In conditional forecasting, errors may be huge because we first must forecast values of the
explanatory variables. Only unconditional forecasts are free of these errors.

Evaluating Model Performance by Ex Post Forecasting
As we have discovered, a models ability to forecast the future may not depend primarily
on its fit of past data. Good forecasts are so financially rewarding that model evaluation has now
become accepted practice. Like road testing a car before you buy it, model performance should
be evaluated before it is used to make important forecasts. In cross sectional analysis, we can
field test a model before adopting it. For example, a marketing model is first tested in a sample

29
Alternatively, a quadratic or geometric model may better describe trends in enrollment. Nonlinear functional forms are discussed in
Chapter 13.

30

Chapter 13 discusses another method to ensure us of unconditional forecasts, the use of explanatory variables lagged one or more time periods.

of representative communities to assess the models ability to correctly-predict customer
response to changes in product price, quality, or packaging.
Unfortunately, we cannot evaluate forecasting performance of a model by conventional
ex ante forecasting until the future outcomes occur. By that time, it will be too late. Fortunately,
a testing ground called ex post forecasting permits us to evaluate time series models. We do this
by adjusting the estimation period so that the forecasted outcomes are already known.
DEFINITIONS: The estimation period is the time series data used to fit a forecasting model.
Ex post forecasting involves forecasting the most recent observations after withholding them
from the estimation period. By contrast, ex ante forecasting uses an estimation period that
includes the most recent observations.
Ex is Latin for "from, post means "after, and ante translates to "before" (think of the "postwar"
period and you "ante" before a poker hand). Thus, ex post forecasts are made after observing the
outcomes, and ex ante forecasts are made before these outcomes can be observed. Clearly, all
real business forecasting is ex ante. Why would you want to forecast what you already know?
The answer is that the ex post forecasting allows us to simulate the performance of a model.
Lets examine the nuts and bolts of ex post forecasting of Nike stock prices. Suppose
that we move the clock ahead to today, and the same market analyst wants to see how well the
model would have forecast for some year in the past, say 1996. He therefore fits the model to
data from an estimation period ending with 1995. He cannot include 1996 data because that
would bias the fitted equation in favor of closer forecasts. The sample regression is then used to
forecast each of the four quarters in 1996. This is exactly what we did earlier with our ex ante
forecasts of Nike stock. The difference with ex post analysis is that now we dont wait a year to
discover how well (or poorly) our model performed. Benefitting from 20-20 hindsight, we may
directly compare forecasts with actual 1996 stock prices. Thus, Ex post forecasts simulate the
real, ex ante forecast once the model is in use.
Earlier in this section we generated 1996 forecasts from two models:
(1) a multiple regression (MR) model, using the Dow index, revenues, and dividends as
explanatory variables, and
(2) a simple time trend (TT) model.
Which of the two models perform better? Both models tested significant and fit the sample data
well, but the multiple regression had the better fit. Its R
2
(even adjusted for degrees of freedom)
was nearly ten points higher.

However, what if the stock market analyst only needs the model for its forecasting.
Then, he prefers the model that yields better forecasts. The quarterly point forecasts from the
MR and the TT models were copied to an Excel spreadsheet and place alongside the actual 1996
Nike stock prices for 1996 (see Figure 11.44). Using the computation capabilities of the
spreadsheet, two additional columns were included: one containing the forecasting errors for
each of the two models.
Forecast errors are calculated to conform to our intuitive notions of forecasting error.
DEFINITION: The forecast error for each observation is
e = Y
F
Y
A
where Y
A
is the actual value of Y being forecast and Y
F
is the forecasted value.

For example, both models forecast stock prices that were too low. The forecasting error in
Figure 11.44 is, and should be,
negative. Signs of the errors
may also be used to interpret the
other forecasts. While the MR
model overestimates Nike stock
price for the remaining three
quarters, the TT model
consistently falls short of the
actual price.
Aside from their
direction, forecasting
errors for both models
vary considerably. The
multiple regression model
had the smaller error for
three of the four quarters,
however, and the time
trend model edged out
MR only for the second quarter. On this basis, the multiple regression model displays slightly
better forecasting performance.
Although ex post forecasting is a powerful tool, certain cautions need to be mentioned.
First, forecast performance may be sensitive to which ex post forecasting period we select. For
example, Figure 11.45 shows the Excel spreadsheet error computations if we choose to forecast
for 1995 and 1996. Although the multiple regression model performs better, the time series
MR Model TT Model Actual Nike Error for Error for
Qtr Forecasts Forecasts Stock Price MR Model TT Model
1 121 86 162 -41 -76
2 122 88 103 19 -15
3 126 89 122 4 -33
4 143 91 120 23 -29
Figure 11.44
MR Model TT Model Actual Nike Error for Error for
Year Qtr Forecasts Forecasts Stock Price MR Model TT Model
1995 1 72 70 75 -3 -5
1995 2 75 71 84 -9 -13
1995 3 77 73 111 -34 -38
1995 4 92 75 139 -47 -64
1996 1 95 76 162 -67 -86
1996 2 95 78 103 -8 -25
1996 3 97 80 122 -25 -42
1996 4 104 81 120 -16 -39
Figure 11.45

model does nearly as well for most forecasts. Also, both models now consistently forecast stock
prices that are too low.
In addition, evaluating forecasts for several periods is best because chance variations can
make a model appear to perform well or poorly for one or two forecasts. For example, the first
three quarter forecasts of 1995 for the two models are only $2 to $4 per share apart (see Figure
11.45). However, the time trend model commits $17 to $19 per share larger errors for next four
quarterly forecasts. Sometimes, the performance winner among several models is not clear even
after several observations are forecasts. One model may clearly do better for near-term
forecasting (say, two or three quarters), but another performs best for long-term forecasts. Short-
term forecasting models often contain explanatory variables that measure temporary shocks such
as oil prices and seasonal weather conditions, while long-term models emphasize variables such
as population and productivity that have more lasting impacts.
Conditional forecasting errors (discussed in section 13.1) may arise if information is
required about explanatory variables in a model. Although ex post forecasts are better for the
Nike multiple regression model once actual values for the three explanatory variables are
inserted, ex ante forecast errors may be outrageous. As we discussed earlier, the Dow index and
Nike revenues may introduce unacceptable amounts of measurement error to these forecasts. For
a fair performance evaluation, try guessing ex post values of DJIA and revenue by the same
techniques you will use in the ex ante forecasts (see exercises).
Ex post forecasting is a valuable method for evaluating performance of time series
models. Before making a judgment, however, forecast several observations,
examine model performance under different choices for p, consider only models
defensible on prior grounds, and include conditional forecasting errors in your
analysis.

Average Forecast Error, Forecast Bias, and Turning Points
It is often difficult to decide which of several alternative models has the best forecasting
performance based only on their forecast errors. One model may do best forecasting some
observations and worst for others. Another model might never provide the smallest forecast
error but always does well enough to stay in the running for best overall. The stock market
analyst may prefer a way to combine all forecasting errors into summary measures.
One such measure is forecast bias.
DEFINITION: The estimated forecast bias B for a model is the mean forecast error

B =
=
_
=
p
i
i
e
p
1
1
_
=
p
i
A
i
F
i
Y Y
p
1
). (
1

Thus, forecast bias B may be estimated by average error of the ex post forecasts. Note that
average forecasting errors may be very small even if individual errors are large because negative
and positive errors often cancel.
A positive bias describes forecasts that are too high on average. Negative bias results if
average forecasts are too low. Bias near zero occurs when all errors are small, but zero
bias also happens if positive and negative forecast errors cancel out.
In the Nike stock case (Figure 11.44), the MR models forecast errors for 1996 were -41, 19, 4,
and 23. These sum to only +5 because the -41 nearly offsets the three positive forecasting errors.
The estimated forecasting bias is
B = (+5)/4 = 1.25
Therefore, the stock market analyst concludes from ex post analysis that the multiple regression
model produces nearly unbiased forecasts. By contrast, forecast errors for the TT model were -
76, -15, -33, and -29. Because no positive error offset the negative ones, estimated bias is
negative and much larger:
B = (-76 -15 -33 -29)/4 = (-153)/4 = -38.25
The conclusion is that the time series model results in forecasts that average $35 to $40 below
actual future Nike stock prices.
In earlier chapters, we chose estimators that were unbiased. How can we now have bias?
However, these estimators only yield unbiased predictions if the future population resembles the
historical sample. Because forecasts apply to a different time period than the estimation period,
bias is a legitimate concern. By comparing mean forecast errors, we may decide whether a
model forecasts too high or too low on average and by how much.
Although bias is usually an important consideration, an even greater concern is the size of
forecasting errors regardless of their signs. Often, large errors in either direction are equally
disastrous for a business. For example, if a production manager uses a forecast that
underestimates next year sales, the resulting backorders and lost sales will cost the company
profits and reputation. On the other hand, a forecast that overshoots sales will raise production
costs and storage costs (for unsold inventory). If sales revenue is insufficient to cover costs, the
company may absorb severe losses.

We are therefore interested in a measure of average forecasting error that does not allow
positive and negative errors to offset one another. Two alternatives are in common use, and each
measure has its strengths.
One obvious measure is a forecasting version of the standard deviation, the root mean
square error RMSE for estimating average forecasting variability.
DEFINITION: The root mean square error, RMSE, is the square root of the mean square
forecasting error, which for p forecasts and errors e, is
31

RMSE =
_
=
p
i i
i
e
p
1
2
1

For the four quarterly forecasts of Nike stock, the computations are
MS
E
= [(-41)
2
+ (19)
2
+ (4)
2
+ (23)
2
] 4 = 676.75
whose square root is
RMSE = 25.4
In contrast to its near-zero bias, the magnitude of RMSE for the MR model is so large because
both negative and positive errors contribute to this measure instead of canceling. However, all
errors for the TT model have the same sign:
MS
E
= [(-76)
2
+ (-15)
2
+ (-33)
2
+ (-29)
2
] 4 = 1982.75
Thus, RMSE is 44.5, a magnitude similar to the bias we found earlier (B = -38.25).
As you may recall from our discussion standard deviation in Chapter 3, sum-of-square
measures of variability place enormous weight on the largest deviations. One way we avoided
this problem was to use the mean absolute deviation. For forecasting, we call this mean absolute
error, or MAE.
The mean absolute error, MAE, is the mean magnitude of the forecasting errors, or
MAE =
_
=
p
i
i
e
p
1
1

31
Unlike previous uses of MS
E
, RMSE includes bias and does not consume degrees of freedom because forecasts are not part of the sample
data.

Absolute values strip off negative signs and make everything positive. Because the MAE does
not give extra weight to the largest forecast errors, MAE is nearly always less than RMSE.
Thus, the MAE computations for each of the two models in the Nike case are:
MAE = (|-41| + |19| + |4| + |23|) 4 = (41 + 19 + 4 + 23) 4 = 87/4 = 21.75
MAE = (|-76 | + |-15 | + |-33 | + |-29 |) 4 = (76 + 15 + 33 + 29) 4 = 153/4 = 38.25

Notice that each of these is less than their RMSE counterpart calculated earlier. Did you also
notice that the MAE for the time series model looked familiar? It has the same magnitude as the
bias, 38.25. This will always occur if all forecast errors have the same sign.

Unless all forecast errors have identical magnitudes, RMSE will be larger than MAE. This
difference is greatest if one error is much larger than the rest. Also, the magnitude of the
bias B will equal MAE if all forecast errors have the same sign.
Often we need a relative measure that shares the desirable properties of MAE. For this
task, the mean absolute percent error, or MAPE, is used.

The mean absolute percent error, MAPE, is the mean absolute error expressed as a percent of
the actual value being forecast, or
MAPE =
% 100
1
1
_
=
p
i
A
i
i
Y
e
p

Applying the MR model forecasts for Nike stock case, we find the MAPE:
Forecasts Actual Error Squared errors Absolute Errors Abs Pct Errors
Year Qtr MR TT Price MR TT MR TT MR TT MR TT
1996 1 121 86 162 -41 -76 1681 5776 41 76 25.3% 46.9%
1996 2 122 88 103 19 -15 361 225 19 15 18.4% 14.6%
1996 3 126 89 122 4 -33 16 1089 4 33 3.3% 27.0%
1996 4 143 91 120 23 -29 529 841 23 29 19.2% 24.2%
Bias = 1.25 -38.25 646.75 1982.75 MAE = 21.75 38.25 MAPE = 16.6% 28.2%
RMSE 25.4 44.5
Figure 11.46

MAPE = (|100%(-41 162)| + |100%( 19 103)| + |100%( 4 122)| + |100%( 23 120)|) 4
= (25.3% + 18.4% + 3.3% + 19.2%) 4 = (66.2%)/4 = 16.6%
Thus, errors average about one-sixth (16.6%) of the stock prices forecast by the multiple
regression model. Similar computations for the time trend model result in an MAPE nearly
twice as large, 28.2%. These calculations are easier using a spreadsheet (see Figure 11.46).
The MAPE measure often provides decision makers with a useful perspective about the
relative size of forecasting errors. Here, we found that MAE is nearly twice as large for the TT
model than for the MR model. This difference is important for the Nike case because it
translates to MAPEs for the two models of 28.2% versus only 16.6%. However, the difference
would have been trivial if the larger MAPE were 0.2% instead of 0.1%. Use of MAPE is also
essential for comparing models applied to different data sets, such as sales forecasts for several
firms. Even if the firms have vastly different sizes, a relative measure allows us to compare
relative forecasting performances directly (see Exercises).
What should the stock market analyst conclude about the two models? The multivariate
model has a small forecasting bias. The MR model also experiences substantially smaller MAE
and RMSE. Because the average relative error (MAPE) can be reduced from 28% to 17% by
choosing the multivariate model, these forecasting gains should not be ignored. Although it
loses some of its bias advantage, the MR model retains most of its edge in MAE and RMSE for
the alternative, eight-quarter ex post analysis (see Figure 11.46).
96 95 94 93 92 91 90 89 88 87 86 85 84 83
150
100
50
0
Year
$
/
S
h
a
r
e
Nike Stoc k Pric e (Adjus ted f or Splits )

However, a time trend model offers us the security of unconditional forecasts because
future values of the time variable are known with certainty. That same claim cannot be made for
variables in the MR model. Unless Nike revenues and the blue-chip stock indexes can be
predicted with reasonable precision, the trend model may be the safer choice. On the other hand,
if the long upward trend in Nike stock ends (see Figure 11.47), the TT model will blindly
forecast steady stock price increases. Thus, the multivariate model is the only model capable of
forecasting a turning point in Nike stock prices.
DEFINITION: A turning point is a single time period or narrow range of periods that precedes
a sharp (and usually long-term) change in direction of in a time series variable.
A model that can accurately forecast turning points may be far more useful than other models
with much smaller average forecasting errors and bias. Knowing when to get out of or back in
the stock, real estate, commodities, or future markets can make (or save) a fortune. Forecasting
turning points may be the number one objective for modeling business and economic variables
that experience the strongest and sharpest changes, such as interest rates, foreign exchange rates,
housing starts, crude oil prices, and auto sales. On the other hand, variables that are not as
volatile, such as population, productivity, and wages, are forecast by models that place a
premium on reducing MAE (or RMSE) and bias. In the case of Nike, the bubble finally burst
beginning with sales slide of 8% in the third quarter of 1997. After peaking in early that year,
Nike stock prices fell by nearly 40% over the next year despite spectacular runups in the Dow
index. Perhaps the market analyst should add to the MR model an explanatory variable for
market share of Nikes newly enervated rivals, Reebok, Adidas, and New Balance.
Once the market analyst decides which model to use, he is then ready to generate ex ante
forecasts using historical data that includes the most recent quarterly data (see Exercises).

Forecasts Actual Error Squared errors Absolute Errors Abs Pct Errors
Year Qtr MR TT Price MR TT MR TT MR TT MR TT
1995 1 72 70 75 -3 -5 9 25 3 5 4.0% 6.7%
1995 2 75 71 84 -9 -13 81 169 9 13 10.7% 15.5%
1995 3 77 73 111 -34 -38 1156 1444 34 38 30.6% 34.2%
1995 4 92 75 139 -47 -64 2209 4096 47 64 33.8% 46.0%
1996 1 95 76 162 -67 -86 4489 7396 67 86 41.4% 53.1%
1996 2 95 78 103 -8 -25 64 625 8 25 7.8% 24.3%
1996 3 97 80 122 -25 -42 625 1764 25 42 20.5% 34.4%
1996 4 104 81 120 -16 -39 256 1521 16 39 13.3% 32.5%
Bias = -26.1 -39.0 1111.1 2130.0 MAE = 26.1 39.0 MAPE = 20.3% 30.8%
RMSE 33.3 46.2
Figure 11.47

If MAPE is tolerably small for all models, then model selection may be based on criteria
other than MAE and RMSE. The ability to forecast turning points and avoid conditional
forecasting errors are factors that also should be considered.
11.90 In regression models, forecast intervals are a type of
a. confidence interval for the conditional mean
b. univariate confidence interval for the variable being forecast
c. prediction interval
d. confidence interval for the regression coefficient

The random disturbance term is necessary because in most business regression models
it is necessary to exclude all but the most important explanatory variables and to represent
difficult-to-measure variables with imperfect proxy variables
Under the regression assumptions, defined by the least-squares equation is an efficient
estimator of the conditional mean of Y. In addition, the conditional distribution of Y for any
given set of values of the k explanatory variables, is normally distributed with mean
0
+
1
x
1
+
... +
k
x
k
and constant standard deviation
.
A carefully constructed model or narrowly-defined population may be necessary to
ensure constant regression parameters. The assumption of normally distributed is
approximately true for large samples if all sources of error have similar distributions and are
independent.
Hypothesis testing with regression commonly takes two forms: 1) testing individual
explanatory variables have a statistically significant effect on the dependent variable, and 2)
testing the significance of the entire model.
In regression analysis, testing for statistical significance of X
j
, j = 1, ..., k, requires a test
on whether its coefficient
j
is significantly different from zero. For a two-sided test, the null
and alternative hypotheses are H
0
:
j
= 0 and H
A
:
j
=/ 0.
The statistic for testing the statistical significance of variable X
j
in a regression model is a
t-ratio equal b
j
/s
b
j
. The number of degrees of freedom for the test is n - k - 1. At significance
level , if t-ratio > t
/2
(n-k-1) or t-ratio < -t
/2
(n-k-1), reject H
0
; if -t
/2
(n-k-1) t-ratio t
/2
(n-k-
1), do not reject H
0
. Alternatively, if |t-ratio| > t
/2
(n-k-1), reject H
0
and conclude that X
j
is
statistically significant, otherwise do not reject H
0
.
Regression tests should be based on data from as large a sample as practical and over a
wide range of values for each explanatory variable, especially if the relationship being tested is

likely to be weak. If only a small sample size if feasible, simple models involving very few
explanatory variables should be fit.
For one two-sided tests, if the p-value for testing X
j
is less than , reject the H
0
and
conclude that X
j
is significant in the regression model. If the p-value is greater than , do not
claim significance for X
j
. Either of the one-sided hypothesis may be tested at significance level
using the t-ratio rule: if t-ratio > t
(n-k-1), reject H
0
; if t
(n-k-1) t-ratio, do not reject H

0
.
Alternatively, the p-value decision rule is if p/2 < , reject H
0
; if p/2 , do not reject H
0
.
The (1 - ) 100 percent confidence interval for a least-squares estimate b
j
is (b
j
- t
/2
(n-k-
1) s
b
j
, b
j
+ t
/2
(n-k-1) s
b
j
).
Procedure for inferences about explanatory variables: state regression model; decide
which variables are to be tested and which are included merely as control variables; determine
which variables are eligible for one-tailed tests; state the null hypotheses formally; assign a level
of significance; collect sample data and estimate the regression equation; use the t-ratio or p-
value decision rules to determine test results; for one-tailed tests, divide p-values by two or use
the t
(n-k-1) rejection region of the t distribution; translate test results on each b

j
into
significance (or lack of significance) for X
j
; interpret point estimates of the coefficients for the
significant variables as geometric slopes; and compare each variable's effect on the dependent
variable based on its slope coefficient for typical changes in a variable.
In the significance test for the regression model, the null and alternative hypotheses may
be stated by
H
0
:
1
=
2
= ... =
k
= 0
H
A
:
j
The F-ratio to test the significance of a regression model with k explanatory variables and
sample size n is the ratio MS
R
/MS
E
; the sampling distribution for this ratio is F with
1
= k and
2
= n - k - 1 degrees of freedom
At significance level , if F-ratio > F
(k, n-k-1), reject H

0
; if F
(k, n-k-1) F-ratio, do

not reject H
0
. Alternatively, the p-value decision rule is if p < , reject H
0
, if p , do not reject
H
0
. Rejecting H
0
in this test allows us to conclude that the model is statistically significant. R =
SS
R
/SS
T
and is tested with the same F-test. The larger the sample, the lower the sample fit
necessary for a significant R; models with more explanatory variables also require higher R or
larger sample sizes to test significant.
If only one X
j
departs from its mean, where x
p, j
is the value of X
j
for predicting
Errors in predicting individual observations from a regression model occur for two
reasons: (1) because of sampling errors in estimating the true population regression equation, and
(2) because there is dispersion around even the unknown population relationship because of
modeling and measurement errors. For large samples, s
P
will be no more than 10 percent greater
than s unless predictions are based upon extreme values of the explanatory variables; therefore, a
95 percent large sample prediction interval may be approximated by 2s.


A consortium of Florida cities hires a budget analyst to model the factors affecting police force
size and predict the police force for three newly incorporated cities. The model is fit for n = 56
middle-sized cities (populations 25,000 to 100,000) using the following model:
FORCE =
0
+
1
VIOL +
2
PRTX +
3
OLD% +
4
OLD% +
where the variables in the model are defined as:
FORCE police force size (number of officers on the city police force)
VIOL violent crime rate (as a percent of total crimes)
PRTX property tax (city tax in dollars per capita)
POP city population (in thousands)
OLD% elderly share of population (percent at least 65)
After generating descriptive univariate statistics on the independent variables, the regression
model is fit and three predictions made:
VIOL 56 7.661 6.750 6.952 5.468 0.731
PRTX 56 164.8 103.0 142.1 173.0 23.1
POP 56 52.38 47.55 51.57 20.44 2.73
OLD% 56 10.900 10.550 10.616 4.995 0.668

VIOL 0.500 33.000 3.975 9.525
PRTX 18.0 977.0 59.2 181.5
POP 25.70 95.90 34.05 72.13
OLD% 2.100 32.600 8.125 13.300

Predict for VIOL = 7.5,PRTX = 165,POP = 53,OLD% = 11;CITY1
Predict for VIOL = 20,PRTX = 300,POP = 60,OLD% = 25;CITY2
Predict for VIOL = 3,PRTX = 200,POP = 5,OLD% = 4 CITY3
FORCE = - 52.5 + 3.24 VIOL + 0.0641 PRTX + 1.59 POP + 2.10 OLD%
[output omitted to save space]
s = 20.36 R-sq = 85.0% R-sq(adj) = 83.8%


SOURCE DF SS MS F p
Regression 4 119858 0.000
Error 51 21144 415
Total 55 141002

89.71 2.73 ( 84.24, 95.19) ( 48.46, 130.96) PREDICTION1
179.34 9.33 ( 160.60, 198.08) ( 134.36, 224.31) PREDICTION2
66.19 18.90 ( 28.24, 104.13) ( 10.41, 121.97) XX THIRD PRED.

Answer the following questions based on the preceding output and model:
12. A printing error caused the SS
R
to be missing from the Analysis of Variance table. The
MS
R
is equal to
a. 290
b. 29,965
c. 98,714
d. 479,432

13. The same printing error caused the F-ratio to also be omitted from the Analysis of
Variance table. The F-ratio is approximately
a. 12
b. 17
c. 72
d. 98

14. The null hypotheses associated with the F test is
a. H
0
:
0
=
1
=
2
=
3
=
4

b. H
0
:
0
=
1
=
2
=
3
=
4
= 0
c. H
0
:
1
=
2
=
3
=
4

d. H
0
:
1
=
2
=
3
=
4
= 0

15. After conducting the F-test at the = .01 significance level using the p-value decision
rule, what should we do?
a. reject H
0
and conclude that the model is significant
b. reject H
0
and conclude that the model is not significant
c. cannot reject H
0
and conclude that the model is significant
d. cannot reject H
0
and conclude that the model is not significant
e. insufficient information provided to conduct the test

16. We are 95% confident that the mean police force for hundreds of cities with the same
characteristics as the first city is between approximately which two values?
a. 90 to 95
b. 84 to 90
c. 84 to 95
d. 48 to 131
e. 69 to 110


17. Which of the following is not true about the prediction for the first city?
a. the point prediction is based on explanatory variables near their sample means
b. the prediction interval is approximately four times the standard error of the estimate
c. the prediction interval is narrower than for virtually any other possible prediction
d. all of the above are true
18. The prediction interval is wider for the second city than for the first city because
a. the second prediction involves extrapolation
b. the second prediction involves explanatory variables are not near the sample means
c. the second prediction involves a larger sized police force
d. the second prediction is made after the first

19. In this case, the prediction for the third city causes an "XX" warning to be issued on the
printout because
a. one of the explanatory variables lies outside the range of the sample data
b. two of the explanatory variables lies outside the range of the sample data
c. three of the explanatory variables lies outside the range of the sample data
d. all four of the explanatory variables lies outside the range of the sample data
e. the warning is not relevant to this case because no time series forecasting is involved

B. To perform rate studies and issue municipal bonds, counties need to model and estimate
electrical power usage needs. County also use resulting information to determine whether it is
more profitable to purchase on the spot market, enter into long term supply contracts with
utilities, or build their own generating facilities. Monroe country in the Keywest resort area of
Florida gathers quarterly time series data on T = 28 quarters from 1982-1988 on the following
variables:

power
t
residential power usage (in millions of kilowatt hours) during tth quarter
DD cool
t
cooling degree days (a measure of temperature)
customer
t
number of billed residences (in thousands) during the tth quarter
retail
t
Florida taxable retail sales (in billions of dollar) during the tth quarter

A consultant for the county's utility board obtains the following regression output on this data:

Predict for ddcool = 400, customer = 15, retail = 6;
Predict for ddcool = 1197, customer = 16.36, retail = 6.77.


power = - 48.8 + 0.0104 DD cool + 4.38 customer - 0.21 retail

Constant -48.83 11.12 -4.39 0.000
DD cool 0.010438 0.001517 6.88 0.000
customer 4.379 1.084 4.04 0.000
retail -0.210 1.452 -0.14 0.886

s = 3.325 R-sq = 80.4% R-sq(adj) = 78.0%

SOURCE DF SS MS F p
Regression 3 1043.06 347.7 32.8 0.000
Error 24 254.28 10.6
Total 27 1297.34

Fit Stdev.Fit 95% C.I. 95% P.I.
19.771 1.642 ( 16.382, 23.160) ( 12.116, 27.425)
33.884 0.628 ( 32.587, 35.181) ( 26.899, 40.868)
MEAN 'DD cool'
MEAN = 1197.0
MEAN 'customer'
MEAN = 16.359
mean 'retail'
MEAN = 6.7732

(1) Modeling and F-Test of the Model: Formally present model being estimated. Use variable
names (DD cool, customer, and retail) and the beta () parameters as variable coefficients.
power =
0
+
1
DD cool
2
customer +
3
retail +
The hypotheses associated with the F test on the model may be constructed in terms of the betas
('s) of the model:
H
0
: H
0
:
1
=
2
=
3
= 0
H
A
: H
A
: at least one
j
different from zero
Conduct the F-test at the = .01 significance level using the p-value decision rule (show your
work), and then state your conclusion in one sentence.
p = .000 < = .01, reject H
0
. Therefore, the model to explain power usage tested significant at
the .01 level.
(2) Check that 19.771 Fit from first PREDict subcommand (second to last line of computer
output) is correct:
Use the fitted equation to predict power usage when there are 400 cooling degree days during the
quarter, 15 thousand customers and retail sales are $6 billion. [Show intermediate calculations
and note that some variables are measured in thousands or billions]
- 48.8 + .0104(400) + 4.38(15) - 0.21(6) = 19.8 (versus 19.771)

(3) Based on the first PREDict output line (the one predicting power = 19.771), which 95%
confidence intervals would you report if
(a) you were forecasting power usage for a particular quarter with those explanatory variable
values.
95% P.I.: (12.1, 27.4)
(b) you wanted to capture average power usage for many quarters with those explanatory
variable values.
95% C.I.: (16.4, 23.2)
(4) Examine the second PREDict output line based on 1197 cooling degrees days, 16,360
customers, and $6.77 billion in retail sales. This is based on sample means for each explanatory
variables (note MEANs at the end of output). Explain why standard deviation of the fit for this
prediction is so much smaller (only 0.63) than the "Stdev.Fit" for the first prediction (1.64).
The standard deviation of the fit is smallest when the explanatory variables are near or at
their mean; the first prediction is based on cooling degree days well below the mean for
that variable.

C. A bank analyst collects monthly data on the economy for the period just prior the 1990-92
recession. The sample consists of monthly time series data on the U.S. economy from 1988 until
the middle of 1990, 30 consecutive months. The variables in the model are defined

unemp monthly unemployment rate (in percentage points)
conf monthly consumer confidence index (in 1967 the index was 100)
starts monthly housing starts (in millions)
invent monthly manufacturing inventories (in billions of dollars)

and the regression model produces the following regression results:

unemp = 9.51 - 0.00445 conf - 0.287 starts - 0.00948 invent

Constant 9.512 1.104 8.61 0.000
conf -0.004453 0.008559 -0.52 0.607
starts -0.2865 0.1971 -1.45 0.158
invent -0.009478 0.001652 -5.74 0.000

1. Modeling and Constructing the Tests
a. Formally present in equation form the model being estimated on your screen (No
words, please!). Use the variable names (unemp, conf, starts, and invent) instead of Y, X
1
, X
2
,
and X
3
), and don't forget to use the beta () parameters as variable coefficients.

b. Complete below the two lines of null-alternative hypotheses for a two-tailed
significance test of the 'invent' variable. [No words please and use the proper
j
]
H
0
:
H
A
:
c. Next, construct the one-tailed hypothesis to test whether there is an inverse relationship
of housing starts with the dependent variable. [No words and use the proper
j
]
H
0
:
H
A
:
d. Explain in one sentence why we are justified in conducting the one-tailed test in
question (1c) above.
e. Why is the "conf" variable included in the model if we don't intend to test it, and why
does not testing every variable in the model help us with testing "starts" and "invent" the
variables we do need to test? (three sentence discussion, please)
2. Conducting the Tests
a. Using the p-value decision rule, conduct each of the two tests from question 1b and 1c
at the = .05 level; in each case, show which two numbers you compared and determine whether
each null hypothesis can or cannot be rejected.
b. In one sentence entirely without symbols or the words "hypothesis", "test", or
"coefficient" (but you may use the words "significant" or "not significant"), verbally
communicate your results of the test on the inventory variable to an audience of business
persons.
"Our statistical analysis concludes that manufacturing inventory levels
c. Using the p-value decision rule again, would your test result for the housing starts
variable have been reversed if:
A two-sided test instead of a one-tailed test had been conducted?
An = .10 (instead of = .05) had been chosen for the one-tailed test?
3. Point and Interval Estimation of Regression Coefficients
a. Using the "delta formula" for multiple regression, a rise of $100 billion in
manufacturing inventories will yield an average (increase/decrease) [circle one] in the
unemployment rate percentage points, other things equal. [show work]
b. Use the approximate formula b 2 s
b
to construct a confidence interval for the
coefficient of the inventory variable. Show work and express your answer in "" form and [low,
high] interval form.


Chapter 13 Nonparametric Methods

Approach: To use the inference methods introduced so far, a set of assumptions had to be
made. The test results may not be valid if these assumptions do not hold. In this chapter, we
discuss the sensitivity of inferential methods to these assumptions and present methods to test
whether the assumptions hold. We also explore a couple alternative methods that do not rely on
as many or as restrictive an assumption set.

parametric vs. nonparametric statistics
robustness
sign test and Wilcoxon signed-rank test

SECTION 13.1 Parametric and Nonparametric Statistics
SECTION 13.2 One-Sample Nonparametric Tests


13.1 Parametric and Nonparametric Statistics
Statistical inference techniques use a random sample to construct confidence intervals or
conduct hypothesis tests about unknown population parameters. In each statistical procedure
(e.g. one- and two-sample methods, regression, or analysis of variance), a knowledge of
sampling distributions allows us to make probability inferences from sample statistics. In earlier
chapters, we applied a t distribution to the one-sample t test statistic and used F distributions for
analysis of variance tests of linear models in regression and experimental design situations. In a
likewise fashion, intervals were derived from sampling distributions for estimating a population
mean or predicting individual observations from a fitted regression model.
However, each of these inferential procedures requires specific distributional properties,
especially normality, in the population data being sampled. We call statistical procedures
parametric if they rely on restrictive distributional assumptions.
DEFINITION: Inferential methods are parametric procedures if their sampling distributions
depend on imposing restrictive distributional assumptions.
Recall from Chapter 7 that a distribution is actually an entire family of specific distributions,
each distinguished by its parameter values. For example, the normal pdf is a two-parameter
distribution. Given a normal distribution, we can know everything about its shape and location
from its mean and standard deviation o. Parametric procedures utilize distributional
assumptions to arrive at sample statistic having only a few parameters. The sample data is used
to estimate the parameters for that distribution. Then, the probabilities from this sampling
distribution are aligned with the significance or confidence levels assigned by the decision-
maker. However, unwarranted confidence intervals and test conclusions can result in the
distributional assumptions do not apply.
The distribution family for a parametric procedure follows directly from the assumptions
of that procedure. If these assumptions are approximately valid for the populations being
sampled, a parametric method yields efficient and powerful inferences. If instead the
assumptions are inappropriate, parametric statistics can cause misleading results and
incorrect decisions.
For example, to determine the rejection region from a t density function for a one-sample
test, the parametric test statistic ( x -
0
)/(s/ n ) must approximately have t for its sampling
distribution. One way to assure this outcome is for X to be normally distributed. We may
already know that X has a normal distribution from examining frequency distributions of similar
studies. Alternatively, industry experts with an understanding of the data generating process
may anticipate that X will be approximately normal.

The z, t, and F statistics used in parametric procedures may be valid even for populations
that do not conform to the assumed normal distribution. We learned from the central limit
theorem how approximately normal distributions will arise if sample sizes are large enough.
Moreover, logarithms (or other transformations) may be used to convert highly-skewed or
outlier-prone data to a normal distribution. In regression and analysis of variance models, the
distributional assumptions are imposed on the random disturbance term c. The central limit
theorem again operates for linear models, resulting in approximately normal c if samples are
large. Furthermore, we discussed in Chapters 10 through 11 methods of experimental design,
data sampling, variable definition, and model construction that can often prevent a breakdown in
the model assumptions.
If a parametric procedure is robust, we may still be able to use that procedure with
populations that are not distributed as assumed.
DEFINITION: A statistical procedure is robust if inferences are not affected adversely by
moderate departures from the distributional assumptions.
The parametric procedures discussed in earlier chapters are fairly robust so long as the departures
from normality are not severe.
42

Popular parametric procedures have achieved their high rate of acceptance because of
their common applicability and robustness.
There are many situations in which parametric procedures are found wanting. Because
the central limit theorem does not apply for smaller samples, parametric analysis is most
susceptible to accusations that restrictive assumptions are not being met. Any inferences we
make will then be suspect. A small sample size also hampers us from examining the sample for
distributional violations. Similarly, the central limit theorem may require very large sample
before normality is attained if distributions are exceedingly nonnormal or contain frequent
outliers. Pioneering investigations and exploration of new data may also concern researchers
considering the application of parametric methods.
Two of the most common distributional requirements of parametric procedures are
normality and, for linear and two sample models, whether standard deviations are constant. For
large enough samples, we may infer whether the sample data (or residuals in the case of models)
is normally distributed from inspection of the histogram shape. Sample standard deviations may
also be compared for approximate equality. In the latter part of this chapter, we shall discuss

42
Recently, several parametric procedures have been developed that are even more robust; although often involving complex sets of
calculations, these procedures are increasing accessible with high speed computers.

more formal methods of testing whether a sample originates from a normal distribution and
equality of standard deviation tests.
Suppose we suspect from examining previous studies that distributional assumptions are
not met. Or maybe the sample data set itself indicates that variables are not normal or standard
deviations are substantially different. Perhaps, the population variables, such as time at which a
sales occurs, are known to follow a uniform rather than normal pdf. Or perhaps we have
insufficient sample information to test for distribution properties.
In each of these situations, should we choose a parametric procedure or instead select a
nonparametric method?
DEFINITION: A nonparametric method is a statistical procedure that relies on fewer or less
restrictive distributional assumptions than does its analogous parametric cousin.
At least one nonparametric alternative exists for most commonly parametric method. Before
introducing these new procedures, we must first determine when to use them. To help us decide,
we next discuss some of the pros and cons of parametric and nonparametric methods.
Like the parametric methods we have studied, nonparametric procedures also employ
sampling distributions to determine confidence intervals and significance test results. But
because they rely on fewer or weaker assumptions to derive these sampling distributions,
nonparametric methods are applicable to a wider range of problems than are their parametric
equivalents.
One major advantage of nonparametric methods is their relative insensitivity to extreme
values and skewed distributions. For example, in Chapter 2 we discovered that the sample mean
can be highly sensitive to outliers. Because arithmetic computations such as sums and quotients
are required to derive means and standard deviations, these measures do not appear in the
nonparametric statistics formulas nor are they objects of population parameter estimation.
Instead, ordinal measures such as the median are emphasized. Unlike the mean, the median of a
sample is insensitive to numerical variations in the data that do not alter the ranking of the
middle observation. Thus, hypothesis tests and interval estimates for the median can be
constructed using ordinal properties that are virtually "distribution free." Inferences about the
median are applicable to a very broad set of data distributions.
Most nonparametric methods are not completely distribution free, but instead utilize
weaker assumptions such as symmetrical population distributions. In contrast to normality, a
special case of a symmetric distribution, symmetry is consistent with many nonnormal
distributions including the uniform and even some bimodal distributions. Thus, a nonparametric

test that makes an assumption of symmetry is likely to be more appropriate than a similar
parametric procedure.
Because they use less sensitive ordinal measures and less restrictive distributional
assumptions, nonparametric methods are applicable to a broader range of problems than
are parametric procedures.
Why not always choose a nonparametric method? If parametric methods make
assumptions that are not satisfied for a particular problem, the statistical findings may be
sensitive to departures from these assumptions and lead us to make an improper decision. Often
we cannot be certain that the assumptions are valid, so why not err on the safe side? The less
restrictive distributional assumptions of the nonparametric method will be satisfied more
generally, and the results will therefore be valid more often.
In the past, major barriers existed to selection of nonparametric statistics. Being a newer
area of statistical interest, nonparametric methods first had to be developed and their properties
understood, information disseminated to practitioners used to traditional parametric procedures,
and computer software altered to include ordinal computations and applicable sampling
distributions. Now that many of these barriers have been overcome, the major reason for not
using nonparametric statistics is its relative lack of power.
In statistical decision making, we always attempt to bring all relevant information to bear
on a problem. The more information, the narrower the confidence intervals we can estimate and
the more powerful are our tests. We have already encountered this phenomenon in parametric
analysis. When we estimate the confidence interval for a univariate mean, knowing the
population standard deviation o allows us to use the Z standard normal distribution. If we are
instead forced to estimate o as well from sample data, the price we pay is the thicker-tailed t
distribution for our inferences. The result of less information is a wider, less informative interval
estimate.
If our decisions involve inferences about the mean (or any other arithmetically based
parameter such as o), we need to use a method that is sensitive to data distributions. In
nonparametric methods, sample statistics composed of ordinal measures allows us to avoid
making distributional assumptions (or at least loosen and these assumptions and shorten the list).
The reason the parent population distributions may be ignored is that statistics based on ordinal
measures are less sensitive to distributional differences. But there is a cost to this as well: less
informative findings.
Statistics derived from ordinal rankings may sacrifice important quantitative distinctions
or create differences that are numerically trivial. For example, consider the samples 1, 5, 6, 8,
200 and 4, 5, 6, 7, 8. As noted in Chapter 2, both have the same sample median, 6, despite

radically different means (44 versus 6). Alternatively, considering only the ranking and not the
numerical value of each sample observation may magnify differences in the data. Thus, the two
nine-observation samples, 0, 0, 0, 0, 9, 9, 9, 9, 9 and 0, 0, 0, 0, 0, 9, 9, 9, 9 have similar means (5
versus 4) but very different medians (9 versus 0). Perhaps you have experienced the effects of
information loss from the use of ordinal measures when you missed an "A" in a course by a
single point while a friend with a much lower average just squeezed by with a "B".
This lost quantitative information in nonparametric methods often translates into less
powerful tests and less precision in estimation intervals. In hypothesis testing, nonparametric
tests are less able to distinguish between the null and alternative hypothesis because much of the
sample information is ignored and less stringent distributional assumptions are made.
Alternatively, a larger sample may be necessary to obtain the power or precision available from
the equivalent parametric method.
But if the parametric assumptions were valid all along, we are needlessly sacrificing
testing power and estimation precision. The nonparametric test may therefore not allow us to
conclude statistical significance that the parametric version would have uncovered. The
resulting business decision can be impaired or misguided. Similarly, an overly broad prediction
interval from nonparametric procedure may lead to wasteful precautions against highly unlikely
outcomes. The prediction interval using the t distribution could have saved the company time
and resources and better focused business strategies on the most likely outcomes.
However, the ordinal information is frequently good enough. Acceptably precise
estimates and correct test results may be attained by nonparametric procedures. By not imposing
the distributional constraints of their parametric counterparts, statistically significant results are
more resistant to attack from your "what if" critics. On the other hand, if the parametric
assumptions are not in fact valid, a nonsignificant finding may represent the true state of nature
and the parametric method is more likely to lead decision makers toward false inferences.
If the assumptions of a parametric procedure are valid, the nonparametric counterpart will
provides less powerful test results and furnish reduced estimating precision. Conversely, if
their distributional assumptions are not approximately true, parametric procedures can
exaggerate the precision or significance of statistical analysis.

Which type of method should we use? Our answer depends on how sensitive the
parametric procedure is to departures from the distributional assumptions, how great that
departure is, how much tolerance the decision maker has to errors from using an inappropriate
procedure, and how good are the alternative nonparametric methods are. In what follows, we
investigate a few commonly used nonparametric methods.

Why can't we simply let the computer tell us which method to choose? With most
computational barriers removed, modern computer software allows us to choose from a wide
range of statistical procedures. However, there is also the danger that we will select an
inappropriate procedure. Computers can be programmed to perform any calculations.
Computers can mistakenly issue pink slips firing every employee or send customers a bill for
$0.00. In a like manner, the computer may permit us to select an inferior or unsuitable statistical
procedure to construct intervals or conduct tests.
One of the most important responsibilities of business statistics courses today is to teach
students which statistical methods should be used in different situations. "Statistical computing
programs have given statisticians a rare additional opportunity to educate others in the proper use
of statistical techniques."
43

13.1 Which of the following is a characteristic of nonparametric statistics?
a. they are used for testing rather than for estimation
b. they rely on ordinal measures
c. they make no assumptions about the population distribution
d. they result in more powerful test conclusions than comparable parametric tests
e. all of the above are characteristics of nonparametric statistics

13.2 Which of the following is an advantage of nonparametric statistics over their parametric
cousins?
a. easier to calculate
b. relies on fewer assumptions
c. less sensitive to sample outliers
d. all of the above

13.3 The problem with using a parametric method when it is inappropriate is that
a. you may not get the proper results
b. you may not be able to trust the your results
c. decision makers may act on the basis of improper findings
d. others may justifiably criticize your findings
e. all of the above

43
Gerard E. Dallal (1990) "Statistical Computing Packages: Dare We Abandom Their Teaching to Others?" American Statistician
44(4):265-267.

13.4 Using a parametric method when only a nonparametric method is justified yields
statistical inference conclusions that will always be
a. wrong
b. substantially different from what would have occurred had a nonparametric method
been used
c. modestly different from those of the corresponding nonparametric method
d. statistically indefensible

13.2 One-Sample and Matched Pair Tests
In Part III of this text, we introduced a parametric procedure for one-sample inference.
We learned that if is normally distributed, then a sample statistic, (0 -
0
)/s, has a t distribution
(or a Z distribution if is known). This t sampling distribution may be used to construct
confidence intervals and conduct hypothesis tests for the population mean. We also discovered
that is normally distributed under two alternative conditions: (1) X is a normally distributed
random variable or (2) the sample size is large enough for the central limit theorem to ensure
approximate normality.
44
What if neither of these conditions is present, i.e., what if the sample is
small and drawn from a nonnormal distribution? Then blindly proceeding with t-tests and
calculating t-intervals may yield highly misleading results. One commonly used nonparametric
alternatives available for such cases is called the sign test.
The sign test is merely an application of the binomial distribution to continuous data.
Recall from Chapter 6 that Bernoulli trials describe categorical data with only two possible
outcome states S and F (success and failure, or any other two states such as hired or not hired,
merge or don't merge, etc.) Suppose each category occurs with equal probability, i.e., P(S) =
P(F) = 1/2. Then for a sample of size n, the expected number of successes (or of failures) would
be E(X) = n/2. For example, a sample of 8 should average 4 outcomes of S and 4 of F.
However, we would not be surprised to find 5 or 6, or even an occasional 7 of the 8 successes or
failures in any particular sample. By contrast, it is unlikely that all 8 would be of one type of
outcome.
The binomial distribution formula provides a way to determine P(x), the probability of
any particular number of outcomes x. In the case of p = 1/2, the formula from Chapter 6
simplifies to
45

44
Chapter 8 also discussed a method to transform X by taking its logarithm when X is highly skewed to the right.
45
The 2
n
in the denominator results from the p
x
(1-p)
n-x
term reducing to (1/2)
n
for p = 1/2.

As we saw in Chapter 6, calculations from this formula are rather laborious (see the Exercises
for calculation of the following from the formula). The cumulative distribution command CDF
in Minitab provides us with the probabilities P(xK), where K is any desired number of S
outcomes.
BINOMIAL WITH N = 8 P = 0.500000
K P( X LESS OR = K)
0 0.0039
1 0.0352
2 0.1445
3 0.3633
4 0.6367
5 0.8555
6 0.9648
7 0.9961
8 1.0000

From this cumulative distribution table, we learn that P(x=0) = .0039, P(x1) = .0352, P(x2) =
.1445, and so forth. Because P(S) = 1 - P(S) = P(F), the same probabilities govern the
probabilities of 0 failures (i.e., 8 successes), 1 or fewer failures (at least 7 successes), etc.
Therefore the probability of all S or F is .0039 + .0039 = .0078. Similarly, the probability of 7 or
more of either one of the two types of trial outcomes is 2(.0352), or .0704. The symmetry of the
p = 0.5 binomial distribution allows us to double the cumulative probability.
What does all this have to do with quantitative variables and nonparametric statistics?
We mentioned in section 1 that nonparametric methods use ordinal measures such as the median.
The sign test uses a unique property of the median: regardless of the distribution -- fifty percent
of the population lies above the median and fifty percent falls below the median. If the null
hypothesis that the population median M = M
0
, then the expected number of observations above
(and below) M
0
will be one-half the sample size. Furthermore, T
+
, the number of observations
above the median, will always have a binomial distribution. So will T
-
, the sample frequency of
observations below the median.
In a random sample of any random variable X, the number of observations above (or
below) the population median of X will have a binomial distribution.
This distributional property for T
+
is correct whether X has a normal, uniform, bimodal, skewed,
or any other distribution.
One way to obtain the sample value for T
+
is to subtract M
0
from each observation and
count the number of differences with positive signs, hence the name "sign" test. The sign test
works like a test of binomial categorical data, because all sample information is disregarded
except for how many observation lie on either side of the hypothesized median.

DEFINITION: The two-sided sign test tests whether the population median of a random variable
X is significantly different from M
0
. The test statistic is T
+
, the sample frequency of
observations greater than M
0
and the sampling distribution is the cumulative binomial
distribution.
Formally, the two-sided hypothesis test may be stated as follows:
H
0
: M = M
0

H
A
: M M
0

The condition M = M
0
for the population median of X is analogous to =
0
for the Chapter 8
parametric test of the mean. We reject the null hypothesis if there are too many (or too few)
observations greater than M
0
for the sample to have come from a population with a median
M = M
0
. One-sided sign tests may be conducted by considering only probabilities on one (rather
than either) end of the binomial distribution.
H
0
: M M
0
or H
0
: M M
0

H
A
: M < M
0
H
A
: M > M
0

Since the mean and median are identical for symmetric distributions, the nonparametric sign test
may also be used to test hypotheses about the mean for symmetric X.

Chapter Case #1: Were Insurance Companies Gouging the Public?
As an example of the sign test, consider a case involving matched pair differences, which
we learned from Chapter 8 may also be analyzed by one-sample methods. During the mid-
1980s, accident and health insurance rates were raised frequently and by amounts much more
than allowance for inflation would have merited. While insurance carriers claimed that, despite
these rate hikes, premiums were barely keeping pace with excessive jury awards, settlements,
and escalating health care costs. Consumer watchdog groups disputed this contention. They
argued that the insurance industry was collecting premiums significantly above its awards in
order to cover investment losses from previous years. Which side was right?

To shed light on this question, we collected a 1987 sample of 16 insurance companies.
46

For each company, we measure total premiums on policies ('policy$') and benefits paid out ('paid
out'), reported in units of millions of dollars. The matched pair differences are generated that
subtracts the two columns.
ROW policy$ paid out diff
1 53.1 56.8 -3.7000 (negative)
2 41.7 38.7 3.0000
3 28.8 12.9 15.9000
4 36.3 33.7 2.6000
5 55.2 55.6 -0.4000 (negative)
6 27.5 14.6 12.9000
7 50.0 54.2 -4.2000 (negative)
8 39.9 11.0 28.9000
9 19.6 16.6 3.0000
10 93.5 78.7 14.8000
11 14.8 10.0 4.8000
12 81.8 58.8 23.0000
13 84.2 67.8 16.4000
14 24.5 10.1 14.4000
15 46.8 41.0 5.8000
16 71.4 45.8 25.6000
17 81.6 62.5 19.1000

If a t-test with
0
= 0 is run with this sample, we get the following results:
TEST OF MU = 0.000 VS MU G.T. 0.000

N MEAN STDEV SE MEAN T P VALUE
diff 17 10.700 10.203 2.475 4.32 0.0003

However, the small p-value of .0003 might not put to rest legitimate concerns (perhaps
encouraged by the insurance industry) that the small sample size makes the t-test inappropriate if
the population is not normal.
To deal with these concerns, we may employ the sign test derived from binomial
probabilities. A count of differences with negative signs indicates that all but three differences (-
3.7, -0.4, and -4.2) exceed M = 0. The number of sample observations in which premiums
exceeding claims awarded is 14 (5 or 6 more than n/2 = 8.5). The watchdog groups cites the
predominance of positive sample differences as evidence that the median is not zero. To conduct
the test, we determine the probability that such a lopsided sample of 14 positive differences to 3
negative ones if the median difference were in fact zero.

46
Sample drawn for class 8 size category from Best's Insurance Reports-Life/Health, 1988.

H
0
: M 0
H
A
: M > 0
Since P(T14) is the same as P(T3), we may determine the second of these cumulative
probabilities.
K P( X LESS OR = K)
3.00 0.0064
There is less than a one percent chance (0.64 percent) that 14 of the 17 differences would be
positive if the population median difference were actually zero. This .0064 may be used as the
p-value for the nonparametric significance test. If an = .01 is required to convince the state
insurance commissions to act against the insurance companies, then p = .0064 is determined to
be less than the level of significance. The median difference is found to be significantly
greater than zero. The sign test results in rejection of the null hypothesis and a conclusion that
the median difference of premiums less benefits for 1987 was positive among the class of
insurers under study. Unlike the t-test, this test result supports the consumer groups arguments
without relying on assumptions of normality.
Instead of counting the positive signs and referring to the binomial distribution, it is
easier to use the Sign Test (in the Nonparametric statistics of Minitab to conduct the sign test
directly.
Simple type in the same type of information you did in the Minitab dialog box for the T Test.
Applying this test to the insurance company differences, premiums minus payouts, we
obtain the following output:

SIGN TEST OF MEDIAN = 0.00000 VERSUS G.T. 0.00000

N BELOW EQUAL ABOVE P-VALUE MEDIAN
C1 17 3 0 14 0.0064 12.90

The Sign Test output states the one-sided test in terms of the median, and then counts the 14
sample differences above and 3 below M
0
= 0. The results also duplicate the correct binomial
distribution p-value of .0064. Had there been observations equal to M
0
, their number would have
appeared in the "EQUAL" column; it is conventional to subtract these from the sample size
before computing the binomial probabilities.

These are moderate sized companies with 1987 premiums written between $10 and $100 million.

The main problem with the sign test is that much of the quantitative information
contained in the sample is sacrificed. The sign test doesn't consider how far away observations
are from M
0
, only how many are above and below M
0
. For example, the sign test would count
three observations above and six below M
0
for the sample 0, 0, 1, 2, 3, 4, 100, 200, 500 whether
M
0
in the null hypothesis is 5, 20, or 99. If we do not know anything about the distribution of
the variable X being sampled, the sign test makes the most of our limited statistical options.
Unfortunately, the sign test also lacks the power to reject H
0
as often as a parametric test derived
from the actual distribution of X. Even if we can only make a weak distributional assumption
about X, we can get a lot more mileage from a sample. In particular, if X may be assumed to
have a symmetrical distribution, we can make full use of the ordinal information contained each
sample.
If the population being sampled is symmetric, the appropriate nonparametric test instead
is the Wilcoxon signed-rank test.
DEFINITION: The Wilcoxon signed-rank test is a nonparametric method used with symmetric
distributions for inferences about the mean and median of one-sample problems or matched pair
differences.
Because symmetric distributions have the same mean and median, the signed-rank test permits
hypotheses to be expressed in terms of means as well as medians. Moreover, the left and right
sides are mirror images of each other in a symmetric distribution. Therefore, if the symmetric
population has a median of M
0
as hypothesized in H
0
, differences from M
0
can be compared
ordinally by ranking their magnitudes. That is precisely what this Wilcoxon test does.
For the insurance company case example, it is reasonable to assume that premiums minus
benefits paid are symmetrical in the population. The histogram also provides support for
assuming symmetry.
Histogram of diff N = 17
Midpoint Count
-5 2 **
0 1 *
5 5 *****
10 0
15 5 *****
20 1 *
25 2 **
30 1 *

Notice that the lower half nearly mirrors the upper half of the histogram. By contrast, the dearth
of observations in the middle makes it risky to impose normality and run the t-test.

The Wilcoxon signed-rank test considers the rankings of the differences as well as the
number of observations above and below M
0
.
In the Wilcoxon signed-rank test, the differences of each sample observation from M
0
, the
hypothesized population median, are ranked. The test statistic, W, is the sum of ranks for
those observations above M
0
. If the null hypothesis is true, W will have an approximately
normal distribution with mean n(n + 1)/4, one half the maximum possible rank sum.
The W statistic is found by first ranking the magnitudes of these differences from smallest to
largest and then summing up the rankings only of observations with a positive sign. Because
these considered in these rankings, this test is more powerful than the sign test, i.e., we are able
to correctly reject H
0
and thus not commit type II errors as often. On the other hand, use of
ordinal aspects of the data does not require normal distribution assumptions required of the t-test.
Any quantitative data may be converted to a set of rankings. Thus, the data 4, 12, 6.5,
and 0 have ranks of 2, 4, 3, and 1, respectively. To observe these signed rankings in operation,
let's return to the insurance company matched pair differences discussed earlier. We find the
magnitudes of this column of data by stripping off any negative signs. Then we determine the
ranking of these magnitudes (from lowest = 1 to highest = 17).
ROW diff abs diff ranking
--> 1 -3.7000 3.7000 5.0
2 3.0000 3.0000 3.5
3 15.9000 15.9000 12.0
4 2.6000 2.6000 2.0
--> 5 -0.4000 0.4000 1.0
6 12.9000 12.9000 9.0
--> 7 -4.2000 4.2000 6.0
8 28.9000 28.9000 17.0
9 3.0000 3.0000 3.5
10 14.8000 14.8000 11.0
11 4.8000 4.8000 7.0
12 23.0000 23.0000 15.0
13 16.4000 16.4000 13.0
14 14.4000 14.4000 10.0
15 5.8000 5.8000 8.0
16 25.6000 25.6000 16.0
17 19.1000 19.1000 14.0

MTB > SUM C15
SUM = 153.00

The sum of the ranks from 1 through 17 is the maximum possible Wilcoxon statistic for a sample
of n = 17.
47
If the null hypothesis is true, then rankings will be randomly distributed among the
positively and negatively signed differences. We should expect a sum of ranking for only the
positive signed observations to be approximately one-half of 153, or near 76.5. Instead, the
Wilcoxon test statistic, W, is much closer to 153.
W is so large for two reasons. First, summing the ranks of the positive differences
involves adding 14 of the 17 rankings, since only observations in rows 1, 5, and 7 have 'diff'
values that are negative. In addition, unlike the sign test that considers these 14 equally, the
Wilcoxon sign-rank test evaluates their ranks as well. In this example, the three negative
differences are among the smallest, containing the first, fifth, and sixth lowest-ranked
magnitudes. The high rankings for the positive difference add further to the sum. Thus, W =
141 is only 12 (1 + 5 + 6) below the 153 maximum possible sum. If the three negative
differences had larger magnitudes, W would have been considerably smaller and the p-value
commensurately larger as a result.
To conduct this test is one simple step, we many once again refer to Minitab
nonparametric statistics menu entry under Wilcoxon test. For the insurance sample of 17
companies, the Minitab output for the one-sided test is the following:
TEST OF MEDIAN = 0.000000000 VERSUS MEDIAN G.T. 0.000000000

N FOR WILCOXON ESTIMATED
N TEST STATISTIC P-VALUE MEDIAN
diff 17 17 141.0 0.001 10.30

Notice that W = 141 agrees with our earlier computations. The p-value, .001, is computed by
comparing the sample statistic for this test with the appropriate normal distribution. The p-value
is smaller than that found for the sign test once the ordinal rankings are considered.

47
When two or more data values are alike (and thus equally ranked), they are each assigned the average rank. Notice, for example, that
the two 3.0 values from C13 have ranks of 3.5, the mean of their 3rd and 4th lowest rankings.

13.10 Which of the following is not an example of a nonparametric method:
a. t test
b. sign test
c. Wilcoxon test
d. all of the above are examples of nonparametric methods

13.11 One difference between the sign test and the t-test is that the sign test
a. does not assume a normal distribution of the sampling statistic
b. tests hypotheses related to the mean
c. involves only the sum of ranks
d. all of the above

13.12 One difference between the sign test and the Wilcoxon test is that the Wilcoxon test
a. does not assume a normal distribution of the sampling statistic
b. tests hypotheses related to the mean
c. involves only the sum of ranks
d. all of the above

13.13 Given a null hypothesis of M = 10, for which of the following samples would the sign
test yield different results:
a. 5, 5, 5, 5, 5, 15
b. 1, 1, 1, 1, 9, 15
c. 5, 5, 5, 5, 5, 1000
d. 5, 5, 5, 5, 15, 15
e. all of the above would yield the same sign test results

13.14 Given a null hypothesis of M = 10, for which of the following samples would the
Wilcoxon test yield different results
a. 1, 2, 3, 4, 15
b. 6, 7, 8, 9, 15
c. 1, 2, 3, 4, 1000
d. 9, 9, 9, 9, 12
e. all of the above would yield the identical Wilcoxon test results


13.15 For nonparametric tests on the two samples 5, 6, 7, 8, 9, 20 and 4, 5, 6, 7, 8, 11 using a
null hypothesis of M = 10, we would be
a. less likely to reject H
0
with the first sample if we used the Wilcoxon test
b. less likely to reject H
0
with the first sample if we used the sign test
c. less likely to reject H
0
with the first sample if we used either test
d. equally likely to reject H
0
for the first and second sample if we used the sign test
e. both a and d

13.16 One possible justification for using the t-test is
a. the population is known to be symmetrical
b. the sample is small
c. the histogram for the sample strongly suggests a normal population
d. the data is ordinal rather than quantitative
e. all of the above

13.17 If the population is assumed to be symmetrical
a. we must use nonparametric methods
b. nonparametric tests apply only to the mean
c. nonparametric tests apply only to the median
d. nonparametric tests apply to both the mean and median
e. we should not use nonparametric tests

13.18 Verify from P(x) = n!/2
n
x!(n-x)!, the binomial distribution formula with p = 0.5, the
cumulative probabilities P(x=0) = .0039 and P(x1) = .0352 for a sample of n = 8.

13.19 The algebraic formula for the sum of integers from 1 to n is n(n + 1)/2. For example, 1 +
2 = 2 3/2 and 1 + 2 + 3 + 4 = 4 5/2 = 10. Use this formula to determine the highest
possible value for W in samples of
(a) n = 8
(b) n = 20
(c) n = 40

Case Study Exercises
1. The following sample of quarterly time series data is contained in column C38:
fuelcost cost of fuel, in cents per gallon, during the 1980s
To test the null hypothesis that fuel prices averaged 90 cents/gallon, parametric and
nonparametric tests were conducted.


TEST OF MU = 90.000 VS MU N.E. 90.000
N MEAN STDEV SE MEAN T P VALUE
fuelcost 32 82.425 15.591 2.756 -2.75 0.0099

SIGN TEST OF MEDIAN = 90.00 VERSUS N.E. 90.00
N BELOW EQUAL ABOVE P-VALUE MEDIAN
fuelcost 32 20 0 12 0.2153 86.65

TEST OF MEDIAN = 90.00 VERSUS MEDIAN N.E. 90.00
N FOR WILCOXON ESTIMATED
N TEST STATISTIC P-VALUE MEDIAN
fuelcost 32 32 155.5 0.043 84.72

(1) Conduct each test at the .05 significance level. Are test results consistent with one another?
The t-test and Wilcoxon test are consistent with each other; each test finds fuel costs
significantly less than 90 cents because both p values, .0099 and .043 are less than =
.05. However, no significant difference from 90 cents is found using the sign test.
(2) Use the histogram to decide which test (or tests) are justified by the distributional
assumptions. Carefully explain your reasoning.

Histogram of fuelcost N = 32

Midpoint Count
50 1 *
55 4 ****
60 1 *
65 1 *
70 1 *
75 0
80 3 ***
85 6 ******
90 6 ******
95 4 ****
100 4 ****
105 1 *

The histogram of sample data does not indicate that fuel costs are normally or even
symmetrically distributed. If the sample size of n = 32 observations is considered a large
enough sample to invoke the central limit theorem, then the t-test is justified.
Alternatively, if not large enough, should use the sign test results (but not the Wilcoxon
which assumes symmetric distribution).

Book On Bussiness Statisitics!!!

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Book On Bussiness Statisitics!!!

Uploaded by

Copyright:

Available Formats

BUSINESS STATISTICS

is not constant, least-squares estimators are not

is influenced by yet another factor. Variability in estimating the conditional

may be discerned from the

= 25.12 2(1.97) = (21.18, 29.06)

, the only difference between the

= 216. Four-year-old used cars with 2200 cubic

is related to the standard error of the

is larger the closer the explanatory variables are to their means

is unaffected by the explanatory variables, only is affected

is larger the closer the explanatory variables are to one another

(n-k-1) t-ratio, do not reject H

(n-k-1) rejection region of the t distribution; translate test results on each b

(k, n-k-1), reject H

(k, n-k-1) F-ratio, do

You might also like