Stats

Applied Managerial Statistics
Steven L. Scott Winter 2005-2006
COPYRIGHT c 2002-2005 by Steven L. Scott. All rights reserved. No part of this work may be reproduced, printed, or stored in any form without prior written permission of the author.
Contents
1 Looking at Data 1.1 Our First Data Set . . . . . . . 1.2 Summaries of a Single Variable . 1.2.1 Categorical Data . . . . . 1.2.2 Continuous Data . . . . . 1.3 Relationships Between Variables 1.4 The Rest of the Course . . . . . 1 1 2 2 4 9 15 17 18 20 20 24 27 33 33 34 36 37 42 45 45 46 48 50 50 51 52
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
2 Probability Basics 2.1 Random Variables . . . . . . . . . . . . . . . . . . . . 2.2 The Probability of More than One Thing . . . . . . . 2.2.1 Joint, Conditional, and Marginal Probabilities 2.2.2 Bayes Rule . . . . . . . . . . . . . . . . . . . . 2.2.3 A Real World Probability Model . . . . . . . 2.3 Expected Value and Variance . . . . . . . . . . . . . . 2.3.1 Expected Value . . . . . . . . . . . . . . . . . . 2.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Adding Random Variables . . . . . . . . . . . . 2.4 The Normal Distribution . . . . . . . . . . . . . . . . . 2.5 The Central Limit Theorem . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
3 Probability Applications 3.1 Market Segmentation and Decision Analysis . . . . . . . . . . . . . 3.1.1 Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Building and Using Market Segmentation Models . . . . . 3.2 Covariance, Correlation, and Portfolio Theory . . . . . . . . . . . . 3.2.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Measuring the Risk Penalty for Non-Diversied Investments 3.2.3 Correlation, Industry Clusters, and Time Series . . . . . . i
. . . . . . .
ii 3.3
CONTENTS Stock Market Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 57 61 61 63 64 65 67 69 71 73 74 75 76 76 79 82 87 87 89 91 91 92 93 97 98 104 106 109 110 111 112 113 114 115
4 Estimation and Testing 4.1 Populations and Samples . . . . . . . . . . . . . . 4.2 Sampling Distributions . . . . . . . . . . . . . . . . 4.2.1 Example: log10 CEO Total Compensation . 4.3 Condence Intervals . . . . . . . . . . . . . . . . . 4.3.1 Can we just replace with s? . . . . . . . 4.3.2 Example . . . . . . . . . . . . . . . . . . . . 4.4 Hypothesis Testing: The General Idea . . . . . . . 4.4.1 P-values . . . . . . . . . . . . . . . . . . . . 4.4.2 Hypothesis Testing Example . . . . . . . . 4.4.3 Statistical Signicance . . . . . . . . . . . . 4.5 Some Famous Hypothesis Tests . . . . . . . . . . . 4.5.1 The One Sample T Test . . . . . . . . . . . 4.5.2 Methods for Proportions (Categorical Data) 4.5.3 The 2 Test . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
5 Simple Linear Regression 5.1 The Simple Linear Regression Model . . . . . . . . . 5.1.1 Example: The CAPM Model . . . . . . . . . 5.2 Three Common Regression Questions . . . . . . . . 5.2.1 Is there a relationship? . . . . . . . . . . . . . 5.2.2 How strong is the relationship? . . . . . . . . 5.2.3 What is my prediction for Y and how good is 5.3 Checking Regression Assumptions . . . . . . . . . . 5.3.1 Nonlinearity . . . . . . . . . . . . . . . . . . . 5.3.2 Non-Constant Variance . . . . . . . . . . . . 5.3.3 Dependent Observations . . . . . . . . . . . . 5.3.4 Non-normal residuals . . . . . . . . . . . . . . 5.4 Outliers, Leverage Points and Inuential Points . . . 5.4.1 Outliers . . . . . . . . . . . . . . . . . . . . . 5.4.2 Leverage Points . . . . . . . . . . . . . . . . . 5.4.3 Inuential Points . . . . . . . . . . . . . . . . 5.4.4 Strategies for Dealing with Unusual Points . 5.5 Review . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . it? . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
6 Multiple Linear Regression 117 6.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Several Regression Questions . . . . . . . . . . . . . . . . . . . . . . 119
CONTENTS Is there any relationship at all? The ANOVA Table and the Whole Model F Test . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 How Strong is the Relationship? R2 . . . . . . . . . . . . . . 6.2.3 Is an Individual Variable Important? The T Test . . . . . . . 6.2.4 Is a Subset of Variables Important? The Partial F Test . . . 6.2.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression Diagnostics: Detecting Problems . . . . . . . . . . . . . . 6.3.1 Leverage Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Whole Model Diagnostics . . . . . . . . . . . . . . . . . . . . Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Detecting Collinearity . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Ways of Removing Collinearity . . . . . . . . . . . . . . . . . 6.4.3 General Collinearity Advice . . . . . . . . . . . . . . . . . . . Regression When X is Categorical . . . . . . . . . . . . . . . . . . . 6.5.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Factors with Several Levels . . . . . . . . . . . . . . . . . . . 6.5.3 Testing Dierences Between Factor Levels . . . . . . . . . . . Interactions Between Variables . . . . . . . . . . . . . . . . . . . . . 6.6.1 Interactions Between Continuous and Categorical Variables . 6.6.2 General Advice on Interactions . . . . . . . . . . . . . . . . . Model Selection/Data Mining . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Model Selection Strategy . . . . . . . . . . . . . . . . . . . . 6.7.2 Multiple Comparisons and the Bonferroni Rule . . . . . . . . 6.7.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1
iii
6.3
6.4
6.5
6.6
6.7
120 122 123 124 126 127 127 129 133 134 135 136 137 137 141 144 145 147 150 150 151 152 152 157 157 160 162 163 164 164 165 165 165 166 167 170 171 173
7 Further Topics 7.1 Logistic Regression . . . . . . . . . . . . . 7.2 Time Series . . . . . . . . . . . . . . . . . 7.3 More on Probability Distributions . . . . 7.3.1 Background . . . . . . . . . . . . . 7.3.2 Exponential Waiting Times . . . . 7.3.3 Binomial and Poisson Counts . . . 7.3.4 Review . . . . . . . . . . . . . . . 7.4 Planning Studies . . . . . . . . . . . . . . 7.4.1 Dierent Types of Studies . . . . . 7.4.2 Bias, Variance, and Randomization 7.4.3 Surveys . . . . . . . . . . . . . . . 7.4.4 Experiments . . . . . . . . . . . . 7.4.5 Observational Studies . . . . . . . 7.4.6 Summary . . . . . . . . . . . . . .
iv A JMP Cheat Sheet A.1 Get familiar with JMP. . . . . . . . . . . . . . . . . . . A.2 Generally Neat Tricks . . . . . . . . . . . . . . . . . . . A.2.1 Dynamic Graphics . . . . . . . . . . . . . . . . . A.2.2 Including and Excluding Points . . . . . . . . . A.2.3 Taking a Subset of the Data . . . . . . . . . . . A.2.4 Marking Points for Further Investigation . . . . A.2.5 Changing Preferences . . . . . . . . . . . . . . . A.2.6 Shift Clicking and Control Clicking . . . . . . . A.3 The Distribution of Y . . . . . . . . . . . . . . . . . . . A.3.1 Continuous Data . . . . . . . . . . . . . . . . . . A.3.2 Categorical Data . . . . . . . . . . . . . . . . . . A.4 Fit Y by X . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 The Two Sample T-Test (or One Way ANOVA). A.4.2 Contingency Tables/Mosaic Plots . . . . . . . . A.4.3 Simple Regression . . . . . . . . . . . . . . . . A.4.4 Logistic Regression . . . . . . . . . . . . . . . . . A.5 Multivariate . . . . . . . . . . . . . . . . . . . . . . . . A.6 Fit Model (i.e. Multiple Regression) . . . . . . . . . . . A.6.1 Running a Regression . . . . . . . . . . . . . . . A.6.2 Once the Regression is Run . . . . . . . . . . . . A.6.3 Including Interactions and Quadratic Terms . . A.6.4 Contrasts . . . . . . . . . . . . . . . . . . . . . . A.6.5 To Run a Stepwise Regression . . . . . . . . . . A.6.6 Logistic Regression . . . . . . . . . . . . . . . . B Some Useful Excel Commands C The Greek Alphabet D Tables D.1 Normal Table . . . . . . D.2 Quick and Dirty Normal D.3 Cooks Distance . . . . . D.4 Chi-Square Table . . . . . . . . Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS 177 177 177 177 177 178 178 178 178 178 178 179 179 179 179 180 181 181 181 181 181 182 182 182 183 185 189 . . . . . . . . . . . . . . . . . . . . 191 192 193 194 195
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
Dont Get Confused

1.1 2.1 2.2 3.1 3.2 4.1 4.2 4.3 5.1 6.1 Standard Deviation vs. Variance. . . . . . . . . . . . . . . Understanding Probability Distributions . . . . . . . . . . The dierence between X1 + X2 and 2X. . . . . . . . . . A general formula for the variance of a linear combination Correlation vs. Covariance . . . . . . . . . . . . . . . . . . Standard Deviation vs. Standard Error . . . . . . . . . . Which One is the Null Hypothesis? . . . . . . . . . . . . . The Standard Error of a Sample Proportion. . . . . . . . R2 vs. the p-value for the slope . . . . . . . . . . . . . . . Why call it an ANOVA table? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 . 19 . 37 . 54 . 56 . 66 . 72 . 80 . 92 . 122
The Dont Get Confused call-out boxes highlight points that often cause new statistics students to stumble.
Not on the Test

4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 6.3 Does the size of the population matter? . . . . . . . . What are Degrees of Freedom? . . . . . . . . . . . . Rationale behind the 2 degrees of freedom calculation Why sums of squares? . . . . . . . . . . . . . . . . . . Box-Cox transformations . . . . . . . . . . . . . . . . . Why leverage is Leverage . . . . . . . . . . . . . . . How to build a leverage plot . . . . . . . . . . . . . . . Making the coecients sum to zero . . . . . . . . . . . Where does the Bonferroni rule come from? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 69 84 89 99 114 129 143 153
There is some material that is included because a minority of students are likely to be curious about it. Much of this material has to do with minor technical points or questions of rationale that are not central to the course. The Not on the Test call-out boxes explain such material to interested students, while letting others know that they can spend their energy reading elsewhere.
vii
viii
NOT ON THE TEST
Preface
An MBA statistics course diers from an undergraduate course primarily in terms of the pace at which material is covered. Much of the material in an MBA course can also be found in an undergraduate course, but an MBA course tends to emphasize topics that undergraduates never get to because they spend their time focusing on other things. Unfortunately for MBA students, most statistics textbooks are written with undergraduates in mind. These notes are an attempt to help MBA statistics students navigate through the mountains of material found in typical undergraduate statistics books. Most undergraduate statistics courses last for a semester and conclude with either one way ANOVA or simple regression. Though it has a similar number of contact hours, our MBA course lasts for eight weeks and covers up to multiple regression. To get to where we need to be at the courses end we must deviate from the usual undergraduate course and make regression our central theme. In doing so, we condense material that occupies several chapters in an undergraduate textbook into a single chapter on condence intervals and hypothesis tests. Undergraduate textbooks often present this material in four or more chapters, usually with other material that does not help prepare students to study regression, and in many cases with an undue emphasis on having students do calculations themselves. Our philosophy is that by condensing the non-regression hypothesis testing material into one chapter we can present a more unied view of how hypothesis testing is used in practice. Furthermore, the one-sample problems typically found in these chapters are much less compelling than regression problems, so packing them in to one chapter lets us get to the good stu more quickly. Material such as the two sample t test and the F test from one way ANOVA are presented as special cases of regression, which reduces the number of paradigms that students must master over an eight week term. A textbook for a course serves three basic functions. A good text concisely presents the ideas that a student must learn. It illustrates those ideas with examples as the ideas are presented. It is also a source of problems and exercises for students to work on to reinforce the ideas from the reading and from lecture. At some ix
NOT ON THE TEST
point these notes may evolve into a textbook, but theyre not there yet. The notes are evolving into a good presentation of statistical theory, but the hardest part of writing a textbook is developing sucient numbers of high quality examples and exercises. At present, we have borrowed and adapted examples and exercises from three primary sources, all of which are required or optional course reading: Statistical Thinking for Managers, 4th Edition, by Hildebrand and Ott, published by Duxbury Press. Business Analysis Using Regression, by Foster, Stine, and Waterman, published by Springer. JMP Start Statistics, by John Sall, published by Duxbury. Each of these sources provides data sets for their problems and examples, H&O from the diskette included with their book, FSW from the internet, and Sall from the CD containing the JMP-IN program. We will distribute the FSW data sets electronically.
Chapter 1
Looking at Data
Any data analysis should begin with looking at the data. This sounds like an obvious thing to do, but most of the time you will have so much data that you cant look at it all at once. This Chapter provides some tools for creating useful summaries of the data so that you can do a cursory examination of a data set without having to literally look at each and every observation. Other goals of this Chapter are to illustrate the limitations of simply looking at data as a form of analysis, even with the tools discussed here, and to motivate the material in later Chapters.
1.1
Our First Data Set
Consider the data set forbes94.jmp (provided by Foster et al., 1998), which lists the 800 highest paid CEOs of 1994, as ranked by Forbes magazine. When you open the data set in JMP (or any other computer software package) you will notice that the data are organized by rows and columns. This is a very common way to organize data. Each row represents a CEO. Each column represents a certain characteristic of each CEO such as how much they were paid in 1994, the CEOs age, and the industry in which the CEOs company operates. In general terms each CEO is an observation and each column of the data table is a variable. Database people sometimes refer to observations as records and variables as elds. One reason the CEO compensation data set is a good rst data set for us to look at is its size. There are 800 observations, which is almost surely too many for you to internalize by simply looking at the individual entries. Some sort of summary measures are needed. A second feature of this data set is that it contains dierent types of variables. Variables such as the CEOs age and total compensation are continu1
CHAPTER 1. LOOKING AT DATA
Figure 1.1: The rst few rows and columns of the CEO data set. ous1 variables, while variables like the CEOs industry and MBA status (whether or not each CEO has an MBA) are categorical2 . The distinction between categorical and continuous variables is important because dierent summary measures are appropriate for categorical and continuous variables. Regardless of whether a variable is categorical or continuous, there are numerical and graphical methods that can be used to describe it (albeit dierent numerical and graphical methods for dierent types of variables). These are described below.
1.2
1.2.1
Summaries of a Single Variable

Categorical Data
An example of a categorical variable is the CEOs industry (see Figure 1.2). The dierent values that a categorical variable can assume are called levels. The most common numerical summary of a categorical variable is a frequency table or contingency table, which simply counts the number of times each level occurred in the data set. It is sometimes easier to interpret the counts as fractions of the total number of observations in the data set, also known as relative frequencies. The
For our purposes, continuous variables are numerical variables where the numbers mean something, as opposed to being labels for categorical levels (1=yellow, 2=blue, etc.). There are stricter, and more precise, denitions that could be applied. 2 There are actually two dierent types of categorical variables. Nominal variables are categories like red and blue, with no order. Ordinal variables have levels like strongly disagree, disagree, agree, . . . which are ordered, but with no meaningful numerical values. We will treat all categorical variables as nominal.
1
1.2. SUMMARIES OF A SINGLE VARIABLE
(a) Histogram and Mosaic Plot
(b) Frequency Distribution
Figure 1.2: Graphical and numerical summaries of CEO industries (categorical data). The mosaic plot is very useful because you can put several of them next to each other to compare distributions within dierent groups (see Figure 1.6).
choice between viewing the data as frequencies or relative frequencies is largely a matter of personal taste. If the categorical variable contains many levels then it will be easier to look at a picture of the frequency distribution such as a histogram or a mosaic plot. A histogram is simply a bar-chart depicting a frequency distribution. The bigger the bar, the more frequent the level. Histograms have been around more or less forever, but mosaic plots are relative newcomers in the world of statistical graphics. A mosaic plot works like a pie chart, but it represents relative frequencies as slices of a stick (or a candy bar) instead of slices of a pie. Mosaic plots have two big advantages over pie charts. First, it is easier for people to see linear dierences than angular dierences (okay, maybe thats not so big since youve been looking at pie charts all your life). The really important advantage of mosaic plots is that you can put several of them next to each other to compare categorical variables for several groups (see Figure 1.6). The summaries in Figure 1.2 indicate that Finance is by far the most frequent industry in our data set of the 800 most highly paid CEOs. The construction industry is the least represented, and you can get a sense of the relative numbers of
4 CEOs from the other industries.
1.2.2
Continuous Data
An example of a continuous variable is a CEOs age or salary. It is easier for many people to think of summaries for continuous data because you can imagine graphing them on a number line, which gives the data a sense of location. For example, you have a sense of how far an 80 year old CEO is than a 30 year old CEO, but it is nonsense to ask how far a Capital Goods CEO is from a Utilities CEO. Summaries of continuous data fall into two broad categories: measures of central tendency (like the mean and the median) and measures of variability (like standard deviation and range). Another way to classify summaries of continuous data is whether they are based on moments or quantiles. Moments (mean, variance, and standard deviation) Moments (a term borrowed from physics) are simply averages. The rst moment is the sample mean n 1 x= xi . n
i=1
You certainly know how to take an average, but it is useful to present the formula for it to get you used to some standard notation that is going to come up repeatedly. In this formula (and in most to follow) n represents the sample size. For the CEO data set n = 800. The subscript i represents each individual observation in the data set (imagine i assuming each value 1, 2, . . . , 800 in turn). For example, if we are considering CEO ages, then x1 = 52, x2 = 62, x3 = 56, and so on (see Figure 1.1). The summation sign simply says to add up all the numbers in the data set. Thus, this formula says to add up all the numbers in the data set and divide by the sample size, which you already knew. FYI: putting a bar across the top of a letter (like x, pronounced x bar) is standard notation in statistics for take the average. The second moment is the sample variance s2 = 1 n1
n i=1
(xi x)2 .
It is the second moment because the thing being averaged is squared.3 The sample variance looks at each observation xi , asks how far it is from the mean (xi x), squares each deviation from the mean (to make it positive), and takes the average. It
3
The third moment has something cubed in it, and so forth.
would take you a while to try to remember the formula for s2 by rote memorization. However, if you remember that s2 is the average squared deviation from the mean then the formula will make more sense and it will be easier to remember. There are two technical details that cause people to get hung up on the formula for sample variance. First, why divide by n 1 instead of n? In any data set with more than a few observations dividing by n 1 instead of n makes almost no dierence. We just do it to make math geeks happy for reasons explained (kind of) in the call-out box on page 69. Second, why square each deviation from the mean instead of doing something like just dropping the minus signs? This one is a little deeper. If youre really curious you can check out page 89 (though you may want to wait a little bit until we get to Chapter 5). So the sample variance is the average squared deviation from the mean. You use the sample variance to measure how spread out the data are. For example, the variance of CEO ages is 47.81. Wait, 47.81 what? Actually, the variance is hard to interpret because when you square each CEOs xi x you get an answer in years squared. Nobody pretends to know what that means. In practice, the variance is computed en route to computing the standard deviation, which is simply the square root of the variance n 1 (xi x)2 . s = s2 = n1
i=1
The standard deviation of CEO ages is 6.9 years, which says that CEOs are typically about 7 years above or below the average. Standard deviations are used in two basic ways. The rst is to compare the reliability of two or more groups. For example: the standard deviation of CEO ages in the Chemicals industry is 3.18 years, while the SD for CEOs in the Insurance industry is 8.4 years. That means you can expect to nd more very old and very young CEOs in the Insurance industry, while CEOs in the Chemicals industry tend to be more tightly clustered about the average CEO age in that industry. The second, and more widespread use of standard deviations is as a standard unit of measurement to help us decide whether two things are close or far. For example, the standard deviation of CEO total compensation is $8.3 million. It so happens that Michael Eisner made over $200 million that year. The average compensation was $2.8 million, so Michael Eisner was 24 standard deviations above the mean. That, we will soon learn, is a lot of standard deviations. Quantiles Quantiles (a fancy word for percentiles) are another method of summarizing a continuous variable. To compute the pth quantile of a variable simply sort the variable
Dont Get Confused! 1.1 Standard Deviation vs. Variance. Standard deviation and variance both measure how far away from your best guess you can expect a typical observation to fall. They measure how spread out a variable is. Variance measures spread on the squared scale. Standard deviation measures spread using the units of the variable.
from smallest to largest and nd which number is p% of the way through the data set. If p% of the way through the data set puts you between two numbers, just take the average of those two numbers. The most famous quantiles are the median (50th percentile), the minimum (0th percentile), and the maximum (100th percentile). If youre given enough well chosen quantiles (say 4 or 5) you can get a pretty good idea of what the variable looks like. The main reason people use quantiles to summarize data is to minimize the importance of outliers, which are observations far away from the rest of the data. Figure 1.3 shows the histogram of CEO total compensation, where Michael Eisner is an obvious outlier. A big outlier like Eisner can have an big impact on averages like the mean and variance (and standard deviation). With Eisner in the sample the mean compensation is $2.82 million. The mean drops to $2.57 million with him excluded. Eisner has an even larger impact on the standard deviation, which is $8.3 million with him in the sample and $4.3 million without him. The median CEO compensation is $1.3 million with or without Michael Eisner. Outliers have virtually no impact on the median, but they do impact the maximum and minimum values. (The maximum CEO compensation with Eisner in the data set is $202 million. It drops to $53 million without him.) If you want to use quantiles to measure the spread in the data set it is smart to use something other than the max and min. The rst and third quartiles (aka the 25th and 75th percentiles) are often used instead. The rst and third quartiles are $787,000 and $2.5 million regardless of Eisners presence. Quantiles are useful summaries if you want to limit the inuence of outliers, which you may or may not want to do in any given situation. Sometimes outliers are the most interesting points (people certainly seem to nd Michael Eisners salary very interesting).
Graphical Summaries Boxplots and histograms are the best ways to visualize the distribution of a continuous variable. Histograms work by chopping the variable into bins, and counting
Figure 1.3: Histogram of CEO total compensation (left panel) and log10 CEO compensation (right panel). Michael Eisner made so much money that we had to write his salary in scientic notation. On the log scale the skewness is greatly reduced and Eisner is no longer an outlier.
frequencies for each bin.4 For boxplots the top of the box is the upper quartile i.e. the point 75% of the way through the data. The bottom of the box is the lower quartile i.e. the point which 25% of the data lies below. Thus the box in a boxplot covers the middle half of the data. The line inside the box is the median. The lines (or whiskers) extending from the box are supposed to cover almost all the rest of the data5 . Outliers, i.e. extremely large or small values, are represented as single points. Histograms usually provide more information than boxplots, though it is easier to see individual outliers in a boxplot. The main advantage of boxplots, like mosaic plots, is that only one dimension of the boxplot means anything (the height of the boxplot in Figure 1.4 means absolutely nothing). Therefore it is much easier to look at several boxplots than it is to look at several histograms. This makes boxplots very useful for comparing the distribution of a continuous variable across several groups. (See Figure 1.7).
At some point someone came up with a good algorithm for choosing histogram bins, which you shouldnt waste your time thinking about. 5 The rules for how long to make the whiskers are arcane and only somewhat standard. You shouldnt worry about them.
CHAPTER 1. LOOKING AT DATA Quantiles 100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0%
maximum
quartile median quartile
minimum
81.000 77.000 69.000 64.000 61.000 57.000 52.000 48.000 42.000 36.000 29.000
Figure 1.4: Numerical and graphical summaries of CEO ages (continuous data). The
normal curve is superimposed. The mean of the data is 56.325 years. The standard deviation is 6.9 years. How well do the quantiles in the data match the predictions from the normal model?
The Normal Curve

Often we can use the normal curve, or bell curve, to model the distribution of a continuous variable. Although many continuous variables dont t the normal curve very well, a surprising number do. In Chapter 2 we will learn why the normal curve occurs as often as it does. If the histogram of a continuous variable looks approximately like a normal curve then all the information about the variable is contained in its mean and standard deviation (a dramatic data reduction: from 800 numbers down to 2). The normal curve tells us what fraction of the data set we can expect to see within a certain number of standard deviations away from the mean. In Chapter 2 we will learn how to use the normal curve to make very precise calculations. For now, some of the most often used normal calculations are summarized by the empirical rule, which says that if the normal curve ts well then (approximately): 68% of the data is within 1 SD of the mean, 95% within 2 SD and 99.75% within 3 SD. To illustrate the empirical rule, consider Figure 1.4, which lists several observed quantiles for the CEO ages. The data appear approximately normal, so the empirical rule says that about 95% of the data should be within 2 standard deviations of the mean. That means about 2.5% of the data should be more than 2 SDs above the mean, and a similar amount should be more than 2 SDs below the mean. The mean is 56.3 years, and the size of an SD is 6.9 years. So 2 SDs above the mean
1.3. RELATIONSHIPS BETWEEN VARIABLES
is about 70. The 97.5% quantile is actually 69, which is pretty close to what the normal curve predicted. Of course you cant use the empirical rule if the histogram of your data doesnt look approximately like a normal curve. This is a subjective call which takes some practice to make. Figure 1.5 shows the four most common ways that the data could be non-normal. The distribution can be skewed with a heavy tail trailing o in one direction or the other. The direction of the skewness is the direction of the tail, so CEO compensation is right skewed because the tail trails o to the right. A variable can have fat tails like in Figure 1.5(b). You can think of fat tailed distributions as being skewed in both directions. The most common fat tailed distributions in business applications are the distributions of stock returns (closely related to corporate prots). Figure 1.5(c) shows evidence of discreteness. It shows a variable which is continuous according to our working denition, but with relatively few distinct values. Finally, Figure 1.5(d) shows a bimodal variable, i.e. a variable whose distribution shows two well-separated clusters. A more precise way to check whether the normal curve is a good t is to use a normal quantile plot (aka. quantile-quantile plot, or Q-Q plot). This plots the data ordered from smallest to largest versus the corresponding quantiles from a normal curve. If the data looks like a normal the quantile plot should have an approximately straight line. If the dots deviate substantially from a straight line this indicates that the data does not look normal. For more details see the discussion of Q-Q plots on page 41.
1.3
Relationships Between Variables
There are two main reasons to look at variables simultaneously: To understand the relationship e.g. if one variable increases what happens to the other (if I increase the number of production lines what will happen to prot). To use one or more variables to predict another e.g. using Prot, Sales, PE ratio etc to predict the correct value for a stock. We will then purchase the stock if its current value is under what we think it should be. If we want to use one variable X to predict another variable Y then we call X the predictor and Y the response. The way we analyze the relationship depends on the types of variables X and Y are. There are four possible situations depending on whether X and Y are categorical or continuous (see page 179). Three are described below. The fourth (when Y is categorical and X is continuous) is best described using a model called logistic regression which we wont see until Chapter 7. If
10
(a) Skewness: CEO Compensation (top 20 outliers removed)
(b) Heavy Tails: Corporate Prots
(c) Discreteness: CEOs age upon obtaining under- (d) Bimodal: Birth Rates of Dierent Countries graduate degree (top 5 outliers excluded)
Figure 1.5: Some non-normal data.
11
Figure 1.6: Contingency table and mosaic plot for auto choice data. X is categorical then it is possible to simply do the analysis you would do for Y separately for each level of X. If X is continuous then this strategy is no longer feasible.
Categorical Y and X
Just as with summarizing a single categorical variable, the main numerical tool for showing the relationship between categorical Y and categorical X is a contingency table. Figure 1.6 shows data collected by an automobile dealership listing the type of car purchased by customers within dierent age groups. This type of data is often encountered in Marketing applications. The primary dierence between two-way contingency tables (with two categorical variables) and one-way tables (with a single variable) is that there are more ways to turn the counts in the table into proportions. For example there were 22 people in the 29-38 age group who purchased work vehicles. What does that mean to us? These 22 people represent 8.37% (=22/263) of the total data set. This is known as a joint proportion because it treats X and Y symmetrically. If you really want to think of one variable explaining another, then you want to use conditional proportions instead. The contingency table gives you two groups of conditional proportions because it doesnt know in advance which variable you want to condition on. For example, if you want to see how automobile preferences vary by age then you want to compute the distribution of TYPE conditional on AGEGROUP. Restrict your attention to just the one row of the contingency table corresponding to 29-38 year olds. What fraction of them bought work vehicles? There are 133 of them, so
12
the 22 people represent 16.54% of that particular row. Of that same group, 63% purchased family vehicles, and 20% purchased sporty vehicles. To see how auto preferences vary according to age group, compare these row percentages for the young, middle, and older age groups. It looks like sporty cars are less attractive to older customers, family cars are more attractive to older customers, and work vehicles have similar appeal across age groups. You could also condition the other way, by restricting your attention to the column for work vehicles. Of the 44 work vehicles purchased, 22 (50%) were purchased by 29-38 year olds. The younger demographic purchased 39% of work vehicles, while the older demographic purchased only 11%. By comparing these distributions across car type you can see that most family and work cars tend to be purchased by 29-38 year olds, while sporty cars tend to be purchased by 18-28 year olds. Finally, the margins of the contingency table contain information about the individual X and Y variables. Because of this, when you restrict your attention to a single variable by ignoring other variables you are looking at its marginal distribution. The same terminology is used for continuous variables too. Thus the title of Section 1.2 could have been looking at marginal distributions. We can see from the margins of the table that the 29-38 age group was the most frequently observed, and that family cars (a favorite of the 29-38 demographic) were the most often purchased. Far and away the best way to visualize a contingency table is through a sideby-side mosaic plot like the one in Figure 1.6. The individual mosaic plots show you the conditional distribution of Y (in this case TYPE) for each level of X (in this case AGEGROUP). The plot represents the marginal distribution of X by the width of the individual mosaic plots: the 39+ demographic has the thinnest mosaic plot because it has the fewest members. The marginal distribution of the Y variable is a separate mosaic plot serving as the legend to the main plot. Finally, because of the way the marginal distributions are represented, the joint proportions in the contingency table correspond to the area of the individual tiles. Thus you can see from Figure 1.6 that family cars purchased by 29-38 year olds is the largest cell of the table. Side-by-side mosaic plots are a VERY eective way of looking at contingency tables. To see for yourself, open autopref.jmp and construct separate histograms for TYPE within each level of AGEGROUP (use the by button in the Distribution of Y dialog box).
Continuous Y and Categorical X

If you want to see how the distribution of a continuous variable varies across several groups you can simply list means and standard deviations (or your favorite
13
Figure 1.7: Side-by-side boxplots comparing log10 compensation for CEOs in dierent
industries.
quantiles) for each group. Graphically, the best way to do the comparison is with side-by-side boxplots. If there are only a few levels (2 or 3) you could look at a histograms for each level (make sure the axes all have the same scale), but beyond that boxplots are the way to go. Multiple histograms are harder to read than side-by-side boxplots because each histogram has a dierent sets of axes. Consider Figure 1.7, which compares log10 CEO compensation for CEOs in dierent industries. As with mosaic plots, the width of the side-by-side boxplots depicts the marginal distribution of X. Thus the nance industry has the widest boxplot because it is the most frequent industry in our data set. Compensation-wise, the nance CEOs seem fairly typical of other CEOs on the list. The aerospace-defense CEOs are rather well paid, while the forest and utilities CEOs havent done as well. To convince yourself of the value of side-by-side boxplots, try doing the same comparison with 19 histograms. Yuck!
Continuous Y and X
The best graphical way to show the relationship between two continuous variables is a scatterplot like the one in Figure 1.8. Each dot represents a CEO. Dots on the right are older CEOs. Dots near the top are highly paid CEOs. From the Figure it appears that if there is a relationship between a CEOs age and compensation it isnt a very strong one.
14
Figure 1.8: Scatterplot showing log10 compensation vs. age for CEO dataset. The best tting line and quadratic function are also shown. Of course the Figure is plotted on the log scale, and small changes in log compensation can be large changes in terms of real dollars. Could there be a trend in the data that is just too hard to see in the Figure? We can use regression to compute the straight line that best6 ts the trend in the data. The regression line has a positive slope, which indicates that older CEOs tend to be paid more than younger CEOs. Of course, the regression line also raises some questions. 1. The slope of the line isnt very large. How large does a slope have to be before we conclude that it isnt worth considering? 2. Why are we only looking at straight lines? We can also use regression to t the best quadratic function to the data. The linear and quadratic models say very dierent things about CEO compensation. The linear model says that older CEOs make more than younger CEOs. The quadratic model says that a CEOs earning power peaks and then falls o. Which model should we believe? 3. The regression line only describes the trend in the data. Our previous analyses (such as comparing log10 compensation by industry) actually described the data themselves (both center and spread). Is there some way to numerically describe the entire data set and not just the trend.
6 The regression line is best according to a specic criterion known as least squares which is discussed in Chapter 5.
1.4. THE REST OF THE COURSE 4. What if a CEOs compensation depends on more than one variable?
15
1.4
The Rest of the Course
The questions listed above are all very important, and we will spend much of the rest of the course understanding the tools that help us answer them. Procedurally, questions 1 and 2 are answered by something called a p-value, which is included in the computer output that you get when you t a regression. Chapter 4 is largely about helping you understand p-values. To do so you need to know a few basic facts about probability, the subject of Chapters 2 and 3. Chapters 3.2.3 and 5 return to the more interesting topic of relationships between variables. Question 3 will be dealt with in Chapter 5 once we learn a little more about the normal curve in Chapter 2. Question 4 may be the greatest limitation of analyses which consist only of looking at data. To measure the impact that several X variables have on Y requires that you build a model, which is the subject of Chapter 6. By the end of Chapter 6 you will have a working knowledge of the multiple regression model, which is one of the most exible and the most widely used models in all of statistics.
16
Chapter 2
Probability Basics
This Chapter provides an introduction to some basic ideas in probability. The focus in Chapter 1 was on looking at data. Now we want to start thinking about building models for the process that produced the data. Throughout your math education you have learned about one mathematical tool, and then learned about its opposite. You learned about addition, then subtraction. Multiplication, then division. Probability and statistics have a similar relationship. Probability is used to dene a model for a process that could have produced the data you are interested in. Statistics then takes your data and tries to estimate the parameters of that model. Probability is a big subject, and it is not the central focus of this course, so we will only sketch some of the main ideas. The central characters in this Chapter are random variables. Every random variable has a probability distribution that describes the values the random variable is likely to take. While some probability distributions are simple, some of them are complicated. If a probability distribution is too complicated to deal with we may prefer to summarize it with its expected value (also known as its mean) and its variance. One probability distribution that we will be particularly interested in is the normal distribution, which occurs very often. A bit of math known as the central limit theorem (CLT) explains why the normal distribution shows up so much. The CLT says that sums or averages of random variables are normally distributed. The CLT is so important because many of the statistics we care about (such as the sample mean, sample proportion, and regression coecients) can be viewed as averages. 17
18
CHAPTER 2. PROBABILITY BASICS
2.1
Random Variables
Denition A number whose value is determined by the outcome of a random experiment. In eect, a random variable is a number that hasnt happened yet. Examples The diameter of the next observed crank shaft from an automobile production process. The number on a roll of a die. Tomorrows closing value of the Nasdaq.
Notation
Random variables are usually denoted with capital letters like X and Y . The possible values of these random variables are denoted with lower case letters like x and y. Thus, if X is the number of cars my used car lot will sell tomorrow, and if I am interested in the probability of selling three cars, then I will write P (X = 3). Here 3 is a particular value of lower-case x that I specify. The Distribution of a Random Variable By denition it is impossible to know exactly what the numerical value of a random variable will be. However, there is a big dierence between not knowing a variables value and knowing nothing about it. Every random variable has a probability distribution describing the relative likelihood of its possible values. A probability distribution is a list of all the possible values for the random variable and the corresponding probability of that value happening. Values with high probabilities are more likely than values with small probabilities. For example, imagine you own a small used-car lot that is just big enough to hold 3 cars (i.e. you cant sell more than 3 cars in one day). Let X represent the number of cars sold on a particular day. Then you might face the following probability distribution x P (X = x) 0 0.1 1 0.2 2 0.4 3 0.3
From the probability distribution you can compute things like the probability that you sell 2 or more cars is 70% (=.4 + .3). Pretty straightforward, really.
2.1. RANDOM VARIABLES Dont Get Confused! 2.1 Understanding Probability Distributions One place where students often become confused is the distinction between a random variable X and its distribution P (X = x). You can think of a probability distribution as the histogram for a very large data set. Then think of the random variable X as a randomly chosen observation from that data set. It is often convenient to think of several dierent random variables with the same probability distribution. For example, let X1 , . . . , X10 represent the numbers of dots observed during 10 rolls of a fair die. Each of these random variables has the same distribution P (X = x) = 1 , for 6 x = 1, 2, . . . , 6. But they are dierent random variables because each one can assume dierent values (i.e. you dont get the same roll for each die).
19
Where Probabilities Come From Probabilities can come from four sources. 1. Classical symmetry arguments 2. Historical observations 3. Subjective judgments 4. Models Classical symmetry arguments include statements like all sides of a fair die are equally likely, so the probability of any one side is 1 . They are the oldest of the 6 four methods, but are of mainly mathematical interest and not particularly useful in applied work. Historical observations are the most obvious way of of deriving probabilities. One justication of saying that there is a 40% chance of selling two cars today is that you sold two cars on 40% of past days. A bit of nesse is needed if you wish to compute the probability of some event that you havent seen in the past. However, most probability distributions used in practice make use of past data in some form or another. Subjective judgments are used whenever experts are asked to assess the chance that some event will occur. Subjective probabilities can be valuable starting points when historical information is limited, but they are only as reliable as the expert who produces them. The most common sources of probabilities in business applications are probability models. Models are useful when there are too many potential outcomes to
20
list individually, or when there are too many uncertain quantities to consider simultaneously without some structure. Many of the most common probability models make use of the normal distribution, and its extension the linear regression model. We will discuss these two models at length later in the course. The categories listed above are not mutually exclusive. For example, probability models usually have parameters which are t using historical data. Subjective judgment is used when selecting families of models to t in a given application.
2.2
The Probability of More than One Thing
Things get a bit more complicated if there are several unknown quantities to be modeled. For example, what if there were two car salesmen (Jim and Floyd) working on the lot? Then on any given day you would have two random variables: X, the number of cars that Jim sells, and Y , the number of cars that Floyd sells.
2.2.1
Joint, Conditional, and Marginal Probabilities
The joint distribution of two random variables X and Y is a function of two variables P (x, y) giving the probability that X = x and Y = y. For example, the joint distribution for Jim and Floyds sales might be. X(Jim) 0 1 2 3 0 .10 .10 .10 .05 Y (Floyd) 1 2 .10 .10 .20 .10 .05 .00 .00 .00 3 .10 .00 .00 .00
Remember that there are only 3 cars on the lot, so P (x, y) = 0 if x + y > 3. As with the distribution of a single random variable, the joint distribution of two (or more) random variables simply lists all the things that could happen, along with the corresponding probabilities. So in that sense it is no dierent than the probability distribution of a single random variable, there are just more possible outcomes to consider. Just to be clear, the distribution given above says that the probability of Jim selling two cars on a day that Floyd sells 1 is .05 (i.e. that combination of events will happen about 5% of the time). Marginal Probabilities If you were given the joint distribution of two variables, you might decide that one of them was irrelevant for your immediate purpose. For example, Floyd doesnt care
2.2. THE PROBABILITY OF MORE THAN ONE THING
21
about how many cars Jim sells, he just wants to know how many cars he (Floyd) will sell. That is, Floyd wants to know the marginal distribution of Y . (Likewise, Jim may only care about the marginal distribution of X.) Marginal probabilities are calculated in the obvious way, you simply sum across any variable you want to ignore. The mathematical formula describing the computation looks worse than it actually is P (Y = y) = P (X = x, Y = y). (2.1)
x
All this says is the following. The probability that Floyd sells 0 cars is the probability that he sells 0 cars and Jim sells 0, plus the probability that he sells 0 cars and Jim sells 1, plus . . . . Even more simply, it says to add down the column of numbers in the joint distribution that correspond to Floyd selling 0 cars. In fact, the name marginal suggests that marginal probabilities are often written on the margins of a joint probability distribution. For example: Y (Floyd) 1 2 .10 .10 .20 .10 .05 .00 .00 .00 .35 .20
X(Jim) 0 1 2 3
0 .10 .10 .10 .05 .35
3 .10 .00 .00 .00 .10
.40 .40 .15 .05 1.00
The marginal probabilities say that Floyd has a 10% chance (and Jim a 5% chance) of selling three cars on any given day. Notice that if you have the joint distribution you can compute the marginal distributions, but you cant go the other way around. That makes sense, because the two marginal distributions have only 8 numbers (4 each), while the joint distribution has 16 numbers, so there must be some information loss. Also note that the word marginal means something totally dierent in probability than it does in economics. Conditional Probabilities Each day, Floyd starts out believing that his sales distribution is Num. Cars (y) Prob 0 .35 1 .35 2 .20 3 .10
What if Floyd somehow knew that today was one of the days that Jim would sell 0 cars. What should he believe about his sales distribution in light of the new information? This situation comes up often enough in probability that there is
22
standard notation for it. A vertical bar | inside a probability statement separates information which is still uncertain (on the left of the bar) from information which has become known (on the right of the bar). In the current example Floyd wants to know P (Y = y|X = 0). This statement is read: The probability that Y = y given that X = 0. The updated probability is called a conditional probability because it has been conditioned on the given information. How should the updated probability be computed? Imagine that the probabilities in the joint distribution we have been discussing came from a data set describing the last 1000 days of sales. The contingency table of sales counts would look something like
X(Jim) 0 1 2 3
0 100 100 100 50 350
Y (Floyd) 1 2 3 100 100 100 200 100 0 50 0 0 0 0 0 350 200 100
400 400 150 50 1000
If Floyd wants to estimate P (Y |X = 0), he can simply consider the 400 days when Jim sold zero cars, ignoring the rest. That is, he can normalize the (X = 0) row of the table by dividing everything in that row by 400 (instead of dividing by 1000, as he would to get the joint distribution). If Floyd didnt have the original counts he could still do the normalization, he would simply do it using probabilities instead of counts. This thought experiment justies the denition of conditional probability P (Y = y|X = x) = P (Y = y, X = x) . P (X = x) (2.2)
Notice that the denominator of equation (2.2) does not depend on y. It is simply a normalizing factor. Also, notice that if you summed the numerator over all possible values of y, you would get P (X = x) in the numerator and denominator, so the answer would be 1. The equation simply says to take the appropriate row or column of the joint distribution and normalize it so that it sums to 1. We can easily compute all of the possible conditional distributions that Floyd would face if he were told X = 0, 1, 2, or 3, and the conditional distributions that Jim would face if he were told Floyds sales.
2.2. THE PROBABILITY OF MORE THAN ONE THING Y (Floyd) 1 2 .25 .25 .50 .25 .33 .00 .00 .00 Y (Floyd) 1 2 .29 .50 .57 .50 .14 .00 .00 .00 1.00 1.00
23
X(Jim) 0 1 2 3
0 .25 .25 .67 1.00
3 .25 .00 .00 .00
1.00 1.00 1.00 1.00
X(Jim) 0 1 2 3
0 .29 .29 .29 .13 1.00
3 1.00 .00 .00 .00 1.00
Floyds conditional probabilities given Jims sales P (Y |X)
Jims conditional probabilities given Floyds sales P (X|Y )
So what does the information that X = 0 mean to Floyd? If we compare his marginal sales distribution to the his conditional distribution given X = 0 No information Jim sells 0 cars .35 .25 .35 .25 .20 .25 .10 .25
it appears (unsurprisingly) that Floyd has a better chance of having a big sales day if Jim sells zero cars. Putting It All Together Lets pause to summarize the probability jargon that weve introduced in this section. A joint distribution P (X, Y ) summarizes how two random variables vary simultaneously. A marginal distribution describes variation in one random variable, ignoring the other. A conditional distribution describes how one random variable varies if the other is held xed at some specied value. If you are given a joint distribution you can derive any conditional or marginal distributions of interest. However, to compute the joint distribution you need to have the marginal distribution of one variable, and all conditional distributions of the other. This is a consequence of the denition of conditional probability (equation 2.2) which is sometimes stated as the probability multiplication rule. P (X, Y ) = P (Y |X)P (X) = P (X|Y )P (Y ) (2.3)
Equations (2.2) and (2.3) are the same, just multiply both sides of (2.2) by P (X = x). However, Equation (2.3) is more suggestive of how probability models are actually built. It is usually harder to think about how two (or more) things vary simultaneously than it is to think about how one of them would behave if we knew the other. Thus most probability distributions are created by considering the marginal distribution of X, and then considering the conditional distribution of Y given X. We will illustrate this procedure in Section 2.2.3.
24
2.2.2
Bayes Rule
Probability distributions are a way of summarizing our beliefs about uncertain situations. Those beliefs change when we observe relevant evidence. The method for updating our beliefs to reect the new evidence is called Bayes rule. Suppose we are unsure about a proposition U which can be true or false. For example, maybe U represents the event that tomorrow will be an up day on the stock market, and notU means that tomorrow will be a down day. Historically, 53% of days have been up days, and 47% have been down days, so we start o believing that P (U ) = .531 . But then we nd out that the leading rm in the technology sector has led a very negative earnings report just as the market closed today. Surely that will have an impact on the market tomorrow. Lets call this new evidence E and compute P (U |E) (the probability of U given E), our updated belief about the likelihood of an up day tomorrow in light of the new evidence. Bayes rule says that the updated probability is computed using the following formula: P (U |E) = P (E|U )P (U ) P (E) P (E|U )P (U ) = . P (E|U )P (U ) + P (E|notU )P (notU )
(2.4)
The rst line here is just the denition of conditional probability. If you know P (E) and P (U, E) then Bayes rule is straightforward to apply. The second line is there in case you dont have P (E) already computed. You might recognize it as equation (2.1) which we encountered when discussing marginal probabilities. If not, then you should be able to convince yourself of the relationship P (E) = P (E|U )P (U ) + P (E|notU )P (notU ) by looking at Figure 2.2. An Example Calculation Using Bayes Rule In order to evaluate Bayes rule we need to evaluate P (E|U ), the probability that we would have seen evidence E if U were true. In our example this is the probability that we would have seen a negative earnings report by the leading technology rm if the next market day were to be an up day. We could obtain this quantity by looking at all the up days in market history and computing the fraction of them that were preceded by negative earnings reports. Suppose that number is P (E|U ) = 1% = 0.010. While were at it, we may as well compute the percentage of down days (notU ) preceded by negative earnings reports. Suppose that number is
1 These numbers are based on daily returns from the S&P 500, which are plotted in Figure 3.4 on page 28.
25
Figure 2.1: The Reverend Thomas Bayes 17021761. Hes even older than that Gauss guy
in Figure 2.10.
NotU
Figure 2.2: Venn diagram illustrating the denominator of Bayes rule: The probability of
E is the probability of E and U plus the probability of E and NotU.
P (E|notU ) = 1.5% = 0.015. It looks like such an earnings report is really unlikely regardless of whether or not were in for an up day tomorrow. However, the report is certainly less likely to happen under U than notU . Bayes rule tells us that the probability of an up day tomorrow, given the negative earnings report today, is P (U |E) = P (E|U )P (U ) P (E|U )P (U ) + P (E|notU )P (notU ) (.010)(.53) = (.010)(.53) + (.015)(.47) = 0.429.
Keeping It All Straight Bayes rule is straightforward mathematically, but it can be confusing because there are several pieces to the formula that are easy to mix up. The formula for Bayes rule would be a lot simpler if we didnt have to worry about the denominator. Notice
26 that P (U |E) = and P (notU |E) =
P (E|U )P (U ) P (E|U )P (U ) + P (E|notU )P (notU ) P (E|notU )P (notU ) P (E|U )P (U ) + P (E|notU )P (notU )
both have the same denominator. When were evaluating Bayes rule we need to compute P (E|U ) and P (E|notU ) to get the denominator anyway, so what if we just wrote the calculation as P (U |E) P (E|U )P (U ). The sign is read is proportional to, which just means that there is a constant multiplying factor which is too big a bother to write down. We can recover that factor because the probabilities P (U |E) and P (notU |E) must sum to one. Thus, if the equation for Bayes rule seems confusing, you can remember it as the following procedure. 1. Write down all possible values for U in a column on a piece of paper. 2. Next to each value write P (U ), the probability of U before you learned about the new evidence. P (U ) is sometimes called the prior probability. 3. Next to each prior probability write down the probability of the evidence if U had taken that value. This is sometimes called the likelihood of the evidence. 4. Multiply the prior times the likelihood, and sum over all possible values of U . This sum is the normalizing constant P (E) from equation (2.4). 5. Divide by P (E) to get the posterior probability P (U |E) = P (U )P (E|U )/P (E). This procedure is summarized in the table below. prior likelihood 0.53 0.010 0.47 0.015 Pri*Like 0.00530 0.00705 ------0.01235 posterior 0.4291498 = 0.00530/0.01235 0.5708502 = 0.00705/0.01235
Up Down
Once you have internalized either equation (2.4) or the ve step procedure listed above you can remember them as: The posterior probability is proportional to the prior times the likelihood.
2.2. THE PROBABILITY OF MORE THAN ONE THING Why Bayes Rule is Important
27
The rst time you see Bayes rule it seems like a piece of trivia. After all, it is nothing more than a restatement of the multiplication rule in equation (2.3) (which was a restatement of equation (2.2)). However, it turns out that Bayes rule is the foundation of rational decision making, and may well be the e = mc2 of the 21st century. One example where Bayes theorem has made a huge impact is articial intelligence, which means programming a computer to make intelligent seeming decisions about complex problems. In order to do do that you need to have some way to mathematically express what a computer should believe about a complex scenario. The computer also needs to learn as new information comes in. The computers beliefs about the complex scenario are described using a complex probability model. Then Bayes theorem is used to update the probability model to as the computer learns about its surroundings. We will see several examples of Bayesian learning in Chapter 3.
2.2.3
A Real World Probability Model
The preceding sections have illustrated some of the issues that can arise when two uncertain quantities are considered. In the interest of simplicity we have dealt mainly with toy examples, which can mask some of the issues that come up in more realistic settings. Lets work on building a realistic probability model for a familiar process: the daily returns of the S&P 500 stock market index. That sounds a bit daunting, so lets limit the complexity of our task by only considering whether each days returns are up (positive return) or down (negative return). We want our model to compute the probability that the next n days will follow some specied sequence (e.g. with n = 4 we want to compute P (up, up, down, up)), and we want it to work with any value of n. One thing worth noticing is that the terms joint, conditional, and marginal become a bit ambiguous when there are several random variables oating about. For example, suppose stock market returns over the next 4 days are denoted by X1 , . . . , X4 . Suppose were told that day 1 will be an Up day, and we want to consider what happens on days 2 and 3. Then P (X2 , X3 |X1 ) is a joint, marginal, and conditional distribution all at the same time. It is joint because it considers more than one random thing (X2 and X3 ). It is conditional because something formerly random (X1 ) is now known. It is marginal because it ignores something random (X4 ). The second thing we notice is that the probability multiplication rule starts to look scary. When applied to many random variables, the multiplication rule
28
Figure 2.3: Daily returns for the S&P 500 market index. The vertical axis excludes a
few outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of the data.
becomes P (X1 , . . . , Xn ) =P (X1 ) P (X2 |X1 ) P (X3 |X2 , X1 ) ... P (Xn1 |Xn2 , . . . , X1 ) (2.5)
P (Xn |Xn1 , . . . , X1 ).
That is, you can factor the joint distribution P (X1 , . . . , Xn ) by multiplying the conditional distributions of each Xi given all previous Xs. Why is that scary? Remember that each of the random variables can only assume one of two values: Up or Down. We can come up with P (X1 ) simply enough, just by counting how many up and down days there have been in the past. These probabilities turn out to be x P (Xi ) = x Down 0.474 Up 0.526
Finding P (X2 |X1 ) is twice as much work, we have to count out how many (UU), (UD), (DU), and (DD) transitions there were. After normalizing the transition counts we get the conditional probabilities Xi = Xi1 =Down Up Down 0.519 0.433 Up 0.481 0.567
1.00 1.00
29
Finding P (X3 |X2 , X1 ) is twice as much work as P (X2 |X1 ), we need to nd the number of times each pattern (DDD), (DDU), (DUD), (DUU), (UDD), (UDU), (UUD), (UUU) was observed. Xi2 Down Down Up Up Xi1 Down Up Down Up Xi Down Up 0.501 0.499 0.412 0.588 0.539 0.461 0.449 0.551
1.00 1.00 1.00 1.00
Notice how each additional day we wish to consider doubles the amount of work we need to do to derive our model. This quickly becomes an unacceptable burden. For example, if n = 20 we would have to compute over one million conditional probabilities. That is far too many to be practical, especially since there are only 14,000 days in the data set. The obvious solution is to limit the amount of dependence that we are willing to consider. The two most common solutions in practice are to assume independence or Markov dependence. Independence Two random variables are independent if knowing the numerical value of one does not change the distribution you would use to describe the other. Translated into probability speak independence means that P (Y |X) = P (Y ). If we were to assume that returns on the S&P 500 were independent, then we could compute the probability that the next three days returns are (UUD) (two up days followed by a down day) as follows. The general multiplication rule says that P (X1 , X2 , X3 = U U D) =P (X1 = U ) P (X2 = U |X1 = U ) P (X3 = D|X2 = U, X1 = U ). (2.6)
If we assume that X1 , X2 , and X3 are independent, then P (X2 |X1 ) = P (X2 ) and P (X3 |X1 , X2 ) = P (X3 ), so the probability becomes P (U U D) = P (X1 = U ) P (X2 = U ) P (X3 = D) = (.526)(.526)(.474) = 0.131. The numbers here come from the marginal distribution of X1 on page 28. We have assumed that the marginal distribution does not change over time, which is a common assumption in practice. (2.7)
30
(a) Diameters of automobile crank shafts.
(b) International airline passenger trac.
Figure 2.4: The crank shaft diameters appear to be independent. The airline passenger
series exhibits strong dependence.
Independence is a strong assumption, but it is reasonable in many circumstances. Many of the statistical procedures we will discuss later assume independent observations. You can often plot your data, as we have done in Figure 2.4, to check whether it is reasonable to assume independence. The left panel shows data from a production line which produces crank shafts to go in automobile engines. The crank shafts should ideally be 815 thousands of an inch in diameter, but there will be some variability from shaft to shaft. Each day ve shafts are collected and measured during quality control checks. Some shafts measure greater than 815, and some lower. But it does not seem like one shaft being greater or less than 815 inuences whether the next shaft is likely to be greater or less than 815. Thats what it means for random variables to be independent. Contrast the shaft diameter data set with the airline passenger data set shown in the right panel of Figure 2.4. The airline passenger data series exhibits an upward trend over time, and it also shows a strong seasonal pattern. The passenger counts in any particular month are very close to the counts in neighboring months. This is an example of very strong dependence between the observations in this series.
Markov Dependence Independence makes probability calculations easy, but it is sometimes implausible. If you think that Up days tend to follow Up days on the stock market, and vice versa, then you should feel uncomfortable about assuming the returns to be independent. The simplest way to to allow dependence across time do so is by assuming Markov dependence. Mathematically, Markov dependence can be expressed P (Xn |Xn1 , . . . , X1 ) = P (Xn |Xn1 ). (2.8)
31
Simply put, Markov dependence assumes that todays value depends on yesterdays value but not the day before. A sequence of random variables linked by Markov dependence is known as a Markov chain. Lets suppose that the sequence of S&P 500 returns follows a Markov chain and compute the probability that the next 4 days X1 , . . . , X4 follow the pattern U U DU . The general multiplication rule says that P (U U DU ) =P (X1 = U ) P (X2 = U |X1 = U ) P (X3 = D|X2 = U, X1 = U ) P (X4 = U |X3 = D, X2 = U, X1 = U ). Markov dependence means that P (X3 |X1 , X2 ) = P (X3 |X2 ), and P (X4 |X3 , X2 , X1 ) = P (X4 |X3 ), so the probability becomes P (U U DU ) =P (X1 = U ) P (X2 = U |X1 = U ) =(.526)(.567)(.433)(.481) =0.062. Again, the numbers here are based on the distributions on page 28. Which Model Fits Best? Now we have an embarrassment of riches. We have two probability models for the S&P 500 series. Which one ts best? There is a nancial/economic theory called the random walk hypothesis that suggests the independence model should be the right answer. The random walk hypothesis asserts that markets are ecient, so if there were day-to-day dependence in returns, arbitrageurs would enter and remove it. Even so, the Markov chain model has considerable intuitive appeal. How can we tell which model ts best? One way is to use Bayes rule. The thing were uncertain about here is which model is the right one. Lets call the model M . The evidence E that we observe is the sequence of up and down days in the S&P 500 data. To use Bayes rule we need the prior probabilities P (M = M arkov) and P (M = Indep) as well as the likelihoods: P (E|M = M arkov) and P (E|M = Indep). Before looking at the data we might have no reason to believe in one model over another, so maybe P (M = M arkov) = P (M = Indep) = .50. This is clearly a subjective judgment, and we need to check its impact on our nal analysis, but lets go with the 50/50 prior for now. The likelihoods are easy enough to compute. We just extend the computations earlier in this section to cover the whole data set. We end up with P (X3 = D|X2 = U ) P (X4 = U |X3 = D) (2.9)
(2.10)
32
CHAPTER 2. PROBABILITY BASICS Model Markov Independence Likelihood e9659 e9713
The es show up because we had to compute the likelihood on the log scale for numerical reasons.2 Dont be put o by the fact that the likelihoods are such small numbers. There are a lot of possible outcomes over the next 14000 days of the stock market. The chance that you will correctly predict all of them simultaneously is very small (like e9659 ). What you should observe is that the data are e53 times more likely under the Markov model than under the independence model. If we plug these numbers into Bayes rule we get P (M = M arkov|E) = Or, equivalently P (M = Indep|E) = (.5)(e9713 ) e53 = 0. (.5)(e9659 ) + (.5)(e9713 ) 1 + e53 1 (.5)(e9659 ) = 1. 9659 ) + (.5)(e9713 ) (.5)(e 1 + e53
The evidence in favor of the Markov model is overwhelming. When we say that P (M = M arkov|E) 1 there is an implicit assumption that the Markov and Independence models are the only ones to be considered. There are other models that t these data even better than the Markov chain, but given a choice between the Markov chain and the Independence model, the Markov chain is the clear winner. With such strong evidence in the likelihood, the prior probabilities that we chose make little dierence. For example, if we were strong believers in the random walk hypothesis we might have had a prior belief that P (M = Indep) = .999 and P (M = M arkov) = .001. In that case we would wind up with P (M = M arkov|E) = (.001)e9659 1 1 = = 1. 9659 + (.999)e9713 53 (.001)e 1 + (999)e 1 + e46
If there is strong evidence in the data (as there is here) then Bayes rule forces rational decision makers to converge on the same conclusion even if they begin with very dierent prior beliefs.
Computers store numbers using a nite number of 0s and 1s. When the stored numbers get so small the computer tends to give up and call the answer 0. This is an easy problem to get around. Just add log probabilities instead of multiplying raw probabilities.
2
2.3. EXPECTED VALUE AND VARIANCE
33
2.3
Expected Value and Variance
Lets return to the auto sales probability distribution from Section 2.1. It is pretty simple, but Section 2.2 showed that probability distributions can get suciently complicated that we may wish to summarize them somehow instead of working with them directly. We said earlier that you can think of a probability distribution as a histogram of a long series of future data. So it makes sense that we might want to summarize a probability distribution using tools similar to those we used to summarize data sets. The most common summaries of probability distributions are their expected value (aka their mean), and their variance.
2.3.1
Expected Value
One way we can guess the value that a random variable will assume is to look at its expected value, E(X). The expected value of a random variable is its long run average. If you repeated the experiment a large number of times and took the average of all the observations you got, that average would be about E(X). We can calculate E(X) using the formula E(X) =
x
xP (X = x)
Returning to the used car example, we dont know how many cars we are going to sell tomorrow, but a good guess is E(X) = (0 0.1) + (1 0.2) + (2 0.4) + (3 0.3) = 1.9 Of course X will not be exactly 1.9, so what is good about it? Suppose you face the same probability distribution for sales each day, then think about the average number of cars per day you will sell for the next 1000 days. On about 10% of the days you would sell 0 cars, about 20% of the time you would sell 1 car, etc. Add up the total number of cars you expect to sell (roughly 100 0s, 200 1s, 400 2s, and 300 3s), and divide by 1000. You get 1.9, E(X), the long run average value. Note we sometimes write E(X) as . It means exactly the same thing. The E() operator is seductive3 because it takes something that you dont know, the random variable X, and replaces it with a plain old number like 1.9. Thus it is tempting to stick in 1.9 wherever you see X written. Dont! Remember that 1.9 is only a the long run average for the number of cars sold per day, while X is specically the number you will sell tomorrow (which you wont know until tomorrow). The expected value operator has some nice properties that come in handy when dealing with sums (possibly weighted sums) of random variables. If a and b are
3
It is the Austin Powers of operators.
34
known constants (weights) and X and Y are random variables then the following rules apply. E(aX + bY ) = aE(X) + bE(Y ) Example: E(3X + 4Y ) = 3E(X) + 4E(Y ) E(aX + b) = aE(X) + b Example: E(3X + 4) = 3E(X) + 4 We will illustrate these rules a little later in Section 2.3.3.
2.3.2
Variance
Expected value gives us a guess for X. But how good is the guess? If X is always very close to E(X) then it will be a good guess, but it is possible that X is often a long way away from E(X). For example X may be an extremely large value half the time and a very small value the rest of the time. In this case the expected value will be half way between but X is always a long way away. We need a measure of how close X is on average to E(X) so we can judge how good our guess is. This is what we use the variance, V ar(X), for. V ar(X) is often denoted by 2 but they mean the same thing. It is calculated using the formula V ar(X) = 2 =
x
(x )2 P (X = x)
= E[(X )2 ]. Remember that E() just says take the average so the variance of a random variable is the average squared deviation from the mean, just like the variance of a data set. If you like, you can think of x and s2 as the mean and variance of past data (i.e. data which has already happened and is in your data set), and E(X) and V ar(X) are the mean and variance of a long series of future data that you would see if you let your data producing process run on for a long time. To illustrate the variance calculation, the variance of our auto sales random variable is 0.89, calculated as follows. x P (X = x) (x ) (x )2 P (X 0 0.1 1.9 3.61 1 0.2 0.9 0.81 2 0.4 0.1 0.01 3 0.3 1.1 1.21 1 = x)(x )2 0.3610 0.1620 0.0040 0.3630 0.8900
2.3. EXPECTED VALUE AND VARIANCE
35
Variance has a number of nice theoretical properties, but it is not very easy to interpret. What does a variance of .89 mean? It means that the average squared distance of X from its mean is 0.89 cars squared. Just as in Chapter 1, variance is hard to interpret because it squares the units of the problem. Standard Deviation The standard deviation of a random variable is dened as SD(X) = = V ar(X).
Taking the square root of variance restores the natural units of the problem. For example the standard deviation of the above random variable is SD(X) = 0.89 = 0.94 cars. Standard deviations are easy to calculate, assuming youve already calculated V ar(X), and they are a lot easier to interpret. A standard deviation of 0.94 means (roughly) that the typical distance of X from its mean is about 0.94. If SD(X) is small then when X happens it will be close to E(X), so E(X) is a good guess for X. There is no threshold for SD(X) to be considered small. The smaller it is, the better guesser you can be. Rules for Variance We mentioned that variance obeys some nice math rules. Heres the main one. Note that the following rule applies to V ar ONLY if X and Y are independent.4 That contrasts with the rules for expected value on page 34, which apply all the time. Remember that X and Y are random variables, and a and b are known constants. V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) Here are some examples: V ar(3X + 4Y ) = 32 V ar(X) + 42 V ar(Y ) = 9V ar(X) + 16V ar(Y ) V ar(X + Y ) = V ar(X) + V ar(Y ) (a = 1, b = 1) V ar(X Y ) = V ar(X) + V ar(Y ) (a = 1, b = 1, the 1 gets squared)
4 Section 3.2.3 explains what to do if X and Y are dependent. It involves a new wrinkle called covariance that would just be a distraction at this point.
(if X and Y are independent)
36
Note that there are no rules for manipulating standard deviation. To determine SD(aX + bY ) you must rst calculate V ar(aX + bY ) and then take its square root.
2.3.3
Adding Random Variables
Expected value and variance are very useful tools when you want to build more complicated random variables out of simpler ones. For example, suppose we know the distribution of daily car sales (from Section 2.1) and that car sales are independent from one day to the next. Were interested in the distribution of weekly (5 day) sales, W . In principle we could list out all the possible values that W could assume (in this case from 0 to 15), then think of all the possible ways that daily sales could happen. For example, to compute P (W = 6) we would have to consider all the dierent ways that daily sales could total up to 6 (3 on the rst day and 3 on the last, one each day except for two on Thursday, and many, MANY more). It is a hard thing to do because there are so many cases to consider, and this is just a simple toy problem! It is much easier to gure out the mean and variance of W using the rules from the previous two Sections. The trick is to write W = X1 + X2 + X3 + X4 + X5 , where X2 is the number of cars sold on day 2, and similarly for the other Xs. Then E(W ) = E(X1 + X2 + X3 + X4 + X5 ) = E(X1 ) + E(X2 ) + E(X3 ) + E(X4 ) + E(X5 ) = 1.9 + 1.9 + 1.9 + 1.9 + 1.9 = 9.5 and V ar(W ) = V ar(X1 + X2 + X3 + X4 + X5 ) = V ar(X1 ) + V ar(X2 ) + V ar(X3 ) + V ar(X4 ) + V ar(X5 ) = .89 + .89 + .89 + .89 + .89 = 4.45, which means that SD(X) = 4.45 = 2.11. That was MUCH easier than actually guring out the distribution of W and tabulating E(W ) and V ar(W ) directly. What if we cared about the weekly sales gure expressed as a daily average instead of the weekly total? Then we just consider W/5, with E(W/5) = 9.5/5 = 1.9 and V ar(W/5) = 4.45/52 = .178, so SD(W/5) = .178 = .422. Of course we still dont know the entire distribution of W . All we have are these two useful summaries. It would be nice if there were some way to approximate the
2.4. THE NORMAL DISTRIBUTION Dont Get Confused! 2.2 The dierence between X1 + X2 and 2X. In examples like our weekly auto sales problem many people are tempted to write W = 5X instead of W = X1 + + X5 . The temptation comes from the fact that all ve daily sales gures come from the same distribution, so they have the same mean and the same variance. However, they are not the same random variable, because youre not going to sell exactly the same number of cars each day. The distinction makes a practical dierence in the variance formula (among other places). If V ar(X1 ) = 2 , then V ar(5X1 ) = 25 2 , but V ar(X1 + + X5 ) = 5 2 . Is this sensible? Absolutely. If you were gambling in Las Vegas, then X1 + + X5 might represent your winnings after 5 $1 bets while 5X would represent your winnings after one $5 bet. The one big bet is much riskier (i.e. has a larger variance) than the ve smaller ones.
37
distribution of W based just on its mean and variance. Well get one in Section 2.5, but we have one more big idea to introduce rst.
2.4
The Normal Distribution
Thus far we have restricted our attention to random variables that could assume a denumerable set of values. While that is natural for some situations, we will often wish to model continuously varying phenomena such as uctuations in the stock market. It is impractical to list out all possible values (and corresponding probabilities) of a continuously varying process. Mathematical probability models are usually used instead. There are many probability models out there,5 but the most common is the normal distribution, which we met briey in Chapter 1. We will use normal probability calculations at various stages throughout the rest of the course. If the distribution of X is normal with mean and standard deviation then we write X N (, ). This expression is read X is normally distributed with mean and standard deviation . Section 7.3 describes some non-normal probability models. If you thought X obeyed one of them you would write E or P (or some other letter) instead of N .
5
A few are described in Section 7.3.
38
Calculating probabilities for a normal random variable The normal table is set up6 to answer what is the probability that X is less than z standard deviations above the mean? So the rst thing that must be done when calculating normal probabilities is to change the units of the problem into standard deviations above the mean. You do this by subtracting the mean and dividing by the standard deviation. Thus you replace the probability calculation P (X < x) with the equivalent event P X x < = P (Z < z).
Subtracting the mean and dividing by the standard deviation transforms X into a standard normal random variable Z. The phrase standard normal just means that Z has a mean of 0 and standard deviation 1 (i.e. Z N (0, 1)). Subtracting the mean and dividing by the standard deviation changes the units of x from whatever they were (e.g. cell phone calls) to the units of z, which are standard deviations above the mean. Subtracting the mean and dividing by the standard deviation is known as z-scoring. Figure 2.5 illustrates the eect that z-scoring has on the normal distribution. Procedurally, normal probabilities of the form P (X x) can be calculated using the following two-step process. 1. Calculate the z score using the formula z= x
Note that z is the number of standard deviations that x is above . 2. Calculate the probability by nding z in the normal table. For example if X N (3, 2) (i.e. = 3, = 2) and we want to calculate P (X 5) then 53 2 z= = = 1. 2 2 In other words we want to know the probability that X is less than one standard deviation above its mean. When we look up 1 in the normal table we see that P (X < 5) = P (Z < 1) = 0.84137 . About 84 in every 100 occurrences of a normal
Dierent books set up normal tables in dierent ways. You might have seen a normal table organized dierently in a previous course, but the basic idea of how the tables are used is standard. 7 We can aord to be sloppy with < and here because the probability that a normal random variable is exactly equal to any xed number is 0.
6
2.4. THE NORMAL DISTRIBUTION
39
0.20
0.15
0.10
0.05
0.00
0 X
10
0.0 4
0.1
0.2
0.3
0.4
0 Z
Figure 2.5: Z-scoring. Suppose X N (3, 2). The left panel depicts P (X < 5). The right panel depicts P (Z < 1) where 1 is the z-score (5 3)/2. The only dierence between the gures is a centering and rescaling of the axes. random variable will be less than one standard deviation above its mean. The table tells us P (Z < z) so if we want P (Z > z) we need to rewrite this as 1P (Z < z) i.e. P (X > 5) = 1 P (X < 5) = 1 0.8413 = 0.1587. Because the normal distribution is symmetric, P (Z > z) = P (Z < z). For example (draw a pair of pictures like Figure 2.5): P (X > 1) = P (Z > 1)
= P (Z < 1) = 0.8413.
Finally P (Z < z) = P (Z > z) = 1 P (Z < z). For example: P (X < 1) = P (Z < 1) = 1 P (Z < 1) = 0.1587
Further examples 1. If X N (5, 3) what is P (3 < X < 6)? This is the same as asking for P (X < 6) P (X < 3) so we need to calculate two z scores. 65 z1 = = 0.58, 3 35 z2 = = 1.15 3
40
0.20
0.15
0.10
d 0.05 0.00 2 0 2 4 x 6 8 10 12 0.00 2 0.05 0.10
0.15
0.20
4 x
10
12
(a) P (3 < X < 6)
(b) P (6 < X < 8)
Figure 2.6: To compute the probability that a normal random variable is in an interval (a, b), compute P (X < b) and subtract o P (X < a). Therefore (see Figure 2.6(a)), P (X < 6) P (X < 3) = P (Z < 0.58) P (Z < 1.15) = P (Z < 0.58) (1 P (Z < 1.15)) = P (Z < 0.58) P (Z > 1.15)
= 0.7190 (1 0.8749) = 0.5939.
2. If X N (5, 3) what is P (6 < X < 8)? This is the same as asking for P (X < 8) P (X < 6) so we need to calculate two z scores. 85 z1 = = 1.73, 3 Therefore (see Figure 2.6(b)), P (X < 8) P (X < 6) = P (Z < 1.73) P (Z < 0.58) = 0.9582 0.7190 = 0.2392. 65 z2 = = 0.58 3
2.4. THE NORMAL DISTRIBUTION
41
Figure 2.7: Normal quantile plot for CEO ages.
Checking the Normality Assumption

The normal distribution comes up so often in a statistics class that it is important to remember that not all distributions are normal. The most eective way to check whether a distribution is approximately normal is to use a normal quantile plot, otherwise known as a quantile-quantile plot or a Q-Q plot. Recall from Chapter 1 the data set listing the 800 highest paid CEOs in 1994. One of the variables that looked approximately normally distributed was CEO ages. Figure 2.7 shows the normal quantile plot for that data. To form the normal quantile plot, the computer orders the 800 CEOs from smallest to largest. Then it gures out how small you would expect the smallest observation from 800 normal random variables to be. Then it gures out how small you would expect the next smallest observation to be, and so on. The actual data from each CEO is plotted against the data you would expect to see if the variable was normally distributed. If what you do see is about the same as what you would expect to see if the data were normal, then the dots in the normal quantile plot will follow an approximate straight line. (If both axes were on the same scale, then this line would be the 45 degree line.) The most interesting thing in a normal quantile plot are the dots, although JMP provides some extra assistance in interpreting the plot. The reference line in the plot is the 45 degree line that shows where you would expect the dots to lie. The bowed lines on either side of the reference line are guidelines to help you decide how far the dots can stray from the reference line before you can claim a departure from normality. Dont take one or two observations on the edge of the plot very
42
(a) Skewness: CEO Compensation (top 20 outliers removed)
(b) Heavy Tails: Corporate Prots
Figure 2.8: Examples of normal quantile plots for non-normal data.
seriously. If you notice a strong bend in the middle of the plot then that is evidence of non-normality. It is easier to see departures from normality in normal quantile plots than in histograms or boxplots because histograms and boxplots sometimes mask the behavior of the variable in the tails of the distribution. Examples of non-normal variables from the CEO data set are shown in Figure 2.8. The histograms for these variables appear in Figure 1.5 on page 10.
2.5
The Central Limit Theorem
The central limit theorem (CLT) explains why the normal distribution comes up as often as it does. The CLT, which we wont state formally, says that the sum of several random variables has a normal distribution. It is an approximation that gets better as more variables are included in the sum. How many do you need before the CLT kicks in? The answer depends on how close the individual random variables that youre adding are to being normal themselves. If theyre highly skewed (like CEO compensation) then you might need a lot. Otherwise, once youve got around 30 random variables in the sum the you can feel pretty comfortable assuming the sum is normally distributed. Recall our interest in the distribution of weekly car sales from Section 2.3.3.
2.5. THE CENTRAL LIMIT THEOREM
43
Density
0.00
0.05
0.10
0.15
8 w
10
12
14
Figure 2.9: Distribution of weekly car sales, and a normal approximation. Figure 2.9 shows the actual distribution of weekly car sales (the histogram) along with the normal curve with the mean and standard deviation that we derived in Section 2.3.3. The t is not perfect (the weekly sales distribution is skewed slightly to the left), but it is pretty close. The weekly sales distribution is the sum of only 5 random variables. The normal approximation would t even better to the distribution of monthly sales.
Figure 2.10: The German 10 Mark bank note, showing the normal distribution (faintly,
but right in the center) and Carl Friedrich Gauss (17751855), who rst derived it.
There are many phenomena in life that are the result of several small random components. The CLT explains why you would expect such phenomena to be normally distributed. There are a few caveats to the CLT which help explain why not every random variable is normally distributed. The random variables being added
44
are supposed to be independent and all come from the same probability distribution. In practice that isnt such a big deal. The CLT works as long as the dependence between the variables isnt too strong (i.e. they cant all be exactly the same number), and the random variables being added are on similar enough scales that one or two of them dont dominate the rest. The normal distribution and the central limit theorem have had a huge impact on science. So much so that Germany placed a picture of the normal curve and its inventor, a German mathematician named Gauss, on their 10 Mark bank note (before they switched to the Euro, of course).
Chapter 3
Probability Applications
For many students who are learning about probability for the rst time, the subject seems abstract and somehow divorced from the real world. Nothing could be further from the truth. Probability models play a central role in several business disciplines, including nance, marketing, operations, and economics. This Chapter focuses on applying the probability rules learned in Chapter 2 to problems faced in these disciplines. Obviously we wont be able to go very deep into each area. Otherwise we wont have time to learn what probability can tell us about statistics and data analysis, which is a central goal of this course. Our goal in this Chapter is to present a few fundamental problems from basic business disciplines and to see how these problems can be addressed using probability models. In the process we will learn more about probability.
3.1
Market Segmentation and Decision Analysis
One of the basic tools in marketing is to identify market segments containing similar groups of potential customers. If you know the dening characteristics of a market segment then (hopefully) you can tailor a marketing strategy to each segment and do better than you could by applying the same strategy to the whole market. Market segmentation is a good illustration of decision theory, or using probability help make decision under uncertain circumstances. The uncertainty comes from the fact that you dont know with absolute certainty the market segment to which each of your potential customers should belong. However, it is possible to assign each potential customer a distribution describing the probability of segment membership. Presumably this distribution depends on observable characteristics such as age, credit rating, etc. Decision theory is about translating each probability 45
46
CHAPTER 3. PROBABILITY APPLICATIONS
distribution into an action. For example, you may have developed two marketing strategies for the new gadget your rm has developed. Strategy E targets Early Adopters, and Strategy F targets Followers. Which approach should you apply to Joe, who is 32 years old, makes between $60K-80K per year, and owns an Ipod?
3.1.1
Decision Analysis
Decision analysis is about making a trade o between the cost of making a bad decision and the probability of making a good one. To continue with the market segmentation example, suppose youve determined that there is a 20% chance that Joe is an Early Adopter, and an 80% chance that he is a Follower. Joes age, income, and Ipod ownership were presumably used to arrive at these probabilities. Section 3.1.2 explores segment membership probabilities in greater detail. Intuitively it looks like you should treat Joe as a Follower, because that is the segment he is most likely to belong to. However, if Early Adopters are more valuable customers than Followers, it might make sense to treat Joe as belonging to something other than his most likely category. Actions and Rewards You can apply one of two strategies to Joe, and Joe can be in one of two categories. (More strategies and categories are possible, but lets stick to two for now.) In order to use decision analysis you need to know how valuable Joe will be to you under all four combinations. This information is often summarized in a reward matrix. A reward matrix explains what will happen if you choose a particular action when the world happens to be in a given state. For example, suppose each Early Adopter you discover is worth $1000, and each Follower is worth $100. Early Adopters are worth more because Followers will eventually copy them. However, to get these rewards you have to treat each group appropriately. If you treat an Early Adopter like a Follower then he may think your product isnt suciently cutting edge to warrant his attention. If you treat a Follower like an Early Adopter then he may decide your technology is too complicated. Thus you may have the following reward matrix. Strategy E Strategy F Early Adopter 1000 150 Follower 10 100
These are high stakes in that you lose about 90% of the customers potential value if you choose the wrong strategy.
3.1. MARKET SEGMENTATION AND DECISION ANALYSIS Risk Prole
47
The mechanics of decision theory involve combining the information in the reward matrix with a probability distribution about which state is correct. The result is a risk prole, a set of probability distributions describing the reward that you will experience by taking each action. You have two options with Joe, so the risk prole here involves two probability distribution. Recall that Joe is 80% likely to be a Follower, so if you treat him as an Early Adopter you have an 80% chance of making only $10, but a 20% chance of making $1000. If you treat him as a Follower you have an 80% chance of making $100, but a 20% chance of making $150. Thus, the risk prole you face is as follows. Reward Strategy E Strategy F Choosing an Action Once a risk prole is computed, your decision simply boils down to deciding which probability distribution you nd most favorable. When decisions involve only a moderate amount of money, the most reasonable way to distinguish among the distributions in your risk prole is by their expected value. The expected return under Strategy E is (.8)(10) + (.2)(1000) = $208. The expected return under Strategy F is (.8)(100) + (.2)(150) = $110. So Joe should be treated as an Early Adopter, even though it is much more likely that he is a Follower. Expected value is a good summary here because you are planning to market to a large population of people like Joe, about 20% of whom are Early Adopters. As you repeat the experiment of advertising to customers like Joe over a large population your average reward per customer will settle down close to the long run average, or expected value. If a decision involves substantially greater sums of money, such as deciding whether your rm should merge with a large competitor, then expected value should not be the only decision criterion. For example, if the stakes above were changed from dollars to billions of dollars, representing the market value of the rm after the action is taken, then many people will nd a guaranteed market value of $100-150 billion preferable to the high chance of $10 billion, even if there is a chance of a home run turning the company into a trillion dollar venture. Is this realistic? As with probability models, decision analysis is as realistic as its inputs. For decision analysis to produce good decisions you need realistic reward information and a $10 .8 0 100 0 .8 150 0 .2 1000 .2 0
48
believable probability model describing the states of the unknown variables. The entries in the reward matrix often make decision analysis seem articial. It is certainly not believable that every Early Adopter will be worth exactly $1,000, for example. However, experts in marketing science have reasonably sophisticated probability models that they can employ to model things like the amount and probability of a customers purchase under dierent sets of conditions. Expected values from these types of models can be used to ll out a reward matrix with numbers that have some scientic validity. If actions are chosen based on the highest expected reward then it makes sense for the entries of the reward matrix to be expected values, because it is legal to take averages of averages. The details of the models are too complex to discuss in an introductory class, but they can be found in Marketing elective courses. Decisions can also involve choosing more than one action. In fact, there is an entire discipline devoted to the theory of complex decisions that expands on the basic principles outlined above. Interested students can learn more in Operations Management elective courses.
3.1.2
Building and Using Market Segmentation Models
Although we will say nothing further about the models used to ll out the reward matrix, we can introduce a bit more realism about the probability of customers belonging to a particular market segment. One way to determine segment membership is simply to ask people if they engage in a particular activity. Collect this type of information from a subset of the market you want to learn about, and also collect information that you can actually observe for individuals in the broader market. For example,1 suppose the producers of the television show Nightline want to market their show to dierent demographic groups. They have an advertising campaign designed to reinforce the opinions of viewers who already watch Nightline, and another to introduce the show to viewers who do not watch it. The producers take a random sample of 75 viewers and get the results shown below Level No Yes Number 50 25 Mean Std Dev 34.9600 5.8238 57.9200 11.1502
Assuming the broader population looks about like the sample (an assumption we will examine more closely in Chapter 4), we might assume that 66.66% of the population are non-watchers with ages that are distributed approximately N (35, 5.8), and the remaining 33.33% of the population are watchers with ages distributed approximately N (58, 11). That is, we could simply t normal models to each observed
1
This example modied from Albright et al. (2004).
3.1. MARKET SEGMENTATION AND DECISION ANALYSIS
49
segment. (We could t other models too, if we knew any and thought they might t better.) Now suppose we have information (i.e. age) about a potential subject in front of whom we could place one of the two ads. Clearly, older viewers tend to watch the program more than younger viewers. The age of the potential viewer in question is 42, which is between the mean ages of watchers and non-watchers. What is the chance that he is a Nightline watcher? This is clearly a job for Bayes rule. We have a prior probability of .33 and we need to update this probability to reect the fact that we know this person to be 42 years old. Let W denote the event that the person is a watcher, and let A denote the persons age. Bayes rule says P (W |A = 42) P (W )P (A = 42|W ) and P (notW |A = 42) p(notW )P (A = 42|notW ). Clearly P (W ) = .33 and P (notW ) = .66. To get P (A = 42|W ) we have two choices. We could use the techniques described in Section 2.4 to compute P (A < 43) P (A < 42) for a normal distribution with = 58 and = 11 (or = 35 and = 5.8 for notW ). We could also approximate this quantity by the height of the normal curve evaluated at age = 42. (See page 185 for an excel command to do this.) The two methods give very similar answers, but the second is more typical. We get P (A = 42|W ) = 0.0126 and P (A = 42|notW ) = 0.0332. Now Bayes rule is straightforward: segment W notW prior .33 .66 likelihood .0126 .0332 prior*like 0.004158 0.021912 --------0.02607 posterior 0.16 0.84
Thus there is a 16% chance that the 42 year subject is a Nightline watcher. It is easy enough to program a computer to do the preceding calculation for several ages and plot the results, which are shown in Figure 3.1. We see that the probability of viewership increases dramatically as Age moves from 40 to 50. More Complex Settings The approach outlined above is very exible. It can obviously be extended to any number of market segments by simply tting a dierent model to each segment. It can incorporate several observable characteristics (such as age, income, and geographic region) by developing a joint probability model for the observed characteristics for each market segment. The multivariate normal distribution (which we will not discuss) is a common choice.
50
0.07
Watchers NonWatchers
0.06
0.05
P(nightline|age) 20 30 40 50 Age 60 70 80
0.03
0.04
0.02
0.00
0.01
0.0 20
0.2
0.4
0.6
0.8
1.0
30
40
50 age
60
70
80
(a)
(b)
Figure 3.1: (a) Age distributions for Nightline watchers and non-watchers. (b) P (W |A): the probability of watching Nightline, given Age.
3.2
Covariance, Correlation, and Portfolio Theory
Investors like to get the most return they can with the least amount of risk. It is generally accepted that a well diversied investment portfolio lowers an investors risk. However there is more to portfolio diversication than simply purchasing multiple stocks. For example, a portfolio consisting only of shares from rms in the same industry cannot be considered diversied. This section investigates seeks to calculate the amount of additional risk incurred by investors who own shares of closely related nancial instruments.
3.2.1
Covariance
The covariance between two variables X and Y is dened as Cov(X, Y ) = E((X E(X))(Y E(Y ))). For context, imagine that X is the return on an investment in one stock, and Y the return on another. If the joint distribution of X and Y is unavailable (as is often the case in practice) the covariance can be estimated from a sample of n pairs
3.2. COVARIANCE, CORRELATION, AND PORTFOLIO THEORY (xi , yi ) using the formula 1 Cov(X, Y ) = n1
n i=1
51
(xi x)(yi y ).
If Cov(X, Y ) is positive we say that X and Y have a positive relationship. This means that when X is above its average then Y tends to be above its average as well. Note that this does not guarantee that any particular Y will be large or small, only that there is a general tendency towards big Y s being associated with big Xs. If Cov(X, Y ) is negative we say that X and Y have a negative relationship. This means that as X increases Y tends to decrease. If X and Y are independent then Cov(X, Y ) = 0. Note, however, that a covariance of zero does not necessarily imply that X and Y are independent. A covariance of zero means that there is no linear relationship between X and Y , but there could be a nonlinear relationship (e.g. a quadratic relationship). One of the most common uses of covariance occurs when calculating the variance of a sum of random variables. Recall that if X and Y are independent then V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ). If X and Y are not independent then we can still perform the calculation using Cov(X, Y ). V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ). Notice that if X and Y are independent, then Cov(X, Y ) = 0, so the general formula contains the simple formula as a special case. The good news is that you dont have to remember two dierent formulas. The bad news is that the formula you have to remember is the more complicated of the two. The sidebar on page 54 contains a trick to help you remember the general variance formula, even if there are more than two random variables involved. The best way to look at covariances for more than two variables in a time is to put them in a covariance matrix like the one in Table 3.1. The diagonal elements in a covariance matrix are the variances of the individual variables (in this case the variances of monthly stock returns). The o-diagonal elements are the covariances between the variables representing each row and column. A covariance matrix is symmetric about its diagonal because Cov(X, Y ) = Cov(Y, X).
3.2.2
Measuring the Risk Penalty for Non-Diversied Investments
Suppose you invest in a stock portfolio by placing w1 = 2/3 of your money in Sears stock and 1/3 of your money in Penney stock. What is the variance of your stock
52
portfolio? Let S represent the return from Sears and P the return from Penney. The total return on your portfolio is T = w1 S + w2 P . The variance formula says (using numbers from Table 3.1)
2 2 V ar(T ) = w1 V ar(S) + w2 V ar(P ) + 2w1 w2 Cov(S, P )
= (2/3)2 (0.00554) + (1/3)2 (0.00510) + 2(2/3)(1/3)(0.00335) = 0.00452. So SD(T ) = 0.067 = 0.00452. Notice that the variance of your stock portfolio is less than the variance of either individual stock, but it is more than it would be if Sears and Penney had been uncorrelated. If Cov(S, P ) = 0 then you wouldnt have had to add the factor of 2(2/3)(1/3)(0.00335), which you can think of as a covariance penalty for investing in two stocks in the same industry. Thus covariance explains why it is better to invest in a diversied stock portfolio. If Sears and Penney had been uncorrelated, then you would have had V ar(T ) = 0.00303, or SD(T ) = 0.055. What if you wanted to short sell Penney in order to buy more shares of Sears? In other words, suppose your portfolio weights were w1 = 1.5 and w2 = 0.5. Then
2 2 V ar(T ) = w1 V ar(S) + w2 V ar(P ) + 2w1 w2 Cov(S, P )
= (1.5)2 (0.00554) + (.5)2 (0.00510) + 2(1.5)(.5)(0.00335) = 0.008715. So SD(T ) = 0.093. The formula can be extended to as many shares as you like (see the sidebar on page 54). It can be shown that for a portfolio with a large number of shares the factor that determines risk is not the individual variances but the covariance between shares. If all shares had zero covariance the portfolio would also have a variance that approaches zero.
3.2.3
Correlation, Industry Clusters, and Time Series
It is hard to use covariances to measure the strength of the relationship between two variables because covariances depend on the scale on which the variables are measured. If we change the units of the variables we will also change the covariance. This means that we have no idea what a large covariance is. Therefore, if we want to measure how strong the relationship between two variables is, we use correlation rather than covariance. The correlation between two variables X and Y is dened as Corr(X, Y ) = Cov(X, Y ) SD(X)SD(Y )
3.2. COVARIANCE, CORRELATION, AND PORTFOLIO THEORY

Sears 0.00554 0.00382 0.00335 0.00109 0.00094 0.00120 0.00254 0.00340 Sears 1.0000 0.6031 0.6307 0.3060 0.1900 0.2042 0.3907 0.3928 K-Mart Penney Exxon Amoco 0.00382 0.00335 0.00109 0.00094 0.00725 0.00415 0.00077 0.00042 0.00415 0.00510 0.00042-0.00002 0.00077 0.00042 0.00230 0.00211 0.00042-0.00002 0.00211 0.00444 0.00055 0.00022 0.00202 0.00257 0.00302 0.00252 0.00083 0.00050 0.00378 0.00333 0.00071 0.00028 (a) Covariance matrix K-Mart Penney 0.6031 0.6307 1.0000 0.6820 0.6820 1.0000 0.1882 0.1220 0.0734 -0.0045 0.0817 0.0400 0.4062 0.4043 0.3829 0.4023 Exxon Amoco 0.3060 0.1900 0.1882 0.0734 0.1220 -0.0045 1.0000 0.6598 0.6598 1.0000 0.5338 0.4894 0.1974 0.0862 0.1276 0.0358 Imp_Oil 0.00120 0.00055 0.00022 0.00202 0.00257 0.00621 0.00049 0.00083 Imp_Oil 0.2042 0.0817 0.0400 0.5338 0.4894 1.0000 0.0709 0.0906 Delta 0.00254 0.00302 0.00252 0.00083 0.00050 0.00049 0.00762 0.00667 Delta 0.3907 0.4062 0.4043 0.1974 0.0862 0.0709 1.0000 0.6580 United 0.00340 0.00378 0.00333 0.00071 0.00028 0.00083 0.00667 0.01348 United 0.3928 0.3829 0.4023 0.1276 0.0358 0.0906 0.6580 1.0000
53
Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United
Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United
(b) Correlation matrix
Table 3.1: Correlation and covariance matrices for the monthly returns of eight stocks in
three dierent industries.
Correlation is usually denoted with a lower case r or the Greek letter (rho). Correlation has a number of very nice properties. Correlation is always between 1 and 1. A correlation near 1 indicates a strong positive linear relationship. A correlation near 1 indicates a strong negative linear relationship. Correlation is a unitless measure. In other words whatever units (feet, inches, miles) we measure X and Y in we get the same correlation (but a dierent covariance). Just as for covariance a correlation of zero indicates no linear relationship. Correlations are often placed in a matrix just like covariances. One of the things you can see from the correlation matrix in Table 3.1(b) is that stocks in the same industry are highly correlated with one another, relative to stocks in dierent industries. For example, the correlation between Sears and Penney is .63, while the
54
Dont Get Confused! 3.1 A general formula for the variance of a linear combination The formula for the variance of a portfolio which is composed of several securities is as follows:
n n n
V ar(
i=1
wi Xi ) =
i=1 j=1
wi wj Cov(Xi , Xj ).
This formula isnt as confusing as it looks. What it says is to write down all the covariances in a big matrix. Multiply each covariance by the product of the relevant portfolio weights, and add up all the answers.
w1 V1 C21 C31 w2 w3 Portfolio Weights w1 w2 w3 C12 V2 C32 C13 C23 Covariance Matrix V3
If you think about the variance formula using this picture it should (among other things) help you remember the formula for V ar(X Y ) when X and Y are correlated. If there are more than two securities then you would probably want to use a computer to evaluate this formula.
correlation between Penney and Imperial Oil is .04. The correlation between Sears and the oil stocks is higher, presumably because Sears has an automotive division and Penney does not. These same relationships are present in the covariance matrix, but they are harder to see because some stocks are more variable than others. Weve established what it means for a correlation to be 1 or 1, but what does a correlation of .6 mean? Were going to have to put that o until we learn about something called R2 in Chapter 5. In the mean time, Figure 3.2 shows the correlations associated with a few scatterplots so that you can get an idea of what a strong correlation looks like. Correlation and Scatterplots We can get more information from a scatterplot of X and Y than from calculating the correlation. So why bother calculating the correlation? In fact one should always plot the data. However, the correlation provides a quick idea of the relationship.
3.2. COVARIANCE, CORRELATION, AND PORTFOLIO THEORY
55
Y 2 1 0 X 1 2
0 X
2 2
0 X
(a) r = 1
(b) r = 1
(c) r = 0
1.0
1.5
0.0
0.5
Y 2 1 0 X 1 2
0.5
1.0
1.5
0 X
2 2
0 X
(d) r = 0
(e) r = 0
(f) r = 1
Y 4 2 1 0 X 1 2 5
0 X
0 2
0 X
(g) r = .76
(h) r = .54
(i) r = .24
Figure 3.2: Some plots and their correlations.
56
Dont Get Confused! 3.2 Correlation vs. Covariance Correlation and covariance both measure the strength of the linear relationship between two variables. The only dierence is that covariance depends on the units of the problem, so a covariance can be any real number. Correlation does not depend on the units of the problem. All correlations are between -1 and 1.
This is especially useful when we have a large number of variables and are trying to understand all the pairwise relationships. One way to do this is to produce a scatterplot matrix where all the pairwise scatterplots are produced on one page. However, a plot like this starts to get too complex to absorb easily once you include about 7 variables. On the other hand a table of correlations can be read easily with many more variables. One can rapidly scan through the table to get a feel for the relationships between variables. Because correlations take up less space than scatterplots, you can include a correlation matrix in a report to give an idea of the relationship. By contrast, including a similar number of scatterplots might overwhelm your readers. Autocorrelation Correlation also plays an important role in the study of time series. Recall that in Section 2.2.3 we determined that there was memory in the S&P 500 data series, meaning that the Markov model was clearly preferred to a model that assumed up and down days occur independently. Correlation allows us to measure the strength of the day-to-day relationship. How? Correlation measures the strength of the relationship between dierent variables, but the time series of returns occupies only one column in the data set. The answer is to introduce lag variables. A lag variable is a time series that has been shifted up one row in the data set, as illustrated in Figure 3.3(b). The lag variable at time t has the same value as the original series at time t 1. Thus the lag variable represents what happened one time period ago. Thus the correlation between the original series and the lag variable can be interpreted as the correlation in the series from one time period to the next. The name autocorrelation emphasizes that the correlation is between present and past values of the same time series, rather than between totally distinct variables. Notice that one could just as easily shift the series by any number k rows to compute the correlation between the current time period and k time periods ago. A graph of autocorrelations at the rst several lags is called the autocorrelation function. Figure 3.3(a) shows the autocorrelation function for the S&P 500 data series.
3.3. STOCK MARKET VOLATILITY
57
The lag 1 autocorrelation is only .08, which is not very large (relative to the correlations in Figure 3.2, anyway). Thus, while we can be sure that there is some memory present in the time series, the autocorrelations say that the memory is weak.
(a)
(b)
Figure 3.3: (a) Autocorrelations for the S&P 500 data series. October 19, 1987 has been
excluded from the calculation. (b) Illustration of the lag variable used to compute the autocorrelations.
3.3
Stock Market Volatility
One of the features of nancial time series data is that market returns tend to go through periods of high and low volatility. For example, consider Figure 3.4, which plots the daily return for the S&P 500 market index from January 3, 1950 to October 21, 2005.2 Notice that the overall level of the returns is extremely stable, but that there are some time periods when the wiggles in the plot are more violent than others. For example, 1995 seems to be a period of low volatility, while 2001 is a period of high volatility. After the fact it is rather clear which time periods belong to high and low volatility states, but it can be hard to tell whether or not a transition is occurring when the process is observed in real time. Clearly there is a rst-mover advantage to be had for analysts that can correctly identify the transition. For example, if an analyst is certain the market is entering a high volatility period then
2
Data downloaded from nance.yahoo.com.
58
Figure 3.4: Daily returns for the S&P 500 market index. The vertical axis excludes a
few outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of the data.
the analyst can move his customers into less risky positions (e.g. more bonds and fewer stocks). If an analyst has stock and bond strategies designed to be used during low and high volatility periods he could presumably write down the expected returns under each strategy in a reward matrix. Then the analyst just needs to know when the markets volatility state changes. Notice that the analyst is in a sticky situation. He needs to react quickly to volatility changes to serve his customers well, but if he reacts to every blip in the market then his clients will become annoyed with him, perhaps thinking he is making pointless trades with their money to collect commissions. One way to model data like Figure 3.4 is to assume that the (unobserved) high/low volatility state follows a Markov chain, and that data from high and low volatility states follow dierent normal distributions. Because we dont get to see which days belong to which states, estimating the parameters of a model like this is is hard, and requires special software. However, suppose the transition probabilities for the Markov chain and parameters for the normal distributions were estimated to be Today Mean SD Yesterday Lo Hi Lo 0.0 .007 Lo .997 .003 Hi 0.0 .020 Hi .005 .995 The transition probabilities suggest that high and low volatility states persist for long periods of time. That is, if today is in a low volatility state then the
3.3. STOCK MARKET VOLATILITY
59
10
20
30
40
50
Low High
0.04
0.02
0.00 Return
0.02
0.04
Figure 3.5: Distribution of returns under the low and high volatility states. The dotted vertical line shows todays data. probability that tomorrow is low volatility is very high (.997), and similarly for high volatility. Each day the analyst can update the probability that the market is in a high volatility state based on that days market return. Denote the volatility state at time t by St . Suppose that yesterdays (i.e. time t 1) probability was P (St1 = Hi) = .001 and that todays market return was Rt = .02. How do we compute the probability for today P (St |Rt )? This is a Bayes rule problem. Yesterday we had a probability distribution describing what would happen today. We need to update that probability distribution based on new information (todays return). We can get the marginal distribution for St because we have a marginal distribution for St1 and conditional distributions for P (St |St1 ). That means the joint distribution is St St1 Lo Hi Lo (.999)(.997) (.001)(.005) Hi S = t1 (.999)(.003) Lo (.001)(.995) Hi Lo 0.996003 .000005 St Hi 0.002997 0.000995
Thus P (St = Lo) = 0.996003 + .000005 = .996 and P (St = Hi) = .004. To update these probabilities based on todays return Rt = .02 we need to compute P (Rt = .02) using the height of the normal curves at .02 for each volatility state (see Figure 3.5). We use a computer to compute the height of the normal curves,3
3
If you feel uncomfortable doing this you can instead compute the probability that Rt is in a
60
and plug them into Bayes rule as follows:
State Low High
prior 0.996 0.004
likelihood 0.9 12.1
pri*like 0.896 0.048 ----------0.944
post 0.949 0.051
Notice what happened. The analyst saw data that looked 12 times more likely to have come from the high volatility state than the low volatility state. But because he knew that yesterday was low volatility, and that volatility states are persistent, he regarded the new evidence skeptically and still believes that it is much more likely that today is a low volatility state than a high one.
small interval around .2, say (.19, .21). You get dierent likelihoods, but the ratio between them will be approximately 12:1, so you will get about the same answers out of Bayes rule.
Chapter 4
Principles of Statistical Inference: Estimation and Testing

This Chapter introduces some of the basic ideas used to infer characteristics of the population or process that produced your data based on the limited information in your dataset. The main objective of estimation is to quantify how condent you can be in your estimate using something called a condence interval. The main goal of hypothesis testing is to determine whether patterns you see in the data are strong enough so that we can be sure they are not just random chance. The material on probability covered in Chapter 2 plays a very important role in estimation and testing. In particular, the Central Limit Theorem from Section 2.5, which describes how averages behave, plays a very important role because many of the quantities we wish to estimate and theories we wish to test involve averages.
4.1
Populations and Samples
A population is a large collection of individuals, entities, or objects that we would like to study. For example, we might be interested in some feature describing the population of publicly held corporations, or the population of customers who have purchased your product, or the population of red blood cells in your body. Most of the time it will be impossible, or at least highly impractical, to take the measurements we would like for every member of the population. For example, the population may be eectively innite, such as is in quality control problems where the population is all the future goods your production process will ever produce. A sample is a subset of a population. Not all samples are created equally. In 61
62
CHAPTER 4. ESTIMATION AND TESTING
Population
Sample
Figure 4.1: A sample is a subset of a population. Sample statistics are used to estimate
population parameters.
this Chapter and in most that follow we will assume that the sample is a simple random sample. Think of a simple random sample1 as if the observations were drawn randomly out of a hat. An optional Section (7.4) in the Further Topics Chapter discusses the key ingredients of a good sampling scheme and what can go wrong if you have a bad one. In Chapter 1 we learned that even if we had the whole population in front of us we would have to summarize it somehow. We can use the summaries from the sample to estimate the population summaries, but we will need some way to distinguish the two types of summaries in our discussions. Population summaries are known as parameters and are denoted with Greek letters (see Appendix C). It is customary to denote a population mean by and a population standard deviation by . Sample summaries are known as statistics. We have already seen the established notation x and s for the sample mean, and standard deviation. By themselves, sample statistics are of little interest because the sample is just a small fraction of the population. However, the magic of statistics (the eld of study) is that statistics (the numbers) can tell us something about the population parameters we really care about.
1 There is a mathematical denition of simple random sampling which involves something called the hypergeometric distribution. It is mind-numbingly dull, even to statisticians.
4.2. SAMPLING DISTRIBUTIONS
63
4.2
Sampling Distributions (or, What is Random About My Data?)
It is hard for some people to see where randomness and probability enter into statistics. They look at their data sets and say These numbers arent random. Theyre right there! In a sense that is true. The numbers in your data set are xed numbers. They will be the same tomorrow as they are today. However, there was a time before the sample was taken when these concrete numbers were random variables. Your data arent random anymore, but they are the result of a random process. The trick to understanding how sample statistics relate to population parameters is to mentally put yourself back in time to just before the data were collected and think about the process of random sampling that produced your data. Suppose the population you wish to sample from has mean and standard deviation (calculated as in Chapter 1). If you randomly select one observation from that population, then that one observation is a random variable X1 with expected value and standard deviation (in the sense of Chapter 2). The probability distribution of X1 is the histogram that you would plot if you could see the entire population. The observation gets promoted from the random variable X1 to the data point x1 once you actually observe it. The same is true for X2 , X3 , . . . , Xn . If the data in your data set are the result of a random process, then the sample statistics describing your data must be too. (If you took another sample, you would get a dierent mean, standard deviation, etc.) A sampling distribution is simply the probability distribution describing a particular sample statistic (like the sample mean) that you get by taking a random sample from the population. Dont let the name confuse you, it is just like any other probability distribution except that it is for a special random variable: a sample statistic. We need to understand as much as we can about a statistics sampling distribution, because we only get to see one observation from that sampling distribution (e.g. each time we take a sample we see only one sample mean). Lets think about the sampling distribution of X. We know three key facts. 1. First, E(X) = . You can show this is true using the rules for expected values on page 33. What this says is that the sample mean is an unbiased estimate of the population mean. Sometimes the sample mean will be too big, sometimes it will be too small, but on average it gives you the right answer. 2. We know that V ar(X) = 2 /n because of the rules for variance on page 35. This is important, because the smaller V ar(X) is the better chance X has of being close to . Because variance is hard to interpret, we usually look at the
64
CHAPTER 4. ESTIMATION AND TESTING standard deviation of X instead. When we talk about the standard deviation of a statistic we call it the standard error. The standard error of X is2 SE(X) = V ar(X) = / n.
For example, if the standard error of X is 1 we know that X is typically about 1 unit away from . Generally speaking, the smaller the standard error of a statistic is, the more we can trust it. The formula for SE(X) tells us X is a better guess for when is small (the individual observations in the population have a small standard deviation) or n is large (we have a lot of data in our sample). 3. The third key idea is the central limit theorem. It says that the average of several random variables is normal even if the random variables themselves are not normal. Remember that if the data are normally distributed then any individual observation has about a 95% chance of being within 2 standard deviations of . If the data are not normally distributed we cant make that statement. The central limit theorem is so important because it says, X occurs within 2 standard errors of 95% of the time even if the data (the individual observations in the sample or population) are non-normal. What these three facts tell us is that the number x in our data set is the result of one observation drawn from a normal distribution with mean and standard error / n. We still dont know the numerical values of and , but Sections 4.3 and 4.5.1 show how to use this fact to back in to estimates of .
4.2.1
Example: log10 CEO Total Compensation
To make the idea of a sampling distribution concrete, consider the CEO compensation data set from Chapter 1. Imagine the collection of 800 CEOs is a population from which you wish to draw a sample, and that you can aord to obtain information from a sample of only 20 CEOs. Figure 4.2 shows the histogram of log10 CEO compensation for all 800 CEOs in the data set. It is somewhat skewed to the right, so you might not feel comfortable modeling this population using the normal distribution. In this contrived example we can actually compute the population mean, = 6.17.
The of X is important. Were focusing on X right now, but in the coming Chapters there are other statistics we will care about, such as the slope of a regression line. These other statistics have standard errors too, which will come from dierent formulas than SE(X).
2
4.3. CONFIDENCE INTERVALS
65
Log10 Total Compensation
Figure 4.2: The white histogram bars are log10 CEO compensation, our hypothetical
population. The gray histogram bars are the means of 1000 samples of size 20 randomly selected from the population of 800 CEOs. They represent the sampling distribution of X. If you took a random sample of 20 CEOs and constructed its mean, you would get one observation from the gray histogram.
The Figure also shows a gray histogram which was created by randomly drawing many samples of size 20 from the CEO population. We took the mean of each sample, then plotted a histogram of all those sample means. The gray histogram, which is the sampling distribution of X in this problem, has all the properties advertised above: it is centered on the population mean of 6.17, it has a much smaller standard deviation than the individual observations in the population (by a factor of 20, though you cant tell by just looking at the Figure), and it is normally distributed even though the population is not. Remember Figure 4.2 fondly, because it is the last time youre going to see an entire population or entire sampling distribution. In practice you only get to see one observation from the gray histogram. The trick is to know enough about how sampling distributions behave so that you have some idea about how far that one x from your sample might be away from the population you wish you could see.
4.3
Condence Intervals
In the previous Section we saw that we could use X as a guess for , and that the standard error of X (SE(X) = / n) gave us an idea of how good our guess
66
Dont Get Confused! 4.1 Standard Deviation vs. Standard Error SD measures the spread of the data. If the SD is big then it would be hard for you to guess the value of a single observation drawn at random from the population. SE measures the amount of trust you can put in an estimate such as X. If the standard error of an estimate is small then you can be condent that it is close to the true population quantity it is estimating (e.g. that x is close to ).
is. Condence intervals build on this idea. A condence interval gives a range of possible values (e.g. 10 20) which we are highly condent lies within. For example if we said that a 95% condence interval for was 10 20 this would mean that we were 95% sure that lies between 10 and 20. The question is, how do we calculate the interval? A condence interval for with known First assume that the population we are looking at has mean and standard devi ation . Then X has mean and standard error / n. The central limit theorem also assures us that X is normally distributed. This is useful because we know that a normal random variable is almost always (95% of the time) within 2 standard deviations of its mean. So X will almost always be within 2/ n of . That means we can be 95% certain that is no more than 2/ n away from x.3 Therefore, if we take the interval [ 2/ n, x + 2/ n] we have a 95% chance of capturing x . (In fact if we want to be exactly 95% sure of capturing we only need to use 1.96 rather than 2 but this is a minor point.) What if we want to be 99% sure or only 90% sure of being correct? If you look in the normal tables you will see that a normal will lie within 2.57 standard deviations of its mean 99% of the time and within 1.645 standard deviations of its mean 90% of the time. Therefore in general we get [ z/ n, x + z/ n] x where z is 1.96 for a 95% interval, 1.645 for a 90% interval and 2.57 for a 99% interval. This formula applies to any other certainty level as well. Just look up z in the normal table.
3 Note the switch. In the previous sentence was known and X was not. This sentence marks our move into the real world where we see x and are trying to guess .
4.3. CONFIDENCE INTERVALS Not on the test 4.1 Does the size of the population matter? The formula for the standard error of X assumes the individual observations are independent. When taking samples from a nite population, the observations are actually very slightly correlated with one another because each unit in the sample reduces the chances of another unit being in the sample. If you take a simple random sample of size n from a population of size N then it can be shown that SE(X) = n N n N 1 n 1 n N
67
The extra factor at the end is called the nite population correction factor (FPC). In practice the FPC makes almost no dierence unless your sample size is really big or the population is really small. For example, if your population has a billion people in it and you take a HUGE sample of a million people then the FPC is 1 1 million/1 billion = 0.9995. Contrast that with the 1/ n factor of 1/ 1 million = .001 and it soon becomes apparent that the size of the sample is much more important than the size of the population.
4.3.1
Can we just replace with s?
The condence interval formula for uses which is the population standard deviation. But if we dont know , we probably dont know either. A natural alternative is to use the sample standard deviation s instead of the population standard deviation . We expect s to be close to , so why not use s s x z ,x + z ? n n If we simply replace with s then we ignore the uncertainty introduced by using an estimate (s) in place of the true quantity . Our condence intervals would tend to be too narrow, so they would not be correct as often as they should be. For example if we use z = 1.96, so that we would expect to get a 95% condence interval, the interval may in fact only be correct (cover ) 80% of the time. To x this problem we need to make the intervals a little wider so they really are correct 95% of the time. To do this we use something called the t distribution. All you really need to know about the t distribution is 1. You use it instead of the normal when you have to guess the standard deviation.
68
0.4
Normal T3 T10 T30 0.3 0.0 4 0.1 0.2
Figure 4.3: The normal distribution and the t distribution with 3, 10, and 30 degrees of
freedom. As the sample size (DF ) grows, the normal and t distributions become very close.
2. It is very similar to the normal, but with fatter tails. 3. There is a dierent t distribution for each value of n. When n is large4 the t and normal distributions are almost identical. To make a long story short the condence interval you should be using is s s x t , x + t . n n Note the dierence between t and z. Both measure the number of standard errors a random variable is above its mean, but z counts the number of true standard errors, while t counts estimated standard errors. The best way to nd t is to use a computer, because there is a dierent t distribution for every value of n. There are tables that list some of the more interesting calculations for the t distribution for a variety of degrees of freedom, but we wont bother with them. Figure 4.3 plots the t distribution for a few dierent sample sizes next to the normal curve. When the sample size is small (there are few degrees of freedom) the t-distribution has much heavier tails than the normal. To capture 95% of the probability you might have to go well beyond 2 standard errors. Thus if
4
Greater than 30 is an established rule of thumb.
4.3. CONFIDENCE INTERVALS Not on the test 4.2 What are Degrees of Freedom? Suppose you have a sample with n numbers in it and you want to estimate the variance using s2 . The rst thing you do is calculate the mean, x. Then you calculate (x1 x), (x2 x), . . . , (xn x), square them, and take their average. If you didnt square the deviations from the mean, their average would always be zero! Because the deviations from the mean must sum to zero, they dont represent 100 numbers worth of information. There are 100 numbers there, but they are not all free because they must obey a constraint. The phrase degrees of freedom means the number of free numbers available in your data set. Generally speaking, each observation in the data set adds one degree of freedom. Each parameter you must estimate before you can calculate the variance takes one away. That is why we divide by n 1 when calculating s2 .
69
you had a very small sample size your formula for a 95% condence interval might be [ 3SE(X)] or [ 4SE(X)]. As the sample size grows, s becomes a better x x guess for and the formula for a 95% condence interval soon becomes very close to what it would be if were actually known.
4.3.2
Example
A human resources director for a company wants to know the average annual life insurance expenditure for the members of a union with which she is about to engage in labor negotiations. She is considering whether it would be cost eective for the company to provide a life insurance benet to the union members. The HR director obtains a random sample of 61 employees who provide the data in Figure 4.4. Find a 95% condence interval for the unions average life insurance expenditure. In this case nding the interval is easy, because it is included in the computer output: (429.48, 532.55). How did the computer calculate the interval? It is best to think of it in three steps. First, nd the point estimate, or the single best guess for the thing youre trying to estimate. The best point estimate for a population mean is a sample mean, so the point estimate here is x = 481. Second, compute the standard error of the point estimate, which is 201.2195/ 61 = 25.7635. Third, compute the 95% condence interval as x 2SE(). If you carry this calculation x out by hand you may notice that you dont get exactly the same answer found in the computer output. Thats because we used 2 for a number t that the computer looked up on its internal t-table with = .05 and 60 = 61 1 degrees of freedom.
70
Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N
481.0164 201.2195 25.7635 532.5511 429.4817 61.0000
Figure 4.4: Annual life insurance expenditures for 61 union employees. The computers answer is more precise than ours, but were pretty close. The condence interval is shown graphically in Figure 4.4 as the diamond in the boxplot. Notice that the diamond covers very little of the histogram. A common misperception that people sometimes have is that 95% condence intervals are supposed to cover 95% of the data in the population. Theyre not. A 95% condence interval is supposed to have a 95% chance of containing the thing youre trying to estimate, in this case the mean of the population. After observing the sample of 61 people, the HR director still doesnt know the overall average life insurance expenditure for the entire union, but a good bet is that it is between $429 and $532. What if the interval is too wide for the HR directors purpose? There are only two choices for obtaining a shorter interval, and both of them come at a cost. The rst is to obtain more data, which will make SE(X) smaller. However, obtaining more data is costly in terms of time, money, or both. The second option is to accept a larger probability of being wrong (i.e. a larger , the probability that the interval really doesnt contain the mean of the population), by going out fewer SEs from the point estimate. If you want, you can have an interval of zero width, but you will have a 100% chance of being wrong! How much more data does the HR director need to collect? That depends on how narrow an interval is desired. Right now the interval has a width of about $100, or a margin of error of about $50. (The margin of error is the term in a condence interval.) Suppose the desired margin of error E is $25. The margin of error for a condence interval is s E = t . n The HR director wants a 95% condence interval, so t 2, and we can feel pretty good about s 200 based on the current data set. She can then solve for n 256. Of course you can solve the formula for n without substituting for the other
4.4. HYPOTHESIS TESTING: THE GENERAL IDEA letters, which gives: n= ts E
71
Assuming you have done a pilot study or have some other way to make a guess at s, this is an easy back of the envelope calculation to tell you how much data you need to get an interval with the desired margin of error E and condence level (determined by t).
4.4
Hypothesis Testing: The General Idea
Hypothesis testing is about ruling out random chance as a potential cause for patterns in the data set. For example, suppose the human resources manager from Section 4.3.2 needs to be sure that the per-capita insurance premium for union employees is less than $500. The sample of 61 people has an average premium of $481. Is that small enough so that we can be sure the population mean is less than $500? Obviously not, since the 95% condence interval for the population mean stretches up to $532. It could easily be the case that = $500 and we just saw an x = $481 by random chance. Our example makes it sound like hypothesis tests are simply an application of condence intervals. In many instances (the so-called t-tests) the two techniques give you the same answer. However, condence intervals and hypothesis tests are dierent in two respects. First, a hypothesis test compares a statistic to a prespecied value, like the HR directors $500 threshold. Hypothesis tests are often used to compare dierences between two statistics, such as two sample means. In such instances the specied value is almost always zero. Second, hypothesis tests can handle some problems that condence intervals cannot. For example, Section 4.5.3 describes a hypothesis test for determining whether two categorical variables are independent. There is no meaningful condence interval to look at in that context. Here is how hypothesis tests work, step by step. 1. You begin by assuming a null hypothesis, which is that the population quantity youre testing is actually equal to a specied value, and that any discrepancy you see in your sample statistic is just due to random chance.5 The null hypothesis is written H0 , such as in H0 : = $500 or H0 : the variables are independent. Remember that a hypothesis test uses the sample to test something about the population. Thus it is incorrect to write H0 : x = 500 or H0 : x = 481. You can see x. You dont need to test for it.
5 It sounds like the null hypothesis is a pretty boring state of the world. If it were exciting, we wouldnt give it a hum-drum name like null.
72
Dont Get Confused! 4.2 Which One is the Null Hypothesis? There is a bit of art involved in selecting the null hypothesis for a hypothesis test. You will get better at it as you see more tests. Here are some guidelines. First, the null hypothesis has to specify an exact model, because in order to compute a p-value you (or some math nerd you keep around for problems like this) need to be able to gure out what the distribution of your test statistic would be if the null hypothesis were true. For example, = 815 or = 0 are valid null hypotheses, but > 815 is not. Consequently, the null hypothesis is almost always simpler than the alternative. For example, if you are testing the hypotheses two variables are independent and two variables are dependent then independent is the natural null hypothesis because it is simpler. Also, be sure to remember that a hypothesis test is testing a theory about a population or a process, not a sample or a statistic. Thus it is incorrect to write something like H0 : x = 815 when you really mean H0 : = 815.
2. The next step is to determine the alternative hypothesis you want to test against. Often the alternative hypothesis is simply the opposite of the null hypothesis. It is written as H1 or Ha . For example, H1 : = $500 or Ha : the variables are related to one another. Sometimes you will only care about an alternative hypothesis in a specied direction. The HR director needs to show that < $500 before she is authorized to oer the life insurance benet in negotiations, so her natural alternative hypothesis is Ha : < $500. 3. Identify a test statistic that can distinguish between the null and alternative hypotheses. The choice of a test statistic is obvious in many cases. For example, if you are testing a hypothesis about a population mean, a dierence between two population means, or the slope of the population regression line, then good test statistics are the sample mean, the dierence between two sample means, and the slope of the sample regression line. Some test statistics are a bit more clever. Well see an example in Section 4.5.3. Test statistics are often standardized so that they do not depend on the units of the problem. Instead of x, you might see a test statistic represented as t= x 0 , SE(X)
where SE(X) = s/ n and 0 is the value specied in the null hypothesis, like $500. A test statistic which has been standardized by subtracting o its
4.4. HYPOTHESIS TESTING: THE GENERAL IDEA
73
hypothesized mean and dividing by its standard error is known by a special name. It is a t-statistic. You should recognize this standardization as nothing more than the z-scoring that we learned about in Section 2.4. A t-statistic tells how many standard errors the rst thing in its numerator is above the second. Because it has been standardized, you get the same t-statistic regardless of whether x is measured in dollars or millions of dollars. 4. The nal step of a hypothesis test looks at the test statistic to determine whether it is large or small (i.e. close or far from the value specied in H0 ). Some people skip this step when working with t statistics, because we have some sense of what a big t looks like (i.e. around 2). However, there are other test statistics out there, with names like 2 and F , where the denition of big isnt so obvious. Even with t statistics, especially with small sample sizes of n < 30, the magic number needed to declare a test statistic signicantly large will be dierent for dierent sample sizes. The p-value is the main tool for measuring the strength of the evidence that a test statistic provides against H0 . A p-value is the probability, calculated assuming H0 is true, of observing a test statistic that is as or more extreme than the one in our data set. In our HR director example, if the population mean had been = 500 then the probability of seeing a sample mean of 481 or smaller is 0.2321. In other words, it wouldnt be particularly unusual for us to see sample means like $481 if the true population mean were $500.
4.4.1
P-values
Once you have a p-value, hypothesis testing is easy. The rule with p-values is Small p-value Reject H0 . The smaller the p-value, the stronger the evidence against H0 . Heres why. If the p-value is very small then there are two possibilities. 1. The null hypothesis is true and we just got very strange data by bad luck. 2. The null hypothesis is false and we should conclude that the alternative is in fact correct. The p-value measures how unlucky we would have to be to see the data in our data set if we were in case 1. If the p-value is small enough (say less than 5%) then we conclude that theres no way were that unlucky, so we must be in case 2 and we can reject the null hypothesis. On the other hand if the p-value is not too small then we fail to reject the null hypothesis. This does not mean that we are sure
74
the null hypothesis is true, just that we dont have enough evidence to be certain it is false. Roughly speaking here is the language you can use for dierent p-values. p > .1 .05 to .1 .01 to .05 < .01 Evidence against H0 None Weak Moderate Strong
The p-value is not the probability that H0 is true. If it were, we would begin to prefer Ha as soon as p < .5 rather than .05. Instead, the p-value is a kind of what if analysis. It measures how likely it would be to see our data, if H0 had been true. The less likely the data would be if H0 were true, the less comfortable we are with H0 . It takes some mental gymnastics to get your mind around the idea, but thats the trade-o with p-values. Using them is easy, understanding exactly what they say is a little harder.
4.4.2
Hypothesis Testing Example
Hypothesis tests have been developed for all sorts of questions: are two means equal? Are two variances equal? Are two distributions the same? In an ideal world you would know the details about how each of these tests worked. However you often may not have time to get into the nitty gritty details of each test. Instead, you can nd a hypothesis test with a null and alternative hypothesis that t your problem, plug your data into a computer, and locate the p-value. For example, suppose you want to know if two populations have dierent standard deviations. This might occur in a quality control application, or you might want to compare two stocks to see if one is more volatile than the other. Figure 4.5 shows computer output from a sample of two potato chip manufacturers. To the eye it appears that brand 2 has a higher standard deviation (i.e. is less reliable) than brand 1. Is this a real discrepancy or could it simply be due to random chance? Figure 4.5 presents output from four hypothesis tests. Each test compares the null hypothesis of equal variances to an alternative that the variances are unequal using a slightly dierent test statistic. Each test has a signicant (i.e. small) pvalue, which indicates that the variances are not the same (small p-value says to reject the null hypothesis that the variances are the same). Thus the dierence we see is too large to be the result of random chance. Brand 1 is more reliable (i.e. has a smaller standard deviation) than brand 2. Notice the advantage of using p-values. We dont have to know how large a Brown-Forsythe F ratio has to be in order to understand the results of the Brown-
4.4. HYPOTHESIS TESTING: THE GENERAL IDEA
75
Test OBrien[.5] Brown-Forsythe Levene Bartlett
F Ratio 9.0356 10.9797 15.2582 10.0318
DFNum 1 1 1 1
DFDen 46 46 46
Prob> F 0.0043 0.0018 0.0003 0.0015
Figure 4.5: Potato chip output. P-values are located in the column marked Prob > F . Forsythe test. We can tell what Brown and Forsythe would say about our problem just by knowing their null and alternative hypothesis and looking at their p-value.
4.4.3
Statistical Signicance
When you reject the null hypothesis in a hypothesis test you have found a statistically signicant result. For example, the standard deviations in the potato chip output were found to be signicantly dierent. All that means is that the dierence is too large to be strictly due to chance. You should view statistical signicance as a minimum standard. You should ignore small patterns in the data set that fail tests of statistical signicance. When you nd a statistically signicant result, you know the result is real, but there is no guarantee that it is important to your decision making process. People sometimes refer to this distinction as statistical signicance vs. practical signicance. For example, suppose it is very expensive to adjust potato chip lling machines, and there are industry guidelines stating that bags must be lled to within 1oz. Then the statistically signicant dierence between the standard deviations of the two potato chip processes is of little practical importance, because both processes are well within the industry limit. Of course if calibrating the lling machines is
76
cheap, then Figure 4.5 says brand 2 should recalibrate.
4.5
Some Famous Hypothesis Tests
Section 4.4.2 seems to argue that all you need to do a hypothesis test is the null hypothesis and the p-value. To a large extent that is true, but there are a few extremely famous hypothesis tests that you should know how to do by hand (with the aid of a calculator). This Section presents three tests which come up frequently enough that it is worth your time to learn them. The three tests are: the one sample t-test for a population mean, the z-test for a proportion, and the 2 test for independence between two categorical variables.
4.5.1
The One Sample T Test
You use the one sample t-test to test whether the population mean is greater than, less than, or simply not equal to some specied value 0 . Thus the null hypothesis is always H0 : = 0 , where you specify a number for 0 . We have already seen one example of the one sample t-test performed by our friend the HR director. All the one sample t-test does is check whether x is close to 0 or far away. Close and far are measured in terms of standard errors using the t-statistic6 t= x 0 , SE(X)
where SE(X) = s/ n. The only complication comes from the fact that there are three dierent possible alternative hypotheses, and thus three dierent possible pvalues that could be computed. You only want one of them. The appropriate Ha depends only on the setup of the problem. It does not depend at all on the data in your data set. One Tail, or Two? (How to Tell and Why it Matters) The one sample t-test tests the null hypothesis H0 : = 0 where 0 is our hypothesized value for the mean of the population. There are 3 possible alternative hypotheses: (a) Ha : = 0 . This is called a two tailed (or two sided) test. (b) Ha : > 0 . This is called a one tailed test.
6 In a hypothesis test you calculate t and use it to compute a p-value. This is the opposite of condence intervals, which look up t so that it matches a pre-specied probability such as 95%.
4.5. SOME FAMOUS HYPOTHESIS TESTS
77
0.015
0.015
0.010
0.010
0.005
0.005
0.000
0.000
400
450
500
550
600
400
450
500
550
600
0.000 400
0.005
0.010
0.015
450
500
550
600
(a) p = 0.4642
(b) p = .7679
(c) p = 0.2321
Figure 4.6: The three p-value calculations for the three possible alternative hypotheses in
the one sample t-test. Based on the data from Figure 4.4.
(c) Ha : < 0 . This is called a one tailed test. The null hypothesis determines where the sampling distribution for X is centered. The alternative hypothesis determines how you calculate your p-value from the sampling distribution. This can be a bit confusing, but it helps to remember that a p-value provides evidence for rejecting the null hypothesis in favor of a specied alternative. For the one sample t-test you can think of the p-value as the probability of seeing an X that supports Ha even more than the x in your sample. Our HR director was trying to show that < $500, so the relevant calculation is the probability that she would see an X even smaller than $481 (her sample mean) if really was $500 and the random sampling process were repeated again. Her calculation is depicted in Figure 4.6(c). The other two probability calculations in Figure 4.6 are irrelevant to the HR director, but they illustrate how the p-values would be calculated under the other alternative hypotheses. What type of x would support Ha if it had been = 500? An x far away from $500 in either direction. Thus if Ha is = 0 then the p-value calculates the probability that you would see future Xs that are even farther away from 0 than the one in your sample if H0 were true. Likewise, if Ha is > 0 then the p-value is the probability that you would see Xs even larger than the x in your sample if H0 were true and the sampling process were repeated. Obviously, the upper and lower tail p-values must sum to 1, and the two tailed p-value is twice the smaller of the one tailed p-values. Remember that you choose the relevant alternative hypothesis from the context of the problem without regard to whether you saw x > 0 or x < 0 . Otherwise, people would only ever do one tailed tests. If you are not sure which alternative
78
hypothesis to use, pick the two tailed test. If youre wrong then all youve done is apply a tougher standard than you needed to. That could keep you out of court when you become CEO.7 Example 1 One of the components used in producing computer chips is a compound designed to make the chips resistant to heat. Industry standards require that a sucient amount of the compound be used so that the average failing temperature for a chip is 300 degrees (Fahrenheit). The compound is relatively expensive, so manufacturers dont want to use more than is necessary to meet the industry guidelines. Each day a sample of 30 chips is taken from the days production and tested for heat resistance, which destroys the chips. Suppose, on a given day the average failure point for the sample of 30 chips is 305 degrees, with a standard deviation of 8 degrees. Does it appear the appropriate heat resistance target is being met? Let be the average population failure temperature. The null hypothesis is H0 : = 300. What is the appropriate alternative? We are interested in deviations from the target temperature on either the positive or negative side, so the best alternative is Ha : = 300. The standard error of the mean is 8/ 30 = 1.46, so the t-statistic is 3.42. We can use the normal table to construct the p-value because there are at least 30 observations in the data set. What p-value should we calculate? We want to know the probability of seeing sample means at least as far away from 300 degrees as the mean from our sample, so the p-value is P (X 305) + P (X 295) p(Z > 3.42) + P (Z < 3.42). (The approximate equality is because were using the normal table instead of the inconvenient t-table.) Looking up 3.42 on the normal table gives us a p-value of .0003 + .0003 = .0006. If the true heat tolerance in the population of computer chips was 300 degrees, we would only see a sample mean this far from 300 about 6 times out of 10,000. That makes us doubt that the true mean really is 300 degrees. It looks like the average heat tolerance is higher than we need it to be. Example 2 An important use of the one sample t-test is in before and after studies where there are two observations for each subject, one before some treatment was applied and one after. You can use the one sample t-test to determine whether there is any benet to the two treatments by computing the dierence (after-before) for each person and testing whether the average dierence is zero. This type of test is so
7
Mandatory Enron joke of 2003.
79
important it merits its own name: the paired t-test, even though it is just the one sample t-test applied to dierences. For example, suppose engineers at Floogle (a company which designs internet search engines) have developed a new version of their search engine that they would like to market as our fastest ever. They test the new version against their previous fastest ever search engine by running both engines on a suite of 40 test searches, randomly choosing which program goes rst on each search to avoid potential biases. The dierence in search times (new-old) is recorded for each search in the test suite. The average dierence is -3.775 (average search for the new engine was 3.775 milliseconds less than under the old engine). The 40 dierences have a standard deviation of 21.4 milliseconds. Do the data support the claim that the new engine is their fastest ever? Let be the dierence in average search times between the two search engines. The appropriate null hypothesis here is = 0 (no dierence in average speed). What should the alternative be? The engineers want to show that the new engine is signicantly faster than the old one, so the natural alternative hypothesis is < 0. The standard error of the dierences is SE(X) = 21.4/ 40 = 3.38, so the t-statistic is t = 3.775/3.38 = 1.12. Should we compute an upper tail, lower tail, or two tailed p-value? Our alternative hypothesis is < 0, so we should compute a lower-tailed p-value. From the normal table (because n > 30) we nd p 0.1314. The large p-value says that the new search engine didnt win the race by a sucient margin to show that it is faster than the old one.
4.5.2
Methods for Proportions (Categorical Data)
Recall from Chapter 1 that continuous data are often summarized by means, and categorical data are summarized by the proportion of the time they occur. Let p denote the proportion of the population with a specied attribute. The sample proportion is p (pronounced p-hat). You compute p by simply dividing the number of successes in your sample by the sample size. For a concrete example, suppose a car stereo manufacturer wishes to estimate the proportion of its customers who, one year after purchasing a car stereo, would would recommend the manufacturer to a friend. Suppose the manufacturer obtains a random sample of 100 customers, 73 of whom respond positively. Then our best guess at p is p = .73. There is good news, and even more good news about proportions. The good news is that proportions are actually a special kind of mean. Imagine that person i is labeled with a number xi , which is 1 if they would recommend our car stereo to a friend, and 0 if they would not. We can calculate the proportion of favorable reviews by averaging all those 0s and 1s. Thus, everything we learned about sample means
80
Dont Get Confused! 4.3 The Standard Error of a Sample Proportion. There are three ways to calculate SE() in practice. All of them recognize p SE() = p(1 p)/n. They dier in what you should plug in for p. p guess for p p p0 1/2 Used in. . . Condence intervals Hypothesis testing Estimating n for a future study Rationale Best guess for p. Youre assuming H0 is true. H0 says p = p0 . A conservative worst case estimate of p. Usable even with no data.
carries over to sample proportions. In particular, the central limit theorem implies p obeys a normal distribution. The second bit of good news about proportions it that it is even easier to calcu late the standard error of p than it is for X. Heres why. Data which only assume the values 0 and 1 (or sometimes -1 and 1) are called indicator variables or dummy variables. Note the following useful fact about dummy variables. Let Xi = 1 with probability p 0 with probability 1 p.
Then it is easy to show E(Xi ) = p, and V ar(Xi ) = p(1 p). (Try it and see! Hint: in this one special case Xi2 = Xi .) Whats so useful about our useful fact? Because proportions are just means in disguise, and we know that SE(X) = SD(X)/ n, our useful fact says SE() = p p(1 p) . n
So with proportions, if you have a guess at p you also have a guess at SE(). p
Condence Intervals To produce a condence interval for p we use the formula p zSE() p
81
where SE() = p(1 p)/n and z comes from the normal table.8 Suppose, for p example, that we sample n = 100 items from a shipment and nd that 15 are defective. What is a 95% condence interval for the proportion of defective items in the whole shipment? The point estimate is p = 15/100 = .15 and its standard error is SE() = (.15)(.85)/100 = 0.0357, so the condence interval is p 0.15 1.96 0.0357 = [0.08, 0.22]. The shipment contains between 8% and 22% defective items (with 95% condence). You may notice that, underneath the square root sign, SE() is a quadratic p function of p. The standard error is zero if p = 0 or p = 1. That makes sense, because if the shipment contained either all defectives or no defectives then every sample we took would give us either p = 1 or p = 0 with no uncertainty at all. We get the largest SE() when p = 1/2. It is useful to assume p = 1/2 when p you are planning a future study and want to know how much data you need to achieve a specied margin of error. Recall from page 71 that to get a margin of error E you need roughly n = (ts/E)2 observations. If you want a 95% condence interval then t 2, and if you assume p = 1/2 then s (the standard deviation of an individual observation) is p(1 p) = (1/2)(1/2) = 1/2. Therefore to estimate a proportion to within E you need roughly n 1/E 2 observations. To illustrate, suppose you wish to estimate the proportion of people planning to vote for the Democratic candidate in the next election to within 2%. You would want a sample of n = 1/(.02)2 = 1/(.0004) = 2500 voters. To get the margin of error down to 1% you would need 1/(.0001) = 10, 000 voters. Standard errors p decrease like 1/ n, so to cut SE() in half you need to quadruple the sample size. Hypothesis Tests Suppose your plant employs an extremely fault tolerant production process, which allows you to accept a shipment as long as you can be condent that there are fewer than 20% defectives. Should you accept the shipment with 15 defectives in the sample of 100? You could answer with a hypothesis test of H0 : p = .2 versus Ha : p < .2. Then SE() = (.2)(.8)/100 = .04 and the test statistic is p z= 0.15 0.2 = 1.25. 0.04
We used .2 instead of .15 in the formula for SE() because hypothesis tests compute p p-values by assuming H0 is true, and H0 says p = .2. Our test statistic says that
We use the normal table and not the t table here because one of the t assumptions is that the data are normally distributed, which cant be true when the data are 0s and 1s. Therefore the results for proportions presented in this section are for large samples with n > 30.
8
82
p is 1.25 standard errors below .2. For the alternative hypothesis Ha : p < p0 the p-value is P (Z < 1.25) = 0.1056 from the normal table. If H0 were true and p = .2 then we would see p .15 about 10% of the time, which is not all that unusual. We should reject the shipment, because we cant reject H0 .
4.5.3
The 2 Test for Independence Between Two Categorical Variables
The nal hypothesis test in this Chapter is a little dierent from the others because there is no condence interval to which it corresponds. The test is called the 2 test. ( is the Greek letter chi, which is pronounced with a hard k sound and rhymes with sky.) The 2 test investigates whether there is a relationship between two categorical variables. For example is there a relationship between gender and type of car purchased? Figure 4.7 describes the type of car purchased by a random sample of 263 men and women. In the sample women buy a slightly higher percentage of family cars than men, and men buy a slightly higher percentage of sporty cars than women. The hypothesizes for the 2 test are H0 : The two variables are independent (i.e. no relationship) Ha : There is some sort of relationship To decide which we believe we produce a contingency table for the two variables and calculate the number of people we would expect to fall in each cell if the variables really were independent. If observed numbers (i.e. the numbers that actually happened) are very dierent from what we would expect if X and Y were independent then we will conclude there must be a relationship. The rst step is to compute how many observations we would expect to see in each cell of the table if the variables were actually independent. Recall that if two random variables X and Y are independent then P (X = x and Y = y) = P (X = x)P (Y = y). From the table in Figure 4.7 we see that the proportion of men in the sample is .5457 (= 144/263) and the proportion of sporty cars in the sample is .3004 (=79/263). If TYPE and GENDER were independent we would expect the proportion of men with sporty cars in the data set to be (.5457)(.3004), which means we would expect to see 263(.5457)(.3004)= 43.25 observations in that cell of the table. In more general terms, Ei = npx py = nx ny /n,
83
Test L. Ratio Pearson
ChiSquare 1.420 1.417
Prob>ChiSq 0.4915 0.4924
Figure 4.7: Automobile preferences for men and women. where Ei is the expected cell count in the ith cell of the table, and px and py are the marginal proportions for that cell. The rst equation is how you should think about Ei being calculated when someone else (i.e. the computer) does the calculation for you. If you have to do the calculation yourself use the second form, which is a shortcut you get by noticing px = nx /n and py = ny /n, where nx and ny are the marginal counts for the cell (e.g. number of men and number of sporty cars). Once you have determined how many observations you would expect to see in each cell of the table, you need to see whether the table you actually observed is close or far from the table you would expect if the variables were independent. The Pearson 2 test statistic9 ts the bill. X2 =
i
(Ei Oi )2 , Ei
where Ei and Oi are the expected and observed counts for each cell in the table. You divide by Ei because you expect cells with bigger counts to also be more variable. Dividing by Ei in each term puts cells with large and small expected counts on a level playing eld. Otherwise X 2 is just a way to measure the distance between your observed table and the table you would expect if the variables were independent. How large does X 2 need to be to conclude X and Y are related? That depends on the number of rows and columns in the table. If there are R rows and C columns, then you compare X 2 to the 2 distribution with (R 1)(C 1) degrees of freedom. (Imagine chopping o one row and one column of the table and counting the cells
There is another chi-square statistic called the likelihood ratio statistic, denoted G2 , that we wont cover. Even though they are calculated dierently, G2 X 2 and both statistics are compared to the same reference distribution.
9
84
Not on the test 4.3 Rationale behind the 2 degrees of freedom calculation The 2 test compares two models. The degrees of freedom for the test is the dierence in the number of parameters needed to t each model. The simpler model assumes the two variables are independent. To specify that model you need to estimate R 1 probabilities for the rows and C 1 probabilities for the columns. The 1 comes from the fact that probabilities have to sum to one. The more complicated model assumes that the two variables are dependent, so each cell in the table gets its own probability. Thats RC 1 free probabilities. A little arithmetic shows that the complicated model has (R 1)(C 1) more parameters than the simple model.
in the smaller table.) The 2 distribution, like the t distribution, is a probability distributions we think you should know about, but not have to learn tables for. The 2 distribution Figure 4.8 shows the 2 distribution with 5 degrees of freedom. with d degrees of freedom has mean d and standard deviation 2d. Thus for larger tables you need a larger X 2 to declare statistical signicance, which makes sense because larger tables contribute more terms to the sum. Large X 2 values make you want to reject H0 (and say there is a relationship between the two variables), so the p-value for the 2 test must be the probability in the upper tail of Figure 4.8. I.e. the probability that you would see an observed table even farther away from the expected table if the variables were truly independent and a second random sample was taken. In the auto choice example, X 2 = 1.417, on (3-1)(2-1)=2 degrees of freedom, for a p-value of .4924. If TYPE and GENDER really were independent, we would see observed tables this far or farther from the expected tables about half the time. The small dierences we see between men and women in the sample could very easily be due to random chance. To see what a relationship looks like, recall Figure 1.6 on page 11, which showed the preferences for automobile type across three dierent age groups. The data in Figure 1.6 produce X 2 = 29.070, on 4=(3-1)(3-1) degrees of freedom, yielding a p-value p < .0001. There is very strong evidence of a relationship between TYPE and AGEGROUP. The 2 test is a test for the entire table. In principle you could design a hypothesis test to examine subsets of a contingency table, but we will not discuss them here. A good strategy with contingency tables is to perform a 2 test rst, to see if there is any relationship between the variables in the table. If there is, you can use methods similar to Section 4.5.2 to compare individual proportions.
85
0.00 0
0.05
0.10
0.15
10
15
20
Figure 4.8: The 2 distribution with 5 degrees of freedom. p-values come from the upper
tail of the distribution.
Caveat There is one important caveat to keep in mind for the 2 test. The p-values are generally not trustworthy if the expected number of observations in any cell of the table is less than 5. However, if that is the case then you may form a new table by collapsing one or more of the oending categories into a single category (i.e. replacing red and blue with red or blue).
86
Chapter 5
Simple Linear Regression

In Chapter 6 we will consider relationships between several variables. The two most common reasons for modeling joint relationships are to understand how one variable will be aected if we change another (for example how will prot change if we increase production) and actually predict one variable using the value of another (e.g. if we price our product at $X what will be our sales Y ). There are any number of ways that two variables could be related to each other. We will start by examining the simplest form of relationship, linear or straight line but it will become clear that we can extend the methods to more complicated relationships.
5.1
The Simple Linear Regression Model
The idea behind simple linear regression is very easy. We have two variables X and Y and we think there is an approximate linear relationship between them. If so then we can write the relationship as Y = 0 + 1 X + where 0 is the intercept term (i.e. the value of Y when X = 0), 1 is the slope (i.e. the amount that Y increases by when we increase X by 1) and is an error term (see Figure 5.1). The error term comes in because we dont really think that there is an exact linear relationship, only an approximate one. A fundamental regression assumption is that the error terms are all normally distributed with the same standard deviation . An equivalent way of writing the linear regression model is Y N (0 + 1 X, ). 87
88
CHAPTER 5. SIMPLE LINEAR REGRESSION
Figure 5.1: An illustration of the notation used in the regression equation. Recall that Y N (, ) means Y is a normally distributed random variable with mean and standard deviation . Writing the regression model this way helps make the connection to Chapters 2 and 4. All that has changed is that now were letting depend on some background variable that were calling X. If we are interested in how Y changes when we change X then we look at the slope 1 . If we wish to predict Y for a given value of X we use y = 0 + 1 x. In practice there is a problem: 0 and 1 are unknown (just like was unknown in Chapter 4) because we cant see all the Xs and Y s in the entire population. Instead we get to see a sample of X and Y pairs and need to use these numbers to guess 0 and 1 . For any given values of 0 and 1 we can calculate the residuals ei = yi yi , which measure how far the y that we actually saw is from its prediction. The residuals are signed so that a positive residual means the point is above the regression line, and a negative residual means the point is below the line. We choose the line (i.e. choose 0 and 1 ) that makes the sum of the squared residuals as small as possible (we square the residuals to remove the sign). Said another way, regression tries to minimize SSE = e2 + e2 + + e2 . 1 2 n SSE stands for the sum of squared errors and the estimates for 0 and 1 are called b0 and b1 . Note that the line that we get is just a guess for the true line (just as x is a guess for ). Therefore b0 and b1 are random variables (just like x) because if we took a new sample of X and Y values we would get a dierent line. You obtain guesses for 0 and 1 by minimizing SSE. What about a guess for ?
5.1. THE SIMPLE LINEAR REGRESSION MODEL Not on the test 5.1 Why sums of squares? There are a number of reasons why the sum of squared errors is the model tting criterion of choice. The rst is that it is a relatively easy calculus problem. You could (but thankfully dont have to) take the derivative of SSE with respect to 0 and 1 (viewing all the xi s and yi s as constants) and solve the system of two equations without too much diculty. On a deeper level, sums of squares tell us how far one thing is from another. Do you remember the distance formula from high school algebra? It says that the squared distance between two ordered pairs (x1 , y1 ) and (x2 , y2 ) is d2 = (x2 x1 )2 + (y2 y1 )2 . The corresponding formula works for ordered triples plotted in 3-space, and for ordered n-tuples in higher dimensions. The sum of squared errors is the squared distance from the observed vector of responses (y1 , . . . , yn ) to the prediction vector (1 , . . . , yn ). By minimizing y SSE, regression gets those two vectors to be as close as possible.
89
In earlier Chapters, we said that variance was the average squared deviation from the mean. The same is true here, but in a regression model the mean for each observation depends on X. So instead of averaging (yi y )2 , we average (yi yi )2 . One other dierence is that in a regression model we must estimate two parameters (the slope and intercept) to compute yi , so we use up one more degree of freedom than we did when our best guess for Y was simply y . Thus the estimate of 2 is s2 =
n i=1 (yi
yi )2 . n2
Now look at the sum in the numerator of s2 . Thats SSE! People sometimes call s2 the mean square error, or MSE, because to calculate the variance you take the mean of the squared errors. Those same people would call s = s2 the root mean square error, or RMSE. When you estimate a regression model you estimate the three numbers b0 , b1 , and s. The rst two tell you where the regression line goes. The last one gives you a sense of how spread out the Y s are around the line.
5.1.1
Example: The CAPM Model
Figure 5.2 shows the results obtained from regressing monthly stock returns for Sears against a value weighted stock index representing the overall performance of the stock market. The capital asset pricing model (CAPM) from nance says that you can expect the returns for an individual stock to be linearly related to the
90
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations
0.426643 0.424106 0.056324 0.009779 228
Parameter Estimates Term Estimate Intercept -0.003307 VW Return 1.123536
Std Error 0.003864 0.086639
t Ratio Prob>|t| -0.86 0.3931 12.97 <.0001
Figure 5.2: Fitting the CAPM model for Sears stock returns. overall stock market. The equation of the estimated line describing this relationship can be found in the Parameter Estimates portion of the computer output. The rst column lists the names of the X variables in the regression. The second column lists their coecients in the regression equation.1 So the regression equation is Sears = 0.003307 + 1.123536(VW Return). You can nd s in the Summary of Fit table under the heading Root Mean Square Error. In this example s = 0.056324. The CAPM model refers to the estimated equation as the security market line. The slope of the security market line is known as the stocks beta (a reference to the standard regression notation). It provides information about the stocks volatility relative to the market. If = 2 then a one percentage point change in the market return would correspond to a two percentage point change in the return of the individual security. Thus if > 1 the stock is more volatile than the market as a whole, and if < 1 the stock is less volatile than the market. You can nd regression-based volatility information in the prospectus for your investments. Figure 5.3 shows the volatility measurements from the prospectus of one of Fidelitys more aggressive mutual funds. It lists , which we just discussed, R2 will be discussed in Section 5.2.2, and standard deviation is s.
1
All statistics programs organize their estimated regression coecients in exactly this way.
5.2. THREE COMMON REGRESSION QUESTIONS
91
Figure 5.3: Volatility measures for an aggressive Fidelity fund. Source: www.delity.com.
5.2
Three Common Regression Questions
There are three common questions often asked of a regression model. 1. Can I be sure that X and Y are actually related? 2. If there is a relationship, how strong is it? An equivalent way to phrase this question is; What proportion of the variation have I explained? 3. What Y would I predict for a given X value and how sure am I about my prediction?
5.2.1
Is there a relationship?
This question really comes down to asking whether 1 = 0, because if the slope is zero X disappears out of the regression model and we get Y = 0 + . The question is, Is b1 far enough away from 0 to make us condent that 1 = 0? This should sound familiar because it is exactly the same idea we used to test a population mean in the one sample t-test. (Is x far enough from 0 that we can be sure = 0 ?) We perform the same sort of hypothesis test here. Start with H0 : 1 = 0 versus H1 : 1 = 0. We calculate b1 and its standard error and then take the ratio t= b1 . SE(b1 )
As before, the t-statistic counts the number of standard errors that b1 is from zero. If the absolute value of t is large then we will reject the null hypothesis and conclude there must be a relationship between X and Y . As always, we determine whether t is large enough by looking at the p-value it generates. If the p-values is small we will conclude that there is a relationship. We usually hope that the p-value is small, otherwise we might as well just stop because there is no evidence that X and Y are related at all. The small p-value for the slope in Figure 5.2 leaves no doubt that there is a relationship between Sears stock and the stock market.
92
Dont Get Confused! 5.1 R2 vs. the p-value for the slope R2 tells you how strong the relationship is. The p-value for the slope tells you whether R2 is suciently large for you to believe there is any relationship. Put another way, the p-value answers a yes-or-no question about the relationship between X and Y . The smaller the p-value the more sure you can be that there is at least some relationship between the two variables. R2 answers a how much question. You shouldnt even look at R2 unless the p-value is signicant.
The only dierence between the t-test for a regression coecient and the t-test for a mean is that the standard error is computed using a dierent formula. SE(b1 ) = sx s n1
This formula says that b1 is easier to estimate (its SE is small) if 1. s is small. s is the standard deviation of Y around the regression line, so if the points are tightly clustered around the regression line it is easier to guess the slope. 2. sx is large. sx is the standard deviation of the Xs. If the Xs are highly spread out, it is easier to guess the regression slope. 3. n is large. As with most things, the estimate of the slope improves as more data are available. You could also use SE(b1 ) to construct a condence interval for 1 . Just as with means, you can construct a 95% condence interval for 1 as [b1 2SE(b1 )]. How meaningful this condence interval is depends on whether you can give 1 a meaningful interpretation. In Figure 5.2, 1 is the stocks volatility measure, so a condence interval for 1 represents a range of plausible values for the stocks true volatility. The standard error of b1 in Figure 5.2 is about .086, so a 95% condence interval for Sears stock is roughly (1.12 .17) = (.95, 1.29). The condence interval contains 1, which says that Sears stock may be no more volatile than the stock market itself.
5.2.2
How strong is the relationship?
Once we have decided that there is some sort of relationship between X and Y we want to know how strong the relationship is. We measure the strength of the
93
relationship between X and Y using a quantity called R2 . To calculate R2 we use the formula SSE R2 = 1 SST where SST = (yi y )2 . Think of SST as the variability you would have if you ignored X and estimated each y with y (the variability of Y about its mean). Think of SSE as the variability left over after tting the regression (the variability of Y about the regression line). Then R2 calculates the proportion of variability that you have explained using the regression. R2 is always between zero and one. A number close to one indicates a large proportion of the variability has been explained. A number close to zero indicates the regression did not explain much at all! Note that R2 1 s2 e s2 y
where se is the standard deviation of the residuals and sy is the standard deviation of the Y s.2 For example, if the sample standard deviation of the ys (about the average y) is 1, and the standard deviation of the residuals (about the regression line) is 0.5 then R2 .75 because 0.52 is about 75% smaller than 12 . You might be wondering why we wouldnt just use the correlation coecient we learned about in Section 3.2.3 to measure of how closely X and Y are related. In fact it can be shown that R2 = r 2 so the two quantities are equivalent. An advantage of R2 is that it retains its meaning even if there are several X variables, which will be the case in Chapter 6 and in most real life regression applications. Correlations are limited to describing one pair of variables at a time. Of course, correlations tell you the direction of the relationship in addition to the strength of the relationship. Thats why we keep them around instead of always using R2 .
5.2.3
What is my prediction for Y and how good is it?
Once we have an estimate for the line, making a prediction of Y for a given x is easy. We just use y = b0 + b1 x. However, there are two sorts of uncertainty about this prediction.
2 The approximate equality is because s2 = SSE/(n 2), while s2 = SST /(n 1). If n is large e y then (n 1)/(n 2) 1.
94
CHAPTER 5. SIMPLE LINEAR REGRESSION 1. Remember that b0 and b1 are only guesses for 0 and 1 so y = b0 + b1 x is a guess for the population regression line. I.e. y is a guess for the associated with this particular x. We use a condence interval to determine how good a guess it is. 2. Even if we knew the parameters of the regression model perfectly the individual data points wouldnt lie exactly on the line. A prediction interval provides a range of plausible values for an individual future observation with the specied x value. The interval combines our uncertainty about the location of the population regression line with the variation of the points around the population regression line.
Figure 5.4 illustrates the dierence between these two types of predictions. It shows the relationship between a rms monthly sales and monthly advertising expenses. Suppose you are thinking of setting an advertising policy of spending a certain amount each month for the foreseeable future. You are taking a long term view so you probably care more about the long run monthly average value of sales than you do month to month variation. In that case you want to use a condence interval. However, if you are considering a one month media blitz and you want to forecast what sales will be if you spend $X on advertising, then the prediction interval is what you want. Condence Intervals for the Regression Line Suppose you are interested in predicting the numerical value of the population regression line at a new point x of your choosing. (Translation: suppose you want to estimate the long run monthly average sales if you spend $2.5 million per month on advertising.) You would compute your condence interval by rst calculating y = b0 + b1 x . Then the interval is y 2SE(), where y SE() = s y 1 (x x)2 + . n (n 1)s2 x
This is another one of those formulas that isnt as bad as it looks when you break it down piece by piece. First, pretend the second term under the square root sign isnt there. Then SE() is just s/ n like it was when we were estimating a mean. y The ugly second term increases SE() when x (the x value where you want to y make your prediction) moves away from x. All that says is that it is harder to guess where the regression line goes when youre far away from x. Far is dened
95
(a)
(b)
Parameter Estimates Term Estimate Std Error Intercept 66151.758 11709.47 ADVEXP 23.32421 6.349464 Mean of ADVEXP: 1804.444 SD of ADVEXP: 386.096
t Ratio Prob>|t| 5.65 <.0001 3.67 0.0008
RSquare RMSE N
0.28412 14503.29 36
Figure 5.4: Regression output describing the relationship between sales and advertising
expenses, including (a) condence and (b) prediction intervals. ($1000s) There is a bow eect present in both sets of intervals, but it is more pronounced in (a).
relative to sx , the standard deviation of the Xs. Finally, you have (x x)2 and s2 because the formula for variances is based on squared things. The ugly second x term is the thing that makes the condence intervals Figure 5.4(a) bow away from the regression line far from the average ADVEXP. The formula for SE() is useful because it helps us understand how condence y intervals behave. However, the best way to actually construct a condence interval is on a computer. Suppose we wanted an interval estimate of long run average monthly sales assuming we spend $2.5 million per month on advertising. The regression equation says that y = 66151 + 23.32(2500) = 124, 451, or about $124 million. The 95% condence interval when x = 2500 is (114230, 134694), or between $114 and $135 million. You can see these numbers in Figure 5.4(a) by going to x = 2500 on the X axis and looking straight up.
96
Prediction Intervals for Predicting Individual Observations SE() measures how sure you are about the location of the population regression y line at a point x of your choosing. Predicting an individual y value (lets call it y ) for an observation with that x is more dicult, because you have to combine your uncertainty about the location of the regression line with the variation of the ys around the line. If you knew 0 , 1 , and 2 then your 95%prediction interval for y would be ( 2). As it is, we recognize that y = y + residual, so y V ar(y ) = s2 1 (x x)2 + n (n 1)s2 x
V ar() y
s2
V ar(residual)
Consequently, the standard error for a predicted y value is SE(y ) = s 1 (x x)2 + + 1. n (n 1)s2 x
An ugly looking formula, to be sure. But notice that if you had all the data in the population (i.e. if n ) then it would just be s. Also notice that under the square root sign you have the same formula you had for SE() plus a 1, which y would actually be s2 if you distributed the leading factor of s through the radical. The extra 1 under the square root sign means that prediction intervals are always wider than condence intervals. Everything else under the square root sign means that prediction intervals are always wider than they would be if you knew the equation of the true population regression line. The 95% prediction interval when ADVEXP=2500 is (93263, 155662), or from $93 to $156 million. Interpolation, Extrapolation, and the Intercept Prediction and condence intervals both get very wide when x is far from x. Using your model to make a prediction at an x that is well beyond the range of the data is known as extrapolation. Predicting within the range of the data is known as interpolation. Wide condence and prediction intervals are the regression models way of warning you that extrapolation is dangerous. An additional problem with extrapolation is that there is no guarantee that the relationship will follow the same pattern outside the observed data that it follows inside. For example a straight line may t well for the observed data but for larger values of X the true relationship may not be linear. If this is the case then an extrapolated prediction will be even worse than the wide prediction intervals indicate.
5.3. CHECKING REGRESSION ASSUMPTIONS
97
The notion of extrapolation explains why we rarely interpret the intercept term in a regression model. The intercept is often described as the expected Y when X is zero. Yet we very often encounter data where all the Xs are far from zero, so predicting Y when X = 0 is a serious extrapolation. Thus it is better to think of the intercept term in a regression model as a parameter that adjusts the height of the regression line than to give it a special interpretation.
5.3
Checking Regression Assumptions
It is important to remember that the regression model is only a model, and the inferences you make from it (i.e. prediction intervals, condence intervals, and hypothesis tests) are subject to criticism if the model fails to t the data reasonably well. We can write the regression model as Y N (Y , ) where Y = 0 + 1 X. This simple equation contains four specic assumptions, known as the regression assumptions. In order of importance they are: 1. Linearity (Y depends on X through a linear equation) 2. Constant variance ( does not depend on X) 3. Independence (The equation for Y does not depend on any other Y s) 4. Normality (Observations are normally distributed about the regression line) This section is about checking for violations of these assumptions, what happens if each one is violated, and possible means of correcting violations. The main tool for checking the regression assumptions is a residual plot, which plots the residuals vs. either X or Y . It doesnt matter which, because Y is just a linear function of X. For simple regressions we typically put X on the X axis of a residual plot. In Chapter 6 there will be several X variables, so we will put Y (a one number summary of all the Xs) on the X axis instead. Why look at residuals? The residuals are what is left over after the regression has done all the explaining that it possibly can. Thus if the model has captured all the patterns in the data, the residual plot will just look like random scatter. If there is a pattern in the residual plot, that means there is a pattern in the data that the model missed, which implies that one of the regression assumptions has been violated. When we detect an assumption violation we want to do something to get the pattern out of the residuals and into the model where we can actually use it to make better predictions. The something is usually either changing the scale of
98
sqrt x, log x 1/x y^2
x^2 y^2
sqrt x, log x, 1/x sqrt y, log y, 1/y
log y, sqrt y, 1/y exp(x), x^2
Figure 5.5: Tukeys Bulging Rule. Some suggestions for transformations you can try
to correct for nonlinearity. The appropriate transformation depends on how the data are bulging.
the problem by transforming one of the variables, or adding another variable to the model.
5.3.1
Nonlinearity
Nonlinearity simply means that a straight line does a poor job of describing the trend in the data. The hallmark of nonlinearity is a bend in the residual plot. That is, look for a pattern of increasing and then decreasing residuals (e.g. a U shape). You can sometimes see strong nonlinear trends in the original scatterplot of Y vs. X, but the bend is often easier to see using the residual plot. Trick: If you are not sure whether you see a bend, try mentally dividing the residual plot into two or three sections. If most of the residuals in one section are on one side of zero, and most in another are on the other side, then you have a bend. You can also test for nonlinearity by tting a quadratic (i.e. both linear and squared terms) and testing whether the squared term is signicant. Fixing the problem: There are two xes for nonlinearity. 1. If there is a U shape in the scatterplot (or an inverted U) then t a quadratic instead of a straight line. 2. If the trend in the data bends, but does not completely change direction, you should consider transforming either the X and/or Y variables. The most popular transformations are those that raise a variable to a power like X 2 , X 1/2 (the square root of X), or X 1 = 1/X. There is a sense3 in which X 0
3
See Not on the Test 5.2 on page 99.
5.3. CHECKING REGRESSION ASSUMPTIONS Not on the test 5.2 Box-Cox transformations There is a theory for how to transform variables known as Box-Cox transformations. Denote your transformed variable by w= x 1 .
99
If = 0 then w is morally x . If you remember LHpitals Rule from o calculus then you can show that as 0 then w loge x.
corresponds to log X, which for some reason is usually a good transformation to start with. The most common way to carry out the transformation is by creating a new column of numbers in your data table containing the transformed variable. Then you can run an ordinary linear regression using the transformed variable. There is something of an art to choosing a good transformation. There is some trial and error involved, and there could be several choices that t about equally well. Tukeys bulging rule provides some some advice on how to get started. You should think about transformations as stretching or shrinking the axis of the transformed variable. Powers greater than one stretch, and powers less than one shrink. For example, if the trend in the data bends like the upper right quadrant of Figure 5.5 then you can straighten out the trend by stretching either the X or Y axes, i.e. by raising them to a power greater than 1. If then trend looks like the lower left quadrant of Figure 5.5 then you want to contract either the X or Y axis. Given a choice, we typically prefer transformations that contract the data because they can also limit the inuence of unusual data points, which we will discuss in Section 5.4. Nonlinearity Example 1 A chain of liquor stores is test marketing a new product. Some stores in the chain feature the product more prominently than others. Figure 5.6 shows weekly sales gures for several stores plotted against the number of display feet of space the product was given on each stores shelves. There is an obvious bend in the relationship between display feet and sales, which you could characterize as diminishing returns. The bend does not change direction, so we prefer to capture it using a transformation rather than tting a quadratic. The bend looks like the upper left quadrant of Figure 5.5, so we could straighten it out by either stretching the Y axis or compressing the X axis. Lets start with a transformation from DisplayFeet to LogDisplayFeet. Figure 5.6(c) shows what the scatterplot looks like when we plot Sales vs. LogDisplayFeet, along
100
(a)
(b)
(c)
Figure 5.6: Relationship between amount of display space allocated to a new product and its weekly sales. (a) Original scale showing linear regression and log transform. (b) Residual plot from linear regression. (c) Linear regression on transformed scale. with the estimated regression line. The curved line in Figure 5.6(a) shows the same estimated regression line but plotted on the original scale. The log transformation changes the scale of the problem from one where a straight line assumption is not valid to another scale where it is. The log transformation ts the data quite nicely, but there other transformations that would t just as well. Figure 5.7 compares computer output for the log and reciprocal (1/X) transformations. Graphically, both transformations seem to t the data pretty well. Numerically, R2 is about the same for both models.4 It is a little higher for the reciprocal model, but not enough to get excited about. So which transformation should we prefer? We want the model to be as interpretable as possible. That means we want a transformation that ts the data well, but also has some economic or physical justication. The question is, what do we think the relationship between Sales and DisplayFeet will look like as DisplayFeet grows? If we think sales will continue to rise, albeit at a slower and slower rate, then the log model makes more sense. If we think that sales will soon plateau, and wont increase no matter how many feet of display space the produce receives, then the reciprocal model is more appropriate. Statistics can tell you that both models t about equally well. The choice between them is based on your understanding of the context of the problem.5 Lets suppose we prefer the reciprocal model. Now that we have it, what can we
4 You can compare R2 for these models because Y has not been transformed. Once you transform Y then R2 is measured dierently for each model. I.e. it doesnt make sense to compare percent variation explained for log(sales) to the percent variation explained for 1/sales. 5 Thats good news for you. Your ability to use judgments in situations like this makes you more valuable than a computer. Human judgment, even more than Arnold Schwarzenegger or Keanu Reeves, will keep computers from taking over the earth.
101
Log model: RSquare RMSE
0.815349 41.3082
Reciprocal model: RSquare 0.826487 RMSE 40.04298
Term Intercept Log(DispFeet)
Estimate Std Error 83.560256 14.41344 138.62089 9.833914
t Ratio Prob>|t| 5.80 <.0001 14.10 <.0001
Term Estimate Std Error t Ratio Prob>|t| Intercept 376.69522 9.439455 39.91 <.0001 1/(DisplayFeet) -329.7042 22.51988 -14.64 <.0001 Figure 5.7: Comparing the log (heavy line) and reciprocal (lighter line) transformations for the display space data. do with it? Suppose that the primary cost of stocking the product is an opportunity cost of $50/foot. That is, the other products in the store collectively generate about $50 in sales per foot of display space. How much of the new product should we stock? If we stock x display feet then our marginal prot will be (x) = 0 + 1 /x
extra sales revenue
50x
opportunity cost
If you know some calculus6 you can gure out the optimal value of x is x= 1 . 50
Our estimate for 1 is b1 = 329, so our best guess at the optimal display amount is 329/50 2.5 feet. How sure are we about this guess? Well 1 is plausibly between 329 2(22.5) = (374, 284), so if we plug these numbers into our formula for the optimal x we nd it is between 2.38 and 2.73 feet.
6
If not then you can hire one of us to do the calculus for you, for an exorbitant fee!
102
For budgeting purposes we want to know the long run average weekly sales we can expect if we give the product its optimal space. Our best guess for weekly sales is 376.69522 329.7042/2.5 = $244.81, but that could range between $232.62 and $257.00. These number come from our condence interval for the regression line at x = 2.5, which we obtained from the computer. Will it be protable to stock the item over the long run? Stocking the product at 2.5 feet of display space will cost us $125 per week in opportunity cost, so we estimate that the long run average weekly prot is somewhere between $107.62 and $127. That means were pretty sure the product will be protable over the long haul. Finally, the 95% prediction interval for a given weeks sales with 2.5 feet of display space goes from $163 to $326. If a store sells less than $163 of the product in a given week they might receive a visit from the district manager to see what the problem is. If they sell more than $326 then theyve done really well. Nonlinearity Example 2 While weve been learning about correlation and regression, our HR director from Chapter 4 has been o negotiating with her labor union. The negotiating team thought they had a deal, but when it was presented to the union for a vote the deal faced sti resistance from the oldest union members. In Chapter 4 the HR director determined that a life insurance benet was too expensive to include in managements initial oer. Now shes wondering if it is time oer the benet, despite the cost, in order to disproportionately sway the older union members. Figure 5.8 shows a scatterplot from a simple random sample of union members that the HR director obtained before the labor negotiations. The plot shows the relationship between a union members age and the amount they pay in life insurance. The HR director ts a linear regression of LifeCost on Age and notices the the regression line has a positive slope with a signicant p-value. That means older employees pay more for life insurance, so the oldest employees ought to support her proposed benet, right? It would, if we believed in the linear relationship. Examine the residual plot in Figure 5.8. Most of the residuals in the middle third of the plot are positive, and most of the residuals in the other two thirds are negative. That is evidence of a nonlinear relationship. The quadratic model from Figure 5.8 is much better t to these data. Notice that the quadratic term has a signicant p-value, which says that it is doing a useful amount of explaining. Also notice that R2 is much higher for the quadratic model than for the linear model. In Chapter 6 we will learn that R2 ALWAYS increases when you add a variable to the model (as we did with the quadratic term). Thus we couldnt justify the quadratic model if R2 were only a little bit higher. The signicant p-value for the quadratic term says that R2 went
103
Term Intercept Age Term Intercept Age (Age-43.787)^2
Estimate Std Error 258.87651 105.6601 5.0732059 2.3444
t Ratio Prob>|t| 2.45 0.0173 2.16 0.0345 Prob>|t| 0.0004 0.0287 <.0001
RSquare 0.073527 RMSE 195.3152
Estimate Std Error t Ratio 360.62865 95.48353 3.78 4.6138293 2.05667 2.24 -0.717539 0.16517 -4.34
RSquare 0.300978 RMSE 171.1107
Figure 5.8: Output from linear and quadratic regression models for insurance premium
regressed on age. The residual plot is from the linear model.
up by enough to justify the terms inclusion. Figure 5.9 plots the residuals from the quadratic model. The residuals are more evenly scattered throughout the plot, particularly at the edges, which makes us comfortable that weve captured the bend. The quadratic model diers from the linear model because it says that older members pay more for life insurance up to a point, then their monthly premiums begin to decline. Perhaps the older employees locked into their life insurance premiums long ago. The equation for the quadratic model is Cost = 360.63 + 4.6(Age) 0.72(Age 43.787)2 The computer centers the quadratic term around the average age in the data set (i.e. x = 43.787) to prevent something called collinearity that we will learn about in Chapter 6. A bit of calculus7 shows that if you have a quadratic equation written as a(x x)2 + bx + c then the optimal value of x is x (b/2a). Thus our regression model predicts that 47 year old union employees pay the most for life insurance.
7
Or a HUGE consulting fee.
104
Figure 5.9: Residuals from the quadratic model t to the insurance data. The old geezers standing in the way of a deal probably wouldnt be swayed by a life insurance benet.
5.3.2
Non-Constant Variance
Non-constant variance8 means that points at some values of X are more tightly clustered around the regression line than at other values of X. To check for nonconstant variance look for a funnel shape in the residual plot i.e. residuals close to zero to start with and then further from zero for larger X values. Often as X increases the variance of the errors will also increase. Trick: Try mentally dividing the plot in half. If the residuals on one half seem tightly clustered, while the residuals the other half seem more scattered, then you have non-constant variance. Consequence: There are two consequences of non-constant variance. First, points with low variance should be weighted more heavily than points with high variance, so there is a slight bias in the regression coecients. More seriously, all the formulas for prediction intervals, condence intervals, and hypothesis tests that we like to use for regression depend on s. If the residuals have non-constant variance then s is meaningless because it doesnt make sense to summarize the variance with a single number. Fixing the Problem: If Y is always positive, try transforming it to log(Y ). There is another solution called weighted least squares which you may read about, but we will not discuss.
Also known as heteroscedasticity, although the term is currently out of favor with statisticians because weve learned that 8 syllable words make people not like us. At least were assuming thats the reason.
8
105
(a)
(b)
y on x.
(c)
Figure 5.10: Cleaning data. (a) Regression line t to raw data and log y transformation.
(b) Regression of log y on log x. (c) Regression of
Example Goldilocks Housekeeping Service is a contractor specializing in cleaning large oce buildings. They want to build a model to forecast the number of rooms that can be cleaned by X cleaning crews. They have collected the data shown in Figure 5.10. The data exhibit obvious non-constant variance (the points on the right are far more scattered than those on the left). The company tries to x the problem by transforming to log(rooms), but that creates a non-linear eect because there was roughly a straight line trend to begin with. Linearity is restored when the company regresses log(rooms) on log(crews), but then it looks like the non-constant variance goes the other way because the points on the left side of Figure 5.10(b) seem more scattered than those on the right. Goldilocks replaces the log transformations with square roots, which dont contract things as much as logs do. Figure 5.10(c) shows the results, which look just right. When plotted on the datas original scale, the regression line for rooms vs. crews looks almost identical to the line for rooms vs. crews. So what has Goldilocks gained by addressing the non-constant variance in the data? Consider Figure 5.11. The left panel shows the prediction intervals from the data modeled on the raw scale. Notice how the prediction intervals are much wider than the spread of the data when the number of crews is small, and narrower than they need to be when the number of crews is large. Figure 5.11(b) shows prediction intervals constructed on the transformed scale where the constant variance assumption is much more reasonable. The prediction intervals in Figure 5.11(b) track the variation in the data much more closely than in panel (a).
106
(a)
(b)
Figure 5.11: The eect of non-constant variance on prediction intervals. Panel (a) is t
on the original scale. The model in panel (b) is t on the transformed scale, then plotted on the original scale.
5.3.3
Dependent Observations
Dependent observations usually means that if one residual is large then the next one is too. To nd evidence of dependent observations, look for tracking in the residual plot. The best way to describe tracking is when you see a residual far above zero, it takes many small steps to get to a residual below zero, and vice versa. The technical name for this tracking pattern is autocorrelation. Unfortunately, it can sometimes be dicult to distinguish between autocorrelation and non-linearity. The good news is that autocorrelation only occurs with time series data, so if you dont have time series data you dont have to worry about this one. Trick: The X variable in your plot must represent time in order to see tracking. This becomes even more important when we deal with several X variables in multiple regression. Consequence: If you see tracking in the residual plot then that means todays residual could be used to predict tomorrows residual. This means you could be doing a better job than youre doing by just including the long run trend (i.e. a regression where your X variable is time). The obvious thing to do here is to put todays residual in the model somehow. We will learn how to put both trend and autocorrelation in the model when we learn about multiple regression in Chapter 6. For now, the best way to deal with autocorrelation may be to simply regress yt on the lag variable yt1 . You can think of a lag variable as yesterdays y. Regressing
107
(a)
(b)
(c)
Figure 5.12: Cell phone subscriber data (a) on raw scale with log and square root transformations, (b) after y 1/4 transformation, (c) residuals from panel (b).
a variable on its lag is sometimes called an autoregression. Example Figure 5.12 plots the number of cell phone subscribers, industry wide, every six months from December 1984 to December 1995. Nonlinearity is the most serious problem with tting a linear regression to these data. Figure 5.12(a) shows the results of tting a linear regression to log y, which bends too severely, and y which doesnt bend quite enough. Figure 5.12(b) shows the scatterplot obtained by transforming Subscribers to the 1/4 power. Thats not a particularly interpretable transformation, but it denitely xes our problem with nonlinearity. Figure 5.12(c) shows the residual plot obtained after tting a linear model to Figure 5.12(b). There is denite tracking in the residual plot, which is evidence of autocorrelation, where tomorrows residual is correlated with todays residual. If you wanted to forecast the number of cell phone subscribers in June 1996 (the next time period, period 24) using these data then it looks like a regression based on Figure 5.12(b) would do a very good job. After all, it has R2 = .997, the highest weve seen so far. However you could do even better if you could also incorporate the pattern in Figure 5.12(c), which says that you should probably increase your forecast for period 24. Thats because the residuals for the last few time periods have been positive, so the next residual is likely to be positive as well. Consider Figure 5.13, which shows output from the regression of subscribers to the 1/4 power on a lag variable. Both models have a very high R2 . The model based on the long run trend over time has R2 = .997, while the autoregression has R2 = .999. That doesnt seem like a big change in R2 , but notice that s in the autoregression is only half as large as it is in the long run trend model. Perhaps the best reason to prefer the lag model is that when you plot its residuals vs. time,
108
(a) Lag Variable RSquare 0.999063 RMSE 0.51842 N 22 (c) Trend RSquare RMSE N
(b) 0.996987 0.972197 23 (d)
Figure 5.13: Output for the lag model compared to the trend model in Figure 5.12. (a)
Subscribers to the 1/4 power vs. its lag. (b) Plot of residuals vs. time. (c) Summary of t for the lag model. (d) Summary of t for the trend model.
as in Figure 5.13(b), there is a much weaker pattern than the residual plot in Figure 5.12(c). Table 5.1 gives the predictions for each model, obtained by calculating point estimates and prediction intervals on the y 1/4 scale, then raising them to the fourth power. The small dierences on the fourth root scale translate into predictions that dier by several million subscribers. The autoregression predicts more subscribers than the trend model, which is probably accurate given that the last few residuals in Figure 5.12(c) are above the regression line. The prediction intervals for the autoregression are tighter than the trend model. It looks like the autoregression
Point Estimate 39.347 34.276 Lower 95% Prediction 37.023 30.498 Upper 95% Prediction 41.778 38.394
Model Lag Trend
Table 5.1: Point and interval forecasts (in millions) for the number of cell phone subscribers
in June 1996 based on the trend model and the autoregression.
109
Figure 5.14: Number of seats in a car as a function of its weight (left panel). Normal
quantile plot of residuals (right panel).
can predict to within about 2 million subscribers, whereas the trend model can predict to within about 4 million. The period 24 point estimate for the trend model is only very slightly larger than the actual observed value for period 23.
5.3.4
Non-normal residuals
Normality of the residuals is the least important of the four regression assumptions. Thats because there is a central limit theorem for regression coecients just like the one for means. Therefore your intervals and tests for b1 and the condence interval for the regression line are all okay, even if the residuals are not normally distributed. Non-normal residuals are only a problem when you wish to infer something about individual observations (e.g. prediction intervals). If the residuals are nonnormal then you cant assume that individual observations will be within 2 of the population regression line 95% of the time. You check the normality of the residuals using a normal quantile plot, just like you would any other variable. Unfortunately, JMP doesnt make normal quantile plot of the residuals by default. You have to save them as a new variable and make the plot yourself. Example One place where you can expect to nd non-normal residuals is when modeling a discrete response. For example, Figure 5.14 shows that the number of seats in a car tends to be larger for heavier cars. The slope of the line is positive, and it has a signicant p-value. The p-value is trustworthy, but it would be dicult to use this model to predict the number of seats in a 3000 pound car. You could estimate
110
(a)
(b)
Figure 5.15: Outliers (two in panel(a)) are points with big residuals. Leverage points (one
in panel (b)) have unusual X values. Both plots show regression lines with and without the unusual points. None of the points inuence the tted regression lines.
the average number of seats per car in a population of 3000 pound cars, but for an individual car the number of seats is certainly not normally distributed, so our technique for constructing a prediction interval wouldnt make sense.
5.4
Outliers, Leverage Points and Inuential Points
In addition to violations of formal regression assumptions, the subject of Section 5.3, you should also check to see if your analysis is dominated by one or two unusual points. There are two types of unusual data points that you should be aware of. 1. An outlier is a point with an unusual Y value for its X value. That is, outliers are points with big residuals. 2. A high leverage point is an observation with an X that is far away from the other Xs. Outliers and high leverage points aect your regression model dierently. It is possible to have a point that is both a high leverage point and an outlier. Such points are guaranteed to have substantial inuence over the regression line. An inuential point is an observation whose removal from the data set would substantially change the tted line. Regardless of whether an outlier or high leverage point inuences the tted line, these observations can make a serious impact on the standard errors we use
5.4. OUTLIERS, LEVERAGE POINTS AND INFLUENTIAL POINTS
111
in constructing condence intervals, prediction intervals, and hypothesis tests. We have discussed three types of standard errors in this Chapter: for the slope of the line, for the y value of the regression line at a point x , and for an individual observation y . SE(b1 ) = s n1 SE() = s y 1 (x x)2 + 2 n sx (n 1) SE(y ) = s 1 (x x)2 + 2 +1 n sx (n 1)
sx
Note that all three standard errors depend on s, the residual standard deviation, and sx , the standard deviation of the xs. Outliers inate s, which increases standard errors and makes us less sure about things. High leverage points inate sx , which is in the denominator of all three standard error formulas. That means high leverage points make us more sure about our inferences. That extra certainty comes at a cost: we cant check the assumptions of the model, particularly linearity, between the high leverage point and the rest of the data.
5.4.1
Outliers
Figure 5.15(a) shows the relationship between average house price ($thousands) and per-capita monthly income ($hundreds) for 50 ZIP codes obtained from the 1990 Census. The data were collected by an intern who was unable to locate the average house price for two of the ZIP codes. The intern entered 0 for the observations where the house price was unavailable. Figure 5.15(a) shows regression lines t with and without the outliers generated by the intern recording 0 for the house prices. The outliers have very little eect on the regression line. They decrease the intercept somewhat, but the slope of the line is nearly unchanged. Outliers typically do not aect the tted line very much unless they are also high leverage points. The two outliers in Figure 5.15(a) are observations with typical income levels, so they are not high leverage points. Figure 5.16 shows the computer output for the two regression lines in Figure 5.15(a). Indeed the estimated regression equations are very similar. However, RMSE is roughly a factor of 4.5 larger if the outliers are included in the data set. The eects of the increased RMSE can be seen in R2 (.660 vs. .069) and the standard error for the slope, which is directly proportional to s. Because SE(b1 ) is increased, the t-statistic and the p-value for the slope are insignicant when the outliers are included, even though the slope is highly signicant without the outliers. How big does a residual have to be before we call it an outlier? There is no hard and fast rule, but you should measure the residuals relative to s. Keep in mind that you expect about 5% of the residuals to be more than 2 residual standard deviations from zero, and only 0.3% of the points to be more than 3s from zero.
112
Full Data Set: Term Estimate Std Error Intercept 139.91994 40.26643 INCOME 5.3208392 2.821008 Outliers Excluded: Term Estimate Intercept 137.53823 INCOME 6.1341704
t Ratio Prob>|t| 3.47 0.0011 1.89 0.0653
RSquare 0.069002 RMSE 46.51617 N 50
Std Error 9.256721 0.649089
t Ratio Prob>|t| 14.86 <.0001 9.45 <.0001
RSquare 0.660041 RMSE 10.66548 N 48
Figure 5.16: Regression output for the house price data. The two 0s in Figure 5.15(a) are between 4 and 5 residual standard deviations from zero, so they are clearly outliers. Generally speaking, outliers dont change the tted line by very much unless they are also high leverage points, but they can aect the certainty of our inferences a great deal.
5.4.2
Leverage Points
There is one month in Figure 5.15 where the value weighted stock market index lost roughly 20% of its value. That is the infamous 1987 crash where the Dow fell over 500 points in a single day. That would be a large one day decline today (when the Dow is at about 9000). It was catastrophic in 1987, when the Dow was trading at just over 2000, and people were marveling that it was that high. The point is a high leverage point. October 1987 may have been disastrous for the stock market, but it is pretty innocuous as a data point. Figure 5.17 shows the computer output for the models t with and without October 1987. RMSE is virtually unchanged, though R2 is a little higher. The slope and intercept of the line barely move when the point is added or deleted. The high leverage point has the expected eect on SE(b1 ), but the eect is minor. High leverage points get their name from their ability to exert undue leverage on the regression line. Imagine each data point is attached to the line by a spring. The farther a point is from the regression line the more force its spring is exerting. When an observation has an extreme X value, all the other observations collectively act like the fulcrum of a lever centered at x. The farther a high leverage point is from x, the less work its spring has to do to pull the line towards it, just like it was pulling on a very long lever. October 1987 happens to have a Y value which is right where the regression line predicts it would be, so its spring isnt pulling the
5.4. OUTLIERS, LEVERAGE POINTS AND INFLUENTIAL POINTS Full Data: Term Estimate Intercept -0.003307 VW Return 1.123536
113
Std Error t Ratio Prob>|t| 0.003864 -0.86 0.3931 0.086639 12.97 <.0001
RSquare 0.426643 RMSE 0.056324 N 228
Leverage Point Excluded Term Estimate Std Error t Ratio Intercept -0.002627 0.003917 -0.67 VW Return 1.0890647 0.092626 11.76
Prob>|t| 0.5031 <.0001
RSquare 0.380581 RMSE 0.056311 N 227
Summary of leverages: Mean 0.0087719 N 228
Figure 5.17: Computer output for the CAPM model t with and without the high leverage
point.
line very hard. The next Section shows an of a high leverage point that moves the line a lot. There is an actual number, denoted hi , that can be calculated to determine the leverage of each point.9 You dont need to know the formula for hi (but see page 114 if youre interested), but you should know that it depends entirely on xi . The farther xi is from x, the more leverage for the point. Each hi is always between 1/n and 1, and the hi s for all the points sum to 2. Some computer programs (but not JMP) warn you if a data point has hi greater than three times the average leverage (i.e. 2/n). Figure 5.17 plots the leverages for the CAPM data set. Sure enough, October 1987 stands out as a high leverage point.
5.4.3
Inuential Points
This last example of unusual points shows an instance of a high leverage point that does move the line a lot.
9 The letter h is used for leverages because they are computed from something called the hat matrix which is beyond our scope, even in a Not on the test box.
114
Not on the test 5.3 Why leverage is Leverage For simple regression the formula for leverage is hi = 1 (xi x)2 + . n (n 1)s2 x
You may recognize this formula as the thing under the square root sign in the formula for SE(). The intuition behind leverage is that the tted regression y line is very strongly drawn to high leverage points. That means high leverage points tend to have smaller residuals than typical points, so the residual for a high leverage point ought to have smaller variance than the other points. It turns out that V ar(ei ) = s2 (1 hi ), where ei is the ith residual. If an observation had the maximum leverage of 1 then V ar(ei ) = 0 because its pull on the tted line is so strong that the line is forced to go directly through the leverage point. Most observations have hi close to 0, so that V ar(ei ) s2 . For multiple regression the formula for hi becomes suciently complicated that we cant write it down without matrix algebra, which is beyond our scope.
A construction company that builds beach cottages typically builds them between 500 and 1000 square feet in size. Recently an order was placed for a 3500 square foot cottage, which yielded a healthy prot. The company wants to explore whether they should be building more large cottages. Data from the last 18 cottages built by the company (roughly a years work) are shown in Figure 5.18.
5.4.4
Strategies for Dealing with Unusual Points
When a point appreciably changes your conclusions you can perform your analyses with and without the point and report both results. Or use transformations to work on a scale where the point is not as inuential. For example if a point has a large x value and we transform x by taking log(X) then log(X) will not be nearly as large. Okay to delete unusual points if Point was recorded in error, or Point has a big impact on model, and you only want to use the model to predict typical future observations. Not okay to delete unusual points: Just because they dont t your model. Fit the model to the data, not vice versa. When you want to predict future observations like the unusual point. (e.g. large cottages)
5.5. REVIEW
115
(a)
(b)
(c)
Point Included: Term Estimate Intercept -416.8594 SqFeet 9.7505469 Point Excluded: Term Estimate Intercept 2245.4005 SqFeet 6.1370246
Std Error 1437.015 1.295848
t Ratio Prob>|t| -0.29 0.7755 7.52 <.0001
RSquare RMSE N
0.779667 3570.379 18
Std Error 4237.249 5.556627
t Ratio Prob>|t| 0.53 0.6039 1.10 0.2868
RSquare RMSE N
0.075205 3633.591 17
Figure 5.18: Computer output for cottages data with and without the high leverage point.
The rst two panels plot condence intervals for the regression line t (a) with and (b) without the high leverage point. Panel (c) plots hi for the regression: max hi = .94680.
5.5
Review
Correlation and covariance measure the strength of the linear relationship between two variables. Regression models the actual relationship. It is hard to test whether a correlation is zero, but easy to test whether a population line has zero slope (which amounts to the same thing). The slope is the most interesting part of the regression.
116
Chapter 6
Multiple Linear Regression

Multiple regression is the workhorse that handles most statistical analyses in the real world. On one level the multiple regression model is very simple. It is just the simple regression model from Chapter 5 with a few extra terms added to y . However multiple regression can be more complicated than simple regression because you cant plot the data to eyeball relationships as easily as you could with only one Y and one X. The most common uses of multiple regression fall into three broad categories 1. Predicting new observations based on specic characteristics. 2. Identifying which of several factors are important determinants of Y . 3. Determining whether the relationship between Y and a specic X persists after controlling for other background variables. These tasks are obviously related. Often all three are part of the same analysis. One thing you may notice about this Chapter is that there are many fewer formulas. Multiple regression formulas are most often written in terms of matrix algebra, which is beyond the scope of these notes. The good news is that the computer understands all the matrix algebra needed to produce standard errors and prediction intervals.
6.1
The Basic Model
Multiple linear regression is very similar to simple linear regression. The key difference is that instead of trying to make a prediction for the response Y based on only one predictor X we use many predictors which we call X1 , X2 , . . . , Xp . In most real life situations Y does not depend on just one predictor so a multiple regression 117
118
CHAPTER 6. MULTIPLE LINEAR REGRESSION
can considerably improve the accuracy of our prediction. Recall the simple linear regression model is Yi = 0 + 1 Xi + i . Multiple regression is similar except that it incorporates all the X variables, Yi = 0 + 1 Xi,1 + + p Xi,p +
i
where Xi,1 indicates the ith observation from variable 1 etc. We have all the same assumptions as for simple linear regression i.e. i N (0, ) and independent. The multiple regression model can also be stated as Yi N (0 + 1 Xi1 + + p Xip , ), which highlights how the multiple regression model ts in with Chapters 2, 4, and 5. The regression coecients should be interpreted as the expected increase in Y if we increase the corresponding predictor by 1 and hold all other variables constant. To illustrate, examine the regression output from Figure 6.1(a), in which a cars fuel consumption (number of gallons required to go 1000 miles) is regressed on the cars horsepower, weight (in pounds), engine displacement, and number of cylinders. Figure 6.1(a) estimates a cars fuel consumption using the following equation F uel = 11.49 + 0.0089W eight + 0.080HP + 0.17Cylinders + 0.0014Disp. You can think of each coecient as the marginal cost of each variable in terms of fuel consumption. For example, each additional unit of horsepower costs 0.080 extra gallons per 1000 miles. Figure 6.1(b) provides output from a simple regression where Horsepower is the only explanatory variable. In the simple regression it looks like each additional unit of horsepower costs .18 gallons per 1000 miles, over twice as much as in the multiple regression! The two numbers are dierent because they are measuring dierent things. The Horsepower coecient in the multiple regression is asking how much extra fuel is needed if we add one extra horsepower without changing the cars weight, engine displacement, or number of cylinders? The simple regression doesnt look at the other variables. If the only thing you know about a car is its horsepower, then it looks like each horsepower costs .18 gallons of gas. Some of that .18 gallons is directly attributable to horsepower, but some of it is attributable to the fact that cars with greater horsepower also tend to be heavier cars. Horsepower acts as a proxy for weight (and possibly other variables) in the simple regression.
6.2. SEVERAL REGRESSION QUESTIONS

Summary RSquare RMSE N of Fit 0.852963 3.387067 111 ANOVA Table Source DF Model 4 Error 106 Total 110
119
Sum of Sq. 7054.3564 1216.0559 8270.4123
Mean Square F Ratio 1763.59 153.7269 11.47 Prob > F <.0001 t Ratio 5.48 8.36 5.75 0.32 0.11 Prob>|t| <.0001 <.0001 <.0001 0.7529 0.9137
Parameter Estimates Term Estimate Intercept 11.49468 Weight(lb) 0.0089106 Horsepower 0.0804557 Cylinders 0.1745649 Displacement 0.0014316
Std Error 2.099403 0.001066 0.013982 0.553151 0.013183
Term Intercept Horsepower
Estimate 25.324465 0.1808646
Std Error 1.494657 0.011501
(a) t Ratio Prob>|t| 16.94 <.0001 15.73 <.0001 (b)
RSquare 0.690198 RMSE 4.877156 N 113
Figure 6.1: (a) Multiple regression output for a cars fuel consumption (gal/1000mi) regressed on the cars weight (lbs), horsepower, engine displacement and number of cylinders. (b) Simple regression output including only HP.
6.2
Several Regression Questions
Just as with simple linear regression there are several important questions we are going to be interested in answering. 1. Does the entire collection of Xs explain anything at all? That is, can we be sure that at least one of the predictors is useful? 2. How good a job does the regression do overall? 3. Does a particular X variable explain something that the others dont? 4. Does a specic subset of the Xs do a useful amount of explaining? 5. What Y value would you predict for an observation with a given set of X values and how accurate is the prediction?
120
6.2.1
Is there any relationship at all? The ANOVA Table and the Whole Model F Test
This question is answered by a hypothesis test known as the whole model F test. An F statistic compares a regression model with several Xs to a simpler model with fewer Xs (i.e. with the coecients of some of the Xs set to zero). The simpler model for the whole model F test has all the slopes set to 0. That is, the null hypothesis is H0 : 1 = 2 = = p = 0 versus the alternative Ha : at least one not equal to zero. In practical terms the whole model F test asks whether any of the Xs help explain Y . The null hypothesis is no. The alternative hypothesis is at least one X is helpful. So how does an F statistic compare the two models? Remember that the job of a regression line is to minimize the sum of squared errors (SSE). The F statistic checks whether including the Xs in the regression reduces SSE by enough to justify their inclusion. In the special case of a regression with no Xs in it at all the sum of squared errors is called SST (which stands for total sum of squares) instead of SSE. That is,
n
SST =
i=1
(yi y )2 .
SST measures Y s variance about the average Y , ignoring X. Thus SST /(n 1) is the sample variance of Y from Chapter 1. Despite its name, we still think of SST as a sum of squared errors because if all the slopes in a regression are zero then the intercept is just y . By contrast,
n
SSE =
i=1
(yi yi )2
measures the amount of variation left after we use the Xs to make a prediction for Y . In other words, SSE/(n p 1) is the variance of Y around the regression line (aka the variance of the residuals). These are the same quantities that we saw in Chapter 5 except that now we are using several Xs to calculate y . If SST is the variation that we started out with and SSE is the variation that we ended up with then the dierence SST SSE must represent the amount explained by the regression. We call this quantity the model sum of squares, SSM. It turns out that
n
SSM =
i=1
(i y )2 . y
If SSM is large then the regression has explained a lot, so we should say that at least one of the Xs helps explain Y . How large does it need to be? Clearly we need
121
to standardize SSM somehow. For example if Y measures heights in feet we can make SSM 144 times as large simply by measuring in inches instead (think about why). To get around this problem we divide SSM by our estimate for 2 i.e. s2 = M SE = SSE np1
where M SE stands for mean square error.1 We also expect SSM to be larger if there are more Xs in the model, even if they have no relationship to Y . To get around this problem we also divide SSM by the number of predictors. M SM = SSM p
where M SM stands for the Mean Square explained by the Model. (Okay, thats a dumb name, but it helps remind you of MSE, to which MSM is compared. Plus, we didnt make it up.) You can think about MSM as the amount of explaining that the model achieves per degree of freedom (aka per X variable in the model). If we combine these two ideas together i.e. dividing SSM by the M SE and also by p we get the F statistic M SM SSM/p F = = . M SE M SE The F statistic is independent of units. When F is large we know that at least one of the X variables helps predict Y . The computer calculates a p-value to help us determine whether F is large enough.2 To compute the p-value the computer needs to know how many degrees of freedom were used in the numerator of F (i.e. for computing MSM) and how many were in the denominator (for computing MSE). The F-statistic in Figure 6.1 has 4 DF in its numerator and 106 in its denominator. All these sums of squares are summarized in something called an ANOVA (Analysis of Variance) table. Source Model Error Total
1
df p np1 n1
SS SSM SSE SST
MS
SSM p SSE np1
F
M SM M SE
p p-value
It turns out that weve been using this rule all along. In Chapter 5 we had p = 1, so we divided by n 2. Chapters 1 and 4 had p = 0 so we divided by n 1. Now we have p predictors in the model so we divide SSE by n p 1. 2 The p-value here is the probability that we would see an F statistic even larger than the one we saw, if H0 were true and we collected another data set.
122
Dont Get Confused! 6.1 Why call it an ANOVA table? The name ANOVA table misleads some people into thinking that the table says something about variances. Actually, the object of an ANOVA table is to say something about means, or in this case a regression model, which is a model for the mean of each Y given each observations Xs. It is called an ANOVA table because it uses variances to see whether our model for means is doing a good job.
For example, the ANOVA table in Figure 6.1 tells us that a model with 4 variables had been t, the number of observations was 111 (111 1 = 110), our estimate for 2 is 11.47 (MSE) and MSM is 1763.59. Furthermore, MSM was signicantly larger than MSE (the F ratio is 153.7269) and the probability of this happening if none of the 4 variables had any relationship to Y was only 0.0001. The small p-value says that at least one of the variables is helping to explain Y . One way to think about the ANOVA table is as the balance sheet for the regression model. Think of each degree of freedom as money. You spend degrees of freedom by putting additional X variables into the model. The sum of squares column explains what you got for your money. If you dont use any degrees of freedom, then you will have SSE = SST . In the cars example, we had 110 degrees of freedom to spend, and we spent 4 of them to move 7054 of our variability from the unexplained box (SSE) to the explained box (SSM) of the table. We are left with 1216 which remains unexplained. The mean square column tries to decide if we got a good deal for our money. The model mean square (M SM ) answers the question, How much explaining, on average, did you get per degree of freedom that you spent? If M SM is large then our degrees of freedom were well spent. As usual, the denition of large depends on the scale of the problem, so we have to standardize it somehow. It turns out that the right way to do this is to divide by the variance of the residuals (M SE), which gives us the F -ratio.
6.2.2
How Strong is the Relationship? R2
The ANOVA table and whole model F test try to determine whether any variability has been explained by the regression. R2 estimates how much variability has been explained. Explaining variability works in exactly the same way as for simple regression. We still look at R2 = 1 SSE SSM = . SST SST
123
If this number is close to 1 then our predictors have explained a large proportion of the variability in Y . If R2 is close to zero then the predictors do not help us much. Notice that correlation is not a meaningful way of describing the collective relationship between Y and all the Xs, because correlation only deals with pairs of variables. This is one of the reasons we use R2 because it gives an overall measure of the relationship. One fact about R2 that should be kept in mind is that even if you add a variable that has no relationship to Y the new R2 will be higher! In fact if you add as many predictors as data points you will get R2 = 1. This may sound good but in fact it usually means that any future predictions that you make are terrible. Well show an example in Section 6.7 where we can add enough garbage variables to predict the stock market extremely well (high R2 ) for data in our data set but do a lousy job predicting future data.
6.2.3
Is an Individual Variable Important? The T Test
One of the rst steps in a regression analysis is to perform the whole model F test described in Section 6.2.1. If its p-value is not small then we might as well stop because we have no evidence that regression is helping. However, if the p-value is small, so that we can conclude that at least one of the predictors is helping, then the question becomes which ones? The main tool for determining whether an individual variable is important is the t-test for testing H0 : i = 0. This is the same old t-test from Chapters 4 and 5 that computes how many SEs each coecient is away from zero. All variables with small p-values are probably useful and should be kept. However, because of issues like collinearity (discussed in Section 6.4) some of the variables with large pvalues might be useful as well.3 That is, an apparently insignicant variable might become signicant if another insignicant variable is dropped. This leads into model selection and data mining ideas which we discuss in Section 6.7. For now lets just say that the right way to drop insignicant variables from the model is to do it one at a time. That way you can be sure that you dont accidentally throw away something valuable. When you test an individual coecient in a multiple regression youre asking whether that coecients variable explains something that the other variables dont. For example, in Figure 6.1 the coecient of Cylinders appears insignicant. Does that mean that the number of cylinders in a cars engine has nothing to do with its fuel consumption? Of course not. We all know that 8 cylinder cars use more
3 There is another issue, called multiple comparisons, that we will ignore for now but pick up again in Section 6.7.2.
124
gas than 4 cylinder cars. The small p-value for Cylinders is saying that if you already know a cars weight, horsepower, and displacement then you dont need to know Cylinders too. (Actually you dont need Displacement either, but you dont know that until you drop Cylinders rst.) The order that you enter the X variables into the regression has no eect on the p-values of the individual coecients. Each p-value calculation is done conditional on all the other variables in the model, regardless of the order in which they were entered into the computer. It is desirable to get insignicant variables out of the regression model. If a variable is not statistically signicant then you dont have enough information in the data set to accurately estimate its slope. Therefore if you include such a variable in the data set all youre doing is adding noise to your predictions. To illustrate this point we re-ran the regression from Figure 6.1 after setting aside 25 randomly chosen observations (i.e. we excluded them from the regression). We used the tted regression model to predict fuel consumption for the observations we set aside, and we computed the residuals from these predictions. Then we dropped the insignicant variables Cylinders and Displacement (dropping them one at a time, to make sure Cylinders did not become signicant when Displacement was dropped) and made the same calculations. The two regressions produced the following results for the 25 holdout observations: variables in model all four only signicant SD(residuals) 3.57 3.48
The model with more variables does worse than the model based only on signicant variables. Even experienced regression users sometimes forget that that keeping insignicant variables in the model does nothing but add noise.
6.2.4
Is a Subset of Variables Important? The Partial F Test
Sometimes you want to examine whether a group of Xs, taken together, does a useful amount of explaining. The test for doing this is called the partial F test. The partial F test works almost exactly the way same as the whole model F test. The dierence is that the whole model F test compares a big regression model (with many Xs) to the mean of Y . The partial F test compares a big regression model to a smaller regression model (with fewer Xs). To calculate the partial F statistic you will need the ANOVA tables from the full (big) model and the null (small) model. The formula for the partial F statistic is F = SSE/DF . M SE full

(a) Full Model
125
Source Model Error Total Source Model Error Total
DF 3 115 118
Sum of Squares 0.46388379 0.46210223 0.92598602
Mean Square 0.154628 0.004018
F Ratio 38.4811 Prob > F < .0001 F Ratio 110.1240 Prob > F < .0001
(b) Null Model
DF Sum of Squares 1 0.44897622 117 0.47700979 118 0.92598602
Mean Square 0.448976 0.004077
Table 6.1: ANOVA tables for the partial F test. Singly boxed items are used in the
numerator of the test. Doubly boxed items are used in the denominator.
Note the similarity between the partial F statistic and the whole model F statistic. In the whole model F we called SSE = SSM and DF = p. Also notice that the whole model F test and the partial F test use the M SE from the big regression model, not the small one. For example, Table 6.1(a) contains the ANOVA table for the regression of Sears stock returns on the returns from IBM stock, the VW stock index, and the S&P 500. Table 6.1(b) is the ANOVA table for the regression with just the VW stock index as a predictor; IBM and S&P500 have been dropped. The question is whether both IBM and the S&P500 can be safely dropped from the regression. The partial F statistic is F = (0.47700979 0.46210223)/(3 1) = 1.855097 0.004018
To nd the p-value for this partial F statistic you need to know how many degrees of freedom were used to calculate the numerator and the denominator. The numerator DF is simply the dierence in the number of parameters for the two models. Here the numerator DF is 3-1=2. The denominator DF is the number of degrees of freedom used to calculate MSE. In our example the denominator DF is 115. By plugging the F statistic and its relevant degrees of freedom into a computer which knows about the F distribution, we nd that the p-value for our F statistic is p = 0.1610887. This is pretty large as p-values go. In particular it is larger than .05, so we cant reject the null hypothesis that the variables being tested have zero coecients. The big p-value says it is safe to drop both IBM and S&P500 from the model. The partial F test is the most general test we have for regression coecients. It
126
can test any subset of coecients you like. If the subset is just a single coecient then the partial F test is equivalent to the t test from Section 6.2.3.4 If the subset is everything then the partial F test is the same as the whole model F test. The most common use of the partial F test is when one of your X variables is a categorical variable with several levels, which we will see in Section 6.5. The partial F test is particularly relevant for categorical Xs because each categorical variable gets split into a collection of simpler dummy variables which we will either exclude or enter into the model as a group.
6.2.5
Predictions
Finally to make a prediction we just use Y = b0 + b1 X1 + + bp Xp where b0 , . . . , bp are the estimated coecients. As with simple linear regression, we know that this guess at Y is not exactly correct, so we want an interval that lets us know how sure we can be about the estimate. If we are trying to guess the average value of Y (i.e. just trying to guess the population regression line) then we calculate a condence interval. If we want to predict a single point (which has extra variance because even if we know the true regression line the point will not be exactly on it) we use a prediction interval. The prediction interval is always wider than the condence interval. The standard error formulas for obtaining prediction and condence intervals in multiple regression are suciently complicated that we wont show them. The intervals are easy enough to obtain using a computer (see page 181 for JMP instructions). The regression output omits information that you would need to create the intervals without a computers help.5 Here are prediction and condence intervals for the car example in Figure 6.1. We dropped the insignicant variables from the model, so now the only factors are Weight and HP. Suppose we want to estimate the fuel consumption for a 4000 pound car with 200HP. Interval Condence Prediction
4 5
Gal/1000mi (63.45, 66.62) (57.92, 72.15)
mi/gal (15.01, 15.76) (13.86, 17.27)
In this special case you get F = t2 , where t is the t-statistic for the one variable youre testing. The estimated regression coecients are correlated with one another. To compute the intervals yourself you would need the covariance matrix describing their relationship. Then you could use the general variance formula on page 54.
6.3. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS
127
The condence and prediction intervals were constructed on the scale of the Y variable in the multiple regression model (chosen to avoid violating regression assumptions). Then we transformed them back to the familiar MPG units in which we are used to measuring fuel economy by simply transforming the endpoints of each interval. It looks like our car is going to be a gas guzzler.
6.3
Regression Diagnostics: Detecting Problems
Just as in simple regression, you should check your multiple regression model for assumption violations and for the possible impact of outliers and high leverage points. Multiple regression uses the same assumptions as simple regression. 1. Linearity. 2. Constant variance of error terms. 3. Independence of error terms. 4. Normality of error terms. Multiple regression uses the same types of tools as simple regression to check for assumptions violations i.e. look at the residuals. As with simple regression the main tool to solve any violations is to transform the data. Outliers also have the same denition in multiple regression as in simple regression. An outlier is a point with a big residual, so the best place to look for outliers is in a residual plot. The denition of a high leverage point changes a bit. In multiple regression a high leverage point is an observation with an unusual combination of X values. For example, it is possible for a car to have rather high (but not extreme) horsepower and rather low (but not extreme weight). Neither variable is extreme by itself, but the car can still be a high leverage point because it is an unusual HP-weight combination. There are two dierent families of tools for detecting assumption violation and unusual point issues in multiple regression models. You can either use tools that work with the entire model at once, or you can use tools that look at one variable at a time. Section 6.3.1 describes the best one variable at a time tool, known as a leverage plot, or added variable plot. Section 6.3.2 describes a variety of whole model tools.
6.3.1
Leverage Plots
Leverage plots show you the relationship between Y and X after the other Xs in the model have been accounted for. See the call-out box on page 129 for a more
128
(a)
(b)
Figure 6.2: (a) Leverage plot and (b) scatterplot showing the relationship between a cars fuel consumption and number of cylinders. The leverage plot is from a model that also includes Horsepower, Weight, and Displacement.
technical denition. Figure 6.2 compares the leverage plot showing the relationship between fuel consumption and the number of cylinders in a cars engine, and the corresponding scatterplot. Points on the right of the leverage plot are cars with more cylinders than we would have expected given the other variables in the model. Points near the top of the leverage plot use more gas than we would have expected given the other variables in the model (weight, HP, displacement). The leverage plot and scatterplot reinforce what we said about Cylinders in Section 6.2.3. In a simple regression there is a relatively strong relationship between fuel consumption and Cylinders, but the relationship is better described by other variables in the multiple regression. When you look at a leverage plot, look at the points on the plot, not the lines. The lines that JMP produces are there to help you visualize the hypothesis test for whether each = 0. You can get the same information from the table of parameter estimates, so the lines are not particularly helpful. However, the true value of a plot is that the points in the plot let you see more than you could in a simple numerical summary. Leverage plots are great places to look for bends and funnels which may indicate regression assumption violations. They are also great places to look for high leverage points.6 Leverage plots can be read just like scatterplots, but they are better
6
The word leverage means dierent things in leverage plot and high leverage point.
6.3. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS Not on the test 6.1 How to build a leverage plot Here is how you would build a leverage plot if the computer didnt do it for you. For concreteness imagine we are creating the leverage plot for Cylinders shown in Figure 6.2. 1. Regress Y on all the X variables in the model except for the X in the current leverage plot (e.g. everything but cylinders). The residuals from this regression are the information about Y that is not explained by the other Xs in the model. 2. Now regress the current X on all the other Xs in the model (e.g. cylinders on weight, HP, and displacement). The residuals from this regression are the information in the current X that couldnt be predicted by the other Xs already in the model. By plotting the rst set of residuals against the second, the leverage plot shows just the portion of the relationship between the Y and X that isnt already explained by the other Xs in the model. A minor detail: because a leverage plot is plotting one set of residuals against another, you might expect both axes to have a mean of zero. When JMP creates leverage plots it adds the mean of X and Y back into the axes, so that the plot is created on the scale of the unadjusted variables. Thats a nice touch, but if I had to create a leverage plot myself I probably wouldnt bother adding the means back in.
129
than scatterplots because they control for the impact of the other variables in the regression. Figure 6.2(a) shows no evidence of any assumption violations or unusual points. It simply shows that Cylinders is an unimportant variable in our model.
6.3.2
Whole Model Diagnostics
The only real drawback to leverage plots is that if you have many variables in your regression model then there will be many leverage plots for you to examine. This Section describes some whole model tools that you can use instead of or in conjunction with leverage plots. Some people prefer to look at these tools rst to suggest things they should look for in leverage plots. Leverage plots and whole model diagnostics complement one another. Which ones you look at rst is largely a matter of personal preference.
130
(a)
(b)
Figure 6.3: (a) Residual plot and (b) whole model leverage plot for the regression of fuel consumption on Weight, Horsepower, Cylinders, and Displacement.
The Residual Plot: Residuals vs. Fitted Values When someone says the residual plot in multiple regression they are talking about a plot of the residuals versus Y , the tted Y values. Why Y ? In Chapter 5 the residual plot was a plot of residuals versus X. In multiple regression there are several Xs, and Y is a convenient one-number summary of them. You use the residual plot in multiple regression in exactly the same way as in simple regression. A bend is a sign of nonlinearity and a funnel shape is a sign of non-constant variance. If you see evidence of nonlinearity then look at the leverage plots for each variable to see which one is responsible. If the bend is present in several leverage plots then try transforming Y . Otherwise try transforming the X variable responsible for the bend, or add a quadratic term in that variable (see page 182). The residual plot for the regression in Figure 6.1 is shown in Figure 6.3(a). The residuals look ne.
The (other) Residual Plot: Residuals vs. Time If your regression uses time series data you should also plot the residuals vs. time. Examine the plot for tracking (aka autocorrelation) just like in Chapter 5.
6.3. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS Normal Quantile Plot of Residuals
131
The last residual plot worth looking at in a multiple regression is a normal quantile plot of the residuals. Remember that the normality of the residuals is ONLY important if you want to use the regression to predict individual Y values (in which case it is very important). Otherwise a central limit theorem eect shows up to save the day. You interpret a normal quantile plot of the residuals in exactly the same way you interpret a normal quantile plot of any other variable: if the points are close to a straight line on the plot then the residuals are close to normally distributed. An Unhelpful Plot: The Whole Model Leverage Plot This plot (see Figure 6.3(b)) isnt really useful for diagnosing problems with the model, which is too bad because JMP puts is front and center in the multiple regression output. The whole model leverage plot is a plot of the actual Y value vs. the predicted Y . You can think of it as a picture of R2 , because if R2 = 1 then all the points would lie on a straight line. The condence bands that JMP puts around the line are a picture of the whole model F test. You can get the same information from the whole model leverage plot and the residual plot, but the whole model leverage plot makes regression assumption violations harder to detect because the baseline is a 45 degree line instead of a horizontal line. Our advice is to ignore this plot. Leverage (for nding high leverage points) High leverage points were easy to spot in simple linear regression. You just look for a point with an unusual X value. However, when there are many predictors it is possible to have a point that is not unusual for any particular X but is still in an unusual spot overall. Consider Figure 6.4(a), which plots Weight vs. HP for the car data set. The general trend is that heavier cars have more horsepower, however notice the two cars marked with a (Ford Mustang) and a + (Chevy Corvette). These cars have a lot of horsepower, but not that much more than several other cars. However, they are lighter than other cars with similarly large horsepower. When we compute the leverage of each point in the regression model7 the Mustang and Corvette show up as the points with the most leverage because they have the most unusual combination of weight and HP. Notice that the same two cars show up on the right edge of the HP leverage plot, because they have more HP than one would expect for cars of their same weight.
7
See page 181 for JMP tips.
132
(a)
(b)
(c)
Figure 6.4: (a) Scatterplot of Weight vs. HP, (b) histogram of leverages, and (c) leverage
plot for HP for the car data set. The maximum leverage in panel (b) is .138. The average is .027, out of 111 observations.
Recall that we denote the leverage of the ith data point by hi . A larger hi means a point with more leverage. As with simple regression we have 1/n hi 1. In multiple regression we have n hi = (p + 1). Thus if we added all the numbers in i=1 Figure 6.4(b) we would get 3 (two X variables plus the intercept). Some computer programs (though not JMP) warn you about points where hi is greater than 3 the average leverage (i.e. hi > 3(p + 1)/n). Rather than use such a hard rule, we suggest plotting a histogram of leverages like Figure 6.4(b) and taking a closer look at points on the right that appear extreme. Cooks Distance: measuring a points inuence Each point in the data set has an associated Cooks distance, which measures how far (in a funky statistical sense) the regression line would move if the point were deleted. Luckily, the computer does not actually have to delete each point and t a new regression to calculate the Cooks distance. It can use a formula instead. The formula is complicated,8 but it only depends on two things: the points leverage and the size of its residual. High leverage points with big residuals have large Cooks distances. A large Cooks distance means the point is inuential for the tted regression equation. There is no formal hypothesis test to say when a Cooks distance is too large. Some guidance may be obtained from the appropriate F table, which will depend on the number of variables in the model and the sample size. A table of suggested cuto values is available in Appendix D.3. Dont take the cuto values in Appendix D.3 too seriously. They are only there to give you guidance about what a large Cooks
8
If you must know the formula it is di = hi e2 /(ps2 (1 hi )2 ). Denitely not on the test. i
6.4. COLLINEARITY
133
Figure 6.5: Cooks distances for the car data. The largest Cooks distance is well below
the lowest threshold in Appendix D.3.
distance looks like. In practice, Cooks distances are used in much the same way as leverages (i.e. you plot a histogram and look for extreme values). If you identify a point with a large Cooks distance you should try to determine what makes it unusual. Attempt to nd the point in the leverage plots for the regression so that you can determine the impact it is having on your analysis. If you think the point was recorded in error or represents a phenomenon you do not wish to model then you may consider deleting it. If the point legitimately belongs in the data set then you might consider transforming the scale of the model so that the point is no longer inuential. Figure 6.5 shows the Cooks distances from the car data set. None of the Cooks distances look particularly large when compared to the table in Appendix D.3 with 3 model parameters (two slopes and the intercept) and about 100 observations. Figure 6.5 highlights the Mustang and Corvette that attracted our attention as high leverage points in Figure 6.4. High leverage points have the opportunity to inuence the tted regression a great deal. However, Figure 6.5 says that these two points are not inuential.
6.4
Collinearity
Another issue that comes up in multiple regression is collinearity,9 where two or more predictor variables are closely related to one another. The classic example is regressing Y on someones height in feet, and on their height in inches. Even if height is an important predictor, the model cant be sure how much of the credit for explaining Y belongs to height in feet and how much to height in inches,
9 Some people call it multicollinearity, which we dont like because it is an eight syllable word. See the footnote on page 104.
134
Term Intercept VW SP500 IBM PACGE Term Intercept VW Estimate 0.0134394 2.5729753 -1.56224 0.2070931 0.0798655 Estimate 0.0195894 1.2392068

Std Error 0.007179 1.028958 1.066683 0.132193 0.130059 Std Error 0.006075 0.118087 t Ratio 1.87 2.50 -1.46 1.57 0.61 Prob>|t| 0.0638 0.0138 0.1458 0.1200 0.5404 VIF . 76.618188 78.399527 1.7847938 1.1560317
(a)
t Ratio 3.22 10.49 Prob>|t| 0.0016 <.0001
(b) Figure 6.6: Regression output for Walmart stock regressed on (a) several stocks and stock indices, (b) only the value weighted stock index. so the standard errors for both coecients become inated. If all the Xs in your model are highly collinear, you could even have a very high R2 and signicant F , but none of the individual variables shows up as signicant. Physically, you can understand collinearity by imagining that you are trying to rest a sheet of plywood on a number of cinder-blocks (the plywood represents the regression surface and the blocks the individual data points). If the blocks are fairly evenly distributed under the plywood it will be very stable i.e. you can stand on it anywhere and it will not move. However, if you place all the blocks in a straight line along one of the diagonals of the plywood and stand on one of the opposite corners the plywood will move (and you will fall, dont try this at home). The second situation is an example of collinearity where two X variables are very closely related so they lie close to a straight line and you are trying to rest the regression plane on these points.
6.4.1
Detecting Collinearity
The best way to detect collinearity is to look at the variance ination factor (VIF).10 We can calculate the VIF for each variable using the formula V IF (bi ) = 1 2 1 RXi |Xi
2 where RXi |Xi is the R2 we would get if we regressed Xi on all the other Xs. If 2 RXi |Xi is close to one, then the other Xs can accurately predict Xi . In that case,
10
See page 181 for JMP tips on calculating VIFs.
6.4. COLLINEARITY
135
the computer cant tell if Y is being explained by Xi or by some combination of the other Xs that is virtually the same thing as Xi . The VIF is the factor by which the variance of a coecient is increased relative to a regression where there was no collinearity. A VIF of 4 means that the standard error for that coecient is twice as large as it would be if there were no collinearity. If the VIF is 9 then the standard error is 3 times larger than it would have otherwise been. As a rough rule if the VIF is around 4 you should pay attention (i.e. do something about it if you can, and it is not too much trouble). If it is around 100 you should start to worry (i.e. dont even think about this as your nal regression model). In Figure 6.6(a) the standard error of VW is about 1, with a VIF of around 76. In Figure 6.6(b), where there is no collinearity because there is only one X, the standard error of VW is about .118. Thus, the standard error of VW in the multiple regression is about 76 8.7 times as large as it is in the simple regression (with no collinearity). If a variable can be perfectly predicted by other Xs, like height 2 in feet and height in inches then RXi |Xi = 1, so V IF (bi ) = and you will get an error message when the computer divides by zero calculating SE(bi ). Because collinearity is a strong relationship among the X variables, you might think that the correlation matrix would be a good place to look for collinearity problems. It isnt bad, but VIFs are better because it is possible for collinearity to exist between three or more variables where no individual pair of variables is highly correlated. The correlation matrix only shows relationships between one pair of variables at a time, so it can fail to alert you to the problem.
6.4.2
Ways of Removing Collinearity
Collinearity happens when you have redundant predictors. Some common ways of removing collinearity are: Drop one of the redundant predictors. If two or more Xs provide the same information, why bother keeping all of them? In most cases it wont matter which of the collinear variables you drop: thats what redundant means! In Figure 6.6 just pick one of the two stock market indices. They provide virtually identical information, so it doesnt matter which one you pick. Combine them into a single predictor in some interpretable way. For example if you are trying to use CEO compensation to predict performance of a company you may have both base salary and other benets in the model. These two variables may be highly correlated so they could be combined into a single variable total compensation by adding them together. Transform the collinear variables. If one of your variables can be interpreted as
136
Term Intercept Age Age*Age Estimate -1015.103 67.451425 -0.717539

Std Error 307.5207 14.50506 0.165171 t Ratio -3.30 4.65 -4.34 Prob>|t| VIF 0.0017 . <.0001 49.872 <.0001 49.872
RSq RMSE N
0.300978 171.1107 61
Term Intercept Age (Age-43.78)^2
Estimate 360.62865 4.6138293 -0.717539
Std Error 95.48353 2.056672 0.165171
(a) t Ratio 3.78 2.24 -4.34 (b)
Prob>|t| VIF 0.0004 . 0.0287 1.0027 <.0001 1.0027
RSq 0.300978 RMSE 171.1107 N 61
Figure 6.7: Output for a quadratic regression (a) without centering (b) with the squared
term centered.
size then you can use it in a ratio to put other variables on equal footing. For example, a chain of supermarkets might have a data set with each stores total sales and number of customers. You would expect these to be collinear because more customers generally translates into larger sales. Consider replacing total sales with sales/customer (the average sale amount for each customer). In the car example HP and Weight are correlated with r = .8183. Using Weight and HP/pound reduces the correlation to .2546. Taking the dierence between two collinear variables can also help, but it isnt very interpretable unless the two variables have the same units. For example, V W S&P 500 makes sense as the amount by which V W outperformed the S&P 500 in a given month, but TotalSales-Customers wouldnt make much sense. A third example of a transformation that reduces collinearity is centering polynomials. Recall the quadratic regression of the size of an employees life insurance premium on the employees age from Chapter 5 (page 103). The same output is reproduced in Figure 6.7 with and without centering. The centered version produces exactly the same tted equation (notice that RMSE is exactly the same), but with much less collinearity.
6.4.3
General Collinearity Advice
Collinearity is undesirable, but not disastrous. It is a less important problem than regression assumption violations or extreme outliers. You will rarely want to omit a signicant variable simply because it has a high VIF. Such a variable is signicant
6.5. REGRESSION WHEN X IS CATEGORICAL

VW 0.0 0.1 0.1 S&P 0.0 0.1 0.0 Lower PI -0.1119 0.0061 0.0296 Upper PI 0.1421 0.2642 0.4923
137
(Interpolation) (Interpolation) (Extrapolation)
Table 6.2: Prediction intervals for the return on Walmart stock using a regression on VW
and the S&P 500.
despite the high VIF. If you can remove the collinearity using transformations then the signicant variable will be even more signicant. Finally, keep in mind that it is easier to accidentally engage in extrapolation when using a model t by collinear Xs. In the context of multiple regression, extrapolation means predicting Y using an unusual combination of X variables. For example, consider the prediction interval for the monthly return on Walmart stock based on a a model with both VW and the S&P500 shown in Table 6.2. The RMSE from the regression is about .06, so if the regression equation were well estimated we would expect the margin of error for a prediction interval to be about .12. If we predict Walmarts return with V W = S&P = 0.1, a typical combination of values, then we get a prediction interval not much wider than .12. The same is true when we predict with V W = S&P = 0.0. However, when we try to predict Walmarts return when V W = 0.1 and S&P = 0.0 the prediction interval is almost twice as wide. Even though .1 and 0 are typical values for both variables, the (0.1, 0.0) combination is unusual.
6.5
Regression When X is Categorical
The two previous Sections dealt with problems you can encounter with multiple regression. Now we return to ways of applying the multiple regression model to real problems. We often encounter data sets containing categorical variables. A categorical variable in a regression is often called a factor. The possible values of a factor are called its levels. For example, Sex is a two-level factor with Male and Female as its levels. Section 6.5.1 describes how to include two-level factors in a regression model using dummy variables. Section 6.5.2 explains how to extend the dummy variable idea to multi-level factors.
6.5.1
Dummy Variables
A factor with only two levels can be incorporated into a regression model by creating a dummy variable, which is simply a way of creating numerical values out of
138
Figure 6.8: JMP will make dummy variables for you. You dont have to make them by
hand as shown here with SexCodes.
categorical data. For example, suppose we wanted compare the salaries of male and female managers. Then we would create dummy variables like Sex[Female] = Sex[Male] = 1 if subject i is Female 0 otherwise 1 if subject i is Male 0 otherwise.
A regression model using both these variables would look like Yi = 0 + 1 Sex[Male] + 2 Sex[Female] = 0 + 1 0 + 2 if subject i is Male if subject i is Female.
Estimating this model is problematic because the two dummy variables are perfectly collinear.11 In order to t the model we must constrain the dummy variables coecients somehow. The two most popular methods are setting one of the coecients to zero (i.e. removing one of the dummy variables from the model) or forcing the coecients to sum to zero. When you include a categorical variable in a regression, such as Sex in Figure 6.8, JMP automatically creates Sex[Female] and Sex[Male] and includes both variables using the sum to zero convention. To use the leave one out convention you must create your own dummy variable, such as SexCodes in Figure 6.8, and use it in the regression instead.12
Sex[Female] = 1 Sex[Male] We doubt you will want to do this. Creating your own dummy variables is a lot of work when you can let the computer do it for you.
12 11

Term Intercept Sex[female] Sex[male] Term Intercept SexCodes Estimate 142.28851 -1.821839 1.821839 Std Error 0.883877 0.883877 0.883877 t Ratio 160.98 -2.06 2.06 t Ratio 97.88 2.06 Prob>|t| <.0001 0.0405 0.0405 Prob>|t| <.0001 0.0405
139
(a)
Estimate 140.46667 3.6436782 Diff. -3.6437 1.7678 Std Error 1.43514 1.767753
(b)
Estimate Std Error t-Test DF Prob > |t| -2.061 218 0.0405 Lower 95% -7.1278 Upper 95% -0.1596
(c) Figure 6.9: Regression output comparing male and female salaries based on (a) the sumto-zero constraint, (b) the leave-one-out constraint. Panel (c) compares mens and womens salaries using the two-sample t-test.
Figure 6.9 shows regression output illustrating the two conventions. In panel (a) the intercept term is a baseline. The average womens salary is 1.82 (thousand dollars) below the baseline. The average mens salary is 1.82 above the baseline. That means the dierence between the average salaries of men and women is 1.82 2 = 3.64. The dierence is statistically signicant (p = .0405), but just barely. The line for Sex[Male] is in quotes because it is not always reported by the computer,13 although the coecient is easy to gure out if the computer doesnt report it. The coecients for Sex[Male] and Sex[Female] must sum to zero, so one coecient is just 1 times the other. Figure 6.9(b) shows the same regression using the leave one out convention. In panel (b) the intercept represents the average salary for women (where SexCodes=0), and the coecient of SexCodes represents the average dierence between mens and womens salaries. The dierence is signicant, with exactly the same p-value as in panel (a). Both regressions in Figure 6.9 describe exactly the same relationship. They just parameterize it dierently. Both models say that the average salary for women is $140,466 and that the average salary for men is $3,643 higher. Furthermore, all the p-values, t-statistics, and other important statistical summaries of the relationship are the same for the two models. In that sense, it does not matter which dummy variables convention you use, as long as you know which one you are using. The
13
To see it you have to ask for expanded estimates. See page 182.
140
Term Intercept Sex[female] YearsExper
Estimate 135.01757 0.0146588 0.7496472
Std Error 1.881391 0.949791 0.173053
t Ratio 71.76 0.02 4.33
Prob>|t| <.0001 0.9877 <.0001

Salary
110
120
130
140
150
160
170
10
15 YearsExper
20
25
Figure 6.10: Regression output comparing mens and womens salaries after controlling for years of experience. Adding dummy variables to a regression gives women ( solid line) and men ( dashed line) dierent intercepts. The lines are parallel because the dummy variables do not aect the slope.
leave one out convention is a little easier if you have to create your own dummy variables by hand. The sum to zero convention is nice because it treats all the levels of a categorical variable symmetrically. Henceforth we will use JMPs sum to zero convention. The last panel in Figure 6.9 shows that you get the same answer when you compare the sample means using regression or using the two sample t-test. The two sample t-test turns out to be a special case of regression, which is why we left it as an exercise in Chapter 4. The advantage of comparing mens and womens salaries using regression is that we can control for other background variables. For example, the subjects in Figure 6.8 come from a properly randomized survey, so the statistically signicant dierence between their salaries generalizes to the population from which the survey was drawn. However, the subjects were not randomly assigned to be men or women,14 so we cant say that the dierence between mens and womens salaries is because of gender dierences. There are other factors that could explain the salary dierence, such as men and women having dierent amounts of experience. Figure 6.10 shows output for a regression of Salary on Sex and YearsExper. Note the change from Figure 6.9. Now the baseline for comparison is a regression line Y = 135 + .75Y earsExper. A woman adds .014 to the regression line to nd her expected salary. A man with the same number of years of experience subtracts .014 from the line to nd his expected salary (remember that the coecients for Sex must sum to zero). That is only a $28 dierence. The large p-value for Sex in Figure 6.10
14
Obviously very dicult to do.
141
says that, once we control for YearsExper, Sex does not have a signicant eect on Salary. Said another way, multiple regression estimates the impact of changing one variable while holding another variable constant. Therefore the coecient of Sex can be viewed as comparing the salaries of men and women with the same number of years of experience. Because the p-value for Sex is insignicant, the model says that men and women with the same experience are being paid about the same. You can think about the contributions from the Sex dummy variable as adding or subtracting something from the intercept term, which means that the regression lines for men and women are parallel. Section 6.6 shows how to expand the model if you want to consider non-parallel lines.
6.5.2
Factors with Several Levels
A factor with L levels can be included in a multiple regression by splitting it into L dummy variables. For example, suppose each observation in the data represents a pair of pants ordered from a clothing catalog. The color of each pair can be natural, dusty sage, light beige, bone, or heather. To include Color in the regression, simply make a dummy variable for each color, like 1 if dusty sage 0 otherwise.
X[DS] =
Just as with the male/female dummy variables in Section 6.5.1, we must constrain the coecients of the L dummy variables in order to avoid a perfect, regression killing collinearity. The same constraints from Section 6.5.1 apply here as well. We can either leave one dummy variable out of the model or constrain the coecients to sum to zero. The sum-to-zero constraint is more appealing when dealing with multi-level factors. The output from a regression on a multi-level factor can appear daunting because each factor level introduces a coecient. For example, Figure 6.11 gives output from a regression of log10 CEO compensation on the CEOs age, industry, and his or her companys prots. Industry is a categorical variable with 19 dierent levels. Three more coecients are added by the intercept, Age, and Prots, so there are 22 regression coecients in total. It helps to remember that all but one of the dummy variables will be zero for any given observation. Therefore, only 4 of the 22 coecients are needed for any given CEO. To illustrate, lets estimate the log10 compensation for a 60 year old CEO in the entertainment industry whose company made 72 million dollars in prot.
142
************ Expanded Estimates ************** Nominal factors expanded to all levels Term Estimate Std Error t Ratio Intercept 5.7275756 0.118957 48.15 Ind[AeroDef] 0.1476387 0.088284 1.67 Ind[Business] 0.0156033 0.072287 0.22 Ind[Cap. gds] -0.078135 0.085355 -0.92 Ind[Chem] -0.020486 0.075052 -0.27 Ind[CompComm] 0.0704833 0.049318 1.43 Ind[Constr] 0.1317273 0.116558 1.13 Ind[Consumer] 0.0601128 0.053002 1.13 Ind[Energy] -0.057568 0.059622 -0.97 Ind[Entmnt] 0.1106206 0.07352 1.50 Ind[Finance] -0.116069 0.03313 -3.50 Ind[Food] -0.04783 0.049415 -0.97 Ind[Forest] -0.123455 0.085613 -1.44 Ind[Health] 0.0093132 0.056013 0.17 Ind[Insurance] 0.0116971 0.052934 0.22 Ind[Metals] -0.089493 0.087778 -1.02 Ind[Retailing] 0.0509543 0.058207 0.88 Ind[Transport] 0.0759199 0.095629 0.79 Ind[Travel] 0.1955346 0.09896 1.98 Ind[Utility] -0.346568 0.049066 -7.06 Age 0.0080405 0.002075 3.87 Profits 0.0001485 0.000024 6.25
Prob>|t| <.0001 0.0949 0.8292 0.3603 0.7850 0.1534 0.2588 0.2571 0.3346 0.1328 0.0005 0.3334 0.1497 0.8680 0.8252 0.3083 0.3816 0.4275 0.0485 <.0001 0.0001 <.0001
RSquare 0.15152 RMSE 0.38498 N 786 ANOVA Source Model Error Total
DF 20 765 785
Sum Sq. 20.24733 113.38040 133.62774
Mean Sq. 1.01237 0.14821
F Ratio 6.8306 Prob > F <.0001
Effect Tests Source Nparm Ind 18 Age 1 Profits 1
DF 18 1 1
Sum Sq. 11.63339 2.224885 5.794551
F Ratio 4.3607 15.0117 39.0970
Prob>F <.0001 <.0001 <.0001
Figure 6.11: Output for the regression of log10 CEO compensation on the CEOs age,
industry, and corporate prots. The Ind[Utility] variable is left out of the usual parameter estimates table. We can either nd it in the expanded estimates table (easy) or compute it from the other Ind[x] coecients (hard).
The same CEO in the chemicals industry could expect a log10 compensation of Y = 5.73 .02 + .00804 60 + .000149 72 = 6.203. That doesnt look like a big dierence, but remember were on the log scale. The rst CEO expects to make 106.333 = $2, 152, 782, the second, 106.203 = 1, 595, 879, about half a million dollars less. Each coecient of Ind[X] in Figure 6.11 represents the amount to be added to the regression equation for a CEO in industry X. If you like, you can interpret Figure 6.11 as 19 dierent regression lines with the same slopes for Age and Prot but dierent intercepts for each industry. Testing a Multilevel Factor To test the signicance of a multi-level factor, you test whether the collection of dummy variables that represent that factor, taken as a whole, do a useful amount
15
Y = 5.73 + .11 + .00804 60 + .000149 7215 = 6.333.
Prot in the data table is recorded in millions of dollars
6.5. REGRESSION WHEN X IS CATEGORICAL Not on the test 6.2 Making the coecients sum to zero The computer uses a trick to force the coecients of the dummy variables to sum to zero. The trick is to leave one of the dummy variables out, but to dene all the others a little dierently. For example, suppose the Utility industry in Figure 6.11 is the level left out. The other dummy variables are 1 if subject i is in the aerospace-defense industry Ind[AeroDef] = 1 if subject i is in the utilities industry 0 otherwise, 1 if subject i is in the nance industry Ind[Finance] = 1 if subject i is in the utilities industry 0 otherwise
143
and so forth, always assigning 1 to the level that doesnt get its own dummy variable. If b1 , b2 , . . . , b18 are the coecients for the other 18 industries, then a CEO in the utilities industry adds b19 = 1(b1 + b2 + + b18 ) to the regression equation, because he gets a -1 on every dummy variable in the model. Clearly, b1 + + b19 = 0. By using this trick, the computer only estimates the rst 18 bs, and derives b19 . That explains why you dont see Ind[Utility] in the regular parameter estimates box. The computer didnt actually estimate a parameter for it. What makes this such a neat trick is that all 19 dummy variables behave as desired (as if one of them was 1 and the rest were 0) in Figure 6.11.
of explaining. The partial F test from Section 6.2.4 answers that question. The partial F test for each variable in the model is shown in the Eects Test table in Figure 6.11. The partial F tests for Age and Prots are equivalent to the t-tests for those variables, because each variable only introduces one coecient into the model. The partial F test for the industry variable Ind examines whether the 18 free dummy variables reduce SSE by a sucient amount to justify their inclusion. These 18 dummy variables reduce SSE by 11.63, compared to the model with just Age and Prot. To calculate the F statistic, the computer uses the formula from Section 6.2.4, 11.63/18 F = = 4.3607 0.14821 where 0.14821 is the MSE from the ANOVA table. The p-value generated by this partial F statistic, with 18 and 765 degrees of freedom, is p < .0001, which says
144
that at least some of the dummy variables are helpful. Once you see that a categorical factor is helpful, there is nothing that says you have to keep all the individual dummy variables in the model. However, if you want to drop insignicant dummy variables from the model you will need to create all the individual dummy variables by hand. That is a big hassle that doesnt seem to have a big payo. We make an exception to the no garbage variables rule when it comes to individual dummy variables that are part of a factor which is signicant overall.
6.5.3
Testing Dierences Between Factor Levels
Once you determine that there is a dierence between a subset of the factor levels, you will often want to run subsequent tests to see where the dierences are. The t-statistics for the individual dummy variables are set up to compare the dummy variable to the baseline regression, not to compare two dummy variables to one another. The tool you use to compare dummy variables is called a contrast. A contrast is a weighted average of dummy variable coecients where some of the weights are positive and some are negative. The weights can be anything you like as long as the positive and negative weights each sum to 1. Often the positive components of the contrast are equally weighted, and likewise for the negative components. For example, suppose you wanted to compare average CEO compensation in the the Finance and Insurance industries with average compensation in the Entertainment, Retailing, and Consumer Goods industries, after controlling for Age and Prots.16 Number these ve industries 1 through 5. The contrast you want to compute is17 .51 + .52 .3333 .3334 .3335
where i is the coecient of the dummy variable for the ith industry. The results of this contrast are shown in the box below. The estimate is negative, which says that the average compensation in the Entertainment/Retail/Consumer industries (which have negative weights) is larger than the average in the Finance/Insurance industries, after controlling Age and Prots. The small p-value says that the dierence is signicant.
Estimate Std Error t Ratio Prob>|t| SS
16
-0.126 0.0474 -2.658 0.008 1.047
Sum of Squares Numerator DF Denominator DF F Ratio Prob > F
1.0469633847 1 765 7.0640690154 0.0080286945
We dont know why someone might want to do this, but it illustrates the point. A more meaningful example appears in Section 6.6.1. 17 See page 182 for computer tips.
6.6. INTERACTIONS BETWEEN VARIABLES
145
Y X1 X2 X1
Y X2 X1
Y X2
Y X1 X2
(a) typical regression
(b) collinearity
(c) interaction
(d) autocorrelation
Figure 6.12: Collinearity is a relationship between two or more Xs. Interaction is when
the strength of the relationship between X1 and Y depends on X2 , and vice-versa. Autocorrelation is when Y depends on previous Y s.
6.6
Interactions Between Variables
Now that we can incorporate either categorical or continuous Xs, multiple regression looks like a very exible tool. It gets even more exible when we introduce the idea of interaction, where the relationship between Y and X depends on another X. Interaction can be a hard idea to internalize because there are three variables involved. Mathematically, interaction means that the slope of X1 depends on X2 . The practical implications of interactions become much easier to understand if you think of the regression slopes as meaningful real world quantities instead of the generic one unit change in X leads to a . . . . For example, the CEO compensation data set has a variable containing the percentage of a companys stock owned by the CEO. Suppose we wish to examine the eect of stock ownership on CEO compensation. From Section 6.5.2 we know that log10 compensation also depends on prots.18 The standard way of including Prot and Stock ownership in a regression model is Y = 0 + 1 Stock + 2 Prot. This model says that increasing a CEOs ownership of his companys Stock by 1% increases his expected compensation by 1 , regardless of the companys prot level. That seems wrong to us. Instead it seems like CEOs with more stock should do even better if they run protable companies than if they dont. Said another way, we think the coecient of Stock should be larger if the CEOs company is protable, and smaller if it is not. Consider the regression model where Y = 0 + 1 Stock + 2 Prot + 3 (Stock)(Prot). The term (Stock)(Prot) is called the interaction between Stock and Prot. It is simply the product of the two variables. The individual variables Stock and Prot are called main eects or rst order eects of the interaction. To determine the
18
and on Age, Industry, and presumably other variables as well. Ignore them for the moment.
146
Term Estimate Std Error t Ratio Intercept 5.7103165 0.118212 48.31 Ind[aerosp] 0.1351741 0.087518 1.54 ... (other industry dummy variables omitted) Ind[Travel] 0.2064925 0.098087 2.11 Age 0.0087606 0.002067 4.24 Profits 0.0001199 0.000029 4.15 Stock% -0.009774 0.002402 -4.07 (Stock%-2.17974)* (Profit-244.202) -0.000013 0.000009 -1.40
Prob>|t| VIF <.0001 . 0.1229 4.130047 0.0356 <.0001 <.0001 <.0001 0.1630 4.8915569 1.0939624 1.5551463 1.1822928 1.6220983
Figure 6.13: Regression of log10 CEO compensation on Age, Industry, Prot, the percent
of a companys stock owned by the CEO, and the Prot*Stock interaction.
interactions eect on the slope of Stock, ignore all terms in the model that dont have Stock in them, then factor Stock out of all that do.19 The slope of Stock is Stock = 1 + 3 Prot. If 3 is positive then the slope of Stock is larger when Prot is high, and smaller when Prot is low. The key to understanding interactions is to come up with a real world meaning for the slopes of the variables involved in the interaction. In the current example, Stock is the impact that stock ownership has on a CEOs overall compensation. If 3 > 0 then stock ownership has a more positive impact on compensation for CEOs of more protable companies, and a more negative impact on compensation for CEOs of money losing companies. The interaction can be interpreted the other way as well. Prot = 2 + 3 Stock so if 3 > 0 then the model says that the prots of a CEOs company have a larger impact on a CEOs compensation if the CEO owns a lot of stock. Figure 6.13 tests our theory about the relationship between Stock, Prot, and log10 compensation by incorporating the Stock*Prot interaction into the regression from Figure 6.11. The rst thing we notice is that the interaction term is insignicant (p = .1630). That means that the eect of stock ownership on CEO compensation is about the same for CEOs of highly protable companies and CEOs of companies with poor prots. So much for our theory! We also notice that the coecient of Stock is negative. Perhaps CEOs who own large amounts of stock
19 If you know about partial derivatives, were just taking the partial derivative with respect to Stock, treating all other variables as constants.
147
choose to forgo huge compensation packages in the hopes that their existing shares will increase in value. The leverage plot for Stock seems to t with that explanation. It seems we had a aw in our logic. CEOs that own a lot of stock obviously do well when their companies are protable, but an increase in the value of their existing shares does not count as compensation, so it wont show up in our data. Just for practice, lets try to interpret what the interaction term in Figure 6.13 says, even though it is insignicant and should be dropped from the model. Before creating the interaction term, the computer centers Stock and Prot around their average values to avoid introducing excessive collinearity into the model.20 Figure 6.13 estimates the slope of Stock as bStock = 0.009774 0.000013(Prot 244.202). Notice that the mean of Stock (2.17 . . . ) multiplies Prot, not Stock, when you multiply out the interaction term, so it doesnt aect bStock . The estimate says that each million dollars of prot over the average of $244 million decreases the slope of Stock by 0.000013. The estimated slope of Prot is bProt = 0.0001199 0.000013(Stock 2.17974), which says that each percent of the companys stock owned by the CEO, over the average of 2.18%, decreases the slope of prot by 0.000013.
6.6.1
Interactions Between Continuous and Categorical Variables
Interactions say that the slope for one variable depends on the value of another variable. When one variable is continuous and the other is categorical interactions mean that there is a dierent slope for each level of the categorical variable. For example, suppose a factory supervisor oversees three managers, conveniently named a, b, and c. The supervisor decides to base each managers annual performance review on the typical amount of time (in man-hours) required for them to complete a production run. The supervisor obtains a random sample of 20 production runs from each manager. To make the comparison fair, the size (number of items produced) of each production run is also recorded. The supervisor regresses RunTime on RunSize and Manager to obtain the regression output in Figure 6.14(a). The regression estimates the xed cost of starting up a production run at 176 manhours, and the marginal cost of each item produced is about .25 man hours (i.e. it takes each person about 15 minutes to produce an item, once the process is up and running). The xed cost for manager a is 38.4 man-hours above the baseline. Manger b is 14.7 man-hours below the baseline, and manager c is 23.8 man-hours
20
See page 136.
148
Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Intercept 176.70882 5.658644 31.23 Manager[a] 38.409663 3.005923 12.78 Manager[b] -14.65115 3.031379 -4.83 Manager[c] -23.75851 2.995898 -7.93 Run Size 0.243369 0.025076 9.71 Effect Tests Source Nparm Manager 2 Run Size 1
Prob>|t| <.0001 <.0001 <.0001 <.0001 <.0001
DF 2 1
Sum of Sq. F Ratio Prob > F 44773.996 83.4768 <.0001 25260.250 94.1906 <.0001 (b) tRatio 31.96 13.17 -4.61 -8.54 9.49 2.07 -2.63 0.77 Prob>|t| <.0001 <.0001 <.0001 <.0001 <.0001 0.0437 0.0112 0.4444
(a) Expanded Estimates Term Estimate Std Err. Intercept 179.59191 5.619643 Manager[a] 38.188168 2.900342 Manager[b] -13.5381 2.936288 Manager[c] -24.65007 2.887839 Run Size 0.2344284 0.024708 Mgr[a](RS-209.317) 0.072836 0.035263 Mgr[b](RS-209.317) -0.09765 0.037178 Mgr[c]*RS 0.0248147 0.032207 Effect Test Source Nparm Manager 2 Run Size 1 Manager*RS 2
DF 2 1 2
Sum Sq. 43981.452 22070.614 1778.661 (c)
F Ratio 89.6934 90.0192 3.6273
Prob > F <.0001 <.0001 0.0333 (d)
Figure 6.14: Evaluating three managers based on run-time (a-b) without interactions (c-d)
with the interaction between manager and run size. Manager a ( , broken line), Manager b (+, light line), Manager c(, heavy line).
below the baseline. The eects test shows that the Manager variable is signicant, so the dierences between the managers are too large to be just random chance. Upon determining that there actually is a dierence between the three managers, the supervisor computed two contrasts (shown in Figure 6.15), indicating that the setup time for manager a is signicantly above the average times of manager b and c on runs of similar size, and that the dierence between managers b and c is not statistically signicant. The supervisor decides to put manager a on probation, and give managers b and c their usual bonus. While meeting with the managers to discuss the annual review, the supervisor learned that manager b used a dierent production method than the other two managers. Manager bs method was optimized to reduce the marginal cost of each unit produced. The supervisor began to worry that he hadnt judged

a 0 b 1 c -1 Estimate 9.1074 Std Error 5.2243 t Ratio 1.7433 Prob>|t| 0.0868 SS 814.99 a b c Estimate Std Error t Ratio Prob>|t| SS 1 -0.5 -0.5 57.614 4.5089 12.778 <.0001 43788
149
Figure 6.15: Contrasts based on the model in Figure 6.14(a). manager b fairly, because his analysis assumed that all three managers had the same marginal cost of .243 man-hours per item. If manager bs method was truly eective (which the supervisor isnt sure about and would like to test), manager b would have an advantage on large jobs that the supervisors analysis ignored. Unhappy with his current model, the supervisor calls you to help. The supervisors question suggests an interaction between the Manager variable and the RunSize variable. Interactions say that the slope of one variable (in this case RunSize) depends on another (in this case Manager). So in the current example an interaction simply means each manager has his own slope, or even more specic to this example: his own marginal cost of production. After hiring you to explain all this (and paying you a huge consulting fee) the supervisor decides to run a new regression that includes the interaction between Manager and RunSize, shown in Figure 6.14(c). The interaction term enters into the model as a product of RunSize with all the Manager[x] dummy variables, after centering the continuous variable RunSize to limit collinearity.21 Because the interaction enters into the model in the form of several variables at once, the right way to test its signicance is through the partial F test shown in the Eects Test table. The interaction is signicant with p = .0333, which says that at least one of the managers has a dierent slope (marginal cost) than the other two. The regression equations for the three managers are: 179.59 + 38.19 + .234RunSize + .073(RS 209.3) RunT ime = 179.59 13.54 + .234RunSize .098(RS 209.3) 179.59 24.65 + .234RunSize + .025(RS 209.3) if Manager a if Manager b if Manager c.
The baseline marginal cost per unit of production is .2344 man-hours per unit. Manager a adds .073 man-hours per unit, manager b reduces the baseline marginal cost by .098 man-hours per unit, and manager c adds .025 man-hours per unit. Notice that the coecients of the dummy variables in the interaction terms sum
21 The centering term is not present in the Mgr[c] line of the computer output, but that is a typo on the part of the computer programmers. The output should read Mgr[c](RS-209.317).
150
to zero just like the coecients of the Manager main eects. By adding like terms we nd that manager as slope is 0.307, manager bs is 0.136, and manager cs is 0.259. JMP doesnt provide a method equivalent to a contrast for testing interaction terms, but clearly manager b has a smaller slope (i.e. smaller marginal cost per item produced) than the other two.
6.6.2
General Advice on Interactions
Students often ask how you know if you need to include an interaction term. The simple answer is that you try it in the model and look at its p-value, just like any other variable. The trick is having some idea which interaction to check, which requires some contextual knowledge about the problem. You should think to try an interaction term whenever you think that changing a variable (either categorical or continuous) changes the dynamics of the other variables. To develop that intuition you should think about the economic interpretation of the coecients as marginal costs, marginal revenues, etc. There is no plot or statistic you look at to say I need an interaction here. The variables involved in an interaction may or may not be related to one another. Interactions are relatively rare in regression analysis. Many people dont bother to check for them because of the intuition required for the search. However, when you nd a strong one it is a real bonus because youve learned something interesting about the process youre studying. Finally, we interpret interactions are the eect that one variable has on the slope of another. That interpretation becomes more dicult if the slope of either variable is excluded from the model. Therefore, when tting models with interactions remember the hierarchical principle which says that if you have an interaction in the model, you should also include the main eects of each variable, even if they are not signicant. The exception to the hierarchical principle is when the variable you create by multiplying two other variables is interpretable on its own merits. In that case you have the option of keeping or dropping the main eects as conditions warrant.
6.7
Model Selection/Data Mining
Model selection is the process of trying to decide which are the important variables to include in your model. This is probably the most dicult part of a typical regression problem. Model selection is one of the key components in the eld of Data Mining which you have probably heard about before. Data mining has arisen because of the huge amounts of data that are now becoming available as a result of new technologies such as bar code scanners. Every time you go to the
6.7. MODEL SELECTION/DATA MINING
151
supermarket and your products are scanned at the checkout that information is sent to a central location. There are literally terabytes of information recorded every day. This information can be used to plan marketing campaigns, promotions, etc. However, there is so much data that it is virtually impossible to manually decide which of the thousands of recorded variables are important for any given response and which are not. The eld of data mining has sprung up to try to answer this problem (and related ones). However, the problem has existed in statistics for a long time before data mining arrived. There are some established procedures for deciding on the best variables, which we discuss below. In the interests of fairness we should point out that this is such a hard problem that answering it is somewhat more of an art than a science!
6.7.1
Model Selection Strategy
When deciding on the best regression model you have several things to keep your eye on. You should check for (1) regression assumption violations, (2) unusual points, and (3) insignicant variables in the model. You should also keep your eye on the VIFs of the variables in the model to be aware of any collinearity issues. When faced with a mountain of potential X variables to sort through, a good strategy is to start o with what signicant predictors you can nd by trial and error. Each time you t a model, glance at the diagnostics in the computer output to see if you can spot any glaring assumption violations or unusual points. Try to economically interpret the coecients of the models you t (3 is the cost of raw materials). Being able to interpret your coecients this way will help you understand the model better, make it easier for you to explain the model to other people, and perhaps give you some insight on interactions that you might want to test for. This stage of the analysis generally requires quite a bit of work and a lot of practice. However, there is really no good substitute for this kind of approach. Once youve gone through a few iterations and are more or less happy with the situation regarding regression assumptions and unusual points, you eventually get to the stage where a computer can help you gure out which variables will provide signicant p-values. Be careful that you dont jump to this step too soon, though. Computers are really good a churning through variables looking for signicant pvalues, but they are really bad at identifying and xing assumption violations, suggesting interpretable transformations to remove collinearity problems, and making judgments about what to do with unusual points.
152
6.7.2
Multiple Comparisons and the Bonferroni Rule
At several points throughout this Chapter we have urged caution when looking at p-values for individual variables. The primary reason is the problem of multiple comparisons. In previous Chapters each problem had only one hypothesis test for us to do. Our .05 rule for p-values meant that if an X was a garbage variable, we had only a 5% chance of mistaking it for a signicant one. Regression analysis involves many dierent hypothesis tests and a lot more trial and error, which means there is a much greater opportunity for garbage variables to sneak into the regression model by chance. Consequently, our .05 rule for p-values no longer protects us from spurious signicant results as well as it did before. One obvious solution is to enact a tougher standard for declaring signicance based on the p-values of individual variables in a multiple regression. The question is how much tougher? Unfortunately there is no way of knowing, but there is a conservative procedure known as the Bonferroni adjustment which we know to be stricter than we need to guarantee a particular level of signicance. The Bonferroni adjustment says that if you are going to select from p potential X variables, then replace the .05 threshold for signicance with .05/p. Thus if there were 10 potential X variables we should use .005 as the signicance threshold instead of .05. The Bonferroni rule is a rough guide. It is important to remember that it is tougher than it needs to be, simply because we have no way of knowing precisely how much we should relax it to maintain a true .05 signicance level. At some point (a practical limit is .0001, the smallest p-value that the computer will print) the Bonferroni adjusted threshold gets low enough that we dont penalize it any further.
6.7.3
Stepwise Regression
In regression problems with only a few X variables it is best to select the variables in the regression model by hand. However, if you have a mountain of X variables staring you in the face youre going to need some help. Some sort of automated approach is required to choose a manageable subset of the variables to consider in more detail. There are three common approaches, collectively called stepwise regression. Forward Selection. With this procedure we start with no variables in the model and add to the model the variable with lowest p-value. We then add the variable with next lowest p-value (conditional on the rst variable being in the model). This approach is continued until the p-value for the next variable to add is above some threshold (e.g. 5%, adjusted by the Bonferroni rule) at which point we stop. The threshold must be chosen by the user.
6.7. MODEL SELECTION/DATA MINING Not on the test 6.3 Where does the Bonferroni rule come from? Suppose X1 , . . . , Xp are all variables with no relationship to Y . Now suppose we run a regression and pick out only the Xs that have signicant individual p-values. Each Xi has a probability of spuriously making it into the model, where our usual rule is = .05. From very basic probability we know that if A and B are two events, then P (A or B) = P (A) + P (B) P (A and B) P (A) + P (B). Therefore, the probability that at least one of the Xi s makes it into the model is P (X1 or X2 or or Xp ) P (X1 ) + + P (Xp ) = p. Thus if we replace the .05 rule with a .005 rule, then we can look at 10 tstatistics and still maintain only a 5% chance of allowing at least one garbage variable into the regression.
153
Backward Selection. With backward selection we start with all the possible variables in the model. We then remove the variable with largest p-value. Then we remove the variable with next largest p-value (given the new model with the rst variable removed). This procedure continues until all remaining variables have a p-value below some threshold, again chosen by the user. Mixed Selection. This is a combination of forward and backward selection. We start with no variables in the model and as with forward selection add the variable with lowest p-value. The procedure continues in this way except that if at any point any of the variables in the model have a p-value above a certain threshold they are removed (the backward part). These forward and backward steps continue until all variables in the model have a low enough p-value and all variables outside the model have a large p-value if added. Notice that this procedure requires selecting two thresholds. If the two thresholds are close to another the procedure can enter a cycle where including a variable make a previously included variable insignicant. Dropping that variable makes the previous variable insignicant, which makes the rst variable signicant again. If you encounter such a cycle then simply choose a tougher threshold for including variables in the model. JMP (or most other statistical packages) will automatically perform these procedures for you.22 The only inputs you need to provide are the data and the
22
See page 182 for computer tips.
154
thresholds. Note that the default thresholds that JMP provides are TERRIBLE. They are much larger than .05, when they should be much smaller. To illustrate, we simulated 10 normally distributed random variables, totally independently, for use in a regression to predict the stock market (60 monthly observations of the VW index from the early 1990s). The last 12 months were set aside, and the remaining 48 were used to t the model. Obviously our X variables are complete gibberish, as is reected in the ANOVA table for the regression of VW on X1 , . . . , X10 shown in Figure 6.16. The second ANOVA table in Figure 6.16 was obtained using the Mixed stepwise procedure, with JMPs default thresholds, on all possible interaction and quadratic terms for the 10 X variables. Notice R2 = .92! The whole model F test is highly signicant. Figure 6.16(a) shows a time series plot of the predictions from this regression model on the same plot as the actual data. For the 48 months that the model was t on the predicted and actual series track nearly perfectly. For the last 12 months, which were not used to t the model, the two series are completely unrelated to one another. Figure 6.16(b) makes the same point using a dierent view of the data. It plots the residuals versus the predicted returns for all 60 observations. At least half of the 12 points not used in tting the model could be considered outliers. The point of this exercise is that stepwise regression is a greedy algorithm for choosing Xs to be in the model. Give it a chance and it will include all sorts of spurious variables that happen to t just by chance. When using the stepwise procedure remember to protect yourself by setting tough signicance thresholds. When we tried our same experiment using .001 as a threshold (the lowest threshold that JMP allows), we ended up with an empty model, which in this case is the right answer.
6.7. MODEL SELECTION/DATA MINING
155
ANOVA Table for ordinary regression on 10 Random Noise variables Source DF Sum of Squares Mean Square F Ratio Model 10 0.00409569 0.000410 0.2906 Error 36 0.05074203 0.001410 Prob > F Total 46 0.05483773 0.9791 Mixed Stepwise with JMPs Defaults Source DF Sum of Squares Mean Square Model 26 0.05058174 0.001945 Error 20 0.00425599 0.000213 Total 46 0.05483773 (all possible interactions considered)
F Ratio 9.1422 Prob > F <.0001
RSquare 0.922389 RMSE 0.014588 N 47
(a)
(b)
Figure 6.16: Regression output for the value weighted stock market index regressed on 10
variables simulated from random noise.
156
Chapter 7
Further Topics
7.1 Logistic Regression
Often we wish to understand the relationship between X and Y where X is continuous but Y is categorical. For example a credit card company may wish to understand what factors (X variables) aect whether a customer will default or whether a customer will accept an oer of a new card. Both default and accept new card are categorical responses. Suppose we let Yi = 1 0 if ith customer defaults if not
Why not just t the regular regression model, Yi = 0 + 1 Xi + and predict Y as Y = b0 + b1 X There are two problems with this prediction. First, if Y = 0.5, for example, we know this is an incorrect prediction because Y is either 0 or 1. Second, depending on the value of X, Y can range anywhere from negative to positive innity. Clearly this makes no sense because Y can only take on the values 0 or 1.
i
A New Model
The rst problem can be overcome by treating Y as a guess, not for for the probability that Y equals 1 i.e. P (Y = 1). So, for example, if would indicate that the probability of a person with this value of X 157 Y itself, but Y = 0.5 this defaulting is
158
CHAPTER 7. FURTHER TOPICS
50%. Thus any prediction between 0 and 1 now makes sense. However, this does not solve the second problem because a probability less than zero or greater than one still has no sensible interpretation. The problem is that a straight line relationship between X and P (Y = 1) is not correct. The true relationship will always be between 0 and 1. We need to use a dierent function i.e. not linear. There are many possible functions but the one that is used most often is the following p= e0 +1 X 1 + e0 +1 X
where p = P (Y = 1). This curve has an S shape. Notice that as 1 X gets close to innity p gets close to one (but never goes past one) and when 1 X gets close to negative innity p gets close to zero (but never goes below zero). Hence, no matter what X is we will get a sensible prediction. If we rearrange this equation we get p = e0 +1 X 1p p/(1 p) is called the odds. It can go anywhere from 0 to innity. A number close to zero indicates that the probability of a 1 (default) is close to zero while a number close to innity indicates a high probability of default. Finally, by taking logs of both sides we get p log = 0 + 1 X 1p The left hand side is called the log odds or logit. Notice that there is a linear relationship between the log odds and X.
Fitting the Model

Just as with regular regression 0 and 1 are unknown. Therefore to understand the relationship between X and Y and make future predictions for Y we need to be estimate them using b0 and b1 . These estimates are produced using a method called Maximum Likelihood which is a little dierent from Least Squares. However, the details are not important because most of the ideas are very similar. We still have the same questions and problems as with standard regression. For example b1 is only a guess for 1 so it has a standard error which we can use to construct a condence interval. We are still interested in testing the null hypothesis H0 : 1 = 0 because this corresponds to no relationship between X and Y . We now use a 2 (chi-square) statistic rather than t but you still look at the p-value to see whether you should reject H0 and conclude that there is a relationship. The interpretation of 1 is a little more dicult than with standard regression. If 1 is positive then increasing X will increase p = P (Y = 1) and vice versa for 1
7.1. LOGISTIC REGRESSION
159
negative. However, the eect of increasing X by one, on the probability is less clear. Increasing X by one changes the log odds by 1 . Equivalently it multiplies the odds by e . However, the eect this will have on p depends on what the probability is to 1 start with.
Making Predictions
Suppose we t the model using average debt as a predictor and get b0 = 3 and b1 = .001. Then for a person with 0 average debt we would predict that the probability they defaulted on the loan would be p= e3 e3+0.0010 = = 0.047 1 + e3+0.0010 1 + e3
On the other hand a person with $2, 000 in average debt would have a probability of default of e3+0.0012,000 e1 p= = = 0.269 1 + e3+0.0012,000 1 + e1
Multiple Logistic Regression

The logistic regression model can easily be extended to as many X variables as we like. Instead of using e0 +1 X p= 1 + e0 +1 X we use e0 +1 X1 ++p Xp p= 1 + e0 +1 X1 ++p Xp or equivalently log p 1p = 0 + 1 X1 + + p Xp .
The same questions from multiple regression reappear. Do the variables overall help to predict Y? (Look at the p-value for the whole model test.) Which of the individual variables help? (Look at the individual p-values.) What eect does each X have on Y . (Look at the signs on the coecients.) In the credit card example it looked like average balance and income had an eect on the probability of default. Higher average balances caused a higher probability (b1 was positive) and higher incomes caused a lower probability (b2 was negative). Therefore we are especially concerned about people with high balances and low incomes.
160
Making predictions with several variables

Suppose we t the model using average debt and income as predictors and get b0 = 1 and b1 = .001 and b2 = .0001. Then for a person with 0 average debt and income of $50, 000 we would predict that the probability they defaulted on the loan would be e1+0.00100.000150,000 e4 p= = = 0.0180 1 + e1+0.00100.000150,000 1 + e4 On the other hand a person with $2, 000 in average debt and income of $30, 000 would have a probability of default of p= e1+0.0012,0000.000150,000 e0 = = 0.5 1 + e1+0.0012,0000.000130,000 1 + e0
7.2
Time Series
The Dierence Between Time Series and Cross Sectional Data

Cross Sectional Data Cars weight predicts/explains fuel consumption. Two dierent variables, no concept of time. Time Series Past number of deliveries explains future number of deliveries. Same variable, dierent times.
Autocorrelation
When we have data measured over time we often denote the Y variable as Yt where t indicates the time that Y was measured at. The lag variable is then just Yt1 i.e. all the Y s shifted back by one. A standard assumption in regression is that the Y s are all independent of each other. In time series data this assumption is often violated. If there is a correlation between yesterdays Y i.e. Yt1 and todays Y i.e. Yt this is called autocorrelation. Autocorrelation means todays residual is correlated with yesterdays residual. An easy way to spot it is to plot the residuals against time. Tracking, i.e. a pattern where the residuals follow each other, is evidence of autocorrelation. In cross sectional data there is no time component which means no autocorrelation. On the other hand time series data almost always has some autocorrelation.
7.2. TIME SERIES Impact of Autocorrelation
161
On predictions. The past contains additional information about the future not captured by the current regression model. Todays residual can help predict tomorrows residual. This means that we should incorporate the previous (lagged) Y s in the regression model to produce a better estimate of todays Y. On parameter estimates. Yesterdays Y can be used to predict todays Y . As a consequence todays Y provides less new information than an independent observation. Another way to say this is that the Equivalent Sample Size is smaller. For example, 100 Y s that have autocorrelation may only provide as much information as 80 independent Y s. Since we have less information the parameter estimates are less certain than in an independent sample. Testing for Autocorrelation One test for autocorrelation is to use the Durbin-Watson Statistic. It is calculated using the following formula DW =
n 2 t=2 (et et1 ) n 2 t=1 et
It compares the variation of residuals about the lagged residuals (the numerator) to the variation in the residuals (denominator). The Durbin-Watson statistic assumes values between 0 and 4. A value close to 0 indicates strong positive autocorrelation. A value close to 2 indicates no autocorrelation and a value close to 4 indicates strong negative autocorrelation. Basically look for values a long way from 2. It can be shown that DW 2 2r where r is the autocorrelation between the residuals. Therefore, an alternative to the Durbin-Watson statistic is to simply calculate the correlation between the residuals and the lagged residuals. Notice that if the correlation is zero the DW statistic should be close to 2. What to do when you Detect Autocorrelation The easiest way to deal with autocorrelation is to incorporate the lagged residuals in the regression as a new X variable. In other words t the model with what ever X variables you want to use, save the residuals, lag them and ret the model including the lagged residuals. This will generally improve the accuracy of the predictions as well as remove the autocorrelation.
162
Short Memory versus Long Memory

Autocorrelation is called a short memory phenomenon because it implies that the Y s are aected by or remember what happened in the recent past. Long memory can also be important in making future predictions. Some examples of long memory are: Trend. This is a long run change in the average value of Y . For example industry growth. Seasonal. This is a predictable pattern with a xed period. For example if you are selling outdoor furniture you would expect your summer sales to be higher than winter sales irrespective of any long term trend. Cyclical. These are long run patterns with no xed period. An example is the boom and bust business cycle where the entire economy goes through a boom period where everything is expanding and then a bust where everything is contracting. Predicting the length of such periods is very dicult and unless you have a lot of data it can be hard to dierentiate between cycles and long run trends. Since we have so little time on time series data we wont worry about cyclical variation in this class. To model a long term trend in the data we treat time as another predictor. Just as with any other predictor we need to check what sort of relationship there is between it and Y (i.e. none, linear, non-linear) and transform as appropriate. To model seasonal variation we should incorporate a categorical variable indicating the season. For example if we felt that sales may depend on the quarter i.e. Winter, Spring, Summer or Autumn we would add a categorical variable indicating which of these 4 time periods each of the data points correspond to. On the other hand we may feel that each month has a dierent value in which case we would add a categorical variable with 12 levels, one for each month. When we t the model JMP will automatically create the correct dummy variables to code for each of the seasons. As with any other predictors we should look at the appropriate p-values and plots to check whether the variables are necessary and none of the regression assumptions have been violated. By incorporating time, seasons and lagged residuals in the regression model we can deal with a large range of possible time series problems.
7.3
More on Probability Distributions
This section explains more about some standard probability distributions other than the normal distribution. You may encounter some of these distributions in your operations class.
7.3. MORE ON PROBABILITY DISTRIBUTIONS
163
0.4
0.3
probability 4 2 0 z 2 4
density
0.2
0.1
0.0
0.0 4
0.2
0.4
0.6
0.8
1.0
0 z
Figure 7.1: PDF (left) and CDF (right) for the normal distribution.
7.3.1
Background
There are two basic types of random variables: discrete and continuous. Discrete random variables are typically integer valued. That is: they can be 0, 1, 2, . . . . Continuous random variables are real valued, like 3.141759. All random variables can be characterized by their cumulative distribution function (or CDF) F (x) = P r(X x).
CDFs have three basic properties: they start at 0, they stop at 1, and they never decrease as you move from left to right. We are already familiar with one CDF: the normal table! If a random variable is continuous you can take the derivative of its CDF to get its probability density function (or PDF). The normal PDF and CDF are plotted in Figure 7.1. If you want to understand a random variable it is is easier to think about its PDF because the PDF looks like a histogram. However, the CDF is a handy thing to have around, because you can calculate the probability that your random variables X is in any interval (a, b] by F (b) F (a), just like we did with the normal table in Section 2.4. If a random variable is discrete you can calculate its probability function P (X = x).
164
1.0
0.8
0.6
probability 0 1 2 x 3 4 5
dexp(x)
0.4
0.2
0.0
0.0 1
0.2
0.4
0.6
0.8
1.0
1 x
Figure 7.2: The exponential distribution (a) PDF (b) CDF.
7.3.2
Exponential Waiting Times
The shorthand notation to say that X is an exponential random variable with rate is X E(). The exponential distribution has the density function f (x) = ex and CDF F (x) = 1 ex for x > 0. These are plotted in Figure 7.2. One famous property of the exponential distribution is that it is memoryless. That is, if X is an exponential random variable with rate , and youve been waiting for a day for the
7.3.3
Binomial and Poisson Counts
The binomial probability function is P (X = x) = n x p (1 p)nx x
7.4. PLANNING STUDIES
165
0.20
0.15
0.10
0.05
0.00
10
12
0.2 0
0.4
0.6
0.8
1.0
10
12
Figure 7.3: Poisson (a) probability function (b) CDF with = 3.
7.3.4
Review
7.4
7.4.1
Planning Studies
Dierent Types of Studies
In this Chapter we talk about three dierent types of studies, which are primarily distinguished by their goal. The concept of randomization plays an important role in study design. Dierent types of randomization are required to achieve the dierent goals. Experiment The goal of an experiment is to determine whether one action (called a treatment) causes another (the eect) to occur. For example, does receiving a coupon in the mail cause a customer to change the type of beer he buys in the grocery store? In an experiment, sample units (the basic unit of study: people, cars, supermarket transactions) are randomly assigned to one or more treatment levels (e.g. bottles/cans) and an outcome measure is recorded. Experiments can be designed to simultaneously test more than one type of treatment (bottles/cans and 12-pack/6-pack). Survey The goal of a survey is to describe a population by collecting a small sample from the population. Surveys are more interested in painting an accurate picture of the population than determining causal relationships between
166 Distribution: Notation f (x) CDF E(X) V ar(X) Discrete / Continuous Model for Normal N (, ) Exponential E() ex 1 ex 1/ 1/2 continuous
CHAPTER 7. FURTHER TOPICS Poisson P o()

x x! e n x
Binomial B(n, p) px (1 p)nx tables np np(1 p) discrete
1 1 e 2 2
(x)2 2
tables 2 continuous lots of stu
tables discrete counts (no known max)
waiting times
counts (known max. n)
Table 7.1: Summary of random variables variables. The key issue in a survey is how to decide which units from the population are to be included in the sample. The best surveys randomly select units from the overall population. Note that there are many dierent strategies that may be employed to randomly select units for inclusion in a survey, and the data survey data should be analyzed with the randomization strategy in mind. A survey in which the entire population is included is called a census. Conducting a census is expensive, time consuming, and it may even be impossible for practical or ethical reasons. There can also be some subtle problems with a census, such as what to do if some people refuse to respond. A carefully conducted survey can sometimes produce more accurate results than a census! Observational Study Like an experiment, the goal of an observational study is to draw causal conclusions. However, the sample units in an observational study are not under the control of the study designer. The conclusions drawn from an observational study are always subject to criticism, but observational studies are the least expensive type of study to perform. As a result they are the most common.
7.4.2
Bias, Variance, and Randomization
Whenever a study is performed to estimate some quantity there are two possible problems, bias and variance. To say that a study is biased means that there is something systematically wrong with the way the study is conducted. Mathematically speaking, bias is the dierence between the expected value of the sample statistic the study is designed to produce and the population parameter that the statistic estimates. For example, in a survey obtained by simple random sampling, we know
167
that E(X) = so the bias is zero. If the bias is zero we call the estimator unbiased. If an unbiased study were performed many times, sometimes its estimates would be too large and sometimes they would be too small, but on average they would be correct. In the context of study design, variance simply means that if you performed the study again you would get a slightly dierent answer. As long as a study is unbiased, we have very eective tools (condence intervals and hypothesis tests) at our disposal for measuring variance. If these tools indicate that we have not measured a phenomenon accurately enough, then we can reduce the variance by simply gathering a larger sample. If an estimator is biased it will get the wrong answer even if we have no variance. In large studies (where the variance is small because n is so large) bias is usually a much worse problem than variance. Ideally we want to produce an estimate that has both low bias and low variance. Unfortunately bias is often hard (and sometimes impossible) to detect. That is why randomization is so important in designing studies. We cant control bias, but we can control variance by collecting more data. Randomization is a tool for turning bias (which we cant control) into variance (which we can).
7.4.3
Surveys
Randomization in Surveys The whole idea of a survey is to generalize the results from your sample to a larger population. You can only do so if you are sure that your sample is representative of the population. Randomly selecting units from the population is vital to ensuring the representativeness of your sample. This is counter-intuitive to some people. You may think that if you carefully decide which units should be included in the sample then you can be sure it is representative of the population. However, bias can sneak into a deterministic sampling scheme in some very subtle ways. A famous example is the Dewey/Truman U.S. presidential election, where the Gallup organization used quota sampling in its polling. The survey was conducted through personal interviews, where each interviewer was given a quota describing the characteristics of who they should interview: (12 white males over 40, 7 Asian females under 30). However, there is a limit to the precision that can be placed on quotas. The interviewers followed the quotas, but still managed to show a bias towards interviewing republicans who planned to vote for Dewey. In fact Truman, the democratic candidate, won the election.
168
Figure 7.4: The famous picture of Harry S. Truman after he defeated Thomas E. Dewey
in the 1948 U.S. Presidential election.
Steps in a Survey Dene population Clearly dening the population you want to study is important because it denes the sampling unit, the basic unit of analysis in the study. If you are not clear about the population you want to study, then it may not be obvious how you should sample from the population. For example, do you want to randomly sample transactions or accounts (which might contain several transactions)? If you sample the wrong thing you can end up with size bias where larger units (e.g. transactions occurring in busy accounts) have a higher probability of being selected than smaller units (transactions occurring in small accounts). Construct sampling frame The sampling frame is a list of almost all units in the population. For example, the U.S. Census is often used as a sampling frame. The sampling frame may have some information about the units in the population, but it does not contain the information needed for your study. Frame coverage bias occurs when there are some units in the population do not appear on the sampling frame. Sometimes the sampling frame is an explicit list. Sometimes it is implicit, such as the people watching CNN right now. Explicit sampling frames give your results greater credibility. Select sample There are several possible strategies that can be employed to randomly sample units from the sampling frame. The simplest is a simple random sample. Simple random sampling is equivalent to drawing units out of a hat. There are other sampling methods, such as stratied random sampling,
169
cluster sampling, two stage sampling, and many others. The thing that determines whether a survey is scientic is whether the sampling probabilities are known (i.e. the probability of each individual in the population appearing in the sample). If the sampling probabilities are unknown, then the survey very likely suers from selection bias. A common form of selection bias (called self-selection) occurs when people (on an implicit sampling frame) are encouraged to phone in or log on to the internet and voice their opinions. Typically, only the strongest opinions are heard. Another form of selection bias occurs with a convenience sample composed of the units that are easy for study designer to observe. We experience selection bias from convenience samples every day. Have you ever wondered how Boy Band X can have the number one hit record in the nation, but you dont know anyone who owns a copy? Your circle of friends is a convenience sample. Convenience samples occur very often when the rst n records are extracted from a large database. Those records might be ordered in some important way that you havent thought of. For example, employment records might have the most senior employees listed rst. Collect data Particularly when you are surveying people, just because you have selected a unit to appear in your survey does not mean they will agree to do so. Non-response bias occurs when people that dont answer your questions are systematically dierent from those people that do. The fraction of people who respond to your survey questions is called the response rate. The response rate is typically highest for surveys conducted in person. It is much lower for phone surveys, and lower still for surveys conducted by mail. Many survey agencies actually oer nancial incentives for people to participate in the survey in order to increase the response rate. Analyze data The data analysis must be done with the sampling scheme in mind. The techniques we have learned (and will continue to learn) are appropriate for a simple random sample. If applied to data from another sampling scheme (stratied sampling, cluster sampling, etc.) they can produce biased results. The methods we know about can be modied to handle other sampling schemes, typically by assigning each observation a weight which can be computed using knowledge of the sampling strategy. We will not discuss these weighting schemes.
170 Other Types of Random Sampling
Cluster sampling. Stratied sampling. Two stage sampling.
7.4.4
Experiments
Unlike surveys, the theory of experiments does not concern itself with generalizing the results of the experiment to a larger population. The goal of an experiment is to infer a causal relationship between two variables. The rst variable (X) is called a treatment and is set by the experimenter. The second variable (Y ) is called the response and is measured by the experimenter. Experiments can be conducted with several dierent treatment and response variables simultaneously. The study of the right way to organize and analyze complicated experiments falls under a sub-eld of statistics called experimental design. Because the goal of an experiment is dierent than that of a survey, a dierent type of randomization is required. Surveys randomly select which units are to be included in the study to ensure that the survey is representative of the larger population. Experiments randomly assign units to treatment levels to make sure there are no systematic dierences between the dierent groups the experiment is designed to compare. For example, in testing a new drug experimental subjects are randomly assigned to one of two groups. The treatment group is given the drug, while the control group is given a placebo. We then compare the two groups to see if the treatment seems to be helping. Issues in Experimental Design Confounding This is where, for example, all women are given the drug and all men are given the placebo. If we then nd that the treatment group did better than the control group we have a problem because we cant tell whether this is because of the drug or because of the gender. Lurking Variables You would obviously never conduct an experiment by assigning all the men to one treatment level and all the women to another. But what if there were some other systematic dierence between the treatment and control groups that was not so obvious to you? Then any dierence you observe between the groups might be because of that other variable. For example the fact that people who smoke get cancer does not prove that smoking causes cancer. It is possible that there is an unobserved variable (e.g. a defective gene) that causes people to both smoke and develop cancer. The variable is unobserved (lurking) so it is dicult to tell for sure. The best way to overcome
171
this problem is to randomly assign people to groups. Random assignment ensures that any systematic dierences among individuals, even those that you might not know to account for, are evenly spread between two groups. If units are randomly assigned to treatment levels, then the only systematic dierence between the groups being compared is the treatment assignment. In that case you can be condent that a statistically signicant dierence between the groups was caused by the treatment. Placebo Eect People in treatment group may do better simply because they think they should! This problem can be eliminated by not telling subjects which group they are in, or telling observers which group they are measuring. This is called a double blind trial.
7.4.5
Observational Studies
An observational study is an experiment where randomization is impossible or undesirable. For example it is not ethical to randomly make some people smoke and make others not smoke. Studies involving smoking and cancer rates are observational studies because the subjects decide whether or not they will smoke. Observational studies are always subject to criticism due to possible lurking variables. Another way to say this is that observational studies suer from omitted variable bias. Omitted variable bias is illustrated by Simpsons paradox, which simply says that conditioning an analysis on a lurking variable can change the conclusion. Here is an example. We are comparing the batting averages for two baseball players. (Batting averages are the fraction of the time that a player hits the ball, times 1000. So someone who bats 250 gets a hit 25% of the time.) First we look at the overall batting average for each player for the entire season. Player A B Whole Season 251 286
It appears that player B is the better batter. However, if we look at the averages for the two halves of the season we get a completely dierent conclusion. Player A B First Half 300 290 Second Half 250 200
How is it possible that player A can have a higher average in both halves of the season but a lower average overall? Upon closer examination of the numbers we see that
172 First Half Hits at-Bats 3 10 58 200
CHAPTER 7. FURTHER TOPICS Second Half Hits at-Bats 100 400 2 10
Player A B
The batting average for both players is lower during the second half of the season (maybe because of better pitching, or worse weather). Most of player As attempts came during the second half, and vice-versa for player B. Therefore if we condition on which half of the season it is we get a quite dierent conclusion. Here is another example based on a study of admissions to Berkeley graduate programs. Admission Yes Gender Men Women Totals 3738 1494 5232 No 4704 2827 7531 Totals 8442 4321 12763 % Admitted 44.3 34.6
44.3% of men were admitted while only 34.6% of women were. On the surface there appears to be strong evidence of gender bias in the admission process. However, look at the numbers if we do a comparison on a department by department basis. Program A B C D E F Total Men No. Applicants % admitted 825 62 560 63 325 37 417 33 191 28 373 6 2691 45 Women No. Applicants % admitted 108 82 25 68 593 34 375 35 393 24 341 7 1835 30
The percentage of women admitted is generally higher in each department. Only departments C and E have slightly lower rates for women. Department A seems if anything to be biased in favor of women. How is this possible? Notice that men tend to apply in greater numbers to the easy programs (A and B) while women go for the harder ones (C, D, E and F). Hence the reason for the lower percentage
173
admitted is simply a result of women going for the harder programs. Unfortunately, statistics cannot determine whether this means men are smart or lazy. Put in a transition here that explains the implications of observational studies on multiple regression and vice-versa. Using linear regression has several advantages over the two sample t-test. The rst is that linear regression allows you to incorporate other variables (once we learn about multiple regression). Recall that possible lurking variables often make it hard to determine if an apparent dierence between, say, Males and Females, is caused by gender or some other lurking variables. If a dierence between genders is still apparent after including possible lurking variables then this suggests (though does not prove) that the dierence is really caused by gender. Using linear regression we can incorporate any other possible lurking variables. The two sample t-test does not allow this. For example with the compensation example males were payed signicantly more than Females. However, when we incorporated the level of responsibility we got Yi 112.8 + 6.06Position + = 6.06Position + 114.7 110.9 1.86 1.86 if ith person is Female if ith person is Male
if ith person is Female if ith person is Male
So the conclusion was reversed. It turns out that it is simply that there are more women at a lower level of responsibility that causes it to seem that women are being discriminated against. This of course leaves open the question as to why women are at a lower level of responsibility. The second reason for using linear regression is that it facilitates a comparison of more than two groups whereas the two sample t-test can only handle two groups. However, the two sample t-test is commonly used in practice so it is important that you understand how it works.
7.4.6
Summary
Surveys: Random selection ensures survey is representative. Randomized surveys can generalize their results to the population. Experiments: Random treatment assignment prevents lurking variables from interfering with causal conclusions. Randomized experiments allow you to conclude that dierences in the outcome variable are caused by dierent treatments. Observational Studies: No randomization is possible. Control for as many
174
CHAPTER 7. FURTHER TOPICS things as you can to silence your critics. If a relationship persists perhaps its real. Do an experiment (if possible/ethical) to verify.
Congratulations!!!
Youve completed statistics (unless you skipped to the back to see how it all ends, in which case: get back to work)! You may be wondering what happens next? No, we mean after the drinking binge. Statistics classes tend to evoke one of two reactions: either youre praying that you never see this stu again, or youre intrigued and would like to learn more. If youre in the rst camp Ive got some bad news for you. Be prepared to see regression analysis used in several of your remaining Core classes and second year electives. The good news is that all your hard work learning the material here means that these other classes wont look nearly as frightening. For those of you who found this course interesting, you should know that it was a REAL statistics course. It qualies you to go compete with the Whartons and Chicagos of the world for quantitative jobs and summer internships. If you want to see more, here are some courses you should consider. Data Mining: Arif Ansari Data mining is the process of automating information discovery. The course is focused on developing a thorough understanding of how business data can be eciently stored and analyzed to generate valuable business information. Business applications are emphasized. The amount of data collected is growing at a phenomenal rate. The users of the data are expecting more sophisticated information from them. A marketing manager is no longer satised with a simple listing of marketing contacts, but wants detailed information about customers past purchases as well as predictions of future purchases. Simple structured/query language queries are not adequate to support these increased demands for information. Data mining steps in to solve these needs. In this course you will learn the various techniques used in data mining like Decision trees, Neural networks, CART, Association rules etc., This course gives you hands-on experience on how to apply data mining techniques to real world business problems. 175
176
CHAPTER 7. FURTHER TOPICS Data Mining is especially useful to a marketing organization, because it allows you to prole customers to a level not possible before. Distributors of mass mailers today generally all use data mining tools. In a few years data mining will a requirement of marketing organizations.
IOM 522 - Time Series Analysis for Forecasting Professor Delores Conway, winner of the 1998 University Associates award for excellence in teaching. Forecasts of consumer demand, corporate revenues, earnings, capital expenditures, and other items are essential for marketing, nance, accounting and operations. This course emphasizes the usefulness of regression, smoothing and Box-Jenkins forecasting procedures for analyzing time series data and developing forecasts. Topics include the concept of stationarity, autoregressive and moving average models, identication and estimation of models, prediction and assessment of model forecasts, seasonal models, and intervention analysis. Students obtain practical experience using ForecastX (a state-ofthe-art, Excel based package) to analyze data and develop actual forecasts. The analytical skills learned from the class are sophisticated and marketable, with wide application.
Appendix A
JMP Cheat Sheet

This guide will tell you the JMP commands for implementing the techniques discussed in class. This document is not intended to explain statistical concepts or give detailed descriptions of JMP output. Therefore, dont worry if you come across an unfamiliar term. If you dont recognize a term like variance ination factor then we probably just havent gotten that far in class.
A.1
Get familiar with JMP.
You should familiarize yourself with the basic features of JMP by reading the rst three Chapters of the JMP manual. You dont have to get every detail, especially on all the Formula Editor functions, but you should get the main idea about how stu works. Some of the things you should be able to do are: Open a JMP data table. Understand the basic structure of data tables. Rows are observations, or sample units. Columns are variables, or pieces of information about each observation. Identify the modeling type (continuous, nominal, ordinal) of each variable. Be able to change the modeling type if needed. Make a new variable in the data table using JMPs formula editor. For example, take the log of a variable. Use JMPs tools to copy a table or graph and paste it into Word (or your favorite word processor). Use JMPs online help system to answer questions for you.
A.2
A.2.1
Generally Neat Tricks

Dynamic Graphics
JMP graphics are dynamic, which means you can select an item or set of items in one graph and they will be selected in all other graphs and in the data table.
A.2.2
Including and Excluding Points
Sometimes you will want to determine the impact of a small number of points (maybe just a single point) on your analysis. You can exclude a point from an analysis by selecting
177
178
APPENDIX A. JMP CHEAT SHEET
the point in any graph or in the data table and choosing Rows Exclude/Unexclude. Note that excluding a point will remove it from all future numerical calculations, but the point may still appear in graphs. To eliminate the selected point from graphs, select Rows Hide/Unhide. You can re-admit excluded and/or hidden points by selecting them and choosing Rows Exclude/Unexclude a second time. An easy way to select all excluded points is by double clicking the Excluded line in the lower left portion of the data table. You can also choose Rows Row Selection Select Excluded from the menu.
A.2.3
Taking a Subset of the Data
The easiest way to take a subset of the data is by selecting the observations you want to include in the subset and choosing Tables Subset from the menu. For example, suppose you are working with a data set of CEO salaries and you only want to investigate CEOs in the nance industry. Make a histogram of the industry variable (a categorical variable in the data set), and click on the Finance histogram bar. Then choose Tables Subset from the menu and a new data table will be created with just the nance CEOs.
A.2.4
Marking Points for Further Investigation
Sometimes you notice an unusual point in one graph and you want to see if the same point is unusual in other graphs as well. An easy way to do this is to select the point in the graph, and then right click on it. Choose Markers and select the plotting character you want for the point. The point will appear with the same plotting character in all other graphs.
A.2.5
Changing Preferences
There are many ways you can customize JMP. You can select dierent toolbars, set dierent defaults for the analysis platforms (distribution of Y, t Y by X, etc.), choose a default location to search for data les, and several other things. Choose File Preferences from the menu to explore all the things you can change.
A.2.6
Shift Clicking and Control Clicking
Sometimes you may want to select several variables from a list or select several points on a graph. You can accomplish this by holding down the shift or control keys as you make your selections. Note that shift and control clicking in JMP works just like it does in other Windows applications.
A.3
The Distribution of Y
All the following instructions assume you have launched the Distribution of Y analysis platform.
A.3.1
Continuous Data
By default you will see a histogram, boxplot, a list of quantiles, and a list of moments (mean, standard deviation, etc.) including a 95% condence interval for the mean. 1. One Sample T Test. Click the little red triangle on the gray bar over the variable whose mean you want to do the T-test for. Select Test Mean. A dialog box pops up. Enter the mean you wish to use in the null hypothesis in the rst eld. Leave other elds blank. Click OK.
A.4.
FIT Y BY X
179
2. Normal Quantile Plot (or Q-Q Plot). Click the little red triangle on the gray bar over the variable you want the Q-Q plot for. Select Normal Quantile Plot.
A.3.2
Categorical Data
Sometimes categorical data are presented in terms of counts, instead of a big data set containing categorical information for each observation. To enter this type of data into JMP you will need to make two columns. The rst is a categorical variable that lists the levels appearing in the data set. For example, the variable Race may contain the levels Black, White, Asian, Hispanic, etc. The second is a numerical list revealing how many times each level appeared in the data set. Suppose you name this variable counts. When you launch the Distribution of Y analysis platform, select Race as the Y variable and enter counts in the Frequency eld.
A.4
Fit Y by X
All the following instructions assume you have launched the Fit Y by X analysis platform. The appropriate analysis will be determined for you by the modeling type (continuous, ordinal, or nominal) of the variables you select as Y and X.
A.4.1
The Two Sample T-Test (or One Way ANOVA).
In the Fit Y by X dialog box, select the variable whose means you wish to test as Y. Select the categorical variable identifying group membership as X. For example, to test whether a signicant salary dierence exists between men and women, select Salary as the Y variable and Sex as the X variable. A dotplot of the data will appear. 1. How to do a T-Test. Click on the little red triangle on the gray bar over the dotplot. Select Means/ANOVA /T-test. 2. Manipulating the Data. Sometimes the data in the data table must be manipulated into the form that JMP expects in order to do the two sample T test. The Tables menu contains two options that are sometimes useful. Stack will take two or more columns of data and stack them into a single column. Split will take a single column of data and split it into several columns. 3. Display Options. The Display Options sub-menu under the little red triangle controls the features of the dotplot. Use this menu to add side-by-side boxplots, means diamonds, to connect the means of each subgroup, or to limit over-plotting by adding random jitter to each points X value.
A.4.2
Contingency Tables/Mosaic Plots
In the Fit Y by X dialog box select one categorical variable as Y and another as X. As far as the contingency table is concerned, it doesnt matter which is which. The mosaic plot will put the X variable along the horizontal axis and the Y variable on the vertical axis (as you would expect).
180
1. Entering Tables Directly Into JMP. Sometimes categorical data are presented in terms of counts, instead of a big data set containing categorical information for each observation. To enter this type of data into JMP you will need to make three columns. The rst is a categorical variable that lists the levels appearing in the X variable. For example, the variable Race may contain the levels Black, White, Asian, Hispanic, etc. The second is a list of the levels in the Y variable. For example, the variable Sex may contain the levels Male, Female. Each level of the X variable must be paired with each level of the Y variable. For example, the rst row in the data table might be Black, Female. The second row might be Black, Male. The nal column is a numerical list revealing how many times each combination of levels appeared in the data set. Suppose you name this variable counts. When you launch the Fit Y by X analysis platform, select Race as the X variable, Sex as the Y variable, and enter counts in the Frequency eld. 2. Display Options for a Contingency Table. By default, contingency tables show counts, total %, row %, and column %. You can add or remove listings from cells of the contingency table by using the little red triangle in the gray bar above the table.
A.4.3
Simple Regression
In the Fit Y by X dialog box select the continuous variable you want to explain as Y and the continuous variable you want to use to do the explaining as X. A scatterplot will appear. The commands listed below all begin by choosing an option from the little red triangle on the gray bar above the scatterplot. 1. Fitting Regression Lines Select Fit Line from the little red triangle. 2. Fitting Non-Linear Regressions (a) Transformations Choose Fit Special from the little red triangle. A dialog box appears. Select the transformation you want to use for Y, and for X. Click Okay. If the transformation you want to use does not appear on the menu you will have to do it by hand using JMPs formula editor. Simply create a new column in the data set and ll the new column with the transformed data. Then use this new column as the appropriate X or Y in a linear regression. (b) Polynomials Choose Fit Polynomial from the little red triangle. You should only use degree greater than two if you have a strong theoretical reason to do so. (c) Splines Choose Fit Spline from the little red triangle. You will have to specify how wiggly a spline you want to see. Trial and error is the best way to do this. 3. The Regression Manipulation Menu You may want to ask JMP for more details about your regression after it has been t. Each regression you t causes an additional little red triangle to appear below the scatterplot. Use this little red triangle to: Save residuals and predicted values Plot residuals
Plot condence and prediction curves
A.5.
MULTIVARIATE
181
A.4.4
Logistic Regression
In the Fit Y by X dialog box choose a nominal variable as Y and a continuous variable as X. There are no special options for you to manipulate. You have more options when you t a logistic regression using the Fit Model platform.
A.5
Multivariate
Launch the multivariate platform and select the continuous variables you want to examine. 1. Correlation Matrix You should see this by default. 2. Covariance Matrix You have to ask for this using the little red triangle. 3. Scatterplot Matrix You may or may not see this by default. You can add or remove the scatterplot matrix using the little red triangle. You read each plot in the scatterplot matrix as you would any other scatterplot. To determine the axes of the scatterplot matrix you must examine the diagonal of the matrix. The column the plot is in determines the X axis, while the plots row in the matrix determines the Y axis. The ellipses in each plot would contain about 95% of the data if both X and Y were normally distributed. Skinny, tilted ellipses are a graphical depiction of a strong correlation. Ellipses that are almost circles are a graphical depiction of weak correlation.
A.6
Fit Model (i.e. Multiple Regression)
The Fit Model platform is what you use to run sophisticated regression models with several X variables.
A.6.1
Running a Regression
Choose the Y variable and the X variables you want to consider in the Fit Model dialog box. Then click Run Model.
A.6.2
Once the Regression is Run
1. Variance Ination Factors Right click on the Parameter Estimates box and choose Columns VIF in the menu that appears. 2. Condence Intervals for Individual Coecients Right click on the Parameter Estimates box and choose Columns Lower 95% in the menu that appears. Repeat to get the Upper 95%, completing the interval. 3. Save Columns This menu lives under the big gray bar governing the whole regression. Options under this menu will save a new column to your data table. Use this menu to obtain: Cooks Distance
Leverage (or Hats) Saving Residuals Saving Predicted (or Fitted) Values
182
Condence and Prediction Intervals
This menu is especially useful for making new predictions from your regression. Before you t your model, add a row of data including the X variables you want to use in your prediction. Leave the Y variable blank. When you save predicted values and intervals JMP should save them to the row of fake data as well. 4. Row Diagnostics This menu lives under the big gray bar governing the whole regression. Options under this menu will add new tables or graphs to your regression output. Use this menu to obtain: Residual Plot (Residuals by Predicted Values) The Durbin Watson Statistic Plotting Residuals by Row
5. Expanded Estimates The expanded estimates box reveals the same information as the parameter estimates box, but categorical variables are expanded to include the default categories. To obtain the expanded estimates box select Estimates Expanded Estimates from the little red triangle on the big gray bar governing the whole regression.
A.6.3
Including Interactions and Quadratic Terms
To include a variable as a quadratic, go to the Fit Model dialog box. Select the variable you want to include as a quadratic. Then click on the Macros menu and select Polynomial to Degree. The Degree eld under the Macros button controls the degree of the polynomial. The degree eld shows 2 by default, so the Polynomial to Degree macro will create a quadratic. There are two basic ways to include an interaction term. The rst is to select the two variables you want to include as an interaction (perhaps by holding down the Shift or Control key) and hitting the Cross button. The second is to select two or more variables that you want to use in an interaction (using the Shift or Control key) and select Macros Factorial to Degree.
A.6.4
Contrasts
To test a contrast in a multiple regression, go to the leverage plot for the categorical variable whose levels you want to test. Click LS Means Contrast. In the table that pops up click the +/ signs next to the levels you want to test until you get the weights you want. Then click Done to compute the results of the test.
A.6.5
To Run a Stepwise Regression
In the Fit Model Dialog box, change the Personality to Stepwise. Enter all the X variables you wish to consider (including any interactions-see the instructions on interactions given above). Then click Run Model. The Stepwise Regression dialog box appears. Change Direction to Mixed and make the probability to enter and the probability to leave small numbers. The probability to enter should be less than or equal to the probability to leave. Then click Go. When JMP settles on a model, click Make Model to get to the familiar regression dialog box. You may want to change Personality to Eect Leverage (if necessary) so that you will get the leverage plots.
A.6.
FIT MODEL (I.E. MULTIPLE REGRESSION)
183
One of the odd things about JMPs stepwise regression procedure is that it creates an unusual coding scheme for dummy variables. Suppose you have a categorical variable called color, with levels Red, Blue, Green, and Yellow. JMPs stepwise procedure may create a variable named something like color[Red&Yellow-Blue&Green]. This variable assumes the value 1 if the color is red or yellow. It assumes the value -1 if the color is blue or green. This type of dummy variable compares colors that are either red or yellow to colors that are either blue or green.
A.6.6
Logistic Regression
Logistic regression with several X variables works just like regression with several X variables. Just choose a binary (categorical) variable as Y in the Fit Model dialog box. Check the little red triangle on the big gray bar to see the options you have for logistic regression. You can save the following for each item in your data set: the probability an observation with the observed X would fall in each level of Y, the value of the linear predictor, and the most likely level of Y for an observation with those X values.
184
Appendix B
Some Useful Excel Commands

Excel contains functions for calculating probabilities from the normal, T , 2 and F distributions. Each distribution also has an inverse function. You use the regular function when you have a potential value for the random variable and you want to compute a probability. You use the inverse function when you have a probability and you want to know the value to which it corresponds.
Normal Distribution
Normdist(x, mean, sd, cumulative) If cumulative is TRUE this function returns the probability that a normal random variable with the given mean and standard deviation is less than x. If cumulative is FALSE, this function gives the height of the normal curve evaluated at x. Example: The salaries of workers in a factory is normally distributed with mean $40,000 and standard deviation $7,500. What is the probability that a randomly chosen worker from the factory makes less than $53,000? Normdist(53000, 40000, 7500, TRUE). Norminv(p,mean,sd) returns the pth quantile of the specied normal distribution. That is, it returns a value x such that a normal random variable has probability p of being less than x. Example: In the factory described above nd the 25th and 75th salary percentiles. 25th: Norminv(.25, 40000, 7500). 75th: Norminv(.75, 40000, 7500).
185
186
APPENDIX B. SOME USEFUL EXCEL COMMANDS
T-Distribution
Tdist(t, df, tails) The tails argument is either 1 or 2. If tails=1 then this function returns the probability that at T random variable with df degrees of freedom is greater than t. For reasons known only to Bill Gates, the t argument cannot be negative. Note that you have to standardize the t statistic yourself before using this function, as there are no mean and sd arguments like there are in Normdist. Example: For a test of H0 : = 3 vs. Ha : = 3 we get a t statistic of 1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 2) ( df = 73 1 and tails =2 because it is a two tailed test.)
Example: For a test of H0 : = 3 vs. Ha : = 3 we get a t statistic of -1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 2) ( The T distribution is symmetric so ignoring the negative sign makes no dierence.) Example: For a test of H0 : = 3 vs. Ha : > 3 we get a t statistic of 1.76. There are 73 observations in the data set. What is the p-value? Tdist(1.76,72 1) ( The p-value here is the probability above 1.76.) Example: For a test of H0 : = 3 vs. Ha : > 3 we get a t statistic of -1.76. There are 73 observations in the data set. What is the p-value? =1-Tdist(1.76,72 1) ( The p-value here is the probability below 1.76.) Tinv(p, df ) Returns the value of t that you would need to see in a two tailed test to get a p-value of p. Example: We have 73 observations. How large a t statistic would we have to see to reject a two tailed test at the .13 level? Tinv(.13, 72) Example: With 73 observations what t would give us a p-value of .05 for the test H0 : = 17 vs. Ha : > 17. Because of the alternative hypothesis, the p-value is the area to the right of t. Thus the answer here is the value of t that would give a p-value of .10 on a two tailed test Tinv(.05, 72).
2 (chi-square) distribution
Chidist(x,df ) Returns the probability to the right of x on the chi-square distribution with the specied degrees of freedom. Example: A 2 test statistic turns out to be 12.7 on 9 degrees of freedom. What is the p-value? Chidist(12.7,9) ChiInv(p, df ) p is the probability in the right tail of the 2 distribution. This function returns the value of the corresponding 2 statistic.
187
Example: In a 2 test on 9 degrees of freedom, how large must the test statistic be in order to get a p-value of .02? ChiInv(.02, 9)
F Distribution
Fdist(F, NumDF, DenomDF) Returns the p value from the F distribution with NumDF in the numerator and DenomDF in the denominator. Example: If F = 12.7, the numerator df = 3 and the denominator df = 102, what is the p-value? Fdist(12.7, 3, 102 Finv(p, NumDF, DenomDF) Returns the value of the F statistic needed to achieve a p-value of p. If there were 3 numerator and 102 denominator degrees of freedom, how large an F statistic would be needed to get a p-value of .05? Finv(.05, 3, 102)
188
APPENDIX B. SOME USEFUL EXCEL COMMANDS
Appendix C
The Greek Alphabet

lower case o upper case A B E Z H I K M N O R T X 189 letter alpha beta gamma delta epsilon zeta eta theta iota kappa lambda mu nu xi omicron pi rho sigma tau upsilon phi chi psi omega
190
APPENDIX C. THE GREEK ALPHABET
Appendix D
Tables
191
192
APPENDIX D. TABLES
D.1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2.30 2.40 2.50 2.60 2.70 2.80 2.90 3.00 3.10 3.20 3.30 3.40 3.50
Normal Table
0.00 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.9987 0.9990 0.9993 0.9995 0.9997 0.9998 0.01 0.5040 0.5438 0.5832 0.6217 0.6591 0.6950 0.7291 0.7611 0.7910 0.8186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.9987 0.9991 0.9993 0.9995 0.9997 0.9998 0.02 0.5080 0.5478 0.5871 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.9987 0.9991 0.9994 0.9995 0.9997 0.9998 0.03 0.5120 0.5517 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.9988 0.9991 0.9994 0.9996 0.9997 0.9998 0.04 0.5160 0.5557 0.5948 0.6331 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.9988 0.9992 0.9994 0.9996 0.9997 0.9998 0.05 0.5199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.8531 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.8051 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.9515 0.9608 0.9686 0.9750 0.9803 0.9846 0.9881 0.9909 0.9931 0.9948 0.9961 0.9971 0.9979 0.9985 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.9989 0.9992 0.9995 0.9996 0.9997 0.9998
0.4
0.08 0.5319 0.5714 0.6103 0.6480 0.6844 0.7190 0.7517 0.7823 0.8106 0.8365 0.8599 0.8810 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.9812 0.9854 0.9887 0.9913 0.9934 0.9951 0.9963 0.9973 0.9980 0.9986 0.9990 0.9993 0.9995 0.9996 0.9997 0.9998
0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998
The body of the table contains the probability that a N (0, 1) random variable is less than z. The left margin of the table contains the rst two digits of z. The top row of the table contains the third digit of z.
0.0 4
0.1
0.2
0.3
0 Z
D.2. QUICK AND DIRTY NORMAL TABLE
193
D.2
z -4.00 -3.95 -3.90 -3.85 -3.80 -3.75 -3.70 -3.65 -3.60 -3.55 -3.50 -3.45 -3.40 -3.35 -3.30 -3.25 -3.20 -3.15 -3.10 -3.05 -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45 -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05
Quick and Dirty Normal Table

Pr(Z<z) 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0010 0.0011 0.0013 0.0016 0.0019 0.0022 0.0026 0.0030 0.0035 0.0040 0.0047 0.0054 0.0062 0.0071 0.0082 0.0094 0.0107 0.0122 0.0139 0.0158 0.0179 0.0202 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z -2.00 -1.95 -1.90 -1.85 -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25 -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 Pr(Z<z) 0.0228 0.0256 0.0287 0.0322 0.0359 0.0401 0.0446 0.0495 0.0548 0.0606 0.0668 0.0735 0.0808 0.0885 0.0968 0.1056 0.1151 0.1251 0.1357 0.1469 0.1587 0.1711 0.1841 0.1977 0.2119 0.2266 0.2420 0.2578 0.2743 0.2912 0.3085 0.3264 0.3446 0.3632 0.3821 0.4013 0.4207 0.4404 0.4602 0.4801 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 Pr(Z<z) 0.5000 0.5199 0.5398 0.5596 0.5793 0.5987 0.6179 0.6368 0.6554 0.6736 0.6915 0.7088 0.7257 0.7422 0.7580 0.7734 0.7881 0.8023 0.8159 0.8289 0.8413 0.8531 0.8643 0.8749 0.8849 0.8944 0.9032 0.9115 0.9192 0.9265 0.9332 0.9394 0.9452 0.9505 0.9554 0.9599 0.9641 0.9678 0.9713 0.9744 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45 3.50 3.55 3.60 3.65 3.70 3.75 3.80 3.85 3.90 3.95 Pr(Z<z) 0.9772 0.9798 0.9821 0.9842 0.9861 0.9878 0.9893 0.9906 0.9918 0.9929 0.9938 0.9946 0.9953 0.9960 0.9965 0.9970 0.9974 0.9978 0.9981 0.9984 0.9987 0.9989 0.9990 0.9992 0.9993 0.9994 0.9995 0.9996 0.9997 0.9997 0.9998 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 1.0000 1.0000
The table gives the probability that a N (0, 1) random variable is less than z. It is less precise than Table D.1 because z increments by .05 rather than .01, but it may be easier to use.
194
APPENDIX D. TABLES
D.3
Cooks Distance
Shocking 0.4897 0.7435 0.8451 0.8988 0.9319 0.9544 0.9705 0.9828 0.9923 1 1.0063 1.0116 1.016 1.0199 1.0232 1.0261 1.0287 1.031 1.0331 1.0349 1.0366 1.0381 1.0395 1.0408 1.042 1.0431 1.0441 1.045 1.0459 1.0467 Sample Size 100 Odd Surprising Shocking 0.0159 0.0645 0.4583 0.1055 0.2236 0.698 0.1944 0.3351 0.7941 0.2647 0.4115 0.8449 0.3199 0.467 0.8762 0.3641 0.5094 0.8974 0.4005 0.5429 0.9127 0.4309 0.5703 0.9242 0.4568 0.5931 0.9332 0.4792 0.6125 0.9405 0.4987 0.6292 0.9464 0.516 0.6438 0.9514 0.5314 0.6567 0.9556 0.5453 0.6681 0.9592 0.5578 0.6784 0.9624 0.5691 0.6876 0.9651 0.5795 0.6961 0.9675 0.5891 0.7037 0.9697 0.5979 0.7108 0.9716 0.606 0.7173 0.9734 0.6136 0.7233 0.9749 0.6206 0.7289 0.9764 0.6272 0.7341 0.9777 0.6334 0.7389 0.9789 0.6392 0.7434 0.98 0.6446 0.7477 0.981 0.6498 0.7517 0.982 0.6546 0.7555 0.9828 0.6592 0.759 0.9837 0.6636 0.7624 0.9844 1000 Odd Surprising Shocking 0.0158 0.0642 0.4553 0.1054 0.2232 0.6936 0.1948 0.3351 0.7892 0.2658 0.4121 0.8397 0.3218 0.4684 0.8709 0.367 0.5114 0.892 0.4043 0.5457 0.9072 0.4356 0.5738 0.9186 0.4625 0.5973 0.9276 0.4858 0.6173 0.9348 0.5062 0.6347 0.9407 0.5244 0.6499 0.9457 0.5406 0.6634 0.9498 0.5552 0.6754 0.9534 0.5685 0.6862 0.9566 0.5807 0.696 0.9593 0.5918 0.705 0.9617 0.6021 0.7132 0.9639 0.6116 0.7207 0.9658 0.6204 0.7277 0.9675 0.6287 0.7342 0.9691 0.6364 0.7402 0.9705 0.6436 0.7458 0.9718 0.6504 0.7511 0.973 0.6568 0.7561 0.9741 0.6629 0.7607 0.9751 0.6686 0.7651 0.9761 0.674 0.7693 0.9769 0.6792 0.7733 0.9778 0.6841 0.777 0.9785
Number of Params 10 Odd Surprising 1 0.0166 0.0677 2 0.1065 0.2282 3 0.1912 0.3357 4 0.2551 0.4066 5 0.3033 0.4563 6 0.3405 0.4931 7 0.37 0.5215 8 0.394 0.544 9 0.4139 0.5623 10 0.4306 0.5775 11 0.4448 0.5903 12 0.4571 0.6013 13 0.4678 0.6108 14 0.4772 0.6191 15 0.4856 0.6264 16 0.4931 0.6329 17 0.4998 0.6387 18 0.5058 0.644 19 0.5113 0.6487 20 0.5163 0.653 21 0.5209 0.6569 22 0.5251 0.6606 23 0.529 0.6639 24 0.5326 0.6669 25 0.536 0.6698 26 0.5391 0.6724 27 0.542 0.6748 28 0.5447 0.6771 29 0.5472 0.6793 30 0.5496 0.6813
The numbers in the table are approximate cuto values for Cooks distances with the specied number of model parameters (intercept + number of slopes), and nearest sample size. An odd point is one about which you are mildly curious. A shocking point clearly inuences the tted regression.
D.4. CHI-SQUARE TABLE
195
D.4
DF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100
Chi-Square Table
0.9 2.71 4.61 6.25 7.78 9.24 10.64 12.02 13.36 14.68 15.99 17.28 18.55 19.81 21.06 22.31 23.54 24.77 25.99 27.20 28.41 29.62 30.81 32.01 33.20 34.38 35.56 36.74 37.92 39.09 40.26 51.81 63.17 74.40 85.53 96.58 107.57 118.50 0.95 Probability 0.99 0.999 6.63 9.21 11.34 13.28 15.09 16.81 18.48 20.09 21.67 23.21 24.73 26.22 27.69 29.14 30.58 32.00 33.41 34.81 36.19 37.57 38.93 40.29 41.64 42.98 44.31 45.64 46.96 48.28 49.59 50.89 63.69 76.15 88.38 100.43 112.33 124.12 135.81 10.83 13.82 16.27 18.47 20.51 22.46 24.32 26.12 27.88 29.59 31.26 32.91 34.53 36.12 37.70 39.25 40.79 42.31 43.82 45.31 46.80 48.27 49.73 51.18 52.62 54.05 55.48 56.89 58.30 59.70 73.40 86.66 99.61 112.32 124.84 137.21 149.45 0.9999 15.13 18.42 21.10 23.51 25.75 27.85 29.88 31.83 33.72 35.56 37.36 39.13 40.87 42.58 44.26 45.93 47.56 49.19 50.79 52.38 53.96 55.52 57.07 58.61 60.14 61.67 63.17 64.66 66.15 67.62 82.06 95.97 109.50 122.74 135.77 148.62 161.33
3.84 5.99 7.81 9.49 11.07 12.59 14.07 15.51 16.92 18.31 19.68 21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14 31.41 32.67 33.92 35.17 36.42 37.65 38.89 40.11 41.34 42.56 43.77 55.76 67.50 79.08 90.53 101.88 113.15 124.34
The table shows the value that a chi-square random variable must attain so that the specied amount of probability lies to its left.
0.25 0.05 0.10 0.15 0.20
Probability
0.00 0
6 8 Chi Square (Number in Table Body)
10
12
196
APPENDIX D. TABLES
Bibliography
Albright, S. C., Winston, W. L., and Zappe, C. (2004). Data Analysis for Managers with Microsoft Excel, 2nd Edition. Brooks/ColeThomson Lerning. Foster, D. P., Stine, R. A., and Waterman, R. P. (1998). Basic Business Statistics. Springer.
197
Index
added variable plot, 127 alternative hypothesis, 72 ANOVA table, 121 autocorrelation, 56, 106, 107 autocorrelation function, 56 autoregression, 107 bimodal, 9 Bonferroni adjustment, 152 Box-Cox transformations, 99 boxplot, 7 categorical, 2 central limit theorem, 64, 66, 80 chi square test, 82 collinearity, 123, 133135 conditional probability, 22 conditional proportions, 11 condence interval, 126, 158 for a mean, 65, 67 for a proportion, 80 for the regression line, 94, 126 for the regression slope, 92 contingency table, 2 continuous, 2 contrast, 144 Cooks distance, 132 correlation, 52 covariance, 50 covariance matrix, 51 decision theory, 45 discreteness, 9 198 dummy variable, 138 dummy variables, 80 empirical rule, 8 expected value, 33 extrapolation, 96 factor, 137 fat tails, 9 elds, 1 rst order eects, 146 frequency table, 2 heteroscedasticity, 104 hierarchical principle, 150 high leverage point, 110 histogram, 3, 7, 13 independence, 29 indicator variables, 80 inuential point, 110 interaction, 145 interpolation, 96 joint distribution, 20 joint proportion, 11 lag variable, 107 levels, 2, 137 leverage plot, 127 linear regression model, 20 main eects, 146 margin of error, 70
INDEX marginal distribution, 12, 21 market segments, 45 Markov chain, 31 Markov dependence, 29 model sum of squares, 120 moments, 4 mosaic plots, 3 multiple comparisons, 152 nominal variable, 2 normal distribution, 8, 20, 37 normal quantile plot, 9, 41 null hypothesis, 71 observation, 1 ordinal variable, 2 outlier, 110 outliers, 6 p-value, 73 parameters, 62 point estimate, 69 population, 61 prediction interval, 94, 126 Q-Q plot, see normal quantile plot quantile-quantile plot, see normal quantile plot quantiles, 4 quartiles, 6 random variable, 18, 35, 38, 39, 51, 63, 64, 66, 88 randomization, 165, 167, 171 in experiments, 170, 173 in surveys, 166, 167, 173 records, 1 regression assumptions, 97 relative frequencies, 2 residual plot, 97, 98, 104, 106, 130
199 residuals, 93, 96, 98, 104, 109, 122, 127, 130, 160162 reward matrix, 46 risk prole, 47 sample, 61 sampling distribution, 63 scatterplot, 13 simple random sample, 62 skewed, 9 slope, 87, 88, 91 SSE, 120 SSM, 120 SST, 120 standard error, 91 standard deviation, 5, 35, 3739, 64, 66, 67, 93 standard error, 6466, 96, 158 statistics, 62 stepwise regression, 152 t distribution, 67 t-statistic, 73 t-test for a regression coecient, 91, 123 one sample, 76 paired, 79 two-sample, 140 test statistic, 72 trend in a time series, 106 Tukeys bulging rule, 99 variable, 1 variance, 3436, 51, 52, 54, 63, 64, 97, 122, 126, 127, 166, 167 non-constant, 104 of a random variable, 34 residual, 120 variance ination factor, 134 VIF, see variance ination factor
200 z-score, 38
INDEX

Stats

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats

Uploaded by

Copyright:

Available Formats

Applied Managerial Statistics

Steven L. Scott Winter 2005-2006

Dont Get Confused

Not on the Test

NOT ON THE TEST

NOT ON THE TEST

Our First Data Set

CHAPTER 1. LOOKING AT DATA

Summaries of a Single Variable

1.2. SUMMARIES OF A SINGLE VARIABLE

(a) Histogram and Mosaic Plot

(b) Frequency Distribution

4 CEOs from the other industries.

CHAPTER 1. LOOKING AT DATA

The third moment has something cubed in it, and so forth.

1.2. SUMMARIES OF A SINGLE VARIABLE

CHAPTER 1. LOOKING AT DATA

1.2. SUMMARIES OF A SINGLE VARIABLE

quartile median quartile

The Normal Curve

1.3. RELATIONSHIPS BETWEEN VARIABLES

Relationships Between Variables

CHAPTER 1. LOOKING AT DATA

(a) Skewness: CEO Compensation (top 20 outliers removed)

(b) Heavy Tails: Corporate Prots

Figure 1.5: Some non-normal data.

1.3. RELATIONSHIPS BETWEEN VARIABLES

CHAPTER 1. LOOKING AT DATA

Continuous Y and Categorical X

1.3. RELATIONSHIPS BETWEEN VARIABLES

CHAPTER 1. LOOKING AT DATA

The Rest of the Course

CHAPTER 1. LOOKING AT DATA

CHAPTER 2. PROBABILITY BASICS

CHAPTER 2. PROBABILITY BASICS

The Probability of More than One Thing

Joint, Conditional, and Marginal Probabilities

2.2. THE PROBABILITY OF MORE THAN ONE THING

0 .10 .10 .10 .05 .35

3 .10 .00 .00 .00 .10

.40 .40 .15 .05 1.00

CHAPTER 2. PROBABILITY BASICS

0 100 100 100 50 350

Y (Floyd) 1 2 3 100 100 100 200 100 0 50 0 0 0 0 0 350 200 100

400 400 150 50 1000

0 .25 .25 .67 1.00

3 .25 .00 .00 .00

1.00 1.00 1.00 1.00

0 .29 .29 .29 .13 1.00

3 1.00 .00 .00 .00 1.00

Floyds conditional probabilities given Jims sales P (Y |X)

Jims conditional probabilities given Floyds sales P (X|Y )

CHAPTER 2. PROBABILITY BASICS

2.2. THE PROBABILITY OF MORE THAN ONE THING

26 that P (U |E) = and P (notU |E) =

CHAPTER 2. PROBABILITY BASICS

P (E|U )P (U ) P (E|U )P (U ) + P (E|notU )P (notU ) P (E|notU )P (notU ) P (E|U )P (U ) + P (E|notU )P (notU )

A Real World Probability Model

CHAPTER 2. PROBABILITY BASICS

2.2. THE PROBABILITY OF MORE THAN ONE THING

1.00 1.00 1.00 1.00

CHAPTER 2. PROBABILITY BASICS

(a) Diameters of automobile crank shafts.