You are on page 1of 23

Multiple Regression: Predicting One Factor from Several Others

2/10/2012

Predicting a single Y variable from two or more X variables

Describe and Understand the Relationship


Understand the effect of one X variable while holding the others fixed

Forecast (Predict) a New Observation


Lets you use all available information (X variables) to find out about what you dont know (the Y variable for this new situation)

Adjust and Control a Process


because the regression equation (you hope) tells you what would happen if you made a change

2/10/2012

n cases (elementary units) k explanatory X variables


Y (dependent variable to be explained) Case 1 Case 2 . . . Case n 10.9 23.6 . . . 6.0 X1 (first independent or explanatory variable) 2.0 4.0 . . . 0.5 Xk (last independent or explanatory variable) 12.5 12.3 . . . 7.0

. . .

2/10/2012

Intercept: a

Predicted value for Y when every X is 0 The effect of each X on Y, holding all other X variables constant

Regression Coefficients: b , b2, bk

Prediction Equation or Regression Equation


(Predicted Y) = a+b1 X1+b2 X2++bk Xk The predicted Y, given the values for all X variables

Prediction Errors or Residuals


(Actual Y) (Predicted Y)

2/10/2012

Standard Error of Estimate: Se or S

Approximate size of errors made predicting Y

Coefficient of Determination: R2

Percentage of variability in Y explained by the X variables as a group

F Test: Significant or Not Significant

Tests whether the X variables, as a group, can predict Y better than just randomly

2/10/2012

t Tests for Individual Regression Coefficients


Significant or not significant, for each X variable Tests whether a particular X variable has an effect on Y, holding the other X variables constant Should be performed only if the F test is significant

Standard Errors of the Regression Coefficients


Sb1 , Sb2 ,. , Sbk (with n k 1 degrees of freedom) Indicates the estimated sampling standard deviation of each regression coefficient Used in the usual way to find confidence intervals and hypothesis tests for individual regression coefficients

2/10/2012

Input Data

To predict cost of ads from magazine characteristics


Y Page Costs (color ad) X1 Audience (thousands) 1,645 34,797 . . . 3,109 X2 Percent Male 51.1 22.1 . . . 14.4 X3 Median Income $38,787 41,933 . . . 43,696

Audubon Better Homes . . . YM

$25,315 198,000 . . . 73,270

2/10/2012

Predicted Page Costs


= a + b 1 X1 + b 2 X2 + b 3 X3 = $4,043 + 3.79(Audience) 124(Percent Male) + 0.903(Median Income)

Intercept a = $4,043
Essentially a base rate, representing the cost of advertising in a magazine that has no audience, no male readers, and zero income level But there are no such magazines intercept a is merely there to help achieve best predictions

2/10/2012

Predicted Page Costs


= a + b1 X1 + b2 X2 + b3 X3 = $4,043 + 3.79(Audience) 124(Percent Male) + 0.903(Median Income)

Regression coefficient b1 = 3.79


All else equal: The effect of Audience on Page Costs, while holding Percent Male and Median Income constant The effect of Audience on Page Costs, adjusted for Percent Male and Median Income On average, Page Costs are estimated to be $3.79 higher for a magazine with one more (thousand) Audience, as compared to another magazine with the same Percent Male and Median Income

2/10/2012

Predicted Page Costs


= a + b1 X1 + b2 X2 + b3 X3 = $4,043 + 3.79(Audience) 124(Percent Male) + 0.903(Median Income)

Regression coefficient b2 = 124


All else equal: The effect of Percent Male on Page Costs, while holding Audience and Median Income constant The effect of Percent Male on Page Costs, adjusted for Audience and Median Income On average, Page Costs are estimated to be $124 lower for a magazine with one more percentage point of male readers, as compared to another magazine with the same Audience and Median Income

2/10/2012

But dont believe it! We will see that it is not significant

Predicted Page Costs


= a + b1 X1 + b2 X2 + b3 X3 = $4,043 + 3.79(Audience) 124(Percent Male) + 0.903(Median Income)

Regression coefficient b3 = 0.903


All else equal: The effect of Median Income on Page Costs, while holding Audience and Percent Male constant The effect of Median Income on Page Costs, adjusted for Audience and Percent Male On average, Page Costs are estimated to be $0.903 higher for a magazine with one more dollar of Median Income, as compared to another magazine with the same Audience and Percent Male

2/10/2012

Predicted Page Costs for Audubon


= a + b1 X 1 + b2 X 2 + b3 X 3 = $4,043 + 3.79(Audience) 124(Percent Male) + 0.903(Median Income) = $4,043 + 3.79(1,645) 124(51.1) + 0.903(38,787)

= $38,966

Actual Page Costs are $25,315 Residual is $25,315 38,966 = $13,651

Audubon has Page Costs $13,651 lower than you would expect for a magazine with its characteristics
(Audience, Percent Male, and Median Income)

2/10/2012

Standard Error of Estimate Se


Indicates the approximate size of the prediction errors About how far are the Y values from their predictions? For the magazine data

Se = S = $21,578 Actual Page Costs are about $21,578 from their predictions for this group of magazines (using regression) Y Compare to SY = $45,446: Actual Page Costs are about $45,446 from their average (not using regression) Using the regression equation to predict Page Costs

2/10/2012

Coefficient of Determination R2
Indicates the percentage of the variation in Y that is explained by (or attributed to) all of the X variables How well do the X variables explain Y? For the magazine data

R2 = 0.787 = 78.7% The X variables (Audience, Percent Male, and Median Income) taken together explain 78.7% of the variance of Page Costs This leaves 100% 78.7% = 21.3% of the variation in Page Costs unexplained
2/10/2012

Linear Model for the Population


Y = (E + F1 X1 + F2 X2 + + Fk Xk) + I = (Population relationship) + Randomness

Where I has a normal distribution with mean 0 and constant standard deviation W, and this randomness is independent from one case to another An assumption needed for statistical inference

2/10/2012

Table 12.1.7

Population (parameters: fixed and unknown) Intercept or constant Regression coefficients E F1 F2 . . . Fk W

Sample (estimators: random and known) a b1 b2 . . . bk S or Se

Uncertainty in Y

2/10/2012

Is the regression significant?

Do the X variables, taken together, explain a significant amount of the variation in Y? The null hypothesis claims that, in the population, the X variables do not help explain Y; all coefficients are 0 H0: F1 = F2 = = Fk = 0

The research hypothesis claims that, in the population, at least one of the X variables does help explain Y H1: At least one of F1, F2, , Fk { 0

2/10/2012

Three equivalent methods for performing F test; they always give the same result

Use the p-value


If p < 0.05, then the test is significant Same interpretation as p-values in Chapter 10

Use the R2 value


If R2 is larger than the value in the R2 table, then the result is significant Do the X variables explain more than just randomness?

Use the F statistic


If the F statistic is larger than the value in the F table, then the result is significant

2/10/2012

For the magazine data, The X variables (Audience,


Percent Male, and Median Income) explain a very highly

significant percentage of the variation in Page Costs The p-value, listed as 0.000, is less than 0.0005, and is therefore very highly significant (since it is less than 0.001) The R2 value, 78.7%, is greater than 27.1% (from the R2 table at level 0.1% with n = 55 and k = 3), and is therefore very highly significant The F statistic, 62.84, is greater than the value (between 7.054 and 6.171) from the F table at level 0.1%, and is therefore very highly significant
2/10/2012

A t test for each regression coefficient


To be used only if the F test is significant


If F is not significant, you should not look at the t tests

Does the jth X variable have a significant effect on Y, holding the other X variables constant? Hypotheses are H0: Fj = 0, H1: Fj { 0 b j s tSb j Test using the confidence interval
use the t table with n k ! b j / Sb tstatistic 1 degrees of freedom
j

Or use the t statistic


compare to the t table value with n k 1 degrees of freedom

2/10/2012

Testing b1, the coefficient for Audience


b1 = 3.79, t = 13.5, p = 0.000
Audience has a very highly significant effect on Page Costs, after adjusting for Percent Male and Median Income

Testing b2, the coefficient for Percent Male


b2 = 124, t = 0.90, p = 0.374
Percent Male does not have a significant effect on Page Costs, after adjusting for Audience and Median Income

Testing b3, the coefficient for Median Income


b3 = 0.903, t = 2.44, p = 0.018
Median Income has a significant effect on Page Costs, after adjusting for Audience and Percent Male

2/10/2012

Standardized Regression Coefficients

Indicate relative importance of the information each X variable brings in addition to the others Ordinary regression coefficients are in different units
And cannot be compared without standardization b S /S

Defined as for the jth X variable Compare the absolute values

Xj

Correlation Coefficients

Indicate relative importance of the information each X variable brings without adjusting for the other X variables

2/10/2012

Multicollinearity

When some X variables are too similar to one another


Might do a good job of explaining and predicting Y But t tests might not significant because no X variable is bringing new information

Variable Selection

How to choose from a long list of X variables?


Too many: waste the information in the data Too few: risk ignoring useful predictive information

Model Misspecification
Perhaps the multiple regression linear model is wrong

2/10/2012

You might also like