You are on page 1of 51

Chapter 17

Multiple Regression

Introduction
In this chapter we extend the simple linear regression model, and allow for any number of independent variables. We expect to build a model that fits the data better than the simple linear regression model.

Introduction
We all believe that weight is affected by the amount of calories consumed. Yet, the actual effect is different from one individual to another. Therefore, a simple linear relationship leaves much unexplained error.

Weight

Calories consumed

Introduction

Weight
Calories consumed
Click to to continue

In an attempt to reduce the unexplained errors, well add a second explanatory (independent) variable
4

Introduction

Weight
Calories consumed
If we believe a persons height explains his/her weight too, we can add this variable to our model. The resulting Multiple regression model is shown:

Weight = b0 + b1Calories + b2Height + e


5

Introduction
We shall use computer printout to
Assess the model
How well it fits the data Is it useful Are any required conditions violated?

Employ the model


Interpreting the coefficients Making predictions using the prediction equation Estimating the expected value of the dependent variable
6

17.1 Model and Required Conditions


We allow k independent variables to potentially explain the dependent variable
Coefficients Random error variable

y = b0 + b1x1+ b2x2 + + bkxk + e


Dependent variable Independent variables
7

Model Assumptions Required conditions for e


The error e is normally distributed. The mean is equal to zero and the standard deviation is constant (se) for all values of y. The errors are independent.

17.2 Estimating the Coefficients and Assessing the Model


The procedure used to perform regression analysis:
Obtain the model coefficients and statistics using a statistical software.

Diagnose violations of required conditions. Try to remedy problems when identified.


Assess the model fit using statistics obtained from the sample. If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.
9

Estimating the Coefficients and Assessing the Model, Example


Example 1 Where to locate a new motor inn?
La Quinta Motor Inns is planning an expansion. Management wishes to predict which sites are likely to be profitable. Several areas where predictors of profitability can be identified are:
Competition Market awareness Demand generators Demographics Physical quality
10

Estimating the Coefficients and Assessing the Model, Example


Profitability
Competition Operating Margin

Market awareness

Customers

Community

Physical

X1 Rooms
Number of hotels/motels rooms within 3 miles from the site.

x2 Nearest
Distance to the nearest La Quinta inn.

x3 Office space

x4 Enrollment
College Enrollment

x5 Income
Median household income.

x6 Distance
Distance to downtown.
11

Estimating the Coefficients and Assessing the Model, Example


Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model:

Margin = b0 + b1Rooms + b2Nearest + b3Office + b4College + b5Income + b6Disttwn + e


INN 1 2 3 4 5 6 MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN 55.5 3203 4.2 549 8 37 2.7 33.8 2810 2.8 496 17.5 35 14.4 49 2890 2.4 254 20 35 2.6 31.9 3422 3.3 434 15.5 38 12.1 57.4 2687 0.9 678 15.5 42 6.9 12 49 3759 2.9 635 19 33 10.8

Regression Analysis, Excel Output


La Quinta
SUMMARY OUTPUT Regression Statistics Multiple R 0.724611 R Square 0.525062 Adjusted R Square 0.49442 Standard Error 5.512084 Observations 100 ANOVA df Regression Residual Total SS MS F Significance F 6 3123.832 520.6387 17.13581 3.03E-13 93 2825.626 30.38307 99 5949.458 P-value Lower 95% Upper 95% 4.04E-07 24.25197 52.02518 2.77E-08 -0.01011 -0.00513 0.010803 0.389548 2.902926 9.24E-08 0.012993 0.026538 0.115851 -0.05318 0.476744 0.003899 0.135999 0.690246 0.210651 -0.58014 0.129622

This is the sample regression equation (sometimes called the prediction equation) MARGIN = 38.14 - 0.0076ROOMS +1.65NEAREST + 0.02OFFICE +0.21COLLEGE +0.41INCOME - 0.23DISTTWN

Coefficients Standard Error t Stat Intercept 38.13858 6.992948 5.453862 Number -0.00762 0.001255 -6.06871 Nearest 1.646237 0.632837 2.601361 Office Space 0.019766 0.00341 5.795594 Enrollment 0.211783 0.133428 1.587246 Income 0.413122 0.139552 2.960337 Distance -0.22526 0.178709 -1.26048

13

Model Assessment Standard Error of Estimate


A small value of se indicates (by definition) a small variation of the errors around their mean. Since the mean is zero, small variation of the errors means the errors are close to zero. So we would prefer a model with a small standard deviation of the error rather than a large one. How can we determine whether the standard deviation of the error is small/large?
14

Model Assessment Standard Error of Estimate


The standard deviation of the error se is estimated by the Standard Error of Estimate se:

SSE se n k 1

The magnitude of se is judged by comparing it to y.


15

Standard Error of Estimate


SUMMARY OUTPUT Regression Statistics Multiple R 0.724611 R Square 0.525062 Adjusted R Square 0.49442 Standard Error 5.512084 Observations 100 ANOVA df Regression Residual Total SS MS F Significance F 6 3123.832 520.6387 17.13581 3.03E-13 93 2825.626 30.38307 99 5949.458 P-value Lower 95% Upper 95% 4.04E-07 24.25197 52.02518 2.77E-08 -0.01011 -0.00513 0.010803 0.389548 2.902926 9.24E-08 0.012993 0.026538 0.115851 -0.05318 0.476744 0.003899 0.135999 0.690246 0.210651 -0.58014 0.129622

From the printout, se = 5.5121 Calculating the mean value of y we have y 45.739

Coefficients Standard Error t Stat Intercept 38.13858 6.992948 5.453862 Number -0.00762 0.001255 -6.06871 Nearest 1.646237 0.632837 2.601361 Office Space 0.019766 0.00341 5.795594 Enrollment 0.211783 0.133428 1.587246 Income 0.413122 0.139552 2.960337 Distance -0.22526 0.178709 -1.26048

16

Model Assessment Coefficient of Determination


In our example it seems se is not particularly small, or is it? If se is small the model fits the data well, and is considered useful. The usefulness of the model is evaluated by the amount of variability in the y values explained by the model. This is done by the coefficient of determination. The coefficient of determination is calculated by
SSR SST SSE R SST SST
2

As you can see, SSE (thus se) effects the value of r2.

17

Coefficient of Determination
SUMMARY OUTPUT Regression Statistics Multiple R 0.724611 R Square 0.525062 Adjusted R Square 0.49442 Standard Error 5.512084 Observations 100 ANOVA df Regression Residual Total SS MS F Significance F 6 3123.832 520.6387 17.13581 3.03E-13 93 2825.626 30.38307 99 5949.458 P-value Lower 95% Upper 95% 1.11E-14 56.78049 88.12874 2.77E-08 -0.01011 -0.00513 0.010803 -2.90292 -0.38955 9.24E-08 0.012993 0.026538 0.115851 -0.05318 0.476744 0.003899 -0.69025 -0.136 0.210651 -0.12962 0.580138

From the printout, R2 = 0.5251 that is, 52.51% of the variability in the margin values is explained by this model.

Coefficients Standard Error t Stat Intercept 72.45461 7.893104 9.179483 ROOMS -0.00762 0.001255 -6.06871 NEAREST -1.64624 0.632837 -2.60136 OFFICE 0.019766 0.00341 5.795594 COLLEGE 0.211783 0.133428 1.587246 INCOME -0.41312 0.139552 -2.96034 DISTTWN 0.225258 0.178709 1.260475

18

Testing the Validity of the Model


We pose the question: Is there at least one independent variable linearly related to the dependent variable?
To answer the question we test the hypothesis H0: b1 = b2 = = bk = 0 H1: At least one bi is not equal to zero. If at least one bi is not equal to zero, the model has some validity.

19

Testing the Validity of the Model


The total variation in y (SS(Total)) can be explained in part by the regression (SSR) while the rest remains unexplained (SSE): i y) 2 + (y i y i )2 SS(Total) = SSR + SSE or (y i y) 2 ( y
i Note, that if all the data points satisfy the linear equation without errors, yi andy coincide, and thus SSE = 0. In this case all the variation in y is explained by the regression (SS(Total) = SSR).

If errors exist in small amounts, SSR will be close to SS(Total) and the ratio SSR/SSE will be large. This leads to the F ratio test presented next.

20

Testing for Significance


Define the Mean of the Sum of Squares-Regression (MSR) Define the Mean of the Sum of Squares-Error (MSE)

SSR MSR k
The ratio MSR/MSE is F-distributed

SSE MSE n k 1
SSR

MSR F MSE SSE

n k 1
21

Testing for Significance


Rejection region

F>Fa,k,n-k-1
Note. A Large F results from a large SSR, which indicates much of the variation in y is explained by the regression model; this is when the model is useful. Hence, the null hypothesis (which states that the model is not useful) should be rejected when F is sufficiently large. Therefore, the rejection region has the form of F > Fa,k,n-k-1
22

Testing the Model Validity of the La Quinta Inns Regression Model


SUMMARY OUTPUT Regression Statistics Multiple R 0.724611 R Square 0.525062 Adjusted R Square 0.49442 Standard Error 5.512084 Observations 100 ANOVA ANOVA df Regression Residual Regression Total

The F ratio test is performed using the ANOVA portion of the regression output
MSR/MSE
SS MS F Significance F SS MS Significance F 3123.832 520.6387 17.13581 F 3.03E-13 2825.626 30.38307 6 3123.832 520.6387 17.13581 3.03382E-13 5949.458 Lower 95% Upper 95% 1.11E-14 56.78049 88.12874 2.77E-08 -0.01011 -0.00513 0.010803 -2.90292 MSR=SSR/k -0.38955 9.24E-08 0.012993 0.026538 MSE=SSE/(n-k-1) 0.115851 -0.05318 0.476744 0.003899 -0.69025 -0.136 23 0.210651 -0.12962 0.580138

k 93 = Residual nk1 99 = 93 2825.626 30.38307 Total 99 5949.458 n1 = Coefficients Standard Error t Stat P-value
7.893104 0.001255 0.632837 0.00341 0.133428 0.139552 0.178709 9.179483 -6.06871 -2.60136 5.795594 1.587246 -2.96034 1.260475

df 6

Intercept 72.45461 ROOMS -0.00762 SSR NEAREST -1.64624 OFFICE 0.019766 SSE COLLEGE 0.211783 INCOME -0.41312 DISTTWN 0.225258

Testing the Model Validity of the La Quinta Inns Regression Model


ANOVA SS MS F Significance F Regression k = 6 3123.832 520.6387 17.13581 3.03382E-13 Residual nk1 = 93 2825.626 30.38307 Total n1 = 99 5949.458
If alpha = .05, the critical F is Fa,k,n-k-1 = F0.05,6,100-6-1=2.17 F = 17.14 > 2.17 Also, the p-value = 3.033(10)-13. Clearly, p-value=3.033(10)-13 < 0.05= a, Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the bi is not equal to zero, thus, the independent variable associated with it has linear relationship to y. This linear regression model is useful
24

df

Interpreting the Coefficients


b0 = 38.14. This is the y intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.
Interpreting the coefficients b1 through bk y = b0 + b1x1 + b2x2 ++ bkxk y = b0 + b1(x1+1) + b2x2 ++ bkxk = b0 + b1x1 + b2x2 ++ bkxk + b1
25

Interpreting the Coefficients


b1 = 0.0076. In this model, for each additional room
within 3 mile of the La Quinta inn, the operating margin decreases on the average by .0076% (assuming the other variables are held constant).

26

Interpreting the Coefficients


b2 = 1.65. In this model, for each additional mile that the
nearest competitor is to a La Quinta inn, the average operating margin increases by 1.65% when the other variables are held constant. b3 = 0.02. For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%. b4 = 0.21. For each additional thousand students the average

operating margin increases by .21% when the other variables


remain constant.
27

Interpreting the Coefficients


b5 = 0.41. For additional $1000 increase in median household income, the average operating margin increases by .41%, when the other variables remain constant.
b6 = - 0.23. For each additional mile to the downtown center, the average operating margin decreases by

.23%.

28

Testing the Coefficients


The hypothesis for each bi is H0: bi 0 H1: bi 0 Excel printout
Coefficients Standard Error

Test statistic
b i bi t sb i
d.f. = n - k -1

For example, a test for b1: t = (-.007618-0)/.001255 = -6.068 Suppose alpha=.01. t.005,100-6-1=3.39 There is sufficient evidence to reject H0 at 1% significance level. Moreover the p=value of the test is 2.77(10-8). Clearly H0 is strongly rejected. The number of rooms is linearly related to the margin.
P-value 4.04E-07 2.77E-08 0.010803 9.24E-08 0.115851 0.003899 0.210651
Lower 95% Upper 95%

Intercept 38.13858 Number -0.007618 Nearest 1.646237 Office Space0.019766 Enrollment 0.211783 Income 0.413122 Distance -0.225258

6.992948 0.00125527 0.63283691 0.00341044 0.13342794 0.1395524 0.17870889

t Stat 5.453862 -6.06871 2.601361 5.795594 1.587246 2.960337 -1.26048

24.25196697 -0.010110585 0.389548431 0.012993078 -0.053178488 0.135998719 -0.580138524

52.02518 -0.00513 2.902926 0.026538 0.476744 0.690246 0.129622

29

Testing the Coefficients


The hypothesis for each bi is H0: bi 0 H1: bi 0 Excel printout
Intercept 38.13858 Number -0.007618 Nearest 1.646237 Office Space0.019766 Enrollment 0.211783 Income 0.413122 Distance -0.225258

See next the interpretation of the p-value results


6.992948 0.00125527 0.63283691 0.00341044 0.13342794 0.1395524 0.17870889 t Stat 5.453862 -6.06871 2.601361 5.795594 1.587246 2.960337 -1.26048 P-value 4.04E-07 2.77E-08 0.010803 9.24E-08 0.115851 0.003899 0.210651
Lower 95% Upper 95%

Coefficients Standard Error

24.25196697 -0.010110585 0.389548431 0.012993078 -0.053178488 0.135998719 -0.580138524

52.02518 -0.00513 2.902926 0.026538 0.476744 0.690246 0.129622

30

Interpretation
Interpretation of the regression results for this model
The number of hotel and motel rooms, distance to the nearest motel, the amount of office space, and the median household income are linearly related to the operating margin Students enrollment and distance from downtown are not linearly related to the margin Preferable locations have only few other motels nearby, much office space, and the surrounding households are affluent.
31

Using the Regression Equation


The model can be used for making predictions by
Producing prediction interval estimate of the particular value of y, for given values of xi. Producing a confidence interval estimate for the expected value of y, for given values of xi.

The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients bi
32

La Quinta Inns, Predictions


La Quinta Predict the average operating margin of an inn at a site with the following characteristics:
3815 rooms within 3 miles, Closet competitor 3.4 miles away, 476,000 sq-ft of office space, 24,500 college students, $39,000 median household income, 3.6 miles distance to downtown center. MARGIN = 38.14 - 0.0076(3815) -1.646(.9) + 0.02(476) +0.212(24.5) - 0.413(35) + 0.225(11.2) = 37.1% 33

La Quinta Inns, Predictions


Interval estimates by Excel (Data analysis plus)
Prediction Interval Predicted value = Prediction Interval Lower limit = Upper limit = Interval Estimate of Expected Value Lower limit = Upper limit =

It is predicted that the average Margin operating margin will lie 37.09149 within 25.4% and 48.8%, with 95% confidence.
25.39527 It is expected the average 48.78771 operating margin of all sites

32.96972 41.21326

that fit this category falls within 33% and 41.2% with 95% confidence.

The average inn would not be profitable (Less than 50%).


34

18.2 Qualitative Independent Variables


In many real-life situations one or more independent variables are qualitative. Including qualitative variables in a regression analysis model is done via indicator variables. An indicator variable (I) can assume one out of two values, zero or one.
o 1 if a degree earned is in Finance 1 if the temperature was below 50 1 1 if ifa data firstwere condition collected out of before two is1980 met I= 0 o or if a degree earned is not Finance the temperature was 50 more 0 00if ifif a data second were condition collected out after ofin two 1980 is met
35

Qualitative Independent Variables; Example: Auction Car Price (II)


Example 2 - continued
Recall: A car dealer wants to predict the auction price of a car. The dealer believes now that both odometer reading and car color are variables that affect a cars price. Three color categories are considered:
White Silver Other colors

Note: Color is a qualitative variable.


36

Qualitative Independent Variables; Example: Auction Car Price (II)


Example 2 - continued
1 if the color is white I1 = 0 if the color is not white

1 if the color is silver I2 = 0 if the color is not silver


The category Other colors is defined by:

I1 = 0; I2 = 0

37

How Many Indicator Variables?


Note: To represent the situation of three possible colors we need only two indicator variables. Generally to represent a nominal variable with m possible values, we must create m-1 indicator variables.
38

Qualitative Independent Variables; Example: Auction Car Price (II)


Solution
the proposed model is y = b0 + b1(Odometer) + b2I1 + b3I2 + e The data
Price 14636 14122 14016 15590 15568 14718 . Enter . Odometer 37388 44758 45833 30862 31705 34010 . the data in . I-1 1 1 0 0 0 0 . Excel . I-2 0 0 0 0 1 1 . usual .

White color

Other color
Silver color
39

as

Example: Auction Car Price (II) The Regression Equation


From Excel we get the regression equation PRICE = 16.837 - .0591(Odometer) + .0911(I-1) + .3304(I-2)
Price Price = 16.837 - .0591(Odometer) + .0911(0) + .3304(1)

Price = 16.837 - .0591(Odometer) + .0911(1) + .3304(0) Price = 16.837 - .0591(Odometer) + .0911(0) + .3304(0)

Odometer

40

Example: Auction Car Price (II) The Regression Equation


Interpreting the equation

From Excel we get the regression equation PRICE = 16701-.0591(Odometer)+.0911(I-1)+.3304(I-2)


For one additional mile the auction price decreases by 5.91 cents on the average. A white car sells, on the average, for $91.1 more than a car of the Other color category

A silver color car sells, on the average, for $330.4 more than a car of the Other color category.
41

Example: Auction Car Price (II) The Regression Equation


SUMMARY OUTPUT Regression Statistics Multiple R 0.837135 R Square 0.700794 Adjusted R Square 0.691444 Standard Error 0.304258 Observations 100 ANOVA df Regression Residual Total SS MS 3 20.814919 6.938306 96 8.8869809 0.092573 99 29.7019 F Significance F 74.9498 4.65E-25

There is insufficient evidence Car Price-Dummy to infer that a white color car and a car of other color sell for a different auction price. There is sufficient evidence to infer that a silver color car sells for a larger price than a car of the other color category.

Intercept Odometer I-1 I-2

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 16.83725 0.1971054 85.42255 2.28E-92 16.446 17.2285 -0.059123 0.0050653 -11.67219 4.04E-20 -0.069177 -0.049068 0.091131 0.0728916 1.250224 0.214257 -0.053558 0.235819 0.330368 0.0816498 4.046157 0.000105 0.168294 0.492442

42

Qualitative Independent Variables; Example: MBA Program Admission (II)


Recall: The Dean wanted to evaluate applications for the MBA program by predicting future performance of the applicants. The following three predictors were suggested:
Undergraduate GPA GMAT score Years of work experience

Note: The undergraduate degree is qualitative.

It is now believed that the type of undergraduate degree should be included in the model.
43

Qualitative Independent Variables; Example: MBA Program Admission (II)


1 if B.A. I1 = 0 otherwise 1 if B.B.A I2 = 0 otherwise 1 if B.Sc. or B.Eng. I3 = 0 otherwise

The category Other group is defined by:

I1 = 0; I2 = 0; I3 = 0

44

Qualitative Independent Variables; Example: MBA Program Admission (II)


MBA-II
SUMMARY OUTPUT Regression Statistics Multiple R 0.746053 R Square 0.556595 Adjusted R Square 0.524151 Standard Error 0.729328 Observations 89 ANOVA df Regression Residual Total SS MS F Significance F 6 54.75184 9.125307 17.15544 9.59E-13 82 43.61738 0.531919 88 98.36922 P-value Lower 95% Upper 95% 0.892996 -2.60863 2.988258 0.957728 -0.23278 0.22066 9.92E-15 0.010095 0.015491 0.001739 0.03786 0.158504 0.126928 -0.79005 0.100081 0.004338 0.227237 1.184213 0.8684 -0.38176 0.45137

Intercept UnderGPA GMAT Work I-1 I-2 I-3

Coefficients Standard Error t Stat 0.189814 1.406734 0.134932 -0.00606 0.113968 -0.05317 0.012793 0.001356 9.432831 0.098182 0.030323 3.237862 -0.34499 0.223728 -1.54199 0.705725 0.240529 2.934058 0.034805 0.209401 0.166211

45

Applications in Human Resources Management: Pay-Equity


Pay-equity can be handled in two different forms:
Equal pay for equal work Equal pay for work of equal value.

Regression analysis is extensively employed in cases of equal pay for equal work.

46

Human Resources Management: Pay-Equity


Example 3
Is there sex discrimination against female managers in a large firm? A random sample of 100 managers was selected and data were collected as follows:
Annual salary Years of education Years of experience Gender
47

Human Resources Management: Pay-Equity


Solution
Construct the following multiple regression model:

y = b0 + b1Education + b2Experience + b3Gender + e


Note the nature of the variables:
Education quantitative Experience quantitative Gender qualitative (Gender = 1 if male; =0 otherwise).
48

Human Resources Management: Pay-Equity


Solution Continued (HumanResource)
SUMMARY OUTPUT Regression Statistics Multiple R 0.83256 R Square 0.693155 Adjusted R Square 0.683567 Standard Error 16273.96 Observations 100 ANOVA df Regression Residual Total SS MS F Significance F 3 5.74E+10 1.91E+10 72.28735 1.55E-24 96 2.54E+10 2.65E+08 99 8.29E+10 P-value Lower 95% Upper 95% 0.71754 -37759.2 26089.02 0.040149 97.21837 4140.578 9.89E-23 3469.714 4728.963 0.618323 -5499.56 9201.527

Analysis and Interpretation The model fits the data quite well. The model is very useful. Experience is a variable strongly related to salary. There is no evidence of sex discrimination.

Intercept Education Experience Gender

Coefficients Standard Error t Stat -5835.1 16082.8 -0.36282 2118.898 1018.486 2.08044 4099.338 317.1936 12.92377 1850.985 3703.07 0.499851

49

Human Resources Management: Pay-Equity


Solution Continued (HumanResource)
SUMMARY OUTPUT Regression Statistics Multiple R 0.83256 R Square 0.693155 Adjusted R Square 0.683567 Standard Error 16273.96 Observations 100 ANOVA df Regression Residual Total

Analysis and Interpretation Further studying the data we find: Average experience (years) for women is 12. Average experience (years) for men is 17 Average salary for female manager is $76,189 Average salary for male manager is $97,832

SS MS F Significance F 3 5.74E+10 1.91E+10 72.28735 1.55E-24 96 2.54E+10 2.65E+08 99 8.29E+10 P-value Lower 95% Upper 95% 0.71754 -37759.2 26089.02 0.040149 97.21837 4140.578 9.89E-23 3469.714 4728.963 0.618323 -5499.56 9201.527

Intercept Education Experience Gender

Coefficients Standard Error t Stat -5835.1 16082.8 -0.36282 2118.898 1018.486 2.08044 4099.338 317.1936 12.92377 1850.985 3703.07 0.499851

50

Review problems

51

You might also like