You are on page 1of 38

Regression Analysis14th week (12/15/2006

Type I SS and Estimable Functions The Type I SS and the associated hypotheses they test are by-products of the modified sweep operator used to compute a generalized inverse of X'X and a solution to the normal equations. For the model E(Y) = X1 B1+X2 B2+X3 B3, the Type I SS for each effect correspond to Effect B1 B2 B3 Type I SS R(B1) R(B2|B1) R(B3|B1, B2)

The Type I SS are model-order dependent; each effect is adjusted only for the preceding effects in the model. Type II SS and Estimable Functions For main effects models and regression models, the general form of estimable functions can be manipulated to provide tests of hypotheses involving only the parameters of the effect in question. The same result can also be obtained by entering each effect in turn as the last effect in the model and obtaining the Type I SS for that effect. These are the Type II SS. Using a modified reversible sweep operator, it is possible to obtain the Type II SS without actually rerunning the model. Thus, the Type II SS correspond to the R notation in which each effect is adjusted for all other effects possible. For a regression model such as E(Y) = X1 B1 + X2 B2 + X3 B3 the Type II SS correspond to Effect B1 B2 B3 Type II SS R(B1 | B2, B3) R(B2 | B1, B3) R(B3 | B1, B2)

Table 61.2: MODEL Statement Options Option Description

Model Selection and Details of Selection SELECTION= BEST= specifies model selection method specifies maximum number of subset models displayed or output to the OUTEST= data set produces summary statistics at each step specifies the display details for forward, backward, and stepwise methods

DETAILS DETAILS=

GROUPNAMES= provides names for groups of variables INCLUDE= MAXSTEP= NOINT PCOMIT= includes first n variables in the model specifies maximum number of steps that may be performed fits a model without the intercept term performs incomplete principal component analysis and outputs estimates to the OUTEST= data set sets criterion for entry into model performs ridge regression analysis and outputs estimates to the OUTEST= data set sets criterion for staying in model specifies number of variables in model to begin the comparing and switching process stops selection criterion

SLE= RIDGE=

SLS= START=

STOP= Fit Statistics ADJRSQ AIC B BIC CP

computes adjusted R2 computes Akaike's information criterion computes parameter estimates for each model computes Sawa's Bayesian information criterion computes Mallows' Cp statistic

GMSEP JP MSE PC RMSE SBC SP SSE

computes estimated MSE of prediction assuming multivariate normality computes Jp, the final prediction error computes MSE for each model computes Amemiya's prediction criterion displays root MSE for each model computes the SBC statistic computes Sp statistic for each model computes error sum of squares for each model

Data Set Options EDF outputs the number of regressors, the error degrees of freedom, and the model R2 to the OUTEST= data set outputs standard errors of the parameter estimates to the OUTEST= data set outputs standardized parameter estimates to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. outputs the variance inflation factors to the OUTEST= data set. Use only with the RIDGE= or PCOMIT= option. outputs the PRESS statistic to the OUTEST= data set has same effect as the EDF option

OUTSEB OUTSTB

OUTVIF

PRESS RSQUARE

Regression Calculations I XPX displays inverse of sums of squares and crossproducts displays sums-of-squares and crossproducts matrix

Details on Estimates ACOV COLLIN COLLINOINT CORRB displays asymptotic covariance matrix of estimates assuming heteroscedasticity produces collinearity analysis produces collinearity analysis with intercept adjusted out displays correlation matrix of estimates

COVB PARTIALR2

displays covariance matrix of estimates displays squared semi-partial correlation coefficients using Type I sums of squares displays squared partial correlation coefficients using Type I sums of squares displays squared partial correlation coefficients using Type II sums of squares displays squared semi-partial correlation coefficients using Type I sums of squares displays squared semi-partial correlation coefficients using Type II sums of squares displays a sequence of parameter estimates during selection process tests that first and second moments of model are correctly specified displays the sequential sums of squares displays the partial sums of squares displays standardized parameter estimates displays tolerance values for parameter estimates computes variance-inflation factors

PCORR1 PCORR2 SCORR1

SCORR2

SEQB SPEC SS1 SS2 STB TOL VIF

Predicted and Residual Values CLB computes % confidence limits for the parameter estimates

CLI

computes

% confidence limits for an individual predicted value

CLM

computes dependent variable

% confidence limits for the expected value of the

DW DWPROB INFLUENCE P

computes a Durbin-Watson statistic computes a Durbin-Watson statistic and p-value computes influence statistics computes predicted values
4

PARTIAL R

displays partial regression plots for each regressor produces analysis of residuals

Display Options and Other Options ALL requests the following options: ACOV, CLB, CLI, CLM, CORRB, COVB, I, P, PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC, SS1, SS2, STB, TOL, VIF, XPX sets significance value for confidence and prediction intervals and tests suppresses display of results specifies the true standard deviation of error term for computing CP and BIC sets criterion for checking for singularity

ALPHA= NOPRINT SIGMA= SINGULAR=

Autocorrelation in Time Series Data


When regression is performed on time series data, the errors may not be independent. Often errors are autocorrelated; that is, each error is correlated with the error immediately before it. Autocorrelation is also a symptom of systematic lack of fit. The DW option provides the Durbin-Watson d statistic to test that the autocorrelation is zero:

The value of d is close to 2 if the errors are uncorrelated. The distribution of d is reported by Durbin and Watson (1951). Tables of the distribution are found in most econometrics textbooks, such as Johnston (1972) and Pindyck and Rubinfeld (1981).

The sample autocorrelation estimate is displayed after the Durbin-Watson statistic. The sample is computed as

This autocorrelation of the residuals may not be a very good estimate of the autocorrelation of the true errors, especially if there are few observations and the independent variables have certain patterns. If there are missing observations in the

regression, these measures are computed as though the missing observations did not exist.

Positive autocorrelation of the errors generally tends to make the estimate of the error variance too small, so confidence intervals are too narrow and true null hypotheses are rejected with a higher probability than the stated significance level. Negative autocorrelation of the errors generally tends to make the estimate of the error variance too large, so confidence intervals are too wide and the power of significance tests is reduced. With either positive or negative autocorrelation, least-squares parameter estimates are usually not as efficient as generalized least-squares parameter estimates. For more details, refer to Judge et al. (1985, Chapter 8) and the SAS/ETS

User's Guide.

The following SAS statements request the DW option for the US population data (see Figure 61.56):

proc reg data=USPopulation; model Population=Year YearSq / dw; run;

The REG Procedure Model: MODEL1 Dependent Variable: Population Durbin-Watson D 1.191

Number of Observations

22

1st Order Autocorrelation 0.323

Figure 61.55: Regression Using DW Option

Model-Selection Methods
The nine methods of model selection implemented in PROC REG are specified with the SELECTION= option in the MODEL statement. Each method is discussed in this section.

Full Model Fitted (NONE)


This method is the default and provides no model selection capability. The complete model specified in the MODEL statement is used to fit the model. For many regression analyses, this may be the only method you need.

Forward Selection (FORWARD)


The forward-selection technique begins with no variables in the model. For each of the independent variables, the FORWARD method calculates F statistics that reflect the variable's contribution to the model if it is included. The p-values for these F statistics are compared to the SLENTRY= value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection stops. Otherwise, the FORWARD method adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added

one by one to the model until no remaining variable produces a significant F statistic. Once a variable is in the model, it stays.

Backward Elimination (BACKWARD)


The backward elimination technique begins by calculating F statistics for a model, including all of the independent variables. Then the variables are deleted from the model one by one until all the variables remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the smallest contribution to the model is deleted.

Stepwise (STEPWISE)
The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward-selection method, variables are added one by one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary
10

deletions accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an F statistic significant at the SLENTRY= level and every variable in the model is significant at the SLSTAY= level, or when the variable to be added to the model is the one just deleted from it.

Maximum R 2 Improvement (MAXR)


The maximum R2 improvement technique does not settle on a single model. Instead, it tries to find the "best" one-variable model, the "best" two-variable model, and so forth, although it is not guaranteed to find the model with the largest R2 for each size.

The MAXR method begins by finding the one-variable model producing the highest R2. Then another variable, the one that yields the greatest increase in R2, is added. Once the two-variable model is obtained, each of the variables in the model is compared to each variable not in the model. For each comparison, the MAXR method determines if removing one variable and replacing it with the other variable increases R2. After comparing all possible switches, the MAXR method makes the switch that produces the largest increase in R2. Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R2. Thus, the two-variable model

11

achieved is considered the "best" two-variable model the technique can find. Another variable is then added to the model, and the comparing-and-switching process is repeated to find the "best" three-variable model, and so forth.

The difference between the STEPWISE method and the MAXR method is that all switches are evaluated before any switch is made in the MAXR method . In the STEPWISE method, the "worst" variable may be removed without considering what adding the "best" remaining variable might accomplish. The MAXR method may require much more computer time than the STEPWISE method.

Minimum R 2 (MINR) Improvement


The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces the smallest increase in R2. For a given number of variables in the model, the MAXR and MINR methods usually produce the same "best" model, but the MINR method considers more models of each size.

R 2 Selection (RSQUARE)
The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the

12

largest and smallest number of independent variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of

R2 magnitude within each subset size. Other statistics are available for comparing
subsets of different sizes. These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set.

The subset models selected by the RSQUARE method are optimal in terms of R2 for the given sample, but they are not necessarily optimal for the population from which the sample is drawn or for any other sample for which you may want to make predictions. If a subset model is selected on the basis of a large R2 value or any other criterion commonly used for model selection, then all regression statistics computed for that model under the assumption that the model is given a priori, including all statistics computed by PROC REG, are biased.

While the RSQUARE method is a useful tool for exploratory model building, no statistical method can be relied on to identify the "true" model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model.

13

The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R2 for each number of variables considered. The other selection methods are not guaranteed to find the model with the largest R2. The RSQUARE method requires much more computer time than the other selection methods, so a different selection method such as the STEPWISE method is a good choice when there are many independent variables to consider.

Adjusted R 2 Selection (ADJRSQ)


This method is similar to the RSQUARE method, except that the adjusted R2 statistic is used as the criterion for selecting models, and the method finds the models with the highest adjusted R2 within the range of sizes.

Mallows' C p Selection (CP)


This method is similar to the ADJRSQ method, except that Mallows' Cp statistic is used as the criterion for model selection. Models are listed in ascending order of Cp.

14

Additional Information on Model-Selection Methods


If the RSQUARE or STEPWISE procedure (as documented in SAS User's Guide:

Statistics, Version 5 Edition) is requested, PROC REG with the appropriate


model-selection method is actually used.

Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and other variable-selection methods.

Criteria Used in Model-Selection Methods


When many significance tests are performed, each at a level of, for example, 5 percent, the overall probability of rejecting at least one true null hypothesis is much larger than 5 percent. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods.

In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction using the sample estimates, you need only to guard against estimating more parameters

15

than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10 percent to 25 percent.

In addition to R2, the Cp statistic is displayed for each model generated in the model-selection methods. The Cp statistic is proposed by Mallows (1973) as a criterion for selecting a model. It is a measure of total squared error defined as Cp = [(SSEp)/(s2)] - (N - 2p) where s2 is the MSE for the full model, and SSEp is the sum-of-squares error for a model with p parameters including the intercept, if any. If Cp is plotted against p, Mallows recommends the model where Cp first approaches p. When the right model is chosen, the parameter estimates are unbiased, and this is reflected in Cp near p. For further discussion, refer to Daniel and Wood (1980).

The Adjusted R2 statistic is an alternative to R2 that is adjusted for the number of parameters in the model. The adjusted R2 statistic is calculated as ADJRSQ = 1 - [((n - i)(1 - R2))/(n - p)] where n is the number of observations used in fitting the model, and i is an indicator variable that is 1 if the model includes an intercept, and 0 otherwise.

16

Limitations in Model-Selection Methods


The use of model-selection methods can be time-consuming in some cases because there is no built-in limit on the number of independent variables, and the calculations for a large number of independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is 20+i, where i is the value of the INCLUDE= option.

For the RSQUARE, ADJRSQ, or CP methods, with a large value of the BEST= option, adding one more variable to the list from which regressors are selected may significantly increase the CPU time. Also, the time required for the analysis is highly dependent on the data and on the values of the BEST=, START=, and STOP= options.

17

Model Fit and Diagnostic Statistics


This section gathers the formulas for the statistics available in the MODEL, PLOT, and OUTPUT statements. The model to be fit is , and the parameter

estimate is denoted by b = ( X' X)- X' Y. The subscript i denotes values for the ith observation, the parenthetical subscript (i) means that the statistic is computed using all observations except the ith observation, and the subscript jj indicates the jth diagonal matrix entry. The ALPHA= option in the PROC REG or MODEL statement is used to set the value for the t statistics.

Table 61.6 contains the summary statistics for assessing the fit of the model. Table 61.6: Formulas and Definitions for Model Fit Summary Statistics Definition or Formula

n p i

the number of observations the number of parameters including the intercept 1 if there is an intercept, 0 otherwise the estimate of pure error variance from the SIGMA= option or from fitting the full model

SST0 SST1

the uncorrected total sum of squares for the dependent variable the total sum of squares corrected for the mean for the dependent variable the error sum of squares

SSE

18

MSE

R2

ADJRSQ

AIC

BIC

CP (Cp)

GMSEP JP (Jp) PC PRESS RMSE SBC SP (Sp)

[( MSE(n+1)(n-2))/(n(n-p-1))] = [1/n] Sp(n+1)(n-2) [(n+p)/n] MSE [(n+p)/(n-p)] (1 - R2) = Jp ( [n/( SSTi)] ) the sum of squares of predri (see Table 61.7)

n ln( [ SSE/n] ) + p ln(n) [ MSE/(n-p-1)]

Table 61.7 contains the diagnostic statistics and their formulas; these formulas and further information can be found in Chapter 2, "Introduction to Regression Procedures," and in the "Influence Diagnostics" section. Each statistic is computed for each observation. Table 61.7: Formulas and Definitions for Diagnostic Statistics Formula

19

Xi b PRED ( RES (ri) H (hi) STDP STDI STDR LCL STDI LCLM STDP UCL STDI UCLM STDP STUDENT RSTUDENT [(ri)/( STDRi)] xi( X' X)- xi' )

COOKD COVRATIO

[1/p] STUDENT2([ STDP/( STDR2)])

DFFITS

DFBETASj

PRESS(predri)

[(ri)/(1-hi)]

20

Testing for Heteroscedasticity


The regression model is specified as and independently distributed: and , where the 's are identically 's are not

. If the

independent or their variances are not constant, the parameter estimates are unbiased, but the estimate of the covariance matrix is inconsistent. In the case of heteroscedasticity, the ACOV option provides a consistent estimate of the covariance matrix. If the regression data are from a simple random sample, the ACOV option produces the covariance matrix. This matrix is (X'X)-1 (X' diag(ei2)X) (X'X)-1 where ei = yi - xi b The SPEC option performs a model specification test. The null hypothesis for this test maintains that the errors are homoscedastic, independent of the regressors and that several technical assumptions about the model specification are valid. For details, see theorem 2 and assumptions 1 - 7 of White (1980). When the model is correctly specified and the errors are independent of the regressors, the rejection of this null hypothesis is evidence of heteroscedasticity. In implementing this test, an estimator of the average covariance matrix (White 1980, p. 822) is constructed and inverted. The

21

nonsingularity of this matrix is one of the assumptions in the null hypothesis about the model specification. When PROC REG determines this matrix to be numerically singular, a generalized inverse is used and a note to this effect is written to the log. In such cases, care should be taken in interpreting the results of this test.

When you specify the SPEC option, tests listed in the TEST statement are performed with both the usual covariance matrix and the heteroscedasticity consistent covariance matrix. Tests performed with the consistent covariance matrix are asymptotic. For more information, refer to White (1980).

Both the ACOV and SPEC options can be specified in a MODEL or PRINT statement

22

Collinearity Diagnostics
When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. This problem is called

collinearity or multicollinearity. It is a good idea to find out which variables are nearly
collinear with which other variables. The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.

The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. First, X'X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the intercept variable is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The analysis in PROC REG is reported with eigenvalues of X'X rather than singular values of X. The eigenvalues of X'X are the squares of the singular values of X.

The condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. The largest condition index is the condition number of the scaled X matrix. Belsey, Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies may be starting to affect the regression estimates. When this number is larger than 100, the estimates may have a fair amount of numerical error

23

(although the statistical standard error almost always is much greater than the numerical error).

For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables.

The VIF option in the MODEL statement provides the Variance Inflation Factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values.

The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined as 1/VIF.

For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980). For a more detailed explanation of using the methods with PROC REG, refer to Freund and Littell (1986).

This example uses the COLLIN option on the fitness data found in Example 61.1. The following statements produce Figure 61.41.

24

proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin; run;

The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Analysis of Variance Source Model Error DF 6 24 Sum of Squares Mean Square F Value Pr > F <.0001

722.54361 120.42393 22.43 128.83794 5.36825 851.38154

Corrected Total 30

Root MSE

2.31695

R-Square 0.8487

Dependent Mean 47.37581 Adj R-Sq 0.8108 Coeff Var 4.89057

Parameter Estimates Variable Intercept RunTime Age Weight RunPulse DF 1 1 1 1 1 Parameter Standard Estimate 102.93448 -2.62865 -0.22697 -0.07418 -0.36963 Error 12.40326 0.38456 0.09984 0.05459 0.11985 t Value Pr > |t| Tolerance 8.30 -6.84 -2.27 -1.36 -3.08 <.0001 . <.0001 0.62859 0.0322 0.1869 0.0051 0.66101 0.86555 0.11852 Variance Inflation 0 1.59087 1.51284 1.15533 8.43727

25

MaxPulse RestPulse

1 1

0.30322 -0.02153

0.13650 0.06605

2.22 -0.33

0.0360 0.7473

0.11437 0.70642

8.74385 1.41559

Collinearity Diagnostics Condition Index 1.00000 19.29087 21.50072 27.62115 33.82918 82.63757 Proportion of Variation Intercept RunTime Age Weight RunPulse MaxPulse

Number Eigenvalue

1 2 3 4 5 6 7

6.94991 0.01868 0.01503 0.00911 0.00607 0.00102

0.00002326 0.00021086 0.00015451 0.00019651 0.00000862 0.00000634 0. 0.00218 0.02522 0.14632 0.15013 0.03186 0.11284 0.49660 0.06210 0.01042 0.23571 0.18313 0.44442 0.10330 0.02283

0.00000244 0.00000743 0. 0.00119 0.00149 0.01506 0.06948 0.91277 0.00125 0.00123 0.00833 0.00561 0.98357

0.00061541 0.12858 0.00638 0.00133 0.79966 0.60897 0.12501 0.09746 0.01455

0.

0.

0.

0.

0.00017947 196.78560 0.18981

0.

Figure 61.41: Regression Using the TOL, VIF, and COLLIN Options

26

Case Study: Using Cigarette Data for An Introduction to Multiple Regression

Abstract
The CIGARETTE dataset contains measurements of weight and tar, nicotine, and carbon monoxide content for 25 brands of domestic cigarettes. The dataset is useful for introducing the ideas of multiple regression and provides examples of an outlier and a pair of collinear variables.

1. Introduction
The Federal Trade Commission annually rates varieties of domestic cigarettes according to their tar, nicotine, and carbon monoxide content. The United States Surgeon General considers each of these substances hazardous to a smoker's health. Past studies have shown that increases in the tar and nicotine content of a cigarette are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke. Past studies have shown that increases in the tar and nicotine content of a cigarette are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke. This information motivates our goal of predicting the variable CO. The preliminary analysis involves two simple linear regressions. We find it helpful to place the models for the simple linear regressions. We plot the data and perform the two regressions. Each plot (CO*Tar and CO*Nicotine) shows a positive linear relationship. Both regressions have slopes that are significantly larger than zero. From these simple analyses it is easy to motivate the use of multiple regression, using the variables tar and nicotine simultaneously. We give a short description of the multiple regression model.

27

2. The Computer Demonstration


Matrix Plot of Co, weight, NI, tar
Matrix Plot of Co, weight, NI, tar
0.8 1.0 1.2 0 15 30 20 Co 1.2 1.0 0.8 2 1 0 10 0

weight

NI

30

15

tar

0 0 10 20 0 1 2

Correlations: Co, weight, NI, tar Co weight NI weight 0.464 0.019 NI 0.926 0.500 0.000 0.011 tar 0.957 0.491 0.977 0.000 0.013 0.000 Cell Contents: Pearson correlation P-Value

28

Regression Analysis: Co versus NI The regression equation is Co = 1.66 + 12.4 NI Predictor Constant NI S = 1.82845 Coef 1.6647 12.395 SE Coef T 0.9936 1.68 1.054 11.76 P 0.107 0.000

R-Sq = 85.7%

R-Sq(adj) = 85.1%

Analysis of Variance Source DF Regression 1 Residual Error 23 Total 24

SS 462.26 76.89 539.15

MS 462.26 3.34

F 138.27

P 0.000

Unusual Observations Obs NI Co Fit SE Fit Residual St Resid 3 2.03 23.500 26.827 1.270 -3.327 -2.53RX 19 0.96 17.500 13.564 0.376 3.936 2.20R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Regression Analysis: Co versus tar The regression equation is Co = 2.74 + 0.801 tar Predictor Constant tar Coef SE Coef T P 2.7433 0.6752 4.06 0.000 0.80098 0.05032 15.92 0.000 R-Sq = 91.7% R-Sq(adj) = 91.3%

S = 1.39672

Analysis of Variance Source DF Regression 1 Residual Error 23 Total 24

SS 494.28 44.87 539.15


29

MS 494.28 1.95

F P 253.37 0.000

Unusual Observations Obs tar Co Fit SE Fit Residual St Resid 3 29.8 23.500 26.612 0.928 -3.112 -2.98RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. The regression equation is Co = 3.09 - 2.65 NI + 0.962 tar Predictor Constant NI tar S = 1.41252 Coef SE Coef 3.0896 0.8438 -2.646 3.787 0.9625 0.2367 R-Sq = 91.9% T P 3.66 0.001 -0.70 0.492 4.07 0.001 R-Sq(adj) = 91.1%

Analysis of Variance Source DF SS Regression 2 495.26 Residual Error 22 43.89 Total 24 539.15 Source DF Seq SS NI 1 462.26 tar 1 33.00

MS 247.63 2.00

F 124.11

P 0.000

Unusual Observations Obs NI Co Fit SE Fit Residual 3 2.03 23.500 26.399 0.987 -2.899

St Resid -2.87RX

R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

30

3. Getting the Data


Brand Name Tar(mg) Nicotine(mg) Weight(g) CO(mg) Alpine 14.1 0.86 0.9853 13.6 Benson&Hedges 16 1.06 1.0938 16.6 BullDurham 29.8 2.03 1.165 23.5 CamelLights 8 0.67 0.928 10.2 Carlton 4.1 0.4 0.9462 5.4 Chesterfield 15 1.04 0.8885 15 GoldenLights 8.8 0.76 1.0267 9 Kent 12.4 0.95 0.9225 12.3 Kool 16.6 1.12 0.9372 16.3 L&M 14.9 1.02 0.8858 15.4 LarkLights 13.7 1.01 0.9643 13 Marlboro 15.1 0.9 0.9316 14.4 Merit 7.8 0.57 0.9705 10 MultiFilter 11.4 0.78 1.124 10.2 NewportLights 9 0.74 0.8517 9.5 Now 1 0.13 0.7851 1.5 OldGold 17 1.26 0.9186 18.5 PallMallLight 12.8 1.08 1.0395 12.6 Raleigh 15.8 0.96 0.9573 17.5 SalemUltra 4.5 0.42 0.9106 4.9 Tareyton 14.5 1.01 1.007 15.9 TRUE 7.3 0.61 0.9806 8.5 ViceroyRichLight 8.6 0.69 0.9693 10.6 VirginiaSlims 15.2 1.02 0.9496 13.9 WinstonLights 12 0.82 1.1184 14.9

31

4.

Explain

the

meanings

of

regression

coefficients in multiple regression equation


The regression equation is Co = 3.09 - 2.65 NI + 0.962 tar Nicotine Tar , CO 0.962 ; Tar Nicotine , CO 2.65 (Q; Is it reasonable??)

32

Review Sheet-Regression (14th week-12/15/2006)


Class:
1. Cigarettes data: Y: COmg, X 1 : TARmg, X 2 : NICOTINEmg, X 3 : WEIGHTg

Name:

Score:

SSR( X 1 ) = SSR( X 2 ) =
SSR( X 3 ) =

494.28131 462.25591 116.05651

SSE( X 1 ) = SSE( X 2 ) =
SSE( X 3 ) =

44.86909 76.89449 423.09389 43.89494

SSR( X 1 , X 2 ) = 495.25546 SSR( X 1 , X 3 ) = SSR( X 2 , X 3 ) = SSR( X 2 | X 1 ) = SSR( X 1 | X 2 , X 3 ) =


(1)Compute RY 1 =
2

SSE( X 1 , X 2 ) =

494.30638 462.25639

SSE( X 1 , X 3 ) = 44.84402 SSE( X 2 , X 3 ) = 76.89401 SSR( X 3 | X 1 , X 2 ) = SSR( X 2 | X 1 , X 3 ) =

(2) Compute RY 2|1 =


2

(3) Compute RY 3|12 =


2

(4) Compute RY 1|23 =


2

(5) Compute RY 2|13 =


2

33

proc reg; model CO=TAR NICOTIN/SS1 SS2;run;


Analysis of Variance Sum of Source Model Error Corrected Total DF 2 22 24 Squares 495.25546 43.89494 539.15040 Mean Square 247.62773 1.99522 F Value 124.11 Pr > F <.0001

Parameter Variable Intercept Tar Nicotine DF 1 1 1 Estimate 3.08961 0.96247 -2.64627

Standard Error 0.84377 0.23666 3.78720 t Value 3.66 4.07 -0.70 Pr > |t| 0.0014 0.0005 0.4920 Type I SS 3923.76960 494.28131 0.97415 Type II SS 26.75166 32.99954 0.97415

proc reg; model CO=TAR WEIGHT/SS1 SS2;run;


Analysis of Variance Sum of Source Model Error Corrected Total DF 2 22 24 Squares 494.30638 44.84402 539.15040 Mean Square 247.15319 2.03836 F Value 121.25 Pr > F <.0001

Parameter Variable Intercept Tar Weight DF 1 1 1 Estimate 3.11433 0.80419 -0.42287

Standard Error 3.41620 0.05904 3.81299 t Value 0.91 13.62 -0.11 Pr > |t| 0.3718 <.0001 0.9127 Type I SS 3923.76960 494.28131 0.02507 Type II SS 1.69404 378.24987 0.02507

34

proc reg; model CO=NICOTINE WEIGHT/SS1 SS2;run;


Analysis of Variance Sum of Source Model Error Corrected Total DF 2 22 24 Squares 462.25639 76.89401 539.15040 Mean Square 231.12820 3.49518 F Value 66.13 Pr > F <.0001

Parameter Estimates Parameter Variable Intercept Nicotine Weight DF 1 1 1 Estimate 1.61398 12.38812 0.05883 Standard Error 4.44663 1.24473 5.02395 t Value 0.36 9.95 0.01 Pr > |t| 0.7201 <.0001 0.9908 Type I SS 3923.76960 462.25591 0.00047919 Type II SS 0.46047 346.19988 0.00047919

35

proc reg; model CO=TAR NICOTINE WEIGHT/SS1 SS2 PCORR1 PCORR2;run;


Analysis of Variance Sum of Source Model Error Corrected Total DF 3 21 24 Squares 495.25781 43.89259 539.15040 Mean Square 165.08594 2.09012 F Value 78.98 Pr > F <.0001

Parameter Estimates Squared Parameter Variable Intercept Tar Nicotine Weight DF 1 1 1 1 Estimate 3.20219 0.96257 -2.63166 -0.13048 Standard Error 3.46175 0.24224 3.90056 3.88534 t Value 0.93 3.97 -0.67 -0.03 Pr > |t| 0.3655 0.0007 0.5072 0.9735 Type I SS 3923.76960 494.28131 0.97415 0.00236 Type II SS 1.78844 33.00142 0.95143 0.00236 Partial Corr Type I . 0.91678 0.02171 0.00005370

Parameter Estimates Squared Partial Variable DF Corr Type II

Intercept Tar Nicotine Weight

1 1 1 1

. 0.42918 0.02122 0.00005370

36

*SAS program; dm 'LOG; CLEAR;OUTPUT;CLEAR;'; Title 'Read Cigarette data from EXCEL'; Proc import datafile="L:\cigarettes_SAS.xls" out=cigarettes; Getnames=Yes; proc reg; model CO=TAR;run; proc reg; model CO=NICOTINE;run; proc reg; model CO= WEIGHT;run; proc reg; model CO=TAR NICOTINE/SS1 SS2;run; proc reg; model CO=NICOTINE WEIGHT/SS1 SS2; run; proc reg; model CO=TAR WEIGHT/SS1 SS2;run; proc reg; model CO=TAR NICOTINE WEIGHT/SS1 SS2 PCORR1 PCORR2 ;run;

37

Review SheetHW-Regression (14th week-12/15/2006) Class:


1. Cigarettes data: Y: COmg, X 1 : TARmg, X 2 : NICOTINEmg, X 3 : WEIGHTg

Name:

Score:

SSR( X 1 ) = SSR( X 2 ) = SSR( X 3 ) =

494.28131 462.25591 116.05651

SSE( X 1 ) = SSE( X 2 ) = SSE( X 3 ) =

44.86909 76.89449 423.09389 43.89494

SSR( X 1 , X 2 ) = 495.25546 SSR( X 1 , X 3 ) =


494.30638

SSE( X 1 , X 2 ) =

SSE( X 1 , X 3 ) = 44.84402 SSE( X 2 , X 3 ) = 76.89401

SSR( X 2 , X 3 ) = 462.25639

SSR( X 1 , X 2 , X 3 ) = 495.25781 SSE( X 2 , X 3 ) = 43.89259


a. Test whether

X 2 can

be dropped from the regression model given that

X 1 is

retained. Use the general linear test statistic b. Test whether

F * and = 0.05 . X 1 and

X 3 can

be dropped from the regression model given that

X 2 are retained. Use the general linear test statistic F * and = 0.05 .
c. Test whether

X 2 can

be dropped from the regression model given that

X 1 and

X 3 are retained. Use the general linear test statistic F * and = 0.05 .
d. Test whether

X 2 and X 3 can

be dropped from the regression model given that

X 1 is retained. Use the general linear test statistic F * and = 0.05 .

38

You might also like