You are on page 1of 19

Multiple Regression Analysis

Purpose of multiple regression


The purpose of multiple regression is to analyze the relationship between metric or dichotomous independent variables and a metric dependent variable. If there is a relationship, using the information in the independent variables will improve our accuracy in predicting values for the dependent variable.
SW388R7 Data Analysis & Computers II

Types of multiple regression


There are three types of multiple regression, each of which is designed to answer a different question:
Standard multiple regression is used to evaluate the relationships between a set of independent variables and a dependent variable. Hierarchical, or sequential, regression is used to examine the relationships between a set of independent variables and a dependent variable, after controlling for the effects of some other independent variables on the dependent variable. Stepwise, or statistical, regression is used to identify the subset of independent variables that has the strongest relationship to a dependent variable.
SW388R7 Data Analysis & Computers II

Standard multiple regression


In standard multiple regression, all of the independent variables are entered into the regression equation at the same time Multiple R and R measure the strength of the relationship between the set of independent variables and the dependent variable. An F test is used to determine if the relationship can be generalized to the population represented by the sample. A t-test is used to evaluate the individual relationship between each independent variable and the dependent variable.
SW388R7 Data Analysis & Computers II

Hierarchical multiple regression


In hierarchical multiple regression, the independent variables are entered in two stages. In the first stage, the independent variables that we want to control for are entered into the regression. In the second stage, the independent variables whose relationship we want to examine after the controls are entered. A statistical test of the change in R from the first stage is used to evaluate the importance of the variables entered in the second stage.
SW388R7 Data Analysis & Computers II

Stepwise multiple regression


Stepwise regression is designed to find the most parsimonious set of predictors that are most effective in predicting the dependent variable. Variables are added to the regression equation one at a time, using the statistical criterion of maximizing the R of the included variables. When none of the possible addition can make a statistically significant improvement in R, the analysis stops.
SW388R7 Data Analysis & Computers II

Assumption of independence of errors -1

Multiple regression assumes that the errors are independent and there is no serial correlation. Errors are the residuals or differences between the actual score for a case and the score estimated using the regression equation. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case. The Durbin-Watson statistic is used to test for the presence of serial correlation among the residuals. The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50.

SW388R7 Data Analysis & Computers II

Assumption of independence of errors -2


Serial correlation is more of a concern in analyses that involve time series. If it does occur in relationship analyses, its presence can be usually be understood by changing the sequence of cases and running the analysis again. If the problem with serial correlation disappears, it may be ignored. If the problem with serial correlation remains, it can usually be handled by using the difference between successive data points as the dependent variable.
SW388R7 Data Analysis & Computers II

Multicollinearity - 1
Multicollinearity is a problem in regression analysis that occurs when two independent variables are highly correlated, e.g. r = 0.90, or higher. The relationship between the independent variables and the dependent variables is distorted by the very strong relationship between the independent variables, leading to the likelihood that our interpretation of relationships will be incorrect. In the worst case, if the variables are perfectly correlated, the regression cannot be computed. SPSS guards against the failure to compute a regression solution by arbitrarily omitting the collinear variable from the analysis. SW388R7
Data Analysis & Computers II

Multicollinearity - 2
Multicollinearity is detected by examining the tolerance for each independent variable. Tolerance is the amount of variability in one independent variable that is no explained by the other independent variables. Tolerance values less than 0.10 indicate collinearity. If we discover collinearity in the regression output, we should reject the interpretation of the relationships as false until the issue is resolved. Multicollinearity can be resolved by combining the highly correlated variables through principal component analysis, or omitting a variable from the analysis.
SW388R7 Data Analysis & Computers II

Adjusted R
The coefficient of determination, R which measures the strength of a relationship, can usually be increased simply by adding more variables. In fact, if the number of variables equals the number of cases, it is often possible to produce a perfect R of 1.00.

Adjusted R is a measure which attempts to reduce the inflation in R by taking into account the number of independent variables and the number of cases. If there is a large discrepancy between R and Adjusted R, extraneous variables should be removed from the analysis and R recomputed.
SW388R7 Data Analysis & Computers II

Influential cases
The degree to which outliers affect the regression solution depends upon where the outlier is located relative to the other cases in the analysis. Outliers whose location have a large effect on the regression solution are called influential cases. Whether or not a case is influential is measured by Cooks distance. Cooks distance is an index measure; it is compared to a critical value based on the formula: 4 / (n k 1) where n is the number of cases and k is the number of independent variables. If a case has a Cooks distance greater than the critical value, it should be examined for exclusion. The criteria for identifying influential case is a Cook's distance score with a value of 0.0160 SW388R7 Data Analysis & Computers II or greater.
S

Steps in regression analysis: running the baseline model


The following is a guide to the decision process for answering problems about the complete regression analysis:
Dependent variable metric? Independent variables metric or dichotomous?

No

Inappropriate application of a statistic

Yes

Ratio of cases to independent variables at least 5 to 1?

No

Inappropriate application of a statistic

Yes
Run baseline regression, using method for including variables identified in the research question. Record R for evaluation of transformations and removal of outliers and influential cases. Record Durbin-Watson statistic for assumption of independence of errors.

SW388R7 Data Analysis & Computers II

Steps in regression analysis: evaluating assumptions - 1


Is the dependent variable normally distributed?

No

Try: 1. Logarithmic transformation 2. Square root transformation 3. Inverse transformation If unsuccessful, add caution for violation of regression assumptions

Yes
Try: 1. Logarithmic transformation 2. Square root transformation (3. Square transformation) 4. Inverse transformation If unsuccessful, add caution for violation of regression assumptions

Metric IVs normally distributed and linearly related to DV?

No

Yes

SW388R7 Data Analysis & Computers II

Steps in regression analysis: evaluating assumptions - 2


DV is homoscedastic for categories of dichotomous IVs?

No

Add caution for violation of regression assumptions

Yes

Residuals are independent, Durbin-Watson between 1.5 and 2.5?

No

Add caution for violation of regression assumptions

Yes

SW388R7 Data Analysis & Computers II

Steps in regression analysis: evaluating outliers


Request statistics for detecting outliers and influential cases by running standard multiple regression using Enter method to include all variables and substituting transformed variables.

Univariate outliers (DV), multivariate outliers (IVs), or influential cases?

Yes

Remove outliers and influential cases from data set

No

Ratio of cases to independent variables at least 5 to 1?

No

Restore outliers and influential cases to data set, add caution to findings

Yes

SW388R7 Data Analysis & Computers II

Steps in regression analysis: picking regression model for interpretation


Were transformed variables substituted, or outliers and influential cases omitted?

No

Yes
Evaluate impact of transformations and removal of outliers by running regression again, using method for including variables identified in the research question.

Yes
Pick regression with transformations and omitting outliers for interpretation

R for evaluated regression greater than R for baseline regression by 2% or more?

No
Pick baseline regression for interpretation

SW388R7 Data Analysis & Computers II

Steps in regression analysis: overall relationship is interpretable


Probability of ANOVA test of regression less than/equal to level of significance?

No

False

Yes
Tolerance for all IVs greater than 0.10, indicating no multicollinearity?

No

False

Yes

SW388R7 Data Analysis & Computers II

Steps in regression analysis: validation - 1


No
Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables?

Yes
Set the random seed and compute the split variable Re-run regression with split = 0 Re-run regression with split = 1

Set the first random seed and compute the split1 variable Re-run regression with split1 = 1 Set the second random seed and compute the split2 variable Re-run regression with split2 = 1

Probability of ANOVA test <= level of significance for both validation analyses?

No

False

Yes SW388R7 Data Analysis & Computers II

You might also like