You are on page 1of 15

Running head: TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION

Tutorial on Assumptions in Multiple Regression Jaylene Bettcher APSY 607 Multivariate Design and Analysis Dr. David Nordstokke June 23, 2011

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION Tutorial on Assumptions in Multiple Regression Multiple regression (MR) is a statistical technique that allows researchers to predict a single dependent variable (DV) from a set (two or more) of predictors or independent variables (IVs) (Stevens, 2009). In order for researchers to attain a better understanding of the relationship between the predictors (IVs) and the dependent variable via MR, stringent assumptions need to be met. These assumptions not only allow for accurate inferences about the data, but also allow for accurate inferences about the population at hand. In MR it is assumed that the dependent variable is a linear function of the independent variables, the errors or residuals are independent, the errors follow a normal distribution, and the errors have constant variance. The following tutorial will not only explain the importance of these assumptions, but also will explain how to test for these assumptions. Overview of Multiple Regression This overview of multiple regression (MR) is intended for readers who have some background knowledge in statistics. Although this may be a review for some readers it is important that the fundamentals of MR are examined to ensure basic understanding. The overview will encompass the criteria for MR, the three types of MR, when it is appropriate to use MR, and an example of MR. Criteria for MR According to Stevens (2009), MR is an intermediate prediction method that allows researchers to predict a single dependent variable (DV) from a set (two or more) of predictor or independent variables (IVs). The DV must be a continuous variable, which is a variable that can take on any value within the limits of the variables range, such as age. Although the IVs are typically continuous, they may also be categorical whereby each value falls into a distinct

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION category, for instance sex (male or female) (Stevens, 2009). Keith (2006) recommends that researchers convert categorical variables into dummy variables to ensure that variables are correctly analyzed and adequate inferences are developed. (Although this tutorial will not go into further details on dummy variables it may be of interest to visit the following link for further information: http://www.socialresearchmethods.net/kb/dummyvar.php). Furthermore, Harlow (2005) explains that IVs should be relatively uncorrelated with each other, but correlated with the DV, as this allows for variance within and between IVs and a linear relationship between IVs and the DV. Types of MR There are three types of multiple regression: hierarchical, stepwise, and standard. In hierarchical regression, which is also referred to as sequential regression, the researcher enters each variable (or sets of variables) in a specific order into the equation. This method provides an estimate of the effects of one variable on another variable (given that they are entered in the correct order), allowing the researcher to potentially find a multiple correlation (Keith, 2006). In stepwise regression the computer, rather than the researcher, chooses the order that the variables are entered based on the IVs that are highly correlated with the DV. This method examines IVs that have the strongest correlation with the DV; however, results from a stepwise regression may be very misleading to the researcher as there are often many biases (Keith, 2006). In standard regression, which is also referred to as simultaneous regression, the IVs are simultaneously entered into the regression equation to make inferences about the impact of each variable (Keith, 2006). This method allows the researcher to examine how groups of IVs relate to the DV through regression coefficients and statistical significance (Keith 2006). For the purpose of this

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION tutorial, we will only use examples involving standard regression, and as such standard regression will be discussed in further detail. As previously stated, standard regression is used examine the relationship between groups of IVs and the DV, which allows the researcher to find multiple correlations. By using regression coefficients and statistical significance the researcher is able to make inferences about the importance of each variable in both explanatory and predicative contexts (Keith, 2006). Keith (2006) explains that standard regression is advantageous because it gives estimates of the direct effects of each IV on the outcome or DV. Keith (2006) further explains that a disadvantage of the standard regression is that the regression coefficients may change according to the variables that are included in the equation. For instance, if the researcher aspires to find that the enjoyment of golf has a moderate to strong effect on an individuals handicap, then they may decide not to include stress associated with golf in the analysis. Therefore, to an extent, the researcher is able to bias their results to adhere to their hypotheses, making extensive knowledge on the subject and honesty important values. When to use MR Although researchers tend to use MR when they are interested in predicting a DV from a set of IVs, it may be difficult to recognize when a MR model is appropriate. According to Brace et al. (2003) it is appropriate to use MR when:
1. Exploring linear relationships. MR may be used when the relationship between the IVs

and the DV is linear (when the relationship follows a straight, angular line).
2. When the outcome variable is continuous. The DV should be measured on a continuous

scale, for instance, an interval or ratio scale.

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION


3. When predictors are either continuous or are easily transformed into a dummy variables.

The predictor variables should be continuous, for instance, ratio, interval, or ordinal. It is appropriate to use nominal predictor variables if they are dichotomous (the value can only belong to one of two groups), for example, sex, whereby males are coded as 0 and females are coded as 1 (or vice versa). As previously mentioned, a dummy variable is also acceptable.
4. When there are a large number of subjects in the experiment. In order to make an

accurate inference about the data and the given population it is important that the number of participants significantly exceeds the number of predictor variables in the equation. Keith (2006) suggests that an acceptable ratio is 10 subjects to one predictor variable; however, Brace (2003) believes that the ratio should be as high as 40 subjects to one predictor variable.
5. When the researcher wishes to either explain or predict the relationship between the IVs

and the DV. The researcher uses the IVs to explain how an effect came about, which subsequently allows the researcher to discuss probable impacts of IVs on the DV. However, it is important to remember that a correlation does not signify a clear inference of cause and effect (Keith, 2006). Researchers also use MR to predict the impact that IVs have on the DV, for instance, do the number of games golfed per year significantly impact an individuals handicap? MR example The following fabricated paradigm is intended to provide the reader with an example of when a researcher would use MR, as well as a visual of the relationship between predictor and

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION outcome variables. It will also provide the reader with concise directions to execute a standard MR equation in SPSS. Steve, a golf coach who aspires to train male, junior golfers to compete in the PGA (professional golfers association) is interested in investigating what factors predict a low handicap (for those who are unfamiliar with golf terms, a low or negative handicap is desirable). After much research and contemplation, Steve believes that games golfed per week, years of experience, enjoyment of golf, and desire to become a professional golfer may be correlated with a young, male golfers handicap. With the help of a statistics company, Steve creates a survey that encompasses questions based on experience, enjoyment of golf, and desire to make golf a profession. Steve randomly selects 275 male junior golfers at private and public golf courses to fill out a survey; out of the 275 surveys that were handed out, 250 were returned completed. Steve would like the statistics company to use his data to predict which IVs have a moderate to strong correlation with an individuals golf score. This example will continue to be referred to throughout the tutorial. Figure 1. Depiction of Golf Related Predictors on an Individuals Golf Score

Games golfed per week

Years of experience

Handicap

Enjoyment of golf

Desire to be a professional golfer

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION

Figure 1 is a depiction of a standard MR with four predictors and one outcome variable. The lines connecting the four IVs represent correlations among predictors, while the arrow pointing toward the outcome variable (handicap) represents prediction error. MR in SPSS In order to run a standard MR in SPSS, the data must be properly entered and screened for missing values, outliers, and plausible means and standard deviations (etc.). For more information about data screening please read chapter four by Tabachnik and Fidell (2007). The following link provides an excellent step by step tutorial on standard multiple regression in SPSS http://calcnet.mth.cmich.edu/org/spss/V16_materials/Video_Clips_v16/21lin_regress1/21lin_reg ress1.swf . Basic instructions are also written out to assist the readers understanding of MR and to foster the readers ability to run MR in SPSS: Click on ANALYZE > REGRESSION > LINEAR > from here you will insert the dependent variable (handicap) and and the independent variables (games golfed per week, years of experience, enjoyment of golf, and desire to be a professional golfer) into the correctly labeled box. To use a standard regression model keep the METHOD at ENTER, which is the programs default setting. If you click on STATISTICS (upper right corner) a box titled Linear Regression: Statistics will appear, here you can choose the regression coefficient (estimates, model fit, R squared change, descriptive, part and partial correlations, and collinearity diagnostics) and residuals (Durbin-Watson). If you click on PLOTS (below STATISTICS) a box titled Linear Regression: Plots will appear, here you can choose scatter plots (ZPRED is entered into the Y box and ZRESD score is

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION entered into the X box). Furthermore, you can choose the standardized residual plots (histogram and normal probability plot). Below PLOTS is the SAVE button, however, for the purposes of this tutorial we will not be reviewing this feature; however, the tutorial from the previous link explains this feature. Below SAVE is OPTIONS, it may be wise to choose EXCLUDE CASES PAIRWISE instead of EXCLUDE CASES LISTWISE to ensure that missing values do not have a large effect on the analysis. Clicking on OK will run the standard regression model. The following link explains in gross detail how to interpret output data for MR in SPSS: http://www.statisticshell.com/multireg.pdf Assumptions in MR Assumptions are important because they allow us to make valid inferences about our hypotheses without or with few biases. If the assumptions are not met, the results will be biased and the inferences will be inaccurate. The following are assumptions underlying the multiple regression model: 1. The dependent variable is a linear function of the independent variables 2. Errors are independent 3. Errors follow a normal distribution 4. Errors have constant variance The importance of these assumptions and how to test for them will be discussed in detail in the following sections. Linearity

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION The first and foremost assumption that will be discussed in this tutorial is linearity. It is assumed that the DV is a linear function of the IVs, and consequently it is also assumed that a linear model is appropriate (Keith, 2006). It is important to note that curvilinear models may be used in multiple regression, however, for the purpose of this tutorial we will merely focus on linear models. If the assumption of linearity is violated, the estimates and inferences from the regression may be biased, which means that the inference may not be an accurate reflection of the given population (Keith, 2006). To test the assumption of linearity, a bivariate scatterplot may be used, whereby a single predictor variable is plotted against the DV (Keith, 2006). Figure 2 displays two examples of bivariate scatterplots that test the assumption of linearity. The scatterplot on the left is an example of a linear relationship because the values are plotted on a straight, angular line and there is variance between values. Conversely, the scatterplot on the right is not an example of a linear relationship because the values are plotted tightly together to form a curvilinear relationship, which may create error. Figure 2. Bivariate Scatterplots to Test the Assumption of Linearity

Figure 2 was retrieved from David Nordstokkes week 3 powerpoint, Introduction to Multiple Regression. To create a bivariate scatterplot in SPSS click on GRAPHS > LEGACY DIALOGS > SCATTER/DOT > from there you will choose SIMPLE SCATTERPLOT > click DEFINE.

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION Enter your dependent variable (handicap) into the Y axis and the independent variable (e.g. games golfed per week) into the X axis. Click OK to create the scatterplot. Errors are Independent The second assumption that will be examined in this tutorial is the independence assumption, whereby we assume that the subjects are responding independently of one another (Keith, 2006). Returning to the golf example, this assumption may be violated if the data were taken from junior male golfers who were members of a private golf club that accept applicants based on their enjoyment of golf and desire to become professional golfers. If this assumption is violated than the risk for error increases significantly, making the study unreliable (Harlow, 2005). To test the assumption that errors are independent, boxplots may be utilized. Bloxplots allow the researcher to observe the variability between the lowest value and the highest value (Keith, 2006). The middle line through the box represents the median, while the box represents the middle half of the values (from the 25th to the 75th percentile), and the extended lines display high and low values (excluding outliers and extreme values) (Keith, 2006). Figure 3 is a boxplot of all of the predictor variables from the golf example. It is evident that there is sufficient variance between the low and high values for years of experience, enjoyment of golf, and desire to become a professional golfer, however, games golf per week has a lower variance, as the median is merely a few values away from the lowest value. Typically, these values should be further examined, but in this case it seems plausible that the majority of golfers golf less than five times per week.

10

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION

11

Figure 3. Boxplot of Golf Related Predictor Variables

To create a boxplot in SPSS, go to GRAPHS > LEGACY DIALOGS > BOXPLOT > ensure that SIMPLE is highlighted and that SUMMARIES OF SEPARATE VARIABLES is chosen to display the variance of each variable. Insert the desired variables into the BOXES REPRESENT box and click OK to create boxplot. Errors follow a normal distribution The third assumption that will be examined in this tutorial is the assumption that errors follow a normal distribution with even variance (Stevens, 2009). It is important that errors are normally distributed to ensure that inferences are accurate, and that probability of error remains low. According to Keith (2006), if the values of the residuals are plotted they should ultimately depict a normal curve. In order to test if errors follow a normal distribution, histograms and p-p plots may be used. If a histogram has a normal curve without skewness and kurtosis the assumption is met and errors follow a normal distribution. Figure 4 is a histogram of depicting the effect that

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION enjoyment of golf has on the desire to become a professional golfer. Although the curve in Figure 4 appears to resemble a normal distribution it may be beneficial to run a p-p plot, as it is easier to notice a deviation from a line than a deviation in a curve (Keith, 2006). According to Keith (2006), like a q-q plot, a p-p plot uses cumulative frequency and the errors are normally distributed if the residuals are close to the straight diagonal line. Figure 5 is a p-p plot that is analogous to the histogram in figure 4, however, it is easily noticed that the residuals do not adhere to the line in certain areas, which may indicate that responses from certain subjects violate the assumption. Figure 4. Histogram of Golf Related Predicting Variables

12

Figure 5. P-P Plot of Golf Related Predicting Variables

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION

13

To simultaneously create a histogram and p-p plot for MR in SPSS click ANALYZE > REGRESSION > LINEAR > enter dependent and independent variables. Click on PLOTS (located on the right), and enter ZPRED into Y and ZRESID into X check HISTOGRAM and NORMAL PROBABILITY PLOT > CONTINUE > click OK to create graphs. Errors have a constant variance The fourth and final assumption that will be discussed in this tutorial is the assumption that errors have a constant variance, which is also referred to as homoscedasticity (Keith, 2006). This assumption supposes that the variance of errors near the regression line is fairly consistent and scattered across levels of the X axis (Keith, 2006). If this assumption is violated standard errors and statistical significance will be considerably affected, thus lowering the power of the equation and the accuracy of the inference (Keith, 2006). To test if errors have constant variance, scatterplots of residuals with predicted values are often beneficial (Keith, 2006). Figure 6 features two scatterplots depicting desirable and undesirable variances. The scatterplot on the left depicts a desirable variance, as the values are randomly scattered and variances are consistently equal, while the scatterplot of the right depicts an undesirable variance, as the variances are neither randomly scattered nor consistently equal. Figure 6. Scatterplots Depicting Desirable and Undesirable Variance

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION

14

Figure 6 was retrieved from David Nordstokkes week 3 powerpoint, Introduction to Multiple Regression. To create a residual scatterplot in SPSS click ANALYZE > REGRESSION > LINEAR > enter dependent and independent variables. Click SAVE (located on the right) > click UNSTANDARDIZED in both the PREDICTED VALUES and RESIDUALS boxes. Click CONTINUE > click OK. There will now be two new columns labelled RES_1 and PRE_1 in the data window. Click GRAPHS> SCATTER > SIMPLE > DEFINE. Insert RES_1 in the Y axis box and PRE_1 in the X axis box. Click OK. Conclusion Multiple regression (MR) is a statistical technique that allows researchers to predict a single DV from a set (two or more) of predictors or IVs (Stevens, 2009). Researchers can only be confident in their analysis if the assumptions, which allow for accurate inferences about the data and the given population, have been fulfilled. This tutorial aimed to provide the reader with a brief overview of MR, describe and explain the importance of assumptions in MR, and give instruction on how to test for assumptions in MR. Although the tutorial was merely a brief overview, I anticipate an understanding of the importance of assumptions in MR and how to test for them was attained. Furthermore, I strongly encourage the exploration of the three links that were provided as they offer a deeper understanding of both the significance and interpretation of assumptions.

TUTORIAL ON ASSUMPTIONS IN MULTIPLE REGRESSION References Brace, N., Keup, R. & Snelgar, R. (2003). An introduction to logistic regression (Section 4). In SPSS for psychologists (2nd. ed.). New York: Palgrave Macmillan Harlow (2005). Chapter 4: Multiple regression. Essence of Multivariate Thinking. (pp. 43-61). New Jersey: Lawrence Erlbaum Associates, Inc. Keith, T. (2006). Chapter 9: Multiple regression summary, further study, and problems. Multiple Regression and Beyond (pp. 180-211). Montreal: Allyn and Bacon. Lee, Famoye, & Sheldon, (2008). SPSS Training Workshop. Retrieved on June 22, 2011 from http://calcnet.mth.cmich.edu/org/spss/V16_materials/Video_Clips_v16/21lin_regress1/21 lin_regress1.swf Research Methods in Psychology: Multiple Regression (2008). Retrieved June 21, 2011 from http://www.statisticshell.com/multireg.pdf Research Methods: Knowledge Based. (2006). Dummy Variables. Retrieve on June 21, 2011 from http://www.socialresearchmethods.net/kb/dummyvar.php Stevens, J.P. (2009). Applied Multivariate Statistics for the Social Sciences 5th edition. New York: Routledge. Tabachnick, B.G. & Fidell, L.S. (2007). Chapter 4: Cleaning up your act. Screening data prior to analysis. In B.G. Tabachnick & L.S. Fidell (Eds.), Using Multivariate Statistics, (5th. ed.). (pp. 60116). Boston: Pearson Education, Inc. / Allyn and Bacon.

15

You might also like