You are on page 1of 9

Applied Linear Regression

Exercises
Quantitative Research, March 2009

TOPICS
1. 2. 3. 4. 5. 6. 7. Simple and multiple linear regression (SPSS) Preparing the data for analysis Inspecting frequencies and diagnostic plots Outliers and influential cases diagnostics and solutions Interpreting the regression output Saving predicted and residual regression scores Introducing non-linear effects in the linear regression framework (recoding ordinal and interval level predictors into multiple categories, interaction effects, polynomial terms) 8. Variable transformations (ln, square) 9. Centering predictors 10. R square change 11. Multicolinearity and solutions Data bases for the lab examples and the homework assignment are available here: http://web.me.com/paula.tufis/QR

EXAMPLE 1 SIMPLE LINEAR REGRESSION Model: Effects of education on occupational prestige Dataset: ISSP_1999_Slovakia.sav (International Social Survey Programme, 1999, Inequality Module, www.issp.org) - subsample of employed respondents with nonfarm origins. Variables: Outcome variable: SIOPS_R (Respondents occupational prestige coded using the Standard International Prestige Score - SIOPS, with a theoretical range from 6 occupations with low prestige to 78 occupations with high prestige) Predictor variable: EDUC_YRS (Respondents schooling measured in completed years of education) Analysis steps: 1. Conceptualization and specification of the model (hypotheses, include relevant predictors, exclude irrelevant predictors), operationalization of variables and concepts Hypothesis: people with higher levels of education tend to have jobs with higher occupational prestige This is a simple example, so while education might be a relevant predictor of occupational prestige, it probably isnt the only one, so the model fails to include some relevant predictors. Education is measured using completed years of schooling and occupational prestige is measured using SIOPS. 2. Inspect and recode the data (missing on NA/DK, direction of scaling, invalid variable values, dichotomizing categorical predictors, transforming highly skewed variables, a.s.o.) Both variables are already recoded for this example. In real-life situations use your statistical data manipulation software of choice to recode variables (SPSS1: use Transform Compute Variable or Transform Recode into Same/Different Variable). Visual inspection of histograms for each variable (SPSS: Graphs Chart Builder Histogram or use the Charts option in the Analyze Descriptive Statistics Frequencies) 3. Inspect the relationship between the two variables using a scatterplot for departures from linearity and for potential outliers. In the multiple regression case, use a scatterplot matrix. SPSS: Graphs Chart Builder Scatter/Dot

Potential outlier (case no. 267) (well do more extensive tests later)

4. Run OLS regression (SPSS: Analyze Regression Linear)


1

SPSS version 18.0

5. Diagnostics for possible violations of OLS regression assumptions Scatterplot/ scatterplot matrix for checking the assumption of linearity of relationships Plot the distribution of the regression standardized residual to check the assumption of normally distributed residuals (distribution should be normal). Alternative: look at Normal PP plot (expected cumulated probability plotted against observed cumulated probability of standardized residuals line should be at 45 degrees) SPSS: In the Plots option of the Linear Regression menu choose Histogram and Normal Probability Plot) under Standardized Residual Plots or save standardized residuals (use the Save option of the Linear Regression menu) and look at the distribution of the saved variable

Durbin Watson test (SPSS:Linear Regression Statistics Durbin Watson) to test the assumption of independence of errors (values close to 2 indicate no error autocorrelation). In cross-sectional samples autocorrelation is likely to be less of a problem than in time-series data, but error autocorrelations might be present for spatially clustered data (in which case, run a hierarchical linear regression model).The Durbin Watson test is relevant in the case of timeseries data. Plot standardized residuals (ZRESID) against standardized predicted values (ZPRED) to check the assumptions of homoskedasticity and linearity (SPSS Linear Regression Plots). Funnel shapes denote heteroskedasticity and curved shapes indicate the relationships might be nonlinear.

6. Detect outliers and influential cases To detect outliers, SPSS: Linear Regression Statistics Casewise Diagnostics (detect cases with standardized residuals with absolute values greater than 3 standard deviations)
Same case identified on the scatterplot. You might consider excluding it if you think the case is affected by measurement error (examine variable values for this case and influence statistics). For a group of outliers you might need a separate model. Variable transformations can also help in some cases to pull in the outlier cases.

To detect influential cases, there are a variety of statistics. Some of the most often used: Cooks Distance, Leverage Values, DFBeta. SPSS: Linear Regression Save Cooks Distance (cases with values over 1 are influential cases), Leverage Values (cases with values more than 3 times the value of the average leverage or values greater than .5 are cause for concern), DFBeta (cases with absolute values greater than 2 are cause for concern)

7. If all OK interpret OLS regression results

INDIVIDUAL ASSIGNMENT 1 (OPTIONAL) Re-estimate and interpret on your own the simple regression model discussed above using data for Romania, 2008. Before interpreting regression coefficients examine diagnostic tests and plots to check for possible violations of regression assumptions. Dataset: EVS_2008_Romania_r80.sav (European Values Survey 2008) random 80% subsample from the nationally representative subsample. Subsample for analysis: employed respondents with nonfarm origins Since the dataset contains both employed and unemployed respondents and respondents with farm and nonfarm origins, you need to select just the respondents that are employed and with nonfarm origins and run your analyses on this subsample. The variable EMPL identifies employed and unemployed respondents and the variable NONFARM identifies respondents with farm and nonfarm origins. To select a subsample in SPSS: Data Select cases If condition is satisfied If Type empl=1 and nonfarm=1 without the quotes in the upper right box Continue. You can choose either to filter out cases (default and recommended option - unselected cases remain in the dataset, but statistical analyses will not take them into account), or to move selected cases to a new dataset (in this case save the resulting dataset with a new name), or to delete cases from the existing dataset (!!!Warning: using this option followed by the Save option overwrites the original dataset.) Variables: Outcome variable: SIOPS_R (Respondents occupational prestige coded using the Standard International Prestige Score - SIOPS, with a theoretical range from 6 occupations with low prestige to 78 occupations with high prestige) Predictor variable: EDUC_YRS (Respondents schooling measured in completed years of education)

EXAMPLE 2 MULTIPLE LINEAR REGRESSION Model: Effects of socio-demographic characteristics on income. Dataset: VF2008_Romania_r80 (Family Life, 2008, Soros Foundation Romania), random 80% subsample from the nationally representative sample. Variables: Outcome variable: INCOMER (Respondents personal income) Predictor variables: GENDERR (Respondents gender, 1=Male, 2=Female) REDUC (Respondents educational level, 1 = No schooling 9=university and postgraduate degree) EMPL (Employment status, 1 = full time employment, 2 = part time employment, 3 = self employed, 4=inactive/unemployed, 99=DK/NA) MARSTAT (Marital status, 1 = married, 2 = cohabiting, 3 = single) AGER (Age, 18-91, 99=NA) LOCSIZE (Locality type and size, 1 = big city, 2 = medium sized city, 3 = small city, 4 = village administrative center, 5 = other villages)

Some notes on variable recoding Income variables are generally positively skewed and need to be transformed recommended transformation: ln(incomer). Examine distribution of the INCOMER variable (SPSS: Analyze Descriptive Statistics Frequencies Charts Histogram with normal curve). To construct the natural logarithm of INCOMER, in SPSS: Transform Compute Variable Fill in a new name under Target Variable and fill in ln(incomer) without the quotes under Numeric Expression Dichotomous predictors such as GENDER should be dummy coded (for example code 0 for females, code 1 for males). For the purposes of this example, well look at differences between employed (EMPL=1 thru 3) and unemployed people (EMPL=4) (so 2 categories and 1 dummy variable introduced as a predictor) SPSS: Transform Recode into Different Categorical predictors (such as MARSTAT) should be dummy coded, resulting in k dummy variables (where k is the number of categories) and using k-1 dummy variables as predictors in the regression equation. For the purposes of this example use single as the reference category (omitted dummy variable) Locality size could be considered an ordinal level variable with a clear ordering according to population size in the first 3 categories (big, medium, and small cities), but the presence of the last two categories (administrative center villages and other villages) makes the variable only partially ordered. As such, we will treat it as a categorical variable, and construct 5 dummy variables. Use other villages as the reference category. For variables with several categories, explore whether there are statistically significant differences by interpreting the sizes and significances of dummy variable coefficients and by varying the reference category used. You can examine the overall statistical significance of a set of dummy variables (i.e. the locality type dummies) by introducing the associated dummy variables in a separate block in the regression equation. Request the R squared change statistic under Statistics in the Linear Regression procedure in SPSS. Interpret overall significance of the variable by using the F test for R square change. Regression assumption diagnostics The same as for the simple regression case In addition, examine the partial regression plots to assess linearity and the presence of outliers (applicable only for interval level predictors).

Multicollinearity diagnostics and solutions Start by inspecting a bivariate correlation matrix between predictor variables. Correlations over .8 or .9 suggest a moderate to high degree of collinearity of predictors. In SPSS, in the linear regression procedure, under Statistics, ask for Collinearity diagnostics (Tolerance less than .2 suggests high collinearity, VIF greater than 4 (conservative) or 7 (liberal criterion) suggests high collinearity, condition index over 30 suggests multicollinearity) Solutions in case of multicollinearity: depending on the context, either construct a scale using the highly collinear variables (if the variables measure the same dimension) or delete one of the highly collinear variables from the model. If you have a substantive interest in estimating the effects of all of the highly collinear variables but they do not form a scale, you can use a block regression with these predictors in a separate block, and compare regression results with and without these variables.
5

Example 2a. Recode variables and estimate the regression model in one block. Examine effects of dummy variables and decide which categories to collapse. Re-estimate the model using blocks for sets of dummies and examine diagnostic tests and plots. Handling nonlinear effects You can split ordinal and interval level predictors into categories if you suspect that their effect on the dependent variable is non-linear. For example, the relationship between age and income might be nonlinear. You can explore that by using 10-year or 5-year age groups instead of the original age variable. It might also be appropriate to account for the squared effect of age on income if the relationship is curvilinear. Interaction effects can also be introduced in the linear regression model if you think the effect of one predictor is moderated by another predictor. Centering predictors Centering is subtracting the mean from variable values It is done for two different purposes: To make the intercept meaningful (the sizes of slopes are not affected by centering) To avoid multicollinearity between a predictor and power terms of the predictor (age and age squared)

INDIVIDUAL ASSIGNMENT 2 (OPTIONAL) 1. Estimate a regression model predicting satisfaction with life, using 2005 data from the Public Opinion Barometer (dataset: bd bop noiembrie 2005.sav). Satisfaction with life is variable V22. You might explore as predictors of life satisfaction: marital status (V55), number of children (V56), gender (V235), age (V237), educational level (V238), household income per capita (V253 household income and b65 number of persons in the household). Tips for recoding variables: Start by assigning system missing values to NA/DK values on each variable. Recode marital status in this example using a simple dichotomy between legally married respondents and all other respondents. Dichotomize the gender variable. Recode the educational level variable into fewer categories so that the resulting variable is an ordinal level variable (in the original version the categories of the variable are only partially ordered). For the purposes of this example, use a smaller number of broad categories (this will make it easier to manage the dichotomized version of the variable later in the exercise).
* SPSS syntax for recoding education. RECODE v238 (99=SYSMIS) (1=1) (2 thru 3=2) (4 thru 6=3) (7 thru 9=4) (10 thru 12=5) (13 thru 14=6) INTO educ. EXECUTE.

Household income per capita is computed by dividing the household income by the number of persons in the household.
*SPSS syntax for computing household income per capita. COMPUTE hhinc=v235/b65. EXECUTE.

2. Explore the possibility of a nonlinear effect of age on satisfaction with life by using a squared age variable as predictor in addition to age. Ask for collinearity diagnostics and look at VIF and Tolerance values for AGE and AGE squared. Center the age variable and recompute the age squared variable using the centered age. Re-run the model and take a look again at the values of the collinearity diagnostics. Tip for computing power terms: Use the Transform Compute menu.
*SPSS syntax for computing age squared. COMPUTE agesq=v237*v237. EXECUTE.

Tip for centering variables: Run a frequency for the variable you want to center and ask for the mean of the variable in the Statistics menu of the Frequency procedure. Compute the new, centered variable by subtracting the mean from the original variable.
*Requesting mean for the age variable. FREQUENCIES VARIABLES=v237 /FORMAT=NOTABLE /STATISTICS=MEAN /ORDER=ANALYSIS. *Constructing the age centered and age centered squared terms. COMPUTE age_c=v237-48.68. EXECUTE. COMPUTE age_c_sq=age_c*age_c. EXECUTE.

3. Explore whether marital status acts as a moderator in the relationship between education and satisfaction with life (test for an interaction effect between marital status and education). Tip: For OLS regression in SPSS you have to manually construct the interaction term (as a product of the two main variables). To test the interaction between two interval level variables, you will construct a single interaction term, equal to the product of the two variables. To test the interaction between an interval level variable and a dichotomous variable, the strategy is the same as for two interval level variables. To test the interaction between an interval level variable and a nominal/ordinal variable with k-1 dummies in the
7

regression, you will have to construct k-1 interaction terms (interval level variable multiplied by the first dummy, then by the second dummy, a.s.o.). Same strategy applies for a dichotomous variable interacted with a nominal/ordinal variable with k-1 dummies.
*SPSS syntax for constructing the interaction term between education and marital status. COMPUTE interact1=educ*married. EXECUTE.

4. Explore the possibility that education has nonlinear effects on satisfaction with life by dichotomizing educational level categories and introducing the education dummies in a separate block in the regression equation. Delete the previous interaction from the model (education is measured here in a different way and consequently the interaction term no longer makes sense). Tip: Collapse the first two education categories since the first category contains a very small number of cases. Used EDUC1 resulting from the syntax below as a reference category in the regression equation.
*SPSS syntax for constructing education dummies. ** The first two education categories are collapsed since the first category contains a very small number of cases. RECODE educ (SYSMIS=SYSMIS) (1 thru 2=1) (ELSE=0) INTO educ1. RECODE educ (SYSMIS=SYSMIS) (3=1) (ELSE=0) INTO educ2. RECODE educ (SYSMIS=SYSMIS) (4=1) (ELSE=0) INTO educ3. RECODE educ (SYSMIS=SYSMIS) (5=1) (ELSE=0) INTO educ4. RECODE educ (SYSMIS=SYSMIS) (6=1) (ELSE=0) INTO educ5. EXECUTE.

5. Re-test for an interaction effect between marital status and education (measured with dummies). Introduce these 4 interaction terms in a separate block in the regression equation and interpret the R squared change statistic. Tip: You will have to construct 4 separate interaction terms: educ2*married, educ3*married, educ4*married, and educ5*married. Introduce these interaction terms in a separate block in the regression equation in order to test for the overall effect of the interaction between marital status and education measured in this way.
*SPSS syntax for constructing the 4 interaction terms. COMPUTE educ2mar=educ2*married. COMPUTE educ3mar=educ3*married. COMPUTE educ4mar=educ4*married. COMPUTE educ5mar=educ5*married. EXECUTE.

6. Choose what you think is the best model and interpret regression results for this model (regression coefficients, R squared coefficients). HOMEWORK ASSIGNMENT (DUE NEXT WEEK) Using a data base you are familiar with, run a multiple regression model on a dependent variable of your choice with several predictors that you consider relevant. You can use the Public Opinion Barometer from 2005 (link available on the lab webpage). Examine the diagnostic tests and plots that we discussed during the lab and write a short interpretation of these. You might try some of the solutions we discussed if you find that there are marked violations of the regression assumptions. Interpret regression coefficients and the R square for your model. Try to think about and present substantive findings of your model. Please include a print-out of relevant parts of the output and a discussion of your results of half a page to one page when you turn in your assignment.

You might also like