Professional Documents
Culture Documents
3 4
14 18
2
1 5 3 6
10
6 22 14 26
20
10
0 0 1 2 3 4 5 6 7
20
10
0 0 1 2 3 4 5 6 7
Model 1
t . .
Sig. . .
(Constant) X
a. Dependent Variable: Y
X 3 4 2 1 5 3 6
Y 14 18 10 6 22 14 26
The constant representing the intercept is the value that the dependent variable would take when all the predictors are at a value of zero. In some treatments this is called B0 instead of a
What is the Regression Equation when the Scores are in Standard (Z) Units?
When the scores on X and Y have been converted to Z scores, then the intercept disappears (because the two sets of scores are expressed on the same scale) and the equation for predicting Y from X just becomes Y = BetaX, where Beta is the standardized coefficient reported in your SPSS regression procedure output
Coefficientsa Unstandardized Coefficients B Std. Error 2.000 .000 4.000 .000 Standardized Coefficients Beta 1.000
Model 1
t . .
Sig. . .
(Constant) X
a. Dependent Variable: Y
In the bivariate case, where there is only one X and one Y, the standardized beta weight will equal the correlation coefficient. Lets confirm this by seeing what would happen if we convert our raw scores to Z scores
Model 1
t . .
Sig. . .
(Constant) Zscore(X)
Zscore(X)
1
5 3 6
6
22 14 26
In the Output viewer, double click on the chart to bring up the Chart Editor; go to Elements and select Fit Line at Total, then select linear and click Close
Scatterplot of Relationship between Female Life Expectancy and Daily Caloric Intake
From the scatterplot it would appear that there is a strong positive correlation between X and Y (as daily caloric intake increases, life expectancy increases), and X can be expected to be a good predictor of as-yet unknown cases of Y. (Note, however, that there is a lot of scatter about the line and we may need additional predictors to soak up some of the variance left over after this particular X has done its work (also consider loess regression In the loess method, weighted least squares is used to
fit linear or quadratic functions of the predictors at the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points)
Significance of constant of little use. Just says that it differs significantly from zero (e.g when x is zero, y is not zero)
Coefficientsa Unstandardized Coefficients B Std. Error 25.904 4.175 .016 .001 Standardized Coefficients Beta .775
Model 1
t 6.204 10.491
95% Confidence Interval for B Lower Bound Upper Bound 17.583 34.225 .013 .019
If the data were expressed in standard scores, the equation would be ZY = .775ZX + e, and .775 is also the correlation between X and Y. This is a standard score regression equation
ANOVAb Model 1 Sum of Squares 5792.910 3842.477 9635.387 df 1 73 74 Mean Square 5792.910 52.637 F 110.055 Sig. .000 a
a. Predictors: (Constant), Daily calorie intake b. Dependent Variable: Average female life expectancy
Model Summary Model 1 R R Square .775 a .601 Adjusted R Square .596 Std. Error of the Estimate 7.255
Residual SS is the sum of squared deviations of the known values of Y and the predicted values of Y based on the equation
Regression SS is the sum of the squared deviations of the predicted variable about its mean
Model 1
t 6.204 10.491
Looking at the standard error of the standardized coefficient we can see that the estimate R (which is also the standardized version of b) is 775. Thus we could say with 95% confidence that if ZX is the Z score corresponding to a particular calorie level, life expectancy is .775 (Zx) plus or minus 7.255 years
Model Summary Model 1 R R Square .775 a .601 Adjusted R Square .596 Std. Error of the Estimate 7.255
SEE = SD of X multiplied by the square root of the coeffiecient of nondetermination. Says what an
error standard score of 1 is equal to in terms of Y units
Multivariate Analysis
Multivariate analysis is a term applied to a related set of statistical techniques which seek to assess and in some cases summarize or make more parsimonious the relationships among a set of independent variables and a set of dependent variables Multivariate analyses seeks to answer questions such as
Is there a linear combination of personal and intellectual traits that will maximally discriminate between people who will successfully complete freshman year of college and people who drop out? What linear combination of characteristics of the tax return and the taxpayer best distinguish between those whom it would and would not be worthwhile to audit? (Discriminant Analysis) What are the underlying factors of an 94-item statistics test, and how can a more parsimonious measure of statistical knowledge be achieved? (Factor Analysis) What are the effects of gender, ethnicity, and language spoken in the home and their interaction on a set of ten socio-economic status indicators? Even if none of these is significant by itself, will their linear combination yield significant effects? (MANOVA, Multiple Regression)
Multiple regression is a relative of simple bivariate or zeroorder correlation (two interval-level variables) In multiple regression, the investigator is concerned with predicting a dependent or criterion variable from two or more independent variables. The regression equation (raw score version) takes the form Y = a + b1X1 + b2X2 + b3X3 + ..bnXn + e
Canonical correlation is perhaps a more classic multivariate procedure with multiple dependent and independent variables
One motivation for doing this is to be able to predict the scores on cases for which measurements have not yet been obtained or might be difficult to obtain . The regression equation can be used to classify, rate, or rank new cases
AfricanAmerican
Subject 2 AfricanAmerican
Subject 3 Other
Medium Status
It is also possible to do tests for the significance of the differences between two predictors: is one a significantly better predictor than the other These coefficients vary from sample to sample so its not prudent to generalize too much about the relative ability of two predictors to predict Its also the case that in the context of the regression equation the variable which is a good predictor is not the original variable, but rather a residualized version for which the effects of all the other variables have been held constant. So the magnitude of its contribution is relative to the other variables, and only holds for this particular combination of variables included in the predictive equation
Substituting the correlations we already have in the formula, we find that the beta weight for the predictive effect of variable X1 on Y is equal to .776 (.869)(.682) / 1 (.682)2 = .342. To compute the second weight, Beta X2Y.X1, we just switch the first and second terms in the numerator. Now lets see that in the context of an SPSS-calculated multiple regression
*Read this as the Beta weight for the regression of Y on X1 when the effects of X2 have been removed
Download the World95.sav data file and open it in SPSS Data Editor.
In Data Editor go to Analyze/ Regression/ Linear and click Reset Put Average Female Life Expectancy into the Dependent box Put Daily Calorie Intake and People who Read % into the Independents box Under Statistics, select Estimates, Confidence Intervals, Model Fit, Descriptives, Part and Partial Correlation, R Square Change, Collinearity Diagnostics, and click Continue Under Options, check Include Constant in the Equation, click Continue and then OK Compare your output to the next several slides
r YX1 r YX2
Sig. (1-tailed)
Average female life expectancy Daily calorie intake People who read (%) Average female life expectancy Daily calorie intake People who read (%) Average female life expectancy Daily calorie intake People who read (%)
r X1X2
Above are the raw (unstandardized) and standardized regression weights for the regression of female life expectancy on daily calorie intake and percentage of people who read. Consistent with our hand calculation, the standardized regression coefficient (beta weight) for daily caloric intake is .342. The beta weight for percentage of people who read is much larger, .636. What this weight means is that for every unit change in percentage of people who read (that is, for every increase by a factor of one standard deviation on the people who read variable), Y (female life expectancy) will increase by a multiple of .636 standard deviations. Note that both the beta coefficients are significant at p < .001
Above is the model summary, which has some important statistics. It gives us R and R square for the regression of Y (female life expectancy) on the two predictors. R is .905, which is a very high correlation. R square tells us what proportion of the variation in female life expectancy is explained by the two predictors, a very high .818. It gives us the standard error of estimate, which we can use to put confidence intervals around the unstandardized regression coefficients
a. Predictors: (Constant), People who read (%), Daily calorie intake b. Dependent Variable: Average female life expectancy
Next we look at the F test of the significance of the Regression equation, Y = .342 X1 + .636 X2. Is this so much better a predictor of female literacy (Y) than simply using the mean of Y that the difference is statistically significant? The F test is a ratio of the mean square for the regression equation to the mean square for the residual (the departures of the actual scores on Y from what the regression equation predicted). In this case we have a very large value of F, which is significant at p <.001. Thus it is reasonable to conclude that our regression equation is a significantly better predictor than the mean of Y.
Coefficientsa Unstandardized Coefficients B Std. Error 25.838 2.882 .007 .001 .315 .034 Standardized Coefficients Beta .342 .636 95% Confidence Interval for B Lower Bound Upper Bound 20.090 31.585 .004 .010 .247 .383 Correlations Partial .506 .738
Model 1
Finally, your output provides confidence intervals around the unstandardized regression coefficients. Thus we can say with 95% confidence that the unstandardized weight to apply to daily calorie intake to predict female life expectancy ranges between .004 and .010, and that the undstandardized weight to apply to percentage of people who read ranges between .247 and .383
Multicollinearity
One of the requirements for a mathematical solution to the multiple regression problem is that the predictors or independent variables not be highly correlated If in fact two predictors are perfectly correlated, the analysis cannot be completed Multicollinearity (the case in which two or more of the predictors are too highly correlated) also leads to unstable partial regression coefficients which wont hold up when applied to a new sample of cases Further, if predictors are too highly correlated with each other their shared variance with the dependent or criterion variable may be redundant and its hard to tell just using statistical procedures which variable is producing the effect Moreover, the regression weights for the predictors would look much like their zero-order correlations with Y if the predictors are dependent; if the predictors are highly correlated this may produce regression weights that dont really reflect the independent contribution to prediction of each of the predictors
Multicollinearity, contd
As a rule of thumb, bivariate zero-order correlations between predictors should not exceed .80
Also, no predictor should be totally accounted for by a combination of the other predictors
Look at tolerance levels. Tolerance for a predictor variable is equal to 1-R2 for an equation where one of the predictors is regressed on all of the other predictors. If the predictor is highly correlated with (explained by) the combination of the other predictors, it will have a low tolerance, approaching zero, because the R2 will be large So, zero tolerance = BAD, near 1 tolerance = GOOD in terms of independence of a predictor
This is easy to prevent; run complete analysis of all possible pairs of predictors using the correlation procedure
The best prediction occurs when the predictors are moderately independent of each other, but each is highly correlated with the dependent (criterion) variable Y Some interpretive problems resulting from multicollinearity can be resolved using path analysis (see Chapter 3 in Grimm and Yarnold)
r YX1 r YX2
Sig. (1-tailed)
Average female life expectancy Daily calorie intake People who read (%) Average female life expectancy Daily calorie intake People who read (%) Average female life expectancy Daily calorie intake People who read (%)
r X1X2
In the case of our two predictors, there is some indication of multicollinearity but not enough to throw out one of the variables
Specification Errors
One type of specification error is that the relationship among the variables that you are looking at is not linear (e.g., you know that Y peaks at high and low levels of one or more predictors (a curvilinear relationship) but you are using linear regression anyhow. There are options for nonlinear regression available that should be used in such a case Another type of specification error occurs when you have either underspecified or overspecified the model by (a) failing to include all relevant predictors (for example including weight but not height in an equation for predicting obesity or (b) including predictors which are not relevant. Most irrelevant predictors will not even show up in the final regression equation unless you insist on it, but they can affect the results if they are correlated with at least some of the other predictors For proper specification nothing beats a good theory (as opposed to launching a fishing expedition)
As mentioned in class before, the Bonferroni procedure is sometimes used, but its hard to swallow, as you have to divide the usual confidence level of .05 by the number of tests you expect to perform, so if you are conducting thirty tests, you have to set your alpha level at .05/30 or .0017 for each test. With stepwise regression its not clear in advance how many tests you will have to perform although you can estimate it by the number of predictor variables you intend to start off with
An issue of HCR (July 2003) devoted several papers to exploring this question