You are on page 1of 5

Chapter 20 linear Regression and Correlation 1. Introduction p. 441 a.

. For variables measured on at least an ordinal scale, you must determine if there is a relationship between the two variables, and if so, you must describe that relationship b. That plotting the values of two variables is an essential first step in examining the nature of their relationship[. From a plot, you can tell whether there is some type of pattern between the values of the two variables or whether the points appear to be randomly scattered. If you see pattern you can try to summarize the overall relationship by fitting a mathematical model to the data. For example, if the points on the plot cluster around a straight line, you can summarize the relationship by finding the equation for the line. Similarly, if some type of curve appears to fit the data points, you can determine the equation for the curve. 2. Choosing the Best Line the line of best fit passes close to most of the observed data points. P. 444 3. The Least-Square Line p. 444 - 445 a. Least-Square Regression Line means that of all possible lines that can be drawn on the plot, the line of best fit has the smallest sum of squared vertical distances between the points and the line. b. For each of the data points, you can calculate the distance between the point and the regression line by drawing a vertical line from the point to the regression line. c. If you calculate the vertical distances between each of the pints and the line, square each of them, and then add them up for all of the points, you have the sum of squared distances between the points and the regression line. (You use squared distances, since you dont want positive and negative differences between the points and the line to cancel out.) d. The least-square regression line is the line that has the smallest sum of squared vertical distances between the points and the one. e. Any other line you draw through the points will have a larger sum of squared distances. 4. The Equation for Straight Line p. 445 - 448 a. Consider the equation for a straight line, if y is the variable plotted on the vertical axis and x is the variable plotted on the horizontal axis, the equation for a straight line is Y = a + bx b. Intercept - The value a is called the intercept. It is the predicted value for y when x is 0 c. Slope i. The value b is called the slope. It is the change in y when x changes by one unit. ii. The difference between the values for the y variable

d. Y is often called the dependent variable, since you can try to predict its values based on the values of x, the independent variable. e. If the value for the slope is positive, you know that as the values of one variable increase, so do the values of the other variable. If the slope is negative, you know that as the values of one variable increase, the values of the other variable decrease. If the slope is large, the line is steep, indicating that a small change in the independent variable results in a large change ion the dependent variable. If the slope is small, there is a more gradual increase or decrease. If the slope is 0, the changes in x have no effect on y. f. You should be careful in interpreting the intercept. Unless a value of 0 makes sense for the independent variable, the intercept may not have any substantive meaning g. The slope and the intercept are not symmetric measures -0 it matters which variable is the dependent variable and which is the independent variable. 5. Calculating the Least-Squares Line p. 448 - 449 a. Its easy to calculate the slope and the intercept when all of the points fall exactly on a straight line. Its somewhat more complicated to calculate them for a least-squares regression line when the points dont fall exactly on the line. b. The slope and the intercept values from the least-squares regression are shown in the column labeled B. The regression equation is Y = y intercept + slope(x) c. Calculate the slope using B = sum of ( (xj bar x) (yi - bar y) )/ (N 1) s2x) d. For each case, subtract the mean of the independent variable, bar x, from the cases value for the independent variable, xi Then subtract the mean of the dependent variable, bar y, from the cases the dependent variable, yi. , Multiply these two differences for each case and then sum them for all of the cases. For the last step, divide this sum by the product of the number of cases minus 1 (N-1) and the variance of the independent variable (s2x). Since the least-squares regression line passes through the point (bar x, bar y), you can calculate the intercept as a = bar y b(bar x) 6. Calculating Predicted Values and Residuals p. 450 a. Use the least-squares regression line to predict the value for the dependent variable. b. Residual

i. The residual for a case is nothing more than the vertical distance from the point to the line. The sign tells you whether the observed point is above or below the least-squares regression line. Another way of saying that the least-squares line has the smallest sum of squared vertical distances from the points to the line is to say that the least-squares line has the smallest sum of squared residuals. ii. Calculated as the difference between the observed and predicted value of the dependent variable iii. Determine the smallest and the largest residual in absolute value iv. Positive Residual means that the observed value is greater than the predicted value. v. Negative Residual means that the observed value is smaller than the predicted value. c. 7. Determining How Well the Line Fits p. 451 - 454 a. The least-squares regression line is the line that fits the data best in the sense of having the smallest sum of squared residuals. However, this does not necessarily mean that it fits the data well. Before you use the regression line for making predictions or describing the relationship between the two variables, you must determine how well the line fits the data. If the line fits poorly, any conclusions based on it will be unreliable. b. Slope The value of the slope depends not only on how closely two variables are related but also on the units in which they are measured. 8. The Correlation Coefficient a. To describe how well the model fits the data, you want an absolute measure that doesnt depend on the units of measurement and is easily interpretable. b. The statistic most frequently used for this purpose is the Pearson correlation coefficient (r) i. It ranges in value from -1 to +1 ii. If all points fall exactly on a line with a positive slope, the correlation coefficient has value of +1. iii. If all points fall exactly on a line with negative slope, the correlation coefficient is -1. iv. Absolute value p tells you how closely the pints cluster around a straight line. v. Both large positive values (near +1) and large negative values (near -1) indicate a strong linear relationship between the two variables-the points close to the line. c. You cant assume that because two variables are correlated, one of them causes the other d. Unlike the slope and the intercept, the correlation coefficient is a symmetric measure. e. Symmetric measure manes you get the same value regardless of which of the two variables is the dependent variable. f. Calculate the correlation coefficient from the slope using the formula

R = b (sx/sy) Where sx and sy are the standard deviations of the independent and dependent variables. You see from the formula that if the dependent and independent variables are standardized to have standard deviations of 1, the correlation coefficient and the slope are equal. g. If the correlation coefficient for the plot is larger in absolute value than the absolute value for a second plot, then the points in the first plot cluster more tightly about the line h. If there is no linear relationship between the two variables, the correlation coefficient is close to 0 i. A correlation coefficient of 0 does not mean that there isnt any type of relationship between the two variables ii. It is possible for two variable to have a correlation coefficient close to 0 and yet be strongly related in a nonlinear way. i. Always plot the values of the two variables before you compute a regression line or a Pearson correlation coefficient j. Plotting allows you to detect nonlinear relationships for which the regression line and correlation coefficient are not good summaries 9. Explaining Variability p. 455 - 456 a. Square of Correlation Coefficient Tells you what proportion of the variability of the dependent variable is explained by the regression model b. If there is a perfect relationship between the two variables, you can attribute all of the observed differences in the one variable to differences in the other variable. i. One variable is said to explain all of the observed variability. ii. In this situation, the correlation coefficient and its square are both c. When all of the data points dont fall exactly on the regression line, you calculate how much of the observed variability in one variable can be attributed to differences in the other. i. Sum of the Square for Residual 1. Calculation of how much variability is not explained by one variable 2. Square the residuals for all of the cases and add them ii. Sum of the Regression and Residual Sum of Squares Obtain the total variability of the dependent variable by calculating its variance and multiplying by the number of case4s minus 1 iii. Proportion of Variability Not Explained by Regression Divide the Sum of the Square for Residual by the Sum of the Regression and Residual Sum of Squares iv. Proportion of Variability Explained by the Regression 1. 1 minus the proportion not explained

2. Square of the correlation coefficient between the dependent and independent variable v. Adjusted R Square 1. Estimate of how well the model fit another data set from the same population 2. Since the slope and the intercept are based on the values in your data set, the model fits your data somewhat better than it would another sample of cases 3. the value of adjusted R2 is always smaller than the value of R2 d. Warnings p. 457 - 458 i. Dont use the linear regression equation to make predictions when values of the independent variable are outside of the observed range Restrict the prediction to dependent variables that have independent variables in the same range as the data set. ii. Dont calculate a regression equation unless the relationship between the two variables appears to be linear over the entire observed range of the independent variable iii. Beware of a relationship that heavily depends on a single point. e.

You might also like