You are on page 1of 3

This module introduces what are probably some of the most important statistical concepts for businesses and

governmentsscatterplots, correlation, and regression. The U.S. military relies on linear and nonlinear regression analysis to predict equipment and program costs, often paying high consulting fees to those who can come up with the best regression equations. Major corporations also rely on regression analysis to predict future trends of costs, profits, and other parameters. Just being able to plot data and show trends can boost careers and enhance presentations, so learning the concepts of correlation and regression and their applications is an important tool. However, one must remember that correlationeven high correlationdoes not necessarily imply a cause-and-effect relationship between variables. There may be a third hidden variable, or the correlation may be spurious. Statisticians usually require controlled experiments to definably declare cause and effect, but any correlation and regression can be useful for making predictions, think about the businesses you know and what correlations and regression predictions would be useful in growing and improving those businesses.

Scatter Diagram A scatter diagram, or scatterplot is a two-dimensional xy-coordinate graph that shows the relationship between the two quantities. The focus here is the correlation between the x and y variables and developing the best-fit linear equation to represent the relationship between the variables. Two variables can be positively associated or negatively associated. To measure the strength of the correlation, use the linear correlation coefficient, r. The linear correlation coefficient is always between 1 and 1, inclusive and the closer to 1 or 1, the stronger the linear relationship between the variables. The equation for computing the sample linear correlation coefficient is

(Eqn. 1)
r =

xi x y i y s x sy n 1

MAT 130 Module Four

where x is the sample mean of the explanatory (independent) variable, sx is the sample deviation of the explanatory variable, y is the sample mean of the response (dependent) variable, sy is the sample standard deviation for the response variable, and n is the number of data items in the sample. As you can see, the equations involve many computations. This is where the application tools aid in the computations, resulting in a greater focus on the interpretation of the value of r.

Least-Squares Regression Given two variables, one can compute linear equations that relate the variables. These linear equations, each of the form y = mx +b, can be used to make predictions for value of y. The ultimate goal is to find the linear equation that best matches the points. The leastsquares regression line is the line that minimizes the sum of the squared errors, and as a result, is a more accurate line. The equation of the least-squares regression line is (Eqn. 2) = y b1x + b0 ,

where b1 is the slope and b0 is the y-intercept. The following scatter diagram generaqted using Excel shows the club-head speed and distance a golf ball travels.

The scatter diagram includes the data points and a trendline. The difference between the observed value of y and the predicted value of y from the trendline is known as the residual.

MAT 130 Module Four

Coefficient of Variation After noticing the existence of a relation, it is importance to analyze how well the leastsquares regression line describes the relation between the explanation and response variable. A measure of the total variation in the response variable is known as the coefficient of determination, r2. The coefficient of determination is a number between 0 and 1, the closer to 1 being the larger amount of explanatory value. Calculating r2, in short, helps determine the effectiveness of the relation.

MAT 130 Module Four