Professional Documents
Culture Documents
REGRESSION
Regression
The idea behind the calculation of the coefficient of correlation is that the scatter plot of the data corresponds to a cloud that follows a straight line. This idea can be formalized by regression methods. In this class we will: Consider the definition of simple linear regression Find a method to predict an individual value Use the normal curve to estimate the percentile ranks Describe the regression effect Compute the regression errors and its RMS Study the behavior of regression errors
Regression
The regression method describes how one variable depends on another.
The Northern California temperature data have average altitude of 3,524 feet and a SD of 1,839 feet; average temperature of 70.3 degrees and SD 6.5 degrees. The correlation between temperature and altitude is 0.76.
Regression
The idea behind the calculation of the coefficient of correlation is that the scatter plot of the data corresponds to a cloud that follows a straight line. This idea can be formalized by regression methods. In this class we will: Consider the definition of simple linear regression Find a method to predict an individual value Use the normal curve to estimate the percentile ranks Describe the regression effect Compute the regression errors and its RMS Study the behavior of regression errors
Regression
The cloud of points shows a mild negative association between the two variables, as does the value of r. Can we use the values of altitude to estimate the average values of temperature?
Regression
How does the regression line work? Associated with an increase of one SD in x there is an increase of r SDs in y on average.
Clearly, if the correlation coefficient is negative, then the average value of y decreases as x increases. In the temperature and altitude example, an increase of height of 1,839 feet produces a increase of -0.76 6.5 = -4.95 degrees in the average temperature.
Regression
How do we use the method to predict an individual value? If we consider two variables x and y and we want to predict the value of y for a specific value of x, we use the average value of y that corresponds to the value of x according to the regression method. Example: The first year GPAs and the Math SAT for the students of a university produce the following data average SAT score = 550, SD = 80 average 1st-year GPA = 2.6, SD = 0.6 r = 0.40 We want to predict the 1st-year GPA of a student with a SAT score of 650.
Regression
The student's SAT score in standard units is
Regression
WARNING: You can use the regression method on new subjects provided that they are similar to the ones that were used to produce the averages, SDs and r used in the regression method. In the previous example the method will not be valid for students of a different institution.
Regression
We can use the regression method and the normal curve to produce estimates of the percentile ranks. Example: In the previous example suppose a student has a percentile rank of 90% for the SAT scores. That is, only 10% of the scores are higher than his. What is the predicted percentile rank for the 1st-year GPA of this student? Using the normal curve we have that a 90% probability corresponds to z score of 1.3. This means that the student's SAT score is 1.3 SDs above average. This corresponds to being 0.4 1.3 0.5 SDs above the average GPA and this corresponds to an accumulated probability, under the normal curve, of approximately 69%.
Regression
So the percentile rank on 1st-year GPA of a student with a percentile rank on SAT score of 90% is predicted to be 69%. Notice that the student with a SAT percentile rank of 90% was `pulled down' to only 69% by the regression method. Why is that? Suppose the correlation was perfect, r = 1, then 90% will convert to 90%. The other extreme is that there is no correlation, so, in the absence of any information, the best guess is the median or 50% percentile. The regression method produces a rank that is somewhere between these two extremes.
Example
The shoe size and the heights of 14 men are recorded. The shoe size average is 10.46 with a SD of 1.21. The average height is 70.45 inches with a SD of 2.45 inches. The correlation is 0.93. What is the average height of a man that uses shoes of size 11.5? We convert 11.5 to standard units
so the shoe size is 0.859 units above average. This means that the height will be 0.859 0.93 2.45 = 1.95 inches above average. So the average height of a man with shoe size 11.5 will be 70.45 + 1.95 = 72.40 inches.
Regression effect
Galton, a British statistician, studied the relationship between the height of the fathers and the sons in 1,078 families. He noticed that tall fathers tended to have shorter sons and short fathers tended to have taller sons. He termed this fact regression to mediocrity. This is where the term regression comes from. Example: Children are tested for IQ before and after taking a preschool program. In both cases the scores average 100 and the SD is 15. So, on average, there seems to be no effect. Nevertheless children below average in the first test had an average gain of 5 IQ and those above average had an average loss of 5 IQ. This is regression effect.
Regression effect
A model for the test-retest situation is observed test score = true score + chance error Suppose that the chance error can be either positive or negative. Suppose that the true scores in the population follow the normal curve with an average of 100 and a SD of 15. Consider the children who scored 140 on the first test. There are two possibilities: true score below 140, with a positive chance error true score above 140, with a negative chance error Which one is more likely? According to the normal curve, the first possibility is more likely, since the mean is 100 and so the interval above 140 has less probability than the one below 140. Under this scenario, the second test is more likely to produce a value below 140.
Regression effect
A symmetric situation is valid for those scoring, say , 80 IQ. It is likely that the true test is above 80 with a negative chance error, and so the second score is likely to be above the first. In other words, if a students scores above average in the first test, it is likely that the true score is lower than the observed one. If the student takes the test again, chances are that the second score will be lower than the first. A symmetric situation is true for a person scoring below average in the first test. This explains the regression effect.
Regression errors
The regression method can be used to predict y from x. But actual values differ from predictions. These are the regression errors. error = actual value of y - predicted value of y Some of the errors defined in this way are positive and some are negative. Reflecting the fact that some observations are above and some are below the regression line. How do we measure the error in a regression? The overall size of the error is measured using the root-meansquare (RMS), as we did to obtain the SD. This is equal to
Regression errors
What if we ignore the values of x? Then our prediction for y is the average of y. In this case the RMS error coincides with the SD of y.
RMS error = 1 r 2 SD of y
We observe the following features The units of the RMS error are the same as the units of the variable being predicted. Perfect correlation corresponds to zero RMS error. Zero correlation corresponds to maximum RMS error (equal to SD of y).
We expect to see no trends or clusters in the residuals There should be about the same number of positive as negative residuals A histogram of the residuals should look symmetric around zero
Problem
The following results are taken from a study of about 1,000 families: average height of husband 68 inches, SD 2.7 inches average height of wife 63 inches, SD 2.5, r 0.25 Predict the height of a wife when the height of her husband is 1. 72 inches The husband is 4 inches above average height. This is 4/2.7 = 1.5 SD above the average. So the wife is predicted to have r 1.5 = 0.25 1.5 0.4 this corresponds to 0.4 2.5 = 1 inch. 2. 68 inches This the husband is right on the average, so the wife will be right on the average as well.
75 68 = 0.7 10
this corresponds to a right hand tail of 14% under the normal curve.
75 68 = 0.7 10
this corresponds to a right hand tail of 14% under the normal curve.
1 r SD of y = 1 0.6 10 = 8 points
This new SD can be used to convert to standard units
75 71 = 0.5 8
and, using the normal curve, we obtain an area of 31% above 0.5. This is the percentage of students scoring more than 75 in the first year among those who scored 165 in the LSAT. Notice that this percentage is higher than the 14% we obtained before. This is because we have focus on a smaller portion of the sample, obtaining a smaller SD.
The intercept is the height of the line when x = 0. The slope is the rate at which y increases, per unit increase in x. If the slope is negative then y decreases as x increases.
This is given by average of y - slope average of x The equation for the regression line is called the regression equation and can be written as
y = slope x + intercept
So, for our example, we have that predicted income = $1,400 per year education + $4,000
Example
Back to our shoe size example. The shoe size and the heights of 14 men are recorded. The shoe size average is 10.46 with a SD of 1.21. The average height is 70.45 inches with a SD of 2.45 inches. The correlation is 0.93. r SD of height 0.93 2.45 = = 1.88 The slope of the regression line is
SD of shoe size 1.21
To obtain the intercept we consider a show size of zero. This is 10.46 units below average and so will correspond to a height that is 1.88 10.46 = 19.66 inches below average. So it corresponds to a height of 70.45 19.69 = 50.75 inches. The regression line is height = 1.88 shoe size +50.74 inches Q: What is the predicted height of a man with a show size of 9? A: Using the regression equation we have 1.88 9 +50.74 inches = 67.67 inches
Least Square
Consider a cloud of points produced by obtaining the scatter diagram of observations corresponding to two variables x and y. There are many lines that we can draw through the cloud. Which is the straight line that fits the points best? The regression line is a possible solution to this problem.
This is the reason why the regression line is called the least squares line.
Least Square
Example: Let b be the length of a spring with no load. If a load x is attached to the spring the stretch is proportional to x. Thus the length of the string is y = mx + b. where m and b are constants that depend on the string. An experiment is run to determine the constants for a given spring, the data are shown in the table.
The correlation coefficient is r = 0.999, so the points are very close to straight line. But they are not exactly on a straight line. This is probably due to measurement error. The regression line for these data produces estimates of b and m, given, respectively, by the intercept and the slope of the line. The values are m 0.5c per kg, and b 439.01 cm. These are the least squares estimates of m and b.
Problem
Find the regression equation for predicting final score from midterm score, based on the following information: average midterm score = 70, SD = 10 average final score = 55, SD = 20 , r = 0.60 The slope of the line can be obtained as