You are on page 1of 9

11.

REGRESSION
Objectives: At the end of this unit, a student should be able to Understand the regression Estimate the regression equation Understand the relation of correlation and regression Use regression analysis Understand the application of regression analysis in business Appreciate the regression analysis. Structure 11.1 Introduction 11.2 Equation of a straight line 11.3 Two Lines of Regression 11.4 Properties of Regression Lines 11.5 Key words 11.6 Suggested readings 11.7 review exercise. 11.1 Introduction The coefficient of correlation gives the magnitude of the association of two variables. The next is to obtain the expression of relationship of the variables. We derive the equation that defines the relationship, which is linear, as we have defined linear correlation in the previous sections. The functional relation between the variables is called as regression equation. The meaning of regression is a tendency of returning to the mean. For example, in the correlation of heights of fathers and sons, a tendency of human race to return to or regress to the average height is observed. 11.2 Equation of a straight line The equation of the straight line is Y = a + bx Where a and b are constants, a is the y intercept i.e. the point where the line y = a + bx cuts the y axis, b is the slope of the line. It gives the rate of change of y with respect to X. We can find the values of a and b using the following normal equations. From y = a + bx

Taking sum of both sides, we get y = na + b X as a is constant a = na.

Multiply equation by X and take sum of both sides we get 188

XY = a X + b X2 Solving equations and we get


b=

XY X

n X Y n X
2

a =Y b X

After obtaining the values of a & b we get an estimating equation.


y = a +bx

where y is estimated value of y when value of X is given.

Illustration: Obtain the regression equation for the following data. X: Y: 10 6 9 3 7 2 8 4 11 5

We find out values of a and b using the above data X 10 9 7 8 11 45 Y 6 3 2 4 5 20 XY 60 27 14 32 55 188 X2 100 81 49 64 121 415

TOTAL
45 =9 5 20 Y = =4 5 X =
b=

XY X

n X Y n X
2

= 0.8

a =Y b X

y = .2 + 0.8 X 3

= 4 9 x 0.8 = - 3.2 is an estimating equation

Exercise: Obtain an estimating equation for the data given below: X: Y: 5 8 3 6 7 8 4 5 8 9 2 6 189 10 8 6 5 8 11 7 7 9 8 11 10

11.3 Two Lines of Regression For a bivariate data (Xi, Yi), the relationship may be Y depends on X or X depends on Y. If Y depends on X then the regression line is Y on X. Y is dependent variable and X is independent variable. If X depends on Y, then regression line is X on Y and X is dependent variable and Y is independent variable. The regression equation Y on X is Y = a + bx, is used to estimate value of Y when X is known. The regression equation X on Y is X = c + dy is used to estimate value of X when Y is given and a, b, c and d are constant. Y = a + bx can also be interpreted as a is the average value of Y when X is zero. X = c + dy, value c is the average value of X, when Y is zero. The slopes of the equation Y on X and X on Y are denoted as byx and bxy respectively. The values of byx and bxy are byx =
cov( X .Y ) var .x

bxy =

co ( X .Y ) v v .y ar

Simplifying we get, byx =

XY X

n X Y n X
2

bxy =

XY Y

n X Y nY
2

byx and bxy are the coefficient of regression. After we obtain values of byx and bxy we obtain the regression equations by substituting in the following equation. Y on X and X on Y

(Y Y ) = byx ( X X ) ( X X ) = bxy (Y Y )

The value of b in the previous section is same as byx. Illustration: 190

The table below gives the stopping distance of an automobile at speed mils per hour at the distant danger is sighted. Speed V (miles per hour) Stopping distance d(ft) 20 54 30 90 40 138 50 206 60 292 70 396

Estimate distance when speed is 45 miles per hour. Estimate the speed when distance traveled before stopping the automobile is 100 feet. We have to obtain the estimating equations. We calculate byx and bxy. Speed X Distance Y
X Y

20 54
XY

30 90
X2 1080 2700 5200

40 138
Y2 400 900 1600 2500 3600 4900 13900

50 206

60 292

70 396

20 30 40 50 60 70
270

54 90 130 206 292 396


1168

2916 8100 16900 42436 85264 156816 312432

10300 17520 27720 64520

X =45

Y = 194.6667
n X Y n X nY
2

XY X XY bxy = Y
byx =

= 6.834286 = 0.140604

n X Y
2

Substituting in the regression equations Y on X and X on Y

(Y Y ) = byx ( X X ) ( X X ) = bxy (Y Y )

we get, (Y-194.6667) = 6.834286(X-45) Simplifying Y=6.834286X+112.876 And (X-45)=0.140604(Y-194.6667) X=0.140604Y+17.629 Observe that the value of byx and bxy have the same sign. 191

Exercise: For the data below, construct a scatter diagram. Find the least square regression lines Y on X and X on Y. Grade on first quiz X Grade on second quiz Y 6 8 5 7 8 7 8 10 7 5 6 8 10 10 4 6 9 8 7 6

11.4 Properties of Regression Lines The regression equations Y on X and X on Y has following properties a) The lines of regression meet in a point whose co-ordinates are X , Y . The averages of both X and Y will lie on both the lines of regression. b) The regression coefficients byx, bxy and correlation coefficient r will have the same sign. The relationship will remain the same in any of the coefficients. c) There is an angle formed between the two lines of regression. Let the angle be denoted by . The correlation is perfect then the angle is 0. The lines exactly coincide as the correlation becomes weaker and weaker the increases. d) The correlation coefficient r is geometric mean of the regression coefficients. The sign + or given to r, that exists for byx and bxy.

r= e) byx =

b yx . b xy

y x

and

x bxy = y

Illustration: 192

The two lines of regression are 5x + 6y = 160 and 2x + 4y = 80 Find 1. Find mean values of X and Y 2. Find regression coefficients 3. Find correlation coefficients 4. Find variance of Y if standard deviation of X is 1. We have 5x + 6y = 160 and 2x + 4y = 80 First we solve these equations simultaneously. To eliminate X 5x + 6y = 160 2x + 4y = 80

60

multiply by 2 multiply by 5

10x + 12y = 320 - 10x + 20y = 400 - 8y = - 80 y = 10 Substituting in any equation we get X = 20 and X =20 Y = 10 2. The regression equations are known. But we dont know which is Y on X and X on Y. we assume that and 5x + 6y = 160 be Y on X 2x + 4y = 80 be X on Y

so we rearrange them to find regression coefficients in Y = a + bx and X = c + dy 6y = - 5x + 160


5 160 x+ y= 6 6

2x = - 4y + 80 x = - 2y + 40 bxy = - 2
b yx . b xy

byx =

5 6

and

Correlation Coefficient r = =

5 x2 6

>1 Which is wrong. As, 1 r 1. Our assumption is wrong. We revert our assumption. 193

Now let 5x + 6y = 160 be X on Y and 2x + 4y = 80 be Y on X 5x = - 6y + 160 x=


6 160 y+ 5 5
1 2

then

4y = - 2x + 80 y=bxy =
6 5
b yx . b xy

1 x + 40 2

byx = -

and

Correlation Coefficient r =

= x

6 5

1 2

= - 0 .774597 Substituting in equation byx = r .

y , squaring both sides we get x


2

yx

y 2 =r . x 2

1 3 y 2 = . 4 5 60
y 2 = 25

Exercise: In a partially destroyed laboratory record of analysis of correlation data, the following results only are legible: Variance of X = 9 Regression equations are 8x 10y + 66 = 0; 40x 18y = 214 What were (a) mean values of x and y (b) standard deviation of y (c) the coefficient of correlation between x and y.

194

11.5

Key words

Regression: A general process of predicting one variable from another by statistical means using previous data Regression line: A line fitted to set of data points to estimate the relationship between the variables. Dependent variable: The variable we are trying to predict Independent variable: The known variable in regression analysis. 11.6 Suggested readings

Anderson et al, Statistics for business and economics, eighth edition,2002, Thomson Asia Pvt. Ltd. Singapore R. Levin and D. Rubin, Statistics for management, seventh edition,1997,Prntice Hall of India, New Delhi. Frank and Althoen, Statistics concept and applications,1994, Cambridge university press, Cambridge A.D.Aczel and J. Sounderpandian, Complete Business Statistics, 2002, Tata McGraw Hill , New Delhi,India W.J.Stevenson, Business Statistics concept and applications, 1978, Harper and Row publishers, New York, USA.

11.7

Review exercise

1. A computer while calculating correlation coefficient between two variables X and Y from 25 pairs of observations obtained the following N = 25

X =125 Y =100

X Y

2 2

= 650 = 460

XY

= 508

195

Find the correlation coefficient of X and Y. Mean values of, X and Y. Regression equations of Y on X and X on Y.
2. A furniture retailer in a locality is interested in studying whether some relationship exists between the number of building permits issued in that locality in the past years and the volume of sales in those years. He has accordingly collected the data for the sales (y) and the number of building permits issued(X) in the past 10 years. The results are as follows X=200 Y= 2200 XY= 45800 X2= 4600 and Y2 =-490400. Using the appropriate regression equations, find i) The level of sales expected next year when 2000 building permits are to be issued. ii) The level of sales expected next year when 2000 building permits are to be issued. 3. To the Internal Revenue Service, the reasonableness of total itemized deduction depends on taxpayers

adjusted gross income. Large deductions, which include charity and m


A furniture retailer in a locality is interested in studying whether some relationship exists between the number of building permits issued in that locality in the past years and the volume of sales in those years. He has accordingly collected the data for the sales (y) and the number of building permits issued (X) in the past 10 years. The results are as follows X=200 Y= 2200 XY= 45800 X2= 4600 and Y2 =-490400. Using the appropriate regression equations, find

iii) iv)

The level of sales expected next year when 2000 building permits are to be issued. The level of sales expected next year when 2000 building permits are to be issued.

4. To the Internal Revenue Service, the reasonableness of total itemized deduction depends on taxpayers

adjusted gross income. Large deductions, which include charity and medical deductions, are more reasonable for taxpayers with large adjusted gross incomes. If a taxpayer claims larger than average itemized deductions for a given level of income, the chances if a IRS audit are increased. Data (in $1000) on adjusted gross income and the average or reasonable amount of itemized deductions follow. Adjusted gross income ($1000s) 22 27 32 48 65 85 120 Total itemized deductions ($1000s) 9.6 9.6 10.1 11.1 13.5 17.7 25.5 Use the least square method to develop the estimated regression equation. Estimate a reasonable level of total itemized deductions for a tax payer with an adjusted gross income of $52000. If this taxpayer has claimed total itemized deductions of $20,00, would the IRS agents request for a n audit appear justified? Explain. 5. In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2X-Y+1=0 and 3X-2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y. 6. In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2X-Y+1=0 and 3X-2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y. 7. The two lines of regression based on 100 observations were 20X-9Y-106=0 and 4X-5Y+30=0. Determine the coefficient of correlation, and calculate the variance of Y if the variance of X is 9.

196

You might also like