You are on page 1of 14

Jurong Junior College 2011 JC2 Mathematics H2 (9740) Tutorial 20 : Correlation and Regression (Solutions) Basic Category 1 In a chemical

reaction, it is known that the amount, w grams, of a certain compound produced has an approximate linear relationship with the temperature, t C . Eight trial runs of this reaction are performed, two at each of four different temperatures. The observed values of w are subjected to error. The results are shown in the table below. t w (i) 10 10 12 15 15 12 20 18 16 25 16 20

State, giving a reason, which of the least squares regression lines, t on w, or w on t, should be used to express a possible linear relation between w and t. (ii) Calculate the equation of the line chosen in (i). (iii) State how you could assess how well the data fit the calculated equation. [Ans: (ii) w = 0.49t + 6.3 ]

Solutions **Refer to notes Example 3** 1 (i) The least squares regression line of w on t should be used because T is the controlled variable while the observed values of the random variable W are subjected to error.

(ii) (iii)

By using GC. We have w = 0.49t + 6.3 We can draw a scatter diagram with the regression line superimposed on it.

Iron rusts when it is exposed to air, therefore a chemical is applied to prevent rust on iron. An experiment is conducted whereby different quantities, x ml, of chemical were applied to identical samples of iron. Times, t hours, for the iron to start to rust were measured. The results are given in the table, x t (i) 2.3 3.3 3.1 5.6 3.8 6.9 4.9 8.4 5.9 8.7 6.7 10.1 8.0 11.0

Draw a scatter diagram for the data. Identify the point P for which the value of t appears to be incorrect.

Jurong Junior College 2011 JC2 Mathematics H2 (9740) Omitting P, explain why the scatter diagram for the remaining points may be consistent with a model of the form t = a + b ln x . Hence calculate the least squares estimate of a and b. [Ans: (i) (5.9, 8.7), (ii) a = 1.45, b = 6.09] [N08 H2 P2 Q8 modified] (ii)

Solutions 2 (i)
12 10 8 t hours 6 4 2 0 0 1 2 3 4 x ml 5 6 7 8 9

P = (5.9, 8.7) (ii) The remaining points show a shape similar to the graph of t = a + b ln x . Using G.C, the least square estimates of a and b are 1.45 and 6.09 respectively.

(a) (b) (c)

Explain why it is advisable to plot a scatter diagram before interpreting a correlation coefficient calculated for a sample drawn from a bivariate data. What is meant by saying that the product moment correlation coefficient is independent of the scale of measurement? An engineering company makes cranes. The number, x, sold in each of three-month period for two years, together with the profits, y thousand dollars, on the sale of these cranes are given in the following table. 15 290 17 350 13 270 21 430 16 340 22 410 14 300 18 360

x y

Jurong Junior College 2011 JC2 Mathematics H2 (9740) Draw a scatter diagram for the data. Calculate the product moment correlation coefficient, and comment on its value in relation to your scatter diagram. [N08 H1 Q11] [Ans: r = 0.969] Solutions 3 (a)

The scatter diagram gives an overview of how the variables are related. It enables a suitable correlation coefficient to be calculated in measuring the strength of the association among the variables. The product moment correlation coefficient, r, is a measure of the strength and direction of linear relationship between two variables. It is unchanged by a change of origin and scaling.
450 430 410 390 370 350 330 310 290 270 250 12 0 13 14 15 16 17 18 19 20 21 22 23 x no. of cranes sold

(b)

(c)
y profit in thousands

r = 0.969 (3 s.f.) The r value is close to +1, which suggests a strong positive linear relationship between x and y as illustrated by the points being close to a positively sloped straight line in the scatter diagram.

A physician is interested in determining whether there is a linear relationship between the number of years an emphysema patient smoked (X) and the physicians subjective evaluation of the extent of lung damage (Y) of the patient. The latter variable is measured on a scale of 0 to 100. Measurements taken on 10 patients are summarised as follows: x = 319, x 2 = 11053, y = 530, y 2 = 30600, xy = 18055 (i) Calculate the equation of the estimated regression line of y on x. (ii) It is given that the extent of lung damage of an emphysema patient is 45. Use the equation found in (i) to estimate the number of years, to the nearest integer, the patient smoked. (iii) Calculate the linear product moment correlation coefficient of the data. Is there strong evidence of a linear relationship? (iv) Comment on the reliability of the value obtained in (ii). [Ans: (i) y = 1.31x + 11.2 (ii) 26 (iii) 0.774]

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

Solutions 4 (i)
x=

10 y = 530 = 53 y= n 10 x y (319)(530) xy n 18055 10 1148 b= = = = 1.309 2 2 (319) 876.9 ( x) 11053 x2 n 10 The equation of the estimated regression line of y on x is y y = b( x x ) , i.e. y = 1.31x + 11.2 .

x = 319 = 31.9
n

(ii)

When y = 45 , 45 = 1.309 x + 11.2429 . x 26 (nearest integer)


x y xy n

(iii)

2 2 ( x) y2 ( y ) 2 x n n 1148 = = 0.774 2 (530) (876.9) 30600 10 This value does not provide strong evidence of a linear relationship.

r=

(iv)

Since X is the variable of interest and r is not close to 1 or 1 , it is not appropriate to use the linear regression line of y on x to find the value of x. Therefore the estimated value of x found in (ii) is not reliable.

The effect of a fertilizer on a particular plant is being tested under controlled conditions. The table below shows the exact amount of fertilizer applied, x kg, to nine different plots, each of area 1 m 2 , and the eventual yields from these plots, y kg. It also gives the values of lg x . Fertilizer applied (x kg) Yield (y kg) lg x
(a)

25 2.03 1.40

50 2.42 1.70

75 2.69 1.88

100 2.95 2.00

125 3.04 2.10

150 3.18 2.18

187.5 3.30 2.27

225 3.41 2.35

275 3.20 2.44

Draw the values of y and x on a graph and state why it would not be sensible to fit a regression line for y on x from these data.

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

Find the product moment correlation coefficient between y and lg x , giving your answer correct to four decimal places. Does it suggest a linear relationship between y and lg x ? (b) Calculate the least squares regression line of y on lg x , giving the answer of the gradient of the regression line to four decimal places. (c) Explain why we should use the equation in (b) to estimate how much fertilizer would be needed to obtain a yield of 3 kg. [Ans: (i) 0.9712 (ii) y = 1.3227(lg x) + 0.2209 ]
Solutions 5 (a)

4 3.5 3 2.5

y kg

2 1.5 1 0.5 0 0 50 100 150 x kg 200 250 300

Graph of y against x shows more of a curvilinear relationship. Therefore it is not suitable to fit a regression line for y on x.

Using GC, the product moment correlation coefficient between y and lg x is r = 0.9712 (4 d.p.). Since r is close to 1, this suggests a strong positive linear correlation between y and lg x .
(b)

Using GC, the equation of the least squares regression line of y on lg x is y = 1.3227(lg x) + 0.2209 . The amount of fertilizer is the controlled variable while the amount of yield is the random variable. Therefore we cannot use the regression line lg x on y to estimate the value of x when y = 3. We should use the equation in (b).

(c)

Jurong Junior College 2011 JC2 Mathematics H2 (9740) Standard Category 6 It is believed that the probability p of a randomly chosen pregnant woman giving birth to a Down's Syndrome child is related to the women's age x, in years. The table gives the observed values of p for 5 different values of x.
x p

25 0.00067

30 0.00125

35 0.00333

40 0.01000

45 0.03330

(i) (ii)

Draw a scatter diagram for the data. State, with a reason, which of the following would be an appropriate model to represent the above data. b p =a+ , A: x p = a + be x , B: ln p = a + bx , C: where a and b are constants and b > 0.

For the appropriate model, calculate the values of a and b, and find the productmoment correlation coefficient. (iv) Obtain an estimate of the probability of a 19 year-old woman giving birth to a Down's Syndrome child. Comment on the reliability of your answer. [Ans: (ii) C. p increases as x increases. (iii) a = 12.5, b = 0.198, r = 0.994 (iv) p = 0.000166]
(iii) Solutions 6 (i) 0.035
0.03 p probability 0.025 0.02 0.015 0.01 0.005 0

0 20

25

30

35 x age in years

40

45

50

(ii)

Identify model C Possible reasons: i. Shape of the points follow the shape of an exponential graph ii. p increases as x increases (because the other 2 choices have p decreasing as x increases)

Jurong Junior College 2011 JC2 Mathematics H2 (9740) (iii)

By using G. C., ln p = 12.465 + 0.19783 x a = 12.465 12.5 b = 0.19783 0.198 r = 0.99373 0.994
(iv)

ln p = 12.465 + 0.19783 ( 19) = 8.7062 p = 0.00016555 0.000166 (3sf)

Extrapolation done in calculating p when x = 19. There may not be a linear relationship between ln p and x for x < 25. Therefore the answer obtained is not reliable.

In a chemical reaction, a compound is predicted to speed up the rate of reaction between two substances. Twelve trials of this reaction are performed in Laboratory Alpha. The amounts, x grams of the compound used in a chemical reaction and the resulting times, y minutes taken to complete the reaction are then recorded. The results are given in the following table.
x y

0.2 6.04
(i) (ii)

0.5 2.95

1.0 2.11

1.5 1.66

2.0 1.52

2.5 1.38

3.0 1.33

3.5 1.25

4.0 1.24

4.5 1.20

5.0 1.19

5.5 1.10

Draw a scatter diagram for the data. It is suggested that x and y are related by the equation y =

k + c , where k and x c are constants. By finding the appropriate regression line, estimate the values of k and c. (iii) Estimate the time taken to complete the reaction when 10 grams of the compound is used. Comment on the reliability of the result. (iv) A chemist in Laboratory Beta claims that the results of the trials are faked. By calculating the linear product moment correlation coefficient between y 1 and , suggest a reason for the chemists claim. x (v) Laboratory Alpha counters the claim by saying that the results would be different if x is measured in kg and y in hours. State with a reason, whether you agree with this comment. [Ans: (ii) k 1.01 , c 0.984 (iii) 1.09 (iv) 0.999]

Jurong Junior College 2011 JC2 Mathematics H2 (9740) Solutions 7 (i)


7 6 5 y minutes 4 3 2 1 0 0 1 2 3 4 5 6 x grams 1 . Then y = kw + c . x

(ii)

Let w =

Using GC, y = 1.01w + 0.984 . k 1.01 and c 0.984 .


(iii)

1.01 + 0.98388 = 1.08488 1.09 . 10 When 10 grams of the compound is used, it takes about 1.09 minutes to complete the reaction. This result is not reliable as the value x = 10 falls outside the range [0.2, 5.5].
When x = 10 , y Using GC, r = 0.999 . The results of the trials might be faked as the data fit the regression curve too well. I disagree with the comment given by Laboratory Alpha because r is independent of the units of x and y.

(iv)

(v)

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

An athletes best times for various distances are shown in the following set of data. 100 200 400 800 1500 10 000 Distance (x metres) Best time (t seconds) 11.2 21.8 51.5 110.3 220.3 1775 It is suspected that x and t are related according to the formula t = axb , where a and b are constants. (i) Show that the relation between lg t and lg x is linear. State, which of the least squares regression lines, lg t on lg x, or lg x on lg t should be used. By considering the chosen regression line, find the estimated values of a and b. (ii) Find the product moment correlation coefficient between lg t and lg x , giving your answer to four decimal places. Comment on this value. (iii) Estimate the athletes time for a 42.2 km Marathon and comment on the reliability of your answer. [Ans: (i) a = 0.0657, b = 1.11, (ii) 0.9998, (iii) 8830]

Solutions 8 (i)
t = axb lg t = lg axb lg t = lg a + b lg x Hence, the relation between lg x and lg t is linear and the regression line lg t on lg x should be used. Using G.C., regression line of lg t on lg x is: lg t = 1.10879lg x 1.18259 b = 1.10879 = 1.11 (3 s.f.)

lg a = 1.1826 a = 101.1826 = 0.0657 (3s.f.)


(ii)

Using G.C., the product moment correlation coefficient between lg t and lg x is 0.9998 ( to 4 d.p.)

There is a strong positive linear correlation between lg t and lg x which means that

t = axb is an appropriate model to provide a good fit to the data points.


(iii)

When x = 42 200, lg t = 1.10879lg(42200) 1.18259 t = 8828.97 secs. 8830 secs

The answer is not reliable as the value x = 42 200 is outside the data range for x where the linear relation between lg x and lg t may no longer hold.

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

A random sample of eight pairs of values of x and y is used to obtain the following equations of the regression lines of y on x and x on y respectively. 7 151 7 y = x+ , x = y + 20 10 10 6 Seven of the pairs of data are given in the table. 10 11 12 11 17 14 19 x 9 8 7 6 5 4 1 y Find the eighth pair of values of x and y. Determine the value of the product moment correlation coefficient, and say what it leads you to expect about the scatter diagram for this sample. Let Y be the value obtained by substituting a sample value of x into the equation of the regression line of y on x. Evaluate Y for each of the eight values of x and verify that ( y Y ) 2 = 8.8 . For each of the eight sample values of x, Y is given by Y ' = a + bx , where a and b 2 are any constants. What can you say about the value of ( y Y ) ?

[Ans: 10, 8; 0.904 ]


Solutions 9

7 151 x+ ----------(1) 10 10 7 x = y + 20 ----------(2) 6 Solving (1) and (2), we have x = 13 and y = 6 . x = 13 and y = 6 . y= 10 + 11 + 12 + 11 + 17 + 14 + 19 + x8 = 13 x8 = 10 8 9 + 8 + 7 + 6 + 5 + 4 + 1 + y8 y8 = 8 =6 8 The required values of x and y are 10 and 8 respectively.
By using G.C., r = 0.904

Since r 1 , we expect the points in the scatter diagram for this sample to be close to a straight line with negative gradient.

x y

10 9

11 8

12 7

11 6

17 5

14 4

19 1

10 8

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

8.1
2

7.4

6.7

7.4

3.2

5.3

1.8

8.1

(y Y)
Since

= (0.9) 2 + (0.6) 2 + (0.3) 2 + ( 1.4) 2 + (1.8) 2 + ( 1.3) 2 + ( 0.8) 2 + ( 0.1) 2 = 8.8


2

(y Y)

is the least squares error,

( y Y )

8.8 .

10

A study comparing the amount of advertising time on TV per week for a product and the number of sales per week for the same product was conducted. The results over eight weeks are given below: Advertising time (minutes), x Sales (thousands), y
(i) (ii)

10 2.3

12 2.8

15

14 3.1

17 3.2

16 2.9

22 5.0

20 4.0

Find the coordinates of the point through which the regression line y on x and that of x on y both pass. Give your answer in terms of k. Given that the regression line of y on x is y = 0.197 x + 0.184 , find k. Hence

find the linear product moment correlation coefficient r between advertising time and sales per week. (iii) Draw a scatter diagram of y against x. (iv) State with a reason, the effect on r if the advertising time was in hours instead of minutes. (v) Give two reasons why it is reasonable to use the regression line of y on x to estimate the value of x when y = 3.4 [Ans: (i) x = 15.75
Solutions 10 (i) x = 15.75

y=

23.3 + k (ii) 8

k = 2.99

r = 0.929 ]

y=

23.3 + k 8

(ii) y = 0.197 x + 0.184

y = 0.197 x + 0.184 23.3 + k = 0.197(15.75) + 0.184 8 k = 2.994 r = 0.92856

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

(iii)

(iv)

r remains unchanged because r is a measure of the degree of scatter and this is unchanged by a change of scaling. It is reasonable because y = 3.4 lies within the range of the given data and r = 0.92856 1 ,the regression line y on x is almost identical to x on y.

(v)

11

The table gives the world record time, in seconds above 3 minutes 30 seconds, for running 1 mile as 1st January in various years. 1930 40.4 1940 36.4 1950 31.3 1960 24.5 1970 21.1 1980 19.0 1990 16.3 2000 13.1

Year, x Time, t
(i) (ii)

Draw a scatter diagram to illustrate the data. Comment on whether a linear model would be appropriate, referring both to the scatter diagram and the context of the question. (iii) Explain why in this context a quadratic model would probably not be appropriate for long-term predictions. (iv) Fit a model of the form ln t = a + bx to the data, and use it to predict the world record time as at 1st January 2010. Comment on the reliability of your prediction. [Ans: (iv) ln t = 34.853 0.016128 x , 3 minutes 41.4 seconds] [N09 H2 P2 Q6]

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

Solutions 11 (i) 45
40 35 t seconds 30 25 20 15 10 5 0 1910 0 1930 1950 x year 1970 1990 2010

(i)

A linear model is not appropriate as the points in the scatter diagram shows a curvilinear relationship. A linear model is not a good one in the long run as it suggests that the world record time will be negative eventually. A quadratic model is not suitable for long term predictions as there will be a point in time where the value of t increases as x increases. This is not an appropriate model since the world record time cannot be higher than before.

(iii)

(iv)

Using G.C, the equation of the least square model is ln t = 34.853 0.016128 x . When x = 2010, ln t = 34.853 0.016128(2010) = 2.43572 t = e2.43572 = 11.4(3sf )

Jurong Junior College 2011 JC2 Mathematics H2 (9740)

Therefore the world record time for 2010 is predicted to be 3 minutes 41.4 seconds. The estimated value is not reliable as x = 2010 is outside the range of values of x given by the data.

You might also like