You are on page 1of 33

89

Chapter 4
4.1 (a) Yes, the scatterplot below (left) shows a linear relationship between the cube root of
weight, 3 weight , and length.
10

0.3

9
0.2

0.1
Residual

Cube root of weight

6
5
4

0.0

-0.1

3
-0.2

2
1

-0.3
5

10

15

20
25
Length (cm)

30

35

40

10

15

20
25
Length (cm)

30

35

40

(b) Let x = length and y = 3 weight . The least-squares regression line is y = 0.0220 + 0.2466 x .
The intercept of 0.0220 clearly has no practical interpretation in this situation, since weight and
the cube root of weight must be positive. The slope 0.2466 indicates that for every 1 cm increase
in length, the cube root of weight will increase, on average, by 0.2466. (c)
3 weight = 0.0220 + 0.2466 36  8.8556 , so the predicted weight is 8.85563  694.5 g. The
predicted weight with this model is slightly higher than the predicted weight of 689.9g with the
model in Example 4.2. (d) The residual plot above (right) shows the residuals are negative for
lengths below 17 cm, positive for lengths between 18 cm and 27 cm, and have no clear pattern
for lengths above 28 cm. (e) Nearly all (99.88%) of the variation in the cube root of the weight
can be explained by the linear relationship with the length.
4.2 (a) The scatterplot below (left) shows positive association between length and period with
one very unusual point (106.5, 2.115) in the top right corner.
2.2

0.100

2.0

0.075
0.050

1.6
Residual

Period (s)

1.8

1.4
1.2

0.025
0.000

1.0

-0.025

0.8

-0.050

0.6
20

30

40

50

60
70
Length (cm)

80

90

100

110

20

30

40

50

60
70
Length (cm)

80

90

100

110

(b) The residual plot above (right) shows that the residuals tend to be small or negative for small
lengths and then get larger for lengths between 40 and 50 cm. The residual for the one very large
length is negative again. Even though the value of r 2 is 0.983, the residual plot suggests that a
model with some curvature (or a linear model after a transformation) might be better. (c) The
information from the physics student suggests that there should be a linear relationship between

90

Chapter 4

period and length . (d) A scatterplot (left) and residual plot (right) are shown below for the
transformed data. The least-squares regression line for the transformed data is
y = 0.0858 + 0.210 length . The value of r 2 is slightly higher, 0.986 versus 0.983, and the
residual plot looks better, although the residuals for the three smallest lengths are positive and
the residuals for the next six lengths are negative.
2.2
2.0

0.05

1.8

Residual

Period

1.6
1.4

0.00

1.2
-0.05

1.0
0.8

-0.10

0.6
4

7
8
Square root of length

10

11

7
8
Square root of length

10

11

(e) According to the theoretical relationship, the slope in the model for (d) should be
2
 0.2007 . The estimated model appears to agree with the theoretical relationship because
980
the estimated slope is 0.210, an absolute difference of about 0.0093. (f) The predicted length of
an 80-centimeter pendulum is y = 0.0858 + 0.210 80  1.7925 seconds.

3.0

3.0

2.5

2.5

Pressure (atmospheres)

Pressure (atmospheres)

4.3 (a) A scatterplot is shown below (left). The relationship is strong, negative and slightly
nonlinear (or curved), with no outliers.

2.0

1.5

1.0

2.0

1.5

1.0
5.0

7.5

10.0

12.5
15.0
Volume (cubic cm)

17.5

20.0

0.050

0.075

0.100
0.125
1/Volume

0.150

0.175

(b) Yes, the scatterplot for the transformed data (above on the right) shows a clear linear
relationship. (c) The least-squares regression equation is P = 0.3677 + 15.8994 (1 V ) . The
square of the correlation coefficient, r 2 = 0.9958 , indicates almost a perfect fit. The residual plot
(below) shows a definite pattern, which should be of some concern, but the model still provides a
good fit.

More about Relationships between Two Variables

91

0.050

Residual

0.025

0.000

-0.025

-0.050

-0.075
0.050

0.075

0.100
0.125
1/Volume

0.150

0.175

(d) Letting y = 1 P , the least-squares regression line is y = 0.1002 + 0.0398V . The scatterplot
(below on the left), the value of r 2 = 0.9997 , and the residual plot (below on the right) indicate
that the linear model provides an excellent fit for the transformed data. This transformation also
achieves linearity because V = k P .
0.005

0.9

0.004
0.8

0.003
0.002
Residual

1/Pressure

0.7

0.6

0.001
0.000
-0.001

0.5

-0.002
0.4

-0.003

0.3

-0.004
5.0

7.5

10.0

12.5
15.0
Volume (cubic cm)

17.5

20.0

5.0

7.5

10.0

12.5
15.0
Volume (cubic cm)

17.5

20.0

(e) When the gas volume is 15 cm3 the model in part (c) predicts the pressure to be
P = 0.3677 + 15.8994 (1 15 )  1.4277 atmospheres, and the model in part (d) predicts the
reciprocal of pressure to be 0.1002 + 0.0398(15) = 0.6972 or P = 1/ 0.6972  1.4343
atmospheres. The predictions are the same to the nearest one-hundredth of an atmosphere.

0.2

0.1

Residual

Period squared

4.4 (a) The scatterplot below (left) shows that the relationship between period2 and length is
roughly linear.

0.0

-0.1

-0.2
20

30

40

50

60
70
Length (cm)

80

90

100

110

20

30

40

50

60
70
Length (cm)

80

90

100

110

(b) The least-squares regression line for the transformed data y = period2 and x = length is
y = 0.1547 + 0.0428 x . The value of r 2 = 0.992 and the residual plot above (right) indicate that

92

Chapter 4

the linear model provide a good fit for the transformed data. As we noticed in Exercise 4.2 part
(d), the residual plot looks better, but there is still a pattern with the residuals for the three
smallest lengths being positive and the residuals for the next six lengths being negative. (c)
4 2
 0.0403 . The
According to the theoretical relationship, the slope in the model should be
980
estimated model appears to agree with the theoretical relationship because the estimated slope is
0.0428, an absolute difference of about 0.0025. (d) The predicted length of an 80-centimeter
pendulum is y = 0.1547 + 0.0428 80  3.2693 or a period of 1.8081 seconds. The two models
provide very similar predicted values, with an absolute difference of only 0.0156.
4.5 (a) A scatterplot is shown below (left). The relationship is strong, negative and nonlinear (or
curved).
180
5.0

140
ln(Light intensity)

Light intensity (lumens)

160

120
100
80
60

4.5

4.0

3.5

40
20

3.0
5

8
Depth (meters)

10

11

8
Depth (meters)

10

11

(b) The ratios (120.42/168, 86.31/120.42, 61.87/86.31, 44.34/61.87, 31.78/44.34, and


22.78/31.78) are all 0.717. Since the ratios are all the same, the exponential model is
appropriate. (c) Yes, the scatterplot (above on the right) shows that the transformation achieves
linearity. (d) If x = Depth and y = ln(Light Intensity), then the least-squares regression lines is
y = 6.7891 0.3330 x . The intercept 6.7891 provides an estimate for the average value of the
natural log of the light intensity at the surface of the lake. The slope, 0.3330, indicates that the
natural log of the light intensity decreases on average by 0.3330 for each one meter increase in
depth. (e) The residual plot below (left) shows that the linear model on the transformed data is
appropriate. (Some students may suggest that there is one unusually large residual, but they need
to look carefully at the scale on the y-axis. All of the residuals are extremely small.) (f) If x =
Depth and y = Light Intensity, then the model after the inverse transformation is
y = e6.7891e0.333 x or y = 888.1139 0.7168x . The scatterplot below (right) shows that the
exponential model is excellent for these data.

More about Relationships between Two Variables

93
180

0.000100

160
Light intensity (lumens)

0.000075

Residual

0.000050
0.000025
0.000000
-0.000025

140
120
100
80
60
40

-0.000050

20
5

8
9
Depth (meters)

10

11

8
Depth (meters)

10

11

(g) At 22m, the predicted light intensity is y = 888.1139e0.33322  0.5846 lumens. No, the
absolute difference between the observed light intensity 0.58 and the predicted light intensity
0.5846 is very small (0.0046 lumens) because the model provides an excellent fit.
4.6 (a) A scatterplot is shown below (left).
3000000

6.5

2500000
6.0
log(Acres)

Acres

2000000

1500000

5.5

1000000

500000

5.0

0
1978

1979

1980

1981

1978

Year

1979

1980

1981

Year

(b) The ratios are 226,260/63,042 = 3.5890, 907,075/226,260 = 4.0090, and 2,826,095/907,075 =
3.1156. (c) The transformed values of y are 4.7996, 5.3546, 5.9576, and 6.4512. A scatterplot of
the logarithms against year is shown above (right). (d) Minitab output is shown below.
The regression equation is
log(Acres) = - 1095 + 0.556 year
Predictor
Constant
year

Coef
-1094.51
0.55577

S = 0.0330502

SE Coef
29.26
0.01478

R-Sq = 99.9%

T
-37.41
37.60

P
0.001
0.001

R-Sq(adj) = 99.8%

(e) If x = year and y = acres, then the model after the inverse transformation is
y = 101094.51100.5558 x . The coefficient of 100.5558 x is 0.0000 (rounded to 4 decimal places) so all of
the predicted values would be 0. (Note: If properties of exponents are not used to simplify the
right-hand-side, then some calculators will be able to do the calculations without having serious
overflow problems.) (f) The least-squares regression line of log(acres) on year is
y = 4.2513 + 0.5558 x . (g) The residual plot below shows no clear pattern, so the linear
regression model on the transformed data is appropriate.

94

Chapter 4

0.04

3000000

0.03

2500000
2000000

0.01

Acres

Residual

0.02

0.00

1500000
1000000

-0.01
500000

-0.02

-0.03
1.0

1.5

2.0

2.5
3.0
Years since 1977

3.5

4.0

Years since 1977

(h) If x = year and y = acres, then the model after the inverse transformation is
y = 104.2513100.5558 x  17,836.1042 100.5558 x . A scatterplot with the exponential model
superimposed is shown above (right). The exponential model provides an excellent fit. (i) The
predicted number of acres defoliated in 1982 (5 years since 1977) is
y  17,836.1042 100.55585 = 10,722,597.42 acres.
4.7 (a) If y = number of transistors and x = number of years since 1970, then y (1) = ab1 = 2250
4

2250
2250 3
 1.5874 . This model
and y (4) = ab = 9000 , so a =
 1417.4112 and b =
0.25
1417.4112
9000
predicts the number of transistors in year x after 1970 to be y = 1417.4112 1.5874 x . (b) Using
the natural logarithm transformation on both sides of the model in (a), produces the line
ln y = 7.2566 + 0.4621x . (c) The slope for Moores model (0.4621) is larger than the estimated
slope in Example 4.6 (0.332), so the actual transistor counts have grown more slowly than
Moores law suggests.
4

4.8 (a) According to the claim, the number of children killed doubled every year after 1950.
Year
1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
Number of deaths
2
4
8
16
32
64
128 256 512 1024
(b) A scatterplot showing the exponential relationship is shown below (left).
3.0

1000

2.5
log(Number of deaths)

Number of deaths

800

600

400

2.0
1.5
1.0

200
0.5
0

0.0
1950

1952

1954

1956
Year

1958

1960

1950

1952

1954

1956

1958

1960

Year

(c) According to the paper, the number of children killed x years after 1950 is 2 x . Thus,
245 = 3.5184 1013 or approximately 35 trillion children were killed in 1995. This is clearly a

More about Relationships between Two Variables

95

mistake. (d) A scatterplot of the logarithms against year (above on the right) shows a strong,
positive linear relationship. (e) The least-squares regression line for predicting the logarithm of
y = deaths from x = year is approximately y = 587.0 + 0.301x . Thus, the predicted value in
1995 is y = 587.0 + 0.301 1995  13.495 . As a check, log(245 )  13.5463 . The absolute
difference in these two predictions, 0.0513, is relatively small.
4.9 (a) A scatterplot is shown below.
300

2.5

2.0
200

Log(Population)

Population (in millions)

250

150

100

1.5

1.0
50

0.5
0

50

100
Time (since 1790)

150

200

50

100
Time (since 1790)

150

200

(b) In the scatterplot above (right), the transformed data appear to be linear from 0 to 90 (or 1790
to about 1880), and then linear again, but with a smaller slope. The linear trend indicates that the
exponential model is still appropriate and the smaller slope reflects a slower growth rate. (c) The
least-squares regression line for predicting y = log(population) from x = time since 1790 is
y = 1.329 + 0.0054 x . Transforming back to the original variables, the estimated population size
is 21.3304 1.0125x . A scatterplot with this regression line is shown below (left). (d) The
residual plot (below on the right) shows random scatter and r 2 = 0.995, so the exponential model
provides an excellent fit.
2.5
0.010

0.005
2.3

Residual

Log(Population)

2.4

2.2

0.000

-0.005

2.1

-0.010

-0.015

2.0
120

130

140

150

160
170
180
Time (since 1790)

190

200

210

120

130

140

150

160
170
180
Time (since 1790)

190

200

210

(e) The predicted population in 2010 is y = 1.329 + 0.0054 220  2.517 or about

102.517 = 328.8516 million people. The prediction is probably too low, because these estimates
usually do not include homeless people and illegal immigrants.
4.10 (a) A scatterplot of distance versus height is shown below (left).

Chapter 4

1500

1500

1400

1400

1300

1300

1200

1200

Distance

Distance

96

1100

1100

1000

1000

900

900

800

800
300

400

500

600
700
Height

800

900

1000

16

18

20

22
24
26
Square root of height

28

30

32

(b) The curve tends to bow downward, which resembles a power curve x p with p < 1.
Since we want to pull in the right tail of the distribution, we should apply a transformation x p
with p < 1. (c) A scatterplot of distance against the square root of height (shown above, right)
straightens the graph quite nicely.
4.11 (a) Let x = Body weight in kg and y = Life span in years. Scatterplots of the original data
(left) and the transformed data (right), after taking the logarithms of both variables, are shown
below. The linear trend in the scatterplot for the transformed data suggests that the power model
is appropriate.
1.75
40

Log(Life span)

Life span (years)

1.50
30

20

1.25

1.00

0.75

10

0.50

0
0

500

1000

1500
Weight (kg)

2000

2500

3000

-2

-1

1
Log(Weight)

(b) The least squares regression line for the transformed data is log y = 0.7617 + 0.2182log( x) .
The residual plot (below on the left) shows fairly random scatter about zero and r 2 = 0.7117 .
Thus, 71.17% of the variation in the log of the life spans is explained by the linear relationship
with the log of the body weight.

More about Relationships between Two Variables

97

0.3

40
0.2

Life span (years)

Residual

0.1
0.0
-0.1
-0.2

30

20

10

-0.3

-0.4
-2

-1

1
Log(Weight)

2
3
Transformed weight

(c) The inverse transformation gives the estimated power model y = 100.7617 x 0.2182  5.7770 x 0.2182 .
(d) This model predicts the average life span for humans to be
y  5.7770 650.2182 = 14.3642 years, considerably shorter than the expected life span of humans.
(e) According to the biologists, the power model is y = ax 0.2 . The easiest and best option is to

plot a graph of ( weight 0.2 , lifespan ) and then fit a least-squares regression line using the

transformed weight as the explanatory variable. The scatterplot (above on the right) shows that
this model provides a good fit for the data. The least-squares regression line is
y = 2.70 + 7.95 x 0.2 with a predicted average life span of y = 2.7 + 7.95 650.2  15.62 years for
humans. Note: Students may try some other models, which are not as good. For example,
raising both sides of the equation to the fifth power, the model becomes y 5 = a 5 x , which is a
linear regression model with no intercept parameter (or an intercept of zero). After transforming
life span y to y5, the estimated model is y 5 = 30,835 x . This model predicts the average life span
of humans to be y = ( 30,835 65)

0.2

 18.2134 years. Another option is to try plotting a graph of

( weight, lifespan ) to achieve linearity.


5

The least-squares regression line for this set of

transformed data is y 5 = 1389463 + 30, 068 x with a predicted average life span of
y = (1389463 + 30068 65 )

0.2

 20.1767 years for humans. Note that none of the models

provides a reasonable estimate for the average life span of humans.


4.12 (a) The power model would be more appropriate for these data. The scatterplot of the log
of cost versus diameter (below on the left) is linear, but the plot of the log of cost versus the log
of diameter (below on the right) shows almost a perfect straight line.

Chapter 4
1.2

1.2

1.1

1.1

1.0

1.0
Log(Cost)

Log(Cost)

98

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6
9

10

11

12

13
14
15
Diameter (inches)

16

17

18

1.00

1.05

1.10
1.15
Log(Diameter)

1.20

1.25

(b) Let y = the cost of the pizza and x = the diameter of the pizza. The least-squares regression
line is log y = 1.5118 + 2.1150log x . The inverse transformation gives the estimated power
model y = 101.5118 x 2.115  0.0308 x 2.115 . (c) According to this model, the predicted costs of the
four different size pizzas are $4.01, $5.90, $8.18, and $13.91, from smallest to largest. There are
only slight differences between the predicted costs for the model and the actual costs, so an
adjustment does not appear to be necessary based on this model. (d) According to our estimated
power model in part (b), the predicted cost for the new soccer team pizza is
y = 0.0308 242.115  $25.57 . (e) An alternative model is based on setting the cost proportional
to the area, or the power model of the form cost ( 4 ) x 2 . Most students will square the
diameter and then fit a linear model to obtain the least squares regression line
y = 0.506 + 0.0445 x 2 . The estimated price of the soccer team pizza is

y = 0.506 + 0.0445 242  $25.13 Alternatively, this model can be rewritten as y = bx .


Using least-squares with no intercept, the value of b is estimated to be 0.2046, so the predicted
2
cost of the soccer team pizza is y = ( 0.2046 24 )  $24.11 .
4.13 (a) As height increases, weight increases. Since weight is a 3-dimensional characteristic
and height is 1-dimensional, weight should be proportional to the cube of the height. A model of
the form weight = a(height)b would be a good place to start. (b) A scatterplot of the response
variable y = weight versus the explanatory variable x = height is shown below.
250

Weight (pounds)

225

200

175

150

60

65

70
Height (inches)

75

80

(c) Calculate the logarithms of the heights and the logarithms of the weights. The least-squares
regression line for the transformed data is log y = 1.3912 + 2.0029log x . r 2 = 0.9999 ; almost
all (99.99% of the variation in log of weight is explained by the linear relationship with log of

More about Relationships between Two Variables

99

height. (d) The residual plot below for the transformed data shows that the residuals are very
close to zero with no discernable pattern. This model clearly fits the transformed data very well.
0.0015

Residual

0.0010

0.0005

0.0000

-0.0005

-0.0010
1.750

1.775

1.800

1.825
Log(Height)

1.850

1.875

1.900

(e) The inverse transformation gives the estimated power model


y = 101.3912 x 2.0029  0.0406 x 2.0029 . The predicted weight of a 510 (70) adult is
y = 0.0406 702.0029  201.4062 lbs, and the predicted weight of a 7 (84) adult is

y = 0.0406 842.0029  290.1784 lbs.


4.14 Who? The individuals are hearts from various mammals. What? The response variable y
is the weight of the heart (in grams) and the explanatory variable x is the length of the left
ventricle (in cm). Why? The data were collected to explore the relationship in these two
quantitative measurements for hearts of mammals. When, where, how, and by whom? The data
were originally collected back in 1927 by researchers studying the physiology of the heart.
Graphs: A scatterplot of the original data is shown below (left). The nonlinear trend in the
scatterplot makes sense because the heart weight is a 3-dimensional characteristic which should
be proportional to the length of the cavity of the left ventricle. A scatterplot, after transforming
the data by taking the logarithms of both variables, shows a clear linear trend (below, right) so
the power model is appropriate.
4

3000
Log(Heart weight)

Heart weight (grams)

4000

2000

1000

-1
0

8
10
12
Cavity length (cm)

14

16

18

-0.4

-0.2

0.0

0.2
0.4
0.6
Log(Cavity lenght)

0.8

1.0

1.2

Numerical Summaries: The correlation between log of cavity length and log of heart weight is
0.997, indicating a near perfect association. Model: The power model is weight = a lengthb .
After taking the logarithms of both variables, the least-squares regression line is
log y = 0.1364 + 3.1387log x . Approximately 99.3% of the variation in the log of heart weight
is explained by the linear relationship with log of cavity length. The residual plot below suggests
that there may be a little bit of curvature remaining, but nothing to get overly concerned about.

100

Chapter 4

0.3

Residual

0.2

0.1

0.0

-0.1

-0.2
-0.4

-0.2

0.0

0.2
0.4
0.6
Log(Cavity length)

0.8

1.0

1.2

Interpretation: The inverse transformation gives the estimated power model


y = 100.1364 x3.1387  0.7305 x3.1387 , which provides a good fit for these data.

400

10

300

Residual

Distance (cm)

4.15 (a) The scatterplot below (left) shows that the relationship between y = distance and x =
time is strong, positive, and nonlinear (curved).

200

-5

100

-10

0
0.1

0.2

0.3

0.4

0.5
0.6
Time (seconds)

0.7

0.8

0.0

0.9

0.1

0.2

0.3

0.4
0.5
Time squared

0.6

0.7

0.8

(b) The least-squares regression line for the transformed data is y = 0.990 + 490.416 x 2 . (c) The
residual plot above (right) shows random scatter and r 2 = 0.9984 , so 99.84% of the variability in
the distance fallen is explained with this linear model. (d) Yes, the scatterplot below (left) shows
that this transformation does a very good job creating a linear trend. The least-squares regression
line for the transformed data is y = 0.1046 + 22.0428 x .
0.4

20

0.2
15

0.1
Residual

Square root of distance

0.3

10

0.0
-0.1
-0.2
-0.3

-0.4
-0.5
0.1

0.2

0.3

0.4

0.5
0.6
Time (seconds)

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5
0.6
Time (seconds)

0.7

0.8

0.9

(e) The residual plot above (right) shows no obvious pattern and r 2 = 0.9986 . This is an excellent
model. (f) The predicted distance that an object had fallen after 0.47 seconds is 109.32 cm using

More about Relationships between Two Variables

101

the model from (b) and 109.51 cm using the model from (d). There is very little difference in the
predicted values, but most students will probably pick the prediction from (d) because r 2 is a
little higher and the residual plot shows less variability about the regression line.
4.16 (a) We are given the model ln y = 2.00 + 2.42ln x . Using properties of logarithms, the
power model is eln y = e 2.00+2.42ln x or y = e2.00 x 2.42 . (b) The estimated biomass of a tree with a
diameter of 30 cm is y = e 2.00 302.42  508.2115 kg.
4.17 Who? The individuals are carnivores. What? The response variable y is a measure of
abundance and the explanatory variable x is the size of the carnivore. Why? Ecologists were
interested in learning more about natures patterns. When, where and how? The data were
collected before 2002 (the publication date) by relating the body mass of the carnivore to the
number of carnivores. Rather than simply counting the total number of observed carnivores, the
researchers created a measure of abundance based on a count relative to the size of prey in an
area. Graphs: A scatterplot of y = abundance versus x = body mass (on the left below) shows a
nonlinear relationship. Using the log transformation for both variables provides a moderately
strong, negative, linear relationship (see the scatterplot below on the right).
1800
3

1600
1400
Log(Abundance)

Abundance

1200
1000
800
600
400

200
0

-1
0

50

100

150
200
Body mass (kg)

250

300

350

-1.0

-0.5

0.0

0.5
1.0
Log(Body mass)

1.5

2.0

2.5

Numerical Summaries: The correlation between log body mass and log abundance is 0.912.
Model: The least-squares regression line for the transformed data is
log y = 1.9503 1.0481log x , with an r 2 = 0.8325 and a residual plot (below) showing no obvious
patterns.
1.0

Residual

0.5

0.0

-0.5

-1.0
-1.0

-0.5

0.0

0.5
1.0
Log(Body mass)

1.5

2.0

2.5

Interpretation: The inverse transformation gives the estimated power model


y = 101.9503 x 1.0481  89.1867 x 1.0481 , which provides a good fit for these data.

102

Chapter 4

4.18 Let x = the breeding length, length at which 50% of females first reproduce and y = the
asymptotic body length. The scatterplot (left) and residual plot (right) below show that the linear
model does not provide a great fit for these body measurements of this fish species. Most of the
residuals are negative for breeding lengths below 30 cm and above 150 cm.
100
75

400

50
300

Residual

Asymptotic body length (cm)

500

200

25
0

100

-25
-50

0
0

50

100

150
200
Breeding length (cm)

250

300

350

50

100

150
200
Breeding length

250

300

350

Applying the log transformation to both lengths produces better results. The scatterplot (left)
and residual plot (right) below show that a linear model provides a very good fit. The least
squares regression model for the transformed data is log y = 0.3011 + 0.9520logx , with an
r 2 = 0.898 and a residual plot with very little structure, although most of the residuals are still
negative when the explanatory variable is above 1.9.
0.4
0.3

2.5

0.2
2.0

Residual

Log(Asymptotic body length)

3.0

1.5

0.1
0.0
-0.1

1.0

-0.2
0.5

-0.3
0.5

1.0

1.5
Log(Breeding length)

2.0

2.5

0.5

1.0

1.5
Log(Breeding length)

2.0

2.5

The inverse transformation gives the estimated power model y = 100.3011 x0.952  2.0003x 0.952 ,
which provides a good fit for these data.
4.19 (a) Scatterplots of the original data (left) and the transformed data (right) are shown below.

More about Relationships between Two Variables

103

120

2.0

Log(Mean colony size)

Mean colony size

100
80
60
40

1.5

1.0

0.5

20
0.0

0
0

10

20
Time (hours)

30

40

10

20
Time (hours)

30

40

(b) The first phase is from 0 to 6 hours when the mean colony size actually decreases. This
decrease is hard to see on the graph of the original data, but is more obvious on the graph of the
transformed data. In the second phase, from 6 to 24 hours, the mean colony size increases
exponentially. Both graphs show this phase clearly, but it is most noticeable from the linear
trend on the graph of the transformed data for this time period. At 36 hours, mean growth is in
the third phase where growth is still occurring, but at a lower rate than the previous phase. The
point in the top right corner of both graphs clearly shows the new phase because this point does
not fit the pattern for phase two. (c) Let y = mean colony size and x = time. The least-squares
regression line for the transformed data is log y = 0.5942 + 0.0851x . Using the inverse
transformation, the predicted size of a colony 10 hours after inoculation is
y = 100.5942100.085110 = 100.2568  1.8063 .

1.6

1.75

1.4

1.50

1.2

1.25
Log(Colony size)

Log(Mean colony size)

4.20 The correlation for time (hours 624) and log (mean colony size) is r = 0.9915 . The
correlation time (hours 624) and log (individual colony size) is r = 0.9846 . As expected, the
correlation for the individual colony size is smaller than the correlation for the mean colony size
because individual measurements have more variability. The scatterplots below show the
differences in the relationships for mean colony sizes (left) and individual colony sizes (right).

1.0
0.8
0.6
0.4

1.00
0.75
0.50
0.25

0.2

0.00

0.0
5

10

15
Time (hours)

20

25

10

15
Time (hours)

20

25

4.21 (a) Weight = c1 (height )3 and strength = c2 (height ) 2 , so strength = c3 ( weight ) 2 / 3 , where c1 ,

c2 , and c3 are arbitrary constants. (b) The graph of y = x 2 / 3 below shows that strength does not
increase linearly with body weight, as would be the case if a person 1 million times as heavy as
an ant could lift 1 million times more than the ant. Strength increases more slowly. For
example, if weight is multiplied by 1000, strength will increase by a factor of 10002 / 3 = 100 .

104

Chapter 4

100

Strength

80

60

40

20

0
0

200

400

600

800

1000

Weight

4.22 (a) Answers will vary. (b) The population of cancer cells after n 1 years is P = P0 (7 / 6) n 1 .
The population of cancer cells after n years is P = P0 (7 / 6) n 1 + (1/ 6)( P0 (7 / 6)n 1 ) = P0 (7 / 6) n .
(c) Answers will vary, but the exponential model should provide a good fit for the data collected.
4.23 (a) The sum of the six counts is 10+9+24+61+206+548 = 858 people. (b) The sum of the
top row shows 10+9+24 = 43 people had arthritis. (c) The marginal distribution of participation
in soccer is shown below.
Elite Non-elite Did not play
Count
71
215
572
Percent 8.3% 25.1%
66.7%
(d) The percent of each group who have arthritis is 14.08% for the elite soccer players, 4.2% for
the non-elite soccer players and 4.19% for the people who did not play. This suggests an
association between playing elite soccer and developing arthritis.
4.24 The percents should add to 100% because they provide a breakdown of all participants
according to one categorical variable. The sum is 8.3% + 25.1% + 66.7% = 100.1 %. If one
more decimal place is included in each of the percents, then the sum is 8.28% + 25.06% +
66.67% = 100.01%. The percents do not add to 100% because of rounding.
4.25 (a) The sum of the six counts is 5375 students. (b) The proportion of these students who
smoke is 1004/5375 = 0.1868, so the percent of smokers is 18.68%. (c) The marginal
distribution of parents smoking behavior is shown below.
Neither parent smokes
One parent smokes Both parents smoke
Count
1356
2239
1780
Percent 25.23%
41.66%
33.12%
(d) The three conditional distributions are shown in the table below.
Neither parent
One parent
Both parents
smokes
smokes
smoke
Student does not smoke
86.14%
81.42%
77.53%
Student smokes
13.86%
18.58%
22.47%
The conditional distributions reveal what many people expectparents have a substantial
influence on their children. Students that smoke are more likely to come from families where
one or more of their parents smoke.

More about Relationships between Two Variables

105

4.26 (a) The two-way table is shown below. (b) The percent of eggs in each group that hatched
are 59.26% in a cold nest, 67.86% in a neutral nest, and 72.12% in a hot nest. The percents
indicate that hatching increases with temperature. The cold nest did not prevent hatching, but
made it less likely.
Cold
Neutral Hot
Hatched
16
38
75
Not hatched 11
18
29
Total
27
56
104
4.27 (a) The two conditional distributions are shown in the table below. The biggest difference
between men and women is in Administrationa higher percentage of women chose this major.
A greater percent of men chose the other fields, especially finance. (b) A total of 386 students
responded , so 722386 = 336 did not respond. About 46.54% of the students did not respond.
Female Male
Accounting
30.22% 34.78%
Administration 40.44% 24.84%
Economics
2.22% 3.73%
Finance
27.11% 36.65%
4.28 Two examples are shown below. In general, choose a to be any number from 0 to 50, and
then all the other entries can be determined.
25 25
10 40
35 15
50 0
Note: This is why we say that such a table has one degree of freedom: We can make
one (nearly) arbitrary choice for the value of a, and then have no more decisions to make.
4.29 (a) The two-way table is shown below. (b) Overall, 11.88% of white defendants and
10.24% of black defendants receive the death penalty. For white victims, 12.58% of white
defendants and 17.46% of black defendants receive the death penalty. For black victims, 0% of
white defendants and 5.83% of black defendants receive the death penalty. (c) The death penalty
is more likely when the victim was white (14.02%) rather than black (5.36%). Because most
convicted killers are of the same race as their victims, whites are more often sentenced to death.
Death penalty
No death penalty
White defendant
19
141
Black defendant
17
149
4.30 (a) The two-way table is shown below. (b) Overall, 70% of male applicants are admitted,
while only 56% of females are admitted. (c) In the business school, 80% of male applicants are
admitted, compared with 90% of females. In the law school, 10% of males are admitted,
compared with 33.33% of females. (d) Six out of 7 men apply to the business school, which
admits 82.5% of all applicants, while 3 out of 5 women apply to the law school, which admits
only 27.5% of its applicants.
Admit Deny
Male
490
210
Female 280
220

106

Chapter 4

4.31 The table below gives the two marginal distributions. The marginal distribution of marital
status is found by taking, e.g., 337/8235  4.1%. The marginal distribution of job grade is found
by taking, e.g., 955/8235  11.6%.
Single
Married
Divorced Widowed
4.1%
93.9%
1.5%
0.5%
Grade 1
Grade 2
Grade 3
Grade 4
11.6%
51.5%
30.2%
6.7%
As rounded here, both sets of percents add up to 100%. If students round to the nearest whole
percent, the marital status numbers add up to 101%. If they round to two places after the
decimal, the job grade percents add up to 100.01%.
4.32 The percent of single men in grade 1 jobs is 58/337  17.21%. The percent of grade 1 jobs
held by single men is 58/955  6.07%.
4.33 Divide the entries in the first column by the first column total; e.g., 17.21%  58/337.
These should add to 100% (except for rounding error). The percentages in the table below add to
100.01%.
Job grade
% of single men
1
17.21%
2
65.88%
3
14.84%
4
2.08%
If the percents are rounded to the nearest tenth, 17.2%, 65.9%, 14.8%, and 2.1%, then they add to
100%.
4.34 (a) We need to compute percents to account for the fact that the study included many more
married men than single men, so that we would expect their numbers to be higher in every job
grade (even if marital status had no relationship with job level). (b) A table of percents is below;
descriptions of the relationship may vary. Single and widowed men had higher percents of grade
1 jobs; single men had the lowest (and widowed men the highest) percents of grade 4 jobs.
Job grade
Single
Married
Divorced Widowed
1
17.21%
11.31%
11.90%
19.05%
4
2.08%
6.90%
5.56%
9.52%
4.35 Age is the main lurking variable: Married men would generally be older than single men,
so they would have been in the work force longer, and therefore had more time to advance
in their careers.
4.36 (a) A bar graph is shown below58.33% of desipramine users did not have a relapse,
while 25.0% of lithium users and 16.7% of those who received a placebo succeeded in breaking
their addictions. (b) Because random assignment was used, there is statistical evidence for
causation (though there are other questions we need to consider before we can reach that
conclusion).

More about Relationships between Two Variables

107

60

Percent with no relapse

50

40

30

20

10

Desipramine

Lithium
Label

Placebo

4.37 (a) To find the marginal distribution of opinion, we need to know the total numbers of
people with each opinion: 49/133  36.84% said higher, 32/133  24.06% said the same,
and 52/133  39.10% said lower. The numbers are summarized in the first table below. The
main finding is probably that about 39% of users think the recycled product is of lower quality.
This is a serious barrier to sales. (b) There were 36 buyers and 97 nonbuyers among the
respondents, so (for example) 20/36  55.56% of buyers rated the quality as higher. Similar
arithmetic with the buyers and nonbuyers rows gives the two conditional distributions of opinion,
shown in the second table below. We see that buyers are much more likely to consider recycled
filters higher in quality, though 25% still think they are lower in quality. We cannot draw any
conclusion about causation: It may be that some people buy recycled filters because they start
with a high opinion of recycled products, or it may be that use persuades people that the quality
is high.
Higher The same Lower
36.84% 24.06%
39.10%
Higher The same
Buyers
55.56% 19.44%
Nonbuyers 29.90% 25.77%

Lower
25.00%
44.33%

4.38 (a) The two-way table is shown below. (b) The overall batting averages are 0.240 for Joe
and 0.260 for Moe. Moe has the best overall batting average.
Hit
No hit
Joe
120
380
Moe
130
370
(c) Two separate tables, one for each type of pitcher, are shown below. Against left-handed
pitchers, Joes batting average is 0.200 and Moes batting average is 0.100. Against righthanded pitchers, Joes batting average is 0.400 and Moes batting average is 0.300. Joe is better
against both kinds of pitchers.
Left-handed pitchers
Right-handed pitchers
Hit
No hit
Hit
No hit
Joe
80
320
Joe
40
60
Moe
10
90
Moe
120
280
(d) Both players do better against right-handed pitchers than against left-handed pitchers. Joe
spent 80% of his at-bats facing left-handers, while Moe only faced left-handers 20% of the time.

108

Chapter 4

4.39 Examples will vary, of course; one very simplistic possibility is shown below. The key is
to be sure that there is a lower percentage of overweight people among the smokers than among
the nonsmokers.
Combined All People
Early Death
Yes
No
Overweight
41
59
Not overweight 50
50
Smokers

Non smokers

Early Death
Yes
No
Overweight
10
0
Not overweight 40
20

Early Death
Yes No
Overweight
31 59
Not overweight 10 30

4.40 Who? The individuals are students. What? The categorical variables of interest are
educational level or degree (Associates, Bachelors, Masters, Professional, or Doctors) and
gender (male or female). Why? The researchers were interested in checking if the participation
of women changes with level of degree. When, where, how, and by whom? These projections,
in thousands, were made for 2005-2006 by the National Center for Education Statistics. Graphs:
The conditional distributions of sex for each degree level are shown in the bar graph below (left).
The conditional distributions of degree level for each gender are shown in the bar graph below
(right).
70
50
60
40
Percent

Percent

50
40
30

30
20

20
10

10
0
m
Fe

Degree
A

e
al

le
Ma

's
te
ia
oc
ss

m
Fe

e
al

el
ch
Ba

's
or

e
al

m
Fe

e
al

's
er
st
Ma

e
al

m
Fe

e
al

o
si
es
of
Pr

e
al

m
Fe

l
na
D

e
al

r's
to
oc

e
al

0
Degree
A

l
's
's
r's
r 's
na
te
or
te
to
ia
el
sio
as
oc
oc
ch
M
es
D
ss
Ba
of
r
P
Fe

e
al

l
's
's
's
r's
na
te
or
or
te
ia
el
ct
sio
as
oc
ch
M
es
Do
ss
Ba
of
r
P
Ma

le

Numerical summaries: The software output below from Mintab provides the joint distribution,
marginal distributions, and conditional distributions in one consolidated table. The first entry in
each cell is the count, the second entry is the % of the row (or the conditional distribution of
gender for each type of degree), the third entry is the % of the column (or the conditional
distribution of degree for each gender), and the fourth entry is the overall %.

More about Relationships between Two Variables


Rows: Degree

Columns: Gender
Female
Male

109

All

Associate's

431
63.85
26.85
15.851

244
36.15
21.90
8.974

675
100.00
24.83
24.825

Bachelor's

813
58.20
50.65
29.901

584
41.80
52.42
21.478

1397
100.00
51.38
51.379

Doctor's

21
46.67
1.31
0.772

24
53.33
2.15
0.883

45
100.00
1.66
1.655

Master's

298
58.09
18.57
10.960

215
41.91
19.30
7.907

513
100.00
18.87
18.867

42
47.19
2.62
1.545

47
52.81
4.22
1.729

89
100.00
3.27
3.273

1605
59.03
100.00
59.029

1114
40.97
100.00
40.971

2719
100.00
100.00
100.000

Professional

All

Cell Contents:

Count
% of Row
% of Column
% of Total

Interpretation: Women earn a majority of associates, bachelors, and masters degrees, but fall
slightly below 50% for professional and doctoral degrees. The distributions of degree level are
very similar for females and males.
4.41 No. Rich nations have more TV sets than poor nations. Rich nations also have longer life
expectancies because they have better nutrition, clean water, and better health care. There is
common response relationship between TV sets and length of life.

110

Chapter 4

x = # of
TV sets

y = average
life span

Wealth

4.42 In this case, there may be a causative effect, but in the direction opposite to the one
suggested: People who are overweight are more likely to be on diets, and so choose artificial
sweeteners over sugar. (Also, heavier people are at a higher risk to develop diabetes; if they do,
they are likely to switch to artificial sweeteners.)

Use of
sweeteners

Weight
gain

4.43 No. The number of hours standing up is a confounding variable in this case. The diagram
below illustrates the confounding between exposure to chemicals and standing up.

More about Relationships between Two Variables

111

?
Exposure to
chemicals

Miscarriages

Time
standing up
4.44 Well-off people tend to have more cars. They also tend to live longer, probably because
they are better educated, take better care of themselves, and get better medical care. The cars
have nothing to do with it. The relationship between number of cars and length of life is
common response.

Number
of cars

Length of
life

Wealth

4.45 It could be that children with lower intelligence watch many hours of television and get
lower grades as well. It could be that children from lower socio-economic households where
parents are less likely to limit television viewing and are unable to help their children with their
schoolwork because the parents themselves lack education. The variables number of hours

112

Chapter 4

watching television and grade point average change in common response to socio-economic
status or IQ.

Number of hours
spent watching TV

GPA

IQ or socioeconomic
status

4.46 Single men tend to have a different value system than married men. They have many
interests, but getting married and earning a substantial amount of money are not among their top
priorities. Confounding is the best term to describe the relationship between marital status and
income.

?
Marital
status

Annual
income

Values

4.47 The effects of coaching are confounded with those of experience. A student who has taken
the SAT once may improve his or her score on the second attempt because of increased

More about Relationships between Two Variables

113

familiarity with the test. The student may also have increased knowledge from additional math
and science courses.

SAT
score

Coaching
Course

Experience

4.48 A reasonable explanation is that the cause-and-effect relationship goes in the other
direction: Doing well makes students feel good about themselves, rather than vice versa.

Selfesteem

Quality
of work

CASE CLOSED!
1. (a) Let y = premium and x = age. Scatterplots of the original data (left) and transformed data
(right) after taking the logarithms of both variables are shown below. The plot of the original
data shows a strong nonlinear relationship. The plot for the transformed data shows a clear
linear trend, so the power model is appropriate.

114

Chapter 4

2.4

250

2.3
2.2
Log(Premium)

Premium ($)

200

150

100

2.1
2.0
1.9
1.8
1.7

50

1.6
1.5

0
40

45

50
55
Age (years)

60

65

1.60

1.65

1.70
Log(Age)

1.75

1.80

(b) A scatterplot of the logarithm of premium versus age is shown below (left). The linear trend
suggests that the exponential model is appropriate.
2.4
0.010

2.3

0.005

2.1
Residual

Log(Premium)

2.2

2.0
1.9
1.8
1.7

0.000

-0.005

-0.010

1.6
1.5

-0.015
40

45

50

55
age

60

65

40

45

50

55

60

65

Age

(c) Since the association between the log of premium and age is nearly perfect, the exponential
model is most appropriate. The least-squares regression line for the transformed data is
log y = 0.0275 + 0.0373x . Using the inverse transformation, the predicted premium is
y = 100.0275100.0373 x  0.9386 100.0373 x . (d) The predicted monthly premiums are

y = 0.9386 100.037358  $136.74 for a 58-year-old and y = 0.9386 100.037368  $322.76 for a
68-year-old. (e) You should feel very comfortable with these predictions. The residual plot
above (right) shows no clear patterns and r 2 = 99.9% , so the exponential model provides an
excellent fit.
2. (a) The entries in each column are only from these six selected causes of death. There are
other causes of death so the total number of deaths in each age group is higher than the sum of
the deaths for these six causes. (b) Percents should be used to compare the age groups because
the age groups contain different numbers of individuals. (c) The conditional distributions are
shown in the table below. Each entry is obtained by dividing the count for that cause of death by
the appropriate column total.
15 to 24 years 25 to 44 years 45 to 64 years
Accidents
45.32%
21.60%
5.42%
AIDS
0.52%
5.34%
1.35%
Cancer
4.93%
14.77%
33.16%
Heart disease
3.28%
12.63%
23.27%
Homicide
15.59%
5.71%
0.63%
Suicide
11.87%
8.73%
2.30%

More about Relationships between Two Variables

115

50

50

40

40

30

30

Percent

Percent

Two different bar graphs below show the conditional distributions.

20

10

Cause

20

10

s S r t e e
s S r t e e
s S r t e e
n t ID ce ar id id
n t ID ce ar id id
n t ID ce ar id id
d e A Can He mic u ic
d e A Can He mic u ic
d e A Can He mic Su ic
ci
ci
ci
o
o S
o S
H
Ac
H
Ac
Ac
H
4
P6

5
P2

5
P1

Cause

5 5 4
P1 P2 P6
s
nt
de
ci
Ac

5 5 4
P1 P 2 P 6
D
AI

5 5 4
P1 P2 P 6
r
ce
an
C

5 5 4
P 1 P2 P6
rt
ea
H

5 5 4
P 1 P 2 P6
e
id
ic
m
Ho

5 5 4
P1 P2 P6
e
id
ic
Su

(d) The leading cause of death for the youngest age group is accidents, followed by homicide
and suicide. For the middle age group, accidents are still the leading cause of death, but cancer
and heart disease are second and third, respectively. For the oldest age group, cancer is the
leading cause of death, with heart disease running a close second.
3. (a) The chance of dying for men over 65 who walk at least 2 miles a day is half that of men
who do not exercise. (b) Individuals who exercise regularly have many other habits and
characteristics that could contribute to longer lives.
4.49 Spending more time watching TV means that less time is spent on other activities. Answers
will vary, but some possible lurking variables are: the amount of time parents spend at home, the
amount of exercise and the economy. For example, parents of heavy TV watchers may not
spend as much time at home as other parents. Heavy TV watchers may not get as much exercise
as other adolescents. As the economy has grown over the past 20 years, more families can afford
TV sets (many homes now contain more than two TV sets), and as a result, TV viewing has
increased and children have less physical work to do in order to make ends meet.
4.50 (a) Let y = intensity and x = distance. A scatterplot of the original data is shown below
(left). The data appear to follow a power law model of the form y = axb where b is some
negative number.
-0.5

0.30

-0.6
-0.7
Log(Intensity)

Intensity (candelas)

0.25

0.20

0.15

-0.8
-0.9
-1.0

0.10

-1.1
-1.2
1.0

1.2

1.4
1.6
Distance (meters)

1.8

2.0

0.00

0.05

0.10

0.15
0.20
Log(distance)

0.25

0.30

(b) A scatterplot of the transformed data (above on the right), after taking the logarithms of both
variables, shows a clear linear trend, so the power model is appropriate. The least-squares

116

Chapter 4

regression line for the transformed data is log y = 0.5235 2.0126log x . (c) The residual plot
below shows no obvious patterns and r 2 = 99.9% so this linear model on the transformed data
provides an excellent fit.
0.010

0.25
Intensity (candelas)

0.005
Residual

Variable
intensity
predicted

0.30

0.000

0.20

0.15

0.10

-0.005

0.00

0.05

0.10

0.15
0.20
Log(Distance)

0.25

0.30

1.0

1.2

1.4
1.6
Distance (meters)

1.8

2.0

(d) Using the inverse transformation to find the predicted intensity gives
y = 100.5235 x 2.0126  0.2996 x 2.0126 . The plot of the original data with this model is shown above
(right). (e) The predicted intensity of the 100-watt bulb at 2.1 meters is
y = 0.2996 2.12.0126  0.0673 candelas.
4.51 (a) Yes, this transformation achieves linearity; see the scatterplot below.
0.30

Intensity (candelas)

0.25

0.20

0.15

0.10

0.2

0.3

0.4

0.5
0.6
0.7
1/(Distance-squared)

0.8

0.9

1.0

(b) Let x = distance and y = intensity. The least-squares regression line for the transformed data
1
is y = 0.0006 + 0.30 2 . (c) The predicted intensity of the 100-watt bulb at 2.1 meters is
x
1
y = 0.0006 + 0.30 2  0.0674 candelas. (d) Writing the model from part (d) of Exercise
2.1
0.3
4.50 in a slightly different form shows that the models are very similar, y = 0.0006 + 2
2.1
0.3
versus y  2.0126 . The absolute difference in the predicted values is 0.0001. Thus, the
2.1

inverse square law provides an excellent model.

More about Relationships between Two Variables

117

4.52 The explanatory variable is the amount of herbal tea and the response variable is a measure
of health and attitude. The most important lurking variable is social interactionmany of the
nursing home residents may have been lonely before the students started visiting.
4.53 (a) The column sums are shown below.
Single:
10,949 + 7,653 + 4,009 + 720 = 23,331
Married:
2,472 + 19,640 + 32,183 + 8,539 = 62,834
Widowed:
16 + 228 + 2,312 + 8,732 = 11,288
Divorced:
155 + 2,904 + 7,898 + 1,703 = 12,660
The sum of these column totals is 23,331 + 62,834 + 11,288 + 12,660 = 110,113, which is not
equal to 110,115. The difference is due to rounding. (b) The marginal distributions, conditional
distributions, and joint distribution are shown in the software output from Minitab below.
Rows: Age
Columns: Marital Status
divorced married single widowed

All

15-24

155
1.14
1.22
0.141

2472
18.19
3.93
2.245

10949
80.55
46.93
9.943

16
0.12
0.14
0.015

13592
100.00
12.34
12.344

25-39

2904
9.54
22.94
2.637

19640
64.55
31.26
17.836

7653
25.15
32.80
6.950

228
0.75
2.02
0.207

30425
100.00
27.63
27.631

40-64

7898
17.02
62.39
7.173

32183
69.36
51.22
29.227

4009
8.64
17.18
3.641

2312
4.98
20.48
2.100

46402
100.00
42.14
42.140

65+

1703
8.65
13.45
1.547

8539
43.36
13.59
7.755

720
3.66
3.09
0.654

8732
44.34
77.36
7.930

19694
100.00
17.89
17.885

All

12660
11.50
100.00
11.497

62834
57.06
100.00
57.063

23331
21.19
100.00
21.188

11288
10.25
100.00
10.251

110113
100.00
100.00
100.000

Cell Contents:

Count
% of Row
% of Column
% of Total

The table below provides just the marginal distribution for marital status.
Single Married Widowed Divorced
21.19% 57.06% 10.25%
11.50%
A bar chart of the marginal distribution is shown below.

118

Chapter 4

60

50

Percent

40

30

20

10

Single

Married
Widowed
Marital status

Divorced

(c) The two conditional distributions are shown in the table below.
Age
Single Married Widowed Divorced
1524 80.55% 18.19% 0.12%
1.14%
4064 8.64% 69.36% 4.98%
17.02%
Among the younger women, more than 4 out of 5 have not yet married, and those who are
married have had little time to become widowed or divorced. Most of the older group is or has
been marriedonly about 8.64% are still single. (d) Among single women, 46.93% are 1524,
32.8% are 2539, 17.18% are 4064 and 3.09% are 65 or older.
4.54 (a) The scatterplots below show a strong nonlinear relationship for the original data (left)
and a nearly perfect, negative linear association for the transformed data (right).
0.5

3.0

0.4
2.5

Log(Height)

Height (feet)

0.3
2.0

1.5

0.2
0.1
0.0

1.0

-0.1
-0.2
0

3
Bounce

Bounce

Not only is the linear association between the log(height) and bounce stronger than the linear
association between the logarithms of both variables, but there is also a value of zero for the
bounce number which means that the logarithm cannot be used for this point. The exponential
model is more appropriate for predicting y = height from x = bounce number. (b) The leastsquares regression line for the transformed data is log y = 0.4610 0.1191x . The residual plot
below shows that the first two residuals are positive and the next three residuals are negative, but
the residuals are all very small. The value of r 2 is 0.998, which indicates that 99.8% of the
variability in log(height) is explained by linear relationship with bounce. This model provides an
excellent fit.

More about Relationships between Two Variables

119

0.015

0.010

Residual

0.005

0.000

-0.005

-0.010

Bounce

(c) The inverse transformation gives a predicted height of


y = 100.4610100.1191 x  2.8907 100.1191 x . The predicted height on the 7th bounce is
y = 2.8907 100.11917 = 0.4239 feet.
4.55 The lurking variable is temperature or season. More flu cases occur in winter when less ice
cream is sold, and fewer flu cases occur in the summer when more ice cream is sold. This is an
example of common response.

Number of flu
cases reported

Amount of
ice cream
sold

Season or
temperature

4.56 Who? The individual are randomly selected people from three different locations. What?
The response variable is whether or not the individual suffered from CHD and the explanatory
variable is a measure of how prone an individual is to sudden anger. Both variables are
categorical, with CHD being yes or no and the level of anger being classified as low, moderate,
or high. Why? The researchers wanted to see if there was an association between these two
categorical variables. When, where, how, and by whom? In the late 1990s a random sample of
almost 13,000 people was followed for four years. The Spielberger Trait Anger Scale was used
to classify the level of anger and medical records were used for CHD. Graphs: A bar graph of
the conditional distributions of CHD for each level of anger is shown below (left). To see the

120

Chapter 4

increase in the percent of individual with CHD in each group, a separate bar graph is shown
(right). Notice how the change in scale changes your impression of the effect.
100
4
80

Percent

Percent

3
60

40

20

0
CHD
Anger level

yes

no
low

yes
no
moderate

yes
no
high

low

moderate
Anger level

high

Numerical summaries: The software output below from Minitab shows the marginal
distributions, conditional distributions, and joint distribution.
Rows: CHD
Columns: Anger
high
low moderate

All

No

606
7.32
95.73
7.151

3057
36.90
98.30
36.075

4621
55.78
97.67
54.532

8284
100.00
97.76
97.758

Yes

27
14.21
4.27
0.319

53
27.89
1.70
0.625

110
57.89
2.33
1.298

190
100.00
2.24
2.242

All

633
7.47
100.00
7.470

3110
36.70
100.00
36.700

4731
55.83
100.00
55.830

8474
100.00
100.00
100.000

Cell Contents:

Count
% of Row
% of Column
% of Total

The most important numbers for comparison are the percents of each anger group
that experienced CHD: 53/3110  1.70% of the low-anger group, 110/4731  2.33% of the
moderate-anger group, and 27/633  4.27% of the high-anger group.
Interpretation: Risk of CHD increases with proneness to sudden anger. It might be good to
point out to students that results like these are typically reported in the media with a reference to
4.3%
 2.5 , we might read that subjects in the
the relative risk of CHD; for example, because
1.7%
high-anger group had 2.5 times the risk of those in the low-anger group.
4.57 Who? The individuals are cultures of marine bacteria. What? The two quantitative
variables are x = time (minutes) and y = count (number of surviving bacteria in hundreds). Why?
Researchers wanted to see if the bacteria would decay exponentially over time when exposed to
X-rays. When, where, how, and by whom? It is not clear when or where the data were collected,
but the counts were obtained after exposing cultures to X-rays for different lengths of time.

More about Relationships between Two Variables

121

Graphs: Scatterplots below show the original data (left) and the transformed data (right) after
taking the logarithm of count. Both plots suggest that the exponential decay model is appropriate
for these data.
400

2.6
2.4
2.2
Log(Count)

Count (in hundreds)

300

200

100

2.0
1.8
1.6
1.4
1.2

1.0
0

8
10
Time (minutes)

12

14

16

8
10
Time (minutes)

12

14

16

Numerical summaries: The least-squares regression line for the transformed data is
log y = 2.5941 0.0949 x . Using the inverse transformation, the predicted count is
y = 102.5941100.0949 x  392.7354 100.0949 x . Interpretation: The residual plot below shows no
clear pattern and r 2 = 98.8% , so the exponential decay model provides an excellent model for
the number of surviving bacteria after exposure to X-rays.
0.10

Residual

0.05

0.00

-0.05

-0.10
0

8
10
Time (minutes)

12

14

16

4.58 (a) The two-way table below was obtained by adding the corresponding entries for each
age group. The proportion of smokers who stayed alive for 20 years is 443/582  0.7612 or
76.12% and the proportion of nonsmokers who stayed alive is 502/732  0.6858 or 68.58%.
Smoker Not
Dead
139
230
Alive
443
502
(b) For the youngest group, 269/288 or 93.40% of the smokers and 327/340 or 96.18% of the
nonsmokers survived. For the middle group, 167/245 or 68.16% of the smokers and 147/199 or
73.87% of the nonsmokers survived. For the oldest group, 7/49 or 14.29% of the smokers and
28/193 or 14.51% of the nonsmokers survived. The results are reversed when the data for the
three age groups are combined. (c) The percents of smokers in the three age groups are
288/628100  45.86% for the youngest group, 245/444100  55.18% for the middle aged
group, but only 49/242100  20.25% for the oldest group.

You might also like