Professional Documents
Culture Documents
Chapter 13 Solutions Develop Your Skills 13.1 1. The scatter diagram is shown below.
HendrickSoftwareSales
140 120 y=6.6519x+ 4.7013
TotalSales($000)
100 80 60 40 20 0 5 7 9 11 13 15 17 NumberofSalesContacts
The least-squares regression line is: total sales ($000) = 6.6519(number of sales contacts) + 4.7013 Interpretation: Each new sales contact results in an increase in sales of approximately $6,652. The y-intercept should not be interpreted, since the sample data did not contain any observations of 0 sales contacts. 2. The equation of the least-squares regression line is monthly spending on restaurant meals = 0.024144(monthly income)+$44.90 Interpretation: Each new dollar in monthly income increases spending on restaurant meals by about 2.4.
351
3.
SmithandKlein Manufacturing
$1,600,000 $1,400,000 $1,200,000 $1,000,000 y= 30.21x 148770
Sales
$800,000 $600,000 $400,000 $200,000 $0 $0 $10,000 $20,000 $30,000 $40,000 $50,000 PromotionExpenditure
The least-squares regression line is: annual sales = 30.21(annual promotion spending) - $148,770 Interpretation: Each new dollar in promotion spending results in an increase in annual sales of approximately $30.21. The y-intercept should not be interpreted, since the sample data did not contain any observations of $0 annual promotion spending. 4. The response variable is the semester average mark, and the explanatory variable is the total number of hours spent working during the semester. The relationship is unlikely to be positive. y = 0.1535x + 90.241 suggests that a student who worked no hours would get a mark of 90%, which seems a little high (but this intercept may not be reasonable to interpret this way, depending on the range of hours worked in the sample data). It also suggests that for each hour worked, the students mark would increase by 0.1535, which seems unlikely. It is more likely that the student's mark would decrease for each hour worked.
352
5.
Because of the way the researcher has posed the question, the response variable is revenues, and the explanatory variable is the number of employees. The scatter diagram is shown below:
The least-squares regression line is: revenue (US$millions) = 0.1338(number of full-time employees) + $140.56 US million Interpretation: Each additional thousand employees results in increased revenue of US$0.1338 million (or US$133,800). The y-intercept should not be interpreted, since the sample data did not contain any observations of 0 employees.
353
Develop Your Skills 13.2 6. The scatter diagram showed an apparently linear relationship between software sales and the number of sales contacts (see Develop Your Skills 13.1, Exercise 1).
5 0 5 0 10 15 20 NumberofSalesContacts 5 10 15 20
The residual plot shows residuals centred on zero, with fairly constant variability. There is no indication that the error terms are not independent. The data were collected over a random sample of months, but the dates of collection are not included, so it is not possible to check for independence of the residuals over time. A histogram of the residuals appears to be approximately normal.
HendrickSoftware SalesResiduals
9 8 7 6 5 4 3 2 1 0 Residual
Frequency
354
A check of the scatter diagram and the standardized residuals does not reveal any outliers. There are no obvious influential observations. It appears that the sample data meet the requirements of the theoretical model. 7. The scatter diagram does not contain much of a pattern, but if there is a relationship, it appears to be linear.
MonthlyIncomeResidualPlot
150 100 Residuals 50 0 50 100 150 $ $1,000 $2,000 $3,000 $4,000 $5,000 MonthlyIncome
The residual plot shows a fairly constant variability, although the residuals appear to be a little larger on the positive side (except in the area of monthly incomes of around $3,500). There is no obvious dependence among the residuals.
Copyright 2011 Pearson Canada Inc.
355
Frequency
20 15 10 5 0 Residual
A check of the scatter diagram and the standardized residuals reveals six points that could be considered outliers. They are circled on the scatter diagram below.
SpendingonRestaurantMealsand Income
MonthlySpendingonRestaurantMeals
$250 $200 $150 $100 $50 $0 $1,000 $1,500 $2,000 $2,500 $3,000 $3,500 $4,000 $4,500 MonthlyIncome
356
The presence of so many outliers is a cause for concern. [If we had access to the original data set, we would check to see that these observations were accurate.] These outliers obviously increase the variability of the error terms. Even if the data points identified as outliers are correct, they are an indication that the model will probably not be very useful for prediction purposes. There are two points in the data set that may be influential observations. They are indicated in the scatter diagram below.
SpendingonRestaurantMealsand Income
MonthlySpendingonRestaurantMeals
$250 $200 $150 $100 $50 $0 $1,000 $1,500 $2,000 $2,500 $3,000 $3,500 $4,000 $4,500 MonthlyIncome
To investigate, each point is removed from the data set, to see the effect on the leastsquares regression line. The least-squares line for the original sample data set was y = 0.0241x + 44.903. Without the circled point on the right-hand side, the equation changes to y = 0.0214x + 50.639, which is not that much of a change, relatively speaking. Similarly, the outlier at (1258.97, 154.68) could be having a large effect on the leastsquares line. Removing it changes the equation to y = 0.0262x + 39.292, which has more of an effect. Still, neither point appears to be affecting the regression relationship by a large amount (relatively speaking).
357
However, at this point in the analysis, it would be useful to go back to the beginning. It does not appear that monthly income is a strong predictor of monthly restaurant spending. There is too much variability in the restaurant spending data, for the various income levels, for us to develop a useful model. 8. The scatter diagram shows the points arranged in a linear fashion. However, the scatter around the regression line appears to widen as the amount of promotional spending increases. This shows quite clearly in the residual plot.
PromotionExpenditureResidual Plot
300000 200000 100000 0 100000 200000 300000 $0 $10,000 $20,000 $30,000 $40,000 $50,000 PromotionExpenditure
At this point, it is clear that the data do meet the requirements of the theoretical model. [For completeness, we will continue to check the other requirements.]
Residuals
358
This is time-series data, and so the residuals should be plotted against time. The resulting plot shows a definite pattern over time, with the residuals widening in more recent years. This again indicates a problem; the current model does not meet the requirements of the theoretical model.
At this point, it is clear that the model should be re-specified. Introducing time as an explanatory variable would probably be of interest.
Residual
359
9.
With the two erroneous data points removed, the scatter diagram looks as shown below.
HoursofWorkandSemesterMarks
100 90 80 70 60 50 40 30 20 10 0 0 100 200 y= 0.144x+ 89.175
SemesterAverageMark
300
400
TotalHoursatPaidJobDuringSemester
Residuals
The residuals appear centred on zero, with fairly constant variability, although variability seems greatest in the middle of the range of hours worked.
360
There is no indication that the residuals are dependent. A histogram of the residuals is shown below.
14 12 10 8 6 4 2 0
Frequency
Residual
The histogram is quite normal in shape. A check of the standardized residuals does not reveal any that are -2 or +2, although there is one observation with a standardized residual of -1.99. This is the observation (72, 65). [If we could, we would check this data point to make sure that it is accurate.] This point is quite obvious in both the scatter diagram and residual plot (the point is circled in these two graphs). There are no obvious influential observations, except perhaps for the almost-outlier. Removing this point from the data set does not affect the least squares regression line significantly. Despite the one troublesome point, the data set does appear to meet the requirements of the theoretical model.
361
10. The relationship between revenues and number of employees appears to be linear. The residual plot is shown below.
FullTimeEmployeesResidualPlot
1200 1000 800 600
Residuals
400 200 0 200 0 400 600 FullTimeEmployees 5000 10000 15000 20000 25000 30000 35000
The residuals do not appear to be centred on zero, and the variability is not constant. At this point, it appears that this sample data set does not appear to meet the requirements of the theoretical model. A histogram of the residuals is shown below.
10 8 6 4 2 0 Residuals
362
The histogram of residuals confirms what we saw in the residual plot. The residuals are highly skewed to the right. There is one observation with a standardized residual of 3.8. The corresponding point is circled on the residual plot above. Develop Your Skills 13.3 11. Since the sample data meet the requirements, it is acceptable to proceed with the hypothesis test. H0: 1 = 0 (that is, there is no linear relationship between the number of sales contacts and sales) H1: 1 > 0 (that is, there is a positive linear relationship between the number of sales contacts and sales) = 0.05 From the Excel output, t = 7.64 The p-value is 9.38E-08, which is very small. The p-value for the one-tailed test is only half of this value, and is certainly < . In other words, there is almost no chance of getting sample results like these, if in fact there is no linear relationship between the number of sales contacts and sales. Therefore, we can (with confidence), reject the null hypothesis and conclude there is evidence of a positive linear relationship between the number of sales contacts and sales data for the Hendrick Software Sales Company. 12. We already expect that the model will not be particularly useful. The number of data points with standardized residuals either +2 or -2 are a concern. However, the hypothesis test provides some evidence that there is a linear relationship between monthly income and monthly spending on restaurant meals. H0: 1 = 0 (that is, there is no linear relationship between monthly income and monthly spending on restaurant meals) H1: 1 > 0 (that is, there is a positive linear relationship between the number of sales contacts and sales) = 0.05 From the Excel output, t = 4.6. The p-value is on the output is 1.338E-05, and the pvalue for the one-tailed test is half of this. Reject H0 and conclude there is evidence of a positive linear relationship between monthly income and monthly spending on restaurant meals. 13. Since the sample data do not meet the requirements of the theoretical model, it is not appropriate to conduct a hypothesis test.
363
14. Since the sample data meet the requirements, it is acceptable to proceed with the hypothesis test. H0: 1 = 0 (that is, there is no linear relationship between the number of hours worked during the semester and the semester average grade) H1: 1 < 0 (that is, there is a negative linear relationship between the number of hours worked during the semester and the semester average grade) = 0.05 From the Excel output, t = -10.01 The p-value is 2.47086E-12, which is very small. The p-value for the one-tailed test is only half of this value, and is certainly < . In other words, there is almost no chance of getting sample results like these, if in fact there is no linear relationship between the number of hours worked during the semester and the semester average grade. Therefore, we can (with confidence), reject the null hypothesis and conclude there is evidence of a negative linear relationship between the number of hours worked during the semester and the semester average grade. 15. Since the sample data do not meet the requirements of the theoretical model, it is not appropriate to conduct a hypothesis test. Develop Your Skills 13.4 16. From the Excel output, R2 = 0.72. This means that 72% of the variation in sales is explained by the number of sales contacts. This suggests a fairly strong linear association between the two variables, which is not surprising. Assuming the original data was collected correctly, it is possible that the other factors affecting sales have been randomized. In such a case, it would seem reasonable to conclude that increasing sales contacts would lead to increased sales. However, there will likely be limits to the positive impact that could be created. Presumably, salespeople contact their best prospective clients first, so additional contacts may not be as productive. As well, increasing the number of contacts may reduce the quantity of time spent with each contact, which could have a detrimental effect on sales. 17. The R2 value for this data set is only 0.18. This is not surprising, because the scatter diagram of the relationship revealed scarcely any perceivable pattern. Only 18% of the variation in monthly spending on restaurant meals is explained by income. Earlier investigations suggested this model was not worth pursuing, and the low R2 value reinforces that. 18. The R2 value is fairly high, at 0.83. This means that 83% of the variation in Smith and Kleins sales is explained by sales promotion spending. However, while there is a strong association between the two variables, the linear regression model is not a good one.
364
19. The R2 value, at 0.72, suggests that 72% of the variation in semester average marks is explained by hours spent working during the semester. (Note that this is for the amended data set, where the two erroneous grades have been removedsee Develop Your Skills 13.2, Exercise 9). Obviously, there are many factors that affect semester average marks, for example, ability, study habits, past educational experience, and so on. If the original data were collected in a truly random fashion, these factors may have been randomized. It seems reasonable to conclude that students who work less will have more time for their studies, and it seems reasonable to think that marks improve with time spent studying. However, this data set does not guarantee that reducing work will lead to improved marks. 20. The R2 value is 0.93. Notice that this value looks very promising. Remember, though, that the model did not meet the requirements of the theoretical model. Remember, a high R2 value does not guarantee a cause-and-effect relationship, or a useful model. Develop Your Skills 13.5 21. Since the requirements are met, it is appropriate to create a confidence interval. The Excel output is shown below (in two parts, to better fit on the page).
PredictionInterval ConfidenceInterval Lowerlimit Upperlimit Lowerlimit Upperlimit 44.96826 97.471443 66.068659 76.37104
With 98% confidence, the interval ($66,069, $76,371) contains the average sales for 10 sales contacts. 22. We have already established this is not a good model. However, even if it were a good model, we would not use it to predict monthly spending on restaurant meals based on a monthly income of $6,000. The highest monthly income in the sample data set is $4,056, and so we should not rely on our model to make predictions for a monthly income of $6,000. 23. Since the requirements are not met, it is not appropriate to create a confidence interval.
365
24. The Excel output is shown below (note that this is for the amended data set, where the two erroneous grades have been removedsee Develop Your Skills 13.2, Exercise 9).
PredictionInterval ConfidenceInterval Lowerlimit Upperlimit Lowerlimit Upperlimit 46.027952 74.74128231 58.1586403 62.61059452
With 95% confidence, the interval (58.2, 62.6) contains the average semester average mark, when students work 200 hours in paid employment during the semester. 25. Since the requirements are not met, it is not appropriate to construct a prediction interval. Chapter Review Exercises 1. The hypothesis test is only valid if the required conditions are met. If you don't check conditions, you may rely on a hypothesis test when it is misleading. 2. Regression prediction intervals are wider than confidence intervals because the interval has to account for the distribution of y-values around the regression line. The regression confidence interval has to take into account only that the sample regression line may not match the true population regression line. A lower standard error means that confidence and prediction intervals will be narrower. Predictions made with the model will therefore be more useful. You should not make predictions outside the range of the sample data on which the regression relationship is based because the relationship may be very different there. For example, a linear model may provide a good approximation of a portion of a relationship that is actually a curved line. However, if the line is extended beyond this portion, it could be quite misleading. It is always tempting to just remove problem data points. However, if you do this, you will often find that the remaining data points also have outliers. If you persist in the practice of removing troublesome data points, you may not have much data left! Careful thinking is a better approach. The outlier may be telling you something really important about the actual relationship between the explanatory and response variables. You wouldn't want to miss this important clue to what is really going on.
3. 4.
5.
366
6.
ListPriceandOdometerReadingfor2006 HondaCivicSedan(asofFall2008)
$22,000 $20,000 y= 0.0374x+18017
ListPrice
$18,000 $16,000 $14,000 $12,000 $10,000 0 20,000 40,000 60,000 80,000 100,000 120,000 OdometerReading
The relationship is: $list price = -0.0374 (odometer reading in kilometers) + $18,017 For this small car, the base asking price is $18,017, which is reduced by about 3.7 for every kilometer on the odometer. However, note that this base asking price should not be trusted for any cars with fewer than 8,600 kilometres, since no cars in the data set had odometer readings below that.
367
7.
We have already examined the scatter diagram, which suggests a negative linear relationship. The residual plot is shown below. It has the desired appearance of constant variability, with the residuals centred on zero.
OdometerResidualPlot
Residuals
0 0 1000 2000 3000 4000 Odometer 20000 40000 60000 80000 100000 120000
A histogram of the residuals is shown below. The histogram is not perfectly normally-distributed, but it is approximately so.
Frequency
5 4 3 2 1 0 Residual
368
There are no standardized residuals +2 or -2. It appears the sample data meet the requirements of the theoretical model, and so it would be appropriate to use odometer readings to predict the list prices of these used cars. A 95% prediction interval for the list price for one of these cars with 50,000 kilometres on the odometer is ($12,683, $19,608). The Excel output is shown below.
PredictionInterval ConfidenceInterval Lowerlimit Upperlimit Lowerlimit Upperlimit 12683.4909 19607.9242 15259.8312 17031.584
8.
A scatter diagram showing the two stock market indexes is shown below. Note that the data used are the "adjusted close" figures. You must take care to match the datesthere are a few instances when one market is open and the other is not. Observations that did not have a match were removed from the data set.
TSXandDJI,January June,2009
11,000
S&P/TSXCompositeIndex
y= 1.2553x 894.84
10,500 10,000 9,500 9,000 8,500 8,000 7,500 7,000 6,000 6,500 7,000 7,500 8,000 8,500 9,000 9,500 DowJonesIndustrialAverage
The estimated relationship is as follows: TSX Composite Index = 1.255 (DJI) 895
369
Note that the choice of variable on the x or y axis is somewhat arbitrary here. Because Canada's economy is so dependent on exports to the US, the DJI is placed as the "explanatory" variable, but the cause and effect is not direct. 9. The coefficient of determination for the TSX and the DJI over the first six months of 2009 is 0.72. This measure suggests that 72% of the variation in the TSX is explained by variation in the DJI.
10. This data set is not a random sample, because it includes all matched observations over the period studied. Could this be considered a random sample? Probably not. The credit crisis and the recession that were having impacts on the stock markets in the first six months of 2009 made this period unreliable as a model of how the two indexes behave during more normal times. However, it is interesting to examine the patterns in the indexes over the period. The indexes were more closely related at the beginning of 2009 than they were later in the period. A time-series plot reveals this quite clearly.
TSXandDJI,January June2009
11,000 10,500 10,000 9,500
IndexValues
DJI TSX
The required conditions are not met (as we might expect, given the graph above).
370
DJIResidualPlot
1000 500
Residuals
0 500 1000 1500 6,000 6,500 7,000 7,500 DJI 8,000 8,500 9,000 9,500
25 20 15 10 5 0 Residual
371
372
StudentMarksin Statistics
100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 MarkonTest#2 y= 0.9586x+ 0.4464
MarkonFinalExam
80
100
The estimated relationship is as follows: Mark on final exam = 0.9586 (Mark on Test #2) + 0.4464 In other words, it appears the mark on the final exam is about 96% of the mark on Test #2.
373
Markon Test#2ResidualPlot
10 5
Residuals
Frequency
6 4 2 0 Residual
There are no obvious influential observations or outliers. It appears that the sample data conform to the requirements of the theoretical model.
374
13. Since the sample data meet the requirements, it is acceptable to proceed with the hypothesis test. H0: 1 = 0 (that is, there is no linear relationship between the mark on Test #2 and the final exam mark in Statistics) H1: 1 > 0 (that is, there is a positive linear relationship between the mark on Test #2 and the final exam mark in Statistics) = 0.05 From the Excel output, t = 16.5 The p-value is 2.96E-14, which is very small. The p-value for the one-tailed test is only half of this value, and is certainly < 5%. In other words, there is almost no chance of getting sample results like these, if in fact there is no linear relationship between the mark on Test #2 and the final exam mark in Statistics. Therefore, reject H0 and conclude there is strong evidence of a positive linear relationship between the mark on Test #2 and the final exam mark in Statistics. 14a. The Excel output is shown below.
PredictionInterval
ConfidenceInterval
b. c.
The 95% confidence interval estimate for the average exam mark of students who had a mark of 65% on the second test in the Statistics course is (60.5, 65). The 95% prediction interval estimate for the exam mark of a student who had a mark of 65% on the second test in the Statistics course is (51.8, 73.75). This interval is wider, because it has to take into the account the variability in individual marks of the students. The regression prediction interval is always wider than the confidence interval. The prediction interval has to take account of the distribution of exam marks around the regression line.
375
AriesCarParts
$1,000 $900
Auditor'sInventoryValue
y=0.9806x+ 25.233
$800 $700 $600 $500 $400 $300 $200 $100 $ $ $200 $400 $600 $800 $1,000 RecordedPartsInventoryValue
If the inventory records are generally accurate, we would expect the slope of the regression line to be very close to 1, as it appears to be. It appears there is a strong positive relationship between the recorded inventory value and the audited inventory value. The relationship is as follows: auditor's inventory value = 0.9806(recorded parts inventory value) + $25.23
376
16. As the scatter diagram created for Exercise 15 indicates, there appears to be a fairly strong positive linear relationship between the recorded and audited inventory values. The residual plot is shown below.
RecordedPartsInventory ValueResidualPlot
80 60 40 20 0 20 40 60 $ $200 $400 $600 $800 $1,000 RecordedPartsInventoryValue
The residual plot shows residuals fairly randomly distributed around zero, with about the same variability for all x-values. There are two residuals that show unusual variability. They are circled in the plot. The data were all collected at about the same point in time, so there is no need to check residuals against time. A review of the standardized residuals reveals two outliers, observation #1 and observation #25 (these are the two points that are circled in the residual plot). Since the auditor has realized that he misread the written records for both data points, we will amend the data, and re-do the analysis.
Residuals
377
AriesCarParts
$1,000 $900
Auditor'sInventoryValue
y= 0.9783x+ 25.227
$800 $700 $600 $500 $400 $300 $200 $100 $ $ $200 $400 $600 $800 $1,000 RecordedPartsInventoryValue
The new regression relationship is as follows: audited inventory value = 0.9783(recorded inventory value) + $25.23
378
The residual plot for the amended data plot is shown below.
RecordedPartsInventory ValueResidualPlot
40 30 20 10 0 10 20 30 40 $ $200 $400 $600 $800 $1,000 RecordedPartsInventoryValue
The residual plot for the amended data set looks acceptable. A histogram of the residuals for the amended data set is shown below.
Residuals
9 8 7 6 5 4 3 2 1 0
Frequency
Residual
The histogram of residuals shows some positive skewness, and this is a cause for concern, suggesting caution in the use of the model.
379
A check of the standardized residuals does not reveal any outliers. There are no obviously influential observations. It appears the corrected data set meets the requirements for the linear regression model, although the distribution of the residuals is not as normal in shape as is desired. 17. While we have some concern about the distribution of residuals, we will proceed with the hypothesis test. H0: 1 = 0 (that is, there is no linear relationship between the recorded inventory values and the audited inventory values) H1: 1 0 (that is, there is a linear relationship between the recorded inventory values and the audited inventory values) = 0.05 An excerpt of Excels regression output is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.995213711 RSquare 0.99045033 AdjustedRSquare 0.990160946 StandardError 16.61634358 Observations 35 ANOVA df Regression Residual Total 1 33 34 SS 944994.372 9111.394836 954105.7668 MS F 944994.372 3422.616936 276.1028738
From the Excel output, t = 58.503. The p-value is 6.47389E-35, which is very small, and certainly < 5%. In other words, there is almost no chance of getting sample results like these, if in fact there is no linear relationship between the recorded inventory values and the audited inventory values. Therefore, reject the null hypothesis and conclude there is evidence of a linear relationship between the recorded and audited inventory values.
380
18. The coefficient of determination for the amended (corrected) data on actual and recorded inventory values for Aries Car Parts is 0.9905. This means that a little over 99% of the variation in the audited inventory values is explained by differences in the recorded inventory values. Such a strong relationship suggests confidence in the recorded inventory values. 19. The scatter diagram for these data is shown below.
y=5.8784x+478280
Profit(000)
$20,000,000 $15,000,000 $10,000,000 $5,000,000 $0 $5,000,000 $1,000,000 $0 $1,000,000 $2,000,000 $3,000,000 $4,000,000 $5,000,000 Revenue(000)
Notice that the trendline is greatly influenced by the three data points from the three largest organizations in the data set. If we remove these observations, the scatter diagram looks as shown on the next page.
381
Profit(000)
$300,000
$250,000
$200,000
$150,000
$100,000
$100,000
Revenue(000)
The coefficient of determination for the full data set is 0.88, which is quite high. However, the measure is misleading. When the three largest data points are removed, the coefficient of determination is only 0.04, which seems more appropriate. The initial high value of the coefficient of determination never guarantees that a relationship is a good model, and it certainly does not, in this case.
$150,000
$50,000
$50,000
$0
382
100
FinallOverallAverageGrade
It appears there is a positive linear relationship between the final overall average grade and the score on the test given during the job interview. The regression relationship is as follows: score on test given during job interview = 0.6421(final overall average grade) + 4.98 This is promising. Since the grades are marked out of 100, and the test scores are out of 70, the slope would be 0.70 if the relationship was perfect.
383
21. As discussed in Exercise 20 above, there appears to be a positive linear relationship between the final overall average grade and the score on the test given during the job interview. The residual plot is shown below.
FinalAverageMarkResidualPlot
8 6 4
Residuals
2 0 2 4 6 8 FinalAverageMark 50 60 70 80 90 100
The residuals appear randomly distributed around zero, with the same variability for all x-values. A histogram of the residuals is shown below.
Frequency
8 6 4 2 0 Residual
384
There are no outliers or obviously influential observations in the data set. It appears these data meet the requirements for the linear regression model. 22. Since the requirements are met, it is appropriate to test for a positive linear relationship. H0: 1 = 0 (that is, there is no linear relationship between the final average mark and the score on the test given during the job interview) H1: 1 > 0 (that is, there is a positive linear relationship between the final average mark and the score on the test given during the job interview) is not given We are provided with only an excerpt of Excel output. However, we know that
We can approximate the p-value using a t-table, with n-2 = 28 degrees of freedom. Since t0.005 = 2.763, we know p-value is considerably less than 0.005. In other words, there is almost no chance of getting sample results like these, if in fact there is no linear relationship between the overall average mark of the graduate and the company test scores. Therefore, we can (with confidence), reject the null hypothesis and conclude there is evidence of a positive linear relationship. 23. Since the requirements are met, it is appropriate to create a confidence interval estimate. The Excel output is shown below.
Point Number
PredictionInterval ConfidenceInterval Lowerlimit Upperlimit Lowerlimit Upperlimit 43.9917459 62.278853 51.4810585 54.78954
With 98% confidence, we estimate that the interval (51.5, 54.8) contains the average test score of graduates with an overall average mark of 75.
385
24. Refer back to the output shown above in the solution to Exercise 23. With 98% confidence, we estimate that the interval (44.0, 62.3) contains the test score of a student with an overall average mark of 75. It is difficult to decide if the company should continue to administer its own test. The answer depends on how reliable a predictor of future performance the test has been, and what the costs of administering the tests have been. If the company test makes a major distinction between the predicted performance of someone with a test score of 44 and someone with a test score of 62, then the overall average grade may not be a good substitute. However, there is fairly strong relationship between the two variables. Perhaps the company could pilot using the overall average grade with a random sample of graduates, to see how well they do. 25. No, it would not be appropriate to use package weight as a predictor of shipping cost. We can see from the residual plot that variability increases as package weight increases. 26. It is often suggested that the Canadian stock market is very closely tied to the price of oil. A data set of weekly values for the Toronto Stock Exchange Composite Index (TSX) and the Canadian spot price of oil in dollars per barrel for the period from January 2000 to June 2009 was examined. The scatter diagram (shown below), suggests that while there may be a relationship between the two variables, it is not linear.
S&PTSXCompositeIndex
14,000 12,000 10,000 8,000 6,000 4,000 $0 $20 $40 $60 $80 $100 $120 $140 WeeklyCanadianParSpotPrice(DollarsperBarrel) $160
386
Residuals
0 1000 2000 3000 4000 5000 WeeklyCanadianParSpotPriceFOB(DollarsperBarrel) 0 20 40 60 80 100 120 140 160
ResidualsOverTime,TSXandOil PriceModel
4000 3000 2000 1000
Residual
03/01/2000
03/06/2000
03/11/2000
03/04/2001
03/09/2001
03/02/2002
03/07/2002
03/12/2002
03/05/2003
03/10/2003
03/03/2004
03/08/2004
03/01/2005
03/06/2005
03/11/2005
03/04/2006
03/09/2006
03/02/2007
03/07/2007
03/12/2007
03/05/2008
03/10/2008
There appears to be a time-related pattern in the residuals. This is also apparent in the patterns of extreme residuals (those with standardized residuals either +2 or -2). They predictably occur in the period of August in 2000, January July 2007, July 2008 and September-October 2008. While the model could probably be improved by the addition of a time variable, it is not clear how this could be used for predictive
387
03/03/2009
purposes. It would be probably be more useful to investigate what other explanatory variables were affecting the stock market over this period. As well, non-linear models could be explored.
388