You are on page 1of 153

STATISTICS FOR BUSINESS I

STAT 371 Course Notes

SPRING 2011

JOCK MACKAY
rjmackay@uwaterloo.ca

Statistics 371 R.J. MacKay, University of Waterloo 2009

Index

Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix 1 Appendix 2 Appendix 3 Statistical Tables Solutions to Exercises Old Midterms and Exams

The Need for Statistics in Business Models linking explanatory and response variates Making Inferences from Regression Models The Analysis of Variance Assessing Model Fit Model Building Sample Survey Issues Probability Sampling Ratio and Regression Estimation with SRS Stratified Random Sampling R Properties of vectors and matrices of random variables Gaussian Quantile-Quantile Plots

Please email me with any errors or points of clarification. These notes are a work in progress.

Data Sets You can download all data sets in the notes and exercises from the file stat371.zip on the Angel course web page. You can access the individual files at the same site.

Statistics 371 R.J. MacKay, University of Waterloo 2009

Chapter 1 The Need for Statistics in Business There is no substitute for knowledge W. Edwards Deming The greatest obstacle to discovery is not ignorance it is the illusion of knowledge Daniel Boorstin The purpose of Stat 371 and 372 is to provide a unified set of strategies and tools to apply Statistical Method in business and industry. In particular, the goal is to learn how to: pose clear questions collect the right data efficiently and effectively a good plan provide useful conclusions communicate the conclusions and the method by which they are reached to a nontechnical audience

Statistics, or better Statistical Method, is a powerful, widely applicable process that we can use to learn about business processes and markets (populations). Statistical Method is empirical, that is, based on observational and experimental investigations. By collecting and analyzing the right data, we can increase our knowledge of the market, the products and services we produce (and plan to produce) and the processes we use in this production. We may then use this knowledge to make better decisions to improve the business. Example 1 The maker of frost-free refrigerators in temperate New Zealand decided to expand their market to tropical south-east Asia. There were immediately numerous complaints about frost build-up in the fridges from the new market. The company interviewed 25 recent purchasers in each of the two markets and found that there were large differences in ambient environmental conditions (temperature and humidity) and usage (frequency of door openings, amount of food introduced at one time) in the two markets (investigation 1). They were convinced that these factors were the cause of the frost build-up in the tropical market. To solve the problem, they decided to try to redesign the fridge to make it more robust to ambient environmental conditions and usage factors. In a experimental investigation, they built 8 prototype fridges in which four design inputs were changed simultaneously. They then tested each prototype under two conditions defined by the extremes of the environmental and usage factors. The response variate was the temperature of the cooling plate in the fridge after 30 minutes operation low constant values mean that there will be no frost build-up. The experimental plan and data are

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-1

Treatment 1 2 3 4 5 6 7 8

D1 new new new new original original original original

D2 new new original original new new original original

D3 new original new original new original new original

D4 new original original new original new new original

Environmental Conditions Normal Extreme 0.7 2.1 2.9 4.8 2.4 9.6 3.8 5.9 1.9 4.0 -0.2 0.1 -0.1 3.5 0.2 7.2

Looking at the data, we can see that there are several promising designs (e.g. treatment 6). After further analysis and a review of the costs, the company adopted the combination in treatment 6 as the new design. The complaints about frost buildup disappeared. Example 2 Municipal taxes in Ontario are based the market value of the property. Where possible, the market or assessed value is determined by predicting the market value of the property using the prices from recent sales of comparable properties. A property owner may choose to appeal the assessed value. A large company felt that the assessed value of its very large property (an automobile assembly plant) was too high. To argue their case, they collected data on 38 large plants that had been sold in the last 10 years throughout Canada and the USA. The first few records are
size (sq ft /10^6) 0.848 1.813 1.297 1.747 age(years) 35 37 50 23 percent office 5.8 3.2 19.0 10.2 build/land ratio 26.6 17.3 45.1 13.3 location usa usa usa usa value $/sq ft 4.32 6.74 6.36 5.95

The idea was to predict the value of the building in question using a model constructed from the data and the known values of the explanatory variates size, age etc. Here the prediction was a failure as there were many problems with the data and how it was collected. We use PPDAC (Problem, Plan, Data, Analysis, Conclusion) to describe Statistical Method, the process we use to learn empirically. The purpose of each stage is: Problem: Plan: Data: Analysis: Develop clear questions about attributes of the population/process of interest Develop a plan to answer the questions posed Execute the Plan to collect the required data Analyze the data based on the Plan and a model to address the question

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-2

Conclusion:

Answer the questions and report uncertainties and limitations

The following should remind you of the language of PPDAC and how we apply the process.

Target Population

Study Population

Sample

Measured variate values

Conclusions

(Model-based) analysis

PPDAC is a process that we use to plan and execute empirical investigations so that we get reliable conclusions at a reasonable cost. There must be a good reason to undertake the investigation in the first place and resolve to take action and make decisions based on the Conclusions. Governments are famous for avoiding decisions by saying that another study is required. The two course are organized by the nature of the Plan and the models used in the analysis. In Stat 371 we concentrate on applications of regression models and sample surveys. In Stat 372, we look at issues of data collected over time (time series, control charting) and the use of experimental plans. Exercises 1. (A true story, believe it or not) To improve the shifting of the transmission, an automobile manufacturer organizes a clinic in which about 100 people evaluate the feel of 6 transmissions on different models from low to (very) high cost. Each person is asked to rate each transmission on several dimensions. The idea is to use the data to help design a new transmission that will have good feel to improve the perceived quality of the vehicle and hence improve sales/market share. To save money in organizing the clinic, the company uses the engineers at its development

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-3

center of which 90% are males under the age of 35. What changes to this plan would you recommend? Why? 2. Write a brief description of the 6 Sigma program. Where does Statistical Method fit in 6 Sigma? What advantages and disadvantages can you see in an organization adopting such a program? 3. What is a software usability trial? What are two key issues in the design of such a trial? How does Statistical Method fit into a usability trial? 4. Give two examples of the how you might use of Statistical Method in market research.

Statistics 371 R.J.MacKay, University of Waterloo, 2009

I-4

Chapter 2 Models linking explanatory and response variates In this chapter, we look at regression models and how to fit them to a set of data. Suppose we have a set of n units selected from a population and, for each unit i, we have the values of the response variate yi and p explanatory variates xi1 , xi 2 ,..., xip . The statistical problem is to fit a regression model of the form yi = 0 + 1 xi1 +...+ p xip + ri where the parameters 0 , 1 ,..., p and the residuals ri , i = 1,..., n are unknown. There are many applications of such models. We give three here. Example 1 The CAPM model is used to measure the risk of a single asset relative to that of a portfolio. For example, suppose we want to assess the relative risk of an IBM share relative to the S&P 500 index. The theoretical CAPM model describes the excess return (actual return risk free return) for an IBM share over a period of time as a constant times the excess return of the portfolio. If we model the excess IBM return as a random variable Y and the portfolio excess return as a random variable X , then we have Y = X and stdev(Y ) = . That is, the parameter measures the relative volatility of the IBM excess stdev( X ) return. The common interpretation is that > 1 corresponds to an asset riskier than the portfolio. In many empirical applications, percentage returns are collected for the asset yi and the portfolio xi over a number of periods (e.g. days, months) and a linear model of the form yi = 0 + 1 xi + ri is fit to the data. The risk-free return is not included in the model. The month over month returns from Jan 2001 to March 2003 for IBM and the S&P 500 are given in the file IBM.txt. The variate names are sp.ret and ibm.ret. We see a scatterplot of the data on the next page, created with the R code plot(sp.ret,ibm.ret,xlab='SP500 return',ylab='IBM return',main='IBM vs. S&P 500 Monthly Returns') From the plot, the model should provide a reasonable fit to the data. The purpose of this modeling is to estimate an attribute of the population of monthly returns. There are many issues about the time period (months) and the sampling period. Note we can fit a model that includes the risk-free return if the data were available. Since this return is small, measured on a monthly basis, the fit will not change markedly. Example 2 In Chapter 1, we introduced the problem of determining the market value of a property that is not sold, using known explanatory variates and the market values of similar Stat 371 R.J. MacKay, University of Waterloo 2009 II-1

properties that were actual sales. The data are in the file assessment.txt. There are 38 units (large sales)with 5 explanatory variates size, age, office, ratio and location and the response variate value ($ per square ft). In this example, we first fit a regression model using the data from the actual sales. We then use the model to predict the market value of the property that was not sold. The values for the explanatory variates for the unsold building are size=13,825, age=21, percent office =3.8, building/land ratio=53, location= 0 (Canada). There are issues about which properties to include in the data set and which explanatory variates to include in the model . In Ontario, there is a private organization that makes extensive use of regression modeling to provide market values to municipalities for all properties that provide the basis for property taxes. There are many applications of regression where the object is to predict the unknown response variate for a given set of values of the explanatory variates. Example 3 A service organization has 24 offices. In the planning of an audit, the accountant looks at the stated overhead from the current and past year for each office. He also has access to the office size and age, the number of employees and clients, and the relative cost of living in the city where the office is located. The data are in the file analytic.txt. The auditor plans to fit a model relating overhead to the explanatory variates in order to look for outliers, offices for which the relationship is different. He will devote more audit resources to any such office. This is an example of an analytic method in auditing. Another similar application is to look at salaries of employees relative to the work they do. The goal is to look for exceptional cases for which the relationship between the explanatory variate and the response is very different. Fitting the Model Least Squares. By "fitting the model", we mean that we estimate the unknown model parameters using the data. To do so, we represent the data model in terms of vectors and matrices. Let y be a n 1 column vector containing the response variate values, x j a column vector containing the values of the jth ( j = 1,..., p ) explanatory variate and r a column vector of the unknown residuals. Also let 1 = (1,..., 1)t be a column vector of n 1s and X = (1, x1 ,..., x p ) an n (1 + p) matrix with columns corresponding to the explanatory variates. Finally, let = ( 0 , 1 ,..., p )t be a (1 + p) 1 column vector of the unknown coefficients. Then we can write the model in terms of these vectors as

Stat 371 R.J. MacKay, University of Waterloo 2009

II-2

y = 0 1 + 1 x1 +...+ p x p + r = (1, x1 ,..., x p ) + r

or more compactly as

y = X + r

We have written y as the sum of two vectors. We can picture the model in R n as shown below. y

0 1+ 1 x1 +...+ p x p

span(1, x1 ,..., x p )

The span(1, x1 ,..., x p ) is the subspace of R n spanned by the columns of X . We assume that this subspace has dimension p+1 or equivalently that the columns of X are linearly independent. To fit the model, we use least squares. That is, we find the value for that minimizes the function W ( ) = ( yi 0 1 xi1 ... p xi p ) 2
i

=|| y X ||2 =|| r||2 To minimize the squared length of r , we project y orthogonally onto span(1, x1 ,..., x p ) as shown below. y

0 1 + 1 x1 +...+ p x p
span(1, x1 ,..., x p )
Stat 371 R.J. MacKay, University of Waterloo 2009 II-3

The estimated residual vector r = y X is orthogonal to the span(1, x1 ,..., x p ) or equivalently, to every column of X. That is we have 1t r = 0, x1t r = 0,..., x tp r = 0 We can write these equations more compactly as X t r = 0 . Substituting for r , we get X t ( y X ) = 0 , and after rearrangement, = ( X t X ) 1 X t y . Note that X t X has an inverse because we assume that X has full rank (i.e. p+1 linearly independent columns). We label the projection (called the vector of fitted values) as , so = 0 1 + 1 x1 +...+ p x p

= X = X ( X t X ) 1 X t y = Hy and the estimated residual by r = y X = (I H)y where the matrix H = X ( X t X ) 1 X t is called the hat-matrix and is the projection onto the subspace span(1, x1 ,..., x p ) . H has several interesting properties - see the exercises. Note that we have decomposed the vector y = Hy + ( I H ) y into two orthogonal components.
Example We use R to fit the empirical CAPM model to the IBM returns vs S&P 500 returns. The following code produces the given output.

a<-read.table(IBM.txt,header=TRUE) attach(a) b<-lm(ibm.ret~sp.ret) summary(b) fitted(b) plot(sp.ret,ibm.ret, main= IBM monthly return vs S&P 500 monthly return) abline(b)

Stat 371 R.J. MacKay, University of Waterloo 2009

II-4

The output is: Call: lm(formula = ibm.ret ~ sp.ret) Residuals: Min 1Q Median 3Q Max -10.001 -5.876 -1.023 4.834 19.633 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.066 1.334 0.799 0.431 sp.ret 1.742 0.255 6.832 2.01e-07 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 7.261 on 28 degrees of freedom Multiple R-Squared: 0.625, Adjusted R-squared: 0.6117 F-statistic: 46.68 on 1 and 28 DF, p-value: 2.013e-07 1 2 3 4 5 6 1.6636134 3.8917891 3.0381487 9.9334725 15.1842320 2.5224802 7 8 9 10 11 12 -1.8955444 -3.7090947 -9.4441644 11.0083625 16.1267206 -18.1007748 13 14 15 16 17 18 1.9162213 -12.6967085 -11.5573599 -0.5157827 -9.6340558 7.4666260 19 20 21 22 23 24 -2.5523248 -1.6464207 2.3848525 14.1633477 4.2193082 -13.1705660 25 26 27 28 29 30 -10.1026870 -0.8102016 -3.2892430 1.9528059 14.4473138 -10.1183661

Stat 371 R.J. MacKay, University of Waterloo 2009

II-5

Notes on the R code:

1. The function lm(y~x1+x2++xp) fits the regression model that includes a constant term. This term can be omitted using the code lm(y~-1+x1+x2++xp) 2. The output of the function is assigned to the object b <- lm(y~x1+x2++xp). We can look at the contents of the model object b with the commands: summary(b) : the table of estimated coefficients and statistics. fitted(b) : the vector of the fitted values in the same order as the original data. resid(b) : the estimated residuals. anova(b) : an Analysis of Variance table coefficients(b) : the estimated coefficients. 3. abline(b) adds the fitted line to the scatter plot when there is a single explanatory variate. To interpret the output, we note that 0 = 1.066, 1 = 1.742 . Since the estimated slope is greater than 1, we know that the IBM share is more volatile than the market as defined by the S&P 500. We will interpret most of the other statistics when we look at formal inference procedures for the corresponding response (probability) model. R-squared (usually written R 2 ) is defined as

Stat 371 R.J. MacKay, University of Waterloo 2009

II-6

R2 = 1 = 1

residual sum of squares from the fitted model residual sum of squares from the model with only a constant term || r ||2 || y y1||2

where y is the sample average of the response variate.


R 2 is always between 0 and 1 and is often quoted as a percentage. R 2 is 1 when the length of the residual vector r is 0 i.e. y lies in span(1, x1 ,..., x p ) . R 2 is 0 when r = y y1

i.e. 0 = y , 1 = 0,..., p = 0 . In other words if we can write y as a linear combination of 1, x1 ,..., x p , then R 2 is 1 and if the fitted model does not involve x1 ,..., x p , then R 2 is 0. In some sense, R 2 measures how well the model fits the data but you need to be very careful in this interpretation. See Exercise 7. Another interpretation is that 100 R2 is the percent of the variation in the response model explained by the explanatory variates. In the example, we can say that the S&P 500 returns explain 62.5% of the variation in the IBM returns. But note that neither the numerator nor the denominator is the usual measure of variation in the response variate. Again, we need to be sure to explain what this means.
Exercises

1. We use this artificial example to help you review the basic concepts of fitting a model using least squares. The data are shown below.
x1 2.6 4.8 3.1 3.4 2.1 1.6 10.3 1.7 3 2.9 x2 3.5 1.9 2.8 5.8 11 4.8 5.9 1.1 5.2 2.2 y 3.7 11.8 6.5 3.5 -6 1.6 16.8 4.8 3.7 7.8

Using R to fit the model yi = 0 + 1 xi1 + 2 xi 2 + ri , i = 1,...,10, we get the summary in the following text box. a) b) What are the estimates ? Calculate 1 and r1

Stat 371 R.J. MacKay, University of Waterloo 2009

II-7

Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -1.09907 -0.28650 0.03148 0.52948 0.87919 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.87190 0.56017 6.912 0.000229 *** x1 2.01086 0.09959 20.192 1.83e-07 *** x2 -1.26481 0.08845 -14.300 1.95e-06 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.7581 on 7 degrees of freedom Multiple R-Squared: 0.9879, Adjusted R-squared: 0.9844 F-statistic: 285.3 on 2 and 7 DF, p-value: 1.958e-07 2. Use R to fit the model yi = 0 + 1 xi1 +...+ p xip + ri to the assessment data (assessment.txt) with a) all 5 explanatory variates b) only age and size c) Do the estimated coefficients change? Why? 3. Suppose we have the returns on an asset yi , the return on the market xi1 and the risk free return xi2 for n periods. Consider three regression models: Model 1: ( yi xi 2 ) = ( xi1 xi 2 ) + ri Model 2: yi = 0 + 1 xi1 + ri Model 3: yi = 0 + 1 xi1 + 2 xi 2 + ri When we fit each model, will the coefficient of x1 , the measure of volatility, change? Explain. 4. Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population. Consider the two models Model 1: yi = 0 + 1 xi1 + ri Model 2: yi = 0 + 1 ( xi1 x1 ) + ri where x1 is the sample average of the explanatory variate. a) Show that the vectors x1 x11 and 1 are orthogonal.

Stat 371 R.J. MacKay, University of Waterloo 2009

II-8

b) Why is span(1, x1 ) = span(1, x1 x11) ? c) In fitting models a) and b), we project onto a subspace. How are those projections different? d) What is the relationship between the estimated coefficients in fitting the two models? e) How does the result in a) simplify the calculation of when fitting model 2? 5. We defined the hat matrix H = X ( X t X ) 1 X t , the projection onto span(1, x1 ,..., x p ) . Show that a) H t = H b) H 2 = H . Interpret this result geometrically. c) ( I H ) 2 = ( I H ), H ( I H ) = 0 d) 0 hii 1 where hii is the diagonal element of H. 6. Some questions about R 2 a) In question 1, which model gave a larger value for R 2 ? b) Show that R 2 cannot decrease if we add extra terms to a model. 7. The data in the file anscombe.txt were produced by F.J. Anscombe, American Statistician 27, 17-21 to demonstrate the difficulty of using R 2 as a measure of fit and the importance of plotting the data. The file contains 4 sets of ( x, y) vectors, labeled x1-x4, y1-y4. a) For each pair, fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair, construct a scatterplot of y versus x and add the fitted line. c) Comment.

Stat 371 R.J. MacKay, University of Waterloo 2009

II-9

Stat 371 R.J. MacKay, University of Waterloo 2009

II-10

Chapter 3 Making Inferences from Regression Models In this chapter, we look at formal inference procedures such as hypothesis tests, confidence intervals and prediction intervals for regression models. We use these procedures to help answer questions of interest such as Is there evidence that IBM returns are more volatile than the S&P 500 index? What is a range of plausible values for an unsold property based on the values of its explanatory variates?

To start, we consider a statistical model to describe the repeated application of the Plan. This model uses random variables to replace the response variate values and residuals in the data model. Suppose we have a set of n units selected from a population and, for each unit i, we have the values of the response variate yi and p explanatory variates xi1 , xi 2 ,..., xip . A statistical regression model is
Yi = 0 + 1 xi1 +...+ p xip + Ri , Ri ~ G(0, ), i = 1,..., n independent

Note that, in the model,


E (Yi ) = 0 + 1 xi1 +...+ p xip so we can interpret j as the change in E(Yi ) when the jth explanatory variate changes by 1 unit with all other explanatory variates held fixed. If x j is continuous, we can interpret j in terms of the partial E(Yi ) derivative. = j , the rate of change of E(Yi ) as x j changes, again with all other x j explanatory variates held fixed. The stdev(Yi ) = stdev( Ri ) = is constant. We treat the explanatory variates as constants (not random variables) in the model.

We can combine the n independent gaussian random variables R1 ,..., Rn into a vector R ~ N (0, 2 I ) . The column vector 0 gives the component means. The variance covariance matrix 2 I gives the variance 2 of Ri in the ith diagonal position and the covariance Cov( Ri , R j ) = 0 in ijth position See Appendix 2. We write the model more compactly as
Y = X + R, R ~ N (0, 2 I )

so Y ~ N ( X , 2 I ) . We use the model to describe how the estimates would behave if we were to repeat the ~ Plan over and over. The estimator = ( X t X ) 1 XY (a p 1 vector of random variables) describes the behaviour of . Using the properties of expectation and variance of linear combinations of random variables (Appendix 1) we have Stat 371 R.J. MacKay University of Waterloo 2009 III-1

~ E( ) = E(( X t X ) 1 X t Y ) = ( X t X ) 1 X t E(Y ) = ( X t X ) 1 X t X =
~ Var ( ) = Var (( X t X ) 1 X t Y ) = [( X t X ) 1 X t ] 2 I[( X t X ) 1 X t ]t = 2 ( X t X ) 1

~ and using the properties of the multivariate normal distribution ~ N ( , 2 ( X t X ) 1 ). ~ ~ That is, each component of is gaussian i.e. j ~ G( j , d j ) where d j is the square root ~ of the jth diagonal element of ( X t X ) 1 . Note that the components of are not independent unless ( X t X ) 1 is diagonal, or, in other words, the columns of X are orthogonal. To estimate , we use the sum of squares of the estimated residuals divided by the degrees of freedom

=
= =

r
i

n ( p + 1) || r ||2 n ( p + 1) ||( I H ) y||2 n ( p + 1)

The corresponding estimator is ~ = || ~||2 r n ( p + 1)

||( I H )Y ||2 = n ( p + 1) = ||( I H ) R||2 n ( p + 1)

since ( I H ) X = 0 . Note that E (||( I H ) R||2 ) = [n ( p + 1)] 2 - see the exercises which partially justifies the denominator. We also have the unproven result that

Stat 371 R.J. MacKay University of Waterloo 2009

III-2

~ / ~ Kn ( p +1)
~ ~ We also can easily show that Cov( , ~) = 0 so that and ~ are statistically independent r r ~ is a function of ~ , it then follows that see the exercises. Since r

j j ~ tn ( p +1) ~ d j
We use this t-distribution to test hypotheses and find confidence intervals for the individual parameters j . Tables for the t-distribution are given in Appendix 3.
Example 1 We fit an empirical CAPM model to a series of monthly returns from an IBM share versus the corresponding returns of the S&P 500 index in the file IBM.txt. The summary R output is given in Chapter 2. Note that the estimate of is called the residual standard error in the summary. In this case, = 0.255.

One question of interest is to see if 1 is different from 1. In words, is the volatility of an IBM share different than that of the index? We consider a test of the hypothesis 1 = 1. We use the same 5-step procedure as in the beloved Stat 231. Step1: (Formulate) Suppose 1 = 1. Step 2: (Estimate) We have 1 = 1.742 from the R output. Step 3: (Calculate the discrepancy measure) The distance from the estimated parameter to the hypothesized value is d= | 1| |1.742 1| = = 2.91 d1 0.255

Note that the denominator is called the standard error of 1 . The standard error is the ~ estimated standard deviation of the corresponding estimator 1 and is given in the R output for each estimated coefficient. Step 4: (Calculate the p-value) To asses if this distance is large or small, we calculate the p-value Pr(| t28 | 2.91) = 0.007 The degrees of freedom correspond to the denominator in the calculation of . The pvalue is the chance that we get such a large discrepancy between the estimated and hypothesized value if the hypothesis is true. We can calculate this probability in R with the command pt(2.91,28).

Stat 371 R.J. MacKay University of Waterloo 2009

III-3

Step 5: (Interpret) Since the p-value is so small, less than 0.001, we say that there is strong evidence that 1 is different from 1. The conclusion in the example is that there is strong evidence that the volatility of the IBM share is different from that of the index. More generally, we interpret a p-value according to the table
Range of p-value greater than 0.10 between 0.05 and 0.10 between 0.01 and 0.05 less than 0.01 Interpretation no evidence against the hypothesis weak evidence against the hypothesis some evidence against the hypothesis strong evidence against the hypothesis

We can also summarize our knowledge of 1 using a confidence interval. The confidence interval shows us how precisely we have estimated the parameter. Recall that the general form of a confidence interval (based on a t-distribution) is

estimate c standard error(estimate)


where the constant c is chosen from the t-tables so that Pr( c t df c ) is the confidence level. In the example, for a 95% confidence interval, we have, from the tables, c = 2.05 and the interval is 1.742 2.05 0.255 or 1.742 0.523 . Note, as expected, 1 is outside of the confidence interval and is not a plausible value for 1 based on the data.
Example 2 A marketing firm wants to test two sales promotions. In a pilot project, 30 stores are selected and divided at random into three groups of 10. One group is given promotion 1, one group is given promotion 2 and the third group acts as a control. For each store, the firm measures the total sales over a two week period before and after the promotion is in place and calculates the percent change. They also measure the sales of competing products during the promotion period. The data are stored in the file trial.txt with columns comp.sales and percent.change to indicate the measured variates. The promotion for each store is specified by two indictor variates x1 and x2. promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1

The statistical model is Yi = 0 + 1 xi1 + 2 xi 2 + 3comp. salesi + Ri , Ri ~ G(0, ), i = 1,2,...,30 independent Stat 371 R.J. MacKay University of Waterloo 2009 III-4

Note the interpretation of the parameters 1 and 2 . Holding x2=0 and comp.sales fixed, 1 represents the increase in the mean response (percent change in sales) if we change from the control to promotion 1. That is, 1 measures the effect of promotion 1. We have a similar interpretation for 2 . We plot the data by promotion using the R code a <- read.table('trial.txt',header=T) attach(a) p <- c(rep("c",10),rep("1",10),rep("2",10)) plot(comp.sales, percent.change, xlab='competing sales', ylab='percent change', main='Percent Change in Sales vs Competing Sales by Promotion', type='n') text(comp.sales, percent.change,p) The 30 1 vector p is a string of characters corresponding to the promotion. The type='n' in the plot command suppresses the plotting of any points but sets up the axes and labels. The text command adds the points using the text characters in the vector p as the plotting symbols.

Note that the plotting symbol corresponds to the promotion. Fitting the model, we get the summary output:

Stat 371 R.J. MacKay University of Waterloo 2009

III-5

Call: lm(formula = percent.change ~ x1 + x2 + comp.sales) Residuals: Min 1Q Median 3Q Max -12.66432 -3.06860 -0.03009 3.47944 9.47980 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0917034 2.1969230 -0.042 0.96702 x1 8.4187456 2.3884539 3.525 0.00159 ** x2 2.8245616 2.3996781 1.177 0.24984 comp.sales -0.0003564 0.0006446 -0.553 0.58509 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 5.334 on 26 degrees of freedom Multiple R-Squared: 0.3316, Adjusted R-squared: 0.2545 F-statistic: 4.3 on 3 and 26 DF, p-value: 0.01367 To examine the effects of promotion 1, we consider the hypothesis 1 = 0 corresponding to no effect. By default, the R output gives the results for the corresponding t test, in this case with p-value 0.00159. There is strong evidence that promotion 1 has a positive effect with 1 = 8.42 (standard error 2.39) if all other explanatory variates in the model are fixed. There is no evidence that promotion 2 has an effect with 2 = 2.82 (standard error 2.40). Although it is clear here that the effects of the two promotions differ, suppose we were interested in the parameter = 1 2 that measures the difference in effects of the two promotions. We have = 1 2 = 5.60 . How can we get the standard error of this estimate? We use vectors to represent = 1 2 . If we let a = ( 0,1,1,0 ) t , then = a t , = a t ~ ~ ~ ~ and = a t . Since ~ N ( , 2 ( X t X ) 1 ) , we have ~ N (a t , 2 a t ( X t X )a) and hence the standard error of is a t ( X t X ) 1 a . We can calculate the standard error in R using the following statements. Note the comments after the #. b <- lm(percent.change~x1+x2+ comp.sales) X <- model.matrix(b) W <- solve(t(X)%*%X) # fit the model # extract the X matrix # find ( X t X ) 1 , note the transpose function t(), matrix multiplication %*% and inverse function solve() # define the vector a

a <- c(0,1,-1,0)

Stat 371 R.J. MacKay University of Waterloo 2009

III-6

theta.hat <- t(a)%*%coef(b) st.err <- 5.334*sqrt(t(a)%*%W%*%a)

st.err

# calculate # calculate the standard error SE( ) we get the estimate of = 5.334 from summary(b) # display SE( )

. In this case, we get the standard error SE( ) = 110 . Using the fact that Pr( 2.06 t26 2.06) = 0.95 , the 95% confidence interval for the difference in the effects of the two promotions is 5.60 2.27 . We can be confident that promotion 1 produces a percent change in sales between 3.33% and 7.87% per week in average sales, compared to promotion 2. We can draw conclusions about any linear combination of the coefficients using the same methodology.
Promotion 1 looks promising in terms of its effect on sales.
Notes 1. In example 2, you might wonder why we did not create an explanatory variate x3 for the control promotion. That is, we have promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1 x3 1 0 0

This will create problems in the fitting since x1 + x 2 + x3 = 1 and so the columns of the matrix X are linearly independent. We proceed by deleting x3 from the model as in the Example or by deleting the intercept with the R command b <- lm(percent.change~-1+x1+x2+x3+comp.sales) The 1 in the model specification suppresses the intercept. Since we are projecting onto the space in each model, the estimates of the parameters corresponding to comparisons e.g. promotion 2 vs promotion 1 are identical with the same standard error. 2. Also in example 2, it is tempting to simplify the modeling by using a single vector x with elements 0,1,2 corresponding to the promotion control, one or two. That is, we can fit the model percent.change = 01 + 1 x + 2 comp.sales + r . When you try to interpret 1 , you can see the problem. There is no reason to suspect that changing from the control to promotion 1 has the same effect as changing from promotion 1 to promotion 2, as implied by this model. You need to be careful to recognize categorical explanatory variates that are coded as integers for convenience. In this case you need to set up indicator variates to represent the categories as described in Note 1.

Stat 371 R.J. MacKay University of Waterloo 2009

III-7

Prediction Intervals Suppose we want an interval of plausible values of the response variate for a unit with known values of the explanatory variates. This is the problem we need to solve in marketvalue assessment example discussed in Chapter 1.

In general, let u t = (1, x1 ,..., x p ) be the values of the explanatory variates for the unit with response variate that has not been measured. From the model, we can describe the behaviour of the response variate by the random variable Y = 0 + 1 x1 +...+ p x p + R = u t + R, R ~ G(0, )
~ so that Y ~ G(u t , ) . We also know that u t ~ G(u t , u t ( X t X ) 1 u ) so ~ Y u t ~ G( 0, 1 + u t ( X t X ) 1 u ) and

~ Y ut ~ 1 + u t ( X t X ) 1 u

~ tn ( p +1)

We use this t distribution to produce prediction intervals for Y. If Pr( c tn ( p +1) c ) = 0.95 , then we have, rearranging the inequality
~ ~ ~ ~ Pr(u t c 1 + u t ( X t X ) 1 u Y u t + c 1 + u t ( X t X ) 1 u ) = 0.95

We get the prediction interval by replacing the estimators with the corresponding estimates.
u t c 1 + u t ( X t X ) 1 u y u t + c 1 + u t ( X t X ) 1 u )

To illustrate how we can get this interval using R, suppose in Example 2 we want to predict the percent change in sales for a store that uses promotion 1 with competitor sales $3000. b <- lm(response~x1+x2+ comp.sales) new <- data.frame(x1=1,x2=0, comp.sales=3000) p<-predict(b,interval=p,newdata=new, level=0.95) p The second line creates a new data set (a data.frame in R-speak) with the values of the explanatory variates for which we want to make the prediction. The third line calculates the interval, the option=p produces a prediction interval at the given values of the explanatory variates. The last line prints the fitted value u t and the prediction interval. interval. In the example, we get

Stat 371 R.J. MacKay University of Waterloo 2009

III-8

fit lwr [1,] 7.257904 -4.26998

upr 18.78579

That is, we predict the percent change in sales to be between 4.3% and 18.8%. This interval is wide because of the high variation within stores ( = 5.334 ) using the same promotion in the investigation.
Exercises 1. Some ideas about confidence intervals: a) Using the R-output given for the sales promotion example, find a 99% confidence interval for the effect of competing sales on the percent change in sales. What can you conclude? b) How does the confidence interval change as we increase the confidence level? ~ c) Suppose we have ~ G( , d ) , the estimator for a parameter and the statistically ~ independent with n ( p + 1) degrees of freedom. Derive the confidence interval for d) Show that 0 is in the 95% confidence interval for if and only if the p-value for the test of the hypothesis = 0 exceeds 5%.

2. In a small study, a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added. The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added. The data are stored in the file hardness.txt. The variates are named hardness and frag.oil. Consider the simple model where we assume separate batches are independent. hardness = 0 + 1frag.oil + R, R ~ G(0, ). Construct a scatterplot of hardness vs the amount of fragrance oil. Interpret the parameter 0 . Find a 95% confidence interval for 0 . Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil. Add a quadratic term to the model (in R f2 <- frag.oil*frag.oil creates a vector with components the square of those in frag.oil). Is there any evidence of curvature in the relationship?

a) b) c) d) e)

3. Using the data in the promotion trial described in this chapter, a) Find a 95% prediction interval using promotion 2 for a large store where the competing sales are $30000. Can you see any difficulty with this prediction? b) Construct a prediction interval for the change in sales if promotion 1 is used rather than promotion 2 for the same store (i.e. competing sales are fixed). [You will need to go back to first principles.] ~ 4. Prove that the components of are independent if and only if the columns of X are orthogonal.

Stat 371 R.J. MacKay University of Waterloo 2009

III-9

~ 5. Prove Cov( , ~) = 0. r 6. The estimate of is = r tr . One (poor) justification for the denominator n ( p + 1) ~ is based on the fact that with this choice we have E ( 2 ) = 2 . Here, we verify this r result. Let ~ = ( I H ) R . We want to show that E ( ~ t ~ ) = ( n p 1) 2 . r r Recall that for any square matrix A , the trace of the matrix is tr( A) = aii . Show that if C is n k and D is k n , then tr(CD) = tr( DC) . Show that E ( ~ t ~ ) = tr ( E ( RR t )( I H )) = 2 tr ( I H ) r r Use the result from a) and the fact that H = X ( X t X ) 1 X t to evaluate E ( ~ t ~ ) . r r

a) b) c)

Stat 371 R.J. MacKay University of Waterloo 2009

III-10

Chapter 4 The Analysis of Variance In Chapter 3, we saw how to make formal inference statements about any component j or any linear combination = a t of the coefficient vector . The confidence intervals and hypothesis tests for these one-dimensional parameters are based on a t distribution of the corresponding estimator. That is ~ ~ ~ tn ( p +1) d
~ where the constant d is determined by finding stdev( ) = d = a t ( X t X ) 1 a .

In this chapter, we look at hypothesis tests that involve several of the parameters simultaneously. That is, the hypothesis is not one-dimensional. It cannot be framed in terms of a single parameter. Example 1 In the problem of predicting the market value of a large plant (see Chapter 2), we fit a model with 5 explanatory variates age, size, ratio of building size to the land area, percentage of office space and the country of sale (1=USA, 0=Canada) to the adjusted value (current $ per square foot) for 38 sales. The data are in the file assessment.txt. In fitting the full model, the summary output from R is given below. Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10.6911 -3.8670 -0.8164 2.7848 14.5435 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.41588 4.77137 4.069 0.000288 *** size -2.41526 1.34128 -1.801 0.081181 . age -0.52300 0.10820 -4.833 3.22e-05 *** office 0.03786 0.14776 0.256 0.799388 ratio 0.11653 0.08139 1.432 0.161926 location 3.49361 3.72137 0.939 0.354867 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 5.993 on 32 degrees of freedom Multiple R-Squared: 0.4537, Adjusted R-squared: 0.3683 F-statistic: 5.315 on 5 and 32 DF, p-value: 0.001150

Stat 371 R.J. MacKay University of Waterloo 2009

IV-1

The output includes the results of t test of the hypothesis that each coefficient is 0, given that all the other explanatory variates are in the model. We can also ask questions such as: Is there any evidence that all of the explanatory variates explain a significant portion of the variation in the response variate, or, in terms of the parameters, is there any evidence that any one of 1 ,..., 5 differ from 0? The corresponding hypothesis is 1 = 0, 2 = 0,..., 5 = 0 . Is there any evidence that only age is an important explanatory variate, or, in terms of the parameters, is there any evidence that any one of 1 , 3 , 4 , 5 differ from 0? The corresponding hypothesis is 1 = 0, 3 = 0, 4 = 0, 5 = 0 The defining relationships in the hypothesis holds simultaneously for all the parameters listed. Example 2 To compare 5 different versions of a product to the current version, an R&D department conducted a clinic in which each of the 6 versions was assessed by 8 different subjects. After trying the product, the subjects completed a questionnaire to determine a score to measure past experience with similar products a score to measure satisfaction with the proposed version The first question of interest was to see if any of the new versions were different from the original, adjusting for constant background experience. To model the data, let yi : satisfaction score for subject i, i = 1,...,48 xij = 1 if subject i used version j, xij = 0 otherwise where j = 1,...,6 and j = 1 corresponds to the current version. pst. scorei : past experience score for subject i Then, in terms of the corresponding vectors, we write the data model as y = 1 x1 +...+ 6 x6 + 7 pst. score + r Note that there is no intercept term in the model. The data are stored in the file product.txt. There is no difference in the versions if 1 = 2 =... = 6 . We want to examine this multi-dimensional hypothesis. To test multi-dimensional hypotheses such as those described in Examples 1 and 2, the basic idea is to construct a discrepancy measure that is the ratio of two estimates of the residual variance 2 . The first estimate is valid whether or not the hypothesis is true. The second is a valid estimate of 2 only if the hypothesis is true. The ratio will tend to be Stat 371 R.J. MacKay University of Waterloo 2009 IV-2

different from 1 if the hypothesis is not true. We can this procedure the Analysis of Variance (ANOVA). The basic steps are: Step 1: Fit the full model to get the usual estimate of the variance 2 = r t r / ( n p 1) with n p 1 degrees of freedom. Note that the sum of squares of the estimated residuals r t r = ri 2 is an estimate of ( n p 1) 2 . Step 2: Fit the reduced model, assuming the hypothesis is true, to get the residual sum of t t squares rH rH . If the reduced model has q +1 parameters, then rH rH is an estimate of ( n q 1) 2 with n q 1 degrees of freedom. Step 3: The second estimate of 2 is based on the so-called additional sum of squares t rH rH r t r that estimates ( p q ) 2 if the hypothesis is true. Hence, if this is the case, t ( rH rH r t r ) / ( p q ) estimates 2 with ( p q) degrees of freedom. Step 4: The discrepancy measure is f =
t (rH rH r t r ) / ( p q )

2 Step 5: To calculate the p-value, we find Pr( F f ) where F ~ Fp q ,n p 1 has an F distribution with p q [numerator] and n p 1 [denominator] degrees of freedom.
Example 1 We can illustrate the steps with the assessment data where we have, in the full model, n = 38, p = 5 . The hypothesis is 1 = 0, 3 = 0, 4 = 0, 5 = 0 . Step 1: Fit the full model. From the R output, we have 2 = 5.9932 = 35.916 and the residual sum of squares 32(5.993) 2 = 1149.314 with 32 degrees of freedom. Step 2: Fit the reduced model value = 0 1 + 2 age + r with 1 = 0, 3 = 0, 4 = 0, 5 = 0 . Call: lm(formula = value ~ age) Residuals: Min 1Q Median 3Q Max -7.823 -3.823 -0.968 2.097 19.432 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.34243 3.00833 6.762 6.75e-08 *** age -0.45498 0.09971 -4.563 5.66e-05 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 2 The residual sum of error: 6.084 on 36 degrees of .freedom 36 degrees of freedom. Residual standard squares is 36(6.084) = 1332 542 with Multiple R-Squared: 0.3664, Adjusted R-squared: 0.3488 F-statistic: 20.82 on 1 and 36 DF, p-value: 5.665e-05 Stat 371 R.J. MacKay University of Waterloo 2009 IV-3

Step 3: The second estimate of 2 , assuming the hypothesis is true, is


1332.542 1149.314 = 45.81 5 1 45.81 Step 4: The discrepancy measure is f = = 1.275 35.92

Step 5: Using the R function 1-pf(1.275, 4,32) or the Tables in the Appendix, we see that Pr( F4,32 1.275) = 0.30 . Since the p-value is so large, there is no evidence against the hypothesis. In other words, there is no evidence that the model is improved by adding all of the other explanatory variates, once age is included in the model. Notes 1. F distribution and F tables Mathematically, an Fnum, den random variable with num and den degrees of freedom is defined as K2 2 / num F = num ( = num ) 2 K den 2 / den den where the numerator and denominator are independent. You may have wondered at Step 2 why we could not have used the estimate of 2 produced from fitting the reduced model directly, rather than the estimate in Step 3 based on the change in the residual sum of squares. The reason is that by subtraction, we get independent estimators which are required for the F distribution. An F random variable is always positive and has mean close to 1. There are tables (in the same format as the t tables) in the Appendix. For each tail probability, there is one page of tables with a column for the numerator degrees of freedom and a row for the denominator degrees of freedom. 2. Calculations with R We can use R to perform all of the calculations. We fit both the full model and the reduced model and then apply the anova() function. For Example 1, the code is b<-lm(value~size+office+age+ratio+location) c<-lm(value~age) anova(c,b) and the corresponding output is Model 1: value ~ age Model 2: value ~ size + office + age + ratio + location Res.Df RSS Df Sum of Sq F Pr(>F) 1 36 1332.76 2 32 1149.20 4 183.56 1.2778 0.2992 Stat 371 R.J. MacKay University of Waterloo 2009 IV-4

Note in the function anova( , ) we put the reduced model first. 3. In the summary output from fitting with lm(), the last line is an F test for the hypothesis that 1 = 2 =... = p = 0 . In words, if this hypothesis is true, none of the explanatory variates is important in explaining variation in the response variate. In Example 1, with the full model, the output gives the F ratio F = 20.82 with a pvalue that is very small so there is very strong evidence that one or more of the coefficients differ from 0. Example 2 There are six indicator variables to index the product version and one other explanatory variate. Note that the full model does not have an intercept term. The sum of the vectors corresponding to the six indicators is x1 +...+ x6 = 1 . If we include an intercept term, then the columns of X are linearly dependent. We are interested in the hypothesis 1 = 2 =... = 6 which corresponds to all versions being the same. We fit the full model with the code b<-lm(sat.score~-1+x1+x2+x3+x4+x5+x6+pst.score) Note that the 1 tells R to not include a constant term (i.e. to fit a model without 0 ) Next we fit the reduced model with 1 = 2 =... = 6 = . Since x1 +...+ x6 = 1 , this corresponds to the model with a constant term and the single explanatory variate pst.score. c<-lm(sat.score~pst.score) To test the hypothesis, we calculate the additional sum of squares (residual) anova(c,b) with output Model 1: sat.score ~ pst.score Model 2: sat.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.score Res.Df RSS Df Sum of Sq F Pr(>F) 1 46 2.58217 2 41 1.60391 5 0.97826 5.0014 0.001146 ** Since the F ratio is so large (p-value 0.001), there is strong evidence of differences among the 6 versions if the pst.score is held fixed.

Stat 371 R.J. MacKay University of Waterloo 2009

IV-5

Exercises 1. Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom. a. Find Pr( F 3) b. find a constant c so that Pr( F c) = 0.05 c. What is the distribution of 1 / F 2. In an industrial example, the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . The data are stored in the file ch4exercise2.txt. Theory suggests that a linear model of the form y = 0 + 1 x1 + 2 x2 + r should describe the data. However, the analyst worries 2 that additional second order terms of the form x12 , x 2 , x1 x 2 should be included in the model. Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < x 2 * x 2 to represent the quadratic terms.] 3. In the product testing example (Example 2 in Chapter 4), use an F test to address the following questions? a. Is there any evidence of differences among the new versions 2 to 6? b. Versions 4,5 and 6 share a common feature. Is there any evidence that these versions have significantly different average satisfaction scores? 4. If we have a single parameter , we can test a hypothesis = 0 in two ways. a. Explain how we can test the hypothesis using a t-test b. Explain how we can test the hypothesis using an F test c. Consider again the product testing example described in Exercise 3. Consider the hypothesis that the coefficient 7 of the explanatory variate pst.score is 0. Test the hypothesis in the two ways and show that the p-value is identical. [This is always true although a nuisance to prove] d. If t ~ tk , show that t 2 has an F distribution. What are the degrees of freedom? 5. Some theory a. In the construction of the F test, explain why the additional sum of squares is always non-negative. b. Consider the model y = 0 1 + 1 x1 +...+ p x p + r . Show that if we replace the vector x j by the vector x * = x j x j 1, the model becomes j * * y = 0 1 + 1 x1 +...+ p x p + r . That is, the coefficients of the explanatory variates do not change. c. Explain why testing the hypothesis 1 = 2 =... = p = 0 will yield identical results for either formulation of the model. d. In the revised model show that x * 1 for all j j

Stat 371 R.J. MacKay University of Waterloo 2009

IV-6

e. In testing the hypothesis, show that the additional sum of squares is t * ( X*t X* ) * where * = ( 1 ,..., p )t and X* = ( x1* ,..., x * ) . This quantity is p often called the regression sum of squares.

Stat 371 R.J. MacKay University of Waterloo 2009

IV-7

Chapter 5 Assessing Model Fit To this point, we have built, fit and used a model for a given set of data without questioning any of the underlying assumptions. In this chapter, we examine the problem of model fit. Are the assumptions reasonably well met and, if not, what do we do about it? In fitting the model y = 0 1 + 1 x1 +....+ p x p + r and using the corresponding estimators to construct formal statistical procedures, we are making a number of assumptions about the underlying probability model Y = 0 1 + 1 x1 +....+ p x p + R, R ~ N (0, 2 I ) For example, we are assuming that: the mean vector E(Y ) is the specified linear function of the explanatory variates the residuals are gaussian, independent with constant standard deviation for each unit in the sample

We can assess these assumptions in several ways. In the exercises, we consider two situations to look at the first bullet. If we have units in the sample in which the explanatory variates are identical, we can use ANOVA to assess the fit. Also we can add extra terms (squares, cross products etc.) to the proposed model and test if the additional terms have significant effects. If not, then we have greater confidence in the form of the mean function in the original model. Looking at the Estimated Residuals We also assess fit by looking for patterns that would be unusual if the model is true. If we find such patterns, we are suspicious about the assumptions underlying the model. This approach to assessing fit is informal and subjective we need to be careful not to over-interpret the plots looking for patterns. The estimated residuals, the components of the vector r , are derived from the given model. r = y = y X ~ r The corresponding estimator ~ = Y X = ( I H ) R is a linear combination of the components of R and hence, according to the model, ~ ~ N (0, 2 ( I H )) . Recall that r t 1 t H = X ( X X ) X depends only on X. We also know that r and are orthogonal and, ~ according to the model, ~ and are independent see the exercises. r If we plot the individual components, the estimated residual ri versus the fitted value i for i = 1,..., n , we should see a plot with no obvious patterns.

Stat 371 R.J. MacKay University of Waterloo, 2009

V-1

Example 1 Consider again the assessment data discussed in previous chapters and found in the file assessment.txt. If we fit a model with 5 explanatory variates size, age, office, ratio and location to the measured value, we can create a plot of the estimated residuals versus the fitted values with the R code b<-lm(value~size+age+ratio+office+location) plot(fitted(b),resid(b))

Does this plot raise any suspicions about the proposed model? The answer is yes since it would surprising (assuming that the model is correct) if the two largest estimated residuals correspond to the two largest fitted values as seen in the plot. The remedy here is to repeat the fitting and analysis with these cases removed to see if the conclusions are substantially effected. If they are influential, then we need to decide (not on statistical basis) how to proceed. Otherwise, we can ignore the poor fit. In the example, we can delete cases 18 and 27 by editing the data frame a. Change all the variate values for cases 18 and 27 to NA. aa<-edit(a) detach(a) attach(aa) Then refit the same model using the 36 units in aa. The plot of the estimated residuals versus the fitted values looks much better. Note that the age is still the only significant Stat 371 R.J. MacKay University of Waterloo, 2009 V-2

explanatory variate but that the estimated coefficient changes from 0.52 to 0.17 so the two cases are very influential if we want to predict the value of an unsold building. We make the decision to omit or include the two cases on non-statistical grounds. Here the

decision was to proceed without these two sales since they corresponded to buildings that were very different from the building in question. We may see a funnel shape on the plot the estimated residuals versus the fitted values. This indicates that the standard deviation is not constant but is a function of the mean ( x ) . Example 2 Here is an artificial example to demonstrate the problem and the remedy. The data are stored in the file ch5example2.txt. The 50 observations were created from the model

Y = (2 + 3x1 2 x 2)*(3 + R)
where x1 and x2 are uniform on the interval (0,1) and Ri ~ G(0,1) independently. In this model, the standard deviation of Yi depends on the mean E(Y ) = 3 *(2 + 3x1 2 x 2), stdev(Y ) =|2 + 3x1 2 x 2| . When we fit the linear model lm(y~x1+x2), the plot of the estimated residuals versus the fitted values (see the left panel on the next page) shows the classic funnel shape indicating that the standard deviation is not constant. The remedy is to transform the response variate. That is, we fit a model of the form

Stat 371 R.J. MacKay University of Waterloo, 2009

V-3

f ( y) = 0 + 1 x1 + 2 x 2 + r Standard choices for the transformation include the functions log( y), y and 1 / y .The right hand panel shows the plot of the fitted versus residual values after applying the log transformation.

After taking logarithms, the funnel effect is removed, in spite of the fact that the we know that the original model is not linear in x1 and x2 on this scale. The plot of the estimated residuals versus the fitted values is the single most useful diagnostic tool. We can also plot the estimated residuals versus the explanatory variates and, if the form of the model is adequate, we expect to see no patterns on these plots. There are many other such diagnostic plots. We can examine the gaussian assumption with a quantile-quantile (usually abbreviated qq) plot. Under the model assumption we have ~ ~ N (0, 2 ( I H )) or r ~ ~ G(0, 1 h ) . To make the standard deviation constant, we standardize and define ri ii the standardized residuals as ri zi = 1 hii If the gaussian assumption in the model is more or less correct, then we will see a straight line on the qq plot of the standardized residuals. Large deviations from the line indicate that the gaussian assumption is likely false. Again the usual remedy is to transform the

Stat 371 R.J. MacKay University of Waterloo, 2009

V-4

response variate or to delete cases with large (positive or negative) estimated residuals. See the Appendix 2 for the concepts underlying the qq plot. For the assessment data with all 38 cases (and the edited file with cases 18 and 27 removed), the qq plots of the standardized residuals are shown in the left and right panels respectively. The plot on the left picks up to some degree the two exceptional cases with large estimated residuals. These residuals are larger than can be expected if the estimated residuals are a sample from a gaussian distribution. After removing these two cases, the plot on the right provides no such evidence against the gaussian assumption.

Using R, with the results of lm() stored in b, we can calculate the studentized residuals and create the qq plot with the code s <- resid(b)/sqrt(1-hatvalues(b)) qqnorm(s) Sensitivity (Case) Analysis We use case analysis to determine in any of the units correspond to an outlier. We call a case an outlier if the conclusion of the analysis is materially changed by omitting the unit from the data. Note that a case can have unusual values for the explanatory variates, the response variate or both. We show an extreme example of each type of outlier for a single explanatory variate on the next page. We look for outliers in the space of the explanatory variates using the following ri argument. Recall that ~ ~ G(0, 1 hii ) where hii is the ith diagonal element of the hat t 1 matrix H = X ( X X ) X t . We call hii the leverage of the ith case since, if hii is close to 1,

Stat 371 R.J. MacKay University of Waterloo, 2009

V-5

the standard deviation of ~ is close to 0 and hence we know that ri will be close to 0. In ri other words, the fitted value i will be close to yi . Note that hii depends only on the explanatory variates and so a case has high leverage (hii close to 1) regardless of the observed response variate. Deletion of a case with high leverage relative to the others may change the fit of the model substantially.

We can extract the hii from any model fit b<-lm(y~) using the R function hatvalues(b). For the assessment data (with the two cases deleted), the plot of the leverage versus the index (case number) is shown below. There are no exceptional large values.

Stat 371 R.J. MacKay University of Waterloo, 2009

V-6

To look for outliers in the response variate, for each case i, we compare yi to y i , the predicted value of y if we delete the ith case. We delete the ith row uit from the matrix X to get X i and refit the model to get parameter estimates i , i

calculate the predicted value yi = uit i yi y i calculate a t-statistic ti = t t i 1 + ui ( X i X i ) 1 ui

If any ti is large (i.e. bigger than 2.5) then we know the response variate of the corresponding case is an outlier. The calculation of ti looks formidable. We are saved by a remarkable formula see the Exercises on rank 1 update - which gives an explicit t formula for ( X i X i ) 1 in terms of ( X t X ) 1 . Using this formula, we can rewrite ti as n p2 ti = si n p si2 a monotone function (as si increases, ti increases) of the studentized residual ri si = . 1 hii A large studentized residual (say greater than 2.5) corresponds to a large value of ti which in turn corresponds to an outlier in the response variate. We calculate the studentized residuals for the model b<-lm() using the R function rstudent(b). For the above assessment data, the plots of the studentized residuals [plot(rstudent(b)] against the index number shows two exceptional values for the unedited data (left) and none for the edited data (right). Note the difference in vertical and horizontal scales.

Stat 371 R.J. MacKay University of Waterloo, 2009

V-7

We conclude that there are no other cases that (singly) are highly influential. Note that there are many other ways to measure the influence of single cases we have looked at cases one-at-a-time, not in groups so there may still be highly influential small groups of cases. This issue is beyond the scope of the course. In summary, if we find influential cases, we should repeat the analysis with theses cases deleted and see if the Conclusion is materially changed. Do not forget that the Conclusion is driven by the original Problem. Changes in the fitted model that do not effect the Conclusion are not important.

Stat 371 R.J. MacKay University of Waterloo, 2009

V-8

Exercises 1. Consider the assessment data with simple model value = 0 + 1age + 2 size + residual . Use the methods in this chapter to assess the fit of the model and to suggest remedies. Is the prediction of value for a building with size 13.9 and age 30 sensitive to any particular cases? 2. In an experimental Plan, there were three explanatory variates x1 , x2 , x3 that each were assigned two values, here coded as -1 and +1. There are 8 combinations. As well, the investigators looked at the response variate for the so-called center point x1 = 0, x2 = 0, x3 = 0. The data are shown below and can be found in the file ch5Exercise2.txt.

x1
-1 -1 -1 1 1 1 1 0 0 0 0

x2
-1 1 1 -1 -1 1 1 0 0 0 0

x3
1 -1 1 -1 1 -1 1 0 0 0 0

y 11.54 5.45 7.34 17.21 17.87 9.40 11.30 11.57 11.97 11.89 12.15

Suppose we fit a model y = 0 + 1 x1 + 2 x2 + 3 x3 + r . The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1.2527 -0.1360 0.1465 0.4419 0.7615 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.6181 0.2372 48.985 3.87e-10 *** x1 2.4654 0.3000 8.218 7.68e-05 *** x2 -3.1071 0.3000 -10.357 1.70e-05 *** x3 0.5329 0.3000 1.776 0.119 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.7649 on 7 degrees of freedom Multiple R-Squared: 0.969, Adjusted R-squared: 0.9558 F-statistic: 73 on 3 and 7 DF, p-value: 1.203e-05 Stat 371 R.J. MacKay University of Waterloo, 2009 V-9

We drop x 3 from the model. To assess the fit of the model, consider two formal approaches.
2 a) Add quadratic terms x12 , x1 x 2 , x 2 to the model and then test the hypothesis that the additional terms are unnecessary. b) Consider an extended model in which the mean of Y is a function ( x1 , x2 ) with no further specification. Show that the residual sum of squares from fitting this model is (yij yi ) 2 where i indexes the unique sets of explanatory variate values and j

indexes the replicated observations within these sets. For the given data, use the additional residual sum of squares to test the hypothesis that the extended model is necessary. This is called a pure residual test of fit. 3. Consider the data described in Chapter 3 (the file is trial.txt) in which a marketing firm wanted to compare two sales promotions against a control. The response variate is the weekly sales and there are four explanatory variates, two of which index the promotion used. a) After fitting the full model, is there any evidence of lack of fit? b) Suppose the primary question is to compare the two promotions adjusting for past and competitors sales. Are there any cases that have a large influence on the conclusion about this comparison? ~ 4. If the model is correct, show that Cov( ~, ) = 0 and hence and ~ are independent. r ~ r What does this suggest about the plot of the estimates ri versus 5. We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case. The key step is to find an expression for the inverse of t X 1 X 1 where X1 is the matrix X with the first row u1t omitted. a) Suppose u and v are two n 1 column vectors and A = I + vu t . Find the constant a so that ( I + vu t ) 1 = I + avu t [This is known as a rank one update] b) If C = B + uu t where B is invertible, find an expression for C 1 . c) Suppose we consider dropping the first case when fitting the model y = X + r . Show t t that X t X = X 1 X 1 + u1u1t and hence find an expression for ( X 1 X 1 ) 1 .

Stat 371 R.J. MacKay University of Waterloo, 2009

V-10

Chapter 6 Model Building In many applications, we have a number of explanatory variates that we can chose to include or delete from the model. For example, if the problem is to predict the value of the response variate for a given set of values for the explanatory variates assess the effect of a particular explanatory variate (or variates) on the response variate when controlling for a number of other explanatory variates

we may or may not include some of the explanatory variates in the model. We want to use the data to decide which variates to include or delete. This decision is important if we want to get a final model that is as simple and useful as possible and fits the data well. If we include unnecessary terms, we add to the model complexity and we can also distort the conclusion. We consider three strategies: 1. (Forward selection) Start with a simple one-variate models and select the one that best explains the variation in the response variate i.e. select the one with the highest value of R 2 or equivalently, the model with the largest F-ratio. Then add a second variate to the model that maximizes the increase in R 2 and has a coefficient significantly different from 0. Continue until we can find no more important variates to add. 2. (Backwards elimination) Start by fitting the full model. If any coefficient is judged not significantly different from 0, leave out the least important variate, based on the p-value for the test of the hypothesis that the corresponding coefficient is 0. Keep deleting until all of the included variates have coefficients significantly different from 0. 3. (All regressions) Fit all possible models there are 2 p 1 if we have p explanatory variates. Select two or three of each size and then use a criterion that balances the value of R 2 against the addition of extra variates to pick the best model. You might expect that strategies 1 and 2 would get the same answer but this is not always the case. We create an artificial example to demonstrate this point. The data are stored in ch6example1.txt. Example 1 We create the data with the code: u1<-rnorm(100);u2<-rnorm(100);u3<-rnorm(100);u4<-rnorm(100);r<-rnorm(100) x1<-u1;x2<-u1+u2;x3<-u1+2*u2-u3;x4<-u1+u4 y<-x1+1.2*x2-0.5*x3+2*r

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-1

Note that the second line introduces relationships among x1, x2, x3 and x4.The corresponding vectors are not close to orthogonal. We then try strategy 1 where we add significant variates. We start by fitting all of the one-variate models and picking the one with a significant coefficient (p-value less than 10%) and the highest value of R 2 . The results are Model lm(y~x1) lm(y~x2) lm(y~x3) lm(y~x4) Significant variates x1 x2 x3 x4
R2 0.4547 0.2491 0.0408 0.2439

We select x1 and proceed by fitting the 3 two-variate models that include x1. Model lm(y~x1+x2) lm(y~x1+x3) lm(y~x1+x4) Significant variates x1 x1 x1
R2 0.4567 0.4550 0.4560

Using strategy 1, we stop and the selected model includes only x1. Now using strategy 2, we start with the four-variate model and work to eliminate variates. In the full model we have x1,x2,x3 with coefficients judged significantly different from 0 (p-value<10%) and R2 = 0.4744 . We drop x4 and fit the three-variate model in which we find all coefficients significant and R2 = 0.4743 . According to strategy 2, we stop and select the three-variate model that includes x1,x2,x3. There are many other similar strategies, often called stepwise procedures, that involve adding or eliminating variates based on testing the significance of their coefficients in a given fit. The reason that we get different answers from such procedures is largely due to the correlation among the explanatory variates. If the vectors 1, x1 ,..., x p are mutually orthogonal, then X t X is a diagonal matrix with entries x tj x j , and ( X t X ) 1 is also diagonal with entries 1 / x tj x j . The estimates of the coefficients j = x tj y / x tj x j and the ~ corresponding estimators j ~ G( j , / x tj x j ) depend only on x j . If we fit any sub~ model (i.e. leave out some of the explanatory variates), j and j do not change. In testing the significance of the coefficients, only the estimate of and the degrees of freedom change from the full model. The extreme opposite of orthogonality occurs if the vectors 1, x1 ,..., x p are linearly dependent or collinear. In this case we have trouble with least squares since X t X is singular and there are many models that give the same minimum value for the sum of

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-2

squares of the estimated residuals. The closer the explanatory variates are to collinear, the more difficult is our problem to select the best model. With the invention of clever algorithms that can fit all possible models without reinversion of large matrices, strategy 3 is the preferred approach. The only difficulty is to specify a criterion for choosing among all of the models. We want to select a model that explains a large portion of the variation in the response variate. Recall that R 2 must increase when we add additional variates to the model because the sum of squares of the estimated residuals must get smaller. On the other hand, we want a simple model and adding explanatory variates increases the complexity. We pick a criterion that balances these two requirements. Suppose that we have p +1 terms in the full model (p explanatory variates plus a constant) and k + 1 terms in a sub-model. Then two criteria that balance the requirements are based on estimated residual sum of squares n ( k + 1) Note that both the numerator and denominator decrease as we add terms to the model. We want this quantity to be small.

2 (sub model) =

estimated residual sum of squares + c k where c > 0 is chosen for calibration. Again we want this quantity to be small.

We specify the first criterion by the so-called adjusted R 2 for a given sub-model.

2 (sub - model) R = 1 2 ( null - model)


2 adj

where 2 ( null - model) = (y i y ) 2 / (n 1) is the estimate of 2 if none of the


i

explanatory variates are included in the model. Note that R 2 has the same form with sum 2 of squares of estimated residuals in the numerator and denominator. We want Radj to be large (i.e. 2 (sub model ) to be small). We specify the second criterion using Mallows c p statistic for a given sub-model where estimated residual sum of squares cp = + 2( k + 1) n 2 The denominator is the estimate of 2 from the fit of the full model. The constants are chosen so that c p k + 1 for good sub-models with k explanatory variates and a constant term.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-3

We can use the package leaps in R to evaluate these criteria for a large number of submodels. We start with the command library(leaps) This loads a set of functions that we need to fit a large number of models simultaneously. If you have not downloaded the leaps, use the Package menu to get it. We use the artificial example to demonstrate the calculations. The R code is library(leaps) e<-regsubsets(y~x1+x2+x3+x4, data=a, nbest=2) # finds the best submodels of size 2 f<-summary(e) # extracts useful information from e detach(a) attach(f) cbind(which,cp,adjr2) # gets model, Cp and adjusted R-squared The output is (Intercept) x1 x2 x3 x4 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 cp 2.566221 39.737843 4.211471 4.339054 3.032519 6.023301 5.000000 adjr2 0.4491371 0.2413939 0.4454612 0.4447408 0.4578213 0.4407583 0.4523016

1 1 2 2 3 3 4

Note that the two best models of each size are chosen based on cp for the full model, c p = p + 1 by definition in the example, the best model includes x1, x2 and x3 with the C p = 3.03 3 and R = 0.458
2 adj

If we fit the model with explanatory variates x1, x2 and x3, we get the summary output Residuals: Min 1Q Median 3Q Max -4.2513 -1.4289 -0.1651 1.2520 6.2827 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04579 0.21029 -0.218 0.828073 x1 1.33241 0.38243 3.484 0.000746 ***

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-4

x2 0.91862 0.49019 1.874 0.063970 . x3 -0.36646 0.20450 -1.792 0.076281 . --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.056 on 96 degrees of freedom Multiple R-Squared: 0.4743, Adjusted R-squared: 0.4578 F-statistic: 28.87 on 3 and 96 DF, p-value: 2.171e-13 Just for fun, note that the selected model and parameter estimates match the true model and parameter values used to generate the data. This was likely fortunate since there are other good models that we might have selected. We end with another market value assessment example, this time using data from house sales with the object of predicting the market value for other homes in the same region. Example 2 The sample is 100 homes that have been sold in a given region. For each home, the data are: size: the size in m2 baths: the number of bathrooms bathrooms with only a basin and toilet count rooms: the number of rooms at or above ground level age: age in years lotsize: size of the lot in m2 basement: whether or not the basement is substantially finished garage: whether or not the house has a garage value: the selling price in $000 The data are stored in the file mkvalue.txt. We start the analysis by fitting a full model and examining how well this model fits. Below we give a summary of the fit and a plot of the estimated residuals versus the fitted values a qq plot of the standardized residuals a plot of the leverages, hii , the diagonal elements of H a plot of the studentized residuals Estimate -65.92366 1.57860 17.19540 7.65047 -1.31162 -3.73867 0.16106 Std. Error t value 39.47105 -1.670 0.21994 7.178 9.78106 1.758 16.81162 0.455 5.03983 -0.260 0.78717 -4.750 0.04331 3.719 Pr(>|t|) 0.098321 . 1.86e-10 *** 0.082105 . 0.650142 0.795258 7.57e-06 *** 0.000345 ***

(Intercept) size stories baths rooms age lotsize

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-5

basement 1.50462 9.11014 0.165 0.869185 garage -42.11920 15.04650 -2.799 0.006253 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 41.19 on 91 degrees of freedom Multiple R-Squared: 0.6704, Adjusted R-squared: 0.6414 F-statistic: 23.14 on 8 and 91 DF, p-value: < 2.2e-16

The relatively low value of R2 = 0.67 indicates that we are likely to have large prediction error. We continue in any case. There is no apparent lack of fit or influential cases as demonstrated by the four plots. We now search for a simpler model by looking at the best 2 of all possible models. The output of regsubsets( ) is presented on the next page There are several attractive models but the best (simplest) includes size, age, lotsize and garage. For this model, we have Radj 2 = 0.644, Cp = 4.13 < 5 . We can refit this simpler model and assess the fit as above.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-6

(Intercept) size stories baths rooms age lotsize basement garage 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 2 1 1 0 0 0 1 0 0 0 2 1 1 1 0 0 0 0 0 0 3 1 1 0 0 0 1 1 0 0 3 1 1 0 0 0 1 0 0 1 4 1 1 0 0 0 1 1 0 1 4 1 1 1 0 0 1 1 0 0 5 1 1 1 0 0 1 1 0 1 5 1 1 0 0 1 1 1 0 1 6 1 1 1 1 0 1 1 0 1 6 1 1 1 0 1 1 1 0 1 7 1 1 1 1 1 1 1 0 1 7 1 1 1 1 0 1 1 1 1 8 1 1 1 1 1 1 1 1 1 Exercises

adjr2 0.549 0.317 0.572 0.560 0.618 0.593 0.644 0.625 0.651 0.641 0.648 0.648 0.645 0.645 0.641

cp 27.15 90.63 21.55 25.00 10.14 16.86 4.13 9.32 3.32 6.09 5.09 5.21 7.02 7.06 9.00

1. Suppose the columns of X are orthogonal. Show that the estimate of j , the coefficient of x j , is not dependent on which are columns of X are included in the model. 2. Show that c p = p + 1 for the full model that includes all p explanatory variates. 3. The file ch6exercise3.txt contains a response variate y and 10 explanatory variates x1 ,..., x10 for 100 cases. These data were created artificially for practice. The model used to generate the data was Y = 3 x1 + 0.3 x2 2 x 4 + x 7 x9 + R, R ~ G(0,2). Note that the columns of X are not orthogonal. a) Fit a model using forward selection. At each step, use a p-value of 0.05 to decide to proceed. b) Fit a model using backwards selection using a p-value of 0.05 to decide to proceed at each step. c) Use leaps to investigate all possible models. Pick a reasonable model. d) How do the results of the three strategies compare in this case.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VI-7

Chapter 7 Sample Survey Issues In the second half of the course, we consider the planning and analysis of simple sample surveys. For the most part, we follow the book Sampling: Design and Analysis by S.L. Lohr. We will cover most of the material in Chapters 1-4. There are multiple copies of the book on reserve (UWD1510) in the Davis Centre Library. See the reference list attached to the course outline. In this chapter, we deal with the language of sample surveys examples of sampling protocols classification of error (the difference between the estimate and the attribute) assessment of error

Sample surveys are widely used to estimate attributes of interest in a specified target population. The survey can be one-time only and informal (e.g. the daily poll on the Netscape home page http://www.netscape.com/ ) or highly complex and regular (e.g. the Canadian Labour Force survey that estimates unemployment rates across Canada on a month to month basis. For details, see http://www.statcan.ca/english/sdds/3701.htm). Surveys are used to estimate attributes of human populations as in the above examples and also any other collection of objects such as financial records. A census is an investigation of a population where we try to examine every unit. The reasons for using a sample survey rather than a census of the target population to learn about attributes are cost timeliness ethical issues relating to efficient use of resources the improved quality of the estimates available from a carefully conducted survey rather than a sloppy census. We use some specialized language to describe survey methodology within the PPDAC framework. For formal surveys, we concentrate on the sampling protocol. Note that we often select units in clusters to implement a sampling protocol. Example In the Labour Force Survey (the quoted material is from the above web site), the units are defined as: LFS covers the civilian, non-institutionalised population 15 years of age and over. Excluded from the survey's coverage are residents of the Yukon, Northwest Territories and Nunavut, persons living on Indian Reserves, full-time members of the Canadian Armed Forces and inmates of institutions. These groups together represent an exclusion of less than 2% of the population aged 15 and over.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VII-1

The sampling protocol does not choose units (people who meet the inclusion criteria) directly. Instead, a sample of households is selected and then variates are measured on every appropriate unit in the selected households. The LFS uses a probability sample that is based on a stratified multi-stage design. Each province is divided into large geographic strata. The first stage of sampling consists of selecting smaller geographic areas, called clusters, from within each stratum. The second stage of sampling consists of selecting households from within each selected cluster. We call the households the sampling units. The frame is the list of sampling units on which the sampling protocol operates. The frame defines the study population. Developing a good frame (one which covers the target population) is often one of the most expensive components of conducting the survey. Formal surveys have a frame. Note that informal surveys such as the Netscape poll do not use a frame since the units are selfselecting. In this instance, the study population is only vaguely specified. In the Labour force Survey, there are separate frames for each stage of the sampling. One frame is a list of clusters within each geographic stratum. The second frame is a list of households within each selected cluster. Sampling Protocols There are many sampling protocols that can be used to select the sample from the study population. A probability sampling protocol uses a probability distribution to select the sample from the frame. More formally, if the frame is denoted by U = {1, 2,..., N } , then a probability sampling protocol assigns a probability to every subset of U and the sample is selected according to this distribution. We will look at ways to implement such a protocol later. Example: Suppose an auditor has a file of 1220 records and plans to select a sample of 20 records to examine the quality of the file. The auditor decides to use simple random sampling (SRS), a protocol in which all samples of size 20 have the same probability of selection. Here we write the frame as U = {1, 2,...,1220} and, for any subset S of U , we have 1 1220 Pr( S ) = 20 0 if S has size 20

otherwise

Example: For the labour force survey, the sampling protocol is described as: The LFS uses a probability sample that is based on a stratified multi-stage design. Each province is divided into large geographic stratum. The first stage of sampling consists of Stat 371 R.J. MacKay, University of Waterloo, 2009

VII-2

selecting smaller geographic areas, called clusters, from within each stratum. The second stage of sampling consists of selecting dwellings from within each selected cluster. Since July 1995, the monthly LFS sample size has been approximately 54,000 households, resulting in the collection of labour market information for approximately 100,000 individuals. There are many non-probability sampling protocols. Some examples are: convenience sampling take what you can get e.g. a survey of people in a mall by a marketing firm self-selection sampling units choose themselves, usually with little control e.g. many internet polls quota sampling units are selected so that some attributes of the sample match known attributes in the target population e.g. in a marketing survey, each interviewer is directed to find a sample whose attributes match the local population in terms of age, income profile and gender. judgment sampling units are selected so that the samplers think that the sample will be representative of the target population (i.e. match the target population with respect to the attributes of interest).

We concentrate on formal surveys that use probability sampling protocols since we can use mathematical tools to assess, at least partially, the error that occurs in drawing conclusions about the target population from the sample. This is a major advantage of these sampling protocols. Errors In applying PPDAC to estimate population attributes, we can classify errors as: Study error: the difference in attributes of interest between the target and study population. Sample error: the difference in attributes of interest between the study population and the sample. Measurement error: the difference in the attributes of interest due to the difference between the true and measured values of the variates on the units in the sample. In the context of sample surveys, study error is called frame error. The attributes of the units listed in the frame may not match those of the target population. For surveys of human populations, an important component of sample error is nonresponse error. Suppose the attributes of interest in the respondent and non-respondent populations are different Then the sample attributes may not match those in the frame because one or more units in the sample may have refused to provide data.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VII-3

If we divide the frame into those units that would respond and those that would not, we can see the effect of non-response error.

Non-respondents

Respondents

Intended sample

Actual sample

The actual sample is different from the intended sample and the attributes in the actual sample may not match those in the frame. Measurement error may occur because of systematic differences in interviewers, people in the sample may lie, forget or modify their answers to please the interviewer. Interviewers may influence the responses by using a different protocol for asking the questions. Measurement error may also occur if the question we pose does not match the question used to define the response variate in the target population. We have all seen statements such as 19 times out of 20, a sample of this size is accurate to within 3 percentage points at the bottom of the conclusions from a survey. This confidence interval captures the uncertainty due to a component of the sample and measurement error. We use the probability model that generated the sample to describe how the sample attributes would behave if we were to repeat the same sampling protocol over and over. The confidence interval does not capture uncertainty due to frame error, non-response, systematic errors in the sampling protocol and measurement system etc. We can control these latter sources of error only through good planning and execution of the survey. Questionnaire Design Here is a brief set of considerations in designing the instrument (the questionnaire) for the survey of a human population. This is a very complex subject. There are many books and papers written on questionnaire design. If you are involved in an important survey, hire an expert. The proper design of the questionnaire and a good plan for its administration can substantially reduce non-response error, recall error, error due misunderstanding the question and so on. The following list is adapted from Lohr pages 10-15.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VII-4

Decide what you want to find out (understand the Problem) Keep the questions clear and simple Use specific instead of general questions Decide whether to use open-ended or closed questions Ask only one concept in each question Use forced choice rather than agree/disagree questions Avoid leading questions and contexts Relate each question to your objective what will you do with the data? Keep the questionnaire short. Explain the purpose of the survey Ensure confidentiality Pay attention to question-order effects Test your questions before the survey. Plan to report the actual questions used

Stat 371 R.J. MacKay, University of Waterloo, 2009

VII-5

Chapter 8 Probability Sampling Formal surveys use probability sampling, a protocol that selects units for the sample based on a probability model on subsets of the frame. The major advantage of probability sampling is that the sampling protocol produces a statistical model that we can use to assess sample error i.e. to generate confidence intervals and hypothesis tests for model parameters that represent attributes of interest in the study population. If we execute the protocol as planned, then we know that the model is appropriate. In this chapter, we first examine several probability sampling protocols and then look at simple random sampling (SRS) in detail. Denote the frame by the set U = {1,2,..., N } so that there are N units in the frame. Then a probability sampling protocol specifies the probability that the sample is s for any subset of s U . We consider protocols where the sample size n is fixed so that the only subsets with positive probability have n units. Here are some common sampling protocols explained in terms of an example. Example: Suppose N = 10,000 and n = 100 . 10000 Simple random sampling: all samples of size 100 have the same probability. 100

FG H

IJ K

Stratified random sampling: divide the frame into sub-frames called strata. For example, U 1 = {1,....,1000},..., U 10 = {9 001,....,10 000} .

F1000IJ For each stratum, select a simple random sample of size 10. There are G H 10 K
samples, each with the same probability.

10

possible

Cluster sampling: Divide the frame into clusters, for example C1 = {1,...,10}, C2 = {11,...,20},...., C1000 = {9991,...,10000}. Select 10 clusters using simple random sampling. The sample is the 100 units in the 10 1000 selected clusters. There are possible samples, each with the same probability. 10

FG H

IJ K

Systematic sampling: Define clusters with n = 100 units per cluster C1 = {1101,...,9901}, C2 = {2,102,...,9902},...., C100 = {100,200,...,10000}. , Use simple random sampling to select one cluster as the sample. We call this protocol systematic because we can select the sample by choosing the first unit from {1,...,100} at random and then taking every subsequent 100th unit. There are 100 possible samples, each with the same probability. Two-stage sampling: Select the sample in two stages. For example,

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-1

Stage 1: Select two strata (here called primary units) from the 10 described in 2. above 10 possible samples of primary units. using simple random sampling. There are 2 Stage 2: Select 50 units from each of the two selected primary units using simple random sampling.

FG IJ H K

F10I F1000IJ There are G J G H 2 K H 50 K

possible samples, each with the same probability. One advantage

of two stage sampling is that we only need to build a frame at the second stage for those primary units selected in the first stage. The inclusion probability pi for any unit i in the frame is the probability that the unit is 1 for each unit i included in the sample. In the above example, you can show that pi = 100 for each of the described sampling protocols see the exercises. You should be able to provide definitions of these sampling protocols in general. Note that complex surveys such as the labour force survey use multi-stage sampling with stratification in the primary stage and cluster sampling (the clusters are households) in the ultimate stage. Also note that we use SRS within each of the above protocols so we need to understand the properties of this most important protocol. Simple Random Sampling For simple random sampling, we select n units from a frame of N units so that each N sample of size n has the same probability of selection. Since there are possible n N samples, each has probability 1 / . The inclusion probability for unit i in the frame is n N 1 n 1 n pi = = N N n Note that the numerator is the number of samples that contain the particular unit. Technically, this protocol is often called simple random sampling without replacement since we do not allow the same unit to be included in the sample more than once. We use the shortened form of the name.

FG IJ H K

FG IJ H K

FG H

FG IJ H K

IJ K

As we saw in the above example, there are many sampling protocols with the same inclusion probabilities as SRS. We cannot define simple random sampling by saying that every unit has the same chance of being included in the sample.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-2

Let yi be the value of the response variate for unit i . Suppose that we are interested in estimating the average response in the target population. We denote this average in the yi i s frame (study population) by and the sample average by = where s is the n selected sample. Similarly, we denote the standard deviation in the frame by and the sample standard deviation by

n 1 where ri = yi is the estimated residual.

( y
i s

)2 =

r
i i s

n 1

We do not build a response model here as we did in the earlier part of the course. Instead, we use the probability mechanism that generated the sample to look at the properties of the estimates if we were to repeat the sampling over and over. That is, we define the estimators ~ yi ~ ( yi ) 2 ~ = iS , = iS n n 1 1 where S is a random subset with Pr( S = s) = for every subset s U of size n. It is N n convenient to re-express the estimators in terms of random variables rather than a random subset. Let 1 if unit i is in the sample Ii = i = 1,..., N 0 otherwise

FG IJ H K

R S T

Then we can write

n n 1 in terms of the indicator random variables I1 ,..., I N . Note that the sums are over the entire frame U.
~ We cannot calculate the exact distribution of the estimator . However we can find many of its properties. We have the following important results:

I y
i i U

, =

I ( y )
i i i U

For simple random sampling of n units from a frame of N units, we have E ( ) = , stdev( ) = (1 n ) N n

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-3

~ ~ We call an unbiased estimator of since E ( ) = . To prove this statement, note n n n n so E ( I i ) = 0(1 ) + 1 = , and hence we have that Pr( I i = 1) = N N N N

i U ~ E () = E (

I y
i

= =

) n E ( Ii ) yi
i U

n (n / N ) yi
i U

= ~ To prove the formula for stdev( ) , we need the result from Stat 230 that, for any linear combination of dependent random variables V1 ,...,Vn ,

Var ( aiVi ) = ai2Var (Vi ) + ai a j Cov (Vi ,V j )


i i i j

where Cov (Vi ,V j ) = E (VV j ) E (Vi ) E (V j ) . i 1 ~ Applying this result, we have Var( ) = 2 { yi2 Var ( Ii ) + yi y j Cov( Ii , I j )}. n i U i j , i , j U Since I i is an indicator random variable, we have n n Var ( I i ) = E ( I i2 ) E ( I i ) 2 = Pr( I i = 1) Pr( I i = 1) 2 = (1 ) N N To find the covariance of I i and I j , we need to find E[ I i I j ] . Since the product is 0 unless I i = 1 and I j = 1 , we have

FG N - 2IJ H n - 2K = n(n 1) E( I I ) = Pr( units i and j are both in the sample) = FG NIJ N ( N 1) H nK
i j

so the covariance is

Cov( I i , I j ) = E ( I i I j ) E ( I i ) E ( I j ) =
Combining the above results we get

n(n 1) n 2 n n 1 2 = (1 ) . N ( N 1) N N N N 1

1 n 1 n n n ~ Var( ) = 2 { (1 ) yi2 (1 ) yi y j} n N N i U N N N 1 i j , i , j U

yi y j 1 n 1 i j , i , j U 2 } = (1 ) { yi n N N i U N 1
Stat 371 R.J. MacKay, University of Waterloo, 2009 VIII-4

We can simplify the expression inside the braces with a bit of algebra. yi y j 1 i j , ij U 2 yi N 1 = N 1 {( N 1) yi2 i jjyi y j )} ,i, U i U i U
= = = = 1 {N yi2 ( yi2 + yi y j )} N 1 i U i U i j , i , j U 1 {N yi2 ( yi ) 2} N 1 i U i U 1 {N yi2 N 2 2} N 1 i U N { yi2 N 2} N 1 i U

n n 2 ~ ~ and stdev( ) = (1 )1/ 2 as required. so, at last, we have Var ( ) = (1 ) N n N n


The formula for the standard deviation of an average is the basic result in the sampling part of the course. We use the result repeatedly so it is worthwhile learning it. The factor 1
n = 1 f is called the finite population correction factor (fpc) where f is N the sampling fraction, the proportion of the population units included in the sample. The standard deviation of the the estimator corresponding to the sample average, using the model generated by simple random sampling, is the usual standard deviation for an

= N 2

multiplied by the square root of the correction factor. The factor arises n because the terms in the sum that defines the estimator are dependent. If the sampling fraction is appreciable, then there can be a significant reduction in the standard deviation. For many applications, the sampling fraction is negligible and we ignore the fpc. ~ We can also show that E( 2 ) = 2 - see the exercises. To use the above results, we need one more fact, a version of the Central Limit Theorem for the average of a sequence of dependent random variables. Here we state one of the consequences of the theorem avoiding all technicalities.
~ n ( ) ~ ~ G(0, 1) approximately. 1 f

average

If N, n and N-n are suitably large, then

We use this result to build confidence intervals for various attributes of interest.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-5

Example 1 An auditor has a file of stated counts and prices for N = 1256 56 items stored in a warehouse of a producer of small automotive parts. The total stated value was $4,311,712. The auditor planned to use SRS to select a sample of item numbers and then physically count the actual number of those selected items. The purposes of the sampling were to assess: the true total value of the inventory the average dollar error per item the proportion of items with counts in error

The auditor selected a simple random sample of n = 50 items and re-counted (actually a pair of co-op students did the counts). The first 10 lines of the data are shown below and the complete data set is available in the file inventory.txt The average actual value of the sampled items was $2895.29 with corresponding sample standard deviation $1997.22. There were 11 items with count errors in the sample. The sample average dollar error (actual value stated value) was $2.33 and the sample standard deviation of the dollar errors was $41.93.
item 1 25 39 53 56 121 207 212 223 225 stated.number item.price stated.value actual.number actual.value 1335 0.61 814.35 1335 814.35 1192 1.64 1954.88 1192 1954.88 1294 1.50 1941.00 1294 1941.00 1269 2.04 2588.76 1269 2588.76 1427 3.24 4623.48 1419 4597.56 1529 2.81 4296.49 1529 4296.49 1446 2.48 3586.08 1446 3586.08 1106 1.13 1249.78 1106 1249.78 847 4.95 4192.65 847 4192.65 1016 10.27 10434.32 1016 10434.32

Table 8.1 Part of the Sampled Data

To estimate the total actual value , we start by estimating the population average and then use the fact that = N . If yi is the actual value of item number i, then we estimate using the sample average = 2895.29 . To assess the precision of this estimate, we find an approximate 95% confidence interval based on the approximation to ~ the distribution of described above. The form of the interval is the same as usual.
c standard error(estimate)

where the standard error is the estimate of the standard deviation of the estimator . Here the standard error is n 50 1/ 2 1997.22 = (1 = 276.77 (1 )1/ 2 ) N 1256 n 50

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-6

For a 95% confidence interval, we use c = 1.96 from the standard gaussian (last row of the t tables). Substituting, the confidence interval for is 2895.29 542.47 . The average actual value is poorly estimated. We are interested in the total actual value = N and we can get a 95% confidence interval for by multiplying the above interval by N = 1256 . The interval is
3, 636, 484 681, 342 This interval is very wide meaning we have estimated the total value of the inventory very imprecisely. Compared to the stated value of $4,311,712, it is difficult to assess whether or not there are material errors in the inventory. One possibility is to increase the sample size but see below.

We use the same methodology to estimate the average error. The sample average error is error = 2.33 with corresponding sample standard deviation 41.93. A 95% confidence interval for the average error is error c standard error 50 41.93 = 5.81. Hence the 95% confidence interval 1256 50 for the average error is $2.33 $11.39 . where the standard error is 1 The average error more precisely estimated than is the average actual value. We can exploit this result and the fact that we know the total stated value to get a better estimate of the total actual value. Since the error was defined as the actual value minus the stated value, we estimate the true total value as the total stated value ($4,311,712) plus the total dollar error. The estimate of the total error is error = N error = 1256 2.33 = $2926.48 so the 95% confidence interval for error is 1256 (2.33 11.39) or 2926 14306 . A 95% confidence interval for the total true value of the inventory is then 4311712 + 2926 14306 or $4,314,638 $14,306 The confidence limits are about 0.3% of the estimated total so the difference between the stated and actual value of the inventory is likely immaterial. Note by exploiting the known information (the total stated value) and the fact that we can estimate the average error with much smaller standard error than the average actual value, we can get a much more precise estimate of the total actual value. This procedure is called difference estimation.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-7

To estimate the proportion of items with counts in error, let yi = 1 if the ith item is in yi error and yi = 0 otherwise. The attribute of interest is = iU , the population average N of the binary variate. We can use the same theory as above. The sample average is 11 and the sample standard deviation is = 50

(y y)
i i s

49

y
i s i s

50 y 2

49

= =

y 50
i

49

50 (1 ) 49 = 0.418

Note for a binary response variate, the sample standard deviation is a function of the sample average proportion. An approximate 95% confidence interval for is

1.96 1

50 0.418 or 0.22 0.11 1256 50

There is considerable evidence that more than 10% of the item counts are in error, though the average error is likely to be small and the total error immaterial.

Notes 1. We use the same theory to get estimates and approximate confidence intervals for averages, totals and proportions. The form of the interval is always
estimate c standard error(estimate)

2. In the above example, the finite population correction factor 1 f = 1

50 = 0.96 1256 played a very small role (especially as it enters the calculations as 1 f = 0.98 ) and we could have safely ignored it. 3. In the case of a binary response, the sample standard deviation is

n (1 ) (1 ) n 1 We usually ignore the factor n / n 1 when we apply this formula.

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-8

Example 2 To assess the quality of a shipment of 2000 cartons of headlights (packed 12 to a carton), a manufacturing organization decides to select a sample of 30 cartons using SRS for inspection. The attribute of interest is the proportion of headlights that are defective. A headlight is declared defective if it fails to pass any one of a large number of tests. The data (number of defective items per sampled carton) are shown below.

0 0 0

1 0 0

1 3 2

4 2 2

0 0 0

0 0 1

0 0 0

0 1 1

2 1 0

3 0 0

The sample average and standard deviation are 0.80 and 1.13 respectively. Here we are using cluster sampling with the clusters defined as the cartons. If is the average number of defectives per carton, then the proportion of defective items in the population is 2000 = = 12(2000) 12 30 1/ 2 A 95% confidence interval for is 1.96(1 or 0.80 0.40 . Hence a ) 2000 30 95% confidence interval for is 0.067 0.33 or 6.7% 3.3% . The proportion of defective headlights is poorly estimated but is significantly larger than 0. Note how we adapt the results from SRS to apply to cluster sampling.
Sample Size Determination We can use the same theory to answer the most common question in Statistics.

How large a sample do I need? The obvious answer is What is your objective? It may take some effort to elicit a specific response but with some guidance, you can determine the target population, the attributes of interest, a possible frame and the required precision for the estimates. Since we will select only one sample, we base the sample size determination on the attribute of primary interest. We suppose that we can state the precision in terms of the length of a confidence interval for this attribute. See the Exercises for another formulation of the precision in terms of relative error. Suppose that we are interested in estimating a population average and we want the confidence interval to be of length 2l (i.e. the confidence interval should be l ). From the above results, assuming we use SRS or a close facsimile, we have n l = c 1 N n or, solving for the sample size

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-9

n=

1 l2 1 + 2 2 N c

To determine the required sample size, we need to specify the confidence level to find c and, with more difficulty, guess the value of . If the second term in the denominator is much larger than the first, we can ignore the term 1 / N and then the required sample size is approximately c 2 2 / l 2 . The answer is very sensitive to the value of . In other words, we often do not know enough to give a good answer to the question. One way to get an idea of is to carry out a small pilot survey. We get an estimate of to help determine the sample size in the main survey and we also can use the pilot study to test the questionnaire and the rest of the proposed methodology. Sometimes we can use the results of previous surveys with similar response variates on the same population to get an idea of the value of .
Example In the audit example, suppose that the above description was a pilot survey and the overall goal was to estimate the average error with 95% confidence within plus or minus one dollar. How many more items do we need to include in the sample?

Here we have l = 1, c = 1.96, N = 1256 and = 41.93 from the initial survey. To achieve the required precision, we have 1 n= = 1059 1 1 + 1256 1.96 2 41.932 Here, because = 41.93 is so large, we are forced to examine an extra 1009 items to achieve the desired precision. Since this is most of the frame, we would likely recommend a complete census.
Example A polling firm has been hired to conduct a cross-Canada survey to solicit opinions from adults on a number of issues. The primary question has a Yes/No answer and the sample size is selected based on estimating the proportion of adult Canadians who would answer Yes to the question. The client asks for a confidence interval of length 5 percentage points (0.05) with 99% confidence so we have l = 0.025, c = 2.57 . The n (1 ) estimated standard deviation will be (1 ) . n 1 The required sample size is

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-10

n=

1 1 0.0252 + N 2.572 (1 )

2.572 (1 ) 0.0252 = 10568 (1 )

where we ignore the term 1 / N since it is so small. Here the required sample size is bounded because the function (1 ) has maximum value of 0.25 when = 0.5 . We know that if we choose n = 10568 0.25 = 2642 , we will meet the requirements. If we have a better idea of from a pilot survey or elsewhere, we may be able to reduce the sample size from this upper bound. Note that these sample size determinations do not take frame error, non- response error and other such errors into account.
When and How to Implement SRS

Here we briefly look at when SRS should be used and how to implement the sampling protocol. To implement SRS, we need a frame for the target population of interest. If the frame consists of a list of items or people, we can assign each unit a unique number from 1 to N and then used available software to select a sample of n units using simple random sampling. In R, the command s <- sample(x,n) selects a random sample of size n from the vector x and stores the result in s. We must be able to examine the selected units. For example, if the units are water heaters packed in cartons stored in large stacks, we can select the sample of identifiers using SRS but we are unlikely to find someone willing to sort through the cartons to find the selected units. SRS is the simplest probability sampling protocol. Because of the difficulty of completing a frame, we may use cluster or multi-stage sampling instead. With cluster sampling the frame can be the list of clusters. With multi-stage sampling we can build the frame as we go. For many populations, it is more efficient (e.g. shorter confidence interval with a smaller sample size) to stratify the population and use stratified random sampling. See Chapter 10.

Exercises 1. Consider the sampling protocols defined in Example 1. a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol. b) On a final examination, a student once defined simple random sampling as follows: simple random sampling is a method of selecting units from a

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-11

population so that every unit has the same chance of selection. Is this a correct answer? yi i s c) Show that the estimator corresponding to the sample average = is n unbiased for for each of the protocols. 2. Consider the estimate =

(y
i s

y)2

n 1 ~ 2 is an unbiased estimator for 2 . [Hint: Use the fact a) For SRS, show that that ( yi y ) 2 = yi2 ny 2 ].
~ b) Is unbiased for ?
i s i s

~ and the corresponding estimator .

3. To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www.birdsontario.org/atlas/atlasmain.html ) for a breeding bird atlas, a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected. Using a GPS system, your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period. The data are summarized below. # of sparrows # of plots 0 28 1 13 2 5 3 3 4 1

a) Find a 95% confidence interval for the total number of male song sparrows in the square. b) Suppose that I wanted to estimate the total number of male song sparrows to within 1000 with 95% confidence. How many additional plots are needed? 4. Suppose we want to estimate a population average so that the relative precision is specified. That is, we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. a) For a given confidence level and required precision p%, find a formula for the required sample size. b) What knowledge of the population attributes do we need to make this formula usable? 5. One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling. Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake why should you tolerate any defective items from your supplier). You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. If you find 1 or more defective items, you inspect the complete shipment. a) How would you select the sample? b) Calculate the probability p( ) that you accept the shipment as a function of , the percentage of defective items in the shipment. Stat 371 R.J. MacKay, University of Waterloo, 2009 VIII-12

c) Graph p( ) for 0 10% d) Given the results in c), you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective. What sample size do you recommend?

Stat 371 R.J. MacKay, University of Waterloo, 2009

VIII-13

Chapter 9 Ratio and Regression Estimation with SRS In this chapter, we consider two related problems: estimating a ratio such as the proportion or average response of a subpopulation (domain) with unknown size improving the sample average as an estimate of the frame average by using explanatory variates Estimating a Ratio In Chapter 8, we looked at assessing the estimates of the frame average (or total) when the sampling protocol is SRS and the estimate is the sample average. Here we consider estimating a ratio. The distinguishing feature is that both the numerator and denominator will change if we were to repeat the sampling protocol over and over. Consider again the inventory example from the previous chapter. Suppose we want to estimate the average size of the error in those files that are in error. We can write this attribute as yi zi yi zi / N = iU = iU = zi zi / N
iU iU

where zi = 1 if the ith file is in error and 0 otherwise. Note that

i U

y z = y here
i i i i U

because yi is the error in the ith account. The parameter is the average error per file and is the proportion of files in error.
We use the estimate = with corresponding estimator = . To assess the estimate and produce confidence intervals for , we find the (approximate) distribution of by finding its mean and variance and then using a gaussian approximation.

To derive the approximation, we use Taylors theorem for a function of two variables. Recall that we can expand f ( x, y ) about the point ( x0 , y0 ) to get a linear approximation f ( x, y ) f ( x0 , y0 ) + f ( x0 , y0 ) f ( x0 , y0 ) ( x x0 ) + ( y y0 ) x y

The linear function on the right has the same value and first partial derivatives as f ( x, y ) at the point ( x0 , y0 ) . f ( x0 , y0 ) 1 f ( x0 , y0 ) x If f ( x, y ) = x / y , then we have = , = 0 and 2 x y y0 y0 x x0 1 x + ( x x0 ) 0 ( y y0 ) 2 y y0 y0 y0

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-1

Replacing ( x , y ) by the random variables ( , ) and ( x0 , y0 ) by ( , ) , we have 1 + ( ) 2 ( ) For large sample sizes, the approximation is reasonable since we expect ( , ) to be close to ( , ) . Hence we have
E ( )

1 + E ( ) 2 E ( ) =
1

Var ( )

Var ( )

The estimate is approximately unbiased (but see Exercise 1). We can write the variance in several forms. Notice that the estimate corresponding to ( yi zi ) which is the sample average of r1 ,..., rn where is = i s n ri = yi zi . Using the basic formula for the variance of an average with SRS, Var ( ) = (1 f )

r2
n

We can estimate this variance by the corresponding sample variance


(1 f ) n

(r r )
i i s

n 1

(1 f ) = n (1 f ) n (1 f ) n

[ y y ( z z )]
i i i s i s

n 1 [ yi y ( zi z )]2 n 1 ( yi zi )2
i s

n 1

where we replace by its estimate = y / z in the second line and f = n / N is the sampling fraction as usual. The estimate of the variance of the estimator is then

Var ( ) =

1 2

( y z ) (1 f )
i i i s

n 1

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-2

Note that the last factor is the sample variance of the estimated residuals y1 z1 ,..., yn zn . To construct a confidence interval for , for large values of n and N, the estimator is approximately gaussian so the confidence interval has the standard form estimate c stdev( )
2.33 In the example see the data file inventory.txt, we have = = = 10.57 . To find 0.22 the estimate of the standard deviation, first calculate the standard deviation of r1 = y1 z1 ,..., rn = yn zn in R by creating the vector r < y theta _ hat * z . Then 1 multiply by the factor 1 f . In the example, we have the standard error is 26.26. n A 95% confidence interval for is 10.57 51.48 so we can estimate the average error in accounts with errors very imprecisely.

Also note that this confidence interval is wider than that for , the average error, because we also have uncertainty about the proportion of files in error. We can use the same approach via Taylors theorem to estimate any other function of variate averages in which we have interest. Ratio Estimation of the Average Suppose the purpose of the survey is to estimate the study population average ( y ) for some variate y. Note the change in notation to explicitly include y in the definition of the attribute. In many surveys, there are other (explanatory) variates that can be measured on each unit in the sample and for which we have complete knowledge of their attributes in the population. For example, in the inventory survey, we know the stated value and the stated number of items for each file in the population and hence we can calculate population attributes for these variates. In many surveys of human populations, the demographics (gender ratio, age distribution etc.) of the population are known, perhaps from a census. When we get the sample, we can determine the values of the response variate y and the explanatory variates, say gender and age for each person in the sample. The idea of the methods discussed here is to adjust the sample average ( y) based on differences between the sample and (known) population attributes of the explanatory variates. For simplicity, we consider only one explanatory variate. Example In the assessment of a lot of 10000 incoming molded parts, a company selects a sample of 40 parts to check the average length of a critical dimension. From previous experience, Stat 371 R.J. MacKay, University of Waterloo, 2009 IX-3

they know that the dimension is related to part weight so they measure the weight of each part in the sample and also the weight of the entire shipment. The sample data are included in the file molded.txt and are plotted below. The plot shows a strong correlation between the length and weight.
Length versus weight
48 47.5 47 46.5 46 45.5 45 44.5 44 43.5 31.5 32 32.5 33 33.5 weight (g) 34 34.5 35 35.5

The average and standard deviations for the two variates are:
x 33.24 0.691 y 45.56 1.005

sample average sample st. dev.

The population average weight is ( x ) = 33.10 grams determined as the total weight (measured all at once) divided by the number of pieces N = 10,000 . ( y) ( x ) = ( x ) where ( x ), ( y ) are the ( x) sample averages for x and y and = ( y ) / ( x ) . The ratio estimate of ( y ) is ( y ) ratio = The sample is collected haphazardly since it is too expensive to create a frame. We develop the estimators and their properties assuming SRS this corresponds to assuming that the haphazard sampling protocol mirrors SRS if the protocol is repeated over and over. We use the results on the estimation of a ratio to derive an approximation for the mean and standard deviation of ( y ) ratio .

Stat 371 R.J. MacKay, University of Waterloo, 2009

length (micron)

IX-4

E[ ( y ) ratio ] = E[ ] ( x ) ( x ) = ( y ) Var[ ( y ) ratio ] = ( x ) 2Var[ ] 1 Var[ ( y ) ( x )] ( x )2 = Var[ ( y ) ( x )] Using the results on the estimation of a ratio, we estimate the variance of ( y ) ratio by ( yi xi )2 1 ri2 1 Var[ ( y ) ratio ] = (1 f ) i s = (1 f ) i s n 1 n n 1 n = ( x )2

where ri = yi xi as before.

In the example we have = 1.371 and


population average is ( y ) ratio =

( y
i s

xi )2 = 0.147 so the ratio estimate of the

n 1

45.56 33.10 = 45.37 with corresponding standard error 33.24 0.060. An approximate 95% confidence interval for the population average length based on the ratio estimate is 45.37 0.12 microns.
Here the ratio estimate is more precise than the sample average ( y ) = 45.56 since this ( y) or estimate gives a confidence interval (ignoring the fpc) ( y ) 1.96 40 45.56 0.31 .

We can compare the estimated variance of ( y ) ratio versus that of ( y ) , the estimator based on the sample average, in general. Consider

Var[ ( y ) ratio ] =

1 (1 f ) i n

( y
s

xi )2

n 1

versus Var[ ( y )] =

1 (1 f ) i n

( y
s

y )2

n 1

The ratio estimate is more precise (i.e. gives a shorter confidence interval) than the sample average if

( y
i s

xi ) 2 < ( yi y ) 2
i s

The expression on the left is the residual sum of squares if we fit a line through the origin to the sample scatterplot. The expression on the right is the total sum of squares. Qualitatively, the ratio estimate is more precise if a line through the origin explains some of the variation in the response variate. This gain in precision is the major advantage of the ratio estimate.

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-5

Notes 1. To apply ratio estimate effectively, we need to measure the explanatory variate xi for each unit i in the sample

to know ( x ) , the population average of the explanatory variate a relationship of the form y = x + noise , a straight line through the origin, between x and y in the study population. The smaller the noise, the greater the benefit in using the ratio estimate.

2. If we think of ratio estimation in terms of fitting a line to the scatterplot, then the estimate is an adjustment based on the fact that the sample average x is different than the population average ( x ) .
x

fitted line y = x
x

( y ) ratio

x x

( y)
x x

x x x

( x)

In this case, the sample average x is smaller than the population average ( x ) so we adjust the estimate of ( y ) upward using the relationship between y and x. The closer x is to ( x ) , the smaller is the adjustment. We can also see the adjustment by rewriting the ratio estimate as ( x ) ( y) ( y)ratio = ( x ) 3. If we fit a response model to the above data (e.g. Yi = xi + Ri , Ri ~ G (0, ) ) then we

x y /n Since = x /n
i i
2 i

x y estimate the slope using = x


i i s 2 i i s

. This suggests another estimate ( x ) for ( y ) .

i s

can be written as the ratio of two averages, we can derive its

i s

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-6

variance as we did for and hence find the variance of ( x ) . You may wonder how the precision of this estimate compares to that of the ratio estimate. y With ratio estimation as described above, we estimate the slope using = . If we x start with a response model Yi = xi + Ri , Ri ~ G (0, xi ) where the variation

around the line increases as x increases, then we can divide by

xi to get the model

Yi / xi = xi + Ri* , Ri* ~ G(0, ) with constant standard deviation. You can easily y verify that the least squares estimate of in this model is = . x If there is constant variation about the line, we expect the estimator based on to be superior. If the variation increases as x increases, then we expect the ratio estimate to be better. In either case, because we are exploiting structure in the study population, the estimates will be superior to the sample average.
4. Suppose the response variate y is binary and the goal is to estimate the population proportion . If we have a continuous explanatory variate x, we need more complex models (and subsequent analysis) to exploit the relationship between the variates in the study population. If the explanatory variate is binary or categorical we can use post-stratification (see the Chapter 9) to improve the precision of the estimation of .
Regression Estimation of the Average

Once we discussed ratio estimation in terms of fitting a line through the origin to the data in the sample, you will have considered what happens if the line does not go through the origin. Here we look at using information on the explanatory variate if the relationship between y and x is linear with constant variation about the line. In other words, we have the conditions necessary for fitting the response model Yi = + ( xi x ) + Ri , Ri ~ G (0, )
to the data in the sample. Note that x = ( x ) is the sample average for the explanatory variate.

To produce the regression estimate ( y ) reg , we

fit a the model using least squares to estimate and to get

= y = ( y ),

(x x) y = (x x)
i i s 2 i i s

substitute the known mean ( x ) into the fitted line

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-7

( y ) reg = ( y ) + [ ( x ) ( x )]
We can view ( y ) reg as an adjustment to the sample average ( y ) as we did with the

ratio estimate. Suppose that is positive so that in the study population larger values of x correspond to larger values of y. If the sample average ( x ) of the explanatory variate is less than the known population average ( x ) , we adjust the estimate of ( y ) upward. The adjustment is shown on the following plot.

fitted line y = ( y) + ( x ( x ))
x

( y ) reg

x x

( y)
x x

x x x

( x)

The properties of the estimator ( y ) reg = ( y ) + [ ( x ) ( x )] are complicated because of the three random components. We can simplify the argument with the following handwave. Rewrite the estimator as

( y ) reg ( y ) = [ ( y ) ( y )] + [ ( x ) ( x )] + [ ][ ( x ) ( x )]
In large samples, we expect each of the terms within the brackets [ ] to be small. The right-most term, a product of two small quantities, is an order smaller than the other two terms. Hence we can say that

( y ) reg ( y ) [ ( y ) ( y )] + [ ( x ) ( x )]
and we have E [ ( y ) reg ( y )] E [ ( y ) ( y )] + E [ ( x ) ( x )] = 0 . That is, the regression estimate is approximately unbiased.

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-8

is the n sample average of r1 ,..., rn where ri = yi ( xi ( x )) and r = y ( x ( x )) . Using the basic result for the variance of an average with SRS, we have [ r ( r )]2 1 i Var ( ( y ) reg ) = (1 f ) iU which can be estimated by n N 1 1 ^ Var ( ( y ) reg ) = (1 f ) n

( We can estimate Var ( ( y ) reg ) by noting that y ) reg =

[ y
i s

( xi ( x ))]

[r r ]
i i s

n 1

1 = (1 f ) i s n

[ yi y ( xi x )]2 n 1

where we have replaced by the estimate . The last factor is the sample variance of the estimated residuals from the least squares fit of the line to the sample data.

Example The volume of useable wood y in a Douglas fir is related to the basal area, x , the crosssectional area of the tree measured at breast height. Volume is expensive to measure because it requires that the tree be destroyed. To estimate the total volume in a section of forest that was to be sold, a sample of 25 trees was selected by dividing the section into small subsections. A SRS of 25 sub-sections was selected and then a tree was selected at random within the sub-section. We will treat this protocol as if it were SRS. The selected trees were sacrificed and the basal area and volume were measured. The data and fitted line are plotted below.
Volume versus Basal Area
7.50 7.00 6.50 volume 6.00 5.50 5.00 4.50 4.00 3.50 3.00 0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90

basal area

The equation of the fitted line is y = 6.17 + 1.51( x 1.31) and the residual sum of squares is 2.268

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-9

A second much larger (and cheaper) survey was carried out to estimate the total number of trees N = 56800 and the average basal area ( x ) = 1.40 . We assume that the errors in estimating ( x ) and N are negligible.
The regression estimate is ( y ) reg = 6.17 + 1.51(1.40 1.31) = 6.31 and the standard error of the estimate is

1 2.268 = 0.061 25 24 The approximate 95% confidence interval for ( y ) based on the regression estimator is 6.31 1.96 0.061 = 6.31 0.12 . The 95% confidence interval for the total volume, ( y ) = 56800 ( y ) , is then 358 408 6816 .
The estimate for ( y ) based on the sample average ( y ) gives a 95% confidence interval 6.17 0.22 since the sample standard deviation of the 25 measured volumes is 0.111. The regression estimate is more precise in this case.

^ stdev ( ( y ) reg )

In general, we can compare the precision of the sample average, the ratio estimate and the regression estimate by looking at the sum of squares of the estimated residuals under the three least squares fits Sample average: ( yi y )2
i s

Ratio estimate: Regression estimate:

( y
i s

xi ) 2 where = y / x

( y
i s

y ( xi x )) 2 where is the estimated slope

The major reason for using ratio and regression estimates is the gain in precision.
Notes 1.

To use the regression estimate effectively, we need a continuous response variate y and a continuous explanatory variate x knowledge of the study population average of the explanatory variate a linear relation between y and x with smaller residual variation leading to a more precise estimate.

2. A special simple case of regression estimation is to use the difference d i = yi xi as the response variate and then estimate the population average by ( y ) diff = ( d ) + ( x ) This estimate is more precise than the sample average if the variation in the differences d1 ,..., d n is less than the variation in y1 ,..., yn . We used a difference estimate in the inventory example to estimate the total true value of the files. 3. Regression estimation can be extended to multiple explanatory variates and nonlinear relationships. We use least squares to estimate the relationship between the Stat 371 R.J. MacKay, University of Waterloo, 2009 IX-10

response and explanatory variates in the sample and then adjust the sample average using the fitted model and the sample averages for the explanatory variates. Note this adjustment accounts for differences in the sample attributes of the explanatory variates and the known population attributes.

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-11

Exercises

1.

Find the quadratic expansion of f ( x, y) = y / x about the point ( ( x ), ( y)) to ~ ~ ~ estimate the bias in the estimator = ( y) / ( x ) . Note that the general form of the expansion is
f ( x 0 , y0 ) f ( x 0 , y0 ) 2 f ( x 0 , y0 ) ( x x 0 ) 2 2 f ( x 0 , y0 ) 2 f ( x 0 , y0 ) ( y y0 ) 2 ( x x0 ) + ( y y0 ) + + ( x x 0 )( y y0 ) + 2 x y x 2 xy y 2 2

f ( x , y ) f ( x 0 , y0 ) +

This quadratic function has the same value, first and second derivatives at the point ( x0 , y0 ) as does f ( x, y) . You can easily check this statement by differentiating the right side of the expression. 2. In order to count the number of small items in a large container, a shipping company selects a sample of 25 items and weighs them. They then weigh the whole shipment (excluding the container). Assume that there is small error in weighing and act as if SRS is used - it is not, the sampling is haphazard. Let the weight of the ith item in the population be yi and the total known weight be a) Show that an estimate of the population size is N =

y
i s

/ 25

b) Find the (approximate) mean and standard deviation of the corresponding estimator ~ N. c) In the example, the sample average weight is 75.45 g, the sample standard deviation is 0.163 g and the total weight is 154.2 kg. Find a 95% confidence interval for the total number of items in the container. 3. Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average. Many bird species have specialized habitat. We can exploit this knowledge when we are trying to estimate population totals or density. For example, wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America. Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo, an area of highly fragmented forest patches. Using aerial photography, we know that there are 1783 such patches (minimum size 3 ha) with an average size 13.4 ha. A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males. The area xi of each sampled woodlot is also recorded. The data are available in the file thrush.txt. Find a 95% confidence intervals for the total number of thrushes based on the a) sample average y b) ratio estimate c) regression estimate

4.

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-12

5. The City of Waterloo wants to estimate the average amount of water per house ( y) that is used to water lawns and gardens in the month of July. A SRS of 50 houses is selected and special metering units are installed to measure the volume of water y from external taps. The total volume of water x is measured by the regular meter. From water records, it is known that the average total water consumption per house is ( x ) = 15.6 cubic metres. The data are stored in the file water.txt. a) Prepare a scatterplot of y versus x b) Estimate ( y) using the sample average, the ratio estimate and the regression estimate. c) Find 95% confidence intervals based on each estimate. d) Which estimation procedure is preferable here? Why?

Stat 371 R.J. MacKay, University of Waterloo, 2009

IX-13

Chapter 10 Stratified Random Sampling In the previous chapter, we looked at ways to use an explanatory variate with known attributes to improve on the sample average as an estimate of the study population average with SRS. The basic idea was to exploit a structural relationship between the response and explanatory variates with known attributes in the population. In this chapter, we change both the sampling protocol and the estimate to get a procedure that usually produces a better estimate of the study population average. The idea is to divide the study population in sub-populations, called strata, and sample independently using SRS from each stratum. Then combine the estimates of each stratum average to get an estimate of the population average. Some examples of possible strata are: Provinces and large urban centers in national opinion surveys Small and large accounts in auditing a population of accounts Home faculties in a survey of UW students Sites in a survey of employees in a multi-site company In many examples, we have questions about the strata averages as well as the overall population average. Stratified sampling gives information about these averages and often an improved estimate of the overall average. Suppose that we divide the population U into H mutually exclusive strata U1 ,..., U H with sizes N1 ,..., N H so that N = N1 +...+ N H . For the variate of interest, we denote the stratum averages and standard deviations by h , h , h = 1,..., H . With this notation, we write
N1 1 +...+ N H H = W1 1 +...+ WH H N the weighted average of the stratum averages. We call Wh = Nh / N the stratum weight, the proportion of the total units found in that stratum.

Now suppose for each stratum, we independently select a sample of size nh from stratum h using SRS and calculate the sample average h . We can combine these estimates to get the stratified estimate of the population average

strat = W1 1 +...+ WH H
~ ~ ~ The corresponding estimator is strat = W1 1 +...+ WH H . Since we used SRS within each stratum, we have

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-1

~ ~ ~ E( strat ) = W1 E( 1 )+...+WH E( H ) = W1 1 +...+ WH H =


and

~ 2 Var ( strat ) = W12 (1 f1 ) 1 +...+ WH (1 f H ) H n1 nH


2 2

where fh = nh / Nh is the sampling fraction and h is the standard deviation of the response variate for stratum h. We can estimate the variance by

2 ^ ~ Var( strat ) = W12 (1 f1 ) 1 +...+ WH (1 fH ) H n1 nH


2 2

is the sample standard deviation within stratum h and y jh is nh 1 the response variate for the jth unit in the sample from stratum h . Example To estimate average water quality and the proportion of wells with contamination, a survey of residential wells was carried out in the rural part of the region of Waterloo. The population of 13 345 wells was identified from assessment records. Three strata were created Stratum Size Weight Sample Size farms with animals 2365 0.177 150 farms without animals 1297 0.097 100 houses 9683 0.726 250 A random sample of wells was selected from each stratum and the water was tested for a large number of characteristics. Here we look at only two:

2 where h =

(y
j sh

hj

h )2

y: u:

sodium (Na) concentration (mg/L) the water was contaminated by coliform bacteria (u = 1) or not (u = 0 )

The data are summarized below. Stratum farms with animals farms without animals houses Average Na 237.3 245.6 220.1 St Dev Na 41.45 37.62 51.23 % contaminated 17.2 11.4 13.2

The estimate of the population average Na concentration is

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-2

strat = 0.177(237.3) + 0.097(245.6) + 0.726(220.1) = 225.6 mg/L


~ The estimated variance of the estimator strat is

150 41.452 100 37.62 2 250 51.232 ^ ~ Var( strat ) = (.177)2 (1 ) + (.097)2 (1 ) + (.726) 2 (1 ) = 5.845 2365 150 1297 100 9683 250
and so the standard error is 2.418 mg/L. An approximate 95% confidence interval for is 225.6 4.7 mg/L. For the binary response variate u, the estimate of the proportion of contaminated wells = W1 1 + W2 2 + W3 3 is strat = 0.177(.172 ) + 0.097(.114) + 0.726(.132) =.137 or 13.7%. The estimated variance of the associated estimator is
150 [.172(1.172)] 100 [.114(1.114)] 250 [.132(1.132)] ^ ~ Var ( strat ) = (.177)2 (1 ) + (.097)2 (1 ) + (.726)2 (1 ) = 0.000272 2365 150 1297 100 9683 250

and so the standard error is 0.0165. Note that we used the approximation
1 fh ~ Var ( h ) h (1 h ) nh in each stratum. The 95% confidence interval for is 0.137 0.032 or 13.7% 3.2%

Note that you need to be careful moving from percentages to proportions. The formulae are expressed in terms of proportions. We can compare the strata means and proportions see Exercise 1. There are several questions of interest. When is stratified sampling more efficient than SRS? How should we allocate the total sample among the strata? Can we combine ratio/regression estimation with stratified sampling? The answer to the last question is the easiest. We can estimate the strata averages in the best way possible e.g. ratio or regression estimation if appropriate, and then combine the estimates using the strata weights. Since each stratum is sampled independently, the estimated variance is the sum of the squared strata weights times the variance of the stratum estimates. Note that these variances are calculated using the formulae for ratio or regression estimates.

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-3

Comparison to SRS To examine the efficiency of stratified sampling, we need to consider the sampling weights wi = ni / n , the strata weights Wi = Ni / N and the relative sizes of 1 ,..., H ~ versus . Looking at the variance of strat

~ 2 Var ( strat ) = W12 (1 f1 ) 1 +...+ WH (1 f H ) H n1 nH


2 2

2 WH 2 1 W12 2 [ 1 +...+ H] n w1 wH we see that if we were to give a very small sample weight to a stratum with high stratum ~ ~ weight and 2 > 2 , then Var ( strat ) > Var ( ) . In other words, there is no uniform result; h it is possible to have larger variance with stratified sampling. However, this is a contrived situation and in most cases, if we construct the strata with care, stratified sampling will be much better than SRS.

To confirm this point, consider proportional allocation where, except for rounding, we have wh = Wh or, in other words, nh N h . We also ignore the finite population corrections. Substituting Wh = wh = nh / n , we have

~ Var ( strat ) = W1 (n1 / n) 1 +...+ WH (nH / n) H n1 nH


2 2

1 2 (W1 1 +...+ WH 2 ) H n and the variance of the stratified estimator will be less than the variance of the sample average from SRS if =
2 W1 1 +...+ WH 2 < 2 H

The left side is the weighted average of the within strata variances. If we form the strata so that these variances are small i.e. form the strata so that there is greater consistency within strata compared to the whole population, then the weighted average will be less than the overall variance. Another way to make the same point is to use the ANOVA partition of the total sum of squares into two components, within and between strata. Suppose that y jh is the response variate for the jth unit in stratum h. Consider each stratum as a treatment and recall for an unbalanced design, we can partition the total sum of squares as. Between strata (treatment): Within strata (treatment): Total:

N ( (y
h h

)2

hj

h ) 2 = ( Nh 1) 2 h
h

( yhj )2 = ( N 1) 2
h j

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-4

where the sums are over the whole population. We have

2 =

N (
h h

)2 +

(N
h

1) 2 h

N 1

N 1

Wh ( h ) 2 + Wh 2 h
h h

the difference in variance for stratified versus sample average estimator is proportional to Wh ( h )2 . For proportional allocation (and for any other allocation), we should
h

make the strata means as different as possible in order to achieve the greatest gain over SRS. Optimal allocation Suppose that at the Plan stage, we decide that we can afford to select a sample of size n. How should we divide the sampling effort among the strata if the objective is to minimize the variance of the resulting estimator? This is the allocation problem. We want to ~ determine n1 ,..., nH so that n1 +...+ nH = n and Var ( strat ) is minimized. We treat n1 ,..., nH as continuous variables and use a Lagrange multiplier. That is, we find the critical point of the function

f (n1 ,..., nH , ) = W12 (1 f1 ) = W12 (

2 1

n1

2 +...+ WH (1 fH )

2 H
nH

+ (n1 +...+ nH n)

1 1 2 1 2 1 ) 1 +...+ WH ( ) 2 + (n1 +...+ nH n) H n1 N1 nH N H

The partial derivatives are these to 0, we get nh =

Wh h

f f W 2 2 = h 2 h + , h = 1,..., H , = n1 +...+ nH n . Setting nh nh and solving for ,


W1 1 +...+ WH H . Hence we have, for optimal allocation n nh = Wh h n W1 1 +...+ WH H

W1 1

+...+

WH H

= n or =

or more simply nh Wh h . We allocate more sampling effort to those strata that have higher weight or larger within stratum standard deviation. If we ignore the fpc, then for the optimal allocation, we get 1 ~ Var ( strat ) = (W1 1 +...+ WH H ) 2 n Note that proportional allocation is optimal if h is the same for each stratum. In order to use the optimal allocation, we need to know (unlikely) or have an estimate of the within-stratum standard deviations, perhaps from a pilot survey, before selecting the sample. If we do so and decide to use optimal allocation, then we can use the preceding

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-5

formula to select the total sample size n to achieve a confidence limit with a predetermined length. That is, for a given level of confidence, the approximate confidence interval has length 2c 2l = (W1 1 +...+WH H ) n c2 so we select a sample with total size n = 2 (W1 1 +...+ WH H )2 where c is a value from l the G(0,1) tables determined by the level of confidence.
Forming the Strata Stratified sampling can produce large increases in precision (i.e. shorter confidence intervals) compared to SRS for the same total sample size. Put in another way, for a given level of precision, we can use a smaller sample size with stratified sampling. However, stratification adds complexity; for example, we need to identify the stratum for each unit in the frame before we begin.

The first consideration is the purpose of the survey. In many cases such as the labour force survey, we want to estimate the rate of unemployment in each province so it is natural to stratify by province. Since we are interested in the provincial rates, we need to ensure that each province gets a large enough sample to estimate the within-stratum rate so here the allocation problem is very different. If we are interested only in the overall frame average or total, we form the strata so that the averages are likely to be very different. If we have complete knowledge of some explanatory variate that we believe to be related to the response variate, we can use the values of the explanatory variate to form the strata. Proportional allocation is popular because we do not need to know the within strata standard deviations and we can be almost sure to do better than with SRS. We can use estimates from a pilot study or an earlier version of the survey to optimally allocate the sample across the strata. In some cases, a particular stratum may be so important that we do a complete census. For example, in many applications in auditing, accounts are stratified on the basis of stated value. The likelihood of large errors is greater in larger accounts so every account in the stratum of the largest accounts is included in the sample.
Post Stratification We now return to an issue discussed in Chapter 9. Suppose there is a discrete explanatory variate such as gender or age class and we know the proportion of the population that falls in each class. This corresponds to knowing the mean for a continuous explanatory variate that we might use in a ratio or regression estimate.

We cannot use the discrete variate to form strata since we do not know the value of the variate for every unit in the frame. Instead, we know the population proportions or weights W1 ,..., WH for the H classes. Stat 371 R.J. MacKay, University of Waterloo, 2007

X-6

We select a sample using SRS from the frame and observe n1 ,..., nH units in each class. The sample sizes are not controlled and if we were to repeat the sampling they would change. A natural estimate of the population average is

post = W1 1 +...+WH H
We call this the post- stratification estimate because we do not establish the stratum for each unit in the sample until after it is selected. The estimate looks like the stratified ~ estimate the estimators are different because the denominator of h is random for the post-stratification estimator. To determine the properties of this procedure, we have a small aside. Suppose X and Y are two discrete random variables with probability function Pr( X = x, Y = y) . Then we can write E( X ) = x Pr( X = x, Y =y) = [ x Pr( X = x| Y =y)]Pr(Y = y)
y x y x

The expression in [ ] is the conditional expected value of X for a given value Y = y and is written E( X| Y = y) . Note that E( X| Y = y) is a function of y only because we have added over all values of x. With this notation we have E( X ) = E( X | Y = y) Pr(Y = y)
y

The right side is the expected value of the function E( X| Y = y) so we write

E( X ) = E[ E( X| Y = y)]
In words, we can calculate the expected value of X in two steps. First, find the conditional expectation for each value of y and second, find the expected value of the conditional expectation over the distribution of Y.

~ ~ ~ We use this result to find E( post ) . Consider the two random variables h , nh . Then we have ~ ~ ~ E ( h ) = E[ E ( h | nh = nh )]
~ ~ As long as nh 0 , we know using the results from SRS that E ( h | nh = nh ) = h . If we ignore the event nh = 0 (which happens with small probability in large samples) we have to a good approximation ~ ~ ~ E ( h ) = E[ E ( h | nh = nh )] E ( h ) = h Hence we have ~ ~ ~ E( post ) = E(W1 1 +....+WH H ) = W1 1 +....+WH H =

The post stratified estimate is unbiased (almost).

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-7

To find the variance, we need a second result. Var ( X ) = E[ X 2 ] 2


= E[ E( X 2 | Y = y)] 2 = E[ E( X 2 | Y = y) E( X| Y = y)2 ] + E[ E( X | Y = y) 2 ] E[ E( X | Y = y)]2 since = E[ E( X| Y = y)] . We now interpret the two pieces. The expression inside the first [ ] is Var( X| Y = y) so the first term is E[Var( X| Y = y)]. The second and third terms are Var[ E( X| Y = y)] so we have the result

Var( X ) = E[Var( X| Y = y)] + Var[ E( X| Y = y)]


~ ~ ~ To find Var( post ) , we condition on n1 = n1 ,..., nH = nH .
~~ ~ Since E( | n1 = n1 ,..., nH = nH ) = for all values of n1 ,..., nH , we have ~ ~ ~ Var[ E( post | n1 = n1 ,..., nH = nH )] = Var[ ] = 0 . Also, from SRS, we have
1 1 2 1 2 1 ~ ~ ~ Var ( post | n1 = n1 ,..., nH = nH ) = W12 ( ) 1 +...+ WH ( ) 2 H n1 N1 nH N H and so

1 1 2 1 1 2 ~ ~ ~ E[Var ( post | n1 = n1 ,..., nH = nH )] = W12 ( E[ ~ ] ) 1 +...+ WH ( E[ ~ ] ) 2 H n1 N1 nH NH

Combining the two pieces we get


1 1 2 1 1 ~ 2 Var ( post ) = W12 ( E[ ~ ] ) 1 +...+ WH ( E[ ~ ] ) 2 H n1 N1 nH NH

We approximate this variance by 1 1 2 1 2 1 ^ ~ Var ( post ) = W12 ( ) 1 +...+ WH ( ) 2 H n1 N1 nH N H that is identical to the variance of the stratified estimator for the observed allocation n1 ,..., nH .
Example A market research organization interviews a randomly selected sample of 300 households in a community to estimate the average amount of money spent on DVD/video rental and movies in the previous week. From census data, they know the distribution of household

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-8

size in the community but this information is not available in the frame for each unit. They post stratify the data as follows.
Household size Population weight (census) 0.232 0.381 0.193 0.123 0.071 Sample size Sample weight Sample average Sample standard deviation 3.67 4.56 5.89 5.23 6.77

1 2 3 4 >4

87 109 54 27 23

0.290 0.363 0.180 0.090 0.077

13.45 20.22 25.67 28.21 28.10

. The estimate is post = 0.232 13.45+...+0.071 2810 = $19.91 and the estimated standard deviation of the corresponding estimator is (ignoring fpcs) 0.292. An approximate 95% confidence interval for the population average amount spent is $19.91 0.57 . Note that the sample average is $19.47 has been adjusted upward because of the over-representation of households of size 1 in the sample.
In the above example, we ignored non-response. The company telephoned many more than 300 households to get the required number of completions. Non-response is a major source of error when sampling human populations. The confidence intervals that we have constructed do not take this error into account. There are many analytic methods and sampling strategies to deal with this important issue.
Exercises 1. In many surveys, there is interest in estimating strata averages or differences in strata averages. ~ a. In general, for SRS, write down the distribution for the estimators h and ~ ~ h k . b. In the well survey, find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated c. In the well survey, find a 95% confidence interval for the average Na difference between the two types of farm wells.

2. Suppose that the purpose of the survey is to estimate a population proportion . If there are H strata, a) Write down the stratified estimate of and the variance of the corresponding estimator. ~ b) What is the variance of strat for proportional allocation? c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? 3. Suppose the well survey was to be re-done with the same overall sample size 500. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal.

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-9

b) Estimating the proportion of contaminated wells was the primary goal. ~ ~ c) For each case, compare the predicted standard deviations of strat and strat to what occurred in the current survey.
~ 4. Consider the difference of the variances of strat under proportional and optimal allocation for a sample of size n. Ignore the fpc. 1 a) Show that this difference can be written as ( h )2 Wh where n h = h Wh is the weighted average standard deviation over the H strata.
h

b) When will the gain be large with optimal allocation relative to proportional allocation? 5. In an informal sample of math students at UW, 100 people were asked their opinion (on a 5 point scale) about the core courses and their value. One particular statement was (with scores): All mathematics students are required to take Stat 231? strongly agree 1 agree 2 neutral 3 disagree- 4 strongly disagree - 5 The sample results, broken down by year are shown below. Estimate the average score for all math students and find an approximate 95% confidence interval for the population average note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted. There are about 3300 students in the faculty. Year Sample size Population Average score Standard weight deviation 1 39 0.31 2.8 1.22 2 23 0.24 3.5 1.09 3 26 0.23 3.2 1.03 4 12 0.22 3.1 0.87

Stat 371 R.J. MacKay, University of Waterloo, 2007

X-10

Appendix 1 An Introduction to R R is a high level language with many useful statistical functions. In this document all R commands and objects are given in italics. I assume that you are using the Windows version of R. You can look at or download this document from the course web page so the links will be active. Getting Started 1. Where to find R A Windows version of R is available free at http://www.r-project.org/ For help with installation and implementation see C:\Program Files\R\rw1062\doc\html\rw-FAQ.html - Introduction R is available on the faculty PCs, specifically in rooms MC 3006 and 3009. R is available on the math faculty unix machines type R to start the program 2. Where to find help Use the Help menu for on-line assistance. The manual An Introduction to R can be downloaded in pdf format. The online FAQ for Windows can be searched for help with almost anything. Within R, if you know the function name, use the command help(function) for assistance. If you do not know the function, try help.search( what you are looking for) 3. Starting and Quitting R To start R, create a shortcut on your desktop from the program. R will open and restore the previously saved workspace. It is a good idea to clear the workspace if you plan to start a new project see the Misc menu. To quit R, type q(). Note that R will let you save the workspace in the current working directory. See 4 below. Look at the web page http://www.stats.uwaterloo.ca/Stats_Dept/StatSoftware/R/ Try the R tutorial. Try the sample session in the An Introduction to R manual 4. Reading Data into R All data sets used in the course notes and lectures will be posted on the course web page in a .zip file that you can download to your own machine. The files have variate names in the first row and the variate values in rows, one row per unit in the sample. To get a data set, use the command a <- read.table(file path and namel, header=T) The data are stored in the data frame a for further use.

Appendix I -1

If the file is stored on your own machine, you can avoid long path names by setting the working directory to the directory containing the file. Look under the file menu on the R gui to set the working directory. You can also read the data files from my web page with the command a < -read.table('http://www.math.uwaterloo.ca/~rjmackay/stat371/file name.txt',header=T) For a variate named sales in the .txt file, the R variate name is a$sales. If there is a single data frame a, you can simplify the name with the command attach(a) so the awkward a$ notation is avoided. To restore the full name, use detach(a). If you want to use R on other data, create the data set in EXCEL and then save it in tab delimited .txt format in your working directory.

Working with R 1. Commands can be typed directly into R. I prefer to type them in Word or Notepad and then paste into the R gui.. This makes editing easy and preserves a record for reuse. 2. Using the up and down arrow keys in R displays past and subsequent command lines which can be edited and re-executed 3. R output and plots can be copied and pasted into Word to create reports. 4. Here are some R functions and objects used repeatedly in STAT 371. These can be stored e.g w <- mean(y) or immediately displayed e.g. mean(y) function mean(y) sd(y) summary(y) tapply(x,y, function) x <- u+v (x <- u*v) sqrt(y) A%*%B t(A) solve(A) b <- lm(y~x1+x2+) b <- lm(y~-1+x1+x2+) summary(b) resid(b) fitted(b) anova(b) hatvalues(b) rstudent(b) anova(c,b) b <- regsubsets(y~x1+x2+, nbest =k)) purpose calculates the average of the variate y calculates the st dev of the variate y calculates a 5 number summary of the variate y calculates the function e.g. mean, applied to y for each value of x creates the sum (element-wise product) of two vectors calculates the element-wise square root of y matrix product gives the transpose of A gives the inverse of A fits the linear model y = 0 + 1 x1 +...+ r and stores the results in the lm object b fits the linear model y = 1 x1 +...+ r without intercept details of the fit the estimated residuals the fitted values the analysis of variance from the fit the diagonal elements of the hat matrix the studentized residuals analysis of variance to compare the sub-model c to b fits the k best subsets for 1, 2,, p variate models

Appendix I -2

plot(x,y, main='title',xlab='xx', etc) abline(b) hist(y) qqnorm(y) par(mfrow=c(j,k))

scatterplot of y vs. x with title , x-axis label etc. adds the fitted line from b <- lm(y~x) to the scatterplot histogram of the values in y a gaussian qq plot of the values in y creates a graphic window with j rows and k columns for the next jk plots

Appendix I -3

Appendix 2 Properties of vectors and matrices of random variables Recall that if X and Y are two random variables and a, b are constants then we have
E( aX + b) = aE ( X ) + b E ( X + Y ) = E ( X ) + E (Y ) Var ( aX + b) = a 2 Var ( X ) Var ( X + Y ) = Var ( X ) + Var (Y ) + Cov( X , Y )

where Cov( X, Y ) = E{( X E( X ))(Y E(Y ))} Now suppose we have two vectors of k random variables U t = (U1 ,..., Uk )t W t = (W1 ,..., Wk )t written as the transpose of row vectors to save space. Definitions: The expected value of U is the vector E(U ) with ith element E(Ui ) The variance-covariance matrix of U is the matrix Var(U ) with ijth element E{(Ui E (Ui )(U j E (U j )} . Note that the diagonal elements are the variances and the off-diagonal elements are the covariances of the components of U. The covariance matrix of U and W is the matrix Cov(U, W ) with ijth element E{(Ui E (Ui )(Wj E (Wj )} , the covariance of Ui and Wj . Properties: These follow from the properties of expectation. 1. If a is a vector and A is a matrix of constants, then E( a + U ) = a + E(U ) and E( AU ) = AE(U ) . 2. The expected value of the sum of two vectors is the sum of the expected values That is E(U + W ) = E(U ) + E(W ) . 3. Var (U ) = E{(U E (U ))(U E (U )) t }. We can see that this result is true by noting that the ijth element of xx t is xi x j for any vector x. 4. Var(a + U ) = Var(U ) 5. Var ( AU ) = AVar (U ) A t . This useful result is easy to show using properties 1 and 3. 6. Cov(U , W ) = E{(U E (U ))(W E (W ))t } . This follows using the same argument as in 3.

Statistics 371 R.J.MacKay University of Waterloo 2005

Appendix 2

7. Cov(a + U, b + W ) = Cov(U, W ) 8. Cov( AU , W ) = ACov(U , W ) and Cov(U , BW ) = Cov(U , W ) B t from 6. and 1. Multivariate Normal Distribution If Z t = ( Z1 , Z2 ,..., Zk )t is a vector of k independent gaussian G(0,1) random variables, we say that Z has a multivariate normal distribution with mean vector 0 and variancecovariance matrix I . We write Z ~ N (0, I ) Properties 1. Suppose is a vector and A is a matrix of constants. If U = + AZ , then the mean is E(U ) = and the variance covariance matrix is Var (U ) = AA t = and U has a multivariate normal distribution. We use the notation U ~ N ( , ) . 2. The component Ui of U is a constant i plus a linear combination of Z1 ,..., Zk and hence is gaussian with mean i and standard deviation the square root of the ith diagonal element of Var(U ) . 3. More generally, if U ~ N ( , ) , then BU ~ N ( B , BB t ) . In words, linear combinations of the components of a multivariate normal random vector are multivariate normal with the appropriate mean and variance-covariance matrix. 4. (An important special case of 3) If a t = (a1 ,..., ak ) is a vector of constants, then a tU is gaussian with mean a tU and standard deviation a t a . 5. The components Ui and U j are independent random variables if and only if Cov(Ui , U j ) = 0 . 6. More generally, the vectors BU and CU are independent iff Cov( BU , CU ) = BVar (U )C t = 0 . These results follow

Statistics 371 R.J.MacKay University of Waterloo 2005

Appendix 2

Appendix 3: Gaussian Quantile Quantile Plots We use a gaussian quantile-quantile (qq) plot to assess if a set of n values looks like a random sample of size n from a gaussian distribution. To explain the plot, consider the figure below which shows a G(0,1) density function and 5 bins each with probability 0.20.

Suppose we have a random sample of 5 values z1 , z2 ,..., z5 from this distribution. With such a sample, the expected number of values in each bin is 1. Denote the sample values in increasing order by z(1) , z( 2 ) ,..., z( 5) . Then we would expect z(1) to fall in the first bin, z( 2 ) to fall in the second and so on. Let the probabilistic centers of the bins be q(1) , q( 2 ) ,..., q( 5) . That is Pr( Z q(1) ) = 1 / 10, Pr( Z q( 2 ) ) = 3 / 10,..., Pr( Z q( 5 ) ) = 9 / 10 . These centers are shown with arrows on the plot. Note in general for a sample of size n, we have Pr( Z q( i ) ) =
i 1 1 2i 1 . + (1 / 2) = n n 2n

If the sample z1 , z2 ,..., z5 is from a G(0,1) distribution, then we expect that z(1) q(1) , z( 2 ) q( 2 ) ,..., z( 5 ) q( 5) or equivalently, if we plot the points ( q(1) , z(1) ),( q( 2 ) , z( 2 ) ),...,( q( 5) , z( 5) ) , we should see a straight line through the origin with slope 1. If the points deviate from this line substantially, then we decide that the gaussian assumption is not tenable. Now suppose that u1 , u2 ,..., un is a sample form a G( , ) distribution. Then we have ui = + zi where z1 , z2 ,..., zn is a sample from a G(0,1) distribution. Since u( i ) = + z( i ) , a plot of the points ( q(1) , u(1) ),( q( 2 ) , u( 2 ) ),...,( q( n ) , u( n ) ) will be approximately

Stat 371 R.J. MacKay University of Waterloo 2009

Appendix 3

a straight line with slope and y-intercept . If the points deviate from a line substantially, then we decide that the gaussian assumption is not tenable. We call this a qq plot and use R to construct the plot with the function qqnorm(). For example, to construct the qq plot of the estimated residuals from the fit b, we use the code qqnorm(residual(b)). The plots for the residuals in the assessment data (full data on the left, two cases omitted on the right) are shown below.

The plot on the left shows that it is not reasonable to suppose that the residuals from fitting the model to the full data set are gaussian. There is no evidence against this assumption once the two cases are deleted. You need to be careful not to over-interpret these plots. The plots on the next page are based on 9 random samples of size 50 from a G(0,1) distribution. Note how several of the plots appear non-linear or have apparent outliers. To see the behaviour of the plots for a non-gaussian distribution, the final three plots correspond from left to right to a sample of 50 values from a G(0,1) distribution, the square of the values and the reciprocal of the values. If the qq plot of the estimated residuals is systematically non-linear, then we can try transforming the values of the response variate before fitting the linear model.

Stat 371 R.J. MacKay University of Waterloo 2009

Appendix 3

Stat 371 R.J. MacKay University of Waterloo 2009

Appendix 3

t-table (right tail) For each row (degrees of freedom k ) and column (right tail probability ), the table entry e satisfies Pr( t k e) = . Note that the t-distribution is symmetric about 0.
degrees of freedom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 40 45 50 gaussian right tail probability 0.10 0.05 0.025 3.078 6.314 12.706 1.886 2.920 4.303 1.638 2.353 3.182 1.533 2.132 2.776 1.476 2.015 2.571 1.440 1.943 2.447 1.415 1.895 2.365 1.397 1.860 2.306 1.383 1.833 2.262 1.372 1.812 2.228 1.363 1.796 2.201 1.356 1.782 2.179 1.350 1.771 2.160 1.345 1.761 2.145 1.341 1.753 2.131 1.337 1.746 2.120 1.333 1.740 2.110 1.330 1.734 2.101 1.328 1.729 2.093 1.325 1.725 2.086 1.323 1.721 2.080 1.321 1.717 2.074 1.319 1.714 2.069 1.318 1.711 2.064 1.316 1.708 2.060 1.315 1.706 2.056 1.314 1.703 2.052 1.313 1.701 2.048 1.311 1.699 2.045 1.310 1.697 2.042 1.306 1.690 2.030 1.303 1.684 2.021 1.301 1.679 2.014 1.299 1.676 2.009 1.282 1.646 1.962

0.25 1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 0.689 0.688 0.688 0.687 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.682 0.681 0.680 0.679 0.675

0.01 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.423 2.412 2.403 2.330

Stat 371 R.J. MacKay, University of Waterloo, 2008

table -1

F-table (right tail) = 0.10 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j), the table entry e satisfies P ( F ( j , k ) e) = .
numerator degrees of freedom 1 2 3 4 5 6 7 8 9 10 20 30 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 61.74 62.26 2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.44 9.46 3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.18 5.17 4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.84 3.82 5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.21 3.17 6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.84 2.80 7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.59 2.56 8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.42 2.38 9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.30 2.25 10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.20 2.16 11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.12 2.08 12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 2.06 2.01 13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 2.01 1.96 14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 1.96 1.91 15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 1.92 1.87 16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.89 1.84 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.86 1.81 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.84 1.78 19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.81 1.76 20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.79 1.74 21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.78 1.72 22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.76 1.70 23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.74 1.69 24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.73 1.67 25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.72 1.66 30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.67 1.61 40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.61 1.54 50 2.81 2.41 2.20 2.06 1.97 1.90 1.84 1.80 1.76 1.73 1.57 1.50 100 2.76 2.36 2.14 2.00 1.91 1.83 1.78 1.73 1.69 1.66 1.49 1.42

Stat 371 R.J. MacKay, University of Waterloo, 2008

denominator degrees of freedom

table -2

F-table (right tail) = 0.05 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j), the table entry e satisfies P ( F ( j , k ) e) = .

1 1 161.45 2 18.51 3 10.13 4 7.71 5 6.61 6 5.99 7 5.59 8 5.32 9 5.12 10 4.96 11 4.84 12 4.75 13 4.67 14 4.60 15 4.54 16 4.49 17 4.45 18 4.41 19 4.38 20 4.35 21 4.32 22 4.30 23 4.28 24 4.26 25 4.24 30 4.17 40 4.08 50 4.03 100 3.94

2 199.50 19.00 9.55 6.94 5.79 5.14 4.74 4.46 4.26 4.10 3.98 3.89 3.81 3.74 3.68 3.63 3.59 3.55 3.52 3.49 3.47 3.44 3.42 3.40 3.39 3.32 3.23 3.18 3.09

3 215.71 19.16 9.28 6.59 5.41 4.76 4.35 4.07 3.86 3.71 3.59 3.49 3.41 3.34 3.29 3.24 3.20 3.16 3.13 3.10 3.07 3.05 3.03 3.01 2.99 2.92 2.84 2.79 2.70

numerator degrees of freedom 4 5 6 7 8 9 224.58 230.16 233.99 236.77 238.88 240.54 19.25 19.30 19.33 19.35 19.37 19.38 9.12 9.01 8.94 8.89 8.85 8.81 6.39 6.26 6.16 6.09 6.04 6.00 5.19 5.05 4.95 4.88 4.82 4.77 4.53 4.39 4.28 4.21 4.15 4.10 4.12 3.97 3.87 3.79 3.73 3.68 3.84 3.69 3.58 3.50 3.44 3.39 3.63 3.48 3.37 3.29 3.23 3.18 3.48 3.33 3.22 3.14 3.07 3.02 3.36 3.20 3.09 3.01 2.95 2.90 3.26 3.11 3.00 2.91 2.85 2.80 3.18 3.03 2.92 2.83 2.77 2.71 3.11 2.96 2.85 2.76 2.70 2.65 3.06 2.90 2.79 2.71 2.64 2.59 3.01 2.85 2.74 2.66 2.59 2.54 2.96 2.81 2.70 2.61 2.55 2.49 2.93 2.77 2.66 2.58 2.51 2.46 2.90 2.74 2.63 2.54 2.48 2.42 2.87 2.71 2.60 2.51 2.45 2.39 2.84 2.68 2.57 2.49 2.42 2.37 2.82 2.66 2.55 2.46 2.40 2.34 2.80 2.64 2.53 2.44 2.37 2.32 2.78 2.62 2.51 2.42 2.36 2.30 2.76 2.60 2.49 2.40 2.34 2.28 2.69 2.53 2.42 2.33 2.27 2.21 2.61 2.45 2.34 2.25 2.18 2.12 2.56 2.40 2.29 2.20 2.13 2.07 2.46 2.31 2.19 2.10 2.03 1.97

10 241.88 19.40 8.79 5.96 4.74 4.06 3.64 3.35 3.14 2.98 2.85 2.75 2.67 2.60 2.54 2.49 2.45 2.41 2.38 2.35 2.32 2.30 2.27 2.25 2.24 2.16 2.08 2.03 1.93

20 248.02 19.45 8.66 5.80 4.56 3.87 3.44 3.15 2.94 2.77 2.65 2.54 2.46 2.39 2.33 2.28 2.23 2.19 2.16 2.12 2.10 2.07 2.05 2.03 2.01 1.93 1.84 1.78 1.68

30 250.10 19.46 8.62 5.75 4.50 3.81 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.31 2.25 2.19 2.15 2.11 2.07 2.04 2.01 1.98 1.96 1.94 1.92 1.84 1.74 1.69 1.57

Stat 371 R.J. MacKay, University of Waterloo, 2008

denominator degrees of freedom

table -3

F-table (right tail) = 0.01 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j), the table entry e satisfies P ( F ( j , k ) e) = .
numerator degrees of freedom 4 5 6 7 8 9 5624 5764 5859 5928 5981 6022 99.25 99.30 99.33 99.36 99.38 99.39 28.71 28.24 27.91 27.67 27.49 27.34 15.98 15.52 15.21 14.98 14.80 14.66 11.39 10.97 10.67 10.46 10.29 10.16 9.15 8.75 8.47 8.26 8.10 7.98 7.85 7.46 7.19 6.99 6.84 6.72 7.01 6.63 6.37 6.18 6.03 5.91 6.42 6.06 5.80 5.61 5.47 5.35 5.99 5.64 5.39 5.20 5.06 4.94 5.67 5.32 5.07 4.89 4.74 4.63 5.41 5.06 4.82 4.64 4.50 4.39 5.21 4.86 4.62 4.44 4.30 4.19 5.04 4.69 4.46 4.28 4.14 4.03 4.89 4.56 4.32 4.14 4.00 3.89 4.77 4.44 4.20 4.03 3.89 3.78 4.67 4.34 4.10 3.93 3.79 3.68 4.58 4.25 4.01 3.84 3.71 3.60 4.50 4.17 3.94 3.77 3.63 3.52 4.43 4.10 3.87 3.70 3.56 3.46 4.37 4.04 3.81 3.64 3.51 3.40 4.31 3.99 3.76 3.59 3.45 3.35 4.26 3.94 3.71 3.54 3.41 3.30 4.22 3.90 3.67 3.50 3.36 3.26 4.18 3.85 3.63 3.46 3.32 3.22 4.02 3.70 3.47 3.30 3.17 3.07 3.83 3.51 3.29 3.12 2.99 2.89 3.72 3.41 3.19 3.02 2.89 2.78 3.51 3.21 2.99 2.82 2.69 2.59

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 100

1 4052 98.50 34.12 21.20 16.26 13.75 12.25 11.26 10.56 10.04 9.65 9.33 9.07 8.86 8.68 8.53 8.40 8.29 8.18 8.10 8.02 7.95 7.88 7.82 7.77 7.56 7.31 7.17 6.90

2 4999 99.00 30.82 18.00 13.27 10.92 9.55 8.65 8.02 7.56 7.21 6.93 6.70 6.51 6.36 6.23 6.11 6.01 5.93 5.85 5.78 5.72 5.66 5.61 5.57 5.39 5.18 5.06 4.82

3 5404 99.16 29.46 16.69 12.06 9.78 8.45 7.59 6.99 6.55 6.22 5.95 5.74 5.56 5.42 5.29 5.19 5.09 5.01 4.94 4.87 4.82 4.76 4.72 4.68 4.51 4.31 4.20 3.98

10 20 30 6056 6209 6260 99.40 99.45 99.47 27.23 26.69 26.50 14.55 14.02 13.84 10.05 9.55 9.38 7.87 7.40 7.23 6.62 6.16 5.99 5.81 5.36 5.20 5.26 4.81 4.65 4.85 4.41 4.25 4.54 4.10 3.94 4.30 3.86 3.70 4.10 3.66 3.51 3.94 3.51 3.35 3.80 3.37 3.21 3.69 3.26 3.10 3.59 3.16 3.00 3.51 3.08 2.92 3.43 3.00 2.84 3.37 2.94 2.78 3.31 2.88 2.72 3.26 2.83 2.67 3.21 2.78 2.62 3.17 2.74 2.58 3.13 2.70 2.54 2.98 2.55 2.39 2.80 2.37 2.20 2.70 2.27 2.10 2.50 2.07 1.89

Stat 371 R.J. MacKay, University of Waterloo, 2008

denominator degrees of freedom

table -4

Exercise Solutions Chapter 2 1. From the R output we have the following a) t = (3.87, 2.01, 1.26) b) 1 = 0 + 1 x11 + 2 x21 = 4.67, r1 = y1 1 = 0.97 2. Use R to fit the model yi = 0 + 1 xi1 +...+ p xip + ri to the assessment data (assessment.txt) with a) all 5 explanatory variates I used the R code a<-read.table("assessment.txt",header=T) attach(a) b<-lm(value~size+age+office+ratio+location) summary(b) to produce the output Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10.6911 -3.8670 -0.8164 2.7848 14.5435 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.41588 4.77137 4.069 0.000288 *** size -2.41526 1.34128 -1.801 0.081181 . age -0.52300 0.10820 -4.833 3.22e-05 *** office 0.03786 0.14776 0.256 0.799388 ratio 0.11653 0.08139 1.432 0.161926 location 3.49361 3.72137 0.939 0.354867 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 5.993 on 32 degrees of freedom Multiple R-Squared: 0.4537, Adjusted R-squared: 0.3683 F-statistic: 5.315 on 5 and 32 DF, p-value: 0.001150 b) only age and size To fit the model with only age and size, use the R code

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -1

c<-lm(value ~ size + age) summary(c) with output Call: lm(formula = value ~ size + age) Residuals: Min 1Q Median 3Q Max -9.171 -3.629 -1.535 2.682 18.456 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.93698 3.65633 6.273 3.38e-07 *** size -1.56577 1.27249 -1.230 0.227 age -0.45850 0.09905 -4.629 4.89e-05 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 6.042 on 35 degrees of freedom Multiple R-Squared: 0.3927, Adjusted R-squared: 0.358 F-statistic: 11.32 on 2 and 35 DF, p-value: 0.0001620

c) Do the estimated coefficients change? Why? Yes, especially the estimated coefficient for size. If we write X = ( X1 X2 ) where X corresponds to the full model and X1 corresponds only to the intercept, size and age, then X1t X1 X1t X2 t we have X X = but note that , in general, the top left corner of ( X t X ) 1 is t t X2 X1 X2 X2 t not equal to ( X1 X1 ) 1 unless X1t X2 = 0 . Hence the estimates found by

F I GH JK F I F X X X X I calculating = G J = G H X X X X JK c X X hy will differ unless X X H K


1 t 1 1 t 1 1 2 2 t 2 1 t 2 t 1 t 2
t 1

= 0 . Note you

can interpret this last condition geometrically The product is 0 if the columns of X1 are orthogonal to the columns of X2 . In this example, we do not have this orthogonality. 3. Suppose we have the returns on an asset yi , the return on the market xi1 and the risk free return xi2 for n periods. Consider three regression models: Model 1: ( yi xi 2 ) = ( xi1 xi 2 ) + ri Model 2: yi = 0 + 1 xi1 + ri Model 3: yi = 0 + 1 xi1 + 2 xi 2 + ri

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -2

If we fit each model will the coefficient of x1 , the measure of volatility, change? Explain. Yes the coefficient of x1 is likely to change for each model. We can write the first two models as special cases of the third. Model 1: yi = (0)1 + ( ) xi1 + (1 ) xi 2 + ri Model 2: yi = 0 + 1 xi1 + (0) x2 i + ri Model 3: yi = 0 + 1 xi1 + 2 xi 2 + ri In fitting the models, we are projecting onto different subspaces in each case the coefficient will change. Again the result depends on the orthogonality of the vectors 1 x1 x2 as in Question 1. 4. Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population. Consider the two models Model 1: yi = 0 + 1 xi1 + ri Model 2: yi = 0 + 1 ( xi1 x1 ) + ri where x1 is the sample average of the explanatory variate. a) Show that the vectors x1 x11 and 1 are orthogonal. We have 1t ( x1 x11) = ( xi1 x1 ) = 0 since x1 is the sample average of the explanatory
i

variate. b) Why is span(1, x1 ) = span(1, x1 x11) ? Since x1 x11 is a linear combination of 1, x1 and 1, x1 x11 are orthogonal (hence linearly independent, the two spans are the same subspace. c) In fitting models a) and b), we project onto a subspace. How are those projections different? Since the two subspaces are the same, the projections are the same vector, d) What is the relationship between the estimated coefficients in fitting the two models? Since the projections are the same, we must have 0 1 + 1 x1 = 0 1 + 1 ( x1 x11) = ( 0 1 x1 )1 + 1 x1 and since 1, x1 are linearly independent, we have 0 = 0 1 x1 , 1 = 1 . e) How does the result in a) simplify the calculation of when fitting model 2?

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -3

Since 1, x1 x11 are orthogonal, the matrix X t X is diagonal and hence the inverse is found by inverting the diagonal elements. 5. We defined the hat matrix H = X ( X t X ) 1 X t , the projection onto span(1, x1 ,..., x p ) . Show that a) H t = H
H t = ( X ( X t X ) 1 X t )t = X[( X t X ) 1 ]t X using the result that ( AB)t = B t A t . Now consider the transpose of the matrix X t X . We have ( X t X )t = X t X so this matrix is symmetric. Now consider the inverse S 1 of any symmetric matrix S . We have ( S 1S ) t = I t = I and also ( S 1S )t = S t ( S 1 ) t = S( S 1 )t since S is symmetric. Combining the two equations, we have S( S 1 ) t = I and hence ( S 1 )t = S 1 since the matrix inverse is unique. We conclude that the inverse of a symmetric matrix is symmetric and hence [( X t X ) 1 ]t = ( X t X ) 1 and Ht = H .

b) H 2 = H . Interpret this result geometrically.


H 2 = ( X ( X t X ) 1 X t )( X ( X t X ) 1 X t ) = X ( X t X ) 1 ( X t X )( X t X ) 1 X t = X ( X t X ) 1 X t = H

Since H is the projection onto the column space of X , applying H to a vector in this space has no effect. Hence for any vector y , we know that H ( Hy) = Hy since Hy is in the column space of X . c) ( I H ) 2 = ( I H ), H ( I H ) = 0 We have ( I H )2 = ( I H )( I H ) = I H H + H2 =IH and H ( I H ) = H H 2 = 0 as required. d) 0 hii 1 where hii is the diagonal element of H. 2 Since H = H 2 = H t H , we have hii = hi2 + hi22 +...+ hi2( p +1) or hii hii = hi2 + hi22 +...+ hi2( p +1) 1 1
2 where the hii term is removed from the right hand side. Hence we have hii (1 hii ) 0 or equivalently 0 hii 1 .

6. Some questions about R 2 a) In question 1, which model gave a larger value for R 2 ? The model with more explanatory variates gave a larger value of R 2 (0.4537 versus 0.3927) b) Show that R 2 cannot decrease if we add extra terms to a model? Stat 371 R.J. MacKay, University of Waterloo 2009 Exercise Solution -4

R2 = 1

residual sum of squares from fitting the model (y i y )2

As we add terms to a model, the residual sum of squares must go down (at least it cannot go up) since it is the minimum value of the function || y X ||2 . 7. The data in the file anscombe.txt were produced by F.J. Anscombe, American Statistician 27, 17-21 to demonstrate the importance of plotting the data. The file contains 4 sets of ( x, y) vectors, labeled x1-x4, y1-y4 a) For each pair, fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair, construct a scatterplot of y versus x and add the fitted line. c) Comment. I used the following R code to fit the four lines A<-read.table(anscombe.txt, header=T) attach(A) B1<-lm(y1~x1);summary(B1) B2<-lm(y2~x2);summary(B2) B3<-lm(y3~x3);summary(B3) B4<-lm(y4~x4);summary(B4) The estimated parameters and R 2 for each case are Case 1 2 3 4

0 3.0001 3.001 3.0025 3.0017

1
0.5001 0.500 0.4997 0.4999

1.237 1.237 1.236 1.236

R2 0.6665 0.6662 0.6663 0.6667

The fitted models and residual sum of squares are virtually identical. The plots on the next page show that the data relationships between y and x are very different so it is a mistake to interpret R 2 as a measure of fit of the model to the data. Hence the need to plot the data (or the estimated residuals) to understand the fit.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -5

Chapter 3 Solutions 1. Some ideas about confidence intervals a) Using the R-output given for the sales promotion example, find a 99% confidence interval for the effect of past sales on the response. What can you conclude? From the output, we have the estimate 0.942 and Std. Error 0.026 for the coefficient of past sales. The underlying variability is estimated with 25 degrees of freedom and for 99% confidence we have Pr(| t25 | 2.79) = 0.99 . The confidence interval is 0.942 2.79 0.026 or 0.942 0.073 . With all the other explanatory variates in the model, the effect of past sales is close to 1 i.e. current sales are close to a constant plus a term proportional to past sales. b) How does the confidence interval change as we increase the confidence level?

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -6

The form of the confidence interval is estimate c st. err.where Pr(| t25 | c) is the confidence level. As the confidence level increases, the constant c increases and so the interval gets wider (the center stays the same). c)

~ Suppose we have ~ G( , d ) , the estimator for a parameter and the ~ statistically independent with n ( p + 1) degrees of freedom. Derive the confidence interval for ~
~

~ ~ G(0,1 ) . Also we have / ~ Kn ( p +1) . Taking the Since ~ G( , d ) , we have d ~ ~ G(0,1 ) = tn ( p +1) . For a particular level of confidence CL, ratio, we get ~d = ~ ~ d Kn ( p +1) /
we have Pr(| tn ( p +1) | c ) = CL . To find the confidence interval, we have ~ ~ ~ ~ ~ Pr(| ~ | c) = CL or re-arranging the inequality Pr( cd + cd ) = CL . d Finally, we replace the estimators by the estimates to get ( cd , + cd ) . Show that 0 is in the 95% confidence interval for if and only if the pvalue for the test of the hypothesis = 0 exceeds 5%.

d)

0 is in the interval ( cd , + cd ) if and only if cd 0 + cd 0 | | c where Pr(| tn ( p +1) | c ) = 0.95 or, equivalently, Pr(| tn ( p +1) | c ) = 0.05 . The d 0 |) . This probability is greater p-value for the hypothesis = 0 is Pr(| tn ( p +1) | | d 0 | c as required. than 0.05 if and only if | d
2. In a small study, a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added. The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added. The data are stored in the file hardness.txt. The variates are named hardness and frag.oil. Consider the simple model hardness = 0 + 1frag.oil + residual . a) Interpret the parameter 0 . 0 represents the average hardness in the study population of candles if the level of fragrance oil is 0. b) Find a 95% confidence interval for this parameter.

We can fit the simple model using the R statements

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -7

a<-read.table("hardness.txt",header=T) attach(a) b<-lm(hardness~frag.oil) summary(b) with output Call: lm(formula = hardness ~ frag.oil) Residuals: Min 1Q Median 3Q Max -0.95868 -0.34417 0.01758 0.37530 0.94132 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1725 0.3203 3.66 0.00179 ** frag.oil 6.9724 0.3042 22.92 9. 07e-15 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.4683 on 18 degrees of freedom Multiple R-Squared: 0.9669, Adjusted R-squared: 0.965 F-statistic: 525.2 on 1 and 18 DF, p-value: 9.067e-15 From the output, the estimate of 0 is 1.1725 with associated standard error 0.3203. Since the underlying variability is estimated with 18 degrees of freedom, we have Pr(| t18 | 2.10) = 0.95 and hence the 95% confidence interval is 1.1725 2.10 0.3203 or 1.1725 0.672 Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil. The parameter of interest is = 0 + 1 (0.02). We can estimate by c)

= 0 + 1 (0.02) = 1.3119 but need to use R to find the corresponding standard error. There are two approaches. If we let u t = (1,0.02 ) , then we can calculate u t ( X t X ) 1 u with the statements u<-c(1.0,0.02) X<-model.matrix(b) sterr<-0.4683*sqrt(t(u)%*%solve(t(X)%*%X)%*%u) sterr
to get 0.3146 and hence the 95% confidence interval is 1.3119 2.10 0.3146 or 1.3119 0.661 . Alternately, we can use the statements

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -8

new<-data.frame(frag.oil=0.02) p<-predict(b, interval=c,newdata=new,level=0.95) p to get the output fit lwr [1,] 1.311952 0.6510668 upr 1.972837

Note the argument interval=c produces a confidence interval for the mean when frag.oil=0.02, not a prediction interval. d) Add a quadratic term to the model (in R, f2<-frag.oil*frag.oil creates a vector with components the square of those in frag.oil). Is there any evidence of curvature in the relationship?

The output is shown on the next page. To test the hypothesis that the coefficient of the square term is 0, the p-value is 0.9301 so there is no evidence that the coefficient is different from 0. In other words, there is no evidence of curvature in the relationship between hardness and the amount of fragrance oil.

Call: lm(formula = hardness ~ frag.oil + f2) Residuals: Min 1Q Median 3Q Max -0.94504 -0.35467 0.02659 0.36635 0.95496 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.081 1.081 1.000 0.3313 frag.oil 7.180 2.357 3.046 0.0073 ** f2 -0.104 1.168 -0.089 0.9301 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.4818 on 17 degrees of freedom Multiple R-Squared: 0.9669, Adjusted R-squared: 0.963 F-statistic: 248.1 on 2 and 17 DF, p-value: 2.635e-13

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -9

3. Using the data in the promotion trial described in this chapter, Find a 95% prediction interval using promotion 2 if the past sales are a) $10000 and the competitor sales are $3000. Can you see any difficulty with this prediction? We use the R statements to fit the model and produce the prediction interval a<-read.table(trial.txt,header=T) attach(a) b<-lm(response~x1+x2+pst.sales+comp.sales) new<-data.frame(x1=0,x2=1,pst.sales=10000,comp.sales=3000) p<-predict(b,interval=p, newdata=new, level=0.95) p with output fit lwr upr [1,] 9494.152 8996.027 9992.277 so the 95% prediction interval is (8996,9992). If we look at the original data set [ the R statement summary(a) is helpful], we see that the largest pst.sales value is $1918 so the value of $10000 is an extreme extrapolation. We have no idea if the model fits in this unexplored region. b) Construct a prediction interval for the change in sales if promotion 2 is used rather than promotion 1 for the same store (i.e. past and competitor sales are fixed). [You will need to go back to first principles.]

If promotion 2 is used let the response be Y (2) = 0 + 2 + 3 pst. sales + 4 comp. sales + R(2) and the corresponding response for promotion 1 isY (1) = 0 + 1 + 3 pst. sales + 4 comp. sales + R(1) and hence the difference is Y (2 ) Y (1) = 2 1 + R(2) R(1) ~ G( 2 1 , 2 ) so (Y (2) Y (1)) ( 2 1 ) ~ ~ G(0,1) . If we replace by we get a t distribution with 25 2 degrees of freedom. The 95% prediction interval for the change of response is 2 1 c 2 or 25.31 60.09 ~ 4. Prove that the components of are independent if and only if the columns of X are orthogonal. ~ ~ We know that ~ N ( , 2 ( X t X ) 1 ) and hence the components of are independent if ( X t X ) 1 is diagonal. This matrix is diagonal if and only if X t X is diagonal and hence when the columns of X are orthogonal. ~ r 5. Prove Cov( , ~) = 0.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -10

~ We have = ( X t X ) 1 X t Y = + ( X t X ) 1 X t R and ~ ~ = Y X r
= Y X ( X t X ) 1 X t Y = ( I H )Y = ( I H )( X + R) = (I H)R Hence ~ Cov( , ~) = Cov(( X t X ) 1 X t R,( I H ) R) r

= ( X t X ) 1 X t Cov( R, R)( I H )t
Now Cov( R, R) = Var ( R) = 2 I and ( I H )t = I H = I X ( X t X ) 1 X t . Substituting, we get ~ Cov( , ~) = 2 ( X t X ) 1 X t ( I X ( X t X ) 1 X t ) r = 2 [( X t X ) 1 X t ( X t X ) 1 X t X ( X t X ) 1 X t ] =0 as required.

Chapter 4 Solutions

1. Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom. a. Find Pr( F 2) Using the tables, we have Pr( F3,32 2.92) = 0.05 so all we know is that Pr( F 2) > 0.05. From R we can use the statement 1-pf(2,3,30) to find Pr( F 2) = 0.1352 b. find a constant c so that Pr( F c) = 0.05 From the tables we find c = 2.92 c. What is the distribution of 1 / F 1 K2 K32 so = 30 ~ F30,3 2 F K32 K30 2. In an industrial example, the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . The data are stored in the file exercise2.txt. Theory suggests that a linear model of the form y = 0 + 1 x1 + 2 x2 + r should describe the data. However, the analyst worries 2 that additional second order terms of the form x12 , x 2 , x1 x 2 should be included in the Note that F is the ratio of two K distributions, F =

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -11

model. Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < x 2 * x 2 to represent the quadratic terms.] We use the following R statements to fit the full and reduced model and then carry out the ANOVA a<-read.table(exercise2.txt,header=T) attach(a) x11<-x1*x1;x12<-x1*x2;x22<-x2*x2 b<-lm(y~x1+x2+x11+x12+x22) c<- lm(y~x1+x2) anova(c,b) with output Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 + x22 Res.Df RSS Df Sum of Sq F Pr(>F) 1 57 14.5765 2 54 12.1980 3 2.3785 3 .5098 0.02123 * --so there is some evidence that one or more of the second order terms is necessary in the Signif. model. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 3. In the product testing example (Example 2 in Chapter 4), use an F test to address the following questions? a. Is there any evidence of differences among the new versions 2 to 6? We fit the full model and then the model under the hypothesis 2 = 3 =... = 6 = . Note that in the reduced model the explanatory variate corresponding to is x = x2 + x3 +...+ x6 . Note that the data set contains one other explanatory variates pst.score The R code is a<-read.table(product.txt,header=T) attach(a) b<-lm(sat.score~-1+x1+x2+x3+x4+x5+x6+pst.score) x<-x2+x3+x4+x5+x6 c<-lm(sat.score~-1+x1+x+pst.score) anova(c,b)

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -12

The output is Analysis of Variance Table Model 1: sat.score ~ -1 + x1 + x + pst.score Model 2: sat.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.score Res.Df RSS Df Sum of Sq F Pr(>F) 1 45 2.29400 2 41 1.60391 4 0.69009 4.4101 0.00467 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 There is strong evidence of differences among the 5 new versions, after controlling for pst.score. b. Versions 4,5 and 6 share a common feature. Is there any evidence that these versions have significantly different average satisfaction scores?

If 4 = 5 = 6 = , the model becomes Y = 1 x1 + 2 x2 + 3 x3 + ( x 4 + x5 + x6 ) + 7 pst. score + R To test the hypothesis of no difference among versions 4,5 and 6, we fit the reduced model and use the change in the residual sum of squares as the basis for the discrepancy measure. The following R statements produce the F test x<-x4+x5+x6 c<-lm(sat.score~-1+x1+x2+x3+x+pst.score) anova(c,b) The output is Analysis of Variance Table Model 1: sat.score ~ -1 + x1 + x2 + x3 + x + pst.score Model 2: sat.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.score Res.Df RSS Df Sum of Sq F Pr(>F) 1 43 2.12109 2 41 1.60391 2 0.51717 6.6101 0.003249 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

There is strong evidence of differences among versions 4,5 and 6. 4. If we have a single parameter , we can test a hypothesis = 0 in two ways. a. Explain how we can test the hypothesis using a t-test

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -13

| || ~ where stdev( ) = d and calculate the p-value d as Pr(| tdf | d ) where df are the degrees of freedom associated with the residual sum of squares.
We can use the discrepancy measure b. Explain how we can test the hypothesis using an F test

We can fit the full model to find the residual sum of squares. This sum of squares divided by the associated degrees of freedom df (same as in a.) is the denominator of the discrepancy measure. Then we can fit a reduced model in which excludes the explanatory variate associated with (since under the hypothesis = 0 ) and again calculate the residual sum of squares. The difference in the residual sum of squares (here with 1 degree of freedom) is the numerator of the discrepancy measure. We calculate the p-value by finding Pr( F1,df discrepancy measure) . c. Consider again the product testing example described in Exercise 3. Consider the hypothesis that the coefficient 7 of the explanatory variate pst.score is 0. Test the hypothesis in the two ways and show that the p-value is identical. [This is always true although a nuisance to prove]

From the fit of the full model b and summary(b), we get the discrepancy measure 16.410 and p-value < 2e-16 for the t test. We can fit the reduced model with 7 = 0 and get the F test from the ANOVA with the R statements c<-lm(sat.score~1+x1+x2+x3+x4+x5+x6) anova(c,b) The F-test in the output has discrepancy measure 269.30 with p-value < 2.2e-16 ***. Note that 269.30 = (16.410) 2 . d. If t ~ tk , show that t 2 has an F distribution. What are the degrees of freedom? G(0,1)2 K12 G(0,1) = 2 = F1,k and also that G(0,1) 2 ~ K1 . Hence tk2 = K k2 Kk Kk

We know that tk =

5. Some theory a. In the construction of the F test, explain why the additional sum of squares is always non-negative. The first step is to fit the full model y = 0 1 + 1 x1 +...+ p x p + r by minimizing || r||2 =|| y 0 1 + 1 x1 +...+ p x p ||2 with respect to 0 , 1 ,..., p . The hypothesis puts some restriction on 0 , 1 ,..., p so when we minimize|| r||2 under this restriction we cannot get

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -14

a smaller value than when there was no constraint. Hence the difference in the two minima must be non-negative. b. Consider the model y = 0 1 + 1 x1 +...+ p x p + r . Show that if we replace the * vector x j by the vector x * = x j x j 1, the model becomes y = 0 1 + 1 x1 +...+ p x * + r j p . That is, the coefficients of the explanatory variates do not change.

Letting x * = x j x j 1 and substituting x j = x * + x j 1 the model becomes j j * * y = 0 1 + 1 ( x1 + x11)+...+ p ( x p + x p 1) + r


* = ( 0 + 1 x1 +...+ p x p )1 + 1 x1 +...+ p x * + r p and setting 0 = 0 + 1 x1 +...+ p x p gives the required result.

c.

Explain why testing the hypothesis 1 = 2 =... = p = 0 will yield identical results for either formulation of the model.

* Since span(1, x1 ,..., x p ) = span(1, x1 ,..., x * ) , when fitting the full model, the residual sum p of squares is the same since the vector of estimated residuals is the same. Under the hypothesis, the reduced models are identical so the residual sum of squares will be the same. Hence the two tests are identical.

d.

In the revised model show that x * 1 for all j j

1t x j = ( xij x j ) = 0 so x * 1. j
i

e.

In testing the hypothesis, show that the additional sum of squares is ( X X* ) * where * = ( 1 ,..., p )t and X* = ( x1* ,..., x * ) . This quantity is often called p the regression sum of squares.
t * t *

Let X = (1 X* ) so that in the second representation of the model we have


y = 1 + X* * + r = 1 X*

b gFGH IJK + r
*

= X + r

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -15

F1 1 Since X = b1 X g we have X X = G HX 1
t t

1t X*
t *

t *

orthogonal to 1. Hence we have ( X t X ) 1

I = F1 1 0 I since the columns of X are J J G X X K H0 X X K F1 / n 0 IJ . Also X y = 1 y + X y so =G H0 (X X ) K


t * t *

t *

t *

= (X X) X y
t 1 t

FG1 / n 0 IJ (1 y + X y) H0 (X X ) K F y IJ =G H ( X X ) X yK
t t * 1 t * * t * 1 * t *

so = y , * = ( X*t X* ) 1 X*t y . Now can write y = 1 + X* * + r or equivalently y y1 = X* * + r where X* * r and so || y y1||2 =|| X* * ||2 +|| r ||2 . The left side is the minimum of the residual sum of squares under the hypothesis * = 0 . Hence the t additional sum of squares is | | X* * ||2 = * X*t X* * .

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -16

Chapter 5 Solutions

1. Consider the assessment data with simple model value = 0 + 1age + 2 size + residual . Use the methods in this chapter to assess the fit of the model and to suggest remedies. Is the prediction of value for a building with size 13.9 and age 30 sensitive to any particular cases? We start by fitting the simple model and looking at various plots of the residuals and the qq plot of the standardized residuals to examine the fit. Note, for this fit, the 95% prediction interval for age=30, size=13.9 is -46.7 to 21.6, an interval so wide that it is useless.

The one common feature of all of the plots is the two large residuals both of which correspond to a large fitted value and relatively small age and size. The qq plot of the standardized residuals is not linear but is highly distorted by the two large standardized residuals. If we delete these two points and repeat the fit we get the following plots. There is no evidence against the fit in any of these plots. With the two points deleted, the prediction interval is -17.0 to 25.6, much narrower but still not useful. The bottom line here is that it is not feasible to use these data to assess a building that is so much larger than any other in the sample.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -17

2. In an experimental Plan, there were three explanatory variates x1 , x2 , x3 that each were assigned two values, here coded as -1 and +1. There are 8 combinations. As well, the investigators looked at the response variate for the so-called center point x1 = 0, x2 = 0, x3 = 0. The data are shown below and can be found in the file ch5Exercise2.txt.

x1
-1 -1 -1 1 1 1 1 0 0 0 0

x2
-1 1 1 -1 -1 1 1 0 0 0 0

x3
1 -1 1 -1 1 -1 1 0 0 0 0

y 11.54 5.45 7.34 17.21 17.87 9.40 11.30 11.57 11.97 11.89 12.15

Suppose we fit a model y = 0 + 1 x1 + 2 x2 + 3 x3 + r . The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1.2527 -0.1360 0.1465 0.4419 0.7615 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.6181 0.2372 48.985 3.87e-10 *** x1 2.4654 0.3000 8.218 7.68e-05 *** x2 -3.1071 0.3000 -10.357 1.70e-05 *** x3 0.5329 0.3000 1.776 0.119 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.7649 on 7 degrees of freedom Multiple R-Squared: 0.969, Adjusted R-squared: 0.9558 F-statistic: 73 on 3 and 7 DF, p-value: 1.203e-05

We drop x3 from the model. To assess the fit of the model, consider two formal approaches.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -18

2 a) Add quadratic terms x12 , x1 x 2 , x 2 to the model and then test the hypothesis that the 2 additional terms are unnecessary. You will discover that x12 = x2 so we can only add two terms to the model.

Use the R statements to create the new variates, fit the extended model and then test the hypothesis that the coefficients of the second order terms are 0. x11<-x1*x1;x12<-x1*x2; b<-lm(y~x1+x2) c<-lm(y~x1+x2+x11+x12) anova(b,c) The output is Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 Res.Df RSS Df Sum of Sq F Pr(>F) 1 8 5.9410 2 6 3.9852 2 1.9559 1.4724 0.3018 so there is no evidence that the second order terms are necessary which provides support for the linear model. b) Consider an extended model in which the mean of Y is a function ( x1 , x2 ) with no further specification. Show that the residual sum of squares from fitting this model is (yij yi ) 2 where i indexes the unique sets of explanatory variate values and j
i j

indexes the replicated observations within these sets. For the given data, use the additional residual sum of squares to test the hypothesis that the extended model is necessary. This is called a pure residual test of fit. If we model the mean to be different for every set of values of x1 , x2 , then the least squares estimate of ( x1 , x2 ) is the average of the response variate values at x1 , x2 . Hence the residual sum of squares is (yij yi ) 2 . In our example, if there is a single value of
i j

y at x1 , x2 , then ( x1 , x2 ) = y and the estimated residual is 0. There are repeated measurements only at the center point (0,0) where ( x1 , x2 ) = y = 11.895 and the residual sum of squares is 0.1763 with 3 degrees of freedom. From the output of the above ANOVA, for the model y = 0 + 1 x1 + 2 x2 + r , the residual sum of squares is 5.9410 with 8 degrees of freedom. The additional residual sum of squares is 5.7647 with 5 degrees of freedom and the F statistic (5 and 3 degrees of freedom) is

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -19

5.7647 / 5 = 19.6 .The corresponding p-value is 0.017 [R code pf(19,6,5,3)] so there is 0.1763 / 3 some evidence against the fit of the linear model.
3. Consider the data described in Chapter 3 in which a marketing firm wanted to compare two sales promotions against a control. The response variate is the weekly sales and there are four explanatory variates, two of which index the promotion used. a) After fitting the full model, is there any evidence of lack of fit? We fit the model response = 0 1 + 1 x1 + 2 x 2 + 3 pst. sales + 4 comp. sales + r and look at the following 6 plots.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -20

the plot of the fitted values versus the estimated residuals shows no unusual patterns. There is one very large fitted value corresponding to case 11 The qq plot is quite straight providing no evidence against the normality assumption The plots of the estimated residuals vs pst.sales and comp.sales each show a point far to the right again corresponding to case 11. Otherwise there are no apparent patterns. The plot of the leverages hii shows one point with extreme leverage again case #11. The plot of the studentized residuals shows that cases 5,15 and 21 have large studentized residuals 2.31, -2.43 and 2.70 respectively.

There are no obvious transformations or additions to the model.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -21

b) Suppose the primary question is to compare the two promotions adjusting for past and competitors sales. Are there any cases that have a large influence on the conclusion about this comparison? With the original fit, we can test the hypothesis that 1 = 2 using ANOVA by refitting a model with x = x1 + x2 . The output of the anova function is given below. Analysis of Variance Table Model 1: response ~ x + pst.sales + comp.sales Model 2: response ~ x1 + x2 + pst.sales + comp.sales Res.Df RSS Df Sum of Sq F Pr(>F) 1 26 13419.1 2 25 10640.1 1 2779.1 6.5297 0.01707 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Deleting the cases 5,11,15,and 21 in turn gives the p-vale for the test of the hypothesis 1 = 2 as shown in the following table.
Case Deleted 5 11 15 21 p-value 0.014 0.025 0.003 0.026

Deleting the cases one-at-a-time has little effect on the p-value and hence on the conclusion that there is a difference in the two promotions. 4. We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case. The key step is to find an expression for the inverse of t X 1 X 1 where X1 is the matrix X with the first row u1t omitted. a) Suppose u and v are two n 1 column vectors and A = I + vu t . Find the constant a so that ( I + vu t ) 1 = I + avu t [This is known as a rank one update] We have I = A 1 A = ( I + avu t )( I + vu t ) = I + avu t + vu t + avu t vu t = I + ( a + 1 + au t v)vu t and so a is the solution to a + 1 + aut v = 0 or a =
1 1 + ut v

b) If C = B + uu t where B is invertible, find an expression for C 1 .

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -22

We write C = B( I + B 1uu t ) = B( I + vu t ) where v = B1u . Hence we have C 1 = ( I + vu t ) 1 B 1 vu t ) B 1 t 1+ u v B 1uu t = (I ) B 1 1 + u t B 1u = (I c) Suppose we consider dropping the first case when fitting the model y = X + r . Show t t that X t X = X 1 X 1 + u1u1t and hence find an expression for ( X 1 X 1 ) 1 . We can write X =

F u I where u GH X JK
t 1

t 1

gives the values of the explanatory variates for the first

t case. Hence we have X t X = (u1 X1 )

Fu I = u u + X GH X JK
t 1

t 1

t 1

X1 . Hence we have

t t X 1 X 1 = X t X u1 u1t and ( X1 X1 ) 1 = ( I +

( X t X ) 1 u1u1t )( X t X ) 1 1 u1t ( X t X ) 1 u1

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -23

Chapter 6 Solutions

1. Suppose the columns of X are orthogonal. Show that the estimate of j , the coefficient of x j , is not dependent on which are columns of X are included in the model. Suppose we have a model that includes x j and any other columns of X . We can write the model as y = U + r . Note that the columns of U are also orthogonal so that U tU is diagonal and the diagonal element corresponding to j is x tj x j . Hence the diagonal element of (U tU ) 1 corresponding to j is 1 / x tj x j and since (U tU ) 1 is also diagonal, we x tj y have j = t independent of all the other explanatory variates xjxj 2. Show that c p = p + 1 for the full model that includes all p explanatory variates. By definition if there are k explanatory variates in the model (plus a constant term), then

2 explanatory variates, we get (n - p - 1) 2 cp = + 2( p + 1) n = p + 1 as required. 2

cp =

estimated residual sum of squares

+ 2( k + 1) n . If we fit the full model with p

3. The file ch6Exercise3.txt contains a response variate y and 10 explanatory variates x1 ,..., x10 for 100 cases. These data were created artificially for practice. The model used to generate the data was Y = 3 x1 + 0.3 x2 2 x4 + x7 x9 + R, R ~ G(0,2). Note that the columns of X are not orthogonal. a) Fit a model using forward selection. At each step, use a p-value of 0.05 to decide to proceed. We start with all one variate models and pick the one with the highest R 2 value if any are significant at the 0.05 level. The highest value corresponds to x7 with R2 = 0.5084 Now we build all two variate models that include x7 and select the next variate to have the highest R 2 value if significant. Only models with x1 , x6 , x8 , x10 have coefficients significantly different from 0 and the highest R2 = 0.5947 corresponds to x8 . Now we build all three variate models that include x7 , x8 and pick the one with the highest R 2 as long as the coefficient is significantly different from 0. x1 , x2 , x3 , x9 have significant coefficients and x1 has the highest R2 = 0.6446 .

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -24

For four variate models including x1 , x 7 , x8 , x 4 , x5 , x6 , x9 and have significant coefficients and x4 has the highest R2 = 0.768 . For five variate models including x1 , x 4 , x7 , x8 , only x10 has a significant coefficient with R2 = 0.781 For six variate models including x1 , x 4 , x7 , x8 , x10 , no other variate has coefficient that is significantly different from 0 with p-value less than 0.05. So we end with the model that includes x1 , x 4 , x7 , x8 , x10 with R2 = 0.781. b) Fit a model using backwards selection using a p-value of 0.05 to decide to proceed at each step. For backwards selection we start with a ten variate model and delete the one that is least significant if we can find one. With ten variates, all but two are not significant. We delete x 5 . With the remaining nine variates, we delete x2 With the remaining eight variates, we delete x6 With the remaining seven variates, we delete x1 With the remaining six variates x3 , x 4 , x7 , x8 , x9 , x10 , all variates have coefficients with pvalue less than 0.05 and we stop. We have R2 = 0.782 c) Use leaps to investigate all possible models. Pick a reasonable model. The output from the R code is (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 1 1 0 0 3 1 1 0 0 1 0 0 1 0 0 0 3 1 1 0 1 1 0 0 0 0 0 0 4 1 1 0 0 1 0 0 1 0 1 0 4 1 1 0 0 1 0 1 1 0 0 0 5 1 1 0 0 1 0 0 1 1 0 1 5 1 1 0 0 1 0 0 1 0 1 1 6 1 1 0 0 1 0 1 1 1 0 1 6 1 0 0 1 1 0 0 1 1 1 1 7 1 1 0 1 1 0 0 1 1 1 1 7 1 1 1 0 1 0 1 1 1 0 1 8 1 1 0 1 1 0 1 1 1 1 1 8 1 1 1 0 1 0 1 1 1 1 1 cp adjr2 109.515179 0.5033519 155.105201 0.3931790 30.024557 0.6971923 75.415232 0.5863704 11.582227 0.7444682 21.585931 0.7197896 6.130472 0.7603550 6.247352 0.7600636 3.633685 0.7691349 5.181391 0.7652356 4.642803 0.7691758 5.036584 0.7681730 5.474527 0.7696742 5.990974 0.7683448 7.168717 0.7679391 7.302265 0.7675915

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -25

The best choice is the five variate model with x1 , x 4 , x7 , x8 , x10 and with many other good candidates. d) How do the results of the three strategies compare in this case. Here the forward selection and best subsets methods got us to the same model. The backwards selection got us to a six variate model which has good cp and adjusted R 2 values. None of the methods reproduced the model used to generate the data this is not surprising because of the correlations among the columns of the X matrix..
Chapter 8 1. Consider the sampling protocols defined in Example 1. a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol. For each protocol, the model is uniform. That is, the chance of any possible sample is equal. To find the inclusion probability, we need only count the number of samples that contain a particular unit. Once the unit is in the sample, we count the ways of selecting the remaining 99 units. 9999 9999 99 1 = SRS: there are ways to select the other units so pi = 10000 99 100

FG H

IJ K

Systematic sampling: there is only one way to select the sample so pi = 1 / 100

F 999IJ FG1000IJ ways to select the other units so Stratified sampling: there are G H 9 K H 10 K FG 999IJ FG1000IJ H 9 K H 10 K = 1 p = FG1000IJ 100 H 10 K F 999IJ ways to choose the remaining clusters so Cluster sampling: there are G H9K FG 999IJ H 9K = 1 p = FG1000IJ 100 H 10 K
9

FG IJ H K FG IJ H 100 K

10

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -26

Two stage sampling: there are

FG 9IJ ways to select the second primary unit. Then the other H1 K FG 9IJ FG 999IJ FG1000IJ F 999IJ FG1000IJ ways so p = H1 K H 49 K H 50 K = 1 99 secondary units can be selected in G H 49 K H 50 K FG10IJ FG1000IJ 100 H 2 K H 50 K
i 2

b)

On a final examination, a student once defined simple random sampling as follows: simple random sampling is a method of selecting units from a population so that every unit has the same chance of selection. Is this a correct answer?

No because there are many sampling protocols that satisfy this definition as shown in a) yi i s c) Show that the estimator corresponding to the sample average = is n unbiased for for each of the protocols. Let Ii =

~ =

y I
i U

R1 if unit i is in the sample i = 1,..., N so that E( I ) = p . Then we can write S0 otherwise T


i i

i i

~ and E( ) =

y E( I ) y p
i i i i U

i U

n
i

= since pi = n / N for all five protocols.

2. Consider the estimate = a)

~ and the corresponding estimator . n 1 ~ For SRS, show that 2 is an unbiased estimator for 2 . [Hint: Use the fact that ( yi y ) 2 = yi2 ny 2 ].
i s
i s i s i U

(y

y)2

~ ~ Using the hint, we can write (n 1) 2 = yi2 Ii n 2 where E( Ii ) = n / N and ~ ~ ~ E( 2 ) = Var( ) + E( )2 = (1 f )

n 2 ~ 2 ) = 1 ( y 2 n n[(1 n ) 2 ]) E( i n 1 i U N N n = = 1 n n ( [ yi2 N 2 ] (1 ) 2 ) n 1 N i U N 1 n n ( ( N 1) 2 (1 ) 2 ) n 1 N N
~ Is unbiased for ?

+ 2 . Combining the results we have

=2 b)

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -27

No using the result assignment 2. 3. To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www.birdsontario.org/atlas/atlasmain.html ) for a breeding bird atlas, a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected. Using a GPS system, your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period. The data are summarized below. # of sparrows # of plots a) 0 28 1 13 2 5 3 3 4 1

Find a 95% confidence interval for the total number of male song sparrows in the square. The data can be written as y1 ,..., y50 where 28 of the yi are 0 and so on. Hence the sample mean and standard deviation are = 0.72, = 1.011 and a 95% confidence interval for 50 1.011 or 0.72 0.28 . A the average number of sparrows per plot is 0.72 1.96 1 10000 50 95% confidence interval for the total number of male sparrows in the square ( = 10000 ) is 7200 2800 b) Suppose that I wanted to estimate the total number of male song sparrows to within 1000 with 95% confidence. How many additional plots are needed? n 1/ 2 . Using In general, the length of the confidence interval for is 2 1.96(1 ) n 10000 = 1.011 and interval length 0.2, we can solve for n to find n = 378. Hence we need about 328 more plots to achieve the desired precision.

4. Suppose we want to estimate a population average so that the relative precision is specified. That is, we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. a) For a given confidence level and required precision p%, find a formula for the required sample size. 1 1 . We want to find the In general, the length of a confidence interval for is 2c n N 1 1 / = p / 100 . Solving for n we have the ugly formula sample size n so that 2c n N 1 n= 1 p 2 +( ) N 200c b) What knowledge of the population attributes do we need to make this formula usable? Exercise Solution -28

Stat 371 R.J. MacKay, University of Waterloo 2009

We need an estimate of the so-called coefficient of variation / . 5. One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling. Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake why should you tolerate any defective items from your supplier). You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. If you find 1 or more defective items, you inspect the complete shipment. a) How would you select the sample? It would be nice to use SRS but it is likely too expensive unless the items are already numbered and it is easy to locate an idea with a specified label. Usually haphazard (for small items) or systematic sampling is used in this context. b) Calculate the probability p( ) that you accept the shipment as a function of , the percentage of defective items in the shipment. Since we are sampling such a small fraction of the shipment, we can approximate the number of defective items in the sample by a binomial random variable with n = 20 and the probability of a defective item . Then we have P(accept shipment) = (1- ) 20 c) Graph p( ) for 0 10% Using R, we have the following graph.

d) Given the results in c), you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective. What sample size do you recommend?

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -29

Suppose the sample size is n . Assuming that we can use the binomial approximation, we have P(accept shipment) = (1- ) n . We want to find n so that (1 0.01) n = 0.05 so n = 298 . This is so large that the binomial approximation may breakdown and, on a practical basis, is completely unreasonable. I recommend you tell your supplier to ensure that there are no defective items in the shipments. Sampling inspection is not useful here.
Chapter 9

Find the quadratic expansion of f ( x, y) = y / x about the point ( ( x ), ( y)) to 1. ~ ~ ~ estimate the bias in the estimator = ( y) / ( x ) . Note that the general form of the expansion is
f ( x , y ) f ( x 0 , y0 ) + f ( x 0 , y0 ) f ( x 0 , y0 ) 2 f ( x 0 , y0 ) ( x x 0 ) 2 2 f ( x 0 , y0 ) 2 f ( x 0 , y0 ) ( y y0 ) 2 ( x x0 ) + ( y y0 ) + + ( x x 0 )( y y0 ) + 2 x y x 2 xy y 2 2

This quadratic function has the same value, first and second derivatives at the point ( x0 , y0 ) as does f ( x, y) . You can easily check this statement by differentiating the right side of the expression. To use the expansion, we have f ( y) f 1 2 f 2 ( y) 2 f 1 2 f = = , , 2 = = , , 2 =0 ( x )2 y ( x ) x ( x )3 xy ( x )2 y x so we can write ~ ~ ( y) ~ 1 ~ 2 ( y) [ ( x ) ( x )]2 1 ~ ~ [ ( x ) ( x )] + [ ( y) ( y)] + [ ( x ) ( x )][ ( y) ( y)] 2 3 2 ( x ) ( x) ( x) ( x) 2 and

( y) 1 ~ ~ ~ ~ E ( ) + Var ( ( x )) Cov( ( x ), ( y )) 3 ( x) ( x )2
= + 1 ~ ~ ~ [ Var ( ( x )) Cov( ( x ), ( y)] 2 ( x)

2 ~ ( x )) = (1 f ) ( x ) The approximate bias is given by the second term. We know Var( n ~ ( x ), ( y)) = (1 f ) Cov( x, y) where Cov( x, y) ~ and with a bit of effort, we can show Cov( n 1 is the population covariance. The key point is to notice that the bias has a factor and n will be small if the sample size is large.

2.

In order to count the number of small items in a large container, a shipping company selects a sample of 25 items and weighs them. They then weigh the whole shipment (excluding the container). Assume that there is small error in weighing and act as if SRS is used - it is not, the sampling is haphazard. Let the weight of the ith item in the population be yi and the total known weight be

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -30

a)

Show that an estimate of the population size is N =

Note that = N ( y) so we can construct an estimate of N using our knowledge of estimating ( y) . The sample average and population average should be close, so we have yi i s and hence the estimate N = . N 25 yi / 25
i s

y
i s

/ 25

b)

Find the (approximate) mean and standard deviation of the corresponding ~ estimator N .

Consider expanding the function f ( y) = 1 / y . The linear approximation about ( y) is 1 1 1 1 1 ~ 1 ( ( y) ( y)) and + ( y ( y)) and hence we have ~ + 2 y ( y) ( y) ( y) ( y) ( y ) 2 1 1 1 1 ~ ~ ~ E( ~ ) , Var ( ~ ) Var ( ( y)) . Since N = / ( y) , we have 4 ( y) ( y) ( y) ( y) (1 f ) N 2 ( y)2 N2 ~ ~ ~ Var ( ( y)) = E( N ) N , Var ( N ) . n ( y) 2 ( y)2 In the example, the sample average weight is 75.45 g with sample standard deviation 0.163 g and the total weight is 154.2 kg. Find a 95% confidence interval for the total number of items in the container. (1 f ) ( y) ~ To find the confidence interval we have N ~ G( N , N ) , approximately. Note n ( y) that the mean and standard deviation both depend on the unknown N which is different ~ N 1 f ( y) from the usual situation. Instead we work with ~ G(1, ) and hence we N n ( y) have ~ N / N 1 | 1.96) = 0.95 Pr(| (1 f )1/ 2 ( y) n ( y) ~ Re-arranging the inequality and substituting N , ( y), ( y) for N , ( y), ( y) , we get the confidence interval N N ( , ) 1/ 2 1.96(1 f ) ( y) 1.96(1 f )1/ 2 ( y) 1+ 1 n n ( y) ( y) In the example, we have N = 2044 and the 95% confidence interval is (2035, 2053). c)

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -31

3.

Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average. We need an explanatory variate with known population average that can be measured on each unit in the sample. If the response variate is approximately proportional to the explanatory variate, then the ratio estimate is more precise than the sample average. If the response variate is approximately linear in the explanatory variate, then the regression estimate is more precise. Many bird species have specialized habitat. We can exploit this knowledge when we are trying to estimate population totals or density. For example, wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America. Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo, an area of highly fragmented forest patches. Using aerial photography, we know that there are 1783 such patches (minimum size 3 ha) with an average size 13.4 ha. A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males. The area xi of each sampled woodlot is also recorded. The data are available in the file exercise8.xls. Find 95% confidence intervals for the total number of thrushes based on the a) sample average y ratio estimate b) c) regression estimate If you do not want to do the calculations, write out what summaries you need to get the three confidence intervals. We need the following summaries of the data:

4.

( x ) = 11.72, ( y) = 1.62, = ( x ) = 11.72 ( y) = 1.34,

( y) = 0.138, = ( y) = 1.62, = 0.228 ( x)

(y x )
i i i s

n 1

= 0.399,

[y ( x
i i s

( x ))]2 = 0.142

n 1

a) The sample average is 1.62 with associated estimated standard deviation 1 f ( y)2 = 0.187 so an approximate 95% confidence interval for the average n number of thrushes per woodlot is 1.62 0.37 and the interval for the total number of thrushes is 1783(1.62 0.37) = 2888 653 b) The ratio estimate is ( x ) = 1.85 with associated estimated standard deviation 1 f i s = 0.088 so an approximate 95% confidence interval for the n n 1 population average is 1.85 0.17 and for the population total is 1783(185 0.17) = 3299 308 .
i i

(y x )

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -32

c) The regression estimate is ( y) + ( ( x ) ( x )) = 2.00 with associated estimated


1 f i s = 0.053 so an approximate 95% n n 1 confidence interval for the population average is 2.00 0.10 and for the population total is 1783(2.00 0.10) = 3566 184
i i

standard deviation

[y ( x

( x ))]2

The scatterplot on the next page shows the fitted regression line (solid), the fitted line through the origin (dotted) and ( x ) = 13.4 .

Number of thrushes versus woodlot area


6 5 4 number 3 2 1 0 0 -1 5 10 15 area 20 25 30 35

( x ) = 13.4

Chapter 10

1. In many surveys, there is interest in estimating strata averages or differences in strata averages.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -33

~ a) In general, for SRS, write down the distribution for the estimators h and ~ ~ h k . Assuming relatively large sample sizes within the strata we have approximately

~ h ~ G( h ,(1 fh )1/ 2

h
nh

~ ~ h k ~ G( h k ,

(1 fh ) 2 (1 fk ) 2 h k ) + nh nk

b) In the well survey, find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated The estimate of the proportion contaminated is 1 =.172 with associated estimated 1 f 1 (1 1 ) = 0.030 so the 95% confidence interval is standard deviation n 0.172 0.058 c) In the well survey, find a 95% confidence interval for the average Na difference between the two types of farm wells. For farms with animals, we have 1 = 237.3 with associated estimated standard deviation
2 (1 f1 ) 1 = 3.275 . n1 For farms without animals, we have 2 = 245.6 with associated estimated standard

deviation

2 (1 f2 ) 2 = 3.614 and hence we have 2 1 = 8.30 with associated n2

estimated standard deviation 3.2752 + 3.614 2 = 4.877 . Hence a 95% confidence interval for 2 1 is 8.30 9.60 . There is no evidence of a difference in average Na levels between the two groups of farms. 2. Suppose that the purpose of the survey is to estimate a population proportion . If there are H strata, a) Write down the stratified estimate of and the variance of the corresponding estimator. Since = W1 1 +...+WH H we have strat = W1 1 +...+ WH H and
2 ~ ~ ~ Var ( strat ) = W12 Var ( 1 )+...+ WH Var ( H )

(1 f1 ) 2 (1 f H ) 1 (1 1 )+...+ WH H (1 H ) n1 nH (ignoring the factors nh / (nh 1) ) ~ b) What is the variance of strat for proportional allocation? = W12

If nh = Wh n , we have

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -34

(1 n1 / N1 ) 2 (1 nH / N H ) ~ Var ( strat ) = W12 1 (1 1 )+...+ WH H (1 H ) W1n WH n 1 f [W1 1 (1 1 )+...+ WH H (1 H )] n where f = 1 n / N . =

c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? We want to form the strata so that [W1 1 (1 1 )+...+WH H (1 H )] < (1 ) . We do this by making h close to 0 or 1 for each stratum. In words, we decrease the variation in the strata by making the response more consistent. 3. Suppose the well survey was to be re-done with the same overall sample size 500. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal. For optimal allocation, we have nh Wh h . If we assume that the standard deviations do not change markedly, we use the estimates from the current survey to allocate the sample. We have
stratum 1 2 3 Weight St Dev 0.177 41.45 0.097 37.62 0.726 51.23

so the optimal sample sizes are A: 76, 38 and 386. more weight is given to stratum three because it is larger and has higher estimated standard deviation. b) Estimating the proportion of contaminated wells was the primary goal. For optimal allocation, we have nh Wh h (1 h ) and we use the current estimates to get
startum Weight sd 1 0.177 0.37738 2 0.097 0.317811 3 0.726 0.338491

So the optimal sample sizes are B:97, 45 and 358. ~ c) For each case, compare the predicted standard deviations of strat and ~ strat to what occurred in the current survey. We have

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -35

(1 f1 ) 2 ~ 2 (1 f H ) Var ( strat ) = W12 1 +...+ WH 2 H n1 nH (1 f1 ) 2 (1 f H ) ~ Var ( strat ) = W12 1 (1 1 )+...+ WH H (1 H ) n1 nH Using the current estimates and the two new allocations, we get the estimated standard deviations
allocation current

~ strat strat

A 2.11 0.015

B 2.13 0.015

2.42 0.016

The estimator of the proportion is much less sensitive to changes in the allocation. . ~ 4. Consider the difference of the variances of strat under proportional and optimal allocation for a sample of size n. Ignore the fpc. 1 a) Show that this difference can be written as ( h )2 Wh where n h = h Wh is the weighted average standard deviation over the H
h

strata. For optimal allocation, ignoring the fpc, we have 1 Vopt = (W1 1 +...+ WH H ) 2 and, for proportional allocation, we have n 1 2 Vprop = (W1 1 +...+ WH 2 ) . The weights can considered a probability distribution on the H n 1 1 integers 1,,H so we have ( h )2 Wh = [ 2 Wh ( h Wh ) 2 ] as required. h n h n h h b) When will the gain be large with optimal allocation relative to proportional allocation? The gain will be largest when the standard deviations vary widely. 5. In an informal sample of math students at UW, 100 people were asked their opinion (on a 5 point scale) about the core courses and their value. One particular statement was (with scores): All mathematics students are required to take Stat 231? strongly agree 1 agree 2 neutral 3 disagree- 4 strongly disagree - 5 The sample results, broken down by year are shown below. Estimate the average score for all math students and find an approximate 95% confidence interval for the population average note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted. There are about 3300 students in the faculty.

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -36

Year

Sample size

1 2 3 4

39 23 26 12

Population weight 0.31 0.24 0.23 0.22

Average score

2.8 3.5 3.2 3.1

Standard deviation 1.22 1.09 1.03 0.87

We can estimate the average score as if we had stratified the sampling beforehand.

. . post = 0.31(2.8) + 0.24(3.5) + 0.23(3.2) + 0.22(31) = 3126 . The approximate estimated ~ variance of post is
1 fh 2 2 Wh h = 0.107 nh h The approximate 95% confidence interval is 313 0.21. .

Stat 371 R.J. MacKay, University of Waterloo 2009

Exercise Solution -37

In Stat 371, we deal with applications and theory of the linear model Y = X + R where X = (1 x1 ... x p ) is a n ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ N (0, 2 ) . We represent the corresponding data model by y = X + r where y is the vector of observed values of the response variate. 1. For the model described above: $ a) (4 marks) Derive the least squares estimate of , i.e. show that = ( X t X ) 1 X t y . Be sure to explain the principles underlying your derivation.
n

Statistics 371 Midterm Solution

The least squares criterion is to minimize

ri
i =1

=r t r = ( y X )t ( y X ) with respect to

.
y r
X

From the picture, the minimum value corresponds to the orthogonal projection of y onto $ the column space of X so that y X is perpendicular to 1, x1 ..., x p , the columns of X or $ $ $ equivalently X t ( y X ) = 0 . Solving we have X t y = X t X so = ( X t X ) 1 X t y as required. b)
% (3 marks) Show that the estimator r corresponding to the estimated residuals is 2 MVN (0, ( I H )) where H = X ( X t X ) 1 X t .

$ $ Since r = y X = y X ( X t X ) 1 Xy = ( I H ) y , we have ~ = ( I H )Y = ( I H )( X + R) = ( I H ) R and hence ~ is multivariate normal with mean r r vector and variance covariance matrix E(~) = E(( I H ) R) = ( I H )0 = 0 r Var ( ~) = Var (( I H ) R) = ( I H ) 2 I (( I H )t = 2 ( I H ) r

since ( I H )( I H )t = ( I H ) and I H is a symmetric projection matrix.

c)

(3 marks) Using the result in b), explain the notion of a unit in the sample that has high leverage. We say that unit i has high leverage if the ith diagonal element of H, hii , is close to 1. ~ $ Using the result in b), we have ri ~ N (0,(1 hii ) 2 ) and ifhii 1, we know that ri is close to 0 . Since hii depends only on X, the fitted plane passes close to yi regardless of its value so deleting case i may have a large effect on the fitted plane.

2. Suppose we are interested in understanding the relationship between a response variate y and a specified explanatory variate x1 In an investigation, y, x1 and a second explanatory variate x2 are measured on a sample of 50 units from the study population. a) (2 marks) Consider the two models Model 1: Y = 0 1 + 1 x1 + R and Model 2: Y = 0 1 + 1 x1 + 2 x2 + R . What is the difference in interpretation between 1 and 1 ?

1 represents the change in the expected response for a unit change in x1 . 1 represents the change in the expected response for a unit change in x1 with x2 held fixed.
From R, the summary output from fitting the two models is: Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -2.81201 -1.02882 -0.02064 0.85583 3.26556 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.3255 0.3653 83.01 < 2e-16 *** x1 0.7700 0.1256 6.13 1. 59e-07 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.416 on 48 degrees of freedom Multiple R-Squared: 0.4391, Adjusted R-squared: 0.4274 F-statistic: 37.58 on 1 and 48 DF, p-value: 1.587e-07 Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -2.7724 -0.9820 -0.1072 0.7907 3.2069 Coefficients: (Intercept) x1 x2 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Estimate Std. Error t value Pr(>|t|) 30.53223 0.44051 69.311 < 2e-16 *** 0.81995 0.13915 5.892 3.91e-07 *** -0.04125 0.04878 -0.846 0.402

Residual standard error: 1.421 on 47 degrees of freedom Multiple R-Squared: 0.4475, Adjusted R-squared: 0.424 F-statistic: 19.03 on 2 and 47 DF, p-value: 8.807e-07 b) (2 marks) The value of R 2 is a bit larger for model 2. Does this mean that this model better fits the data?

No since we know that R 2 must increase as we add terms to a model whether or not the corresponding explanatory variate has an effect on the response. R 2 measures the proportion of variation in the response variate explained by the explanatory variates. c) (3 marks) Using the model 2, find a 95% confidence interval for 1

$ From the R output we have 1 = 0.820 with standard error of 0.135 and 47 degrees of freedom. From the t-table we have P(| t47 | 2.01) 0.95 so the confidence interval is 0.820 2.01 0.135 (estimate c standard error) or 0.820 0.273 d) (1 mark) The F statistic in the summary output for model 2 is F=19.03. What does this signify?

Since the F ratio is so large (p-value 8.807e-07) there is strong evidence against the hypothesis that all of the s are 0. That is, there is strong evidence that one or more of the explanatory variates explains variation in the response variate. e) (3 marks) The output for model 2 shows that there is no evidence that 2 differs from 0 using a t-test. Show how we ca use the Analysis of variance to test the same hypothesis. To use ANOVA, we need to find two estimates of 2 . From fitting the full model the estimate of 2 that does not depend on any hypothesis about 2 is 1.4212 = 2.019 with 47 degrees of freedom. When fitting the model assuming 2 = 0 , the estimate of 2 is 1.416 2 = 2.005 with 48 degrees of freedom.. Hence the change in the residual sum of squares is 47 2.019 48 2.005 = 1.338 with 1 degree of freedom so the F-ratio is 1.388 / 1 = 0.662 so there no evidence that 2 differs from 0. 2.019 f) (2 marks) A plot of the estimated residuals versus the fitted values from Model 2 is shown below. Based on the plot, what action would you recommend? Why?

Note the plot has been corrected in the solution. No action is required because there are no apparent patterns or outliers on the plot. Some suggested that there was a funnel effect that suggests a non-constant standard deviation and transforming the response variate using the logarithm etc.

Statistics 371 Sample Midterm Solution


In Stat 371, we deal with applications and theory of the linear model Y = X + R where X = (1 x1 ... x p ) is a n ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ G(0, ) . We represent the corresponding data model by y = X + r where y is the vector of observed values of the response variate. 1. (4 marks) Give two distinct, different uses of this model in business contexts. prediction: predict the market value of a building using the selling price (response variate) and various explanatory variates (size, age, ) from sales of similar buildings estimate parameters: estimate the volatility of a share price relative to an index using past closing prices look for outliers: identify extreme salaries (response variate) after adjusting for explanatory variates such as experience , age, educational qualifications, .. 2. For the model described above: a) (1 mark) What is the criterion used to produce the least squares estimates of the parameters ? We minimize || r||2 =|| y X ||2 or we chose so that r = y X is perpendicular to span(1, x1 ,..., x p )

b)

(5 marks) We know that the least squares estimate of is = ( X t X ) 1 X t y and the ~ corresponding estimator is ~ N ( , 2 ( X t X ) 1 ) . Suppose we want to predict the response variate for a unit with values of the explanatory variate u t = (1, u1 ,..., u p ) . Derive a 95% prediction interval. Be sure to explain the derivation. ~ ~ We know that ~ N ( , 2 ( X t X ) 1 ) and hence u t ~ N (u t , 2 u t ( X t X ) 1 u) . We are ~ predicting Y where Y ~ N (u t , 2 ) . Hence we have Y u t ~ N (0, 2 (1 + u t ( X t X ) 1 u)). ~ Standardizing and replacing by , we have ~ Y ut ~ tn ( p +1) ~ (1 + u t ( X t X ) 1 u)

We use this random variable as the basis for our interval. Choosing c so that Pr(| tn ( p +1) | c) = 0.95, we have ~ Y ut Pr( c ~ c) = 0.95 (1 + u t ( X t X ) 1 u) Cross-multiplying and re-arranging we get the probability statement
1

~ ~ ~ ~ Pr(u t c (1 + u t ( X t X ) 1 u) Y u t + c (1 + u t ( X t X ) 1 u)) = 0.95 We get the 95% confidence interval by replacing the estimators by the corresponding estimates. (u t c (1 + u t ( X t X ) 1 u), u t + c (1 + u t ( X t X ) 1 u)) Note: many marks were lost because of confusion among parameters, estimate and estimators.

3. In a compensation study of the chief executive officer salaries in one state, data were collected from 91 rural school districts in a given year. The purpose of the investigation was to determine if the salaries were relatively equitable, or if some CEOs were highly under- or over-paid, relative to the others, after adjusting for qualifications. The variates measured were: experience: number of years in the current or similar job size: number of students in the district education level: BA only, MA or PhD cost of living(col): relative cost of living in the district salary: annual salary of CEO Note that education level is captured by two explanatory variates ma=0, 1 and phd=0, 1 where 1 indicates the presence of the degree. If phd=1, then the CEO has the equivalent to both degrees so ma is set to 1. The R output from fitting a linear model to the data is given in the box below. Call: lm(formula = salary ~ experience + size + ma + phd + col) Residuals: Min 1Q Median 3Q Max -2969.7 -944.2 -135.4 1059.6 3299.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 90717.6700 4093.0329 22.164 < 2e-16 *** experience 171.1682 36.4134 4.701 9.91e-06 *** size 3.1024 0.4598 6.747 1.74e-09 *** ma 5228.7820 334.4497 15.634 < 2e-16 *** phd 4910.7543 474.3909 10.352 < 2e-16 *** col -2917.0956 4152.1441 -0.703 0.484 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 a) (1 mark) Carefully interpret the on 85 degrees4 of freedom Residual standard error: 1402 coefficient corresponding to the explanatory variate phd. Multiple R-Squared: 0.889, Adjusted R-squared: 0.8825 F-statistic: 136.2 on 5 and 85 DF, p-value: < 2.2e-16

Since E(Y ) = 0 + 1experience + 2 size + 3 ma + 4 phd + 5 col , 4 represents the average change in salary if a CEO gets a PhD, all other explanatory variates held fixed. b) (1 mark) Suppose we add a product term phd*experience with coefficient 14 to the model. Carefully interpret this parameter. With the new model we have E(Y ) = 0 + 1experience + 2size + 3 ma + 4 phd + 5 col + 14 experience * phd If phd=0, E(Y ) = 0 + 1experience + 2 size + 3 ma + + 5 col and if phd=1, E(Y ) = 0 + 1experience + 2size + 3 ma + 4 + 5 col + 14 experience
= 0 + ( 1 + 14 )experience + 2size + 3 ma + 4 + 5 col Hence 14 represents the change in the rate that changing experience effects average salary if a CEO has a PhD versus not having a PhD . Note : This was meant to be difficult and it proved to be so. Good thing it was only one mark! c) (2 marks) The Pr(>|t|) for the variable col is 0.484. What does this tell us? The p-value for the hypothesis 5 = 0 is large, so there is no evidence against this hypothesis. That is, there is no evidence that col effects salary, all other explanatory variates being held ficxed. (4 marks) To check the contribution of size and experience to the model, a new model with terms e2 and s2, the squares of size and experience was fit. Part of the R summary output is shown below. Is there any evidence that these quadratic terms are necessary?

d)

Residual standard error: 1377 on 83 degrees of freedom Multiple R-Squared: 0.8955, Adjusted R-squared: 0.8867 F-statistic: 101.6 on 7 and 83 DF, p-value: < 2.2e-16
The estimate of under the full model (including the squared terms) is 1377 so the residual sum off squares is 83 * (1377) 2 = 157,378,707 The estimate of under the restricted model (without the squared terms) is 1402 so the residual sum off squares is 85 * (1402) 2 = 167,076,340 The change in the residual sum of squares is 167,076,340 157,378,707 with 8-6= 2 167,076,340 157,378,707 degrees of freedom so the mean square is = 4,848,816 . 2 To test the hypothesis that the coefficients of the squared terms are simultaneously 0, the 4,848,816 discrepancy measure is = 2.557 and the p-value is (1377)2

0.05 < Pr( F2,83 2.557) < 0.10 . There is weak evidence against the hypothesis that the coefficients of the squared terms are 0 and hence weak evidence that they need to be included in the model. Note that the full model has 83 degrees of freedom for estimating here. e) (2 marks) A quantile-quantile (qq) plot of the standardized residuals is shown below. Explain how to calculate the coordinates of the point in the lower left corner of the plot.

We divide the G(0,1) into 91 bins each with probability 1/91. The x-coordinate is the 1 . The y-coordinate is the smallest center of the first bin q1where Pr( Z q1 ) = 182 standardized residual in the set of 91.

f) (1 mark) What does the qq plot tell us in this case? Since the points fall close to a straight line, we can be confident that the assumption of gaussian residuals is reasonable. g) (2 marks) How can we detect cases with an outlier in the explanatory variates? For each case, we look at the leverages hii , the diagonal elements of the hat matrix H = X ( X t X ) 1 X t . If the leverages are close to 1 or relatively large then the corresponding values of the explanatory variates are an outlier and are possibly influential in the fit of the model.

h) (2 marks) A plot of the studentized residuals versus the case number is shown below. Assuming that the fit of the model is adequate, use the plot to provide a conclusion to the investigation.

The purpose of the investigation was to identify outliers in the response variate, the CEO salary after accounting for the explanatory variates. Looking at the plot of the studentized residuals we see no very large values (i.e.>2.5) so it appears that the salaries are equitable.

Final Examination Spring 2004 Part I (35 marks) In the first part of the course, we looked at applications and theory of the linear model Y = X + R where X = (1 x1 ... x p ) is a n ( p + 1) matrix with columns giving the values of the explanatory variates for the n units in the sample and R is a vector of random variables with independent components Ri ~ G(0, ) . We represent the corresponding data model by y = X + r where y is the vector of observed values of the response variate.

1. (5 marks) From first principles, show that the least squares estimate of is

= ( X t X ) 1 X t y .
2. (4 marks) Show that the mean and variance-covariance matrix of the corresponding estimator are and 2 ( X t X ) 1 respectively. 3. (3 marks) The estimator corresponding to the vector of estimated residuals is r = Y X . Find the distribution of the ith component ri You work in the marketing division of a large corporation that owns pizza franchises. Your company is investigating a special promotion with the goal of increasing sales. There are three versions of the promotion plus a control in which there is no change to current practice. The company assigns each of the versions at random to 20 franchises and measures the average weekly sales (over a four week period). The average weekly sales before the promotion is also recorded for each of the 80 franchises in the sample. For each franchise, the data are coded as follows:
Name average weekly sales (in $1000) during the promotion period promotion 1 promotion 2 promotion 3 promotion 4 (control) past average sales (in $1000) Symbol y
x1 = 1 if promotion 1 is used, x1 = 0 otherwise x2 = 1 if promotion 2 is used, x2 = 0 otherwise x3 = 1 if promotion 3 is used, x3 = 0 otherwise none x4

Consider the model (in vector notation)


Y = 01 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + R, R ~ N (0, 2 I ) (1)

4. (2 marks) Carefully interpret the coefficient 1 5. (2 marks) To maintain symmetry, your boss, an engineer gone wrong, suggests adding an extra term c xc to the model (1) where xc = 1 for the franchises with the control and xc = 0 otherwise. Is this a good idea? Explain.

You fit the model (1) using R with the following summary output. lm(formula = y ~ x1 + x2 + x3 + x4) Coefficients: (Intercept) x1 x2 x3 x4 Estimate -1.18868 0.18265 2.07901 1.30198 0.90509 Std. Error 0.39377 0.44514 0.44554 0.44505 0.01399 t value -3.019 0.410 4.666 2.925 64.709 Pr(>|t|) 0.00347 0.68274 1.31e-05 0.00455 < 2e-16

Residual standard error: 1.407 on 75 degrees of freedom Multiple R-Squared: 0.9828, Adjusted R-squared: 0.9819 F-statistic: 1072 on 4 and 75 DF, p-value: < 2.2e-16

6. 7. 8. 9.

(1 mark) What does Multiple R-Squared: 0.9828 tell you? (2 marks) What does F-statistic: 1072 on 4 and 75 DF, p-value: < 2.2e-16 tell you? (3 marks) Find a 95% confidence interval for 2 . What does this interval tell you? (3 marks) Explain in symbols and words (no numerical calculations needed ) how you could formally assess if there was a difference between promotion 2 and promotion 1? 10. (4 marks) To test that the hypothesis that there is no difference among the three promotions, you fit the model Y = 01 + ( x1 + x2 + x3 ) + 4 x4 + R, R ~ N (0, 2 I ) with summary output (in part): Residual standard error: 1.549 on 77 degrees of freedom Multiple R-Squared: 0.9786, Adjusted R-squared: 0.9781 F-statistic: 1763 on 2 and 77 DF, p-value: < 2.2e-16 Is there any evidence of a difference among the three versions? 11. (4 marks) To assess the fit of the original model (1), the following plots were prepared. Briefly describe what each plot tells you about the fit.

Plot 1(Estimated Residuals vs Fitted Values): Plot 2 (Normal Q_Q Plot of standardized residuals): Plot 3 (Leverage vs Case Number): Plot 4 (Studentized Residual vs Case Number):

12. (2 marks) If the primary purpose of the study was to look for differences in the versions (as in Question 10), how would you proceed with the information from the above plots?

Part 2 (35 marks)

In the second part of the course, we deal with the theory, applications and some extensions of simple random sampling (SRS) to learn about population averages. The yi = is basic estimate of a population average is the sample average where yi is n the value of the response variate for the ith unit in the sample s and n is the sample size. yi I i The corresponding estimator can be written = iU where U is the population N (frame) with size N and I i = 1 if unit i is in the sample and I i = 0 otherwise. We can n 2 ( y) where 2 ( y ) = show that E ( ) = , Var ( ) = (1 ) N n

( y
iU

)2

N 1

1. (3 marks) A key step in proving that E ( ) = for SRS was to determine Pr( I i = 1) . Find this probability and explain your reasoning. 2. (4 marks) A key step in the derivation of Var ( ) for SRS was to determine the covariance of I i and I j . Find this covariance. 3. (3 marks) Suppose that yi is a binary variate with values 0 and 1. Show that ( y ) is essentially determined by , the population proportion of units with y = 1 . 4. (5 marks) Suppose we have a population with frame U = U1 ... U H of size N where the U h , h = 1,..., H are mutually exclusive strata of size N h . We plan to sample nh units from stratum h using SRS independently for each stratum. The cost per unit of sampling from stratum h is ch and the total sampling cost must be limited to C. How should we best allocate the sample to the various strata if the goal is to estimate the population average? In the early 1980s, the Federal Government of Canada established a grant program to help homeowners re-insulate their homes to reduce energy consumption. Many homes used UFFI, a foam insulation that could be pumped into cavities as a liquid. The foam then solidified without reducing its volume. Unfortunately, some homeowners developed allergy symptoms that were attributed to formaldehyde (CH2O), a gas that could have been given off by UFFI. There was then pressure on the Government to help homeowners remove the UFFI, a very expensive proposition. To assess the magnitude of the problem, a survey was commissioned with the basic purposes to assess the average level of CH2O in homes with UFFI the proportion of individuals in these homes with allergy symptoms

and to compare these attributes to those in homes without UFFI. Since the Government had awarded grants, they had a frame of 124,345 homes in which UFFI had been installed and another frame of 230,981 homes that had been re-insulated without UFFI. They decided to select a simple random sample of 500 homes from each frame and then measure the concentration of CH2O in the air in each home in the sample. administer a questionnaire to the homeowner to collect information about allergy symptoms and other demographics. 5. (2 marks) Explain how you could implement simple random sampling in this case. 6. (3 marks) An early press release about the allergy problems noted that the survey will look at a random sample of people living in homes with UFFI Is this statement technically correct? Explain. 7. (3 marks) A chemist involved in the planning of the survey noted that the frame of homes with UFFI was about half the size as that for homes without UFFI. He suggested that the survey would be improved if the sample sizes was proportional to the frame sizes. Is this correct? Explain. A summary of part of the data collected is
Attribute CH2O average concentration (ppb) CH2O standard deviation (ppb) UFFI homes 57.8 12.4 non-UFFI homes 47.6 9.7

8. (3 marks) Find a 95% confidence interval for the population average CH2O concentration in the UFFI homes. 9. (3 marks) Find a 95% confidence interval for the difference in population average CH2O concentration in the UFFI and non-UFFI homes. 10. (2 marks) For the UFFI home sample, the 95% confidence interval for , the proportion of homes in which one or more persons experienced allergy symptoms was 0.08 0.023 . How large a sample would have been required to reduce the length of this interval by half? 11. (4 marks) Briefly discuss two different strategies that could have been used to increase the efficiency of the survey. Strategy 1: Strategy 2:

You might also like