You are on page 1of 31

1.

Chapter 1 Introduction The big picture for statistics


Take Sample

Sample

Inference

Population

Suppose we want to estimate the mean GPA of all UNL students. In order to do it formally, we would define the following quantities: 1) Population: All objects of interest All students at UNL 2) Random variable: A characteristic of the population Let Y be a random variable denoting GPA 3) Parameter: Numerical summary measure of a population characteristic
2010 Christopher R. Bilder

1.2

Let denote the population mean 4) Probability distribution for the population: Provides a function to show how the values of the random variable are distributed PDF is f(y) and CDF is F(y) Notation: We will often simply use F to denote the CDF and this will be a general way to give the distribution. 5) Simple random sample Suppose a sample of size n is taken from the population in a representative manner and every object has an equal probability of being selected. 6) Sample: Subset of the population Random variables Y1, Y2, , Yn Observed values y1, y2, , yn 7) Statistic: Numerical summary measure used to describe a sample characteristic Let T be the sample mean. Remember that T is a random variable and has an observed value of t. Note that T is an estimator of and t is an estimate of
2010 Christopher R. Bilder

1.3

8) Based upon the statistic(s) in the sample, we will make inferences about the parameter in the population with a certain level of accuracy Where does this level of accuracy come from? Sampling distributions of statistics 9) What is the true sampling distribution? Suppose the population has a size of N and a sample of size n is taken from it. There are NCn different samples with their associated probabilities of being obtained. The corresponding fixed number of values for T can then be found with the corresponding probabilities of being observed. Suppose the population has an infinite size. There are an infinite number of samples of size n. The distribution of T can not be truly characterized unless something is said about the population probability distribution. Here are three ways to APPROXIMATE the sampling distribution of T. a) Assume a probability distribution for the population

2010 Christopher R. Bilder

1.4

Let Y1, Y2, , Yn ~ F. This will lead to the derivation of a distribution for the statistic T, call this G. Problem: How often can one be absolutely sure that iid Y1, Y2, , Yn ~ F? Thus, this method usually ends up being an approximation. b) Find the asymptotic distribution for the statistic
d For example, n ( T ) X where X has a d L particular distribution ( equivalent to )

iid

From page 235 of Casella and Berger (2002), d lim Fn (x) = F(x) Xn X if n
d If the Central Limit Theorem works: n ( T ) X where X~N(0,2). More simply, one may write this as d n ( T ) N(0, 2 ) using a little abuse of notation.

Problems: How often does one take a sample of size ? What happens at a fixed sample size of n? How do we obtain 2? There may be times when we want the distribution of some function of T, say h(T). We can use a 2010 Christopher R. Bilder

1.5

method approximation for h(T) using a first-order Taylor series approximation. How good is this approximation? c) Use the bootstrap! To be discussed shortly Once you have the sampling distribution, what can you do with it? Estimate bias and variance Confidence intervals and hypothesis tests

2010 Christopher R. Bilder

1.6

A short introduction to the bootstrap There are a number of introductory papers on the bootstrap. Here are a few in the statistics literature: Boos, D. D. (2003). Introduction to the bootstrap world. Statistical Science 18, 168-174. Johnson, R. W. (2001). An Introduction to the bootstrap. Teaching Statistics 23, 49-54. Duckworth, W. M. and Stephenson, W. R. (2003). Resampling methods: Not just for statisticians anymore. Proceedings of the American Statistical Association, Section on Teaching Statistics in the Health Sciences, 1280-1285. In the previous discussion, notice how knowing F itself helps to find the sampling distribution of T. Suppose you knew F, but you did not want to formally derive a distribution for T. How could you use Monte Carlo simulation to approximate the distribution for T, call it G? Generally, one does not know F exactly, but perhaps we . When one could estimate it. Say the estimate of F is F wants to specify the variable of interest y, we can use F(y) or Could we do the same type of Monte Carlo simulation F(y). replacing F? YES! as discussed above with F

2010 Christopher R. Bilder

1.7

? Use the empirical What should we use as F distribution function (EDF): = proportion of observations in the sample that F(y) fall less than or equal to y. More specifically for a sample of y1, , yn, #{y j y} 1 n F(y) = = I(y j y) n n j=1 where I(y j y) is an indicator function where I(A) = 1 if A is true and I(A) = 0 if A is false. To use a term from a STAT 218 like class, we are finding a cumulative relative frequency distribution for our data here. Notice how weights of 1/n are assigned to each observation. Suppose a simple random sample obtains the following observed values: (y1, y2, y3, y4, , y20) = (1.54, 1.80, 2.07, , 3.27) where I happened to order these values. The plotted EDF is a step function:

2010 Christopher R. Bilder

EDF for College GPA


1.0

1.8

^ F 0.0 0.2 0.4

0.6

0.8

1.5

2.0 y

2.5

3.0

1(0.05) = 1.54. Notice that F(1.54) = 0.05 and F The upcoming example will examine this data in more detail. Is this F(y) a good estimator for F(y)?
d n ( F(y) F(y) ) N[0,F(y) (1 F(y))] for each fixed y

2010 Christopher R. Bilder

1.9

Note that the strong law of large numbers implies that wp1 F(y) F(y) for each fixed y (wp1 is the same as almost sure convergence). Glivenko-Cantelli Theorem: P supy F(y) F(y) 0 = 1 (See p. 23 of Ferguson (1996)) in order to obtain a Monte Carlo How do we sample from F simulation estimate for the sampling distribution of T (again, this distribution is denoted by G)? Remember that each y1, y2, , yn has a weight of 1/n assigned to it. Using this, we can perform sampling the following way: where we call these 1) Generate samples of size n from F y r1,yr 2 ,...,yrn for r = 1, , ?. Note that this simply involves sampling WITH REPLACEMENT from the observed y1, y2, , yn.
Each { y r1,yr 2 ,...,yrn } is called a resample. Also, be careful with the subscripts. The value of y r 2 is most likely not y2.

2) Calculate t r for each resample.


3) Form the EDF of t1 , t 2 , to obtain G . This is the bootstrap estimate of G!

2010 Christopher R. Bilder

1.10

to perform statistical inference! 4) Use G What is the distribution of G? We need to know G so that one can make inferences about through using T. Notice that we obtain only one t from our sample. The bootstrap though allows us to . estimate G through using G Example: GPA data set (bootgpa_example.R) Suppose 20 UNL students were randomly selected from the population of all UNL students, and college GPAs were obtained from each student. You will notice some of these GPAs are quite low. I simply used a N(2.505, 0.25) to simulate them. The purpose here is to perform statistical inference for the population mean college GPA.
>########################################################## > # Obtain data > > > > > [1] > mu<-2.505 sigma<-0.5 set.seed(1025) y<-rnorm(n = 20, mean = mu, sd = sigma) max(y) 3.265626 min(y)
2010 Christopher R. Bilder

1.11
[1] 1.536693 ########################################################## > # Initial examination of sample > > y<-round(y,3) > t<-mean(y) > cat("My sample is", sort(y), "\n which produces an observed statistic of", t, "\n") My sample is 1.537 1.804 2.068 2.375 2.426 2.454 2.486 2.488 2.499 2.535 2.556 2.592 2.6 2.647 2.676 2.934 2.956 3.127 3.202 3.266 which produces an observed statistic of 2.5614 > > #EDF par(lend = "square") #Option for how end of the lines look - try with the default of par(lend = "round") to see the difference plot.ecdf(x = y, verticals = TRUE, do.p = FALSE, main = "EDF for College GPA", lwd = 2, panel.first = grid(nx = NULL, ny = NULL, col="gray", lty="dotted"), ylab = expression(hat(F))

>

2010 Christopher R. Bilder

1.12
EDF for College GPA
1.0 ^ F 0.0 0.2 0.4 0.6 0.8

1.5

2.0 y

2.5

3.0

> ######################################################### > # Resample > > > > set.seed(4518) y.star<-sample(x = y, replace = TRUE) t.star<-mean(y.star) cat("My resample is", sort(y.star), "\n which produces an observed statistic of", t.star, "\n")

My resample is 1.537 1.804 2.068 2.375 2.426 2.426 2.426 2.454 2.486 2.535 2.6 2.6 2.647 2.647 2.934 2.956 3.127
2010 Christopher R. Bilder

1.13
3.202 3.266 3.266 which produces an observed statistic of 2.5891 > table(y)
2.6 1 y 1.537 1.804 2.068 2.375 2.426 2.454 2.486 2.488 2.499 2.535 2.556 2.592 1 1 1 1 1 1 1 1 1 1 1 1 2.647 2.676 2.934 2.956 3.127 3.202 3.266 1 1 1 1 1 1 1

>

y.star 1.537 1.804 2.068 2.375 2.426 2.454 2.486 2.535 1 1 1 1 3 1 1 1 3.202 3.266 1 2

table(y.star)

2.6 2.647 2.934 2.956 3.127 2 2 1 1 1

There are a very large number of resamples (Chapter 2 gives the exact number). Instead of finding all of them, a large set of these resamples is usually found. This set is large enough to be very representative of all possible resamples.
> > > > > R<-1000 set.seed(4518) save.resample<-matrix(data = NA, nrow = R, ncol = length(y)) #There are more efficient ways to do this than a for loop for (i in 1:R) { y.star<-sample(x = y, replace = TRUE) save.resample[i,]<-y.star }

> #Example resamples > table(save.resample)/R save.resample


1.537 1.804 2.068 2.375 2.426 2.454 2.486 2.488 2.499 2.535 2.556 2.592 2.6 0.993 1.037 0.964 1.000 0.994 1.006 1.007 0.994 0.998 0.969 1.029 1.007 1.015 2.647 2.676 2.934 2.956 3.127 3.202 3.266 1.008 1.017 1.018 1.033 0.989 0.961 0.961

>

summary(t.star)
2010 Christopher R. Bilder

1.14
Min. 1st Qu. 2.262 2.498 Median 2.560 Mean 3rd Qu. 2.560 2.626 Max. 2.814

> sd(t.star) #Estimated by bootstrap [1] 0.0914199 > sd(y)/sqrt(length(y)) #Usual estimate [1] 0.09584745 > 0.71265/sqrt(length(y)) #Actual [1] 0.1118034 > > > #Compare to CLT approximation for T t.star<-apply(X = save.resample, MARGIN = 1, FUN = mean) plot.ecdf(x = t.star, verticals = TRUE, do.p = FALSE, main = "Boot. estimate of G", lwd = 2, xlab = "t*", panel.first = grid(nx = NULL, ny = NULL, col="gray", lty="dotted"), ylab = "EDF or CLT est. CDF") curve(expr = pnorm(x, mean = t, sd=sd(y)/ sqrt(length(y))), col = "red", add = TRUE) legend(locator(1), legend = c("T* EDF", "CLT app. for T"), col=c("black","red"), bty="n", lwd=c(2,1), cex=0.75)

> >

2010 Christopher R. Bilder

1.15
Estimating G
1.0

T* EDF CLT app. for T using estimates CLT app. for T using actual

^ G or G 0.0 2.2 0.2 0.4

0.6

0.8

2.3

2.4

2.5 t*

2.6

2.7

2.8

>

t.star.quantiles<-quantile(x = t.star, probs = seq(from = 0.05, to = 0.95, by = 0.05)) > CLT.quantiles<-round(p = qnorm(p = seq(from = 0.05, to = 0.95, by = 0.05), mean = t, sd = sd(y)/sqrt(length(y))),2) > data.frame(t.star.quantiles, CLT.quantiles) t.star.quantiles CLT.quantiles 5% 2.399090 2.24 10% 2.442975 2.30 15% 2.467770 2.34 20% 2.482260 2.37 25% 2.498350 2.40 30% 2.510620 2.42 35% 2.522193 2.44
2010 Christopher R. Bilder

1.16
40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% > > 2.536010 2.546337 2.559625 2.575125 2.585450 2.596188 2.609710 2.625662 2.640410 2.655390 2.675655 2.704673 2.46 2.48 2.50 2.53 2.55 2.57 2.59 2.61 2.64 2.67 2.71 2.77

hist(x = t.star, main = "Histogram for t*", freq = FALSE, ylim = c(0,3)) curve(expr = dnorm(x, mean = t, sd=sd(y)/ sqrt(length(y))), col = 2, add = TRUE)
Histogram for t*
4 Density 0 1 2 3

2.3

2.4

2.5 t*

2.6

2.7

2.8

2010 Christopher R. Bilder

1.17

Compare the CLT to the bootstrap approximation for the sampling distribution of T. Be careful with the scale presented in the plots when forming your conclusions. Notice the bootstrap estimate is similar to the usual N(x,s2 / n) we would use without the bootstrap. Given they are so close, why would one use the bootstrap here?

2010 Christopher R. Bilder

1.18

Quick history of the bootstrap: Bradely Efron from Stanford (http://wwwstat.stanford.edu/people/faculty/efron/index.html) is usually credited with the idea, although others had proposed parts of it before. Some initial places where the bootstrap was discussed initially by Efron include: 1977 Reitz Lecture at IMS meetings 1979 Annals of Statistics and SIAM Review papers Efron missed JSM 2007 for an important reason:

2010 Christopher R. Bilder

1.19

Bradley Efron receives National Medal of Science from President Bush In presenting the medal to Efron on July 27, President Bush cited Efron's "momentous" intellectual achievements, mentioning in particular Efron's "bootstrap re-sampling technique." (From ASA website) Where did the name come from? It is a reference to a book called The Adventures of Baron Munchausen by Rudolph Erich Raspe. There was a 1989 movie based on the book as well. Here are some links regarding the book: http://authorama.com/book/adventures-of-baronmunchausen.html http://homepage.ntlworld.com/forgottenfutures/munch/ munch.htm In the book, Baron Munchausen was at the bottom of a lake. In order to get out, he pulled himself up by his bootstraps. John Tukey thought that using the sample to generate more data is like the trick by the Baron. At first, there were skeptics regarding if this was a valid statistical procedure.
2010 Christopher R. Bilder

1.20

Excerpt from Efron (Annals of Statistics, 1979):

What is the current state of research in this area? Much of the main research on the bootstrap was done in the 1980s and 1990s. Most of the research into the bootstrap itself is now done in the very mathematical statistical journals like the Annals of Statistics. The use of the bootstrap to solve statistical problems is still very active in the statistical literature and now in the non-statistical literature! What are we going to do in this course? We will NOT do every section of the Bootstrap Methods and their Application (BMA) textbook. o Chapter 2 Basics of the bootstrap
2010 Christopher R. Bilder

1.21

o Chapter 3 The bootstrap for specific data structures or problems o Chapter 4 Hypothesis testing o Chapter 5 Confidence intervals o Chapter 6-7 Bootstrap with regression models o Chapters 8-10 We most likely will not be able to do these chapters o Chapter 11 R/S-Plus coding Using R For an introduction to R, please see my R Introduction notes on the schedule web page. These are the notes that I go through at the start of my UNL STAT 875 course. Next are additional notes about R that are going to be very important to us in this bootstrap course. Using R - Packages R consists of a number of packages that are organized by particular statistical methods. A list of packages is available at http://cran.r-project.org/web/packages/. Some packages are automatically installed in R while others need to be downloaded from the R website in order to be installed. In order to download a package, one can do this inside of R instead of having to use a web browser. Select PACKAGES > INSTALL PACKAGES from the main menu bar of R to see a list of mirror
2010 Christopher R. Bilder

1.22

locations that contain the packages. Once you choose a location (such as USA(IA) this is at Iowa State University), a list of packages will be made available and you can choose one to install. We will be using the boot package a lot for this class. This package contains a number of functions used to implement the bootstrap and already comes with an initial R installation. In addition to its help page inside of R, the website, http://cran.r-project.org/web/packages/ boot/index.html, gives information for it. I have found the R Help PDF file to be useful! A.J. Canty wrote these functions originally for BMA and S-Plus. The functions in the package have been updated for R, but you may notice some small differences with their implementation in the Practicals and Chapter 11 of BMA. Using R - Use the R help! One way to access the main R help is through selecting HELP > HTML HELP on the R menu bar. A web browser will appear with a number of different options of where to go for help. I usually select PACKAGES and go to the help of a function within a particular package. There are a number of R Manuals online at http://cran.rproject.org/manuals.html. Some of these are also available through HELP > MANUALS.
2010 Christopher R. Bilder

1.23

The R listserv is an excellent source for help. Often, other people will have had the same problem that you are looking for help on. A search engine for the listserv is at http://finzi.psych.upenn.edu/search.html. This will allow you to easily search through the listserv posts and even the R documentation. I have found this to be VERY useful. In order to subscribe (and post) to the listserv, you can go to http://www.r-project.org/mail.html. This is an active listserv so I recommend subscribing to it in digest mode. Please remember to use the search engine first before posting a message to the listserv! Another useful interface to the R listserv is http://news.gmane.org/gmane.comp.lang.r.general. While not part of R help, there are many blogs that I have found to be very useful. For example, the Revolutions blog (http://blog.revolution-computing.com/) is one that I read. Using R - Writing your own functions In order to use many of the functions in the boot package, you will need to write a function to calculate a statistic. Here are some reminders about functions in R: 1)The last line of code in a function contains what is returned to the user.
2010 Christopher R. Bilder

1.24

2)In the help for a function, in the function( ) line means additional options can be passed into it other than those listed. 3)Object names in the function( ) line with an = ___ after them correspond to those with default values that need to be specified only when the user wants to change them. 4)If you type the function name in the R console only, you will get the code for the function back. Example: GPA data set (gpa_example.R, gpa.xls) The purpose of this example is for you to examine a function outside of the boot package that some of you may be familiar with. The function is lm(), and it is used to fit regression models. Continuing the code from the last GPA data set example, we can fit a simple linear regression model to the data using high school GPA to predict college GPA.
> > mod.fit<-lm(formula = College ~ HS, data = gpa) summary(mod.fit)

Call: lm(formula = College ~ HS, data = gpa) Residuals: Min 1Q Median -0.42294 -0.25711 -0.04094 3Q 0.27536 Max 0.40334

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.70758 0.19941 3.548 0.00230 ** HS 0.69966 0.07319 9.559 1.78e-08 ***
2010 Christopher R. Bilder

1.25
--Signif. codes: ' 1 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '

Residual standard error: 0.297 on 18 degrees of freedom Multiple R-Squared: 0.8354, Adjusted R-squared: 0.8263 F-statistic: 91.38 on 1 and 18 DF, p-value: 1.779e-08 > names(mod.fit) [1] "coefficients" "residuals" [5] "fitted.values" "assign" "df.residual" [9] "xlevels" "call" "model" > class(mod.fit) [1] "lm" "effects" "qr" "terms" "rank"

Below is part of the help for the lm() function inside of the stats package.

2010 Christopher R. Bilder

1.26

2010 Christopher R. Bilder

1.27

Next is the actual function (pasted from Tinn-R using RTF)


function (formula, data, subset, weights, na.action, method = "qr",model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) { ret.x <- x ret.y <- y cl <- match.call() mf <- match.call(expand.dots = FALSE) m <- match(c("formula", "data", "subset", "weights", "na.action","offset"), names(mf), 0) mf <- mf[c(1, m)] mf$drop.unused.levels <- TRUE mf[[1]] <- as.name("model.frame") mf <- eval(mf, parent.frame()) if (method == "model.frame") return(mf) else if (method != "qr") warning(gettextf("method = '%s' is not supported. Using 'qr'", method), domain = NA) mt <- attr(mf, "terms") y <- model.response(mf, "numeric") w <- as.vector(model.weights(mf)) if (!is.null(w) && !is.numeric(w)) stop("'weights' must be a numeric vector") offset <- as.vector(model.offset(mf)) if (!is.null(offset)) { if (length(offset) == 1) offset <- rep(offset, NROW(y)) else if (length(offset) != NROW(y)) stop(gettextf("number of offsets is %d, should equal %d (number of observations)", length(offset), NROW(y)), domain = NA) } if (is.empty.model(mt)) { x <- NULL z <- list(coefficients = if (is.matrix(y)) matrix(, 0, 3) else numeric(0), residuals = y, fitted.values = 0 *y, weights = w, rank = 0, df.residual = if
2010 Christopher R. Bilder

1.28
(is.matrix(y)) nrow(y) else length(y)) if (!is.null(offset)) z$fitted.values <- offset

} else { x <- model.matrix(mt, mf, contrasts) z <- if (is.null(w)) lm.fit(x, y, offset = offset, singular.ok = singular.ok,...) else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,...) } class(z) <- c(if (is.matrix(y)) "mlm", "lm") z$na.action <- attr(mf, "na.action") z$offset <- offset z$contrasts <- attr(x, "contrasts") z$xlevels <- .getXlevels(mt, mf) z$call <- cl z$terms <- mt if (model) z$model <- mf if (ret.x) z$x <- x if (ret.y) z$y <- y if (!qr) z$qr <- NULL z }

2010 Christopher R. Bilder

1.29

Next is my own function that can be used to calculate a trimmed mean (pasted from Tinn-R using RTF).
#Function for the statistic # First element is the original data name. # Second element represents the indices of the data. # For example, the indices will be 1:20 for the observed data. The reason for this second element will be discussed in Chapter 2 # Third element just shows how additional options can be passed into the function. Note that trim=0 is default for the mean() function and is not needed. calc.t<-function(data, i, trim = 0) { data2<-data[i] mean(data2, trim = trim) }

Implementation:
> #Try it > calc.t(data = y, i = 1:20, trim = 0) [1] 2.505

Question: The lm() function code is complicated. How could one determine what every line of the code does? Using R - Class of an object Every object has a class. For example, in the GPA example,
> class(mod.fit) [1] "lm"

You will often see statements similar to


2010 Christopher R. Bilder

1.30
class(object)<-my.class.name object }

at the end of a function. An object called object in the function itself is returned to the user, and it is has a class called my.class.name. The word my.class.name is just a name that I decided to use. Notice the implementation of the lm() function resulted in a class name of lm for mod.fit. Again, it could have been another name, but the programmer chose to call it the same name as the function. Why are classes important to know about? You will often hear of a function that has been made Generic. For example, the plot() function is a generic function. Generic functions can be used multiple ways depending upon the class of the object. If one would implement plot(mod.fit), R will first search for a function called plot.lm() to invoke with the object mod.fit that has class lm. The function plot.lm() and other functions like it are called Method functions. If R can not find a function called plot.lm(), it tries to implement plot.default() with mod.fit. Other commonly used examples of generic functions are summary(), residuals(), and predict(). For
2010 Christopher R. Bilder

1.31

example, there is a summary.lm() function. Because R works in this way, it is often referred to as an object-oriented language. Depending on the class of the object, a function may result in a different type of outcome. This may be confusing the first time are introduced to it, but this behavior is done to make it less confusing . One can manipulate objects using a consistent and familiar suite of generic functions. For example, one can summarize most types of model fits for linear models, generalized linear models, , by simply implementing summary(mod.fit) for an object mod.fit resulting from a model fitting function. Why are classes important to know about in this course? When you are trying to find help on a function, dont look at plot(); instead look at plot.lm() for help!

2010 Christopher R. Bilder

You might also like