Handout01 Regression

STAT W3315, Fall 2014: Scatterplots and Regression
Example 1: Ozone
These data consist of maximum daily Ozone readings, and maximum daily Temperature, for
330 days in 1976 at a location in California not far from Los Angeles. (We are only interested
in the first two variables.)
Reading the data into R

The data are on the CourseWorks in a file called ozone.txt. Download the file to your Data
directory, and read them into R as follows (of course you will have to update the path to
correctly reflect the location of your data folder).
> filename <- "~/Documents/Regression/Data/ozone.txt"
> ozone.data <- read.table(file=filename, header=T)
> names(ozone.data) # variable names
[1] "Ozone"
"Temperature" "InversionHt" "Pressure"
[6] "Height"
"Humidity"
"WindSpeed"
"Visibility"
There are actually eight measurements taken each day, but we are only interested in Ozone and
Temperature.
Summary statistics
> summary(ozone.data[, 1:2])
Ozone
Temperature
Min.
: 1.00
Min.
:25.00
1st Qu.: 5.00
1st Qu.:51.00
Median :10.00
Median :62.00
Mean
:11.78
Mean
:61.75
3rd Qu.:17.00
3rd Qu.:72.00
Max.
:38.00
Max.
:93.00
In this command ozone.data is the name of the data frame. Rows and columns of the data
are selected with square brackets so, for example, ozone.data[c(1,3,4), 1:2] would select
rows 1, 3, and 4 and columns 1 and 2 of ozone.data. The command above selects all the rows
of ozone.data and columns 1 and 2. The function summary is then applied to this subset of
the data.
1
Marginal distribution of ozone

If we think of the observed Ozone measurements as a random sample of independent draws
from a larger (theoretically infinite) population of daily ozone levels, the distribution of that
population is called the marginal distribution of Ozone. We can then estimate characteristics
of the population distribution by their corresponding sample values; for example, we estimate
the mean daily ozone level by the sample mean (about 12), which is somewhat larger than the
sample median of 10. The sample distribution seems to have a long right tail, since the distance
from median to third quartile is greater than the distance from the first quartile to the median;
also, the maximum value is quite a bit farther from the third quartile than the minimum is
from the first quartile. We can see this better from a histogram.
> hist(ozone.data$Ozone, freq=F, right=F, xlab="Ozone",
+
main="Histogram of Ozone")
0.03
0.00
0.01
0.02
Density
0.04
0.05
0.06
Histogram of Ozone
10
20
30
40
Ozone
I used the options freq=F to produce a density histogram (total area of all blocks equal to 1)
and right=F to make the histogram bins left-closed and right-open (thus the 5s are in the 510
bin, the 10s are in the 1015 bin, etc). We can smooth out the rough edges of a histogram
(producing a better estimate of the population distribution) with the density function.
> plot(density(ozone.data$Ozone), main="Density of Ozone")
> rug(jitter(ozone.data$Ozone))
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Density
Density of Ozone
10
20
30
40
N = 330 Bandwidth = 2.261
The statement density(ozone.data$Ozone) computes the density estimate but does not plot
it. The outer plot function takes as input the output from density then knows how to plot it.
The rug command adds to an existing plot. Since the data have so many ties, rug is applied
to jitter(ozone.data$Ozone), which adds random noise to the points before plotting them.
Density estimates also have tuning parameters that we wont discuss in this course. One
unhappy characteristic of the density estimate is that it gives positive probability to impossible
values, namely Ozone readings less than zero.
Lets construct a density estimate for the distribution of log(Ozone); we will use base 2
logarithms.
> plot(density(ozone.data$log.Ozone), main="Base-2 log of Ozone")
> rug(jitter(ozone.data$log.Ozone))
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Density
Base-2 log of Ozone
-1
N = 330 Bandwidth = 0.3047
Conditional distributions
Regression is the study of conditional distributions, or how the distribution of a response like
Ozone changes as the value of a predictor, such as Temperature, changes. Letting Y denote a
generic response and X a generic predictor, we seek to study the conditional distributions of
(Y |X = x) as x varies. The standard visualization tool is the 2D scatterplot, which for the
conditional distribution of Ozone given Temperature is given by
20
0
10
Ozone
30
> plot(Ozone ~ Temperature, data=ozone.data)

> lines(lowess(ozone.data$Ozone ~ ozone.data$Temperature))
30
40
50
60
70
80
90
Temperature
The smoother lowess fits a nonparametric curve via local averaging; we will use this tool
frequently in this course, and may study the algorithm in greater detail later on.
What can we say about the conditional distribution of Ozone given Temperature? Here are
some general questions one might ask about the conditional distributions (Y |X = x).
1. Is E(Y |X = x) constant as x varies?
2. Does E(Y |X = x) change linearly as x increases?
3. Is var(Y |X = x) constant as x varies?
4. If not, how does the conditional variance depend on x?
5. How can we be sure our visual conclusions are not simply due to chance variation?
We can get a similar picture by looking at adjacent boxplots.
> plot(Ozone~cut(Temperature,8), data=ozone.data)
4
30
20
0
10
Ozone
(24.9,33.4]
(42,50.5]
(59,67.5]
(76,84.6]
cut(Temperature, 8)
The cut function converted the continuous variable Temperature into a categorical variable
with 8 levels. The documentation for cut, which you can obtain by typing help(cut), does
not explain how the intervals are determined. The equal spacing between the boxplots might
create a misleading impression unless cut creates categories that are of equal width.
Example 2: Haystacks
This example is from Applied Regression Including Computing and Graphics, by R. Dennis
Cook and Sanford Weisberg.
Farmers in the Great Plains during the 1920s sold hay by the stack, requiring
estimation of stack volume to ensure a fair price. Estimating the volume of a
haystack was not a trivial task and could require much give-and-take between the
buyer and the seller to reach a mutually agreeable price.
A study was conducted in Nebraska during 1927 and 1928 to see if a simple method
could be developed to estimate the volume of round haystacks. It was reasoned that
farmers could easily use a rope to characterize the size of a round haystack with two
measurements: The circumference around the base of a haystack and the over,
the distance from the ground on one side of a haystack to the ground on the other
side. The haystack study involved measuring the volume, circumference, and over
on 120 round haystacks.
The issue confronting the investigators in the haystack study was how to use the
data, as well as any available prior information, in the development of a simple
formula expressing the volume of a haystack as a function of its circumference and
over to a useful approximation.
5
The volume of a hemisphere, being half that of the corresponding sphere, can be written as a
function of the circumference C, as
C3
(1)
12 2
which suggests one possible to way to estimate haystack volume: Ignore Over and use (1).
The data are available on the CourseWorks and can be read into R as follows (you will have
to update the path to reflect the location to which you download the data).
volume =
> filename <- "~/Documents/Regression/Data/haystacks.txt"

> Data <- read.table(file=filename, header=T)
> names(Data) # lists variable names in data set
[1] "Vol" "C"
"Over"
Does model (1) provide a good approximation to the volume of a haystack?
> Data$Ymodel <- (Data$C^3)/(12*pi*pi)
> pairs(Vol ~ Ymodel+C+Over, data=Data)
3000
4000
30
35
40
45
4000
6000
2000
4000
2000
Vol
75
80
2000
3000
Ymodel
40
45
60
65
70
30
35
Over
2000
4000
6000
60
What do you think?

6
65
70
75
80

Handout01 Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handout01 Regression

Uploaded by

Copyright:

Available Formats

STAT W3315, Fall 2014: Scatterplots and Regression

Reading the data into R

Marginal distribution of ozone

0.00 0.01 0.02 0.03 0.04 0.05 0.06

N = 330 Bandwidth = 2.261

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Base-2 log of Ozone

N = 330 Bandwidth = 0.3047

> plot(Ozone ~ Temperature, data=ozone.data)

> filename <- "~/Documents/Regression/Data/haystacks.txt"

What do you think?

You might also like