Professional Documents
Culture Documents
Example 1: Ozone
These data consist of maximum daily Ozone readings, and maximum daily Temperature, for
330 days in 1976 at a location in California not far from Los Angeles. (We are only interested
in the first two variables.)
"Visibility"
There are actually eight measurements taken each day, but we are only interested in Ozone and
Temperature.
Summary statistics
> summary(ozone.data[, 1:2])
Ozone
Temperature
Min.
: 1.00
Min.
:25.00
1st Qu.: 5.00
1st Qu.:51.00
Median :10.00
Median :62.00
Mean
:11.78
Mean
:61.75
3rd Qu.:17.00
3rd Qu.:72.00
Max.
:38.00
Max.
:93.00
In this command ozone.data is the name of the data frame. Rows and columns of the data
are selected with square brackets so, for example, ozone.data[c(1,3,4), 1:2] would select
rows 1, 3, and 4 and columns 1 and 2 of ozone.data. The command above selects all the rows
of ozone.data and columns 1 and 2. The function summary is then applied to this subset of
the data.
1
0.03
0.00
0.01
0.02
Density
0.04
0.05
0.06
Histogram of Ozone
10
20
30
40
Ozone
I used the options freq=F to produce a density histogram (total area of all blocks equal to 1)
and right=F to make the histogram bins left-closed and right-open (thus the 5s are in the 510
bin, the 10s are in the 1015 bin, etc). We can smooth out the rough edges of a histogram
(producing a better estimate of the population distribution) with the density function.
> plot(density(ozone.data$Ozone), main="Density of Ozone")
> rug(jitter(ozone.data$Ozone))
Density
Density of Ozone
10
20
30
40
The statement density(ozone.data$Ozone) computes the density estimate but does not plot
it. The outer plot function takes as input the output from density then knows how to plot it.
The rug command adds to an existing plot. Since the data have so many ties, rug is applied
to jitter(ozone.data$Ozone), which adds random noise to the points before plotting them.
Density estimates also have tuning parameters that we wont discuss in this course. One
unhappy characteristic of the density estimate is that it gives positive probability to impossible
values, namely Ozone readings less than zero.
Lets construct a density estimate for the distribution of log(Ozone); we will use base 2
logarithms.
> plot(density(ozone.data$log.Ozone), main="Base-2 log of Ozone")
> rug(jitter(ozone.data$log.Ozone))
Density
-1
Conditional distributions
Regression is the study of conditional distributions, or how the distribution of a response like
Ozone changes as the value of a predictor, such as Temperature, changes. Letting Y denote a
generic response and X a generic predictor, we seek to study the conditional distributions of
(Y |X = x) as x varies. The standard visualization tool is the 2D scatterplot, which for the
conditional distribution of Ozone given Temperature is given by
20
0
10
Ozone
30
30
40
50
60
70
80
90
Temperature
The smoother lowess fits a nonparametric curve via local averaging; we will use this tool
frequently in this course, and may study the algorithm in greater detail later on.
What can we say about the conditional distribution of Ozone given Temperature? Here are
some general questions one might ask about the conditional distributions (Y |X = x).
1. Is E(Y |X = x) constant as x varies?
2. Does E(Y |X = x) change linearly as x increases?
3. Is var(Y |X = x) constant as x varies?
4. If not, how does the conditional variance depend on x?
5. How can we be sure our visual conclusions are not simply due to chance variation?
We can get a similar picture by looking at adjacent boxplots.
> plot(Ozone~cut(Temperature,8), data=ozone.data)
4
30
20
0
10
Ozone
(24.9,33.4]
(42,50.5]
(59,67.5]
(76,84.6]
cut(Temperature, 8)
The cut function converted the continuous variable Temperature into a categorical variable
with 8 levels. The documentation for cut, which you can obtain by typing help(cut), does
not explain how the intervals are determined. The equal spacing between the boxplots might
create a misleading impression unless cut creates categories that are of equal width.
Example 2: Haystacks
This example is from Applied Regression Including Computing and Graphics, by R. Dennis
Cook and Sanford Weisberg.
Farmers in the Great Plains during the 1920s sold hay by the stack, requiring
estimation of stack volume to ensure a fair price. Estimating the volume of a
haystack was not a trivial task and could require much give-and-take between the
buyer and the seller to reach a mutually agreeable price.
A study was conducted in Nebraska during 1927 and 1928 to see if a simple method
could be developed to estimate the volume of round haystacks. It was reasoned that
farmers could easily use a rope to characterize the size of a round haystack with two
measurements: The circumference around the base of a haystack and the over,
the distance from the ground on one side of a haystack to the ground on the other
side. The haystack study involved measuring the volume, circumference, and over
on 120 round haystacks.
The issue confronting the investigators in the haystack study was how to use the
data, as well as any available prior information, in the development of a simple
formula expressing the volume of a haystack as a function of its circumference and
over to a useful approximation.
5
The volume of a hemisphere, being half that of the corresponding sphere, can be written as a
function of the circumference C, as
C3
(1)
12 2
which suggests one possible to way to estimate haystack volume: Ignore Over and use (1).
The data are available on the CourseWorks and can be read into R as follows (you will have
to update the path to reflect the location to which you download the data).
volume =
4000
30
35
40
45
4000
6000
2000
4000
2000
Vol
75
80
2000
3000
Ymodel
40
45
60
65
70
30
35
Over
2000
4000
6000
60
65
70
75
80