You are on page 1of 6

STAT W3315, Fall 2014: Scatterplots and Regression

Example 1: Ozone
These data consist of maximum daily Ozone readings, and maximum daily Temperature, for
330 days in 1976 at a location in California not far from Los Angeles. (We are only interested
in the first two variables.)

Reading the data into R


The data are on the CourseWorks in a file called ozone.txt. Download the file to your Data
directory, and read them into R as follows (of course you will have to update the path to
correctly reflect the location of your data folder).
> filename <- "~/Documents/Regression/Data/ozone.txt"
> ozone.data <- read.table(file=filename, header=T)
> names(ozone.data) # variable names
[1] "Ozone"
"Temperature" "InversionHt" "Pressure"
[6] "Height"
"Humidity"
"WindSpeed"

"Visibility"

There are actually eight measurements taken each day, but we are only interested in Ozone and
Temperature.

Summary statistics
> summary(ozone.data[, 1:2])
Ozone
Temperature
Min.
: 1.00
Min.
:25.00
1st Qu.: 5.00
1st Qu.:51.00
Median :10.00
Median :62.00
Mean
:11.78
Mean
:61.75
3rd Qu.:17.00
3rd Qu.:72.00
Max.
:38.00
Max.
:93.00
In this command ozone.data is the name of the data frame. Rows and columns of the data
are selected with square brackets so, for example, ozone.data[c(1,3,4), 1:2] would select
rows 1, 3, and 4 and columns 1 and 2 of ozone.data. The command above selects all the rows
of ozone.data and columns 1 and 2. The function summary is then applied to this subset of
the data.
1

Marginal distribution of ozone


If we think of the observed Ozone measurements as a random sample of independent draws
from a larger (theoretically infinite) population of daily ozone levels, the distribution of that
population is called the marginal distribution of Ozone. We can then estimate characteristics
of the population distribution by their corresponding sample values; for example, we estimate
the mean daily ozone level by the sample mean (about 12), which is somewhat larger than the
sample median of 10. The sample distribution seems to have a long right tail, since the distance
from median to third quartile is greater than the distance from the first quartile to the median;
also, the maximum value is quite a bit farther from the third quartile than the minimum is
from the first quartile. We can see this better from a histogram.
> hist(ozone.data$Ozone, freq=F, right=F, xlab="Ozone",
+
main="Histogram of Ozone")

0.03
0.00

0.01

0.02

Density

0.04

0.05

0.06

Histogram of Ozone

10

20

30

40

Ozone

I used the options freq=F to produce a density histogram (total area of all blocks equal to 1)
and right=F to make the histogram bins left-closed and right-open (thus the 5s are in the 510
bin, the 10s are in the 1015 bin, etc). We can smooth out the rough edges of a histogram
(producing a better estimate of the population distribution) with the density function.
> plot(density(ozone.data$Ozone), main="Density of Ozone")
> rug(jitter(ozone.data$Ozone))

0.00 0.01 0.02 0.03 0.04 0.05 0.06

Density

Density of Ozone

10

20

30

40

N = 330 Bandwidth = 2.261

The statement density(ozone.data$Ozone) computes the density estimate but does not plot
it. The outer plot function takes as input the output from density then knows how to plot it.
The rug command adds to an existing plot. Since the data have so many ties, rug is applied
to jitter(ozone.data$Ozone), which adds random noise to the points before plotting them.
Density estimates also have tuning parameters that we wont discuss in this course. One
unhappy characteristic of the density estimate is that it gives positive probability to impossible
values, namely Ozone readings less than zero.
Lets construct a density estimate for the distribution of log(Ozone); we will use base 2
logarithms.
> plot(density(ozone.data$log.Ozone), main="Base-2 log of Ozone")
> rug(jitter(ozone.data$log.Ozone))

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Density

Base-2 log of Ozone

-1

N = 330 Bandwidth = 0.3047

Conditional distributions
Regression is the study of conditional distributions, or how the distribution of a response like
Ozone changes as the value of a predictor, such as Temperature, changes. Letting Y denote a
generic response and X a generic predictor, we seek to study the conditional distributions of
(Y |X = x) as x varies. The standard visualization tool is the 2D scatterplot, which for the
conditional distribution of Ozone given Temperature is given by

20
0

10

Ozone

30

> plot(Ozone ~ Temperature, data=ozone.data)


> lines(lowess(ozone.data$Ozone ~ ozone.data$Temperature))

30

40

50

60

70

80

90

Temperature

The smoother lowess fits a nonparametric curve via local averaging; we will use this tool
frequently in this course, and may study the algorithm in greater detail later on.
What can we say about the conditional distribution of Ozone given Temperature? Here are
some general questions one might ask about the conditional distributions (Y |X = x).
1. Is E(Y |X = x) constant as x varies?
2. Does E(Y |X = x) change linearly as x increases?
3. Is var(Y |X = x) constant as x varies?
4. If not, how does the conditional variance depend on x?
5. How can we be sure our visual conclusions are not simply due to chance variation?
We can get a similar picture by looking at adjacent boxplots.
> plot(Ozone~cut(Temperature,8), data=ozone.data)
4

30
20
0

10

Ozone

(24.9,33.4]

(42,50.5]

(59,67.5]

(76,84.6]

cut(Temperature, 8)

The cut function converted the continuous variable Temperature into a categorical variable
with 8 levels. The documentation for cut, which you can obtain by typing help(cut), does
not explain how the intervals are determined. The equal spacing between the boxplots might
create a misleading impression unless cut creates categories that are of equal width.

Example 2: Haystacks
This example is from Applied Regression Including Computing and Graphics, by R. Dennis
Cook and Sanford Weisberg.
Farmers in the Great Plains during the 1920s sold hay by the stack, requiring
estimation of stack volume to ensure a fair price. Estimating the volume of a
haystack was not a trivial task and could require much give-and-take between the
buyer and the seller to reach a mutually agreeable price.
A study was conducted in Nebraska during 1927 and 1928 to see if a simple method
could be developed to estimate the volume of round haystacks. It was reasoned that
farmers could easily use a rope to characterize the size of a round haystack with two
measurements: The circumference around the base of a haystack and the over,
the distance from the ground on one side of a haystack to the ground on the other
side. The haystack study involved measuring the volume, circumference, and over
on 120 round haystacks.
The issue confronting the investigators in the haystack study was how to use the
data, as well as any available prior information, in the development of a simple
formula expressing the volume of a haystack as a function of its circumference and
over to a useful approximation.
5

The volume of a hemisphere, being half that of the corresponding sphere, can be written as a
function of the circumference C, as
C3
(1)
12 2
which suggests one possible to way to estimate haystack volume: Ignore Over and use (1).
The data are available on the CourseWorks and can be read into R as follows (you will have
to update the path to reflect the location to which you download the data).
volume =

> filename <- "~/Documents/Regression/Data/haystacks.txt"


> Data <- read.table(file=filename, header=T)
> names(Data) # lists variable names in data set
[1] "Vol" "C"
"Over"
Does model (1) provide a good approximation to the volume of a haystack?
> Data$Ymodel <- (Data$C^3)/(12*pi*pi)
> pairs(Vol ~ Ymodel+C+Over, data=Data)
3000

4000

30

35

40

45

4000

6000

2000

4000

2000

Vol

75

80

2000

3000

Ymodel

40

45

60

65

70

30

35

Over

2000

4000

6000

60

What do you think?


6

65

70

75

80

You might also like