You are on page 1of 9

HW 6

1. Firesale Revisited
realestate=read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Real.
Estate.txt")
with(realestate,plot(pch=19,jitter(Price,5)~jitter(Living.Area,5)))
lm.realestate=lm(Price~Living.Area,data=realestate)
abline(lm.realestate)

b)
with(realestate,lm(Price~Living.Area))
##
## Call:
## lm(formula = Price ~ Living.Area)
##
## Coefficients:

## (Intercept)
##
13439

Living.Area
113

Each additional square feet is worth $113.1.


c)
library(mosaic)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Loading required package: car


Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: lattice
Loading required package: ggplot2
Attaching package: 'mosaic'
The following objects are masked from 'package:dplyr':
do, tally
The following object is masked from 'package:car':
logit
The following objects are masked from 'package:stats':
binom.test, cor, cov, D, fivenum, IQR, median, prop.test, sd,
t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum

models=do(1000)*lm(Price~Living.Area,data=resample(realestate))
## Loading required package: parallel
quantile(models$Living.Area,c(0.025,0.975))
## 2.5% 97.5%
## 106.2 119.9

The 95% confidence interval is between 106.6492 and 119.8063.


d)
confint(lm.realestate)
##
2.5 % 97.5 %
## (Intercept) 3647.7 23231.1
## Living.Area 107.9
118.4

The theoretical 95% confidence interval is smaller than the one I found in part c.
e,f)
plot(lm.realestate$residuals)

scatter.smooth(residuals(lm.realestate)~predict(lm.realestate))

No, there are no underlying patterns in the residuals, although a lot of them seem to be
fairly close to 0.
2. Your Own Regression
cancer=read.delim("http://www.statsci.org/data/general/hanford.txt")
with(cancer,hist(Exposure))

with(cancer,hist(Mortality))

with(cancer,plot(Exposure,Mortality))

a) This dataset examines the relationship between index of exposure to the Hartford
reactor and cancer mortality rate. The distribution of each individual variable is random
and there seems to be a positive linear correlation between the two as shown in the
scatterplot.
b)
with(cancer,plot(Exposure,Mortality))
lm.cancer=lm(Exposure~Mortality,data=cancer)
abline(lm.cancer)

with(cancer,lm(Exposure~Mortality))
##
##
##
##
##
##
##

Call:
lm(formula = Exposure ~ Mortality)
Coefficients:
(Intercept)
-10.008

Mortality
0.093

library(mosaic)
models=do(10000)*lm(Exposure~Mortality,data=resample(cancer))
quantile(models$Mortality,c(0.025,0.975))
##
2.5%
97.5%
## 0.05479 0.12574

Although the slope of the scatterplot is quite low (0.093) and the 95% confidence interval
is between 0.05421 and 0.12563, which is not significant in terms of statistical correlation.
However, in my opinion, putting this number into context of mortality rate, I would say that

there a pretty strong correlation because there is a 10% increase of chance of mortality
with one index of exposure.
d) The sample I drew was quite small and is not sufficient to conclude that there is a strong
correlation between index of exposure and mortality rate.
3. Sales by Region
This model is seemingly deceiving. Upon first glance, it looks like the there is a strong
negative correlation with the sales by region. However, it is by chance that U.S. is assigned
the number 1 and the Rest of the World is assigned the number 6. If the sales manager
decides to assign different numbers to these regions, then there would not be negative
correlation. Also, regression is only appropriate when both variables are quantitative.

You might also like