Professional Documents
Culture Documents
A using R
Julian J. Faraway
July 2002
X
R 1
2 i i
( ˆ
y y )
1
2
RSS
( yi y ) Total RSS
• 2
• ˆ T
ˆ
ˆ : ˆ
2
n p
Y
Residual for
Residual for
Difference between
two models
% The period . stands for the other variables in the data frame
% I() argument is evaluated rather than interpreted as part of the model
formula
http://my.mofile.com/tangyc8866 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )
• 2) H0: ddpi=1
Null model :
y=0+pop15pop15+pop75pop75+dpidpi+ddpi+
% The fixed term is called an offset
> gr <- lm(gr ~ pop15 + pop75 + dpi + offset(ddpi), savings)
> anova(gr, g)
Two other ways as usual:
t-statistic : t ( ˆ c) / se( ˆ ),
where c is the point hypothesis
( ˆ )T X T X ( ˆ )T pˆ 2 Fp(,n) p
Confidence interval for one i
• General form
estimate +/- critical value x s.e. of estimate
• For i
ˆi tn( p/ 2)ˆ ( X T X )ii1
• It’s better to consider the joint confidence intervals when pla
usible, especially when the estimates of i’s are heavily corr
elated.
Conclusion (See Figures 3.6 and 3.7): the widening of the CI doe
s not reflect the possibility that the structure of the model itself
may change as we move into new territory.
ˆ
2
cov( , ) 0 1
E 1 1 2 1
2
2
1 2 / 2
Thus the estimate of 1 will be biased!
If the variability in the errors of observation of X (2)
are small relative to the range of X (2), we need not
be concerned.
If not, it’s a serious problem. Other methods should be
considered!
2 known
Lack of fit if (n p )ˆ 2 2 (1 )
n p
2
Example – strongx
Model 1: without a quadratic term lack of fit (p-value=0.0048)
Model 2: with a quadratic term well fit (p-value=0.85363)
df pe
distinct x
(# replicates 1)
df SS MS F
• Reason:
regression std=3.06, pure error sd=sqrt(11.8/6)=1.4.
( replicates are genuine? Low pe sd? Maybe caused by
some correlation in the measurements.
Unmeasured third variable.)
• Another model other than a straight line (though not obvious!)
> gp <- lm(loss ~ Fe+I(Fe^2)+I(Fe^3)+I(Fe^4)+I(Fe^5)+I(Fe^6),
+ corrosion)
>summary$r.squared
[1] 0.99653
• R2=99.7%
hi=Hii -- leverages
ˆi 2 (1 hi ) : a large leverage of hi will “force” the fit to
be close to yi.
Some facts: hi=p, hi >= 1/n
http://my.mofile.com/tangyc8866 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )
“Rule of thumb” – leverage of more than 2p/n should be
looked at more closely. Large values of hi are due to ext
reme values in X. (hi corresponds to a Mahalanobis dist
ance defined by X.) Also notice that ˆ
var ˆ
y h
.iˆ 2
yˆ ( i ) xiT ˆ( i )
If yˆ (i ) yi is large then point i is an outlier.
( ˆ ˆ( i ) )T ( X T X )( ˆ ˆ( i ) )
Di
pˆ 2
1 2 hi
ri
p 1 hi
• The first term, ri2, is the residual effect and the second is the
leverage. The combination leads to the influence.
• An index plot of Di can be used to identify influential points.
Example – savings (page 79)
• Full data
• Exclude one point -- helpful using lm.influence()
PLS:
• In contrast, PLS finds linear combinations of the predictors t
hat best explain the response. It is most effective when ther
e are large numbers of variables to be considered.
• If successful, the variablity of prediction is substantially redu
ced.
• On the other hand, PLS is virtually useless for explanation p
urposes.
R2=1-RSS/TSS
RSS /(n p ) n 1
R 1
2
a 1 (1 R 2
)
TSS /(n 1) n p
ˆ mod
2
1 2 el
ˆ null
Possible choices of
Comparison of mathematical
programs for data analysis
(Edition 4.4)
by Stefan Steinhaus
(stefan@steinhaus-net.de)
http://www.scientificweb.com/ncrunch/