You are on page 1of 15

Bailey Wang

Professor Chen
STA108​
1.
Chol: Quantitative
Stab: Quantitative
hdl: Quantitative
ratio: Quantitative
gylub: Quantitative
location: Qualitative
age: Quantitative
gender: Qualitative
height: Quantitative
weight: Quantitative
frame: Qualitative
bp s: Quantitative
bp d: Quantitative
waist: Quantitative
hip: Quantitative
time: Quantitative

Graph 1:
The histogram appears to be
Graph 2:

The histogram appears to be


Graph 3:

The histogram appears to be


Graph 4:
The histogram appears to be
Graph 5:

The histogram appears to be


Graph 6:
The histogram appears to be
Graph 7:

The histogram appears to be


Graph 8:

The histogram appears to be


Graph 9:
The histogram appears to be
Graph 10:

The histogram appears to be


Graph 11:

The histogram appears to be


Graph 12:
The histogram appears to be
Graph 12:

The histogram appears to be


Graph 13:

The histogram appears to be


Graph 14:
The chart
Graph 15:

The chart
Graph 16:

The chart

Scatterplot Matrix Graph 17:


Scatterplot Matrix Graph 17:

It appears that majority of the data are uncorrelated. However, height and weight seem to be the
only two variables that have a linear relationship.

Refer to Code 1.

2
Diagnostic Plots Graph 18:

The residual vs fitted shows that the data to be clustered within a center area however, there is a
slight spread. The Scale-location shows that the data is gathered within a certain area, and overall
they appear to be of equal variance. The Q-Q normal shows that the right tail is very heavy,
while the left tail has a slight dip. In the Residual vs leverage, there appears to be one outlier.

3
Boxcox for the model1 Graph 19:
Boxcox for model 2 Graph 20:

Graph 21:
The boxcox only required a slight transformation (1/y) afterwards, it was extremely close to a
perfect model. For Residual vs Fitter, and Scale-location, the data moved to the right. However,
nothing too large changed within the data. Similar to the data before the transformation, they
show similar results. The Normal Q-Q shows that in fact both tails are heavy tail and the residual
vs leverage still has an outlier.

Refer to Code 2.

4
Refer to Code 3.

5
Model 3 has 17 regression coefficients which would look like:
Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15
The MSE of model3 is .001383855.
Refer to Code 4.

6
X51 ​ ​sse ​ R^2 ​ R^2_a ​ Cp BIC AIC
none 0 0.18964436 0.0000000 0.0000000 61.264603 -146.7862 -149.2970
1 0 0.14541160 0.2332406 0.2246253 28.216801 -171.4650 -166.4433
2 0 0.13222175 0.3027910 0.2869454 19.765827 -178.1180 -170.5854
3 1 0.11881816 0.3734686 0.3518640 11.145499 -185.8447 -175.8012
4 1 0.11052338 0.4172072 0.3901005 6.573136 -190.4301 -177.8758
5 1 0.10762551 0.4324877 0.3991046 6.277012 -190.8479 -175.7827
6 1 0.10510676 0.4457691 0.4061812 6.281282 -191.0029 -173.4268
7 1 0.10190499 0.4626522 0.4173337 5.744357 -191.8180 -171.7311
8 1 0.10078825 0.4685407 0.4166911 6.859514 -190.8207 -168.2230
9 1 0.09870371 0.4795326 0.4217029 7.207829 -190.7226 -165.6140
10 1 0.09719085 0.4875099 0.4234487 8.009116 -190.1282 -162.5087
11 1 0.09537923 0.4970627 0.4270334 8.573675 -189.8404 -159.7101
12 1 0.09460624 0.5011387 0.4243908 9.961198 -188.5809 -155.9397
13 1 0.09385925 0.5050776 0.4215193 11.369320 -187.3023 -152.1502
14 1 0.09343915 0.5072928 0.4165309 13.036456 -185.7105 -148.0476
15 1 0.09339316 0.5075353 0.4090423 15.000014 -183.7553 -143.5815
16 1 0.09339314 0.5075354 0.4010565 17.000000 -181.7553 -139.0707

Model 5 has the best Cp criterion at 6.57. The model has the closest value to the parameters (P).
Cp will indicate a good model for the data.
Refer to Code 4.

7
There are 136 regression coefficients in Model4.
The MSE of model4 is .001216314.
The model could be over fitted.
Refer to Code 5.

8
The Model.fs1 is from stab.glu+age+waist+ratio+stab.glu:ratio+age:ratio. This model’s AIC
value is -273.8571.
Refer to Code 5.

9 The Model.fs2 is from chol + stab.glu + hdl + ratio + age + height +


9 The Model.fs2 is from chol + stab.glu + hdl + ratio + age + height +
weight + bp.1s + bp.1d + waist + hip + time.ppn + X7 + stab.glu:X7 +
hdl:ratio + age:bp.1d + weight:bp.1s + age:hip + hip:time.ppn +
height:X7 + stab.glu:bp.1s + stab.glu:time.ppn + stab.glu:waist +
age:waist + chol:time.ppn + hdl:weight + bp.1d:waist + weight:hip
This model’s AIC value is -289.803
Refer to Code 5.

10
BIC.fs1 = -248.1812
BIC.fs2 = -222.4038
AIC and BIC choose different models. However, AIC shows that both models result to the same
AIC value. BIC was able to distinguish the two models and chose the better model. It is possible
that the two models are the same, since model4.2 contains variables from model4.1.
Refer to Code 5.

11
PRESS
Model 3.1: .4949556
Model 3.2: .5055473
Model 3.3: .4923776
Model 4.1: .218265
Model 4.2: .2267079
The best model from PRESS model 4.1 since it has the smallest PRESS value .2182
Refer to Code 6.
12
MSPR
Model 3.1: .002704676
Model 3.2: .002762554
Model 3.3: .002690588
Model 4.1: .001192705
Model 4.2: .001238841
The best model from MSPR model 4.2, since it has the smallest MSPR value .001238841.
Refer to Code 6.
13

Call:
lm(formula = glyhb ~ ., data = data)

Coefficients:
(Intercept) chol stab.glu hdl ratio locationLouisa
-2.2998047 0.0050363 0.0275617 -0.0034033 0.0851633 -0.2337873
age gendermale height weight framemedium framesmall
0.0134974 -0.0850168 0.0227918 -0.0041222 0.2799754 0.2281141
bp.1s bp.1d waist hip time.ppn X101
0.0027597 -0.0014471 0.0295509 0.0157799 0.0005652 NA
X102 X71 X51
NA NA NA

The best model would be model 4.2. The PRESS and MSPR method reveals that the best models
were model 4.1 and model 4.2; however, model 4.2 had a better MSPR. Therefore, the model5 is
using model 4.2.
Refer to Code 6​
setwd("~/Desktop/RStudio")
data=read.table("diabetes.txt",header=T)

n=length(data$chol)
n=length(data$chol)

Code 1:
Y=data$glyhb
X1=data$chol
X2=data$stab.glu
X3=data$hdl
X4=data$ratio
X5=data$location
X6=data$age
X7=data$gender
X8=data$height
X9=data$weight
X10=data$frame
X11=data$bp.1s
X12=data$bp.1d
X13=data$waist
X14=data$hip
X15=data$time.ppn

#Histograms

lm(glyhb~.,data = data)

hist(X1)
hist(stab.glu)
hist(hdl)
hist(ratio)
hist(glyhb)
hist(age)
hist(height)
hist(weight)
hist(bp.1s)
hist(bp.1d)
hist(waist)
hist(hip)
hist(time.ppn)

X10vector=data$X10=as.factor(ifelse(frame =="small", 0, ifelse(frame =="medium", 1, 2)))


X7vector=data$X7=as.factor(ifelse(gender == "male", 0,1) )
X5vector=data$X5=as.factor(ifelse(location == "Buckingham", 0,1) )
View(data)

str(data)
qual=c(5,6,8,11)
head(data[,qual])
head(data[,-qual])
cor(data[,-qual])
View(cor(data[,-qual,+1/Y]))
Ynew=1/Y
Ystar=1/Y
data1=data[,-qual]
data2=cbind(data1,Ystar)

##data1=
(cbind(Y,X1,X2,X3,X5,X5vector,X6,X7vector,X8,X9,X10vector,X11,X12,X13,X14,X15))
(cbind(Y,X1,X2,X3,X5,X5vector,X6,X7vector,X8,X9,X10vector,X11,X12,X13,X14,X15))

##Pie charts
genderpie=subset(data, select = c(gender))
gender1=genderpie$gender
gender.pie=table(gender1)
pie(gender.pie, main="Gender")

locationpie=subset(data, select = c(location))


location1=locationpie$location
location.pie=table(location1)
pie(location.pie, main= "Location")

framepie=subset(data, select = c(frame))


frame1=framepie$frame
frame.pie=table(frame1)
pie(frame.pie, main= "Frame")

View(data)

d=data1
pairs(Y~X1+X2+X3+X4+X6+X9+X11+X12+X13+X14+X15,
data1=d, main="Scatterplot Matrix For Diablities Data", cex.main=.8)

Code 2:
model1=lm(Y~.,data=data1)
plot(model1)

library("MASS")
boxcox(Y~.,data=data1)

head(d)
d=d[,names(d)!="Y"]
model2=lm(1/Y~.,data=d)
plot(model2)

boxcox((1/Y)~.,data=data1)

Code 3:
set.seed(10)

N=nrow(d)
index=sample(1:n,size=n/2, replace=FALSE)
data.t=data2[index,]
data.v=data2[-index,]

​ ode 4:
C
model3=lm((Ystar)~.,data=data.t)
summary(model3)
length(model3$coefficients)
summary(model3)$sigma^2

library("leaps")
best=regsubsets(Ystar~.,data=data.t,nbest=1,nvmax=16)
sum_sub=summary(best)
plot(best)
sum_sub
n=nrow(data.t)
p.m=as.integer(as.numeric(rownames(sum_sub$which))+1) ##p.m=1:16+1 ##p.m=2:17
n=nrow(data.t)
p.m=as.integer(as.numeric(rownames(sum_sub$which))+1) ##p.m=1:16+1 ##p.m=2:17
sse=sum_sub$rss
sse

MSE=summary(model3)$sigma^2

aic=n*log(sse)+2*p.m #-n*log(n)
bic=n*log(sse)+log(n)*p.m
res_sub=cbind(sum_sub$which,sse,sum_sub$rsq,sum_sub$adjr2,sum_sub$cp,aic,bic)
fit0=lm(Ystar~1,data=data.t)
sse1=sum(fit0$residuals^2)
p=1
c1=sse1/MSE-(n-2*p)
aic1=n*log(sse1)+2*p
bic1=n*log(sse1)+log(n)*p
none=c(1,rep(0,16),sse1,0,0,c1,bic1,aic1)
res_sub=rbind(none,res_sub)
colnames(res_sub)=c(colnames(sum_sub$which),"sse","R^2","R^2_a","Cp","BIC","AIC")
res_sub

frametype=model.matrix(~data$X10)

model3.1=lm(Ynew~X2+X4+X6+X13, data=data.t)
model3.2=lm(Ynew~X2+X6+X13, data=data.t)
model3.3=lm(Ynew~X2+X4+X6+frametype[,3]+X13+X15, data=data.t)

​Code 5:
model4=lm(Ystar~.^2,data=data.t)
length(model4$coefficients)
summary(model4)$sigma^2

fs1=stepAIC(fit0,scope=list(upper=model4,lower=~1),direction="both", k=2)
sse.fs1=sum(fs1$residuals^2)
p.fs1=length(fs1$coefficients)

fs2=stepAIC(model3, scope=list(upper=model4, lower=~1), direction="both",k=2)


sse.fs2=sum(fs2$residuals^2)
p.fs2=length(fs2$coefficients)

aic.fs1=n*log(sse.fs1)+2*p.fs1
aic.fs2=n*log(sse.fs2)+2*p.fs2
bic.fs1=n*log(sse.fs1)+log(n)*p.fs1
bic.fs2=n*log(sse.fs2)+log(n)*p.fs2

model4.1=fs2
model4.2=fs1

Code 6:
##influence = Hii

press.mod3.1=sum(model3.1$residuals^2/(1-influence(model3.1)$hat)^2)
sse.mod3.1=sum(model3.1$residuals^2)
(press.mod3.1)/183

press.mod3.2=sum(model3.2$residuals^2/(1-influence(model3.2)$hat)^2)
sse.mod3.2=sum(model3.2$residuals^2)
(press.mod3.2)/183
sse.mod3.2=sum(model3.2$residuals^2)
(press.mod3.2)/183

press.mod3.3=sum(model3.3$residuals^2/(1-influence(model3.3)$hat)^2)
sse.mod3.3=sum(model3.3$residuals^2)
(press.mod3.3)/183

press.mod4.1=sum(model4.1$residuals^2/(1-influence(model4.1)$hat)^2)
sse.mod4.1=sum(model4.1$residuals^2)
(press.mod4.1)/183

press.mod4.2=sum(model4.2$residuals^2/(1-influence(model4.2)$hat)^2)
sse.mod4.2=sum(model4.2$residuals^2)
(press.mod4.2)/183

model5=model4.1

model5=lm(Y~.,data=data)
summary(model5)
anova(model5)

You might also like