You are on page 1of 7

IE 561

STATISTICAL METHODS AND DATA ANALYSIS


ASSIGMENT:
Homework 6
Instructor:
Dr. Cheng-Hung, Hu
Department of Industrial Engineering and Management
Work elaborated by:
Belen Vargas
ID: 1015459
Carlos Manuel Santos
ID: 1015458
Due date: 2012/11/06

QUESTION 1
A)
Histogram of X1, X2, X3, X4
Normal

Frequency

X1

X2

6.0

4.5

3.0

1.5

0.0

60

80

100

120

X2
Mean 106.7
StDev 17.29
N
25

140

80

96

X3

X1
Mean 103.4
StDev 20.30
N
25

112

128

144

X4

X4
Mean 94.68
StDev 10.68
N
25

3
4

2
0

X3
Mean 100.8
StDev 8.855
N
25

1
80

90

100

110

120

70

80

90

100

110

120

Boxplot of X1, X2, X3, X4


X1

X2

150
120

125
100

100

75
50
120

80
X3

X4
110

110

100

100

90

90

80

80

70

The distribution on X2 and X3 are a little bit skewed, there could be a potential outlier on X1, and X1
seems to follow a normal distribution.

B)

Matrix Plot of Yi, X1, X2, X3, X4


50

100

150

80

100

120

120
90

Yi

60

150
100

X1

50
120
100

X2

80
120
100

X3

80

110
90

X4

70
60

90

120

80

100

120

70

90

110

Correlations: Yi, X1, X2, X3, X4


Yi
0.514
0.009

X1

X2

0.497
0.011

0.102
0.627

X3

0.897
0.000

0.181
0.387

0.519
0.008

X4

0.869
0.000

0.327
0.111

0.397
0.050

X1

X2

X3

0.782
0.000

Cell Contents: Pearson correlation


P-Value

Based on correlation values and the analysis of the scatter plot


matrix, Y is highly linearly correlated with all four predictors
variables. Based on the highlighted correlation values and the
corresponding scatter plots, X2 and X3, X3 and X4, are highly
correlated which indicates the possibility of multicollinearity
problem.

C) The regression equation is


Yi = - 124.382 + 0.29573 X1 + 0.04829 X2 + 1.3060 X3 + 0.5198 X4
Predictor
Constant
X1
X2
X3
X4

Coef
-124.382
0.29573
0.04829
1.3060
0.5198

SE Coef
9.941
0.04397
0.05662
0.1641
0.1319

T
-12.51
6.73
0.85
7.96
3.94

P
0.000
0.000
0.404
0.000
0.001

VIF
1.138
1.370
3.017
2.835

It seems as if X2 should be excluded from the model due to a pvalue of 0.404


D) USING FITTED VALUES (Yhat)

Best Subsets Regression: Yi versus X1, X2, X3, X4


Response is Yi
Vars
1
1
1
1
2
2
2
2
3
3
3
3
4

R-Sq
80.5
75.6
26.5
24.7
93.3
87.7
81.5
80.6
96.2
93.4
87.9
84.5
96.3

R-Sq(adj)
79.6
74.5
23.3
21.4
92.7
86.6
79.8
78.8
95.6
92.5
86.2
82.3
95.5

Mallows
Cp
84.2
110.6
375.3
384.8
17.1
47.2
80.6
85.5
3.7
18.5
48.2
66.3
5.0

S
8.7676
9.8039
17.014
17.217
5.2512
7.1073
8.7193
8.9336
4.0720
5.3306
7.2237
8.1653
4.0986

X X X X
1 2 3 4
X
X
X
X
X
X
X X
X
X
X X
X
X X
X X X
X X X
X X
X
X X X X

The four best subset regression models according to the Ra,p2 criterion are
(X1,X3,X4)(X1,X2,X3,X4)(X1,X3)(X1,X2,X3)

E)

We would chose the stepwise method due to is an automatic procedure


for statistical model selection in cases where there is a large number
of potential explanatory variables (which is our case), and no
underlying theory on which to base the model selection. At each stage
in the process, after a new variable is added, a test is made to check
if some variables can be deleted without appreciably increasing
the residual sum of squares. The procedure terminates when the measure
is (locally) maximized, or when the available improvement falls below
some critical value.

F)

Stepwise Regression: Yi versus X1, X2, X3, X4


Alpha-to-Enter: 0.15

Alpha-to-Remove: 0.15

Response is Yi on 4 predictors, with N = 25


Step
Constant
X3
T-Value
P-Value

1
-106.1

2
-127.6

3
-124.2

1.97
9.74
0.000

1.82
14.81
0.000

1.36
8.94
0.000

0.348
6.49
0.000

0.296
6.78
0.000

X1
T-Value
P-Value
X4
T-Value
P-Value
S
R-Sq
R-Sq(adj)
Mallows Cp

0.52
3.95
0.001
8.77
80.47
79.62
84.2

5.25
93.30
92.69
17.1

4.07
96.15
95.60
3.7

The stepwise model selection method suggests that model with


(X1,X3,X4) as predictor variables is the best subset model which is
consistent with the conclusion we draw earlier with the
G)
2

Both the Stepwise method and the Ra,p


criterion indicate the model
with (X1,X3,X4) as predictor variables is the best subset regression
model, the only comparison we may obtain between both subset choosing
2
methods is that Ra,p criterion will indicate how each subset fits the
model and the Stepwise will always indicate which is the best subset
immediately, so if we had a larger number of predictors we would
choose to use a method that would compute a faster answer which would
be in this case the Stepwise method.

QUESTION 2
A)
The regression equation is
LNY = - 2.04 - 0.712 LNX1 + 0.747 LN (140-X2) + 0.757 LNX3

B)
Normal Probability Plot
(response is LNY)

99

95
90

Percent

80
70
60
50
40
30
20
10
5

-0.4

-0.3

-0.2

-0.1

0.0
Residual

0.1

0.2

0.3

0.4

Scatterplot of RESI1 vs FITS1, LNX3, LN (140-X2), LNX1


FITS1

LNX3

0.4
0.2
0.0
-0.2

RESI1

-0.4
3.5

4.0

4.5

5.0

4.0

4.2

LN (140-X2)

0.4

4.4

4.6

LNX1

0.2
0.0
-0.2
-0.4
4.20

4.35

4.50

4.65

4.80 -0.5

0.0

0.5

1.0

C)
Predictor
Constant
LNX1
LN (140-X2)
LNX3

Coef
-2.043
-0.71195
0.7474
0.7574

SE Coef
1.019
0.09203
0.1570
0.1592

T
-2.00
-7.74
4.76
4.76

P
0.054
0.000
0.000
0.000

VIF
1.339
1.330
1.016

There are no indications that serious multicollinearity exist because


the average of the VIF not considerably greater than one.
D)
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

T
-0.02397
0.0034
-0.21765
0.27942
-0.15891
-0.16114
0.649973
-0.58074
-0.04679
-0.51129
1.127888
-0.86439
-0.08746
-0.11518
1.070476
-1.43825
0.734645

Subject
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

T
0.955272
0.955632
1.648977
-1.68083
-0.606
0.425732
0.229971
-0.73581
-1.43759
-0.33157
2.214574
-2.51995
-0.76873
-0.97484
1.98281
0.829027

We would like to test whether the absolute value of subject 29, which
is the largest value, is an outlier. We shall use the Bonferroni
simultaneous test procedure with =10.
t(0.9985,28)=3.24993
We can observe that the absolute value of subject 29 is not larger
than the Tvalue given by the Bonferroni procedure, therefore we have
enough statistical evidence to support that the model doesnt have any
outliers.

You might also like