BAUDM Assignment Predicting Boston Housing Prices

BAUDM Assignment
Predicting Boston Housing Prices
By:
Suraj Dhende (14PGP015)
Saranya Panicker (14PGP108)
Bharadwaj Sista (14PGP118)
Ashis Tripathy (14PGP121)
Yadvendra Yadav (14PGP123)
Ans a:
The data should be partitioned into training and validation sets because we need two
sets of data: one to build the regression model and the other to test the model.
The regression model will describe the relationship between the dependent and the
independent variables where as the model run on the validation set will determine the
accuracy of the model that has been built on the training data set.
The validation data is used to validate or test the model. In this process, the model
(built using the training data set) is used to make predictions with the validation data data that were not used to fit the model. In this way we get an unbiased estimate of
how well the model performs. We compute measures of error which reflect the
prediction accuracy.
Ans b: The equation is:
Model Summary
Model
R Square
Approximately
Adjusted R
Std. Error of the
Square
Estimate
60% of the
cases
(SAMPLE) = 1
(Selected)
.747a
.557
.553
6.161
a. Predictors: (Constant), RM, CHAS, CRIM

ANOVAa,b
Model
Sum of Squares
df
Mean Square
Regression
14051.438
4683.813
Residual
11160.292
294
37.960
Total
25211.730
297
F
123.388
a. Dependent Variable: MEDV

b. Selecting only cases for which Approximately 60% of the cases (SAMPLE) = 1
c. Predictors: (Constant), RM, CHAS, CRIM
-34.867-0.218*CRIM+3.824*CHAS+9.20*RM
Sig.
.000c
Ans c:
Median house price is $-7.288
Ans d:
i. There are certain variables that measure the level of development and industrialization.
These variables are likely to be positively correlated. From the correlations we come to know
that INDUS, NOX and TAX are highly correlated. This is because areas that have a high
proportion of non-retail businesses tend to have higher taxes and more pollution.
INDUS indicates the proportion of non-retail business while NOX indicates Nitric oxide
concentration.
ii. The highly correlated variables are as follows:
1) NOX and INDUS: Correlation coefficient = 0.764
2) TAX and INDUS: Correlation coefficient = 0.688
3) AGE and NOX: Correlation coefficient = 0.724
4) DIS and NOX: Correlation coefficient = -0.765
5) DIS and AGE: Correlation coefficient = -0.745
6) TAX and RAD: Correlation coefficient = 0.891
The variables INDUS, TAX and NOX denote the same thing that is development and
urbanization. So we can remove these variables to find the best fit model
iii.
Model 1: We have chosen to keep NOX
Variables Entered/Removeda,b
Model
Variables
Variables
Entered
Removed
Method
LSTAT, B,
1
PTRATIO,
CRIM, ZN, RM,
. Enter
NOX, DISc
b. Models are based only on cases for which
Approximately 60% of the cases (SAMPLE) = 1
c. All requested variables entered.
Model Summary
Model
R Square
Adjusted R
Std. Error of
Square
the Estimate
Approximately
60% of the
Change Statistics
R Square
F Change
df1
df2
Sig. F Change
Change
cases
(SAMPLE) =
1 (Selected)
.846a
.716
.709
4.973
.716
91.286
289
.000
a. Predictors: (Constant), LSTAT, B, PTRATIO, CRIM, ZN, RM, NOX, DIS
Model 2: Keeping INDUS
Model
Variables
Variables
Entered
Removed
Method
INDUS, B,
1
PTRATIO,
. Enter
CRIM, RM, ZN,

LSTAT, DISc

Model Summary
Mod
el
Approximat Square
Adjusted R
Std. Error
Square
of the
R Square
Estimate
Change
Change
ely 60% of
Change Statistics
df1
df2
Sig. F
Change
the cases
(SAMPLE)
= 1
(Selected)
1
.837a
.701
.693
5.104
.701
84.833
a. Predictors: (Constant), INDUS, B, PTRATIO, CRIM, RM, ZN, LSTAT, DIS
289
.000
Model 3: Keeping TAX
Model
Variables
Variables
Entered
Removed
Method
TAX, RM, B,
1
ZN, PTRATIO,
. Enter
CRIM, DIS,
LSTATc

Model Summary
Mode
l
Adjusted R
Std. Error of
Approximate
Square
Square
the Estimate
ly 60% of
Change Statistics
R Square
df1
Change
Change
df2
Sig. F
Change
the cases
(SAMPLE) =
1 (Selected)
1
.837a
.700
.692
5.114
.700
84.361
a. Predictors: (Constant), TAX, RM, B, ZN, PTRATIO, CRIM, DIS, LSTAT
We find that the adjusted R square value is highest for following model
Model Summaryb,c
Model
R
Approximately
Approximately
60% of the
60% of the
cases
cases
(SAMPLE) = 1
(SAMPLE) ~= 1
(Selected)
1
R Square
Adjusted R
Std. Error of the
Square
Estimate
(Unselected)
a
.909
.913
.8257330890
.821
3.899
a. Predictors: (Constant), CAT. MEDV, CHAS, CRIM, B, DIS, PTRATIO, RM, LSTAT
b. Unless noted otherwise, statistics are based only on cases for which Approximately 60% of the
cases (SAMPLE) = 1.
c. Dependent Variable: MEDV
289
.000
Coefficientsa,b
Model
Unstandardized Coefficients
Standardized
Sig.
Collinearity Statistics
Coefficients
B
(Constant)
Std. Error
21.8196835416
4.650
CRIM
-.0916326951
.029
CHAS
2.5602912363
RM
Beta
Tolerance
VIF
4.693
.000
-.088
-3.113
.002
.747
1.339
.963
.068
2.658
.008
.921
1.086
1.5674889438
.534
.112
2.938
.004
.418
2.390
DIS
-.3863479004
.133
-.086
-2.905
.004
.684
1.462
PTRATIO
-.3669881155
.124
-.087
-2.967
.003
.695
1.438
.0072449467
.003
.073
2.653
.008
.790
1.266
-.4361544592
.049
-.346
-8.820
.000
.391
2.559
12.6070216351
.825
.512
15.274
.000
.536
1.865
LSTAT
CAT. MEDV

b. Selecting only cases for which Approximately 60% of the cases (SAMPLE) = 1
Final model: The variables are CRIM, CHAS, RM, DIS, PTRATIO, B, LSTAT &
CAT.MEDV
FOR TRAINING DATA SET:

Z1 = (MEDVactual - MEDV mean)^2 = 25211.73023
Z2 = (MEDVactual MEDV calculated)^2 = 4393.570348
R square = 1- (Z2/Z1) = 0.825733089
FOR TEST DATA SET:

Z1 = (MEDVactual - MEDV mean)^2 = 17486.93188
Z2 = (MEDVactual MEDV calculated)^2 = 2916.163235
R square = 1- (Z2/Z1) = 0.833237571

BAUDM Assignment Predicting Boston Housing Prices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BAUDM Assignment Predicting Boston Housing Prices

Uploaded by

Copyright:

Available Formats

BAUDM Assignment

Predicting Boston Housing Prices

Std. Error of the

a. Predictors: (Constant), RM, CHAS, CRIM

a. Dependent Variable: MEDV

a. Predictors: (Constant), LSTAT, B, PTRATIO, CRIM, ZN, RM, NOX, DIS

Model 2: Keeping INDUS

CRIM, RM, ZN,

a. Dependent Variable: MEDV

a. Predictors: (Constant), INDUS, B, PTRATIO, CRIM, RM, ZN, LSTAT, DIS

Model 3: Keeping TAX

a. Dependent Variable: MEDV

a. Predictors: (Constant), TAX, RM, B, ZN, PTRATIO, CRIM, DIS, LSTAT

Std. Error of the

a. Dependent Variable: MEDV

FOR TRAINING DATA SET:

FOR TEST DATA SET:

You might also like