You are on page 1of 6

BAUDM Assignment

Predicting Boston Housing Prices

By:
Suraj Dhende (14PGP015)
Saranya Panicker (14PGP108)
Bharadwaj Sista (14PGP118)
Ashis Tripathy (14PGP121)
Yadvendra Yadav (14PGP123)

Ans a:

The data should be partitioned into training and validation sets because we need two
sets of data: one to build the regression model and the other to test the model.
The regression model will describe the relationship between the dependent and the
independent variables where as the model run on the validation set will determine the
accuracy of the model that has been built on the training data set.
The validation data is used to validate or test the model. In this process, the model
(built using the training data set) is used to make predictions with the validation data data that were not used to fit the model. In this way we get an unbiased estimate of
how well the model performs. We compute measures of error which reflect the
prediction accuracy.
Ans b: The equation is:
Model Summary
Model

R Square

Approximately

Adjusted R

Std. Error of the

Square

Estimate

60% of the
cases
(SAMPLE) = 1
(Selected)
.747a

.557

.553

6.161

a. Predictors: (Constant), RM, CHAS, CRIM


ANOVAa,b
Model

Sum of Squares

df

Mean Square

Regression

14051.438

4683.813

Residual

11160.292

294

37.960

Total

25211.730

297

F
123.388

a. Dependent Variable: MEDV


b. Selecting only cases for which Approximately 60% of the cases (SAMPLE) = 1
c. Predictors: (Constant), RM, CHAS, CRIM

-34.867-0.218*CRIM+3.824*CHAS+9.20*RM

Sig.
.000c

Ans c:
Median house price is $-7.288
Ans d:
i. There are certain variables that measure the level of development and industrialization.
These variables are likely to be positively correlated. From the correlations we come to know
that INDUS, NOX and TAX are highly correlated. This is because areas that have a high
proportion of non-retail businesses tend to have higher taxes and more pollution.
INDUS indicates the proportion of non-retail business while NOX indicates Nitric oxide
concentration.
ii. The highly correlated variables are as follows:
1) NOX and INDUS: Correlation coefficient = 0.764
2) TAX and INDUS: Correlation coefficient = 0.688
3) AGE and NOX: Correlation coefficient = 0.724
4) DIS and NOX: Correlation coefficient = -0.765
5) DIS and AGE: Correlation coefficient = -0.745
6) TAX and RAD: Correlation coefficient = 0.891
The variables INDUS, TAX and NOX denote the same thing that is development and
urbanization. So we can remove these variables to find the best fit model
iii.
Model 1: We have chosen to keep NOX

Variables Entered/Removeda,b
Model

Variables

Variables

Entered

Removed

Method

LSTAT, B,
1

PTRATIO,
CRIM, ZN, RM,

. Enter

NOX, DISc
a. Dependent Variable: MEDV
b. Models are based only on cases for which
Approximately 60% of the cases (SAMPLE) = 1
c. All requested variables entered.

Model Summary
Model

R Square

Adjusted R

Std. Error of

Square

the Estimate

Approximately
60% of the

Change Statistics
R Square

F Change

df1

df2

Sig. F Change

Change

cases
(SAMPLE) =
1 (Selected)
.846a

.716

.709

4.973

.716

91.286

289

.000

a. Predictors: (Constant), LSTAT, B, PTRATIO, CRIM, ZN, RM, NOX, DIS

Model 2: Keeping INDUS

Variables Entered/Removeda,b
Model

Variables

Variables

Entered

Removed

Method

INDUS, B,
1

PTRATIO,

. Enter

CRIM, RM, ZN,


LSTAT, DISc

a. Dependent Variable: MEDV


b. Models are based only on cases for which
Approximately 60% of the cases (SAMPLE) = 1
c. All requested variables entered.

Model Summary
Mod
el

Approximat Square

Adjusted R

Std. Error

Square

of the

R Square

Estimate

Change

Change

ely 60% of

Change Statistics
df1

df2

Sig. F
Change

the cases
(SAMPLE)
= 1
(Selected)
1

.837a

.701

.693

5.104

.701

84.833

a. Predictors: (Constant), INDUS, B, PTRATIO, CRIM, RM, ZN, LSTAT, DIS

289

.000

Model 3: Keeping TAX

Variables Entered/Removeda,b
Model

Variables

Variables

Entered

Removed

Method

TAX, RM, B,
1

ZN, PTRATIO,

. Enter

CRIM, DIS,
LSTATc

a. Dependent Variable: MEDV


b. Models are based only on cases for which
Approximately 60% of the cases (SAMPLE) = 1
c. All requested variables entered.

Model Summary
Mode
l

Adjusted R

Std. Error of

Approximate

Square

Square

the Estimate

ly 60% of

Change Statistics
R Square

df1

Change

Change

df2

Sig. F
Change

the cases
(SAMPLE) =
1 (Selected)
1

.837a

.700

.692

5.114

.700

84.361

a. Predictors: (Constant), TAX, RM, B, ZN, PTRATIO, CRIM, DIS, LSTAT

We find that the adjusted R square value is highest for following model
Model Summaryb,c
Model

R
Approximately

Approximately

60% of the

60% of the

cases

cases

(SAMPLE) = 1

(SAMPLE) ~= 1

(Selected)
1

R Square

Adjusted R

Std. Error of the

Square

Estimate

(Unselected)
a

.909

.913

.8257330890

.821

3.899

a. Predictors: (Constant), CAT. MEDV, CHAS, CRIM, B, DIS, PTRATIO, RM, LSTAT
b. Unless noted otherwise, statistics are based only on cases for which Approximately 60% of the
cases (SAMPLE) = 1.
c. Dependent Variable: MEDV

289

.000

Coefficientsa,b
Model

Unstandardized Coefficients

Standardized

Sig.

Collinearity Statistics

Coefficients
B
(Constant)

Std. Error

21.8196835416

4.650

CRIM

-.0916326951

.029

CHAS

2.5602912363

RM

Beta

Tolerance

VIF

4.693

.000

-.088

-3.113

.002

.747

1.339

.963

.068

2.658

.008

.921

1.086

1.5674889438

.534

.112

2.938

.004

.418

2.390

DIS

-.3863479004

.133

-.086

-2.905

.004

.684

1.462

PTRATIO

-.3669881155

.124

-.087

-2.967

.003

.695

1.438

.0072449467

.003

.073

2.653

.008

.790

1.266

-.4361544592

.049

-.346

-8.820

.000

.391

2.559

12.6070216351

.825

.512

15.274

.000

.536

1.865

LSTAT
CAT. MEDV

a. Dependent Variable: MEDV


b. Selecting only cases for which Approximately 60% of the cases (SAMPLE) = 1

Final model: The variables are CRIM, CHAS, RM, DIS, PTRATIO, B, LSTAT &
CAT.MEDV

FOR TRAINING DATA SET:


Z1 = (MEDVactual - MEDV mean)^2 = 25211.73023
Z2 = (MEDVactual MEDV calculated)^2 = 4393.570348
R square = 1- (Z2/Z1) = 0.825733089

FOR TEST DATA SET:


Z1 = (MEDVactual - MEDV mean)^2 = 17486.93188
Z2 = (MEDVactual MEDV calculated)^2 = 2916.163235
R square = 1- (Z2/Z1) = 0.833237571

You might also like