You are on page 1of 1

Realty Mogul: Real Estate Price Prediction with Regression and Classification

Hujia Yu, Jiafu Wu, [hujiay, jiafuwu]@stanford.edu

Motivation Models Discussion


The Ames Assessoris Office released Classification Dimensionality Reduction Classification: We treated Gaussian
information on its sold houses from 2006 to Naive Bayes as baseline and it
2010. Housing prices are an important reflection Naive Bayes (Gaussian/Multinomial) Principal Component Analysis
performed poorly with 0.79 error rate.
of the economy, and houses price ranges are The best models for these classification
of great interest for both buyers and sellers. In problem include SVC with linear kernel
this project, sale prices will be predicted based and random forest. One possible cause
on a variety features of residential houses both Multinomial Logistic Regression of the error might be that there are too
as a continuous response variable and many features (288) and it leads to
multinary response variables, with overfit. We use PCA for dimensionality
classifications determined by the following price Regression reduction and it indeed improved the
ranges: SVM Classification Ridge Regression performance of the models.
[0, 100K), [100K, 150K), [150K, 200K), [200K, (Linear/ Gaussian) Regression: We treated linear
250K), [250K, 300K), [300K, 350K), [350K, inf) Lasso Regression regression with all covariates as
baseline, and it generated RMSE of
0.5501. Overall, most of the regression
Data and Features Random Forest Classification SVM Regression
models gave better results than our
Dataset: residential houses in Ames, Iowa sold baseline model, except SVR with linear
Constructing a multitude of decision trees at the training Similar to SVM Classification
in 2006 - 2010 time and output the decision of the class at test
kernel, which is not innately suitable for
79 house features
Random Forest Regression fitting linear regression data set like this.
Similar to Random Forest Classification
1460 houses with sold prices Linear regression with Lasso turned out
Results to perform the best due to its feature
Preprocess the data: reduction function. According to our
Turn categorical data into separated Classification Classification Classification Classification Regression Regression model, the year that the house was built
Model Error Rate Model w/ PCA Error Rate Model RMSE
indicator data. turned out to have the greatest statistical
Fill in null value as 0 indicator value Gaussian Naive
0.7913
PCA + Gaussian
0.5022
Linear
0.5501
significance upon predicting the sale
Bayes Naive Bayes Regression price of a house.
Randomly select training and testing
examples among 1460 examples. Multinomial
0.4891 - -
Set aside sold prices in testing examples as Naive bayes
Lasso 0.4954
Future
ground truth Multinomial Multinomial The number of covariates existent in our
Logistic 0.500 Logistic 0.4413 Ridge 0.5448
Sale Price is log transformed to have a Regression Regression
dataset is abundant, but feature
normalized distribution during regression selection helped constrain the
analysis SVC linear SVC linear SVR (linear
kernel
0.3260
kernel
0.3087
kernel)
5522 complexity of our models in this setting.
With around 0.3087 error rate, our SVC
Final dataset SVC Gaussian SVC Gaussian SVR (Gaussian
0.5891 0.5891 0.5016 with linear kernel model could be used
kernel kernel kernel)
288 house features for price range predictions for future
1000 training examples Random Forest Random Forest Random Forest houses in Ames, Iowa.
0.3348 0.4326 0.5394
Classification Classification Regression
460 testing examples.

You might also like