This document summarizes models for predicting housing prices using data on over 1,400 homes sold between 2006-2010 in Ames, Iowa. Classification models were used to predict price ranges, with the best performing being support vector classification (SVC) with a linear kernel and random forest, achieving error rates of 30.87% and 32.60% respectively. Regression models were also tested to predict continuous sale prices, with Lasso regression performing best with a root mean square error of 49.54. Dimensionality reduction via principal component analysis improved some model performances. Overall the models generated better predictions than simple linear regression baselines.
This document summarizes models for predicting housing prices using data on over 1,400 homes sold between 2006-2010 in Ames, Iowa. Classification models were used to predict price ranges, with the best performing being support vector classification (SVC) with a linear kernel and random forest, achieving error rates of 30.87% and 32.60% respectively. Regression models were also tested to predict continuous sale prices, with Lasso regression performing best with a root mean square error of 49.54. Dimensionality reduction via principal component analysis improved some model performances. Overall the models generated better predictions than simple linear regression baselines.
This document summarizes models for predicting housing prices using data on over 1,400 homes sold between 2006-2010 in Ames, Iowa. Classification models were used to predict price ranges, with the best performing being support vector classification (SVC) with a linear kernel and random forest, achieving error rates of 30.87% and 32.60% respectively. Regression models were also tested to predict continuous sale prices, with Lasso regression performing best with a root mean square error of 49.54. Dimensionality reduction via principal component analysis improved some model performances. Overall the models generated better predictions than simple linear regression baselines.
The Ames Assessoris Office released Classification Dimensionality Reduction Classification: We treated Gaussian information on its sold houses from 2006 to Naive Bayes as baseline and it 2010. Housing prices are an important reflection Naive Bayes (Gaussian/Multinomial) Principal Component Analysis performed poorly with 0.79 error rate. of the economy, and houses price ranges are The best models for these classification of great interest for both buyers and sellers. In problem include SVC with linear kernel this project, sale prices will be predicted based and random forest. One possible cause on a variety features of residential houses both Multinomial Logistic Regression of the error might be that there are too as a continuous response variable and many features (288) and it leads to multinary response variables, with overfit. We use PCA for dimensionality classifications determined by the following price Regression reduction and it indeed improved the ranges: SVM Classification Ridge Regression performance of the models. [0, 100K), [100K, 150K), [150K, 200K), [200K, (Linear/ Gaussian) Regression: We treated linear 250K), [250K, 300K), [300K, 350K), [350K, inf) Lasso Regression regression with all covariates as baseline, and it generated RMSE of 0.5501. Overall, most of the regression Data and Features Random Forest Classification SVM Regression models gave better results than our Dataset: residential houses in Ames, Iowa sold baseline model, except SVR with linear Constructing a multitude of decision trees at the training Similar to SVM Classification in 2006 - 2010 time and output the decision of the class at test kernel, which is not innately suitable for 79 house features Random Forest Regression fitting linear regression data set like this. Similar to Random Forest Classification 1460 houses with sold prices Linear regression with Lasso turned out Results to perform the best due to its feature Preprocess the data: reduction function. According to our Turn categorical data into separated Classification Classification Classification Classification Regression Regression model, the year that the house was built Model Error Rate Model w/ PCA Error Rate Model RMSE indicator data. turned out to have the greatest statistical Fill in null value as 0 indicator value Gaussian Naive 0.7913 PCA + Gaussian 0.5022 Linear 0.5501 significance upon predicting the sale Bayes Naive Bayes Regression price of a house. Randomly select training and testing examples among 1460 examples. Multinomial 0.4891 - - Set aside sold prices in testing examples as Naive bayes Lasso 0.4954 Future ground truth Multinomial Multinomial The number of covariates existent in our Logistic 0.500 Logistic 0.4413 Ridge 0.5448 Sale Price is log transformed to have a Regression Regression dataset is abundant, but feature normalized distribution during regression selection helped constrain the analysis SVC linear SVC linear SVR (linear kernel 0.3260 kernel 0.3087 kernel) 5522 complexity of our models in this setting. With around 0.3087 error rate, our SVC Final dataset SVC Gaussian SVC Gaussian SVR (Gaussian 0.5891 0.5891 0.5016 with linear kernel model could be used kernel kernel kernel) 288 house features for price range predictions for future 1000 training examples Random Forest Random Forest Random Forest houses in Ames, Iowa. 0.3348 0.4326 0.5394 Classification Classification Regression 460 testing examples.