Predicting House Prices: Advanced Regression

This project attempts to predict sales prices for residential homes using information such as physical structure, location and condition of the home, in order to guide the estimation of individual house prices and assist the studies on housing market trends.

Data

Predicting final sales price of houses using 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa, downloaded from the Kaggle website.

Data Files:

data/train.csv: the training set
data/test.csv: the test set
data/data_description.txt: full description of each variable

Summary of Analysis

My analysis includes the following steps (python code in the ipynb files):

Imputation on the missing values: Data Cleaning and Imputation.ipynb
Data transformation and cleaning: Data Cleaning and Imputation.ipynb
Prediction using multiple models: Prediction.ipynb, Prediction_PCA.ipynb, XGBoost.ipynb, XGBoost_PCA.ipynb
Ensemble learning on the model results: Model_Ensemble.ipynb

1. Imputation on Missing Values

About 30 predictor variables contain missing values.

Figure 1. Percentage Missingness of Predictors with Missing Values

Steps:

Predictors with more than 40% missingness were eliminated from the analysis;
Remaining missing values were imputed:
- Quantitative variable: better model between lasso or ridge regression (regularization parameter tuned)
- Categorical variable: logistic regression (regularization parameter tuned)

2. Data Transformation and Cleaning

Response Variable:

Since the price is highly right skewed, log-transformation was employed to transform the variable.

Figure 2. Log-transformation on the response variable (sales price)

Quantitative Predictors:
- log-transformed if highly right skewed;
- if includes a lot of zero values, a binary variable was added to indicate zero vs non-zero.

3. Prediction

Multiple regressing models were applied to predict the log-transformed sales price using the predictors:

Lasso regression
Ridge regression
Random forest regressor
XGBoost

Since the predictor variables are highly correlated, transformation of the predictors using principal component analysis was also examined. However, the PCA transformed data did not yield better prediction.

4. Ensemble Learning

The prediction of each model on both the training and testing sets was saved from the previous step.

Training set: cross-validation prediction
Testing set: prediction using the training set

The data were further fed into predictive models to generate better prediction on the testing set.

Models include

Lasso regression
Ridge regression
Random forest regressor
XGBoost

Model Comparison

Root Mean Squared Logarithmic Error (RMSLE) is used to evaluate each model.

Basic Model	RMSLE
Lasso	0.1326
Ridge	0.1366
Random Forest	0.1462
XGBoost	0.1188

Ensemble	RMSLE
Lasso	0.1194
Ridge	0.1195
Random Forest	0.1219
XGBoost	0.1180

Summary

Lasso regression and XGBoost in general perform the best on this dataset;
Ensemble learning is able to boost the prediction precision by integrating individual models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting House Prices: Advanced Regression

Data

Summary of Analysis

1. Imputation on Missing Values

2. Data Transformation and Cleaning

3. Prediction

4. Ensemble Learning

Model Comparison

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
original py files		original py files
results		results
.gitignore		.gitignore
Data cleaning and imputation.ipynb		Data cleaning and imputation.ipynb
Model_Ensemble.ipynb		Model_Ensemble.ipynb
Prediction.ipynb		Prediction.ipynb
Prediction_PCA.ipynb		Prediction_PCA.ipynb
README.md		README.md
XGBoost.ipynb		XGBoost.ipynb
XGBoost_PCA.ipynb		XGBoost_PCA.ipynb

Folders and files

Latest commit

History

Repository files navigation

Predicting House Prices: Advanced Regression

Data

Summary of Analysis

1. Imputation on Missing Values

2. Data Transformation and Cleaning

3. Prediction

4. Ensemble Learning

Model Comparison

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages