This project attempts to predict sales prices for residential homes using information such as physical structure, location and condition of the home, in order to guide the estimation of individual house prices and assist the studies on housing market trends.
Predicting final sales price of houses using 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa, downloaded from the Kaggle website.
Data Files:
- data/train.csv: the training set
- data/test.csv: the test set
- data/data_description.txt: full description of each variable
My analysis includes the following steps (python code in the ipynb files):
- Imputation on the missing values: Data Cleaning and Imputation.ipynb
- Data transformation and cleaning: Data Cleaning and Imputation.ipynb
- Prediction using multiple models: Prediction.ipynb, Prediction_PCA.ipynb, XGBoost.ipynb, XGBoost_PCA.ipynb
- Ensemble learning on the model results: Model_Ensemble.ipynb
About 30 predictor variables contain missing values.
Figure 1. Percentage Missingness of Predictors with Missing ValuesSteps:
- Predictors with more than 40% missingness were eliminated from the analysis;
- Remaining missing values were imputed:
- Quantitative variable: better model between lasso or ridge regression (regularization parameter tuned)
- Categorical variable: logistic regression (regularization parameter tuned)
-
Response Variable:
Since the price is highly right skewed, log-transformation was employed to transform the variable.
-
Quantitative Predictors:
- log-transformed if highly right skewed;
- if includes a lot of zero values, a binary variable was added to indicate zero vs non-zero.
Multiple regressing models were applied to predict the log-transformed sales price using the predictors:
- Lasso regression
- Ridge regression
- Random forest regressor
- XGBoost
Since the predictor variables are highly correlated, transformation of the predictors using principal component analysis was also examined. However, the PCA transformed data did not yield better prediction.
The prediction of each model on both the training and testing sets was saved from the previous step.
- Training set: cross-validation prediction
- Testing set: prediction using the training set
The data were further fed into predictive models to generate better prediction on the testing set.
Models include
- Lasso regression
- Ridge regression
- Random forest regressor
- XGBoost
Root Mean Squared Logarithmic Error (RMSLE) is used to evaluate each model.
| Basic Model | RMSLE |
|---|---|
| Lasso | 0.1326 |
| Ridge | 0.1366 |
| Random Forest | 0.1462 |
| XGBoost | 0.1188 |
| Ensemble | RMSLE |
|---|---|
| Lasso | 0.1194 |
| Ridge | 0.1195 |
| Random Forest | 0.1219 |
| XGBoost | 0.1180 |
- Lasso regression and XGBoost in general perform the best on this dataset;
- Ensemble learning is able to boost the prediction precision by integrating individual models.