Zillow-Kaggle

This repo tackles the first round of Zillow’s Home Value Prediction Competition, which challenges competitors to predict the log error between Zestimate and the actual sale price of houses. And the submissions are evaluated based on Mean Absolute Error between the predicted log error and the actual log error. The competition was hosted from May 2017 to October 2017 on Kaggle, and the final private leaderboard was revealed after the evaluation period ended in January 2018.

The score of the stacked model in this repo would have ranked in the top-100 and qualified for the private second round, which involves building a home valuation algorithm from ground up.

Author: Junjie Dong ([email protected])

Files

explanatory.ipynb performs basic explanatory data analysis. The analysis is by no means comprehensive, since most ad-hoc data analysis were performed while extracting features and building models.
feature_extraction.ipynb cleans up the raw data, unifies data types and the representations of missing values, and extracts various kinds of features (eg. interaction features, region-based aggregate features). The features are saved to hdf5 files that can be easily loaded by other notebooks for modeling.
model_lgb.ipynb builds LightGBM models on top of the extracted features.
model_catboost.ipynb builds CatBoost models (both single model and ensemble model) on top of the extracted features.
stack.py performs simple stacking by taking a linear combination of the predictions from LightGBM and CatBoost.
src/data_proc.py includes several helper methods for data cleaning and feature extraction

Models

Both the LightGBM and CatBoost models are mostly tuned based on offline cross validation, with no public leaderboard probing/overfitting. The weight for the final linear stacking is chosen based on public leaderboard scores. Outliers in the offline training and validation sets were carefully handled so that the offline validation method is as reliable as possible.

The following table outlines the chosen models' performance on the hidden private test set.

Model	Private Leaderboard Score	Private Leaderboard Ranking	Percentile (Top)
LightGBM (single)	0.0752026	332 / 3779	8.8%
CatBoost (single)	0.0751456	250 / 3779	6.6%
CatBoost (ensemble x8)	0.0750750	147 / 3779	3.9%
Stack (LightGBM + CatBoost)	0.0750213	95 / 3779	2.5%

For this dataset, CatBoost with its default hyperparameters gives very strong performance, and the model required almost no hyperparameter tuning. In comparison, LightGBM required much more hyperparameter tuning to achieve good performance.

Also, it turns out that ensembling and stacking are essential to climbing up the leaderboard in this competition. And I believe there is still some room for improvement just by tuning the LightGBM hyperparameters and switching to a better stacking method (eg. train a meta-model on top of the base models' predictions on a hold-out set).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zillow-Kaggle

Files

Models

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
explanatory.ipynb		explanatory.ipynb
feature_extraction.ipynb		feature_extraction.ipynb
model_catboost.ipynb		model_catboost.ipynb
model_lgb.ipynb		model_lgb.ipynb
stack.py		stack.py

junjiedong/Zillow-Kaggle

Folders and files

Latest commit

History

Repository files navigation

Zillow-Kaggle

Files

Models

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages