MLCW1: Regression Model for `outcome`

K-number: 21185545

Task and Overview

The goal is to predict the continuous variable outcome from the remaining columns of a tabular dataset. The evaluation metric is out-of-sample R2 on a held-out test set with hidden true outcomes. The pipeline focuses on consistent preprocessing, direct model comparison under one validation protocol, and selecting the final model based on generalization performance.

Exploratory Data Analysis and Preprocessing

The training set has 10,000 rows and 31 columns, with one target, three categorical features, and the remainder numeric. There are no missing values, so no imputation is required. The target outcome has mean about -4.98 and standard deviation about 12.72.

The categorical features are cut, color, and clarity, with 5, 7, and 8 levels respectively. Baseline sklearn models use one-hot encoding with unknown handling at inference. The final model uses CatBoost native categorical handling.

Simple correlation checks show the strongest linear association with outcome is from depth (about -0.41), followed by b3, b1, a1, a4, and table. This aligns with the observed high-signal feature subset.

Model Selection

Models were compared using a consistent 80/20 train-validation split with seed 123.

Model	Validation R2
Linear regression (one-hot)	0.289
Random forest (one-hot)	0.454
Histogram gradient boosting (one-hot)	0.446
Gradient boosting regressor (one-hot)	0.459
CatBoost regressor (native categoricals)	0.475

CatBoost was selected because it handled mixed tabular features most effectively and remained stable under re-evaluation. A 3-fold cross-validation check gave mean R2 around 0.476 (std about 0.017).

Model Training and Evaluation

Training uses a two-level split strategy. A fixed 80/20 split is used for external validation, and inside the 80% training portion a 10% internal validation set is used for early stopping.

Selected CatBoost hyperparameters:

depth=8
learning_rate=0.02
l2_leaf_reg=30.428077305821905
bootstrap_type='Bernoulli'
subsample=0.65
rsm=0.85
min_data_in_leaf=12
random_strength=1
boosting_type='Ordered'
leaf_estimation_iterations=1

Early stopping selected best_iteration=913. Final 80/20 validation R2 is approximately 0.4752.

Feature importance from CatBoost ranks depth highest, followed by b3, b1, a1, a3, a4, and table.

Reproducibility

The main training and prediction script is:

CW1_eval_script.py

The script reads CW1_train.csv and CW1_test.csv, trains the CatBoost model, and writes predictions. Core dependencies are pandas, numpy, scikit-learn, and catboost.

Run:

python CW1_eval_script.py

The generated submission file in this repository is:

CW1_submission_21185545.csv

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figs		figs
CW1_eval_script.py		CW1_eval_script.py
CW1_report.tex		CW1_report.tex
CW1_submission_21185545.csv		CW1_submission_21185545.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLCW1: Regression Model for `outcome`

Task and Overview

Exploratory Data Analysis and Preprocessing

Model Selection

Model Training and Evaluation

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLCW1: Regression Model for outcome

Task and Overview

Exploratory Data Analysis and Preprocessing

Model Selection

Model Training and Evaluation

Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MLCW1: Regression Model for `outcome`

Packages