K-number: 21185545
The goal is to predict the continuous variable outcome from the remaining columns of a tabular dataset. The evaluation metric is out-of-sample R2 on a held-out test set with hidden true outcomes. The pipeline focuses on consistent preprocessing, direct model comparison under one validation protocol, and selecting the final model based on generalization performance.
The training set has 10,000 rows and 31 columns, with one target, three categorical features, and the remainder numeric. There are no missing values, so no imputation is required. The target outcome has mean about -4.98 and standard deviation about 12.72.
The categorical features are cut, color, and clarity, with 5, 7, and 8 levels respectively. Baseline sklearn models use one-hot encoding with unknown handling at inference. The final model uses CatBoost native categorical handling.
Simple correlation checks show the strongest linear association with outcome is from depth (about -0.41), followed by b3, b1, a1, a4, and table. This aligns with the observed high-signal feature subset.
Models were compared using a consistent 80/20 train-validation split with seed 123.
| Model | Validation R2 |
|---|---|
| Linear regression (one-hot) | 0.289 |
| Random forest (one-hot) | 0.454 |
| Histogram gradient boosting (one-hot) | 0.446 |
| Gradient boosting regressor (one-hot) | 0.459 |
| CatBoost regressor (native categoricals) | 0.475 |
CatBoost was selected because it handled mixed tabular features most effectively and remained stable under re-evaluation. A 3-fold cross-validation check gave mean R2 around 0.476 (std about 0.017).
Training uses a two-level split strategy. A fixed 80/20 split is used for external validation, and inside the 80% training portion a 10% internal validation set is used for early stopping.
Selected CatBoost hyperparameters:
depth=8learning_rate=0.02l2_leaf_reg=30.428077305821905bootstrap_type='Bernoulli'subsample=0.65rsm=0.85min_data_in_leaf=12random_strength=1boosting_type='Ordered'leaf_estimation_iterations=1
Early stopping selected best_iteration=913. Final 80/20 validation R2 is approximately 0.4752.
Feature importance from CatBoost ranks depth highest, followed by b3, b1, a1, a3, a4, and table.
The main training and prediction script is:
CW1_eval_script.py
The script reads CW1_train.csv and CW1_test.csv, trains the CatBoost model, and writes predictions. Core dependencies are pandas, numpy, scikit-learn, and catboost.
Run:
python CW1_eval_script.pyThe generated submission file in this repository is:
CW1_submission_21185545.csv