Skip to content

AliAlharmoodi/MLCW1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLCW1: Regression Model for outcome

K-number: 21185545

Task and Overview

The goal is to predict the continuous variable outcome from the remaining columns of a tabular dataset. The evaluation metric is out-of-sample R2 on a held-out test set with hidden true outcomes. The pipeline focuses on consistent preprocessing, direct model comparison under one validation protocol, and selecting the final model based on generalization performance.

Exploratory Data Analysis and Preprocessing

The training set has 10,000 rows and 31 columns, with one target, three categorical features, and the remainder numeric. There are no missing values, so no imputation is required. The target outcome has mean about -4.98 and standard deviation about 12.72.

The categorical features are cut, color, and clarity, with 5, 7, and 8 levels respectively. Baseline sklearn models use one-hot encoding with unknown handling at inference. The final model uses CatBoost native categorical handling.

Simple correlation checks show the strongest linear association with outcome is from depth (about -0.41), followed by b3, b1, a1, a4, and table. This aligns with the observed high-signal feature subset.

Model Selection

Models were compared using a consistent 80/20 train-validation split with seed 123.

Model Validation R2
Linear regression (one-hot) 0.289
Random forest (one-hot) 0.454
Histogram gradient boosting (one-hot) 0.446
Gradient boosting regressor (one-hot) 0.459
CatBoost regressor (native categoricals) 0.475

CatBoost was selected because it handled mixed tabular features most effectively and remained stable under re-evaluation. A 3-fold cross-validation check gave mean R2 around 0.476 (std about 0.017).

Model Training and Evaluation

Training uses a two-level split strategy. A fixed 80/20 split is used for external validation, and inside the 80% training portion a 10% internal validation set is used for early stopping.

Selected CatBoost hyperparameters:

  • depth=8
  • learning_rate=0.02
  • l2_leaf_reg=30.428077305821905
  • bootstrap_type='Bernoulli'
  • subsample=0.65
  • rsm=0.85
  • min_data_in_leaf=12
  • random_strength=1
  • boosting_type='Ordered'
  • leaf_estimation_iterations=1

Early stopping selected best_iteration=913. Final 80/20 validation R2 is approximately 0.4752.

Feature importance from CatBoost ranks depth highest, followed by b3, b1, a1, a3, a4, and table.

Reproducibility

The main training and prediction script is:

  • CW1_eval_script.py

The script reads CW1_train.csv and CW1_test.csv, trains the CatBoost model, and writes predictions. Core dependencies are pandas, numpy, scikit-learn, and catboost.

Run:

python CW1_eval_script.py

The generated submission file in this repository is:

  • CW1_submission_21185545.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors