Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

5. Implementations of Machine Learning models

Prerequisite: Files with selected features named patient_genes_[variant].csv must be generated as described in Features.

The notebooks in this folder contain the Machine Learning models based on the following algorithms:

  • Logistic Regression
  • Support Vector Machines (SVM)
  • Random Forest

The models are trained using various patient_genes files located in the /Data folder. Each model notebook allows easy switching between different feature sets at the top of the notebook.

# Model[Algorithm].ipynb
# To view all available feature set variants, use:
# FeatureVariant.print_info()

# Or refer to the FeatureVariant class in ../DataHelpers.ipynb
GENE_FILE_VARIANT = FeatureVariant.LITERATURE

NOTE: The feature variants are mapped to the generated patient_genes_[variant].csv files. As noted previously, ensure that the corresponding patient_genes files have been genarated before training or evaluating the model.

General data helper functions are declared in DataHelpers.ipynb. This includes but is not limited to:

  • Splitting the dataset into 80% training data and 20% testing data
  • K-fold cross validation
  • Exporting of various evaluation metrics

All splits are stratified to account for the imbalance present in the dataset.

The notebooks for the respective models train the type of model they are named after (as listed above) and use the helper functions to perform the actual data splitting, five-fold cross-validation, and preliminary evaluation.

A summarization of the cross-validation results is displayed at the end of the notebook, and all necessary data is exported to the following files, where [variant] refers to the model type trained:

  • model_output_[variant].csv, containing the output for each test case selected from the dataset: the actual TNBC value, the predicted value, and the predicted probability
  • model_metrics_[variant].csv, containing the accuracy, recall, precision, F1-score, ROC/AUC, and the numbers of True Positives, True Negatives, False Positives and False Negatives, for both the initial test data (first row) and the five folds used in cross-validation.

Next step: Additional evaluations and visualization based on the generated data, including combining those for different models, is generated by the notebook in the Evaluation folder, that lives next to this Model folder.

Key findings

When using literature-based features based on the key findings of the previous step, the three models trained (logistic regression, random forest and SVM) perform very similarly. Overall, performance is moderate or at best slightly better than moderate, which is likely caused by the limited size and relative imbalance of the dataset, with only a small number of positive TNBC samples.

Interestingly, perfect recall is seen on the initial split. However, this disappears in cross-validation as well as when changing the initial random state. This seems to be an artifact of the random state and limited positive sample size, rather than model quality or training issues.

While the models perform similarly, logistic regression shows slightly weaker performance compared to the others, particularly in precision and F1.

After applying SMOTE

SMOTE Method to address class imbalance. In our case the majority class were cases that did not have TNBC, while the cases with TNBC were the minority class. The minority class significantly had fewer examples. Instead of simply duplicating examples of the minority class, SMOTE creates generates synthetic data, as the former can lead to overfitting.

For the most part, it seems that all models show equal or better accuracy and higher mean CV accuracy when SMOTE is applied except for Logistic Regression's Literature feature set.

Comparing the models performance, we can see that SVM and Random Forest consistently outperforms Logistic Regression, especially after SMOTE. As Logistic Regression lags behind slightly, this may suggest that the relationship between features and labels is non-linear to a certain extend which SVM and Random Forest handle better.

Because scores have improved overall after applying SMOTE, this supports the idea that class imbalance was a limiting factor, and SMOTE does help the models learn more generalizable patterns, especially in Random Forest and SVM.

Breakdown by model

Logistic Regression
Feature Set SMOTE Accuracy MEAN CV Accuracy
Literature NO 0.95 0.9447
Literature YES 0.94:arrow_down: 0.9443:arrow_down:
Research NO 0.95 0.9386
Research YES 0.96:arrow_up: 0.9420:arrow_up:

Findings:

  • SMOTE does not help with Literature feature set.
  • On the other hand, SMOTE does help with Research feature set.
  • Overall, at worst SMOTE provide a drop in scores for the Literate feature set and at best modest benefit on the Research feature set.
SVM
Feature Set SMOTE Accuracy MEAN CV Accuracy
Literature NO 0.94 0.9488
Literature YES 0.95 ⬆️ 0.9519 ⬆️
Research NO 0.94 0.9427
Research YES 0.97 ⬆️ 0.9664 ⬆️

Findings:

  • As indicated here below in Random Forest as well, both models are leveraging the improved class balance well.
  • This can be seen in a great boost in accuracy and MEAN CV accuracy scores with SMOTE, especially on Research feature set.
Random Forest
Feature Set SMOTE Accuracy MEAN CV Accuracy
Literature NO 0.93 0.9355
Literature YES 0.96 ⬆️ 0.9681 ⬆️
Research NO 0.95 0.9325
Research YES 0.97 ⬆️ 0.9727 ⬆️

Findings:

  • Clear indication that SMOTE helps Random Forest across both feature sets.
  • Research feature set with SMOTE yield the best cross-validation score compared with other models (0.9727).
  • Suggests that Random Forest is robust and can take advantage of SMOTE-generated synthetic data very well.

================================================================================================

Training on all feature sets (Mean CV Accuracy)

Model SMOTE BORUTA EXTRA TREE Automated RESEARCH ANOVA RFE LASSO Statistical
Logistic Regression NO 0.9447 0.9396 0.9386 0.9355 0.9478 0.9550 0.9263
Logistic Regression YES 0.9582 0.9588 0.9780 0.9524 0.9704 0.9623 0.9182
SVM NO 0.9457 0.9478 0.9232 0.9416 0.9488 0.9509 0.9406
SVM YES 0.9727 0.9675 0.9832 0.9664 0.9768 0.9739 0.9217
Random Forest NO 0.9457 0.9457 0.8946 0.9427 0.9488 0.9468 0.9457
Random Forest YES 0.9733 0.9727 0.9797 0.9704 0.9768 0.9733 0.9635

Training on models (Final selection of feature sets/ Mean CV Accuracy)

Model SMOTE Literature RESEARCH Statistical LASSO Boruta Automated RFE
Logistic Regression NO 0.9447 0.9386 0.9263 0.9550 0.9447 0.9386 0.9478
Logistic Regression YES 0.9443 0.9420 0.9182 0.9623 0.9582 0.9780 0.9704
SVM NO 0.9488 0.9427 0.9406 0.9509 0.9457 0.9232 0.9488
SVM YES 0.9519 0.9664 0.9217 0.9739 0.9727 0.9832 0.9768
Random Forest NO 0.9355 0.9325 0.9457 0.9468 0.9457 0.8946 0.9488
Random Forest YES 0.9681 0.9727 0.9635 0.9733 0.9733 0.9797 0.9768

Findings

SMOTE is method to address class imbalance. In our case the majority class were cases that did not have TNBC, while the cases with TNBC were the minority class. The minority class significantly had fewer examples. Instead of simply duplicating examples of the minority class, SMOTE generates synthetic data, as the former can lead to overfitting.

Accross all models and feature selection methods (except for Statistical in Logistic Regression), SMOTE improves the mean cross-validation accuracy. We can conclude that class imbalance was indeed affecting the model performance and SMOTE is a good strategy for addressing this.

Comparing the models performance, we can see that SVM and Random Forest consistently outperforms Logistic Regression, especially after SMOTE (for the applicable ones). As Logistic Regression lags behind slightly, this may suggest that the relationship between features and labels is non-linear to a certain extend which SVM and Random Forest handle better.

Because scores have improved overall after applying SMOTE, this supports the idea that class imbalance was a limiting factor, and SMOTE does help the models learn more generalizable patterns, especially in and SVM and Random Forest.

Based on the above model mean cv scores, all feature sets (except for a few) benefits (greatly) from SMOTE. The following can be concluded for each feature set:

  • Literature: Overall performance across models is consistent. Does not benefit from SMOTE for Logistic Regression.
  • Research: Avarage performance compared with the other feature sets. This one sits in the middle. Benefits from SMOTE.
  • Statiscal: Weakest across feature sets. -> Drop or refine?
  • LASSO: Performs very well with SVM and Random Forest.
  • Boruta: Performs well with all models. Added benefit of its interpretability
  • Automated (PCA): Overall best. Benefits extremely well from SMOTE.
  • RFE: Consistent across models. Also benefit of interpretability

Below an overview of the Top 3 performing models and their respective feature sets

Model 1st 2nd 3rd
Logistic Regression Automated: 0.9780 RFE: 0.9704 LASSO: 0.9623
SVM Automated: 0.9832 RFE: 0.9768 LASSO: 0.9739
Random Forest Automated: 0.9797 RFE: 0.9768 BORUTA/LASSO: 0.9733

Model validation will be further expanded in the next step as well as validation of the models.

=========================================================================

Use in presentation

Logistic Regression (without SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9447 (3rd)
Research 0.9386 (4th)
Statistical 0.9263 (5th)
LASSO 0.9550 (1st)
Boruta 0.9447 (3rd)
Automated 0.9386 (4th)
RFE 0.9478 (2nd)

Logistic Regression (wit SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9443 (5th)
Research 0.9420 (6th)
Statistical 0.9182 (7th)
LASSO 0.9623 (3rd)
Boruta 0.9582 (4th)
Automated 0.9780 (1st)
RFE 0.9704 (2nd)

SVM (without SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9488 (2nd)
Research 0.9427 (3rd)
Statistical 0.9406 (5th)
LASSO 0.9509 (1st)
Boruta 0.9457 (4th)
Automated 0.9232 (6th)
RFE 0.9488 (2nd)

SVM (wit SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9519 (6th)
Research 0.9664 (5th)
Statistical 0.9217 (7th)
LASSO 0.9739 (3rd)
Boruta 0.9727 (4th)
Automated 0.9832 (1st)
RFE 0.9768 (2nd)

Random Forest (without SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9355 (4th)
Research 0.9325 (5th)
Statistical 0.9457 (3rd)
LASSO 0.9468 (2nd)
Boruta 0.9457 (3rd)
Automated 0.8946 (6th)
RFE 0.9488 (1st)

Random Forest (wit SMOTE)

Feature Set MEAN CV Accuracy
Literature 0.9681 (5th)
Research 0.9727 (4th)
Statistical 0.9635 (6th)
LASSO 0.9733 (3rd)
Boruta 0.9733 (3rd)
Automated 0.9797 (1st)
RFE 0.9768 (2nd)