Name	Name	Last commit message	Last commit date
parent directory ..
DataHelpers.ipynb	DataHelpers.ipynb
Models.ipynb	Models.ipynb
README.md	README.md

5. Implementations of Machine Learning models

Prerequisite: Files with selected features named patient_genes_[variant].csv must be generated as described in Features.

The notebooks in this folder contain the Machine Learning models based on the following algorithms:

Logistic Regression
Support Vector Machines (SVM)
Random Forest

The models are trained using various patient_genes files located in the /Data folder. Each model notebook allows easy switching between different feature sets at the top of the notebook.

# Model[Algorithm].ipynb
# To view all available feature set variants, use:
# FeatureVariant.print_info()

# Or refer to the FeatureVariant class in ../DataHelpers.ipynb
GENE_FILE_VARIANT = FeatureVariant.LITERATURE

NOTE: The feature variants are mapped to the generated patient_genes_[variant].csv files. As noted previously, ensure that the corresponding patient_genes files have been genarated before training or evaluating the model.

General data helper functions are declared in DataHelpers.ipynb. This includes but is not limited to:

Splitting the dataset into 80% training data and 20% testing data
K-fold cross validation
Exporting of various evaluation metrics

All splits are stratified to account for the imbalance present in the dataset.

The notebooks for the respective models train the type of model they are named after (as listed above) and use the helper functions to perform the actual data splitting, five-fold cross-validation, and preliminary evaluation.

A summarization of the cross-validation results is displayed at the end of the notebook, and all necessary data is exported to the following files, where [variant] refers to the model type trained:

model_output_[variant].csv, containing the output for each test case selected from the dataset: the actual TNBC value, the predicted value, and the predicted probability
model_metrics_[variant].csv, containing the accuracy, recall, precision, F1-score, ROC/AUC, and the numbers of True Positives, True Negatives, False Positives and False Negatives, for both the initial test data (first row) and the five folds used in cross-validation.

Next step: Additional evaluations and visualization based on the generated data, including combining those for different models, is generated by the notebook in the Evaluation folder, that lives next to this Model folder.

Key findings

When using literature-based features based on the key findings of the previous step, the three models trained (logistic regression, random forest and SVM) perform very similarly. Overall, performance is moderate or at best slightly better than moderate, which is likely caused by the limited size and relative imbalance of the dataset, with only a small number of positive TNBC samples.

Interestingly, perfect recall is seen on the initial split. However, this disappears in cross-validation as well as when changing the initial random state. This seems to be an artifact of the random state and limited positive sample size, rather than model quality or training issues.

While the models perform similarly, logistic regression shows slightly weaker performance compared to the others, particularly in precision and F1.

After applying SMOTE

SMOTE Method to address class imbalance. In our case the majority class were cases that did not have TNBC, while the cases with TNBC were the minority class. The minority class significantly had fewer examples. Instead of simply duplicating examples of the minority class, SMOTE creates generates synthetic data, as the former can lead to overfitting.

For the most part, it seems that all models show equal or better accuracy and higher mean CV accuracy when SMOTE is applied except for Logistic Regression's Literature feature set.

Comparing the models performance, we can see that SVM and Random Forest consistently outperforms Logistic Regression, especially after SMOTE. As Logistic Regression lags behind slightly, this may suggest that the relationship between features and labels is non-linear to a certain extend which SVM and Random Forest handle better.

Because scores have improved overall after applying SMOTE, this supports the idea that class imbalance was a limiting factor, and SMOTE does help the models learn more generalizable patterns, especially in Random Forest and SVM.

Breakdown by model

Logistic Regression

Feature Set	SMOTE	Accuracy	MEAN CV Accuracy
Literature	NO	0.95	0.9447
Literature	YES	0.94:arrow_down:	0.9443:arrow_down:
Research	NO	0.95	0.9386
Research	YES	0.96:arrow_up:	0.9420:arrow_up:

Findings:

SMOTE does not help with Literature feature set.
On the other hand, SMOTE does help with Research feature set.
Overall, at worst SMOTE provide a drop in scores for the Literate feature set and at best modest benefit on the Research feature set.

SVM

Feature Set	SMOTE	Accuracy	MEAN CV Accuracy
Literature	NO	0.94	0.9488
Literature	YES	0.95 ⬆️	0.9519 ⬆️
Research	NO	0.94	0.9427
Research	YES	0.97 ⬆️	0.9664 ⬆️

Findings:

As indicated here below in Random Forest as well, both models are leveraging the improved class balance well.
This can be seen in a great boost in accuracy and MEAN CV accuracy scores with SMOTE, especially on Research feature set.

Random Forest

Feature Set	SMOTE	Accuracy	MEAN CV Accuracy
Literature	NO	0.93	0.9355
Literature	YES	0.96 ⬆️	0.9681 ⬆️
Research	NO	0.95	0.9325
Research	YES	0.97 ⬆️	0.9727 ⬆️

Findings:

Clear indication that SMOTE helps Random Forest across both feature sets.
Research feature set with SMOTE yield the best cross-validation score compared with other models (0.9727).
Suggests that Random Forest is robust and can take advantage of SMOTE-generated synthetic data very well.

================================================================================================

Training on all feature sets (Mean CV Accuracy)

Model	SMOTE	BORUTA	EXTRA TREE	Automated	RESEARCH ANOVA	RFE	LASSO	Statistical
Logistic Regression	NO	0.9447	0.9396	0.9386	0.9355	0.9478	0.9550	0.9263
Logistic Regression	YES	0.9582 ▲	0.9588 ▲	0.9780 ▲	0.9524 ▲	0.9704 ▲	0.9623 ▲	0.9182 ▼
SVM	NO	0.9457	0.9478	0.9232	0.9416	0.9488	0.9509	0.9406
SVM	YES	0.9727 ▲	0.9675 ▲	0.9832 ▲	0.9664 ▲	0.9768 ▲	0.9739 ▲	0.9217 ▼
Random Forest	NO	0.9457	0.9457	0.8946	0.9427	0.9488	0.9468	0.9457
Random Forest	YES	0.9733 ▲	0.9727 ▲	0.9797 ▲	0.9704 ▲	0.9768 ▲	0.9733 ▲	0.9635 ▲

Training on models (Final selection of feature sets/ Mean CV Accuracy)

Model	SMOTE	Literature	RESEARCH	Statistical	LASSO	Boruta	Automated	RFE
Logistic Regression	NO	0.9447	0.9386	0.9263	0.9550	0.9447	0.9386	0.9478
Logistic Regression	YES	0.9443 ▼	0.9420 ▲	0.9182 ▼	0.9623 ▲	0.9582 ▲	0.9780 ▲	0.9704 ▲
SVM	NO	0.9488	0.9427	0.9406	0.9509	0.9457	0.9232	0.9488
SVM	YES	0.9519 ▲	0.9664 ▲	0.9217 ▼	0.9739 ▲	0.9727 ▲	0.9832 ▲	0.9768 ▲
Random Forest	NO	0.9355	0.9325	0.9457	0.9468	0.9457	0.8946	0.9488
Random Forest	YES	0.9681 ▲	0.9727 ▲	0.9635 ▲	0.9733 ▲	0.9733 ▲	0.9797 ▲	0.9768 ▲

Findings

SMOTE is method to address class imbalance. In our case the majority class were cases that did not have TNBC, while the cases with TNBC were the minority class. The minority class significantly had fewer examples. Instead of simply duplicating examples of the minority class, SMOTE generates synthetic data, as the former can lead to overfitting.

Accross all models and feature selection methods (except for Statistical in Logistic Regression), SMOTE improves the mean cross-validation accuracy. We can conclude that class imbalance was indeed affecting the model performance and SMOTE is a good strategy for addressing this.

Comparing the models performance, we can see that SVM and Random Forest consistently outperforms Logistic Regression, especially after SMOTE (for the applicable ones). As Logistic Regression lags behind slightly, this may suggest that the relationship between features and labels is non-linear to a certain extend which SVM and Random Forest handle better.

Based on the above model mean cv scores, all feature sets (except for a few) benefits (greatly) from SMOTE. The following can be concluded for each feature set:

Literature: Overall performance across models is consistent. Does not benefit from SMOTE for Logistic Regression.
Research: Avarage performance compared with the other feature sets. This one sits in the middle. Benefits from SMOTE.
Statiscal: Weakest across feature sets. -> Drop or refine?
LASSO: Performs very well with SVM and Random Forest.
Boruta: Performs well with all models. Added benefit of its interpretability
Automated (PCA): Overall best. Benefits extremely well from SMOTE.
RFE: Consistent across models. Also benefit of interpretability

Below an overview of the Top 3 performing models and their respective feature sets

Model	1st	2nd	3rd
Logistic Regression	Automated: 0.9780	RFE: 0.9704	LASSO: 0.9623
SVM	Automated: 0.9832	RFE: 0.9768	LASSO: 0.9739
Random Forest	Automated: 0.9797	RFE: 0.9768	BORUTA/LASSO: 0.9733

Model validation will be further expanded in the next step as well as validation of the models.

=========================================================================

Use in presentation

Logistic Regression (without SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9447 (3rd)
Research	0.9386 (4th)
Statistical	0.9263 (5th)
LASSO	0.9550 (1st)
Boruta	0.9447 (3rd)
Automated	0.9386 (4th)
RFE	0.9478 (2nd)

Logistic Regression (wit SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9443 (5th)
Research	0.9420 (6th)
Statistical	0.9182 (7th)
LASSO	0.9623 (3rd)
Boruta	0.9582 (4th)
Automated	0.9780 (1st)
RFE	0.9704 (2nd)

SVM (without SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9488 (2nd)
Research	0.9427 (3rd)
Statistical	0.9406 (5th)
LASSO	0.9509 (1st)
Boruta	0.9457 (4th)
Automated	0.9232 (6th)
RFE	0.9488 (2nd)

SVM (wit SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9519 (6th)
Research	0.9664 (5th)
Statistical	0.9217 (7th)
LASSO	0.9739 (3rd)
Boruta	0.9727 (4th)
Automated	0.9832 (1st)
RFE	0.9768 (2nd)

Random Forest (without SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9355 (4th)
Research	0.9325 (5th)
Statistical	0.9457 (3rd)
LASSO	0.9468 (2nd)
Boruta	0.9457 (3rd)
Automated	0.8946 (6th)
RFE	0.9488 (1st)

Random Forest (wit SMOTE)

Feature Set	MEAN CV Accuracy
Literature	0.9681 (5th)
Research	0.9727 (4th)
Statistical	0.9635 (6th)
LASSO	0.9733 (3rd)
Boruta	0.9733 (3rd)
Automated	0.9797 (1st)
RFE	0.9768 (2nd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

5. Implementations of Machine Learning models

Key findings

After applying SMOTE

Breakdown by model

Logistic Regression

SVM

Random Forest

Training on all feature sets (Mean CV Accuracy)

Training on models (Final selection of feature sets/ Mean CV Accuracy)

Findings

Use in presentation

FilesExpand file tree

Model

Directory actions

More options

Directory actions

More options

Latest commit

History

Model

Folders and files

parent directory

README.md

5. Implementations of Machine Learning models

Key findings

After applying SMOTE

Breakdown by model

Logistic Regression

SVM

Random Forest

Training on all feature sets (Mean CV Accuracy)

Training on models (Final selection of feature sets/ Mean CV Accuracy)

Findings

Use in presentation