Name	Name	Last commit message	Last commit date
parent directory ..
FeaturesAutomated.ipynb	FeaturesAutomated.ipynb
FeaturesBORUTA.ipynb	FeaturesBORUTA.ipynb
FeaturesLASSO.ipynb	FeaturesLASSO.ipynb
FeaturesLiterature.ipynb	FeaturesLiterature.ipynb
FeaturesRFE.ipynb	FeaturesRFE.ipynb
FeaturesResearch.ipynb	FeaturesResearch.ipynb
FeaturesStatistical.ipynb	FeaturesStatistical.ipynb
README.md	README.md

Name

Last commit message

Last commit date

FeaturesBORUTA.ipynb

FeaturesLASSO.ipynb

FeaturesLiterature.ipynb

FeaturesRFE.ipynb

FeaturesResearch.ipynb

FeaturesStatistical.ipynb

README.md

4. Feature Selection: selecting genes using different methods

Prerequisite: Preprocessed data must be available in geneDataPreProcessed.csv as described in Preprocessing.

The Jupyter Notebooks in this folder will extract the the selected features from the RNA Seq files and combine them with the labeled clinical data with the following process:

Load the preprocessed datafile as a Pandas dataframe, keeping only the case ID, TNBC label, and linked RNA Seq filename;
Determine the features to be selected (based on literature, statistical analysis or automated dimensionality reduction, depending on notebook);
Extract the selected features from the RNA Seq files and add these to the labeled clinical dataframe.
Add the matched files to the clinical dataframe, dropping cases where no RNA Seq file is available;
Save the resulting dataframe to a new file patient_genes_[variant].csv in the Data folder (this will not be included in the repository);

The generated file with the selected features can be loaded with the following code:

dataPath = '../Data'
df = pd.read_csv(os.path.join(dataPath, 'patient_genes_[variant].csv'))
# replace [variant] with the variant to use as input (literature, statistical, automated)

Next step: Jupyter Notebooks that train different models based on the selected features can be found in the Model folder that lives next to this Features folder.

Key findings

The RNA Seq files contain gene expression data for 60,000 genes, each of which has different expression values. Based on [WAT PRECIES] we have learned that stranded_first is the most appropriate value to use.

Based on literature, a list of 19 genes was selected for initially: TBC1D9, GATA3, SLC16A6, ESR1, INPP4B, SLC44A4, ANXA9, AGR2, MCCC2, TSPAN1, STBD1, MLPH, CACNA2D2, RARA, STARD3, PPP1R14C, LDHB, MFGE8, PSAT1 (SFRS13B is not in the dataset).
Using statistical analysis, a different list was found.
With automated methods of feature selection, specifically PCA, 768 principal components were found to account for 95% of the variance.
In addition to these, use of the raw data (i.e. 60,000 features) was also attempted.

It seems that using the raw feature set is unreliable, showing unstable recall and precision likely due to overfitting (the so-called "curse of dimensionality"). Using literature based features gives a much more stable result and better performance. Furthermore, PCA offers more variable results than literature, though not quite as much as raw data. Also, use of principle components rather than genes directly makes explainability significantly more difficult.

This suggests that the selection of genes based on literature is the best way to move forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

4. Feature Selection: selecting genes using different methods

Key findings

FilesExpand file tree

Features

Directory actions

More options

Directory actions

More options

Latest commit

History

Features

Folders and files

parent directory

README.md

4. Feature Selection: selecting genes using different methods

Key findings