Name	Name	Last commit message	Last commit date
parent directory ..
images	images
DataLoading.ipynb	DataLoading.ipynb
README.md	README.md

Name

Last commit message

Last commit date

DataLoading.ipynb

README.md

2. Data Loading: Classification and Filtering of clinical data

Prerequisite: Data files (Clinical and RNA sequencing) must have been downloaded as described in Data.

The Jupyter Notebook in this folder will classify and filter the raw clinical data with the following process:

Load the Clinical Datafile as a Pandas dataframe;
Determine TNBC status by the columns er_status_by_ihc, pr_status_by_ihc, and her2_status_by_ihc, dropping indeterminable cases;
Match RNA Seq files to Case IDs using the Metadata file;
Add the matched files to the clinical dataframe, dropping cases where no RNA Seq file is available;
Save the resulting dataframe to a new file clinical.csv in the Data folder (this will not be included in the repository);

The resulting preprocessed data file can be loaded with the following code:

dataPath = '../Data'
df = pd.read_csv(os.path.join(dataPath, 'clinical.csv'))

Next step: A Jupyter Notebook that scales and standardises the gene expression data, can be found in the Preprocessing folder that lives next to this DataLoading folder.

Key findings

From to the available clinical data, 116 can be determined as triple negative (TNBC), as the three relevant columns are 'Negative'. A further 863 cases can be determined as not triple negative (meaning at least one 'Positive' value in the appropriate columns). This leaves 118 indeterminable cases.

RNA Sequencing files are missing for 3 cases (1 TNBC, 1 non-TNBC, 1 indeterminable), leaving a total of 977 cases with usable data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

2. Data Loading: Classification and Filtering of clinical data

Key findings

FilesExpand file tree

DataLoading

Directory actions

More options

Directory actions

More options

Latest commit

History

DataLoading

Folders and files

parent directory

README.md

2. Data Loading: Classification and Filtering of clinical data

Key findings