GitHub - CharlieSpackman/SC-GAN: Implementing GANs in the sc-RNA-Seq pipeline

CharlieSpackman / SC-GAN Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Implementing GANs in the sc-RNA-Seq pipeline

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
DataPreprocessing		DataPreprocessing
ModelCreation		ModelCreation
ModelEvaluation		ModelEvaluation
.gitignore		.gitignore
README.txt		README.txt
requirements_R.txt		requirements_R.txt
requirements_python.txt		requirements_python.txt

Repository files navigation

#------------------
# A. Archive Contents
#------------------

__init__.py - blank file to enable the WGANGP class on the python path
cell_types.csv - annotations (labels) for the training data provided
classification_metrics.py - computes prediction performance metrics based on output from scPred
dimensionality_reduction_evaluation.py - compute dimensionality reduction metrics and reduces the dataset
GSE114725_data_processing.py - pre-processing for the filtered imputed values
GSE114725_filter_data.py - removes outlier samples and samples 10000 items from the raw_imputed.csv
requirements_python.txt - list of Python modules used
requirements_R.txt - list of R packages used
scPred.R - trains scPred models on the GAN reduced and baseline data and outputs the cell type predictions
WGANGP.py - main class for training and evaluating the GAN

#------------------
# B. Instructions
#------------------

In order to run the code a directory with the structure specified in C. Directory Structure must be created.
Once the directory has been created the user should then complete the following steps:

1. Download the raw data (imputed_corrected.csv) from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114725.
2. Filter the raw imputed data for the tumour cells by running GSE114725_filter_data.py.
3. Pre-process the data using GSE114725_data_processing.py.
4. Specify model parameters (or leave the default parameters) in WGANGP.py and run the file to train the model.
All relevant evaluation metrics, images and checkpoints will be located within the models/model_name folder.
The model_name directory is automatically created when running the file.
5. Once the GAN training is complete, update the file names in dimensionality_reduction_evaluation.py and run to produce the reduced GAN data and metrics
6. Update the file names in scPred.R and run the file to train classification models.
Once completed, predictions will be saved in models/model_name/metrics.
7. Update the file names in classification_metrics.py and run the file to evaluate the model performances. Metrics will be saved in models/model_name/metrics

After completing the above steps the model will have been created and evaluated.
The directory models/model_name will contain the following directories:

images - evaluation and training plots
metrics - evaluation metrics for dimensionality reduction and cell classification
data - losses and Discriminator reduced data
epochs - checkpoints containing model weights at specific epochs

#------------------
# C. Directory Structure
#------------------

The following directory structure and files should be created and retrieved in order to run the code.
The files excluding the imputed_corrected.csv file can be found in the source code folder.
The imputed_corrected.csv file can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114725.

#------------------
# D. Requirements
#------------------

Modules and Packages used in the implementation of this project can be found in requirements_python.txt and requirements_R.txt.
It is recommended that users attempting to run the code should have the requirements installed on their system.
Conda was used in order to install the Python requirements.
RStudio was used to install the R requirements.