Systematic perturbation of cells followed by comprehensive measurements of molecular and phenotypic responses provides informative data resources for constructing computational models of cell biology. Models that generalize well beyond training data can be used to identify combinatorial perturbations of potential therapeutic interest. Major challenges for machine learning on large biological datasets are to find global optima in a complex multi-dimensional space and mechanistically interpret the solutions. To address these challenges, we introduce a hybrid approach that combines explicit mathematical models of cell dynamics with a machine learning framework, implemented in TensorFlow. We tested the modeling framework on a perturbation-response dataset of a melanoma cell line after drug treatments. The models can be efficiently trained to describe cellular behavior accurately. Even though completely data-driven and independent of prior knowledge, the resulting de novo network models recapitulate some known interactions. The approach is readily applicable to various kinetic models of cell biology.
This is CellBox scripts developed in Sander lab for the paper in Cell Systems or bioRxiv.
Yuan, B., Shen, C., Luna, A., Korkut, A., Marks, D., Ingraham, J., Sander, C. CellBox: Interpretable Machine Learning for Perturbation Biology with Application to the Design of Cancer Combination Therapy. Cell Systems, 2020.
Maintained by Bo Yuan, Judy Shen, and Augustin Luna.
If you want to discuss the usage or to report a bug, please use the 'Issues' function here on GitHub.
If you find CellBox
useful for your research, please consider citing the corresponding publication.
For more information, please find our contact information here.
Easily try CellBox
online with Binder
- Go to: https://mybinder.org/v2/gh/sanderlab/CellBox/9d13f3354f8b14bd896de6c8aa5db0b97c65ad12
- From the New dropdown, click Terminal
- Run the following command for a short example of model training process:
python scripts/main.py -config=configs/Example.random_partition.json
Alternatively, in project folder, do the same command
Before installing CellBox, it is good practice to create a Python virtual environment. With conda, conda create -n “cellbox” python==3.8.0
creates a conda environment with the name cellbox
and Python 3.8.0. Activate the environment by conda activate cellbox
.
To install CellBox to a particular folder, type the following:
git clone https://github.com/sanderlab/CellBox.git <folder_name>
cd /<folder_name>/cellbox
pip install .
If you only want to install CellBox from a particular branch, the following command will install cellbox from a particular branch using the '@' notation:
pip install git+https://github.com/sanderlab/CellBox.git@cell_systems_final#egg=cellbox\&subdirectory=cellbox
Clone repository and in the cellbox
folder run:
python3.6 setup.py install
Only python3.6 supported. Anaconda or pipenv is recommended to create python environment.
Now you can test if the installation is successful
import cellbox
cellbox.VERSION
These data files are used for generating the results from the official CellBox paper. Replace these files with your own data.
node_index.csv
: names of each protein/phenotypic node.expr_index.txt
: information each perturbation condition. This is one of the original data files we downloaded from paper and is only used here as a reference for the condition names. In other words the 2nd and 3rd columns are not being used in CellBox.loo_label.csv
: A deprecated csv file that stores the actual indexing of perturbation targets, used in the original paper. There are 89 rows corresponding to 89 drug combinations. On each row, two numbers denote the index of one of 12 drugs for that combination. Number 0 denotes no drug, meaning rows with 0 denote single-target drugs.expr.csv
: Protein expression data from RPPA for the protein nodes and phenotypic node values. Each row is a condition while each column is a node.pert.csv
: Perturbation strength and target of all perturbation conditions. Used as input for differential equations.expr_subset.npz
andpert_subset.npz
: A subset ofexpr.csv
andpert.csv
(clarification needed).
CellBox
is defined inmodel.py
- A
dataset.factory()
function for random parition, leave-one-out, and single-to-combo tasks. - A multiple-substage training process for finding the optimal hyperparameters defined in
train.py
.
- Make sure to specify the experiment_id and experiment_type
experiment_id
: name of the experiments, would be used to generate results foldersexperiment_type
: currently available tasks are {"random partition", "leave one out (w/o single)", "leave one out (w/ single)", "full data", "single to combo"}
- Different training stages can be specified using
stages
andsub_stages
in config file - Other default configurations are defined in
config.py
The experiment type configuration file is specified by --experiment_config_path
or -config
python scripts/main.py -config=configs/Example.random_partition.json
Note: always run the script in the root folder.
A random seed can also be assigned by using argument --working_index
or -i
python scripts/main.py -config=configs/Example.random_partition.json -i=1234
When training with leave-one-out validation, make sure to specify the drug index --drug_index
or -drug
to leave out from training.
- You should see a experiment folder generated under
/results
using the date andexperiment_id
. - Under experiment folder, you would see different models run with different random seeds
- Under each model folder, you would have:
record_eval.csv
: log file with loss changes and time used.random_pos.csv
: how the data was split (only for random partitions)best.W
,best.alpha
,best.eps
: model parameters snapshot for each training stagebest.test_hat
: Prediction on test set, using the best model for each stage.ckpt
files are the final models in tensorflow compatible format.