A foundation model of transcription across human cell types.
This repository contains the official implementation of the model described in our paper: https://www.nature.com/articles/s41586-024-08391-z.
- There was a bug in data preparation pipeline for PBMC10x data, leading to random performance in predcit ATAC performance. The reason is peak was expected to be sorted by
chr1,chr2,chr3
while the count matrix was not sorted correctly. This has been fixed and we have updated thepredict_atac.ipynb
andprepare_pbmc.ipynb
. Sorry for the inconvenience. Thefinetune_pbmc.ipynb
andpretrain_pbmc.ipynb
is pending an update. We will notify here once we updated it. - As as sanity check to prevent this kind of processing bug when you are dealing with your own data. I recommend you to run
predict_atac.ipynb
to train a motif->ATAC model from scratch. If the data has been properly processed and has decent (e.g. > 3M) depth, the performance should rapidly (<10 epochs) reach ~0.7 Pearson when trained on one cell type and leave out chr10,chr11. export_config
andload_config_from_yaml
helper functions has been added toget_model.config.config
for export and load back your customized config as yaml file.
- Data processing
- Finetune & Interpretation
- Moitf -> ATAC prediction (just for demo, optional)
- Continue pretrain (just for demo, optional)
Note that Motif -> ATAC prediction
tutorial has been tested on a Macbook Pro M4 Pro with MPS accelaration. It seems that the speed for training and validation iteration is close to a RTX3090;
However, some ops used in the metric calculation (Pearson/Spearman/R^2) was not accelarated, making the speed a bit inferior.
- Preprocessed tutorial data is available at https://zenodo.org/records/14614947;
- Pretrain data can be found in s3://2023-get-xf2217/get_demo/pretrain_human_bingren_shendure_apr2023/ (although it's in a deprecated format, which should be load with
get_model.dataset.zarr_dataset.RegionDataset
rather than the newget_model.dataset.zarr_dataset.RegionMotifDataset
. The information they stored is the same. We just switched tozarr
for future-proof.) - Inference results and checkpoints (used in the demo can be found in
s3://2023-get-xf2217/get_demo/
If you just need the model and analysis package. You can install with pip. However, note that the R package pcalg
is required for the causal analysis and will not be available if you don't install it manually.
pip install git+https://github.com/GET-Foundation/get_model.git@master
You can use conda/mamba for environment setup. The env.yml
will install the following packages:
- get_model: main model package
- wget: in case you don't have it
- gawk: GNU awk, in case you don't have it
- bedtools
- htslib
- r-pcalg: for causal discovery of motif-motif interaction
- scanpy: for single cell analysis (optional, required just for tutorial).
- snapatac2: for scATAC-seq analysis (optional, required just for tutorial).
If you don't want all of them, you can install just the get_model package with pip.
Note that if you have problem installing the conda/mamba environment, edit (temporarily) your CONDARC to remove channel_priority: strict
mamba env create -f env.yml
If you are on Mac OS and Apple Silicon, you can try to run the following:
mamba env create -f env_osx.yml
# install brew if you haven't
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# install R with brew
brew install r
# install pcalg with bioconductor
R -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org"); BiocManager::install("pcalg")'
# test pcalg loads within R
R -e 'library(pcalg); cat("pcalg loaded successfully\n")'
Alternatively, a docker image is provided for running the code.
docker pull fuxialexander/get_model:latest
This start a bash shell in the container by default
docker run --entrypoint /bin/bash -it -v /home/xf2217:/home/xf2217 fuxialexander/get_model
You can also start a jupyter notebook server in the container and access it from your host machine on port 8888
docker run --entrypoint /opt/conda/bin/jupyter -it -p 8888:8888 -v /home/xf2217:/home/xf2217 fuxialexander/get_model notebook --allow-root --ip 0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.password='' # add password if you want
then you can access the jupyter notebook server at http://localhost:8888
, in VSCode, you can open a jupyter notebook and select kernel to use existing jupyter server, and put in http://localhost:8888
as the server URL.
You can also directly acess the python with the following command
docker run -it -v /home/xf2217:/home/xf2217 fuxialexander/get_model /opt/conda/bin/python /some/script/to/run.py
For singularity installation, you can pull the docker image and convert it to singularity image.
# module load singularity if needed
singularity pull get.sif docker://fuxialexander/get_model:latest
# start a jupyter notebook server
singularity exec --nv get.sif env JUPYTER_CONFIG_DIR=/tmp/.jupyter /opt/conda/bin/jupyter notebook --allow-root --ip 0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.password=''
# or directly access python
singularity exec --nv get.sif /opt/conda/bin/python
then test if cuda is avaliable and whether package is installed correctly:
import torch
torch.cuda.is_available()
import get_model
import gcell
If you are using vscode or cursor as code editor, you can open a tunnel from inside the singularity / docker
singularity exec --nv get.sif /bin/bash
# then
code tunnel
# or
cursor tunnel
This enable you to use your local Cursor.app or VSCode.app and all the Copilot/Jupyter/Debugger stuff to access the environment inside the container. You can even access it from your browser.
GET uses a transformer-based architecture with several key components:
- Region Embedding
- Transformer Encoder
- Task-specific heads (Expression, ATAC, etc.)
In the future, nucleotide modeling and more modality (e.g. Hi-C, ChIP-seq) will be incorporated. All variation of model will be constructed in a modular and composable way. For more details, check out this Schematic or Model Architecture.
We use Hydra for configuration management and command line interface. Hydra provides a flexible way to configure and run experiments by:
- Managing hierarchical configurations through YAML files
- Enabling command line overrides of config values
- Supporting multiple configuration groups
- Allowing dynamic composition of configurations
See the example debug scripts in get_model/debug/
for how to write a command line training script.
To run a basic training job in command line:
python get_model/debug/debug_run_region.py --config-name finetune_tutorial stage=fit
GET uses Hydra for configuration management. Key configuration files:
- Base config:
get_model/config/config.py
- Model configs:
get_model/config/model/*.yaml
- Dataset configs:
get_model/config/dataset/*.yaml
See Configuration Guide for more details.
We use hatch
to manage the development environment.
hatch env create
This project is licensed under the CC BY-NC 4.0 License. For commercial use, please contact us.
If you use GET in your research, please cite our paper:
A foundation model of transcription across human cell types. Nature (2024). https://doi.org/10.1038/s41586-024-08391-z
For questions or support, please open an issue or contact [email protected].