Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures
- System requirement: Ubuntu 20.04/Ubuntu 22.04
- Tested GPUs: A100
Create conda environment:
conda create -n physics_condition python=3.9
conda activate physics_condition
pip install -r requirements.txt
Before testing or training your cases, please ensure that the working directory is set in physics_condition
:
-
Please follow the instructions in the
README
file of the dynamicPDB repository to download the dataset. Ensure you have sufficient disk space available, asone protein
may require over 80GB of storage. -
From the
dynamicsPDB
repository root, run:- processing trajectory:
python src/toolbox/processing_dynamicPDB/prep_dynamicPDB.py --dynamic_dir [dynamicPDB dataset root] --outdir [DIR] --simulation_suffix [simulation suffix]
- process physics information, we extract the physics information of
Cα
:python src/toolbox/processing_dynamicPDB/atom_select.py --dynamic_dir [dynamicPDB dataset root]
This will preprocess the dynamicPDB trajectories into
.npz
files. The physics property (velocity & force) will be packed as.pkl
files. - processing trajectory:
-
Download the OmegaFold weights and install the modified OmegaFold repository.
wget https://helixon.s3.amazonaws.com/release1.pt git clone https://github.com/bjing2016/OmegaFold pip install --no-deps -e OmegaFold
-
Run OmegaFold to make the embeddings:
python src/toolbox/processing_dynamicPDB/extract_embedding.py --reference_only --out_dir_root=./dataset/embeddings --lm_weights_path [OmegaFold weight] --data_csv_path [data csv path] --simulation_suffix [simulation params]
The
data csv
is organized as:name seqres seq_len 1ab1_A TTCCPSIVA... 415 ... -
These datasets should be organized as follows:
./dynamicPDB/ |-- 1ab1_A_npt1000000.0_ts0.001 | |-- 1ab1_A_npt_sim_data | | |-- 1ab1_A_npt_sim_0.dat | | `-- ... | |-- 1ab1_A_dcd | | |-- 1ab1_A_dcd_0.dcd | | `-- ... | |-- 1ab1_A_T | | |-- 1ab1_A_T_0.pkl | | `-- ... | |-- 1ab1_A_F | | |-- 1ab1_A_F_0.pkl | | `-- ... | |-- 1ab1_A_V | | |-- 1ab1_A_V_0.pkl | | `-- ... | |-- 1ab1_A.pdb | |-- 1ab1_A_minimized.pdb | |-- 1ab1_A_nvt_equi.dat | |-- 1ab1_A_npt_equi.dat | |-- 1ab1_A_T.dcd | |-- 1ab1_A_T.pkl | |-- 1ab1_A_F.pkl | |-- 1ab1_A_V.pkl | `-- 1ab1_A_state_npt1000000.0.xml |-- 1uoy_A_npt1000000.0_ts0.001 | |-- ... | `-- ... `-- ...
- Optionally, you could consolidate all the information into the relevant
.csv
files and apply filtering based on specific conditions, or you could directly use the providedtrain.csv
andtest.csv
files for training and inference inexamples/atlas_visual_se3_filter.csv
.
python src/toolbox/processing_atlas/merge_csv.py --csv atlas.csv --atlas_dir ./dataset/atlas/ --save_path merged.csv --processed_npz ./dataset/processed_npz --embeddings ./dataset/embeddings --simulation_suffix [simulation params]
The merged
.csv
file will be formed as:name seqres seq_len dynamic_npz embed_path pdb_path vel_path force_path 1ab1_A TTCCPSIVA... 46 .npz .npz .npz .pkl .pkl ... - Optionally, you could consolidate all the information into the relevant
Follow Data Preparation to get data ready, and Update the date .csv
path in configuration YAML
files or change it in the training scripts. Start training with the following command:
cd applications/physics_condition
bash scripts/run_train.sh
Note: Ensure that CUDA_VISIBLE_DEVICES numbers
,nproc_per_node
, experiment.num_gpus
, and experiment.batch_size
are set to the same value.
To get the evaluation metrix, run the following command:
cd applications/physics_condition
bash scripts/run_eval.sh
We present the predicted 3D structures by our method and SE(3)-Trans.
SE(3) Trans | Ours | Ground Truth |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
If you find our work useful for your research, please consider citing the paper:
@misc{liu2024dynamicpdbnewdataset,
title={Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures},
author={Ce Liu and Jun Wang and Zhiqiang Cai and Yingxu Wang and Huizhen Kuang and Kaihui Cheng and Liwei Zhang and Qingkun Su and Yining Tang and Fenglei Cao and Limei Han and Siyu Zhu and Yuan Qi},
year={2024},
eprint={2408.12413},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
}
We would like to thank the contributors to the openfold, AlphaFlow, EigenFold, and SE3-Diffusion repositories, for their open research and exploration. If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.