Skip to content

Files

physics_condition

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Mar 31, 2025
Mar 31, 2025
Mar 31, 2025
Apr 1, 2025
Mar 31, 2025
Mar 31, 2025
Mar 31, 2025

Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures

Ce Liu1*  Jun Wang1*  Zhiqiang Cai1*  Yingxu Wang1,3  Huizhen Kuang2  Kaihui Cheng2  Liwei Zhang1
Qingkun Su1  Yining Tang2  Fenglei Cao1  Limei Han2Siyu Zhu2†  Yuan Qi2†
1Shanghai Academy of AI for Science  2Fudan University  3MBZUAI
*Equal Contribution  Corresponding Author


🔧️ Framework

framework

⚙️ Installation

  • System requirement: Ubuntu 20.04/Ubuntu 22.04
  • Tested GPUs: A100

Create conda environment:

  conda create -n physics_condition python=3.9
  conda activate physics_condition
  pip install -r requirements.txt

🗝️️ Usage

Before testing or training your cases, please ensure that the working directory is set in physics_condition:

  1. Data Preparation.
  2. Training.
  3. Inference.

📥 Data Preparation

Downloading datasets

  • Please follow the instructions in the README file of the dynamicPDB repository to download the dataset. Ensure you have sufficient disk space available, as one protein may require over 80GB of storage.

  • From the dynamicsPDB repository root, run:

    • processing trajectory:
      python src/toolbox/processing_dynamicPDB/prep_dynamicPDB.py --dynamic_dir [dynamicPDB dataset root] --outdir [DIR] --simulation_suffix [simulation suffix]
      
    • process physics information, we extract the physics information of :
      python src/toolbox/processing_dynamicPDB/atom_select.py --dynamic_dir [dynamicPDB dataset root]
      

    This will preprocess the dynamicPDB trajectories into .npz files. The physics property (velocity & force) will be packed as .pkl files.

Extract Sequence Embeddings

  • Download the OmegaFold weights and install the modified OmegaFold repository.

    wget https://helixon.s3.amazonaws.com/release1.pt
    git clone https://github.com/bjing2016/OmegaFold
    pip install --no-deps -e OmegaFold
    
  • Run OmegaFold to make the embeddings:

    python src/toolbox/processing_dynamicPDB/extract_embedding.py --reference_only --out_dir_root=./dataset/embeddings --lm_weights_path [OmegaFold weight] --data_csv_path [data csv path]  --simulation_suffix [simulation params]
    

    The data csv is organized as:

    name seqres seq_len
    1ab1_A TTCCPSIVA... 415
    ...
  • These datasets should be organized as follows:

    ./dynamicPDB/
    |-- 1ab1_A_npt1000000.0_ts0.001
    |   |-- 1ab1_A_npt_sim_data
    |   |   |-- 1ab1_A_npt_sim_0.dat
    |   |   `-- ...
    |   |-- 1ab1_A_dcd
    |   |   |-- 1ab1_A_dcd_0.dcd
    |   |   `-- ...
    |   |-- 1ab1_A_T
    |   |   |-- 1ab1_A_T_0.pkl
    |   |   `-- ...
    |   |-- 1ab1_A_F
    |   |   |-- 1ab1_A_F_0.pkl
    |   |   `-- ...
    |   |-- 1ab1_A_V
    |   |   |-- 1ab1_A_V_0.pkl
    |   |   `-- ...
    |   |-- 1ab1_A.pdb
    |   |-- 1ab1_A_minimized.pdb
    |   |-- 1ab1_A_nvt_equi.dat
    |   |-- 1ab1_A_npt_equi.dat
    |   |-- 1ab1_A_T.dcd
    |   |-- 1ab1_A_T.pkl
    |   |-- 1ab1_A_F.pkl
    |   |-- 1ab1_A_V.pkl
    |   `-- 1ab1_A_state_npt1000000.0.xml
    |-- 1uoy_A_npt1000000.0_ts0.001
    |   |-- ...
    |   `-- ...
    `-- ...
    
    • Optionally, you could consolidate all the information into the relevant .csv files and apply filtering based on specific conditions, or you could directly use the provided train.csv and test.csv files for training and inference in examples/atlas_visual_se3_filter.csv.
    python src/toolbox/processing_atlas/merge_csv.py  --csv atlas.csv  --atlas_dir ./dataset/atlas/ --save_path merged.csv --processed_npz ./dataset/processed_npz --embeddings ./dataset/embeddings --simulation_suffix [simulation params]
    

    The merged .csv file will be formed as:

    name seqres seq_len dynamic_npz embed_path pdb_path vel_path force_path
    1ab1_A TTCCPSIVA... 46 .npz .npz .npz .pkl .pkl
    ...

🔥 Training

Follow Data Preparation to get data ready, and Update the date .csv path in configuration YAML files or change it in the training scripts. Start training with the following command:

cd applications/physics_condition
bash scripts/run_train.sh

Note: Ensure that CUDA_VISIBLE_DEVICES numbers,nproc_per_node, experiment.num_gpus, and experiment.batch_size are set to the same value.

🎮 Evaluation

To get the evaluation metrix, run the following command:

cd applications/physics_condition
bash scripts/run_eval.sh

📸 Showcase

We present the predicted 3D structures by our method and SE(3)-Trans.

SE(3) Trans Ours Ground Truth

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@misc{liu2024dynamicpdbnewdataset,
      title={Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures},
      author={Ce Liu and Jun Wang and Zhiqiang Cai and Yingxu Wang and Huizhen Kuang and Kaihui Cheng and Liwei Zhang and Qingkun Su and Yining Tang and Fenglei Cao and Limei Han and Siyu Zhu and Yuan Qi},
      year={2024},
      eprint={2408.12413},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
}

🤗 Acknowledgements

We would like to thank the contributors to the openfold, AlphaFlow, EigenFold, and SE3-Diffusion repositories, for their open research and exploration. If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.