Two-step retention time prediction

Use predict.py to first predict retention order indices and then map these to retention times using anchor compounds.

Two input files are needed:

Structures with retention times for anchor compounds in TSV format:

smiles	rt
C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)OC4C(C(C(C(O4)CO)O)O)O
C1=CC(=CC=C1C=CC(=O)OC(C(C(=O)O)O)C(=O)O)O
CCC(C)CN
C1=C(OC(=C1)C=O)CO
C1=CC=C(C(=C1)C(=O)O)N
COC1=C(C=CC(=C1)C=CC=O)O
CC(C)CCN	6.9
C1=CC=C2C(=C1)C(=CN2)CC(=O)C(=O)O	30.285
C(CC(=O)N)C(C(=O)O)N	1.3
C1=CC=C(C(=C1)C(=O)CC(C(=O)O)N)N	11.005

Information on the chromatographic setup in YAML format, similar to what is used in RepoRT:
```
column:
  name: Waters ACQUITY UPLC HSS T3
  t0: 0.735
eluent:
  A:
    pH: 3
```
The column should have HSM and Tanaka parameters available. If provided, twice t0 is used as the void threshold, anchors below will not be considered during mapping.

Consider standardizing compound structures before, using e.g., standardizeUtils.

To access HSM and Tanaka parameters, a copy of RepoRT needs to be available.

Use predict.py like this:

python predict.py  --model models/twostep_everything_predready.pt  --repo_root_folder <path to RepoRT> \
       --input_compounds test/test_input.tsv --input_metadata test/test_metadata.yaml \
       --out test/test_output.tsv

This should take about 10 seconds on a normal laptop without GPU.

With docker:

docker run -v $(pwd)/test:/app/test -v <path to RepoRT>:/RepoRT -it --rm ghcr.io/boecker-lab/twosteprt:latest \
       python predict.py --model models/twostep_everything_predready.pt --repo_root_folder /RepoRT \
       --input_compounds test/test_input.tsv --input_metadata test/test_metadata.yaml

An output with predicted retention times will be generated (full example output: test/test_output.tsv):

	smiles	rt_pred
0	C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)OC4C(C(C(C(O4)CO)O)O)O	26.192242
1	C1=CC(=CC=C1C=CC(=O)OC(C(C(=O)O)O)C(=O)O)O	11.60789
2	CCC(C)CN	19.052101
3	C1=C(OC(=C1)C=O)CO	20.10303
4	C1=CC=C(C(=C1)C(=O)O)N	16.751137
5	COC1=C(C=CC(=C1)C=CC=O)O	24.9707

Dependencies

The following dependencies are required:

- python=3.12
- chemprop=1.6.1
- pulp
- pytorch
- rdkit
- statsmodels
- tensorboard
- tqdm
- yaml
- numpy<2

A conda/mamba environment is provided (typical install time: 5 minutes):

mamba env create -n twosteprt -f env.yaml
mamba activate twosteprt

For GPU support, the pytorch-cuda-package has to be added with the appropriate version, e.g., pytorch-cuda=11.8. See env_cuda.yaml. A Dockerfile and container is provided as well. GPU (or any other special hardware) is not required.

The code should work on any operating system and was tested under Arch [email protected] with the following package versions:

- python=3.12.8
- chemprop=1.6.1
- pulp=2.8.0
- pytorch=2.5.1
- rdkit=2024.09.3
- statsmodels=0.14.4
- tensorboard=2.18.0
- tqdm=4.67.1
- yaml=0.2.5
- numpy=1.26.4

Training ROI prediction models

python train.py --input <IDs of RepoRT datasets> --epsilon 10s \
       --run_name twosteproi --save_data  \
       --batch_size 512 --epochs 10 --sysinfo --columns_use_hsm --columns_use_tanaka --use_ph \
       --repo_root_folder <path to RepoRT> --clean_data \
       --encoder_size 512 --sizes 256 64 --sizes_sys 256 256 \
       --pair_step 1 --pair_stop None --sample --sampling_count 500_000 --no_group_weights \
       --mpn_no_residual_connections_encoder --no_standardize

Add --gpu to enable training on GPU.

Model training creates three files:

The model itself, twosteproi.pt (with option --ep_save files for every epoch are created: twosteproi_ep1.pt etc.)
Processed training data, twosteproi_data.pkl
A JSON file detailing the training configuration, twosteproi_config.json

To make the trained model ready for prediction, use the repackage_model.py-script:

python repackage_model.py twosteproi.pt twosteproi_predready.pt

This combines all information required for prediction into one file.

A model trained on 171 manually curated reversed-phase datasets from RepoRT (version 94f43c1b) is provided in the models subdirectory (models/twostep_everything_predready.pt).

Evaluation of retention order prediction accuracy

python evaluate.py --model <path to trained model> --test_sets <IDs of RepoRT datasets> \
       --repo_root_folder <path to RepoRT> --epsilon 10s

Name		Name	Last commit message	Last commit date
Latest commit History 390 Commits
.github/workflows		.github/workflows
models		models
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dmpnn.py		dmpnn.py
dmpnn_graph.py		dmpnn_graph.py
env.yaml		env.yaml
env_cuda.yaml		env_cuda.yaml
evaluate.py		evaluate.py
features.py		features.py
mapping.py		mapping.py
mpnranker2.py		mpnranker2.py
predict.py		predict.py
repackage_model.py		repackage_model.py
sampling.py		sampling.py
train.py		train.py
utils.py		utils.py
utils_newbg.py		utils_newbg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-step retention time prediction

Dependencies

Training ROI prediction models

Evaluation of retention order prediction accuracy

About

Releases

Packages

Languages

License

boecker-lab/twosteprt

Folders and files

Latest commit

History

Repository files navigation

Two-step retention time prediction

Dependencies

Training ROI prediction models

Evaluation of retention order prediction accuracy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages