🦠 Model Request: CReM - chemically reasonable mutations framework for structure generation #505

DhanshreeA · 2022-12-14T19:07:17Z

Model Name

CReM fragment based structure generation

Model Description

The framework is an open source implementation of fragment based generative approaches for exploring chemical space while ensuring chemical validity. This framework utilizes a database of known compounds to come up with interchangable fragments based on the context radius of an input molecule to generate new molecules.

Slug

crem-structure-generation

Publication

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00431-w

Code

Code for the framework: https://github.com/DrrDom/crem
Documentation for the package: https://crem.readthedocs.io/en/latest/
Databases for fragment replacements: http://www.qsar4u.com/pages/crem.php (Has three databases in total with SC scores 2, and 2.5 and with SA score of 2)

License

BSD 3-clause License
https://github.com/DrrDom/crem/blob/master/LICENSE.txt

DhanshreeA · 2022-12-14T19:08:06Z

@miquelduranfrigola please take a look.

DhanshreeA · 2022-12-15T06:57:46Z

The SA (synthetic accessibility) score is on a scale of 1 to 10 (lower is better) while the SC (synthetic complexity) score is on a scale of 1 to 5 (again, lower is better). I am unsure how SCscore 2 compares to SAscore 2.

GemmaTuron · 2022-12-15T07:44:30Z

Hi @DhanshreeA

SC score uses the number of reactions needed to synthesize a new molecule as a measure of complexity, whereas SA score also takes into account the presence of unusual structural features. They are related (usually, a molecule with high SC score will also have high SA score and viceversa)

Do you think it would be useful to incorporate the models for SA and SC independently as well? SA score is already in the Hub (eos9ei3) but SC score I think is not there: https://pubs.acs.org/doi/10.1021/acs.jcim.7b00622

DhanshreeA · 2022-12-15T07:55:59Z

@GemmaTuron I see. Thanks for that explanation, it makes a lot of sense. Incorporating SAscore 2 and SCscore 2 independently is going to be very straightforward since the author has provided different fragment replacement databases for each of them. Only difference is that SA2 is 3GB whiile SC2 is 721MB but I understand that git LFS handles that.

GemmaTuron · 2022-12-15T08:05:44Z

Hi @DhanshreeA !

I am unsure about what SA2 is and what is the difference between the SA Score from the original publication, which is already in the Hub? Unless there is a siginificant improvement we are good with the one we have

DhanshreeA · 2022-12-15T08:42:15Z

Hi @GemmaTuron, I'll quote from their documentation:

replacements02_sa2.db.gz - database created from ChEMBL v22 structures which contain only organic atoms (C,N,O,S,P,F,Cl,Br,I) and have maximum synthetic accessibility score (SAScore) 2
replacements02_sc2.db.gz - database created from ChEMBL v22 structures which contain only organic atoms (C,N,O,S,P,F,Cl,Br,I) and have maximum synthetic complexity score (SCScore) 2

The model we have in the hub computes SA score for a given molecule input, while this framework uses fragments of this score for generating new molecules. In fact the author only uses the SA score as defined in the original publication https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-1-8

DhanshreeA · 2022-12-15T08:43:33Z

A few more details about this package relevant to incorporating it within the hub. The library API provides three functions for generating mutations in a given input molecule:

grow_mol which replaces hydrogen atoms with fragments from the database.
Input:
mol: RDKit Mol object
Returns:
Generator over SMILES of new molecules
link_mols which replaces hydrogen atoms with fragments from the database.
Input:
mol1: RDKit Mol object
mol2: RDKit Mol object
Returns:
Generator over SMILES of new molecules
mutate_mol which replaces hydrogen atoms with fragments from the database.
Input:
mol: RDKit Mol object
Returns:
Generator over SMILES of new molecules

Other configurable parameters that can affect the speed of execution include, from the documentation and as addressed in the original repo :

max_replacements – maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied. Default: None.
min_freq – minimum occurrence of fragments in DB for replacement. Default: 0.
ncores – number of cores. Default: 1.

DrrDom · 2022-12-15T10:25:33Z

Nice initiative! Short comments:

I would not recommend to use SC as a measure of synthetic accessibility.
SC2 and SA2 mean that the source database was filtered based on these thresholds to choose more synthetically feasible molecules, after that molecules were fragmented to get corresponding fragment databases.

I can share a SA2.5 database for ChEMBL, but it will be even larger (10-20Gb).
To reduce the size of a database it is possible to remove some tables with context radius which you do not expect to use, for example a table with radius 5. Or remove fragments with the number of atoms greater than a chosen threshold, e.g. fragments with size more than 12 atom can be removed. Now, there are fragments with up to 20 atoms. These are SQLite databases and this is easy to do by yourself.

If you will have further questions regarding CReM do not hesitate to contact me.

miquelduranfrigola · 2022-12-15T11:58:48Z

Thanks @DrrDom this is fantastic feedback. Greatly appreciated. @DhanshreeA, for the deployed version in the Ersilia Model Hub, let's use a set of reasonable parameters and relatively small database size, just to avoid frustration from users. But let's make sure we encourage users to visit the actual CReM repo, which is well maintained and easy to install. We can reflect this nicely in the README file. What do you all think?

miquelduranfrigola · 2022-12-15T12:01:32Z

@DhanshreeA I've made a few edits to your initial comment, in particular to mention more explicitly CReM. Let's hear @GemmaTuron 's feedback and then it is good to go!

DrrDom · 2022-12-15T14:19:11Z

It is difficult to recommend certain settings because I do not know your use case.
From a general practical perspective we usually use databases restricted by SA of source molecules (SA2 or SA2.5). Context radius is set to 3 or 4. This is quite conservative generation resulting in a limited number of output molecule, but with higher chances of their synthetic accessibility. The fragment size depends on the task but rarely larger than 12 (to cover fused bicycles with small substituents). To avoid an unreasonably large number of generated structures one may set max_replacement to a desired value. This will guarantee to be the upper bound of the number of generated compounds and this makes generation time more predictable.
We have a web service for simple use cases - https://crem.imtm.cz. It may help to better understand time constrains.

miquelduranfrigola · 2022-12-15T16:22:05Z

/approve

github-actions · 2022-12-15T16:22:33Z

New Model Repository Created! 🎉

@DhanshreeA ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos4q1a

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

DhanshreeA · 2022-12-19T11:29:31Z

Associated PR: ersilia-os/eos4q1a#1

miquelduranfrigola · 2022-12-20T06:31:52Z

Hi @DhanshreeA I have added the model to the AirTable and have merged the PR

DhanshreeA · 2022-12-20T08:25:12Z

Note: For sake of simplicity we are not including the link_mols endpoint at this time.

Further action items:

Incorporate the grow_molecule API and concatenate the results, ensuring that there are no duplicates
Report model fetch and run times (with eml_canonical.csv)
Shape of the output csv, we're specifically looking for any variability in number of molecules returned for different inputs.

Update (Dec 21):

Number of molecules generated differs across input molecules. For ease of use, we will cap the number of generated molecules to some fixed value based on the distribution of this number over a sample of test inputs. For example:
abacavir - Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1 generates 10 output molecules
abiraterone - C[C@]12CC[C@H]3C@@H[C@@h]1CC=C2c1cccnc1 generates 3 output molecules
acetazolamide CC(=O)Nc1nnc(S(N)(=O)=O)s1 generates 758 output molecules
acetic acid CC(=O)O generates 448 output molecules
acetylcysteine CC(=O)NC@@HC(=O)O generates 4752 output molecules

DhanshreeA · 2022-12-22T07:30:02Z

Updates:

Minor bug in refactoring of the eos4q1a codebase that caused errors during fetch. PR that addresses this issue: Refactored checkpoints directory to databases directory eos4q1a#4
Incorporated the grow_mol endpoint and added error handling in this PR Add grow_mol mode from crem into run_generate api eos4q1a#5

DhanshreeA · 2022-12-22T07:32:56Z

Further action items:
Since the number of output molecules for different inputs can differ so wildly, we are capping total output molecules at 100. However to sample these output molecules for diversity, we will generate 100 clusters and return a molecule closest to the cluster centroid from each cluster.

DhanshreeA · 2022-12-22T07:34:34Z

@miquelduranfrigola / @GemmaTuron could one of you merge ersilia-os/eos4q1a#4?

DhanshreeA · 2022-12-27T06:25:56Z

Updates:

Output molecules capped to hundred by performing K Means clustering and then returning molecules that were the closest to the cluster centroids based on euclidean distance. ersilia-os/eos4q1a@6a44885.
I tested this approach on the eml_canonical dataset (which has 442 unique input smiles) with different choices of hyper parameters. With a batch size of 50, the end-to-end generation took upwards of 5 hours. With default batch size of 1024, the end-to-end generation took upwards of 3 hours. To further reduce this time, I am not capping the generated molecules to be sampled at 2000 through random selection. This in addition to default batch size of 1024 has lead to significant performance improvement on this dataset, as end-to-end generation now takes approximately 1.5 hours.

Associated PR: ersilia-os/eos4q1a#5 @miquelduranfrigola @GemmaTuron

Results on eml_canonical:
output.csv

DhanshreeA · 2022-12-27T06:30:05Z

To understand fingerprint generation and vectorization into Numpy arrays for the downstream task of KMeans clustering I took inspiration from https://github.com/PatWalters/kmeans/blob/master/kmeans.py (particularly for using the inbuilt (rdkit.DataStructs.ConverToNumpyArray) and https://github.com/ersilia-os/lazy-qsar/blob/main/lazyqsar/descriptors/descriptors.py#L328

DhanshreeA · 2022-12-27T06:34:09Z

In understanding RDKit documentation I came across a similar functionality that RDKit offers in terms of picking a subset of diverse molecules https://www.rdkit.org/docs/GettingStartedInPython.html#picking-diverse-molecules-using-fingerprints. I have not tried using so I am unsure if it is better optimized for the task at hand than using regular machine learning libraries like scikit learn, which I have used right now.

DhanshreeA · 2022-12-27T06:40:54Z

Updates with integration testing with ersilia CLI:

Model is fetched and served without issues.
The generate API within the model is constantly running into an error:

Traceback (most recent call last):
  File "/home/dee/miniconda3/envs/ersilia-env/bin/ersilia", line 33, in <module>
    sys.exit(load_entry_point('ersilia', 'console_scripts', 'ersilia')())
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 138, in wrapper
    return func(*args, **kwargs)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 115, in wrapper
    return_value = func(*args, **kwargs)
  File "/home/dee/miniconda3/envs/ersilia-env/lib/python3.8/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
    return func(*args, **kwargs)
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/cli/commands/api.py", line 36, in api
    result = mdl.api(
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 334, in api
    return self.api_task(
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 349, in api_task
    for r in result:
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/core/model.py", line 178, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/serve/api.py", line 329, in post
    self.output_adapter.adapt(
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 283, in adapt
    df = self._to_dataframe(result)
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 227, in _to_dataframe
    dtypes = [self.__pure_dtype(k) for k in output_keys]
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 227, in <listcomp>
    dtypes = [self.__pure_dtype(k) for k in output_keys]
  File "/home/dee/learn_xyz/outreachy/ersilia-project/ersilia/ersilia/io/output.py", line 156, in __pure_dtype
    t = self._schema[k]["type"]
KeyError: 'outcome'

DhanshreeA · 2022-12-27T07:01:07Z

Generating an example file as suggested here and using that I still run into the same error.

miquelduranfrigola · 2022-12-27T07:53:07Z

In understanding RDKit documentation I came across a similar functionality that RDKit offers in terms of picking a subset of diverse molecules https://www.rdkit.org/docs/GettingStartedInPython.html#picking-diverse-molecules-using-fingerprints. I have not tried using so I am unsure if it is better optimized for the task at hand than using regular machine learning libraries like scikit learn, which I have used right now.

Thanks @DhanshreeA. I think both options should be fine. If the scikit-learn option is already in place, then let's use that one.

miquelduranfrigola · 2022-12-27T07:55:53Z

As for the error above, as discussed, let's make sure that the output file has a fixed number of columns (100). Comma-separated is fine. We may or may not include a header.

DhanshreeA · 2022-12-27T12:54:29Z

Thanks @miquelduranfrigola I implemented this fix and the API works with the CLI now. Here is an output CSV file generated with five inputs.
eos4q1a_output_2.csv

DhanshreeA · 2022-12-27T12:59:38Z

This PR is now good to be merged: ersilia-os/eos4q1a#5
@GemmaTuron @miquelduranfrigola
After it is merged, the model is ready to be tested by anyone in the team.

miquelduranfrigola · 2022-12-27T19:48:44Z

Hi @DhanshreeA! I've tried the model in this GitHub workflow

It failed, unfortunately. Aparently, it is unable to find crem.

Did you try to fetch the model from your computer and check what is inside the eos4q1a conda environment?

paulinebanye · 2022-12-28T15:19:19Z

Hi @DhanshreeA,

Update on test

I tested the model multiple times but it failed to fetch on both the CLI and colad. Error returned on both is EmptyOutputError

link to colab
CLI log 1
CLI log 2

Test Environment

Conda 4.12.0
Python 3.7.13
Windows 10 WSL
Ubuntu 20.04

DhanshreeA · 2023-01-03T06:30:32Z

Hi @DhanshreeA! I've tried the model in this GitHub workflow

It failed, unfortunately. Aparently, it is unable to find crem.

Did you try to fetch the model from your computer and check what is inside the eos4q1a conda environment?

Hi @miquelduranfrigola this is fixed by pushing the crem replacement database onto the LFS server. I re ran the build action in the GitHub workflow. and can confirm that the model is fetched successfully. It was an issue with Git LFS being unable to find the crem replacement database on the LFS server.

DhanshreeA · 2023-01-03T06:31:46Z

Hi @DhanshreeA,

Update on test

I tested the model multiple times but it failed to fetch on both the CLI and colad. Error returned on both is EmptyOutputError

link to colab CLI log 1 CLI log 2

Test Environment
* Conda 4.12.0

* Python 3.7.13

* Windows 10 WSL

* Ubuntu 20.04

Thank you for your help @pauline-banye. The model works in Colab now.

DhanshreeA · 2023-01-03T07:10:09Z

One final TODO (bonus):

Provide Colab notebook in model repo.

miquelduranfrigola · 2023-01-03T07:29:27Z

Fantastic work @DhanshreeA ! I've marked the model as Ready in the AirTable. Closing issue!

GemmaTuron · 2023-01-05T11:33:41Z

Before closing this issue, let's make sure the model is tested.
I am assigning this to @Femme-js , @carcablop and @pauline-banye !
Please comment once you can confirm it works on your system and in google colab (using the newest template we edited)

carcablop · 2023-01-06T23:07:28Z

Hi @DhanshreeA
The model fetch successfully in CLI and Google Colab.
Fetch in google colab:
Fetch in CLI:
log_fetch_eos4q1a.txt

To predict it takes a long time, I tried it with a single molecule.

GemmaTuron · 2023-01-12T12:06:49Z

We are doing the testing in ersilia-os/eos4q1a#7
I'll close this issue as completed

DhanshreeA added the new-model New model requested label Dec 14, 2022

DhanshreeA self-assigned this Dec 14, 2022

ersilia-os deleted a comment from GemmaTuron Dec 15, 2022

DhanshreeA mentioned this issue Dec 20, 2022

Further action items: ersilia-os/eos4q1a#2

Closed

3 tasks

DhanshreeA mentioned this issue Dec 27, 2022

🐛 Bug: eos5qfo - Predict Error: products #371

Closed

miquelduranfrigola closed this as completed Jan 3, 2023

GemmaTuron reopened this Jan 5, 2023

GemmaTuron assigned carcablop, Femme-js and paulinebanye and unassigned DhanshreeA Jan 5, 2023

GemmaTuron closed this as completed Jan 12, 2023

This issue was closed.

🦠 Model Request: CReM - chemically reasonable mutations framework for structure generation #505

🦠 Model Request: CReM - chemically reasonable mutations framework for structure generation #505

Comments

DhanshreeA commented Dec 14, 2022 • edited by miquelduranfrigola Loading

Model Name

Model Description

Slug

Tags

Publication

Code

License

DhanshreeA commented Dec 14, 2022

DhanshreeA commented Dec 15, 2022

GemmaTuron commented Dec 15, 2022

DhanshreeA commented Dec 15, 2022 • edited Loading

GemmaTuron commented Dec 15, 2022

DhanshreeA commented Dec 15, 2022 • edited Loading

DhanshreeA commented Dec 15, 2022 • edited Loading

DrrDom commented Dec 15, 2022

miquelduranfrigola commented Dec 15, 2022

miquelduranfrigola commented Dec 15, 2022

DrrDom commented Dec 15, 2022

miquelduranfrigola commented Dec 15, 2022

github-actions bot commented Dec 15, 2022

New Model Repository Created! 🎉

Next Steps ⭐

Additional Resources 📚

DhanshreeA commented Dec 19, 2022

miquelduranfrigola commented Dec 20, 2022

DhanshreeA commented Dec 20, 2022 • edited Loading

DhanshreeA commented Dec 22, 2022

DhanshreeA commented Dec 22, 2022 • edited Loading

DhanshreeA commented Dec 22, 2022

DhanshreeA commented Dec 27, 2022 • edited Loading

DhanshreeA commented Dec 27, 2022

DhanshreeA commented Dec 27, 2022

DhanshreeA commented Dec 27, 2022

DhanshreeA commented Dec 27, 2022

miquelduranfrigola commented Dec 27, 2022

miquelduranfrigola commented Dec 27, 2022

DhanshreeA commented Dec 27, 2022 • edited Loading

DhanshreeA commented Dec 27, 2022

miquelduranfrigola commented Dec 27, 2022 • edited Loading

paulinebanye commented Dec 28, 2022 • edited Loading

Update on test

Test Environment

DhanshreeA commented Jan 3, 2023

DhanshreeA commented Jan 3, 2023

Update on test

Test Environment

DhanshreeA commented Jan 3, 2023

miquelduranfrigola commented Jan 3, 2023

GemmaTuron commented Jan 5, 2023

carcablop commented Jan 6, 2023 • edited Loading

GemmaTuron commented Jan 12, 2023

DhanshreeA commented Dec 14, 2022 •

edited by miquelduranfrigola

Loading

DhanshreeA commented Dec 15, 2022 •

edited

Loading

DhanshreeA commented Dec 15, 2022 •

edited

Loading

DhanshreeA commented Dec 15, 2022 •

edited

Loading

DhanshreeA commented Dec 20, 2022 •

edited

Loading

DhanshreeA commented Dec 22, 2022 •

edited

Loading

DhanshreeA commented Dec 27, 2022 •

edited

Loading

DhanshreeA commented Dec 27, 2022 •

edited

Loading

miquelduranfrigola commented Dec 27, 2022 •

edited

Loading

paulinebanye commented Dec 28, 2022 •

edited

Loading

carcablop commented Jan 6, 2023 •

edited

Loading