-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameters relevant for speed in mutate_mol #4
Comments
Hi! It depends what is the context of "speed up".
|
Thanks! Feel free to close this issue. But on a related note what work do you think needs to be done to speed it up more? Would it be possible to leverage GPUs? Feel free to email me, the team I am working with would love to help! |
I have to profile code to remember all details and pitfalls. I'll post those results here to enable a more objective discussion. What I remember is that I implemented a brute force replacement procedure which makes multiple replacements leading to the same compound. Therefore, duplicates are generated and filtered internally. This is definitely waste of time. But I do not remember the proportion of duplicates and I do not know how to easily avoid this. For example, we have a compound X-CH2-CH3 and make replacement CH3>>NH2 and CH2-CH3>>CH2-NH2. They both will give X-CH2-NH2. I'm not sure that GPU will help much, because the code extensively uses RDKit and there is no GPU support in RDKit (rdkit/rdkit#3537) and separating GPU-friendly code seems too difficult. From time to time I thought about substantial refactoring and change of the architecture and algorithms. But this will require a lot of efforts and may result in a new implementation :) |
I put here results of tests and some conclusions. I currently do not consider speed up as a priority issue because we use relatively slow scorings and structure generation is not a limitation step. But for fast scorings this can be already an issue and if this can be improved by low cost we can do that. The amount of duplicated generated structures is up to 65% per molecule that is too much that I expected. The major function consuming 90% of runtime is
Conversion to smiles takes most of the time. I expect that RemoveHs is also expensive because create a new Mol object. This step can be replaced with inchi to identify duplicates (could be faster and no need to address the issue of explicit hydrogens). But as a main output of all functions I use smiles and their generation is inevitable or we need to change the output of all functions to return not smiles but Mol objects and transfer responsibility to manage explicit hydrogens on a user. Changing the interface seems not a good idea but if this will speed up things greatly we can do that. Use SMILES as a major output probably was not a great idea.
remove sanitization step is questionable, because again this transfers responsibility to a user to manage generated Mol objects and solve issues with them. Concluding remarks:
Alternative "solution". Any suggestions/ideas are welcome. |
Investigating issue with duplicates I found and fixed the bug of generation of duplicated fragments before their mutation. All fragments with 1 or 2 attachment points were duplicated, thus twice number of replacements were performed and they resulted in the same molecules. This fix gave x2 speed up. Now the number of duplicates in generated molecules is from 0 to 32% on test samples. |
Thanks for working on this! |
Yes, this should speed up on 30-40%, but there will be about 20-25% (in average) of duplicated molecules in the output. If you are ok with them you may remove generation of smiles and change the further code accordingly. I'm still thinking how to improve the situation, but I do not see a way to avoid substantial changes. The worst thing is that it is difficult to estimate whether these changes will speed up whole generation or not. |
Would it be faster to use fingerprints to find duplicates instead of SMILE strings? |
I do not know you may try to measure performance. But fingerprints have collisions and thus some molecules will be falsely identified as duplicates. So some compounds will be erroneously discardred. The number of such false duplicates depends on fingerprint type and structures. I never explored this issue. |
Hi @DrrDom, I am interested in incorporating CReM within the Ersilia model hub and based on this thread I am curious to know if configuring the |
Hi @DhanshreeA, yes, the speed will depend on these parameters also. I expect that the percentage of duplicates will be similar for different setups. So, if you generate more compounds, more duplicates should be discarded. Thus, if your parameters increases the number of generated compounds the time spent on detection of duplicates will rise proportionally (in average). Example for GROW. min_atoms=1 and max_atoms=8 may generate 1000 molecules (and 200 duplicates will be discarded internally). min_atoms=9 and max_atoms=9 may generate 500 molecules (and 100 discarded duplicates). min_atoms=12 and max_atoms=12 may generate 2000 molecules (and 400 discarded duplicates). Everything depends on how many fragment replacements can be found in the database. Final choice of parameters depends on the practical use case. |
Hi! Thanks for maintaining this repo! I was wondering which parameters were relevant for speeding up
mutate_mol
? I've found better speeds by increasingn_cores
but if I wanted to speed it up even more would decreasingradius
help too?The text was updated successfully, but these errors were encountered: