-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor scaling when creating an interchange with multiple unique components vs. individual components #1156
Comments
Also adding some additional context and results shared between Matta/Shirts' groups to establish this issue |
Thanks for the detailed report - this has been relatively easy for me to sink my teeth into. So far I'm able to identify, somewhat surprisingly, that preparing the bond and electrostatics handlers are causing almost all of the slowdown. I'm seeing this sort of "~49% for bond, ~49% for electrostatics, the remainder going to other steps" performance across a number of system sizes:
More to come later |
Hah, is somehow it trying to look at all possible bonds and electrostatics in the system ??? |
I'll follow up later with a more detailed report later, but I can already share with some confidence that the bottleneck is the toolkit's handling of isomorphism with large molecules. There's plenty of chatter about how the toolkit could handle polymers (including biopolymers) better. One good issue which has stalled for a long time is openforcefield/openff-toolkit#1734 Molecule de-duplication is essential for handling repeated quantities of chemically identical molecules (i.e. solvent) so skipping the step is non-negotiable. It's possible to conceive of a band-aid in which different pathways are chosen based on some properties of the system, but I would prefer not encoding hacks like that. The results of this isomorphism check are cached, which would explain why almost all of the runtime is taken up by creating the bonds collection. (It's the first collection to be built, before any other valence terms or other non-bonded terms, and only after the positions, box, and topology are processed.) I'm not likely to double-check the runtime on the order of tens of thousands of seconds, but the super-linear scaling with system size that @hannaomi reported matches my intuition. |
So, just I understand, it seems like Hannah is saying if she takes the same 50 molecules, runs them through independently, and sums the time, it takes far less time than the 50 molecules in the same system - which means that it's doing something where it's looking at atoms in different molecules, rather than handling them independently. Do I understand correctly? |
I think that's the claim |
Interchange benchmarking for polydisperse polymer topologies - Extension of recent discussion on Infrastructure slack.
For topologies with multiple large unique components (e.g. polydisperse polymer topologies), parameterization creates a bottleneck in system prep.
Calling Interchange.from_smirnoff() on the entire topology (all unique components included) takes significantly longer and scales poorly, compared to the runtime sum of calling the same command on a topology containing each individual component (the latter scales linearly). I do not have enough understanding of the inner workings of these commands to understand why there is such a difference between the 2.
Attached a script to recreate the discrepancy between single vs. individual topologies benchmarking_single_vs_multiple_GH.py.zip.
To create the polydisperse systems the script uses SwiftPol , a package developed by our research group to generate representative polymer systems. It is designed with OpenFF interoperability in mind and should be installable from github but hasn't been extensively tested so please let me know if you come into any bugs.
Interchange version = 0.4.1
Toolkit version = 0.16.6
Data generated using 1xnvidia A100 40GB GPU
The text was updated successfully, but these errors were encountered: