Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

Open
hannaomi opened this issue Jan 30, 2025 · 6 comments
Labels
upstream Blocked by upstream changes

Comments

@hannaomi
Copy link

hannaomi commented Jan 30, 2025

Interchange benchmarking for polydisperse polymer topologies - Extension of recent discussion on Infrastructure slack.

For topologies with multiple large unique components (e.g. polydisperse polymer topologies), parameterization creates a bottleneck in system prep.
Calling Interchange.from_smirnoff() on the entire topology (all unique components included) takes significantly longer and scales poorly, compared to the runtime sum of calling the same command on a topology containing each individual component (the latter scales linearly). I do not have enough understanding of the inner workings of these commands to understand why there is such a difference between the 2.

Attached a script to recreate the discrepancy between single vs. individual topologies benchmarking_single_vs_multiple_GH.py.zip.

To create the polydisperse systems the script uses SwiftPol , a package developed by our research group to generate representative polymer systems. It is designed with OpenFF interoperability in mind and should be installable from github but hasn't been extensively tested so please let me know if you come into any bugs.

Interchange version = 0.4.1
Toolkit version = 0.16.6
Data generated using 1xnvidia A100 40GB GPU

@hannaomi
Copy link
Author

Also adding some additional context and results shared between Matta/Shirts' groups to establish this issue

KCL_CUB_Multicomponent_interchange_GH.pptx.zip

@mattwthompson
Copy link
Member

Thanks for the detailed report - this has been relatively easy for me to sink my teeth into. So far I'm able to identify, somewhat surprisingly, that preparing the bond and electrostatics handlers are causing almost all of the slowdown. I'm seeing this sort of "~49% for bond, ~49% for electrostatics, the remainder going to other steps" performance across a number of system sizes:

on num chains 8
building system ...
System built!, size = 8
writing to file
building 'complete' case
topology re-creation time 0.20
positions setting time 0.00
box processing time 0.00
		Function _bonds took 372.6278 seconds
		Function _constraints took 0.0156 seconds
		Function _angles took 0.1741 seconds
		Function _propers took 0.6292 seconds
		Function _impropers took 0.0575 seconds
		Function _vdw took 0.0482 seconds
		Function _electrostatics took 370.9286 seconds
		Function _plugins took 0.0000 seconds
		Function _virtual_sites took 0.0000 seconds
		Function _gbsa took 0.0000 seconds
	from_smirnoff time_elapsed=728.25

More to come later

@mrshirts
Copy link
Contributor

Hah, is somehow it trying to look at all possible bonds and electrostatics in the system ???

@mattwthompson
Copy link
Member

I'll follow up later with a more detailed report later, but I can already share with some confidence that the bottleneck is the toolkit's handling of isomorphism with large molecules. There's plenty of chatter about how the toolkit could handle polymers (including biopolymers) better. One good issue which has stalled for a long time is openforcefield/openff-toolkit#1734

Molecule de-duplication is essential for handling repeated quantities of chemically identical molecules (i.e. solvent) so skipping the step is non-negotiable. It's possible to conceive of a band-aid in which different pathways are chosen based on some properties of the system, but I would prefer not encoding hacks like that.

The results of this isomorphism check are cached, which would explain why almost all of the runtime is taken up by creating the bonds collection. (It's the first collection to be built, before any other valence terms or other non-bonded terms, and only after the positions, box, and topology are processed.) I'm not likely to double-check the runtime on the order of tens of thousands of seconds, but the super-linear scaling with system size that @hannaomi reported matches my intuition.

@mattwthompson mattwthompson added the upstream Blocked by upstream changes label Jan 31, 2025
@mrshirts
Copy link
Contributor

So, just I understand, it seems like Hannah is saying if she takes the same 50 molecules, runs them through independently, and sums the time, it takes far less time than the 50 molecules in the same system - which means that it's doing something where it's looking at atoms in different molecules, rather than handling them independently. Do I understand correctly?

@mattwthompson
Copy link
Member

I think that's the claim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Blocked by upstream changes
Projects
None yet
Development

No branches or pull requests

3 participants