Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

hannaomi · 2025-01-30T18:13:22Z

Interchange benchmarking for polydisperse polymer topologies - Extension of recent discussion on Infrastructure slack.

For topologies with multiple large unique components (e.g. polydisperse polymer topologies), parameterization creates a bottleneck in system prep.
Calling Interchange.from_smirnoff() on the entire topology (all unique components included) takes significantly longer and scales poorly, compared to the runtime sum of calling the same command on a topology containing each individual component (the latter scales linearly). I do not have enough understanding of the inner workings of these commands to understand why there is such a difference between the 2.

Attached a script to recreate the discrepancy between single vs. individual topologies benchmarking_single_vs_multiple_GH.py.zip.

To create the polydisperse systems the script uses SwiftPol , a package developed by our research group to generate representative polymer systems. It is designed with OpenFF interoperability in mind and should be installable from github but hasn't been extensively tested so please let me know if you come into any bugs.

Interchange version = 0.4.1
Toolkit version = 0.16.6
Data generated using 1xnvidia A100 40GB GPU

hannaomi · 2025-01-30T18:17:01Z

Also adding some additional context and results shared between Matta/Shirts' groups to establish this issue

KCL_CUB_Multicomponent_interchange_GH.pptx.zip

mattwthompson · 2025-01-31T16:43:27Z

Thanks for the detailed report - this has been relatively easy for me to sink my teeth into. So far I'm able to identify, somewhat surprisingly, that preparing the bond and electrostatics handlers are causing almost all of the slowdown. I'm seeing this sort of "~49% for bond, ~49% for electrostatics, the remainder going to other steps" performance across a number of system sizes:

on num chains 8
building system ...
System built!, size = 8
writing to file
building 'complete' case
topology re-creation time 0.20
positions setting time 0.00
box processing time 0.00
		Function _bonds took 372.6278 seconds
		Function _constraints took 0.0156 seconds
		Function _angles took 0.1741 seconds
		Function _propers took 0.6292 seconds
		Function _impropers took 0.0575 seconds
		Function _vdw took 0.0482 seconds
		Function _electrostatics took 370.9286 seconds
		Function _plugins took 0.0000 seconds
		Function _virtual_sites took 0.0000 seconds
		Function _gbsa took 0.0000 seconds
	from_smirnoff time_elapsed=728.25

More to come later

mrshirts · 2025-01-31T18:10:27Z

Hah, is somehow it trying to look at all possible bonds and electrostatics in the system ???

mattwthompson · 2025-01-31T20:04:45Z

I'll follow up later with a more detailed report later, but I can already share with some confidence that the bottleneck is the toolkit's handling of isomorphism with large molecules. There's plenty of chatter about how the toolkit could handle polymers (including biopolymers) better. One good issue which has stalled for a long time is openforcefield/openff-toolkit#1734

Molecule de-duplication is essential for handling repeated quantities of chemically identical molecules (i.e. solvent) so skipping the step is non-negotiable. It's possible to conceive of a band-aid in which different pathways are chosen based on some properties of the system, but I would prefer not encoding hacks like that.

The results of this isomorphism check are cached, which would explain why almost all of the runtime is taken up by creating the bonds collection. (It's the first collection to be built, before any other valence terms or other non-bonded terms, and only after the positions, box, and topology are processed.) I'm not likely to double-check the runtime on the order of tens of thousands of seconds, but the super-linear scaling with system size that @hannaomi reported matches my intuition.

mrshirts · 2025-01-31T20:14:24Z

So, just I understand, it seems like Hannah is saying if she takes the same 50 molecules, runs them through independently, and sums the time, it takes far less time than the 50 molecules in the same system - which means that it's doing something where it's looking at atoms in different molecules, rather than handling them independently. Do I understand correctly?

mattwthompson · 2025-01-31T21:28:03Z

I think that's the claim

mattwthompson added the upstream Blocked by upstream changes label Jan 31, 2025

mattwthompson mentioned this issue Feb 3, 2025

Topology.identical_molecule_groups scales poorly with many large molecules openforcefield/openff-toolkit#2008

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

hannaomi commented Jan 30, 2025 •

edited

Loading

hannaomi commented Jan 30, 2025

mattwthompson commented Jan 31, 2025

mrshirts commented Jan 31, 2025

mattwthompson commented Jan 31, 2025

mrshirts commented Jan 31, 2025

mattwthompson commented Jan 31, 2025

Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

Poor scaling when creating an interchange with multiple unique components vs. individual components #1156

Comments

hannaomi commented Jan 30, 2025 • edited Loading

hannaomi commented Jan 30, 2025

mattwthompson commented Jan 31, 2025

mrshirts commented Jan 31, 2025

mattwthompson commented Jan 31, 2025

mrshirts commented Jan 31, 2025

mattwthompson commented Jan 31, 2025

hannaomi commented Jan 30, 2025 •

edited

Loading