HTEX Protocol update [DRAFT] #3586

yadudoc · 2024-08-15T20:12:33Z

Problem statement

Currently the HighThroughputExecutor does not have a clearly defined protocol to communicate tasks and results internally.
This has a few different issues:

Duplication of logic all over the htex interchange that packages results, here are some examples:
- Interchange reporting a version mismatch error: link
- Interchange reporting ManagerLost exception: link
- Interchange repeating heartbeat: link
- Reporting a drained manager: link
- Interchange pickling task dict before shipping to the manager: link
There is a lot of cruft that needs cleanup
GlobusCompute is interested in shipping task objects that HTEX can understand. Now, HTEX is given a wrapper function that unpacks GC's task object and executes it. I'm not 100% sold on this, but I would like some discussion on what this would look like.
Recently we've run into issues where @matthewc2003 had issues adding the resource_specification to the task package so that it can be used by the interchange for better scheduling decisions.
Lack of metadata limited where MPIExecutor could handle decisions based on resource_specification. Ideally, some rework here could make it easier to update the MPIExecutor to use resource_specification info on the interchange rather than leaving this logic to the worker.

Describe the solution you'd like

We take a look at messagepack and see if we can use that, alternatively, we design a new protocol.
Switch htex over to use the protocol we decide on

Describe alternatives you've considered

We could do nothing, but our current model makes it harder for new work to be done on HTEX.

The text was updated successfully, but these errors were encountered:

benclifford · 2024-08-15T20:41:13Z

Describe alternatives you've considered

I'd really like to see incremental changes made to the existing protocol to address these issues before a massive protocol redesign: right now I see the inability of our current Parsl contributors to adapt the current protocol to address these issues is some evidence that those same contributors are not in a position to design a new protocol that addresses these issues.

I'm also very aware of the allure of building a first-iteration Grand Solution over the grunt work of actually fixing small problems by sustained boring effort. We've had interchange protocol issues open for some time, and it's clear no one has the motivation to fix them - probably then those same people do not have time to fix the issues that inevitably will arise with a Grand Solution rewrite.

Moving down one protocol layer, I've repeatedly heard Globus Compute people talk about messagepack, but without describing concrete benefits over our existing framing protocols (json and pickle). This issue is a pretty good place to start fleshing those out more concretely: pickle and json both are not particularly nice protocols here but they are both well supported and have not caused a lot of problems at this layer.

benclifford · 2024-08-16T12:04:44Z

one way to move forward with this is document the current protocols to the level that you want the final protocol to also be documented: without any other further work, that in itself is valuable for onboarding new people. then make the corresponding desired target protocol description. then build a series of supportable, reviewable and justifiable steps to get from one to the other, and proceed on that path.

benclifford · 2024-08-28T09:32:41Z

In talking about other things with @yadudoc, I think I get the sense that point 3 above (or something similar to it) has a background context of Globus Compute wanting to provide a much richer cross-version compatibility story for Globus Compute users, with something along the lines of up to four different GC installations contributing towards task execution (roughly in parsl terms, 1 the preparation of a function for remote execution (eg serialization of a function object), 2 the serialization of arguments and other preparation of a task invocation, 3 task dispatch around the interchange (around the endpoint in Globus Compute terms), 4 execution on a worker.

I said to @yadudoc in that discussion that I think that story needs to be much better fleshed out on the Globus Compute side of things before placing requirements on Parsl-level protocols.

yadudoc added the enhancement label Aug 15, 2024

yadudoc assigned yadudoc, benclifford, rjmello and khk-globus Aug 15, 2024

benclifford added the executor:htex label Aug 16, 2024

benclifford removed their assignment Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTEX Protocol update [DRAFT] #3586

HTEX Protocol update [DRAFT] #3586

yadudoc commented Aug 15, 2024

benclifford commented Aug 15, 2024 •

edited

Loading

benclifford commented Aug 16, 2024

benclifford commented Aug 28, 2024

HTEX Protocol update [DRAFT] #3586

HTEX Protocol update [DRAFT] #3586

Comments

yadudoc commented Aug 15, 2024

benclifford commented Aug 15, 2024 • edited Loading

benclifford commented Aug 16, 2024

benclifford commented Aug 28, 2024

benclifford commented Aug 15, 2024 •

edited

Loading