Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTEX Protocol update [DRAFT] #3586

Open
yadudoc opened this issue Aug 15, 2024 · 3 comments
Open

HTEX Protocol update [DRAFT] #3586

yadudoc opened this issue Aug 15, 2024 · 3 comments

Comments

@yadudoc
Copy link
Member

yadudoc commented Aug 15, 2024

Problem statement

Currently the HighThroughputExecutor does not have a clearly defined protocol to communicate tasks and results internally.
This has a few different issues:

  1. Duplication of logic all over the htex interchange that packages results, here are some examples:

    • Interchange reporting a version mismatch error: link

    • Interchange reporting ManagerLost exception: link

    • Interchange repeating heartbeat: link

    • Reporting a drained manager: link

    • Interchange pickling task dict before shipping to the manager: link

  2. There is a lot of cruft that needs cleanup

  3. GlobusCompute is interested in shipping task objects that HTEX can understand. Now, HTEX is given a wrapper function that unpacks GC's task object and executes it. I'm not 100% sold on this, but I would like some discussion on what this would look like.

  4. Recently we've run into issues where @matthewc2003 had issues adding the resource_specification to the task package so that it can be used by the interchange for better scheduling decisions.

  5. Lack of metadata limited where MPIExecutor could handle decisions based on resource_specification. Ideally, some rework here could make it easier to update the MPIExecutor to use resource_specification info on the interchange rather than leaving this logic to the worker.

Describe the solution you'd like

  1. We take a look at messagepack and see if we can use that, alternatively, we design a new protocol.
  2. Switch htex over to use the protocol we decide on

Describe alternatives you've considered

We could do nothing, but our current model makes it harder for new work to be done on HTEX.

@benclifford
Copy link
Collaborator

benclifford commented Aug 15, 2024

Describe alternatives you've considered

I'd really like to see incremental changes made to the existing protocol to address these issues before a massive protocol redesign: right now I see the inability of our current Parsl contributors to adapt the current protocol to address these issues is some evidence that those same contributors are not in a position to design a new protocol that addresses these issues.

I'm also very aware of the allure of building a first-iteration Grand Solution over the grunt work of actually fixing small problems by sustained boring effort. We've had interchange protocol issues open for some time, and it's clear no one has the motivation to fix them - probably then those same people do not have time to fix the issues that inevitably will arise with a Grand Solution rewrite.

Moving down one protocol layer, I've repeatedly heard Globus Compute people talk about messagepack, but without describing concrete benefits over our existing framing protocols (json and pickle). This issue is a pretty good place to start fleshing those out more concretely: pickle and json both are not particularly nice protocols here but they are both well supported and have not caused a lot of problems at this layer.

@benclifford
Copy link
Collaborator

one way to move forward with this is document the current protocols to the level that you want the final protocol to also be documented: without any other further work, that in itself is valuable for onboarding new people. then make the corresponding desired target protocol description. then build a series of supportable, reviewable and justifiable steps to get from one to the other, and proceed on that path.

@benclifford
Copy link
Collaborator

In talking about other things with @yadudoc, I think I get the sense that point 3 above (or something similar to it) has a background context of Globus Compute wanting to provide a much richer cross-version compatibility story for Globus Compute users, with something along the lines of up to four different GC installations contributing towards task execution (roughly in parsl terms, 1 the preparation of a function for remote execution (eg serialization of a function object), 2 the serialization of arguments and other preparation of a task invocation, 3 task dispatch around the interchange (around the endpoint in Globus Compute terms), 4 execution on a worker.

I said to @yadudoc in that discussion that I think that story needs to be much better fleshed out on the Globus Compute side of things before placing requirements on Parsl-level protocols.

@benclifford benclifford removed their assignment Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants