-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20220517
Jeff Squyres edited this page May 17, 2022
·
2 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Brian Barrett (AWS)
- David Bernholdt (ORNL)
- Edgar Gabriel (AMD)
- Howard Pritchard (LANL)
- Joseph Schuchart (UTK)
- Josh Fisher (Cornelis Networks)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Geoffrey Paulsen (IBM)
- Brendan Cunningham (Cornelis Networks)
- Hessam Mirsadeghi (UCX/NVIDIA)
- Tommy Janjusic (NVIDIA)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius Networks)
- Mark Allen (IBM)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Sam Gutierrez (LLNL)
- Xin Zhao (NVIDIA)
- 4.1.4
- Merged in all the pending PRs
- MTT looks good from overnight
- Probably release this Thursday
- UCC collectives and a bunch of little bug fixes
- Josh H asks: is there predictability on when the nightly tarballs
are available?
- Probably not.
- We probably could stand to update nightly tarball infra to actually running as a Jenkins job (so that it runs on the build hosts -- not www.open-mpi.org). Need to find some time to do that...
- Last night's MTT picked up old tarball, so it still shows some old failures.
-
RC7 went out late Fri
- Have some compile failures; those should all be fixed.
- A few more fixes in flight.
-
We talked yesterday about setting minimum of PRTE version to 2.0.2+fixes (i.e., latest PRTE public release + fixes). This may or may not be useful.
- But recall that PRTE 2.1 had a large command line refactor.
- Timeline for PRTE 2.1 isn't until end of summer (estimate).
- PRTE 2.1 is what we have been targeting for OMPI 5 for a while. It would be... weird to try to make PRTE 2.0 be for OMPI 5 (e.g., have to back-port all the command line refactor stuff, and other things).
- There are 7 critical 2.1.x issues on PRTE.
- Q: Isn't this a stop-ship for OMPI 5?
- Yes. :-(
- We should help at least some of the PRTE 2.1 issues -- some of them are OMPI-related.
- We/OMPI need a public PRTE 2.1.x release so that packagers can have an OMPI package + PRTE package. The embedded PRTE is "not enough" for packagers.
- Sidenote: PRTE 2.0.x (including the publicly-released PRTE 2.0.2) would be weird for OMPI users, because it doesn't have all the CLI fixes/updates.
- Need OMPI community help here (for PRTE 2.1.x): https://github.com/openpmix/prrte/issues?q=is%3Aissue+is%3Aopen+label%3A%22Target+2.1%22
- Options:
- Figure out how to be happy with PRTE 2.0.2
- Wait for PRTE 2.1.0
- OPTIONALLY: Pour resources into PRTE 2.1.0 (which could make it release faster)
- Feels like the only reasonable path forward is to wait for
2.1.0, and we should all contribute resources as much as
possible because we want it as fast as possible.
- Resource availability is slim right now :-(
- We should also set the minimum PRTE version to 2.1.x.
-
We also talked about setting a minimum/floor version of PMIx for OMPI 5
- If OMPI 5 supports PMIx 3, we lose (at least):
- Debugger support
- show_help aggregation
- sessions
- ULFM/fault tolerance
- This is quite undesireable.
- Does anyone have a need for
- Last PMIx 4.0.x release was Dec 2020.
- We should probably target PMIx 4.1.x.
- PMIx 4.1.2 was released Feb 2022. --> Does not include show_help aggregation.
- If OMPI 5 supports PMIx 3, we lose (at least):
-
Joseph brings up https://github.com/open-mpi/ompi/pull/10349 -- need to make sure this doesn't fall off the table.
- Howard is going to look at the sessions aspect of this.
- Howard also cites https://github.com/open-mpi/ompi/pull/10350
-
Old issue that has re-surfaced: Intercomm communicators (when using more than 1 node) are hanging on main/v5.0.x. Josh Hursey thinks it might involve PMIx_Connect.
- These two issues seem to be dups of the same core issue:
- No one has looked into this. Josh Hursey hopes to triage it this week.
- Howard points out that RHC looked at this a while ago, and wrote
up a suggestion https://github.com/open-mpi/ompi/issues/10110.
- Corresponding PRTE issue: https://github.com/openpmix/prrte/issues/964
- This will also depend on which PMIx version we're using.
- There may also be a PRTE dependency...? Unknown. Need to triage.
- This is a regression for OMPI.
-
New: Lisandro has hit segv with partitioned sending.
-
New: ULFM issue: https://github.com/open-mpi/ompi/issues/10398
- Merging in small fixes.
- No plan for update
- Did not get to discuss this. See notes from last meeting.
- Did not get to discuss this. See notes from last meeting.
- Did not get to discuss this. See notes from last meeting.