WeeklyTelcon_20230131

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Geoffrey Paulsen (IBM)
Jeff Squyres (Cisco)
Brendan Cunningham (Cornelis Networks)
David Bernholdt
Edgar Gabriel (AMD)
Howard Pritchard (LANL)
Joseph Schuchart (UTK)
Josh Fisher (Cornelis Networks)
Josh Hursey (IBM)
Luke Robison (Amazon)
Thomas Naughton (ORNL)
Todd Kordenbrock (Sandia)
William Zhang (AWS)

New Items

Reminder: When issues with various company CI controls, please post in #general Slack.
New - Issue #11347 Versioning is wrong in v5.0.x
- We agreed v4.0.x -> v4.1.x -> v5.0.x should be ABI compatible.
  - Compile an MPI Application with v4.0.x, then RM -Rf OMPI, and then install the v5.0.0 into the same location, and it just work.
  - Did we figure out the Fortran ABI break?
    - Memory: Yes we did break Fortran ABI.
    - Broke ABI in a very narrow case, when you compile Fortran with 8byte ints, and C 4byte int.
    - Two other things that may or maynot break ABI.
    - Did some stuff with intents and asyncs, and went from named interfaces to unnamed.
      - Unsure if this affects ABI.
  - ABI mostly just care about C and mpif.h
  - Fortran library has different .so versioning.
- Blocker for next v5.0.0rc - get library versioning correct.
- When we talk about ABI - Fortran will be nuanced.

v4.1.x

Made a minor change for another rc. Trying to get rc built.

v5.0.x

RC from last week, got pushed to this week.
- Still waiting on https://github.com/open-mpi/ompi/issues/11354
- may be enable dso option?
  - Accelerator initially picks CUDA and then disqualifies it, but at teardown it trys to teardown CUDA.
    - Reason it does this, is because CUDA now uses delayed startup so will still be enabled.
    - Another variable if CUDA was initialized.
  - Should also be on main (comment saying otherwise
- Howard said after the call that this isn't a blocker for rc10
Waiting on PMIx and PRRTE submodule update.
- Ralph pestered us to please merge it. - just merged on main.
- Merged, will make rc10
Need documentation for v5.0.0
Manpages need an audit before release.
- Double check --prefix behavior
- Not the same behavior as v4.1.x
What is status of HAN?
- Joseph pushed a bunch of data, but not on the call. Go read this.
- Joseph had some more experiments. HAN collective component with shared memory PR, we were pretty good compared to tuned and another
  - Comparing HAN with shared Mem component.
  - How many ppr? Between 2ppr and 64ppr
- Better numbers, would be good to document this.
  - In OSU there's always a barrier before the operation. If Barrier and operation match up well, you get lower latency.
  - We'd talked about supplying some docs about how HAN is great, and why we're enabling it for v5.0.0 by default.
    - Like to include instructions on how to reproduce as well for users.
    - document in ECP -
  - Our current resolution is to enable it as is, and fix current regressions in future releases.
  - What else is needed to enable it by default?
    - Just need to flip a switch.
    - The module that Joseph has for shared memory for HAN at the moment would need some work to add additional collectives.
    - And it relies on xpmem to be available.
    - So for now just enable HAN for collectives we have, and later enable for other collectives.
    - George would like to re-use what tuned does, without reimplementhing everything, but a shared memory component is a better choice, but with more work.
    - If we don't enabled HAN now by default, it's v5.1 (best case) before it's enabled.
      - The trade offs lean toward turning it on and fixing whatever problems might be there.
    - There is a PR for tuned (increases default segment size), and changes algorithms in tuned for shared memory.
    - Need to start moving forward, rather than doing more analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeeklyTelcon_20230131

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

New Items

v4.1.x

v5.0.x

Main branch

ITT

Clone this wiki locally