Skip to content

WeeklyTelcon_20210330

Geoffrey Paulsen edited this page Apr 7, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

not there today (I keep this for easy cut-n-paste for future notes)

  • Austen Lauria (IBM)

  • Brendan Cunningham (Cornelis Networks)

  • Brian Barrett (AWS)

  • Edgar Gabriel (UH)

  • Geoffrey Paulsen (IBM)

  • Harumi Kuno (HPE)

  • Hessam Mirsadeghi (UCX/nVidia)

  • Howard Pritchard (LANL)

  • Jeff Squyres (Cisco)

  • Josh Hursey (IBM)

  • Michael Heinz (Cornelis Networks)

  • Naughton III, Thomas (ORNL)

  • Raghu Raja (AWS)

  • Ralph Castain (Intel)

  • Todd Kordenbrock (Sandia)

  • Tomislav Janjusic

  • William Zhang (AWS)

  • Marisa Roman (Cornelius)

  • Matthew Dosanjh (Sandia)

  • Akshay Venkatesh (NVIDIA)

  • Artem Polyakov (nVidia/Mellanox)

  • Aurelien Bouteiller (UTK)

  • Brandon Yates (Intel)

  • Charles Shereda (LLNL)

  • Christoph Niethammer (HLRS)

  • David Bernhold (ORNL)

  • Erik Zeiske

  • Geoffroy Vallee (ARM)

  • George Bosilca (UTK)

  • Joseph Schuchart

  • Joshua Ladd (nVidia/Mellanox)

  • Mark Allen (IBM)

  • Matias Cabral (Intel)

  • Nathan Hjelm (Google)

  • Noah Evans (Sandia)

  • Scott Breyer (Sandia?)

  • Shintaro iwasaki

  • Xin Zhao (nVidia/Mellanox)


New Items

  • master /
  • AWS can install CUDA and prevent build breakage.
  • And ask nVidia/Mellanox to add this to their MTT.

v4.1

  • MPICH datatype stuff that came in from IBM

Reformatting master

  • Sessions branch is pretty big, and needs to come back.

    • So plan was to wait for rest of formatting until sessions is rebased, and then format master.
    • Howard's having a few more issues on sessons, so is okay with us reformatting
  • Reformatted

  • Doing formatting on master and v5.0.x seems reasonable

  • But reformatting v4.0.x and v4.1.x seems too risky.

  • clang-format instructions are in the format file.

  • He also ran clang-tidy, and we don't have directions for that yet.

  • Requires clang-format at least v10 (Different version clang-format than clang compiler)

    • Nathan will try to make it compatible with older v8
    • Geoff ping Nathan to request the v5.0.x version of opal PR.
  • clang-format is separate from compiler-toolchange

  • Will we require developers to REQUIRE this?

    • Not requiring a github build to require it.
    • Will have a CI test that will check it.
    • Not in a path where every CI will have to have it installed.
  • Do we want to hold off on MORE before v5.0.0 ships? (or 6 months after?)

  • Should be rerun as a non-cherry-pick. Might be easy to lose

    • But the two branches are close.
  • Run it on master, try to PR to v5.0.x, and

  • Nathan can only run certain sections of the code-base with the systems he has.

    • Strongly encourage everyone test their sections.
    • PSM2 - doesn't even build in our CI, so someone should build/test this.

PR 8551 - New coding style enforced via clang --format

  • Needs a squash, missing signed off commit.
    • Austen will ping Nathan.
    • want in v5.0.x also
  • Merged to v5.0 3/29 - DONE with opal

Autoconf 2.7

  • This is working just fine at the moment, except for ROMIO.
    • ROMIO is throwing tons of warnings. But okay.
    • Would need to fix it upstream.
  • PMIx/PRRTE is updated.
  • Perhaps now for 3rdParties, configure with --silence-obsolencense flag.
  • Does someone want to ping Rob about it?
    • Jeff will

Testing

  • Intercomm Merge tests are timing out.
    • MTT master on HLS timeouts

32bit? Do we want to continue to support this?

  • Require a C11 compiler to support 32bit platforms.

  • Debian is the only Linux distro that supports 32bit.

    • Can be done in PMIx or PRRTE if desirable.
    • 32bit atomics stay, because we still support 32bit datatypes.
    • 32bit only architectures are removed.
  • Failure in prrte on v5.0.x, will be resolved in tonight's.

  • https://github.com/open-mpi/ompi/issues/8566

  • Using an actual 32bit gcc - Compile fail

  • Nathan thinks he might be able to write a compare-and-swap

  • v5.0 - good time to drop 32bit.

    • Jeff will send note to packaging, and see if they will care.
    • Debian is okay, they will just use MPICH
    • OSC/RDMA assumed everything was 64bit, but once we changed
  • On 32bit, if we could use C11 atomics with locks, it might be allowed.

    • So perhaps this would be a path.
    • Is C11 available on older 32bit systems.
    • gcc 6.0+ it should work fine.
  • Nobody has a strong opinon.

    • Pride issue, but it's also time and money
    • Right now the only thing breaking it Nathan's 1sided.
    • Lets ask Nathan what he thinks, and if he has time to fix it.

4.0.x

  • Shoot for a next RC of v4.0.6 on March 31st
  • blocking on UCX issues (see New topics above)
    • George, will get to it soon.
  • Too many Open Issues (50)
    • Geoff and Howard will go over v4.0.x issues, and try to close or address many of them.
      • May need to label some as wont_fix, and then close
      • Closed a number of issues.
  • Check status of ROMIO from MPICH vs in v4.1 vs v4.0.x

v4.1.x

  • Same boat, waiting for George's datatype fix.
  • A new v4.1 RC was built last week
  • Most of ROMIO fixes have gone into MPICH
    • 8371 - might be close
  • Intercomm Merge issue
    • may have gone away after PRRTE update on master
    • Investigating
  • blocking on UCX issues (see New topics above)
    • George, will get to soon.

Open-MPI v5.0

  • PMIx and PRRTE are close to a release canidate.
  • Is there a list of PRRTE issues that still need to be added?
    • No, just the ones in the issue.
    • Ralph thinks they're in PRRTE, but perhaps OMPI submodule not updated.
    • Raghu will check.
  • Regression is OSC/UCX that breaks Dynamic Windows.
    • reported a year ago, but not update.
    • Issue 6987
    • If UCX is going to be broken for this long, may
      • Couldn't get RDMA backend.
  • What do we do with the mpirun Manpage?
    • Didn't want OMPI requiring Sphynx, but if PRRTE and PMIx in same tar
  • Ralph almost has singleton comm spawn working
    • Single node without the mpirun process
  • Static MCA components default still on track for v5.0.x

Video Presentation BOF

  • ECP Community days ( March 30-April 1st )
    • Need SLIDES by close of business FRIDAY (not Saturday)
    • Each day 90 minute time slots.
    • Tuesday March 30th from 1-2:30pm (US Eastern)
      • LIVE
      • Invited some people to speak. They will be our main community speakers.
      • Anyone on OMPI community can send slides to Jeff and George
      • Due Friday March 26th
    • PMIx Wed 31st 11 - 12:30 (US Eastern)
    • Need to ensure no more MPIR, SLURM PMI1/2,

Longer Term discussions

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work
  • No update - 3/16
    • Could be independent of PMIx and PRRTE.
    • PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.

ROMIO Long Term (12/8)

  • OLD
  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • ULFM have some tests added there.
  • Need folks to add to MTT
  • Should have some new Sessions tests
Clone this wiki locally