Skip to content

WeeklyTelcon_20200630

Geoffrey Paulsen edited this page Jul 7, 2020 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Barrett, Brian (AWS)
  • Brendan Cunningham (Intel)
  • Christoph Niethammer (HL
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Matthew Dosanjh (Sandia)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Intel)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • William Zhang (AWS)
  • Xin Zhao (nVidia/Mellanox)
  • mohan (AWS)

Release Branches

Review v4.1.x Milestones v4.1.0

  • Schedule: Want to release mid-July
    • RC1 probably cant do end of this week, a lot of big PRs outstanding.
  • Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
  • We've come to consensus for a v4.1.0 release
    • Brendan - BTL
    • Need include/exclude selection, worried about consistent selection.
    • Alot of PRs outstanding, but can't merge until
      • Patch for OFI stuff messed up v4.1.x branch.
      • Howard has a fix PR, Jeff is looking at.
    • Howard changed new OFI BTL parameters to be consistent with MTL
    • Not breaking ABI or backwards compatibility.
    • v4.1.x branch, branched from v4.0.4 tag.
    • NOT touching runtime!!!
    • Not going to be pulling in a new PMIx version.
  • All MTT is online on v4.1.x branch
  • Not compiling under SLURM EFA test. (OFI BTL issue)

Review v4.0.x Milestones v4.0.4

  • v4.0.4 Released
  • v4.0.5 - No schedule yet.
    • Two potential drivers for a quick v4.0.5 turn-around.
    • OSC RDMA Bug - May drive a v4.0.5 release.
    • Program Aborts on detach.
  • OSC pt2pt we have on v4.0.x
  • Fragmented Puts, the counting is not correct for a particular user request
    • Non-continguous rPuts.
    • Also needed in a v4.0.5
  • How urgent is ROMIO fix?
    • Good to have in v4.0.5, but hard to make testcase to hit.
  • usNic failing almost all multi-node tests on v4.0.x
    • Jeff started to look at last week, but didn't get to look at this last week.
    • v4.0.x WAS working, and seeing Master failing.
    • ACTION - check back next week.
  • iWarp support Issue 7861.
    • How are we supposed to run iWarp in Open-MPI v4.0.x?
    • How much do we care about iWarp?
    • At a minimum need to update FAQ.

Review v5.0.0 Milestones v5.0.0

  • Need to put OSC pt2pt

    • OS RDMA requires a single BTL that can contact every single process.
      • This didn't use to be the case. (Comment in the code)
  • We can't use the OSC pt2pt.

    • It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
    • This is just a testing falicy. Could add tests to show this, but still at same boat.
    • Either product A or B is broken and we need to fix it.
  • RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.

    • The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
  • Jeff will close the PR, and

  • Jeff will Nathan will fetching, get, compare and swap.

  • Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.

  • Does UCX support iWarp?

    • Does libFabric support iWarp via verbs provider?
    • https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
    • Brian thinks that libFabric
    • OFI can support iWarp, just need to specify the provider in the include list.
    • This person who's asking is a partner not a customer
  • PMIX

    • Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
    • Sessions needs something from PMIx v4
    • ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
    • PPN scaling issue - simple algorithmic issue in this function
      • PMIX talked about it. Artem might know someone who might be interested in working on it.
      • Algorithm behind one of the interfaces doesn't scale well.
      • Not a regression. Above ~ 4K nodes, becomes quadratic.
  • PRRTE

    • Nothing's happening there.

master

  • Austen went through master
  • UCX is failing in certain test cases, SEGV
    • Austen will open an issue.
  • PRRTE is hitting and assert in some cases.
    • Austen will Open Issue
  • Remaining CISCO failures look like connectivity issues.
    • Jeff hasn't got to look deeper to see
    • Looks like USNIC is either not being picked or disqualifying itself internic.
  • CLANG - added float16
    • Need to add a special compiler flag for software emulation of float16.
    • Not magically add that flag.

Face to face

  • Many companies are not allowing a face to face travel until 2021 due to COVID19.
    • Instead lets do a series of virtual-face to face?
  • Yes this summer to discuss for v5.0
    • Maybe we can do it by topic?
    • Maybe not 4 or 8 hour things.
  • Different topics on different days.
  • Do a doodle poll of least-worse days in late July/August.
  • Start a list of topics.

Super Computing Birds-of-a-feather

  • George and Jeff will help plan and come to community.
  • May not have Super Computing conference at ALL this year.
  • Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
  • Then this works pretty well, and do this a couple of times a year.
  • Not constrained to Super Computing

Infrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

Clone this wiki locally