-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20200630
Geoffrey Paulsen edited this page Jul 7, 2020
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Barrett, Brian (AWS)
- Brendan Cunningham (Intel)
- Christoph Niethammer (HL
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Matthew Dosanjh (Sandia)
- Nathan Hjelm (Google)
- Naughton III, Thomas (ORNL)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- William Zhang (AWS)
- Xin Zhao (nVidia/Mellanox)
- mohan (AWS)
Blockers All Open Blockers
Review v4.1.x Milestones v4.1.0
- Schedule: Want to release mid-July
- RC1 probably cant do end of this week, a lot of big PRs outstanding.
- Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
- We've come to consensus for a v4.1.0 release
- Brendan - BTL
- Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
- Patch for OFI stuff messed up v4.1.x branch.
- Howard has a fix PR, Jeff is looking at.
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
- All MTT is online on v4.1.x branch
- Not compiling under SLURM EFA test. (OFI BTL issue)
Review v4.0.x Milestones v4.0.4
- v4.0.4 Released
- v4.0.5 - No schedule yet.
- Two potential drivers for a quick v4.0.5 turn-around.
- OSC RDMA Bug - May drive a v4.0.5 release.
- Program Aborts on detach.
- OSC pt2pt we have on v4.0.x
- Fragmented Puts, the counting is not correct for a particular user request
- Non-continguous rPuts.
- Also needed in a v4.0.5
- How urgent is ROMIO fix?
- Good to have in v4.0.5, but hard to make testcase to hit.
- usNic failing almost all multi-node tests on v4.0.x
- Jeff started to look at last week, but didn't get to look at this last week.
- v4.0.x WAS working, and seeing Master failing.
- ACTION - check back next week.
- iWarp support Issue 7861.
- How are we supposed to run iWarp in Open-MPI v4.0.x?
- How much do we care about iWarp?
- At a minimum need to update FAQ.
Review v5.0.0 Milestones v5.0.0
-
Need to put OSC pt2pt
- OS RDMA requires a single BTL that can contact every single process.
- This didn't use to be the case. (Comment in the code)
- OS RDMA requires a single BTL that can contact every single process.
-
We can't use the OSC pt2pt.
- It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
-
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.
- The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
-
Jeff will close the PR, and
-
Jeff will Nathan will fetching, get, compare and swap.
-
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.
-
Does UCX support iWarp?
- Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
-
PMIX
- Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
- PMIX talked about it. Artem might know someone who might be interested in working on it.
- Algorithm behind one of the interfaces doesn't scale well.
- Not a regression. Above ~ 4K nodes, becomes quadratic.
-
PRRTE
- Nothing's happening there.
- Austen went through master
- UCX is failing in certain test cases, SEGV
- Austen will open an issue.
- PRRTE is hitting and assert in some cases.
- Austen will Open Issue
- Remaining CISCO failures look like connectivity issues.
- Jeff hasn't got to look deeper to see
- Looks like USNIC is either not being picked or disqualifying itself internic.
- CLANG - added float16
- Need to add a special compiler flag for software emulation of float16.
- Not magically add that flag.
- Many companies are not allowing a face to face travel until 2021 due to COVID19.
- Instead lets do a series of virtual-face to face?
- Yes this summer to discuss for v5.0
- Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
- Different topics on different days.
- Do a doodle poll of least-worse days in late July/August.
- Start a list of topics.
- George and Jeff will help plan and come to community.
- May not have Super Computing conference at ALL this year.
- Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
- Then this works pretty well, and do this a couple of times a year.
- Not constrained to Super Computing
- scale-testing, PRs have to opt-into it.