-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20190108
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Todd Kordenbrock
- Edgar Gabriel
- Howard Pritchard
- Josh Hursey
- Joshua Ladd
- Ralph Castain
- Xin Zhao
- Aravind Gopalakrishnan (Intel)
- Brian Barrett
- Nathan Hjelm
- Dan Topa (LANL)
- Thomas Naughton
- Matias Cabral
- Akshay Venkatesh (nVidia)
- David Bernholdt
- Geoffroy Vallee
- Matthew Dosanjh
- Arm (UTK)
- George
- Peter Gottesman (Cisco)
- mohan
-
Summary of PMIx re-architecturing for v5.0
-
Lots of TCP wire-up discussion
-
Session work is complete (Nathan and Howard worked on)
-
or check archives for MPI sessions working group.
-
works with MPI_Init.
-
Involved a lot of cleanup for setup and shutdown.
-
Can keep it as prototype, or put it in, without headers.
-
For MPI_Init/MPI_Finalize only apps, fully backward compatible.
- Initialize a "default" Session.
-
Asking about adding this to master in mid-January
-
Part of cleanup is to have reverse setup and shutdown.
-
Cleanup sounds good. Well contained. Set of pathes.
- Calling it "instances" inside of MPI, but we'll be renaming it if/when MPI standardizes sessions.
-
Summary - patches for cleanup lets do them and look at them.
- Under work for sessions, need to look at a bit closer
- We can discuss sessions bindings in the future.
-
Session init is all local, so timing should still be good.
-
-
Opal PR6136 - Nathan did some Opal cleanup prep for Sessions
- It's failing on ARM and PPC64LE XL missing a symbol.
- Next bring in ompi cleanup.
- Then create communicators from groups.
- And need to bring in PMIx v4.x updates.
- Can use the embedded pmix for sessions stuff. Have been
- Have to use prterun
- Is this going to require PMIx 4.x even for non-session MPI apps?
- No, because fake sessions can use PMIx 3.x functionality.
- Checks if group create interface is available, and if not it falls back
- preterun / preteserver can support different PMIx versions at build time,
- New MPI session based
- Do we want to update mpirun inside of prte to know about sessions before preterun?
- No, introduce sessions in prterun first, and follow with mpirun
-
New Work: We need to contextify Opal.
- So that we can just have one libopal, and different systems can use the same one.
- PRTE uses MCA system, but doesn't rename everything in PRTE, but did in PMIx
- One possibility could be to split it / rename it, and move on.
- two problems: atomics, and mca system itself.
- Other possiblity is to Contextify, so that various projects can pass in context and they behave nicely together.
- Contextify - would need to look at all global vars, and variable systems,
context would have to be evaluated. Would need to come up with a system
where clients could register variables, etc.
- First step would be to determine what needs to be contextified, versus shared.
- Then you could have multiple instances that would play together in same process.
- Bundling is also somewhat of an issue. We assume most users grab everything but some other users deconstruct this.
- Building libopal as a seperate project, you end up with a configure script per project (unavoidable). Longer configure times.
-
Really two problems:
- Opal Problem
- Deconstruction Problem
-
Compatibility matrix become more complicated to define.
- A bit of a maintence headache, but perhaps easiest...
- Prte doesn't use much of OPAL. Perhaps best answer is sever it, rename it and let it diverge. Hand merge changes to MCA, and Atomics to PRTE.
- A bit of a maintence headache, but perhaps easiest...
-
Now we're at the point where we're shipping:
- an MPI library, with a portability layer
- a runtime, with a portability layer
- an OPAL with a portability layer. Doesn't sound like an awesome story.
-
At one point, we were going to have a configure script per project, no on liked it because configure times were slow. but maybe okay today.
-
git submodules might help here also.
-
At Amazon, they have a project where, whenever a commit gets pushed to submodule, then jenkins creates and published a PR to the master project.
- DONT try to when we build OMPI, one big configure script, and yet still have an opal only configure script.
-
What about versioning information?
- Either a solution that doesn't work for static.
- Trending towards a solution that will require versioning of OPAL.
-
Need to come up with a solution, because we're getting to the time where we should do something. But need direction.
- Whatever solution we come up with, might work for ALL of our embedded projects. If we're going to do something, nice to keep it consistent.
-
Submodules are no too bad, people are using them more.
- We could catch with CI, if we allow PRs only.
-
Need to discuss how our branches track submodule releases (their master, or their versions)
Review All Open Blockers
- Schedule: posted a v2.1.6 rc1 (Nov 28th - no probs since then, but delayed for holidays, and a good round of MTT)
- Driver: Assembly and locking fix, vader and pmix, etc.
- we think the atomic fixes didn't matter for pmix in 2.1.x
- Should release by end of the week assuming good MTT nightly runs.
- Uses OLD release process, so not hindered by EWS / Jenkins issue (see v4.0.x)
Review v3.0.x Milestones v3.0.3
- Scheduled 3.0.4 may of 2019
- May be able to pull this date IN
- Will merge in PRs this afternoon.
Review v3.1.x Milestones v3.1.0
- Scheduled 3.1.4 april of 2019
- May be able to pull this date IN
- Will merge in PRs this afternoon.
- Brian will reply on github on question.
Review v4.0.x Milestones v4.0.1
- Schedule: Need a quick turn around for a v4.0.1
- v4.0.0 - a few major issues:
-
https://github.com/open-mpi/ompi/pull/6249 - Looks good, but possibly wrong
Copyright. Howard will Talk to Nathan about.
- Want the PR, but Howard is not sure about the Copyright.
- mpi.h is correct, but the library is not building the removed and deprecated functions because they're missing in Makefile.
- Fix in https://github.com/open-mpi/ompi/pull/6120
- Issue 6149 - Tests are fine, but needs PR6120
- Jeff re-reviewed this morning. A few more non-technical minor nits. Will get this in today.
- Two issue hit via SPACK packaging:
- root cause may be: make -j creates TOO many threads of parallel execution on some OSes.
- max filename restrition on fortran header files.
- PR6121 master - should resolve on v4.0.x ??
- Manpages is perl, jeff runs on mac. maybe some other magic. need rman
-
https://github.com/open-mpi/ompi/pull/6249 - Looks good, but possibly wrong
Copyright. Howard will Talk to Nathan about.
- Discuss pulling PR 6110 into v4.0.1
- Bug, some OSHMEM APIs missed in v4.0.0
- Jeff pulled up slides showing that we can ADD APIs in minor versions.
- Old built executables must be able to run with newer.
- We need to verify if the patch breaks anything with older built executables.
- Because this PR is just adding functions, it should be okay.
- Mellanox volunteered to test built with old executable and run with newer OMPI
- If that test passes, everyone is okay with pulling this in.
- UCX priority PR - expecting a PR from master
- Matias Cabral local procs with OFI MTL - master this PR is okay, will be coming back to v4.0.x 6106
- Two rankfile mapper issues reported on mailing list. Howard will file issue.
- Need to create v4.0.x issues for https://www.mail-archive.com/[email protected]/msg32847.html
- @siegmargross
- Issue 6242 -
- Issue 6228 - Open MPI v4.0.2 would like PMIx 3.1.0 (still unreleased)
- PR 6191 - Aravind - asked Brian and Howard to take a look.
- Opal Issue - One version embedded in Open MPI, and another in PRTE.
- How do we manage that overlapping code?
- similar to libevent, and hwloc (prte, pmix, and ompi)
- Already affecting us, if you want an external PMIX, you have to use external libevent, and hwloc
- We have a decision in near future about libopal. Used by other packages, need to figure out a way out of this.
- Brian is writing a doc on an approach
- Some discussion.
- Libtool issue came up before or during supercomputing.
- Went around with many options - Ultimately will need to version all .so's
- need to explicitly version on each release branch going forward.
- WONT make opals on various release branches compatible with each other.
- Went around with many options - Ultimately will need to version all .so's
- Amazon AWS / Jenkins is still crippled
- Jenkins Broke the EC2 plugin, and there is a fix for EC2, but EC2 has not released the fix.
- https://wiki.jenkins.io/display/JENKINS/Amazon+EC2+Plugin
- Brian tried to build EC2 from scratch, but wasn't able to make it work.
- Scope of how this affects Open MPI Projects:
- release build process is broken
- only about 10% of CI tests right now.
- Status: we're currently stuck waiting on this EC2 fix.
- Jenkins Broke the EC2 plugin, and there is a fix for EC2, but EC2 has not released the fix.
- Ralph worked a lot on PMIx Tools interface, and documenting it for standard.
- Ralph should have 3 new chapters of PMIx v4 standard document in a few weeks.
- Ralph will send email to PMIx announce list.
- PMIX gropu, PMIX tools, and PMIx fabric
- Will release a version of PMIx v3.1.0 in next week or two for Open MPI v4.0.x
- IBM test configure should have caused that.
- Cisco has a one-sided info check that failed a hundred times.
- Cisco install fail looks like a legit compile fail (ipv6 master)
- We have a new open-mpi SLACK channel for Open MPI developers.
- Not for users, just developers...
- email Jeff If you're interested in being added.
Review Master Master Pull Requests
- didn't discuss today.
Review Master MTT testing
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA