-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20170613
Geoffrey Paulsen edited this page Jan 9, 2018
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Edgar Gabriel
- Artem Polyakov
- Jeff Squyres (Cisco)
- Howard Pritchard
- Josh Hursey
- Joshua Ladd
- Mohan
- Todd Kordenbrock
- David Bernholdt
- Nathan Hjelm
- Ralph
- Brian Barrett (Amazon)
- George
- Ryan Grant
Review All Open Blockers
- released June 1st.
- No driver for a v2.0.4 at this time.
- v2.1.1 went out in May
- No Driver for v2.1.2 at this time.
Review Milestones v3.0
- PMIx - Looking good. Need Josh Hursey to confirm the threading issue is resolved.
- Ralph finished a branch that Josh can build (just now), Josh will kick off build and test today.
- Cisco tests had lots of issues (everything hung, some issue with slurm on cluster).
- Yesterday switched it over to rsh instead. Maybe not enabling prefix by default. Not getting LD_LIBRARY_PATH propagation by default.
- libquadmath, and libimf issues.
- Has so many things that he preloads on launching node before he launches, so rsh is problematic for that.
- Cisco is using little older SLURM 14.03. (just after 2.6 after numbering scheme change).
- We were Planning to do a v3.0 RC last week, but that didn't happen.
- We'll wait on an RC for two things:
- Cisco MTT is a bit concerning. Doing an srun under the covers, it just hangs on all Master, 3.0, 2.0.
- Sporadic hangs (may be fixed now, but it's sporadic), and failures on ppc64le chips.
- sbatch, and mpirun.
- both manually runs fine.
- Similar issue at UT - crank up the logging of SLURMD or run in foreground.
- Tripple check that it's not picking up daemons on the node.
- Amazon is running under SLURM 16.? - though currently direct launching.
- LLNL "gadget" is running okay.
- Ralphs PR, sounds like a few more days -
- Cisco MTT is a bit concerning. Doing an srun under the covers, it just hangs on all Master, 3.0, 2.0.
- PMIx 2.0
- OMPI 2.0 has a bunch of event notification in orte and opal
- OMPI 3.0, because PMIx now has event notification.
- Debugger attach now flows through PMIx 2.0.
- IF Open MPI want to support PMIx 1.x series (via external) in OMPI 3.0, someone will have to write notification support. Issue 3660.
- It WILL compile with external PMIx 1.x, and will run many things, but without event notification support, you can't attach debuggers.
- OMPI 3.0 (default configure) doesn't work with SLURM 17.x
- Default out of the box results in launch failures.
- Don't know what it would take to fix this.
- Direct Launch is all handled by SLURM, so PMIx isn't involved.
- Brian will follow up with Ralph about running with earlier SLURM, but not SLURM 17.x Might be a different issue.
- Looks like we'll be a little late on v3.0.
- Lets push back branching of next release branch to July 13th for face to face.
- v3.0 RC, will we hit Friday?
- Ralph just has an issue with 3696 master PR.
- failing in an munmap with an invalid pointer. Brian will try tonight.
Review Master Pull Requests
Review Master MTT testing
- Face2Face Meeting-2017-07
- Date: July 11-13 (9am Tuesday - noon on Thursday.
- Cisco has booked space in Chicago.
- Cisco has reserved some space right next to O-Hare (can get shuttle to hotel).
- we have met there before.
- Jeff will come in Monday evening.
- Cisco has reserved some space right next to O-Hare (can get shuttle to hotel).
- Amazon - bringing much more testing online, and CI processes.
- v3.0.0 Release work
- Improved Jenkins infrastructure. Hopefully some changes yesterday (in Jenkins setup at Amazon) will make it run a little faster.
- Travis is now officially deactivated. No longer using Travis.
- Amazon
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu