-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20160209
Jeff Squyres edited this page Nov 18, 2016
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres
- Geoff Paulsen
- Brad Benton
- Edgar Gabriel
- Howard Pritchard
- Joshua Ladd
- Nathan Hjelm
- Nysal Jan
- ralph
- Ryan Grant
- Sylvain Jeaugey
- Todd Kordenbrock
- Yohann Burette
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3 - Targeting April, unless there is a need.
- Nathan will look at 0 byte send issue.
- dev list of SLURM issues already fixed in 1.10.2
- verbs usNIC not build by default - wait for review by Howard.
- Fortran 08 - Jeff will take a look at today.
- SLES 12 - was a race condition fork/exec before SIGCHILD detection. Fixed.
- Long running jobs (Linpack) still having SIGCHILD issues.
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
-
Issue 1215 https://github.com/open-mpi/ompi/pull/1335: grpcomm errors
- Ralph is unable to replicate. Didn't see on Trinity and elsewhere at scale. Found where the problem is, but trying to figure out why solution isn't working. Ralph-and-Jeff-are-iterating phase.
-
https://github.com/open-mpi/ompi/issues/1252: bad perf caused by openib
- Only fails if openib finds valid procs. As soon as you ibv_cq_poll on 2nd socket. Still like 3ms openib intra-node.
- Specific Mellanox MOFED 3.0 Verbs?
- Mellanox has seen far socket on sandybridbe, but latency hit should be 100ish ns, not 7ms!
-
Issue 1215 https://github.com/open-mpi/ompi/pull/1335: grpcomm errors
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- Mellanox would like new entrypoints in hcoll into 2.0 - Merged.
- Issue with Addprocs on big-endian machines. Now want minimum change to get 2.0 out.
- Easiest solution - for 32bit and BIG-ENDIAN - don't turn on dynamic add-procs - Nathan will put in change for 2.0
-
RFC to set the add_procs_cutoff to 32. PR1340
- Just drop it down to 0!
-
Async Modex - at scale helps for sparce connectivity, hurts for full connectivity.
- Which direction for default? Right now Full-modex (longer launch time for people who may not need it).
- Ralph thinks for 2.0 leave it where it is (optional). Figure out when to turn it on.
- Concern if people don't know if it's available. Put in NEWS?
-
--host vs. --hostfile behavior PR1344
- Jeff would like consistent with how over subscription works, but no -np runs 1 proc.
- two issues... how many slots, and how many processes.
- change behavior so that if user doesn't specify -np but DOES specify --host we'll get 1 slot (and one process).
- keep hostfile behavior same as today.
- Alot of issues are usNIC related. Jeff will look at.
- nVidia look like dynamics related. Slyvian fixing something about way it launches.
- Nathan will look at all one-sided failures.
- tcp btl might have an issue, getting tried to lock resource but already locked warning.
- LANL
- Houston
- HLRS
- IBM
- LANL, Houston, HLRS, IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel