Skip to content

WeeklyTelcon_20160606

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Arm Painyasakdikul
  • Nathan
  • Edgar Gabriel
  • Ralph
  • Todd Kordenbrock
  • Geoff Paulsen
  • Howard Pritchard

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • Does anyone care about NAG on 1.10? - consensus is no.
  • Nathan: Some threading issue on 1.10 OB1 Pending Progress. There is a leak in RDMA. Hanging in RDMACM.
  • master PR1758 - if BTL Vader (purposely allocate 4000 fragments) Isend, it would go ahead an progress, bug we've had for long long time. fraglist will leak, and will keep growing without bounds.
    • Vader was first to hit it because all other BTL's free list max is -1.
    • Need to hit really hard with lots of isends.
  • Door is closing on RHEL 7.3, so want to get into 1.10.3.
    • Redhat / Ubuntu / SLES generally just pickup latest for their release.
  • Let it go into master tonight, and see how it goes tonight before deciding risk for 1.10.3
    • Once it's into master, generate the PR for 1.10 to see how bad the backport is.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    • with OB1 fixes, things are looking good. As of now, haven't had a test that's causing things to blow up.
    • Hitting a lot of subsystems in Open MPI harder than we used to.
    • Threaded tests that we
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • Nysal - File Open and a couple others not Multi-thread safe. Because we do lazy open of framework
      • Should put a big warning in readme that File operations are not thread safe.
      • No workaround since there are no locks around those calls.
    • PR 1199 - we are expecting more commits?
      • Yes a couple more George and Nathan going back and forth. Gone into Master and fixed hang on Master.
      • OB1 failures.
      • Nvidia issues.
      • Nathan cherry-picked up some warning cleanup code also.
    • Feeling better and better about this, but still more
    • PR 1218 MPOOL - rcache. Timeout. RDMACM test.
      • platform file revert on 1.10 around RDMACM.
      • Iwarp people aren't complaining, and this is their only connection method.
      • Howard not okay merging yet, want better explanation of why it's hanging.
      • Was on 2.x before Request fallout happened.
      • on master hanging in Finalize waiting for disconnect. 1758.
    • RDMACM has been low priority for a long time (not default), so possibly not a blocker for v2.0, could go into 2.0.1
      • How could it be off for IWARP? It's low priority, but still works.
      • should ping with Chelcio, and ask if they have an issue.
    • With these issues don't want to be pushing 1180 in to v2.x
      • Todd asked to put into v2.0.0
      • Howard put the comment in not to put into v2.0.0, but we'd agreed to let Todd put in earlier, and it's isolated, and low risk. Jeff deleted howards comment since he's changed his mind and is okay with v.2.0.0
    • 1174 - asked Nathan to put changes to error out if not power of two.
      • needs one more commit. to clean up warnings about dropping volatile keyword.
    • Looking at SCIF IP interface - BTL TCP IP exclude list.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • MTT is continuing to improve, but lots of failure on master.
  • All Wait Any's failed on Master - Request Any.
  • Cisco will add a series with TCP Async enabled.

Update from MPI Forum

  • Lots of decisions on MPI_Comm_Info in working group.
    • not all issues resolved, but many main issues.
    • Once this is done we can revisit MPI_Comm_Info implementation approach we began over a year ago.

MTT Dev status:

Status Updates:


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally