Skip to content

WeeklyTelcon_20160614

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Arm Patinyasakdikul
  • Edgar Gabriel
  • Howard
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
    • Appears to be ready to go, but there is this PSM signal issue, we'll discuss in new item,
    • Dynamic Comm Spawn disconnecting. Maybe you need Free and disconnect.
      • On disconnect, child is trying to send signal to parent and is getting an unreachable error.

Review 2.0.x

NEW Agenda for this week:

  • Timing of v1.10.3 vs v2.0.0 releases

    • coordination of NEWS bullets
  • PSM/PSM2 signal hijacking: fix for v1.10.x and v2.0.0

    • Jeff filed PR - fixs SEGV. Make sure wording is good (mentioned vendor).
      • PR looks in env, for either PSM or PSM2 variable. If env var is NOT set, it sets it to disabled.
        • Default is don't do PSM backtrace files unless user asks for it via env var.
      • in JNI onload they dlopen libmpi, so do it for this.
      • Open MPI has always had a backtrace handler in Open MPI, and never understood where the signalhandler was failing.
    • For debugging in PSM and PSM2 libraries, getenv (var) register sigtraps for various handlers.
      • In PSM2 they handled correctly to chain the signal handers, and put the old handlers back when they're done.
        • Only reason need protection here for HFI - PSM2 library. Discovered a type-o in there at finalize it was resetting the signal handler for random point in memory.
        • Intel Pushing fixes back, aiming for latest Fedora 25 (small windows) to eventually get picked up by RHEL 7.3?
  • Next developer’s meeting - Will create a doodle for time.

    • Cisco Chicago by Ohare
    • IBM DFW
    • Cisco San Jose
  • Begin planning for 3.0 branch - There is a LOT of change in the master. Several pages of changes for 2.1.

    • Perhaps in August timeframe branch for 3.0 since master and 2.x has diverged alot?
      • well, maybe don't need to
    • No known ABI break for 3.0, so we won't yet fork for 3.0.
    • We need to figure out procedural issue to get code changes in, duel checkins to master and 2.x will be painful.
      • Should look into possibly moving much of 2.x branch up to date with master??? That is a lot of risk!
  • MTT development - A lot of development

    • Do we need an MTT telecon for awhile (biweekly?) Ralph will do a doodle setup.
  • non-member access to ompi-tests -

    • Asking to get tests to do testing. Seems great, but haven't yet given access to non-members.
    • This case they are working on contributors agreement.
  • Open MPI - 2.0 testing down to about 0.2% of error rate on Jeff's runs.

  • Group Proc Count errors

    • it'd be nice to group MTTs that are not common, sparce groups going into 2.1.
  • Would be nice to have a Known_Failure file of some sort.

  • Also would be nice to "group" certain tests (like via a tag) such that when all of them fail, it's easier to know "That's all of the MPI_Group_create tests".

Review Master MTT testing (https://mtt.open-mpi.org/)

  • Still a lot of failures on master.
  • Cray failures may be cluster issue, howard needs to look at.

MTT Dev status:

Status Updates:


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally