Skip to content

WeeklyTelcon_20170627

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres (Cisco)
  • Howard Pritchard
  • Josh Hursey
  • Todd Kordenbrock
  • David Bernholdt (ORNL)
  • Nathan Hjelm
  • Ralph
  • Brian Barrett (Amazon)
  • Artem

Agenda

2.0.3

  • Targeting next 2.0.x October.

Review v2.x

  • Targeting next 2.1.x mid-August.
  • No RC last week.
  • Going to merge in last couple of Ralph's PMIx changes.
  • Josh Hursey needs to review PR 3754.
  • Will create an Open MPI v3.0.0 RC1 today.
  • Focus of RC1 testing will be around Orte launching.
    • Some orteds are still getting killed sometimes.
    • Some complaints in killing changes
  • Larger picture schedule for v3.0?
    • Like to get feedback on RC1.
    • Haven't had a lot of testing on v3.0 branch now.
  • There are a bunch of MPI layer PRs (some are review required)
    • two PRs
    • ROMIO PR (requires REVIEW)
    • RDMA PR (requires REVIEW)
    • Any special features for NEWS? Only responses from Mellanox.
  • MTT Cisco turned off Leave Session Attached is busted.
  • IBM added some MPI dependencies in OPAL layer, but no CI caught it.
    • autogen.pl -nompi and some other flag, would catch some abstraction layer violations like this.
  • Branch for next release will be End of Face to Face in July.
  • Expectations for Folks to test RC.
    • Down the road we should make a release tarball each night, and have MTT test THAT nightly.
    • Very different in how they're built, until they call 'make dist'.
  • Some corruption in Cray PMIX component on Master, about a week ago.
  • Monitoring components - replaces ptraces stuff. Some segv in this.
    • Don't think they're supposed to be on by default. Possibly bug in GLUE.

  • Mellanox was having some MTT testing issue, Artem will look at it.
    • Mellanox might be seeing it because of deprecated build status stuff.
  • Some issues with tests running successfully, but then hangs at the end of output, and dies due to Timeout.
  • Right Now PRs, building exactly what the person PRs,
    • But could build AFTER a merge of the PR and test THAT.
    • IBM has seen internally this method has caught a failure before it was merged to the branch.
    • Amazon likes this approach also.

MTT Dev status:

  • Intel is pushing content somewhat regularly, but unclear how much longer.
    • Not seeing much benefit.
  • Howard - Trying to use it an trying to work on viewer.

Exceptional topics

  • Face2Face Meeting-2017-07
    • Date: July 11-13 (9am Tuesday - noon on Thursday.
    • Cisco has booked space in Chicago.
    • Jeff will see about setting up a Web-Ex for those who are interested.
      • Please email him if you are interested in attending via Web-Ex.
    • No Fees at this face to face.
  • From mailing list (From SuSE) - Reproducability of the build.
    • Whatever build you want to be able to binary compare to see if it's the same, but can't because of date.
    • Lots of pros / cons to having date in build.
    • Put it in ompi_info - build host, build date, Manpages (stamped at make dist).
    • maybe add some DATE env to force the date for post v3.0
  • dlopen LOCAL is painful - Issue 3705
    • each mca library should be linked against libraries they have actual dependencies
    • We used to link the components against the libraries, but then we stopped.
      • Jeff Recalls: But then we stopped because we'd link MPI components against both MPI and ORTE.
      • Jeff Recalls: But if you do an upgrade, then you're screwed...
      • Brian Recalls: OSX namespacing issue...
      • need to do some archeology
      • Ralph remembers there was SOME reason we don't do this linkage.
    • Not for v3.0 - on Face 2 Face discussion.
    • Maybe add a configure option to do this.
  • For v4.0 do we want to keep hwloc internal, or just use external?
    • Compromise would be to change precedent to use external over internal for all of our libs?
      • Then in a future release, remove internals (or some at least) completely?
    • RHEL5 doesn't have hwloc.
    • Fixed something that now allows Open MPI to use older hwloc 1.3, 1.4, 1.5 or something, but still not v1.0.
  • What to do about libevent? - look at all of them at face to face.

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally