Skip to content
Jeff Squyres edited this page Aug 19, 2016 · 72 revisions

August 2016 Open MPI Developer's Meeting

Logistics:

  • Start: 9am, Tue Aug 16, 2016
  • Finish: 1pm, Thu Aug 18, 2016
  • Location: IBM facility, Dallas, TX
  • Attendance fee: $50/person, see registration link below

Location:

Attendees

Please both register at EventBrite ($50/person) and add your name to the wiki list below if you are coming to the meeting:

  1. Jeff Squyres, Cisco
  2. Howard Pritchard, LANL
  3. Geoffrey Paulsen, IBM
  4. Ralph Castain, Intel
  5. George Bosilca, UTK (17 and 18)
  6. Josh Hursey, IBM
  7. Edgar Gabriel, UHouston
  8. Takahiro Kawashima, Fujitsu
  9. Shinji Sumimoto, Fujitsu
  10. Brian Barrett, Amazon Web Services
  11. Nathan Hjelm, LANL
  12. Sameh Sharkawi, IBM (17 and 18)
  13. Mark Allen, IBM
  14. Josh Ladd, Mellanox (17)
  15. Perry Schmidt (missed registration)
  16. Sameh Sharkawi (missed registration)

Topics Still To Discuss

  • MTT updates / future direction

  • Thread-safety design

    • Need some good multi-threaded performance tests (per Nathan and Artem discussion)
      • Do we need to write them ourselves?
    • Review/define the path forward
  • Fujitsu status

    • Memory consumption evaluation
    • MTT status
    • PMIx status
  • Revive btl/openib memalign hooks?

  • Request completion callback and thread safety

  • Discuss appropriate default settings for openib BTL

    • Email thread on performance conflicts between RMA/openib and SM/Vader
  • Ralph offers to give presentation on "Flash Provisioning of Clusters", if folks are interested

  • Performance Regression tracking

  • What to do about MPI_Info PR from IBM / MPI Forum gyrations about MPI_Info?

Topics Already Discussed

NOTE: Some notes are included below. But a much more detailed writeup can be found in the meeting minutes

  • Status of v2.0.1 release

    • Lots of PRs still...
    • From the meeting:
      • Closing in on v2.0.1. Most PRs are in. Release next Tuesday (Aug 23, 2016) if possible
  • After v2.1.0 release, should we merge from master to the v2.x branch?

    • Only if there are no backwards compatibility issues (!)
    • This would allow us to close the divergence/gap from master to v2.x, but keep life in the v2.x series (which is attractive to some organizations)
    • Alternatively, we might want to fork and create a new 3.x branch.
    • From the meeting:
      • Long discussion. There seems to be two questions:
        1. What to call the release after v2.1.x: v2.2.x or v3.x (i.e., whether there are backwards compatibility issues or not)
        2. Whether to merge master into the v2.x branch or fork into a new branch (regardless of whether the next release is v2.2.x or v3.x)
      • The consensus seems to be that we think (but we don't know for sure because no one has systematically analyzed) there is both:
        1. A huge amount of code drift from master to v2.x such that a merge may generate tons of conflicts
        2. A bunch of backwards-incompatible changes (e.g., MCA vars and CLI params)
      • Meaning: we think the next release should be v3.x and it should be a fork from master
  • Migration to new cloud services update for website, database, etc.

    • DONE:
      • DNS:
        • All 6 domains transferred to Jeff's GoDaddy account
      • Web site:
      • Mailing lists:
        • Migrate mailman lists to NMC
        • Freeze old mailing list archives, add to ompi-www git
        • Add old mailing list archives to mail-archive.com
        • Setup new mails to archive to mail-archive.com
      • Email
        • Setup 2 email legacy addresses: rhc@ and jjhursey@
      • Infrastructure
        • Nightly snapshot tarballs being created on RHC's machine and SCPed to www.open-mpi.org
      • Github push notification emails (i.e., "gitdub")
        • Converted Ruby gitdub to PHP
        • Works for all repos... except ompi-www (due to memory constraints)
          • Might well just disable git commit emails for ompi-www
      • Contribution agreements
    • Still to-do:
      • Web site:
        • Probably going to shut down the mirroring problem.
        • Possibly host the tarballs at Amazon S3 and put CloudFront in front of them
      • Spin up an Amazon EC instance (thank you Amazon!) for:
        • Hosting Open MPI community Jenkins master
        • Hosting Open MPI community MTT database and web server
      • Revamp / consolidate: ompi master:contrib/ -- there's currently 3 subdirs that should really be disambiguated and overlap removed. Perhaps name subdirs by the DNS name where they reside / operate?
        • infrastructure
        • build server
        • nightly
      • Spend time documenting where everything is / how it is setup
      • Fix OMPI timeline page: https://www.open-mpi.org/software/ompi/versions/timeline.php
      • Possible umbrella non-profit organization
      • Update Open MPI contrib agreements
        • Created a new contributions@lists. email address, will update agreements
  • MCA support as a separate package?

    • Now that we have multiple projects (PMIx) and others using MCA plugins, does it make sense to create a separate repo/package for MCA itself? Integrating MCA into these projects was modestly painful (e.g., identifying what other infrastructure - such as argv.h/c - needs to be included) - perhaps a more packaged solution will make it simpler.
    • Need to "tag" the component libraries with their project name as library confusion is becoming more prevalent as OMPI begins to utilize MCA-based packages such as PMIx
    • From the meeting:
      • The need for this has gone down quite a bit: PMIx copied and renamed, Warewulf is going to go python.
      • But it seems worthwhile to take the next few steps in spreading the project name throughout the MCA system:
        • Put the project name in the component filename: mca_PROJECT_FRAMEWORK_COMPONENT.la
        • Add some duplicate-checking code in the MCA var base: if someone sets a value for FRAMEWORK_COMPONENT_VAR, and there's more than one of those (i.e., the same framework/component/var in two different projects, and the project name was not specified), the we need to error and let a human figure it out.
  • Plans for folding ompi-release Github repo back into ompi Github repo

  • (Possibly) Remove atomics from OBJ_RETAIN/OBJ_RELEASE in the THREAD_SINGLE case.

  • Continue --net mpirun CLI option discussion from Feb 2016 meeting

    • Originally an IBM proposal.
    • Tied to issues of "I just want to use network X" user intent, without needing to educate users on the complexities of PML, MTL, BTL, COLL, ...etc.
    • We didn't come to any firm conclusions in February.
    • From the meeting:
      • There was a long discussion about this in the meeting; see the meeting minutes for more detail.
  • MPI_Reduce_Local - move into coll framework.

    • From the meeting:
      • It isn't in the coll framework already simply because it isn't a collective.
      • But IBM would like to have multiple backends to MPI_REDUCE_LOCAL
      • The OMPI Way to do this is with a framework / component
      • Seems like overkill to have a new framework just for this one MPI function
      • So it seems ok to add it to the coll framework
  • Should we be using Slack.com as a community?

    • From the meeting:
      • Seems like no one cares
  • Symbol versioning

    • Per request from Debian: https://github.com/open-mpi/ompi/pull/1955
    • There's 3 issues:
      1. Symbol visibility. Per his PR, it looks like we're leaking a lot of symbols that do not need to be public.
      2. .so version numberinf for MPI and OSHMEM libraries. We have made promises about this; we just need to honor those promises.
      3. Symbol versioning.
        • Symbol versioning can fix the visibility problem, but that's really a side effect. We should just fix the visibility issue with proper use of DECLSPEC.
        • Per the MPI spec, we don't need to version the MPI API calls (because the MPI Forum will not break APIs)
        • Will it help to version things like MPI_Comm? (e.g., if we grow the size of the communicator struct)
    • From the meeting:
      • Jeff will reply to the Debian guy: we don't see a reason for symbol versioning
      • Jeff will pull in Orion (RHEL packaging guy) and ask him, too
    • Fallout from this discussion: should we return to a single library so that the ORTE and OPAL symbols will not be exposed to users?
      • We have flip-flopped on 1 vs. 3 librarys multiple times.
      • From Feb 2016 meeting notes:
        • Beginning of the project: there was just libmpi. Later, it was split into projects, and then the project libraries. Later, the build was unified back into libmpi again.
        • In Dec 2012 (here's the commit), we split the build back into 3 libraries. The commit message cites discussion at the Dec 2012 Open MPI dev meeting -- but there's unfortunately no clues as to the rationale why this was done in the wiki notes. Was it just because we developers like having 3 smaller libraries? Or is there some deeper technical issue? Neither Ralph nor Jeff remembers. 😦
        • Rationale for bringing this up again: when upstream projects are trying to link against portions of our project, and then also support apps that link against all of it, we run into conflicts (e.g., the ORTE being used by the upstream project may be different than the one being used by the OMPI installation). Slurping it all up into one library -- and making only the MPI API be visible -- would resolve the problem. ...but we cannot recall if there are undesirable side-effects.
        • No issue with recombining libraries, need to look closer at whether ORTE/OPAL symbols still need to be public
    • From the meeting:
      • Doesn't seem like there's much of a need for this any more
      • Also: it doesn't solve the "can hide the ORTE/OPAL symbols" problem, because those need to be exposed for plugins.
  • PMIx roadmap discussions

  • Plans for v2.1.0 release

    • Need community to contribute what they want in v2.1.0
    • Want to release by end of 2016 at the latest
    • From the meeting:
      • Lots of discussion
      • See the meeting minutes for more details
  • Annual git committer audit

  • Present information about IBM Spectrum MPI, processes, etc.

    • May have PR's ready to discuss requested changes, but schedule is tight in July / August for us.
  • How to help alleviate "drowning in CI data" syndrome?

    • One example: https://github.com/open-mpi/ompi/pull/1801
    • One suggestion: should we actively market for testers in the community to help wrangle this stuff?
    • If Jenkins detects an error, can we get Jenkins to retry the tests without the PR changes, and then compare the results to see if the PR itself is introducing a new error?
    • How do we stabilize Jenkins to alleviate all these false positives?

Presentation Material

Clone this wiki locally