Skip to content

WeeklyTelcon_20180918

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Geoff Paulsen
  • Howard Pritchard
  • Edgar Gabriel
  • Brian
  • Matias Cabral
  • Ralph Castain
  • Akvenkatesh (nVidia)
  • Todd Kordenbrock
  • Xin Zhao
  • Aravind Gopalakrishnan (Intel)

not there today (I keep this for easy cut-n-paste for future notes)

  • Matthew Dosanjh
  • Thomas Naughton
  • Nathan Hjelm
  • Josh Hursey
  • Geoffroy Vallee
  • Dan Topa (LANL)
  • Brian
  • Joshua Ladd
  • Dan Topa (LANL)
  • David Bernholdt
  • George
  • Peter Gottesman (Cisco)

Agenda/New Business

  • Ralph proposed moving mailman to a new hosting site.

    • mailmanhost.com - $3/list for up to 4K members.
    • dotlist is company behind them.
    • We have about 2600 subscribers.
    • Might have had some more issues today with current provider.
    • No action until face to face
  • Silent Wrong Issue(s)

    • Vader fence issue (Originally Issue 4937)
    • Released v2.1.x with this.
    • Other things for v3.1.x
      • Put out an RC for v3.1.x
    • ACTION: Did this get fixed for v4.0.x?
    • ACTION: did this go to all release branches?
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
    • Got good comments from George and others.
    • eliminate all of our atomic for C11 atomics.
      • So will need to support until 2020 due to RHEL.
    • Nathan agreed to clear out old stuff now, and will rebase.
  • github suggestion on email filtering

Minutes

Review v2.1.6 (not going to do this in immediate future.

  • Vader-bug part 2 (Dont think it's in v2.0.x) [hang or crash]
  • Driving a new release because it's a regression.
  • Dec 1st.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • Will do an RC end of this work "I/O configury stuff"
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Not important enough to do in parallel with v4.0.x
  • Issue with external PMIX v3.0 hanging. Fixed on master, Ralph backported the fix to OMPI v3.0.x and v3.1.x Already fixed in OMPI v4.0
  • fairly extensive bug fix list is building.
  • Few more PRs
  • UCS shaming.
  • Issue: MPI Connect accept broken except within same mpirun.
    • Ignoring for v3.0.3
  • Issue: CUDA Direct RDMA blocks on msgs larger than RDMA message length.
    • Fix for this is to use mca_cuda_memcpy_asycn_send_recv is not default.
    • Would like to change this default to be false.
    • Want a PR from someone who can test this for v3.0.3 and futures.
    • Lower priority because moving to UCX.
  • Issue 1763: Tune BCAST Data Corruption.
    • Looks like George worked around (2 years ago) but doesn't fix.
  • Probably want PMIx v2.1.4 - fix that came in from ARM last week.
    • Is there a timeline for v2.1.4 release?
    • Can release soon. Then Open MPI will pickup.
  • There are a bunch of Issues "targeting v3.0" not in v3.0.3 release.
    • Many have been merged in, but waiting to merge in everywhere.

Review v3.1.x Milestones v3.1.0

  • Schedule: Dec 1st
  • Issue 5083 - ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)
    • Open PR: PMIx v2.1.4 upgrade
    • PR 4986 - if no updates in 7 days, Brian will close PR.
    • Issue 5540 issue with overlapping datatype.
      • George is working on.

v4.0.0

  • Schedule: release: End of Sept.

    • Date for first RC - Setp 11 (today)
  • PR 5692 - making sure Jenkins will test without showing up red again.

    • Once finished Merge.
  • PR 5665 - Improved instruction cashe - pasha review and howard will review

  • Update to PMIx v3.1.2 but not released yet.

    • It's ready to go, and Ralph can generate release.
    • Ralph will generate release today.
  • Jeff has question about v4.0.0

    • Skiff BTL still exists - Howard will remove before RC2
    • George will push patch to fix SKIFF and then Howard will remove
  • PR 5622 - Ralph will update with released PMix.

  • Open MPI v4.0.0 RC2 on Thursday.

  • Configuring PMIx is challenging to pass in configure options.

    • considering a mechanism to pass configure flags down to PMIx configure
    • consider something similar to: --with-romio-flags and pass that to romio.
    • Caution: this can be very painful for escaping.
  • Another issue is it's hard to see how pmix was configured.

    • pmix has a pmix_info - and we should build/package that.
    • Would like for v4.0.0
  • Howard has NEWS update

  • PR 5650:

    • with PR 5650 was much simpler. Final solution will come through PMIx, so don't need to do this greater solution.
    • PR 5650 has been removed for earlier versions, and will be satisfied by pmix in the future. So this PR is ONLY needed for v4.0.0
    • Why 2 commits? - Ralph got confused a bit about what Mathias wanted.
    • Mathias and Ralph have agreed to back the out 2nd commit out and create a new PR with 2nd commit, and both will get into RC2
  • Geoff and Howard came up with list of commits in master, not PRed to v4.0.x and will send out list to devel-core and to people directly.

    • Howard and Geoff will 3:30 MT.
  • Issue: 5470

    • builtin atomics seem to fail on ppc64le / ARM?
    • Nathan will look at Issue 5470.
    • We May disable atomics by default on everything but intel systems for v4.0
    • In master, but configury broken so not being used.
  • 5713 - PR 5714 to fix some of it, but not all of it.

  • Issue: 5375 in vader.

    • may be new blocker for v4.0.0
  • PR 5716 - Want to Pull in.

  • Added several labels with prefix 'state_' or 'severity_'

    • This helps us remember the state of the issue.
    • Does require users to update.
    • Would be nice to have a wiki page describing intent of these.

PMIx

  • Will release new v3.1.2 and v2.1.4 this week for Open MPI releases.

  • No action: Open MPI v5.x Future of Launch

    • Geoffroy Vallee sent out document with summary to core-devel.
      Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss launching under Open MPI v5.0
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two steps:
      1. Make PMIx a first class citizen - and call PMIx API directly.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • MTT License discussion - MTT needs to be de-GPL-ified.

    • Main desire is python is in a repo with no GPL code (no Perl code)
    • Current status:
      • Need to make progress on sooner than later.
      • Ralph will move existing MTT to new mtt-legacy repo,
        • then rip out perl from MTT repo.
      • Cisco spins up a different slurm job for each MPI build, with a single ini file. By doing it this way, it depends on many perl funclets.
      • If change to have a different ini for each different "stream", it should work okay with python. Didn't happen before Peter left.
    • Resolution - Just backup current mtt to mtt-legacy,
      • and then rip out the perl from main mtt.
  • MTT performance database?

    • No status for a while.
    • MTT does report this, but no one looks.
    • Howard suggests many different performance dashboards.
      • Influx DB with jenkins, and can be queried.
      • Still need to get an up to date viewer.

Review Master Master Pull Requests

  • didn't discuss today.
  • Next Face to Face
    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco
    • Need Agenda items added to the face to face.
      • Issue with devel-core / mailman.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally