Skip to content

WeeklyTelcon_20160719

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Brian
  • Edgar Gabriel
  • Howard
  • Josh Hursey
  • Nathan Hjelm
  • Ralph
  • Ryan Grant
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones
  • A couple of things sitting against 1.10.4

Review 2.0.x

  • Wiki
  • 2.0.1 PRs that are reviewed and approved
    • v2.0.1 PRs are open. Need to get PRs reviewed!
  • Blocker Issues *
  • Milestones
  • We release last Tuesday. Now taking in PRs.
  • A lot of 2.0.1 PRs that did not get reviewed yet, so please get reviews.
  • Howard and Jeff merging in low risk ones.
  • nvidia failures with OFED install (false failures)
  • cisco failures - still some failures here. Have to do with sparse groups. One of the PRs we haven't pulled in yet.
  • IBM seem to do with spawn an intercomm interconnect
    • Might call Connect / Accept - when we create key, we use PMIx to communicate between leaders.
    • PMIX needs to support Exchange.
    • Aborts but Hangs. - PMIx error code is coming up.
  • Cray - all associated with SPAWN, but CRAY PMI doesn't support it.
  • Applaunch with Master doesn't work
  • MPI_Info keys are weird. OMPI_NUM_APPS? what is that?
  • Mellanox will host EventBright (vendor process fees). Thank you Mellanox.

Review Master MTT testing (https://mtt.open-mpi.org/)

MTT Dev status:

New Items:

  • Gentle reminder that lots of 2.0.1 PRs haven't been reviewed yet.

  • Merging github master and ompi_release is taking a backseat to migraion.

  • Migration ongoing, nothing's moved yet, just testing:

    • Mailman lists - sanity check of list of lists that we are migrating, and not-migrating.
      • if Community is good with list of lists, then give everyone a heads up that it's moving.
      • new aliases will be @lists.openmpi-org.
    • Transfer MTT to Ralph's machine to address PostGRES issue, before transitioning.
      • MTT code is somewhat POSTGRES specific. But Hostgator support MYSQL, but not POSTGRES.
      • So need to modify code from POSTGRES to MYSQL.
      • So Intel is temporarily migrating MTT Sever until we can migrate to MYSQL.
      • meeting with MTT to guestimate the time... few months of realistic effort.
      • Mostly API issue, though some POSTGRES specific tables. That will need to change. Database structure won't have to change.
    • moving main website. Mostly a solved issue. Want to do mailing list stuff first.
    • PDFs for 3rd party agreements. Ralph talked to Hostgator, they have a file sharing that increases price dramatically.
      • If only one or two people have access, and have permission on HostGator, perhaps this is acceptable.

New discussion.

  • Mellanox Jenkin's - Some jenkin's testing that was failling in MPI_Init, not sure if new MELLANOX Seed.
    • Will look into. Server was rebooted, they are doing some maintenance. Perhaps this is causing issues.
    • Jeff tagged Artem in PR in last few hours.
  • Possible to put a :bot-mellanox-retest: on Mellanox Jenkins
    • Artem will try.
  • Howard pointed out yesterday. Jeff did a bot-retest of old 2.0.1 PRs, because he thought they'd be done in serial. But Mellanox config says it will run 10 in parallel
  • Artem - discuss benchmarks.

Artem new benchmarking.

  • Test blocking versus non-blocking MPI_Send/Recv -
    • could run 16 processes per node, and pair processes on two nodes to send back and forth.
    • Or could run 1 process per node, and run 16 threads, and do the same thing.
    • would expect this to be similar, but in reality, it is very different (16 threads is much worse).
    • So this is one of the questions to discuss?
  • Not talking about oversubscribing.
  • Are each thread pair using seperate communicator or same?
    • Can do both, no difference.
  • Are you preposting the buffers? If not, and everything is using different tags, then the receive list gets quite large (OB1).
  • Each thread using different tag, but different messages using the same tag.
  • Unclear if Yalla has a way to distinguish based on Comms (just tags?) so possibly no gain in parallelism.
    • Artem was thinking about this.
  • Artem can include link to sources.
    • Would like community to work with Artem on them.
    • can do the fine binding (if you have 16 procs per node).
      • Do allow you to bind a process to 4 cores. Bind to core, map by core:ppe=4.
      • Artem has benchmark do fine binding.
    • Doing non-overlapping fine binding based on MPI_Exchanging.
    • Can run 4 multi-threaded process. OSU can only run 1 proc/node.
  • Now can reproduce OSU results with 1 process per node.
  • Can reproduce ARM's results with OSU, having each thread doing seperate Send/Recvs.
  • Artem will send out link to his personal public github repo for for others to try, and provide PRs against, etc.

Status Updates:

  1. Mellanox
    • Artem sent out message rate email.
  2. Sandia
  3. Intel

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally