Skip to content

WeeklyTelcon_20220920

Geoffrey Paulsen edited this page Oct 4, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Brendan Cunningham (Cornelis Networks)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Fisher (Cornelis Networks)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tommy Janjusic (nVidia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • Edgar Gabriel (UoH)
  • Erik Zeiske
  • George Bosilca (UTK)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Jan (Sandia -ULT support in Open MPI)
  • Jingyin Tang
  • Josh Hursey (IBM)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)10513
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia)

v4.1.x

  • Multiple weeks on CVE from nvidia.
  • v4.1.5
    • Schedule: targeting ~6 mon (Targeting October)
    • No driver on schedule yet.
  • 10583 - Potential CVE from 4 years ago issue in libevent.. but might not need to do anything.
    • Updated one company reported scanner didn't report anything.
    • Waiting on confirmation that patches to remove dead was enough.

v5.0.x

  • An RC this week.

  • Discuss MCA https://github.com/open-mpi/ompi/pull/10793

    • When you pass mca parameter to PRTERUN, it has to figure out which MCA system it's going to.
      • If you want to be sure, just say -omca,-prtemca,-pmximca
      • Jeff and Briant came up with a solution, they're working on.
  • Is this related to submodule?

    • Unrelated to -MCA, we share a lot of replicated M4 code between OMPI, PMIX, PRRTE.

      • They have diverged in radical and subtle ways.
    • Last week, added another submodule pointer to OMPI

    • Took handful of M4 macros and combined them there.

    • More consolidation there over time.

    • Most part this is behind the scene, but will need to git submodule init.

    • Purpose is it'll just be M4 files.

    • --mca is how we've set OMPI mca parameters in Open MPI

      • Could PRRTE just "do the right thing" for --mca
      • Agree --mca is Open MPI specific options.
      • when pprte and pmix split off they prefixed.
      • They don't have ownership over MCA.
      • End of the day our docs can't change bec
  • 10779 OPAL "core" library for internal usage

    • NEED to see if it made it's way to v5
    • Approach to seperate out pieces of OPAL for core and top
    • All internal things, not exposed to user
    • Brian and George worked on it, and then Josh picked it up and PRed 10779
    • Still in Draft because he wants to resolve any high level issues
    • As far as code layout, could move some things around, but if we do this too much, worried about dropping history...
      • We'd have hundreds or thousands of
  • Discuss mca_base_env_list https://github.com/open-mpi/ompi/pull/10788

  • Discuss Remaining PRRTE CLI issues (https://github.com/open-mpi/ompi/issues/10698)

    • -N document an error if they try to error if --map-by conflict.
    • --show-progress - do the little ... on terminal to display, now it doesn't do anything.
      • DOE may set this by default in MCA parameters (makes some users feel happy)
    • --display-topo Generally we've tried to be backwards compatible.
    • -v version
    • -V verbose
    • -s|--preload-binary <- functionally it works, but with -n gets messed up
    • rankfile <- NOT deprecating
    • --mca is Open MPI's framework
    • No gprtemca. Created by PRRTE, but do we continue to support --gpmixmca?
    • --test-suicide and others all prrtedameon not exposed to the users.
      • passed to prrte launcher
  • Posted Issue Open-MPI #10698 with about 13 issue, that will need

  • No longer trust the verbage here, based on Ralph's comment

    • Not recognized from mpirun, but sited in --help.
    • Some of these aren't possible??? and mpirun -> prterun (one shot thing)
  • Should mpirun be able to talk to an existing dvm???

    • Or is it always a 1 shot thing?
    • If we have it talk to an existing DVM,
    • prte to startup prteds, and pruns at that.
    • If you're using MPI front-end, and want to interact with DVM, how should we tell users to do that?
      • What should they do?
      • Go through mpirun, or go through prun (with ompi personality?)
    • Thomas can look and see if you can get everything you need.
    • There were some common things that were difficult when switching between the two.
    • Was there an option for this in v4.1?
      • Yes, but perhaps wasn't working much.
      • Are there legacy command line options that we should support or alias?
  • Are we dropping DVM support for v5?

    • How did this work in v4?
    • Howard thought you fired up an orte something, and that would provide a command line
    • Couldn't do all of this with mpirun, it was a two stage process.
    • Had to start DVM manually, and got back a URI
      • But thought if you sourced this scziso and gave it a URI, it would do all of the right things.
    • Could add support if the user fired up using PRTE the DVM, and got URI back.
      • Don't have ompi-dvm executable in v5, so this is already a deviation.
    • What do we do?
      1. support same CLI options (and executables, etc as documented for v4.x
      2. Don't support at all in v5, and if you want to do DVM things
      3. Maybe something in the middle.
    • Does anyone care about DVM?
    • Can we run ompi_scizo / personality with vanilla PRUN?
      • Some people on call DO care about DVM.
    • Early days of Sesions needed DVM run (no longer needed in main/v5)
  • Usually if customers are interested in doing this, they're willing do to a bit more work.

    • But if we want to get v5.0.0 out in near future, it'd be more likely if we
    • Thomas gets a lot of use with mini-task, some are MPI parallel.
      • This is where DVM is useful because slamming lots of serial and parallel jobs in a short time.
      • If they can do this via prun to get ompi_schziso doesn't matter the path.
      • Thomas will investigate proper options.
    • Could do a CLI interface for mpirun in a future version to have mpirun not call prterun
      • Don't want to rush this.
  • Schedule:

    • PMIx and PRRTE changes coming at end of August.
      • PMIx v3.2 released.
      • Try to have bugfixes PRed end of August, to give time to iterate and merged.
    • Still using Critical v5.0.x Issues (https://github.com/open-mpi/ompi/projects/3) yesterday
  • Docs

    • mpirun --help is OUT OF DATE.
      • Have to do this relatively quickly, before PRRTE releases.
      • Austen, Geoff and Tomi will be
      • REASON for this, is because mpirun command line is in PRRTE.
  • mpirun manpage needs to be re-written.

    • Docs are online and can be updates asyncronously.
    • Jeff posted PR to document runpath vs rpath
      • Our configure checks some linker flags, but there might be default in linker or in system that really governs what happens.
  • Symbol Pollution - Need an issue posted.

    • OPAL_DECLSPEC - Do we have docs on this?
      • No. Intent is where do you want a symbol available?
        • Outside of your library, then use OPAL_DECLSPEC (like Windows DECLSPEC)
        • I want you to export this symbol.
    • need to clean up as much as possible.
    • Open-MPI community's perspective, our ABI is just MPI_Symbols
    • Still unfortunate. We need to clean up as much as possible.

Main branch

  • Case of QThreds, where they need a recursive lock.
    • A configury problem was fixed.

Accelerator framework

  • Just working on getting it ready for OMPI.
    • converting structures to OPAL objects.
    • Also adding libcuda linking (instead of DLOPEN)
  • William will test JEff's PR [10763?] this week.
  • In Jeffs Roll Up the docs
    • Called out accelerator and show-load-errors
    • Not sure what distros will want to do, since some of these accelerator are not open
  • Packager building Open MPI,
    • Example: say only 20% of nodes have accelerators, only installed libraries on those nodes.
    • Problem why everything today is dlopened...
      • Get scary warnings to fail to open components on some nodes.
      • If you build accelerator components by default, they'll be part of libmpi.so
    • But if you know you have hetrogenous in software/hardware (only accelerators on 20%)
      • Build accelerator components as so components.
      • Can still run, but don't want scary warnings.
    • Packager build accelerator components as SOs.
      • Put SOs in sub package of Open MPI, and only that subpackage depends on ACCELERATOR LIBs
      • WONT get scary message since SOs only on nodes that have these libs

Attomics PRs

  • Switching to builtin atomics,
    • 10613 - Prefered PR. GCC / Clang should have that.
    • Next step would be to refactor the atomics for post v5.0.
    • Waiting on Brian's review and CI fixes.
  • Joseph will post some additional info thing in the ticket

MTT

Administrative tasks

Face-to-face

Clone this wiki locally