-
Notifications
You must be signed in to change notification settings - Fork 918
Description
Background information
We are running a cluster with OmniPath communication, using OpenMPI as built by EasyBuild. Upgrading to the latest EasyBuild toolchains means upgrading to OpenMPI 5.X but then cm/psm2 fails with the error Error obtaining unique transport key from PMIX (OMPI_MCA_orte_precondition_transports not present in the environment).
. Switching to the ucx PML works, but at a performance cost.
This problem is likely related to #12886, but we already have the SLURM version that @bedroge reports solved the problem for them.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
We mainly see the problem with OpenMPI version 5.0.8 and PMIx version 5.0.8 (openpmix) [EasyBuild toolchain gompi/2025b], where the problem happens on all jobs started with mpirun
, with srun
or as singletons.
We see it to a lesser degree with OpenMPI version 5.0.7 and PMIx 5.0.6 [EasyBuild toolchain gompi/2025a], where jobs started with mpirun
work, but jobs started with srun
or as singletons fail.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
With the EasyBuild build system, i.e. from source tarballs.
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: RockyLinux 8
- Computer hardware: Dell R650 (69 icelake cores), Dell R640 (40 skylake cores), Huawei XH620 (24 broadwell cores) - same issue on them all
- Network type: OmniPath
Details of the problem
All MPI programs crash upon starting. Per default OpenMPI chooses the cm PML with the psm2 MTL.
$ mpirun -np 2 ./mpiBench_2025b
--------------------------------------------------------------------------
Error obtaining unique transport key from PMIX (OMPI_MCA_orte_precondition_transports not present in
the environment).
Local host: a018
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[a018:00000] *** An error occurred in MPI_Init
[a018:00000] *** reported by process [4230545409,1]
[a018:00000] *** on a NULL communicator
[a018:00000] *** Unknown error
[a018:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a018:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 0 with PID 0 on node a018 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
Note that setting OMPI_MCA_pml=ucx
solves the problem, but with OpenMPI 5.0.7 (where we can run with mpirun but not srun or singletons) we measure a 25% performance degradation doing that.