Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtl/ofi: call to fi_domain fails on Crusher/Frontier #12038

Closed
devreal opened this issue Oct 31, 2023 · 8 comments
Closed

mtl/ofi: call to fi_domain fails on Crusher/Frontier #12038

devreal opened this issue Oct 31, 2023 · 8 comments

Comments

@devreal
Copy link
Contributor

devreal commented Oct 31, 2023

Background information

I am trying to run Open MPI 5.0 on Crusher/Frontier but I get the following error during MPI_Init:

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: crusher051
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
  Error: Function not implemented (38)
--------------------------------------------------------------------------

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5.0 from the release tarball.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

On Crusher I run configure:

../configure --disable-debug --with-slurm --prefix=$HOME/opt-crusher/openmpi-5.0 --with-ofi=/opt/cray/libfabric/1.15.2.0/ --with-xpmem=/opt/cray/xpmem/default --with-rocm=/opt/rocm-5.3.0 --with-libevent=internal CC=gcc CXX=g++

Please describe the system on which you are running

libfabric version: 1.15.2.0 (default module)


Details of the problem

Running OSU benchmark built against this installation on Crusher:

> mpirun -np 2 $HOME/src/osu-micro-benchmarks-7.1-1/build_crusher_ompi5/c/mpi/collective/blocking/osu_reduce
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: crusher001
  Location: ../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:999
  Error: Function not implemented (38)
--------------------------------------------------------------------------

If I run with FI_LOG_LEVEL=Debug I get a couple of lines like this:

libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Supported: FI_ADDR_CXI_COMPAT
libfabric:15824:1698769409::cxi:core:ofi_check_info():1101<info> Requested: FI_ADDR_CXI
libfabric:15824:1698769409::cxi:core:ofi_check_info():1099<info> address format not supported

and

libfabric:15824:1698769409::cxi:fabric:cxip_gen_auth_key_ss_env_get_vni():1232<info> crusher056: SLINGSHOT_VNIS not found
libfabric:15824:1698769409::cxi:domain:cxip_domain():1238<warn> crusher056: cxip_gen_auth_key failed: -38:Function not implementedlibfabric:15824:1698769409::core:core:fi_fabric_():1374<info> Opened fabric: cxi

and

libfabric:15824:1698769409:ofi_rxm:core:core:fi_getinfo_():1176<info> fi_getinfo: provider cxi returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_COLLECTIVE, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxm:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1084<info> Unsupported capabilities
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:15824:1698769409::ofi_rxd:core:ofi_check_info():1085<info> Requested: FI_RMA, FI_ATOMIC, FI_HMEM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():681<info> unsupported endpoint type
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Supported: FI_EP_DGRAM
libfabric:15824:1698769409::udp:core:ofi_check_ep_type():682<info> Requested: FI_EP_RDM
libfabric:15824:1698769409::core:core:fi_getinfo_():1176<info> fi_getinfo: provider udp returned -61 (No data available)

Not sure if that helps and if that is the right thing to look for. I can post the full log if necessary.

Is there any way to get OMPI 5.0 working with this libfabric?

@jsquyres jsquyres added this to the v5.0.1 milestone Oct 31, 2023
@hppritcha
Copy link
Member

hppritcha commented Nov 1, 2023

are you doing this yourself or using the install done by @naughtont3 ?
you are probably not setting the right prte MCA parameters if you are using mpirun.

@devreal
Copy link
Contributor Author

devreal commented Nov 1, 2023

I am building OMPI myself using the Cray-provided libfabric module. I do not see any pre-installed Open MPI modules to look for MCA params on either machine. What parameters should I set or where can I look for them?

@hppritcha
Copy link
Member

Here's what I set in my shell on crusher:

PRTE_MCA_ras_slurm_use_entire_allocation=1
PRTE_MCA_ras_base_launch_orted_on_hn=1

@hppritcha
Copy link
Member

Note for some reason slurm on crusher doesn't support pmix. so no srun direct launch support with open mpi 5 and main on that system.

@devreal
Copy link
Contributor Author

devreal commented Nov 1, 2023

Thanks @hppritcha, setting those two variables did the trick for me 👍 is there a way to detect that automatically so that other users (and future me) don't have to bother setting them?

@hppritcha
Copy link
Member

I was going with the creation of a mca param file via, perhaps, the platform file approach that would install the prte mca params file (forgot the exact name) with these params set. At a minimum I guess this is a docs issue. We'll treat this as a docs issue for now.

@hppritcha hppritcha self-assigned this Nov 1, 2023
@hppritcha
Copy link
Member

#12150 is relevant to this discussion.

hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 13, 2023
and also make a statement about the OFI BTL more accurate.

Related to open-mpi#12038

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 14, 2023
and also make a statement about the OFI BTL more accurate.

Related to open-mpi#12038

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 14, 2023
and also make a statement about the OFI BTL more accurate.

Related to open-mpi#12038

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 2718732)
@hppritcha
Copy link
Member

closed via #12162 and #12163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants