Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_dup error on init on Frontier #1102

Open
pgrete opened this issue Jun 11, 2024 · 1 comment
Open

MPI_Comm_dup error on init on Frontier #1102

pgrete opened this issue Jun 11, 2024 · 1 comment
Labels
bug Something isn't working build configuration

Comments

@pgrete
Copy link
Collaborator

pgrete commented Jun 11, 2024

New day, new issues.
I just tried the latest amd software stack on Frontier:

module load cpe/23.12
module load PrgEnv-amd
module load amd/5.7.1
module load craype-accel-amd-gfx90a cmake cray-hdf5-parallel cray-python ninja
export MPICH_GPU_SUPPORT_ENABLED=1

and this result in non-functional code (e.g., advection example):

Assertion failed in file ../src/mpid/common/cray/cray_gpu_ops.c at line 188: mpi_errno == MPI_SUCCESS
/opt/cray/pe/lib64/libmpi_amd.so.12(MPL_backtrace_show+0x26) [0x7fffebab367b]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x22bf374) [0x7fffeb4d9374]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x2725368) [0x7fffeb93f368]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x2168420) [0x7fffeb382420]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa237c) [0x7fffeb1bc37c]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa028c) [0x7fffeb1ba28c]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x6d4cf1) [0x7fffe98eecf1]
/opt/cray/pe/lib64/libmpi_amd.so.12(PMPI_Comm_dup+0x174) [0x7fffe98eef34]
/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(darshan_core_initialize+0xa8) [0x7fffebbd3f68]
/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(MPI_Init+0x7d) [0x7fffebbd3d0d]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x335280a]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x3050e40]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe89f924d]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x2f4ce6a]
MPICH ERROR [Rank 0] [job id 2015481.11] [Tue Jun 11 08:41:29 2024] [frontier00491] - Abort(1): Internal error

srun: error: frontier00491: task 0: Exited with exit code 1
srun: Terminating StepId=2015481.11

@pgrete pgrete added bug Something isn't working build configuration labels Jun 11, 2024
@pgrete
Copy link
Collaborator Author

pgrete commented Jun 11, 2024

Same issue with PrgEnv-cray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build configuration
Projects
None yet
Development

No branches or pull requests

1 participant