You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar zxf openmpi-5.0.3.tar.gz
ln -s openmpi-5.0.3 openmpi
# 4) Install openmpi (from source code)
mkdir -p /home/lab/bin
cd${DESTINATION_PATH}/openmpi
./configure --prefix=/home/lab/bin/openmpi
make -j $(nproc) all
make install
output of ompi_info
+ ompi_info
Package: Open MPI root@buildkitsandbox Distribution
Open MPI: 5.0.3
Open MPI repo revision: v5.0.3
Open MPI release date: Apr 08, 2024
MPI API: 3.1.0
Ident string: 5.0.3
Prefix: /home/lab/bin/openmpi
Configured architecture: x86_64-pc-linux-gnu
Configured by: root
Configured on: Fri May 31 08:42:58 UTC 2024
Configure host: buildkitsandbox
Configure command line: '--prefix=/home/lab/bin/openmpi'
Built by:
Built on: Fri May 31 08:51:40 UTC 2024
Built host: buildkitsandbox
C bindings: yes
Fort mpif.h: no
Fort use mpi: no
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /bin/gcc
C compiler family name: GNU
C compiler version: 11.4.0
C++ compiler: g++
C++ compiler absolute: /bin/g++
Fort compiler: none
Fort compiler abs: none
Fort ignore TKR: no
Fort 08 assumed shape: no
Fort optional args: no
Fort INTERFACE: no
Fort ISO_FORTRAN_ENV: no
Fort STORAGE_SIZE: no
Fort BIND(C) (all): no
Fort ISO_C_BINDING: no
Fort SUBROUTINE BIND(C): no
Fort TYPE,BIND(C): no
Fort T,BIND(C,name="a"): no
Fort PRIVATE: no
Fort ABSTRACT: no
Fort ASYNCHRONOUS: no
Fort PROCEDURE: no
Fort USE...ONLY: no
Fort C_FUNLOC: no
Fort f08 using wrappers: no
Fort MPI_SIZEOF: no
C profiling: yes
Fort mpif.h profiling: no
Fort use mpi profiling: no
Fort use mpi_f08 prof: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI extensions: affinity, cuda, ftmpi, rocm
Fault Tolerance support: yes
FT MPI support: yes
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.3)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.3)
MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.3)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.3)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.3)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v5.0.3)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.3)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.3)
MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.3)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
v5.0.3)
MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.3)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
v5.0.3)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.3)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.3)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v5.0.3)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.3)
MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.3)
MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.3)
MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
v5.0.3)
MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.3)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.3)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.3)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v5.0.3)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v5.0.3)
Please describe the system on which you are running
I have a master that spawns a worker, when the worker dies that is simulated with a sigkill, the master spawns another one. It is fault tolerant. Locally in a node it works perfectly, but when it is executed in another node remotely it gives problems and does not execute correctly. Just adding --host node2 fails. The spawn is done locally.
+ mpicc -g -o test test.c
+ mpiexec -n 1 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9eI'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8596 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9eI'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8598 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
I'm the child. 0 0 2e7630b38c9eMPI_Comm_spawn ret 0: MPI_SUCCESS: no errorsMPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errorsI'm the parent. 0 0 2e7630b38c9e
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8600 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9eI'm the child. 0 0 2e7630b38c9e
...................
Bad remote execution:
+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
Warning: Permanently added 'c1de8f727368' (ED25519) to the list of known hosts.
I'm the child. 0 0 c1de8f727368MPI_Comm_spawn ret 0: MPI_SUCCESS: no errorsMPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errorsI'm the parent. 0 0 c1de8f727368
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3861 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicatorParent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
With debug
With debug good run:
+ mpiexec -n 1 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08989] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08989] mca: base: components_register: found loaded component self
[2e7630b38c9e:08989] mca: base: components_register: component self register functionsuccessful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_register: component sm register functionsuccessful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_register: component tcp register functionsuccessful
[2e7630b38c9e:08989] mca: base: components_open: opening btl components
[2e7630b38c9e:08989] mca: base: components_open: found loaded component self
[2e7630b38c9e:08989] mca: base: components_open: component self open functionsuccessful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_open: component sm open functionsuccessful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_open: component tcp open functionsuccessful
[2e7630b38c9e:08989] [[33963,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08989] select: initializing btl component self
[2e7630b38c9e:08989] select: init of component self returned success
[2e7630b38c9e:08989] select: initializing btl component sm
[2e7630b38c9e:08989] select: init of component sm returned failure
[2e7630b38c9e:08989] mca: base: close: component sm closed
[2e7630b38c9e:08989] mca: base: close: unloading component sm
[2e7630b38c9e:08989] select: initializing btl component tcp
[2e7630b38c9e:08989] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08989] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08989] btl: tcp: Using interface: sppp
[2e7630b38c9e:08989] btl:tcp: 0x5650082292e0: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08989] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: Successfully bound to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[2e7630b38c9e:08989] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08989] select: init of component tcp returned success
[2e7630b38c9e:08989] mca: bml: Using self btl for send to [[33963,1],0] on node 2e7630b38c9e
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08991] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08991] mca: base: components_register: found loaded component self
[2e7630b38c9e:08991] mca: base: components_register: component self register functionsuccessful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_register: component sm register functionsuccessful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_register: component tcp register functionsuccessful
[2e7630b38c9e:08991] mca: base: components_open: opening btl components
[2e7630b38c9e:08991] mca: base: components_open: found loaded component self
[2e7630b38c9e:08991] mca: base: components_open: component self open functionsuccessful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_open: component sm open functionsuccessful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_open: component tcp open functionsuccessful
[2e7630b38c9e:08991] [[33963,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08991] select: initializing btl component self
[2e7630b38c9e:08991] select: init of component self returned success
[2e7630b38c9e:08991] select: initializing btl component sm
[2e7630b38c9e:08991] select: init of component sm returned failure
[2e7630b38c9e:08991] mca: base: close: component sm closed
[2e7630b38c9e:08991] mca: base: close: unloading component sm
[2e7630b38c9e:08991] select: initializing btl component tcp
[2e7630b38c9e:08991] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08991] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08991] btl: tcp: Using interface: sppp
[2e7630b38c9e:08991] btl:tcp: 0x5605f1ac9560: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08991] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08991] select: init of component tcp returned success
[2e7630b38c9e:08991] mca: bml: Using self btl for send to [[33963,2],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9eMPI_Comm_spawn ret 0: MPI_SUCCESS: no errorsMPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errorsI'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08991] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08991] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08991] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08991] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8991 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08993] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08993] mca: base: components_register: found loaded component self
[2e7630b38c9e:08993] mca: base: components_register: component self register functionsuccessful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_register: component sm register functionsuccessful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_register: component tcp register functionsuccessful
[2e7630b38c9e:08993] mca: base: components_open: opening btl components
[2e7630b38c9e:08993] mca: base: components_open: found loaded component self
[2e7630b38c9e:08993] mca: base: components_open: component self open functionsuccessful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_open: component sm open functionsuccessful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_open: component tcp open functionsuccessful
[2e7630b38c9e:08993] [[33963,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08993] select: initializing btl component self
[2e7630b38c9e:08993] select: init of component self returned success
[2e7630b38c9e:08993] select: initializing btl component sm
[2e7630b38c9e:08993] select: init of component sm returned failure
[2e7630b38c9e:08993] mca: base: close: component sm closed
[2e7630b38c9e:08993] mca: base: close: unloading component sm
[2e7630b38c9e:08993] select: initializing btl component tcp
[2e7630b38c9e:08993] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08993] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08993] btl: tcp: Using interface: sppp
[2e7630b38c9e:08993] btl:tcp: 0x555e7b27d520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08993] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08993] select: init of component tcp returned success
[2e7630b38c9e:08993] [[33963,3],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fe3e0c1081b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fe3e0f64ef7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fe3e0f651da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fe3e09652b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fe3e0965b07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fe3e0bdcb2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fe3e0bdcbe5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fe3e0f83c58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fe3e0f842f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fe3e0f76a7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fe3e0fac432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x555e7a6b7361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe3e0ce1d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe3e0ce1e40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x555e7a6b7245]
[2e7630b38c9e:08993] mca: bml: Using self btl for send to [[33963,3],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9eMPI_Comm_spawn ret 0: MPI_SUCCESS: no errorsMPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errorsI'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08993] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08993] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08993] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08993] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,3],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8993 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08995] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08995] mca: base: components_register: found loaded component self
[2e7630b38c9e:08995] mca: base: components_register: component self register functionsuccessful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_register: component sm register functionsuccessful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_register: component tcp register functionsuccessful
[2e7630b38c9e:08995] mca: base: components_open: opening btl components
[2e7630b38c9e:08995] mca: base: components_open: found loaded component self
[2e7630b38c9e:08995] mca: base: components_open: component self open functionsuccessful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_open: component sm open functionsuccessful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_open: component tcp open functionsuccessful
[2e7630b38c9e:08995] [[33963,4],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08995] select: initializing btl component self
[2e7630b38c9e:08995] select: init of component self returned success
[2e7630b38c9e:08995] select: initializing btl component sm
[2e7630b38c9e:08995] select: init of component sm returned failure
[2e7630b38c9e:08995] mca: base: close: component sm closed
[2e7630b38c9e:08995] mca: base: close: unloading component sm
[2e7630b38c9e:08995] select: initializing btl component tcp
[2e7630b38c9e:08995] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08995] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08995] btl: tcp: Using interface: sppp
[2e7630b38c9e:08995] btl:tcp: 0x564f26846520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08995] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08995] select: init of component tcp returned success
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] mca: bml: Using self btl for send to [[33963,4],0] on node 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9erank 0I'm the child. 0 0 2e7630b38c9e
........
Bad remote execution:
+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
Daemon was launched on c1de8f727368 - beginning to initialize
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03925] mca: base: components_register: registering framework btl components
[c1de8f727368:03925] mca: base: components_register: found loaded component self
[c1de8f727368:03925] mca: base: components_register: component self register functionsuccessful
[c1de8f727368:03925] mca: base: components_register: found loaded component sm
[c1de8f727368:03925] mca: base: components_register: component sm register functionsuccessful
[c1de8f727368:03925] mca: base: components_register: found loaded component tcp
[c1de8f727368:03925] mca: base: components_register: component tcp register functionsuccessful
[c1de8f727368:03925] mca: base: components_open: opening btl components
[c1de8f727368:03925] mca: base: components_open: found loaded component self
[c1de8f727368:03925] mca: base: components_open: component self open functionsuccessful
[c1de8f727368:03925] mca: base: components_open: found loaded component sm
[c1de8f727368:03925] mca: base: components_open: component sm open functionsuccessful
[c1de8f727368:03925] mca: base: components_open: found loaded component tcp
[c1de8f727368:03925] mca: base: components_open: component tcp open functionsuccessful
[c1de8f727368:03925] [[61898,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03925] select: initializing btl component self
[c1de8f727368:03925] select: init of component self returned success
[c1de8f727368:03925] select: initializing btl component sm
[c1de8f727368:03925] select: init of component sm returned failure
[c1de8f727368:03925] mca: base: close: component sm closed
[c1de8f727368:03925] mca: base: close: unloading component sm
[c1de8f727368:03925] select: initializing btl component tcp
[c1de8f727368:03925] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03925] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03925] btl: tcp: Using interface: sppp
[c1de8f727368:03925] btl:tcp: 0x55e3ea152000: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03925] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: Successfully bound to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[c1de8f727368:03925] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03925] select: init of component tcp returned success
[c1de8f727368:03925] mca: bml: Using self btl for send to [[61898,1],0] on node c1de8f727368
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03927] mca: base: components_register: registering framework btl components
[c1de8f727368:03927] mca: base: components_register: found loaded component self
[c1de8f727368:03927] mca: base: components_register: component self register functionsuccessful
[c1de8f727368:03927] mca: base: components_register: found loaded component sm
[c1de8f727368:03927] mca: base: components_register: component sm register functionsuccessful
[c1de8f727368:03927] mca: base: components_register: found loaded component tcp
[c1de8f727368:03927] mca: base: components_register: component tcp register functionsuccessful
[c1de8f727368:03927] mca: base: components_open: opening btl components
[c1de8f727368:03927] mca: base: components_open: found loaded component self
[c1de8f727368:03927] mca: base: components_open: component self open functionsuccessful
[c1de8f727368:03927] mca: base: components_open: found loaded component sm
[c1de8f727368:03927] mca: base: components_open: component sm open functionsuccessful
[c1de8f727368:03927] mca: base: components_open: found loaded component tcp
[c1de8f727368:03927] mca: base: components_open: component tcp open functionsuccessful
[c1de8f727368:03927] [[61898,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03927] select: initializing btl component self
[c1de8f727368:03927] select: init of component self returned success
[c1de8f727368:03927] select: initializing btl component sm
[c1de8f727368:03927] select: init of component sm returned failure
[c1de8f727368:03927] mca: base: close: component sm closed
[c1de8f727368:03927] mca: base: close: unloading component sm
[c1de8f727368:03927] select: initializing btl component tcp
[c1de8f727368:03927] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03927] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03927] btl: tcp: Using interface: sppp
[c1de8f727368:03927] btl:tcp: 0x55cfffe47330: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: Successfully bound to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[c1de8f727368:03927] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03927] select: init of component tcp returned success
[c1de8f727368:03927] mca: bml: Using self btl for send to [[61898,2],0] on node c1de8f727368
I'm the child. 0 0 c1de8f727368MPI_Comm_spawn ret 0: MPI_SUCCESS: no errorsMPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errorsI'm the parent. 0 0 c1de8f727368
rank 0
[c1de8f727368:03927] mca: bml: Using tcp btl for send to [[61898,1],0] on node unknown
[c1de8f727368:03927] btl: tcp: attempting to connect() to [[61898,1],0] address 172.24.0.4 on port 1024
[c1de8f727368:03927] btl:tcp: would block, so allowing background progress
[c1de8f727368:03927] btl:tcp: connect() to 172.24.0.4:1024 completed (complete_connect), sending connect ACK
[c1de8f727368:03925] btl:tcp: now connected to 172.24.0.4, process [[61898,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[c1de8f727368:03925] [[61898,1],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3927 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fc3f773881b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fc3f7a8cef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7fc3f77960e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7fc3f77941a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7fc3f748d3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fc3f748db07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fc3f7704b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fc3f7704be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7fc3f7c48820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7fc3f7b65c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7fc3f7ab8c8d]
[11] ./test(+0x15c3)[0x55e3e83e95c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc3f7809d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc3f7809e40]
[14] ./test(+0x1245)[0x55e3e83e9245]
[c1de8f727368:03925] [[61898,1],0] ompi_request_is_failed: Request 0x55e3ea18cf80 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source -1)
[c1de8f727368:03925] Recv_request_cancel: cancel granted for request 0x55e3ea18cf80 because it has not matched
[c1de8f727368:03925] Rank 00000: DONE WITH FINALIZE
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368rank 0Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicatorParent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75[c1de8f727368:03929] mca: base: components_register: registering framework btl components[c1de8f727368:03929] mca: base: components_register: found loaded component self[c1de8f727368:03929] mca: base: components_register: component self register function successful[c1de8f727368:03929] mca: base: components_register: found loaded component sm[c1de8f727368:03929] mca: base: components_register: component sm register function successful[c1de8f727368:03929] mca: base: components_register: found loaded component tcp[c1de8f727368:03929] mca: base: components_register: component tcp register function successful[c1de8f727368:03929] mca: base: components_open: opening btl components[c1de8f727368:03929] mca: base: components_open: found loaded component self[c1de8f727368:03929] mca: base: components_open: component self open function successful[c1de8f727368:03929] mca: base: components_open: found loaded component sm[c1de8f727368:03929] mca: base: components_open: component sm open function successful[c1de8f727368:03929] mca: base: components_open: found loaded component tcp[c1de8f727368:03929] mca: base: components_open: component tcp open function successful[c1de8f727368:03929] [[61898,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm[c1de8f727368:03925] mca: base: close: component self closed[c1de8f727368:03925] mca: base: close: unloading component self[c1de8f727368:03925] mca: base: close: component tcp closed[c1de8f727368:03925] mca: base: close: unloading component tcp[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS[c1de8f727368:03929] select: initializing btl component self[c1de8f727368:03929] select: init of component self returned success[c1de8f727368:03929] select: initializing btl component sm[c1de8f727368:03929] select: init of component sm returned failure[c1de8f727368:03929] mca: base: close: component sm closed[c1de8f727368:03929] mca: base: close: unloading component sm[c1de8f727368:03929] select: initializing btl component tcp[c1de8f727368:03929] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8[c1de8f727368:03929] btl: tcp: Found match: 127.0.0.1 (lo)[c1de8f727368:03929] btl: tcp: Using interface: sppp [c1de8f727368:03929] btl:tcp: 0x5591b49ac2f0: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100[c1de8f727368:03929] btl:tcp: Attempting to bind to AF_INET port 1024[c1de8f727368:03929] btl:tcp: Successfully bound to AF_INET port 1024[c1de8f727368:03929] btl:tcp: my listening v4 socket is 0.0.0.0:1024[c1de8f727368:03929] btl: tcp: exchange: 0 10 IPv4 172.24.0.4[c1de8f727368:03929] select: init of component tcp returned success[c1de8f727368:03929] [[61898,3],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f25fca5b81b][ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f25fcdafef7][ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7f25fcdb01da][ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7f25fc7b02b8][ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f25fc7b0b07][ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f25fca27b2f][ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f25fca27be5][ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7f25fcdcec58][ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7f25fcdcf2f8][ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7f25fcdc1a7f][10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f25fcdf7432][11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x5591b41f3361][12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f25fcb2cd90][13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f25fcb2ce40][14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x5591b41f3245][c1de8f727368:03929] mca: bml: Using self btl for send to [[61898,3],0] on node c1de8f727368[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received exit cmd[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: exit cmd, 1 routes still exist[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received exit cmd[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: all routes and children gone - exiting
The text was updated successfully, but these errors were encountered:
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Open MPI v5.0.3
https://www.open-mpi.org/community/help/
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
output of ompi_info
Please describe the system on which you are running
Details of the problem
I have a master that spawns a worker, when the worker dies that is simulated with a sigkill, the master spawns another one. It is fault tolerant. Locally in a node it works perfectly, but when it is executed in another node remotely it gives problems and does not execute correctly. Just adding --host node2 fails. The spawn is done locally.
Code:
Hostname:
Good execution
Bad remote execution:
With debug
With debug good run:
Bad remote execution:
The text was updated successfully, but these errors were encountered: