Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant error when re spawn process in mpiexec in remote node #12599

Open
dariomnz opened this issue Jun 4, 2024 · 0 comments
Open

Fault tolerant error when re spawn process in mpiexec in remote node #12599

dariomnz opened this issue Jun 4, 2024 · 0 comments

Comments

@dariomnz
Copy link

dariomnz commented Jun 4, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • ompi_info --version
    Open MPI v5.0.3

https://www.open-mpi.org/community/help/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar zxf openmpi-5.0.3.tar.gz
ln   -s openmpi-5.0.3  openmpi

# 4) Install openmpi (from source code)
mkdir -p /home/lab/bin
cd       ${DESTINATION_PATH}/openmpi
./configure --prefix=/home/lab/bin/openmpi
make -j $(nproc) all
make install
output of ompi_info
+ ompi_info
                 Package: Open MPI root@buildkitsandbox Distribution
                Open MPI: 5.0.3
  Open MPI repo revision: v5.0.3
   Open MPI release date: Apr 08, 2024
                 MPI API: 3.1.0
            Ident string: 5.0.3
                  Prefix: /home/lab/bin/openmpi
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: root
           Configured on: Fri May 31 08:42:58 UTC 2024
          Configure host: buildkitsandbox
  Configure command line: '--prefix=/home/lab/bin/openmpi'
                Built by: 
                Built on: Fri May 31 08:51:40 UTC 2024
              Built host: buildkitsandbox
              C bindings: yes
             Fort mpif.h: no
            Fort use mpi: no
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 11.4.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: none
       Fort compiler abs: none
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: no
    Fort ISO_FORTRAN_ENV: no
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: no
 Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
 Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
   Fort mpif.h profiling: no
  Fort use mpi profiling: no
   Fort use mpi_f08 prof: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.3)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.3)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.3)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.3)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.3)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.3)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.0.3)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.3)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.3)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.3)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.0.3)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.0.3)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.3)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.3)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.0.3)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.3)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.3)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.0.3)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.3)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.3)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.0.3)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)

Please describe the system on which you are running

  • Operating system/version: docker Ubuntu 22.04.4 LTS
  • Computer hardware: irrelevant
  • Network type: irrelevant

Details of the problem

I have a master that spawns a worker, when the worker dies that is simulated with a sigkill, the master spawns another one. It is fault tolerant. Locally in a node it works perfectly, but when it is executed in another node remotely it gives problems and does not execute correctly. Just adding --host node2 fails. The spawn is done locally.

Code:

#include "mpi.h"
#include "mpi-ext.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <limits.h>
#include <signal.h>


int main( int argc, char *  argv[] )
{
    MPI_Comm parentcomm, intercomm;
    int cast_buf,rank,ret,eclass,len;
    int errcodes[1];
    char estring[MPI_MAX_ERROR_STRING];

    char serv_name[HOST_NAME_MAX];
    gethostname(serv_name, HOST_NAME_MAX);

    MPI_Init( &argc, &argv );
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Comm_get_parent( &parentcomm );
    if (parentcomm == MPI_COMM_NULL)
    {
        do
        {
            ret = MPI_Comm_spawn( "/work/xpn/test/integrity/mpi_connect_accept/test" , MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
            MPI_Comm_set_errhandler(intercomm, MPI_ERRORS_RETURN);
            
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("MPI_Comm_spawn ret %d: %s\n", eclass, estring);
            MPI_Error_class(errcodes[0], &eclass);
            MPI_Error_string(errcodes[0], estring, &len);
            printf("MPI_Comm_spawn errcodes[0] %d: %s\n", eclass, estring);

            printf("I'm the parent. %d %d %s\n",ret,errcodes[0],serv_name);

            ret = MPI_Bcast(&cast_buf,1,MPI_INT, 0,intercomm);
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("Parent Bcast Error ret %d: %s\n", eclass, estring);

            ret = MPI_Bcast(&cast_buf,1,MPI_INT, 0,intercomm);
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("Parent Bcast Error ret %d: %s\n", eclass, estring);
            if(eclass != MPIX_ERR_PROC_FAILED)
                break;
            
        } while (1);
    }else{
        printf("I'm the child. %d %d %s\n",ret,errcodes[0],serv_name);
        sleep(1);
        ret = MPI_Bcast(&cast_buf,1,MPI_INT, MPI_ROOT,parentcomm);
        MPI_Error_class(ret, &eclass);
        MPI_Error_string(ret, estring, &len);
        printf("Child Bcast Error ret %d: %s\n", eclass, estring);
        raise(SIGKILL);
        ret = MPI_Bcast(&cast_buf,1,MPI_INT, MPI_ROOT,parentcomm);
        MPI_Error_class(ret, &eclass);
        MPI_Error_string(ret, estring, &len);
        printf("Child Bcast Error ret %d: %s\n", eclass, estring);
    }
    fflush(stdout);
    MPI_Finalize();
    return 0;
}

Hostname:

+ hostname
2e7630b38c9e

Good execution

+ mpicc -g -o test test.c
+ mpiexec -n 1 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8596 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8598 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8600 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
...................

Bad remote execution:

+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
Warning: Permanently added 'c1de8f727368' (ED25519) to the list of known hosts.
I'm the child. 0 0 c1de8f727368
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 c1de8f727368
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3861 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
With debug

With debug good run:

+ mpiexec -n 1 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08989] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08989] mca: base: components_register: found loaded component self
[2e7630b38c9e:08989] mca: base: components_register: component self register function successful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08989] mca: base: components_open: opening btl components
[2e7630b38c9e:08989] mca: base: components_open: found loaded component self
[2e7630b38c9e:08989] mca: base: components_open: component self open function successful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08989] [[33963,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08989] select: initializing btl component self
[2e7630b38c9e:08989] select: init of component self returned success
[2e7630b38c9e:08989] select: initializing btl component sm
[2e7630b38c9e:08989] select: init of component sm returned failure
[2e7630b38c9e:08989] mca: base: close: component sm closed
[2e7630b38c9e:08989] mca: base: close: unloading component sm
[2e7630b38c9e:08989] select: initializing btl component tcp
[2e7630b38c9e:08989] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08989] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08989] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08989] btl:tcp: 0x5650082292e0: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08989] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: Successfully bound to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[2e7630b38c9e:08989] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08989] select: init of component tcp returned success
[2e7630b38c9e:08989] mca: bml: Using self btl for send to [[33963,1],0] on node 2e7630b38c9e
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08991] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08991] mca: base: components_register: found loaded component self
[2e7630b38c9e:08991] mca: base: components_register: component self register function successful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08991] mca: base: components_open: opening btl components
[2e7630b38c9e:08991] mca: base: components_open: found loaded component self
[2e7630b38c9e:08991] mca: base: components_open: component self open function successful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08991] [[33963,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08991] select: initializing btl component self
[2e7630b38c9e:08991] select: init of component self returned success
[2e7630b38c9e:08991] select: initializing btl component sm
[2e7630b38c9e:08991] select: init of component sm returned failure
[2e7630b38c9e:08991] mca: base: close: component sm closed
[2e7630b38c9e:08991] mca: base: close: unloading component sm
[2e7630b38c9e:08991] select: initializing btl component tcp
[2e7630b38c9e:08991] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08991] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08991] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08991] btl:tcp: 0x5605f1ac9560: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08991] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08991] select: init of component tcp returned success
[2e7630b38c9e:08991] mca: bml: Using self btl for send to [[33963,2],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08991] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08991] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08991] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08991] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8991 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08993] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08993] mca: base: components_register: found loaded component self
[2e7630b38c9e:08993] mca: base: components_register: component self register function successful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08993] mca: base: components_open: opening btl components
[2e7630b38c9e:08993] mca: base: components_open: found loaded component self
[2e7630b38c9e:08993] mca: base: components_open: component self open function successful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08993] [[33963,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08993] select: initializing btl component self
[2e7630b38c9e:08993] select: init of component self returned success
[2e7630b38c9e:08993] select: initializing btl component sm
[2e7630b38c9e:08993] select: init of component sm returned failure
[2e7630b38c9e:08993] mca: base: close: component sm closed
[2e7630b38c9e:08993] mca: base: close: unloading component sm
[2e7630b38c9e:08993] select: initializing btl component tcp
[2e7630b38c9e:08993] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08993] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08993] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08993] btl:tcp: 0x555e7b27d520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08993] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08993] select: init of component tcp returned success
[2e7630b38c9e:08993] [[33963,3],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fe3e0c1081b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fe3e0f64ef7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fe3e0f651da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fe3e09652b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fe3e0965b07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fe3e0bdcb2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fe3e0bdcbe5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fe3e0f83c58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fe3e0f842f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fe3e0f76a7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fe3e0fac432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x555e7a6b7361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe3e0ce1d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe3e0ce1e40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x555e7a6b7245]
[2e7630b38c9e:08993] mca: bml: Using self btl for send to [[33963,3],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08993] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08993] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08993] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08993] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,3],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8993 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08995] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08995] mca: base: components_register: found loaded component self
[2e7630b38c9e:08995] mca: base: components_register: component self register function successful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08995] mca: base: components_open: opening btl components
[2e7630b38c9e:08995] mca: base: components_open: found loaded component self
[2e7630b38c9e:08995] mca: base: components_open: component self open function successful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08995] [[33963,4],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08995] select: initializing btl component self
[2e7630b38c9e:08995] select: init of component self returned success
[2e7630b38c9e:08995] select: initializing btl component sm
[2e7630b38c9e:08995] select: init of component sm returned failure
[2e7630b38c9e:08995] mca: base: close: component sm closed
[2e7630b38c9e:08995] mca: base: close: unloading component sm
[2e7630b38c9e:08995] select: initializing btl component tcp
[2e7630b38c9e:08995] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08995] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08995] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08995] btl:tcp: 0x564f26846520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08995] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08995] select: init of component tcp returned success
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] mca: bml: Using self btl for send to [[33963,4],0] on node 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
I'm the child. 0 0 2e7630b38c9e
........

Bad remote execution:

+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
Daemon was launched on c1de8f727368 - beginning to initialize
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03925] mca: base: components_register: registering framework btl components
[c1de8f727368:03925] mca: base: components_register: found loaded component self
[c1de8f727368:03925] mca: base: components_register: component self register function successful
[c1de8f727368:03925] mca: base: components_register: found loaded component sm
[c1de8f727368:03925] mca: base: components_register: component sm register function successful
[c1de8f727368:03925] mca: base: components_register: found loaded component tcp
[c1de8f727368:03925] mca: base: components_register: component tcp register function successful
[c1de8f727368:03925] mca: base: components_open: opening btl components
[c1de8f727368:03925] mca: base: components_open: found loaded component self
[c1de8f727368:03925] mca: base: components_open: component self open function successful
[c1de8f727368:03925] mca: base: components_open: found loaded component sm
[c1de8f727368:03925] mca: base: components_open: component sm open function successful
[c1de8f727368:03925] mca: base: components_open: found loaded component tcp
[c1de8f727368:03925] mca: base: components_open: component tcp open function successful
[c1de8f727368:03925] [[61898,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03925] select: initializing btl component self
[c1de8f727368:03925] select: init of component self returned success
[c1de8f727368:03925] select: initializing btl component sm
[c1de8f727368:03925] select: init of component sm returned failure
[c1de8f727368:03925] mca: base: close: component sm closed
[c1de8f727368:03925] mca: base: close: unloading component sm
[c1de8f727368:03925] select: initializing btl component tcp
[c1de8f727368:03925] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03925] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03925] btl: tcp: Using interface: sppp 
[c1de8f727368:03925] btl:tcp: 0x55e3ea152000: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03925] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: Successfully bound to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[c1de8f727368:03925] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03925] select: init of component tcp returned success
[c1de8f727368:03925] mca: bml: Using self btl for send to [[61898,1],0] on node c1de8f727368
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03927] mca: base: components_register: registering framework btl components
[c1de8f727368:03927] mca: base: components_register: found loaded component self
[c1de8f727368:03927] mca: base: components_register: component self register function successful
[c1de8f727368:03927] mca: base: components_register: found loaded component sm
[c1de8f727368:03927] mca: base: components_register: component sm register function successful
[c1de8f727368:03927] mca: base: components_register: found loaded component tcp
[c1de8f727368:03927] mca: base: components_register: component tcp register function successful
[c1de8f727368:03927] mca: base: components_open: opening btl components
[c1de8f727368:03927] mca: base: components_open: found loaded component self
[c1de8f727368:03927] mca: base: components_open: component self open function successful
[c1de8f727368:03927] mca: base: components_open: found loaded component sm
[c1de8f727368:03927] mca: base: components_open: component sm open function successful
[c1de8f727368:03927] mca: base: components_open: found loaded component tcp
[c1de8f727368:03927] mca: base: components_open: component tcp open function successful
[c1de8f727368:03927] [[61898,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03927] select: initializing btl component self
[c1de8f727368:03927] select: init of component self returned success
[c1de8f727368:03927] select: initializing btl component sm
[c1de8f727368:03927] select: init of component sm returned failure
[c1de8f727368:03927] mca: base: close: component sm closed
[c1de8f727368:03927] mca: base: close: unloading component sm
[c1de8f727368:03927] select: initializing btl component tcp
[c1de8f727368:03927] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03927] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03927] btl: tcp: Using interface: sppp 
[c1de8f727368:03927] btl:tcp: 0x55cfffe47330: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: Successfully bound to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[c1de8f727368:03927] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03927] select: init of component tcp returned success
[c1de8f727368:03927] mca: bml: Using self btl for send to [[61898,2],0] on node c1de8f727368
I'm the child. 0 0 c1de8f727368
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 c1de8f727368
rank 0
[c1de8f727368:03927] mca: bml: Using tcp btl for send to [[61898,1],0] on node unknown
[c1de8f727368:03927] btl: tcp: attempting to connect() to [[61898,1],0] address 172.24.0.4 on port 1024
[c1de8f727368:03927] btl:tcp: would block, so allowing background progress
[c1de8f727368:03927] btl:tcp: connect() to 172.24.0.4:1024 completed (complete_connect), sending connect ACK
[c1de8f727368:03925] btl:tcp: now connected to 172.24.0.4, process [[61898,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[c1de8f727368:03925] [[61898,1],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3927 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fc3f773881b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fc3f7a8cef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7fc3f77960e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7fc3f77941a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7fc3f748d3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fc3f748db07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fc3f7704b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fc3f7704be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7fc3f7c48820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7fc3f7b65c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7fc3f7ab8c8d]
[11] ./test(+0x15c3)[0x55e3e83e95c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc3f7809d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc3f7809e40]
[14] ./test(+0x1245)[0x55e3e83e9245]
[c1de8f727368:03925] [[61898,1],0] ompi_request_is_failed: Request 0x55e3ea18cf80 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[c1de8f727368:03925] Recv_request_cancel: cancel granted for request 0x55e3ea18cf80 because it has not matched
[c1de8f727368:03925] Rank 00000: DONE WITH FINALIZE
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368
rank 0
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75
[c1de8f727368:03929] mca: base: components_register: registering framework btl components
[c1de8f727368:03929] mca: base: components_register: found loaded component self
[c1de8f727368:03929] mca: base: components_register: component self register function successful
[c1de8f727368:03929] mca: base: components_register: found loaded component sm
[c1de8f727368:03929] mca: base: components_register: component sm register function successful
[c1de8f727368:03929] mca: base: components_register: found loaded component tcp
[c1de8f727368:03929] mca: base: components_register: component tcp register function successful
[c1de8f727368:03929] mca: base: components_open: opening btl components
[c1de8f727368:03929] mca: base: components_open: found loaded component self
[c1de8f727368:03929] mca: base: components_open: component self open function successful
[c1de8f727368:03929] mca: base: components_open: found loaded component sm
[c1de8f727368:03929] mca: base: components_open: component sm open function successful
[c1de8f727368:03929] mca: base: components_open: found loaded component tcp
[c1de8f727368:03929] mca: base: components_open: component tcp open function successful
[c1de8f727368:03929] [[61898,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03925] mca: base: close: component self closed
[c1de8f727368:03925] mca: base: close: unloading component self
[c1de8f727368:03925] mca: base: close: component tcp closed
[c1de8f727368:03925] mca: base: close: unloading component tcp
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[c1de8f727368:03929] select: initializing btl component self
[c1de8f727368:03929] select: init of component self returned success
[c1de8f727368:03929] select: initializing btl component sm
[c1de8f727368:03929] select: init of component sm returned failure
[c1de8f727368:03929] mca: base: close: component sm closed
[c1de8f727368:03929] mca: base: close: unloading component sm
[c1de8f727368:03929] select: initializing btl component tcp
[c1de8f727368:03929] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03929] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03929] btl: tcp: Using interface: sppp 
[c1de8f727368:03929] btl:tcp: 0x5591b49ac2f0: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03929] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03929] btl:tcp: Successfully bound to AF_INET port 1024
[c1de8f727368:03929] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[c1de8f727368:03929] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03929] select: init of component tcp returned success
[c1de8f727368:03929] [[61898,3],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f25fca5b81b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f25fcdafef7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7f25fcdb01da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7f25fc7b02b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f25fc7b0b07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f25fca27b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f25fca27be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7f25fcdcec58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7f25fcdcf2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7f25fcdc1a7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f25fcdf7432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x5591b41f3361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f25fcb2cd90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f25fcb2ce40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x5591b41f3245]
[c1de8f727368:03929] mca: bml: Using self btl for send to [[61898,3],0] on node c1de8f727368
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received exit cmd
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: exit cmd, 1 routes still exist
[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received exit cmd
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: all routes and children gone - exiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants