Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Init fails with some openMPI variants #4

Open
carlos-pinto-coelho-microsoft opened this issue Mar 2, 2017 · 7 comments
Open

MPI_Init fails with some openMPI variants #4

carlos-pinto-coelho-microsoft opened this issue Mar 2, 2017 · 7 comments

Comments

@carlos-pinto-coelho-microsoft

Getting the following error:
[phlrr4019:30856] mca: base: component_find: unable to open /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_sysv: /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[phlrr4019:30856] mca: base: component_find: unable to open /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_posix: /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[phlrr4019:30856] mca: base: component_find: unable to open /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_mmap: /usr/local/openmpi-1.10.3-cuda-8.0/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)

It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS


It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[phlrr4019:30856] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@carlos-pinto-coelho-microsoft
Copy link
Author

carlos-pinto-coelho-microsoft commented Mar 2, 2017

I was searching around and found

https://svn.open-mpi.org/trac/ompi/wiki/Linkers

To debug the problem I reduced this to a simple plugin foo that I build using
/usr/local/mpi/bin/mpic++ -std=c++11 -shared foo.cpp -o foo.so -fPIC -I $TF_INC -O2
and that just calls MPI_Init inside.

Presumably this is similar to what mpi4py does but mpi4py has some dlopen “stuff” in https://bitbucket.org/mpi4py/mpi4py/src/eaf4f475857ec2330ef4781289328d1d44068460/src/dynload.c?at=master&fileviewer=file-view-default

Borrowing this
static void PyMPI_OPENMPI_dlopen_libmpi(void)
{
void handle = 0;
int mode = RTLD_NOW | RTLD_GLOBAL;
/
GNU/Linux and others */
#ifdef RTLD_NOLOAD
mode |= RTLD_NOLOAD;
#endif
if (!handle) handle = dlopen("libmpi.so.20", mode);
if (!handle) handle = dlopen("libmpi.so.12", mode);
if (!handle) handle = dlopen("libmpi.so.1", mode);
if (!handle) handle = dlopen("libmpi.so.0", mode);
if (!handle) handle = dlopen("libmpi.so", mode);
}

static int PyMPI_OPENMPI_MPI_Init(int *argc, char ***argv)
{
PyMPI_OPENMPI_dlopen_libmpi();
return MPI_Init(argc, argv);
}
#undef MPI_Init
#define MPI_Init PyMPI_OPENMPI_MPI_Init

static int PyMPI_OPENMPI_MPI_Init_thread(int *argc, char ***argv,
int required, int *provided)
{
PyMPI_OPENMPI_dlopen_libmpi();
return MPI_Init_thread(argc, argv, required, provided);
}
#undef MPI_Init_thread
#define MPI_Init_thread PyMPI_OPENMPI_MPI_Init_thread

from mpi4py "fixes" the issue for my build but perhaps it would be useful to make the two libs coexist together a bit better.

@qianglan
Copy link

@carlos-pinto-coelho-microsoft , hi , I also met this problem, I am not clearly understand your solution. First, the reason why the problem happened is because MPI_init function is not called in allreduce-test.py, am I right? so you create a new file which is foo.cpp, the file does the MPI_Init thing. After you compile the foo.cpp file to a library, how do you call foo.so in allreduce-test.cpp ?

@qianglan
Copy link

It seems the reason is that I install two version of MPI, and some environment like LD_LIBARARY_PATH , I don't change that, also it seems that I need to --disable-dlopen during the configure stage of OpenMPI,

@carlos-pinto-coelho-microsoft
Copy link
Author

carlos-pinto-coelho-microsoft commented Mar 22, 2017

@qianglan the MPI_init in the background thread but it fails for some openmpi builds. The stuff I added above was taken from mpi4py and "fixes" the issue so I don't have to mess around with my openmpi build, which I don't even control in our cluster.

The other issue that I had with the code is that the way it calls MPI_Init in the background thread prevents this from working with other libraries that also call MPI_Init such as mpi4py. Locally, I ended up modifying the code to do the MPI_Init explicitly (or not at all for the case when I was using this and mpi4py in the same program).

@plegresl
Copy link

The typical way to write a library that uses MPI is to first call MPI_Initialized(), to check if MPI_Init() has already been called, and skip the MPI_Init() call within that library if it has already been called somewhere else.

@qianglan
Copy link

OK, so the problem happened because of twice MPI_Init execute. Thanks

@qianglan
Copy link

A simple solution is add below in allreduce-test.py file:

import ctypes
ctypes.CDLL("libmpi.so", mode=ctypes.RTLD_GLOBAL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants