Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using docker image on cluster gives error #3292

Open
razvangamanut opened this issue Aug 16, 2024 · 3 comments
Open

Using docker image on cluster gives error #3292

razvangamanut opened this issue Aug 16, 2024 · 3 comments
Assignees
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) P: Won't fix No one will work on this in the near future. See comments for details S: Normal Handle this with default priority T: External bug Not an issue that can be solved here. (May need documentation, though)

Comments

@razvangamanut
Copy link

razvangamanut commented Aug 16, 2024

Describe the bug
I ran on the University HPC a python script using NEST 3.8 docker image with OpenMPI and Singularity. When I used slurm, it gave errors.

To Reproduce
Steps to reproduce the behavior:

  1. (Minimal) reproducing example

The main commands in the slurm file were:

module load openmpi.gcc/4.0.3
module load singularity

srun --mpi=pmix singularity run ./nest.sif python3 simulation.py
[or]
mpirun -n 8 singularity run ./nest.sif python3 simulation.py

Screenshots
[when using srun --mpi=pmix] : PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169

[when using mpirun -n 8]:


It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS


It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[28962,1],0]
Exit code: 1

Desktop/Environment (please complete the following information):

  • NEST-Version: NEST 3.8 docker image
@gtrensch gtrensch added T: Bug Wrong statements in the code or documentation S: Normal Handle this with default priority and removed T: Bug Wrong statements in the code or documentation S: Normal Handle this with default priority labels Aug 23, 2024
@gtrensch gtrensch added S: Normal Handle this with default priority I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) T: External bug Not an issue that can be solved here. (May need documentation, though) P: Won't fix No one will work on this in the near future. See comments for details labels Aug 23, 2024
@gtrensch gtrensch self-assigned this Aug 23, 2024
@terhorstd
Copy link
Contributor

my apologies, closed accidentally

@terhorstd terhorstd reopened this Aug 23, 2024
@gtrensch
Copy link
Contributor

@razvangamanut, the problem is most likely caused by an incompatibility of the MPI libraries. We are afraid that for your HPC system, NEST needs to be built from source. This is necessary to include the system-specific MPI libraries. Currently, we do not know how to properly handle external site-specific MPI setups, especially for HPC systems. You may also contact the administrators of your HPC system if they know of a solution. We would also be very interested in such expertise.

@hamannju
Copy link

Hello, I had a similar issue with Nest / OpenMPI compatibility while getting some old code to run to help a PhD student. We worked on the Imperial College London cluster and there you are able to utilize conda environments from the user profile, so we built a version of Nest 2.20.2 into a clean conda environment and that worked well with MPI etc.

This is the repo with the code: https://github.com/hamannju/anaconda-nest

It was a little messy, but if the user can bring his/her own conda env, then it runs on an HPC cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) P: Won't fix No one will work on this in the near future. See comments for details S: Normal Handle this with default priority T: External bug Not an issue that can be solved here. (May need documentation, though)
Projects
Status: In progress
Development

No branches or pull requests

4 participants