Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVM environment variable? #12597

Closed
100405907 opened this issue Jun 4, 2024 · 5 comments
Closed

DVM environment variable? #12597

100405907 opened this issue Jun 4, 2024 · 5 comments

Comments

@100405907
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Open MPI v5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar zxf openmpi-5.0.3.tar.gz
ln   -s openmpi-5.0.3  openmpi

cd      openmpi
./configure --prefix=/home/lab/bin/openmpi --with-slurm=/opt/slurm
make -j $(nproc) all
make install

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04.5 LTS
  • Computer hardware: irrelevant
  • Network type: irrelevant

Details of the problem

I have a Spark application that uses Open MPI. In my application, a server has been set up locally and the clients are running in the Spark executors. The problem is that the clients don't find the server. I need to specify to Spark the server URI as done with --dvm. Is there any environment variable with which I can achieve this?

Thanks in advance.

@rhc54
Copy link
Contributor

rhc54 commented Jun 4, 2024

Just to be clear: you have a DVM running that you started with just prte, and you have a set of procs that are being started via some separate method (and so they will appear as singletons) - correct? So the problem is to tell the singletons how to connect to the DVM?

The singletons will automatically look for a system server, so all you should need to do is start the DVM with the "system server" flag: prte --system-server.

I see you configured --with-slurm for some reason. If you are using Slurm to start the procs, that could be a problem as the procs may automatically connect to the Slurm daemon and not the DVM. Your only sure bet would be to add an option to your program that tries to circumvent that behavior, but I'd have to think about it for awhile and probably experiment a bit.

@100405907
Copy link
Author

I'm starting my DVM with prte as follows:

prte --hostfile $HOME/gsotodos/conf/machines_mpi --report-uri $PRTEFILE --no-ready-msg &

I configured --with-slurm for other purposes, but as my application is running with Spark, I am starting my jobs with spark-submit. I tried to start my server with the flag --system-server but the clients still cannot connect with the server.

Is there any way to specify this configuration with environment variables or any way to launch my clients without mpiexec?

@rhc54
Copy link
Contributor

rhc54 commented Jun 5, 2024

When you say "it cannot connect", what are you seeing that tells you this? I just tested it and the clients connect just fine. The issue may be in what data they expected to be able to access.

Copy link

github-actions bot commented Jul 1, 2024

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label Jul 1, 2024
Copy link

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants