Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address already in use error #22

Open
clarkipeng opened this issue Dec 5, 2024 · 1 comment
Open

Address already in use error #22

clarkipeng opened this issue Dec 5, 2024 · 1 comment

Comments

@clarkipeng
Copy link

clarkipeng commented Dec 5, 2024

Hello,

I was trying to run with 3 gpus in an ssh server, but I keep on getting this error when I try to use multiple gpus. The single gpu case works perfectly.

This is the main error message
RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

My command looks something like this:

CUDA_VISIBLE_DEVICES=3,4,5 python3 ./demos/cli.py --model_dir <model_dir> --t5_model_path <t5_model_path> --num_frames 61 --use_xdit --ulysses_degree 3 --ring_degree 1

I have tried changing the MASTER_PORT to other numbers like 29501, but the same error message pops up about 29500. I have also used lsof and grep on the port 29500, but found no other processes. Does anyone know how to fix this?

here is the full error message if you want to take a look.
error.txt

Thanks in advance!

@feifeibear
Copy link
Contributor

This can be solved with properiate torchrun usage.

MASTER_PORT: Explicitly set the MASTER_PORT to a different value (e.g., 29501 or any other unused port) in your command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants