Address already in use error #22

clarkipeng · 2024-12-05T08:08:44Z

Hello,

I was trying to run with 3 gpus in an ssh server, but I keep on getting this error when I try to use multiple gpus. The single gpu case works perfectly.

This is the main error message
RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

My command looks something like this:

CUDA_VISIBLE_DEVICES=3,4,5 python3 ./demos/cli.py --model_dir <model_dir> --t5_model_path <t5_model_path> --num_frames 61 --use_xdit --ulysses_degree 3 --ring_degree 1

I have tried changing the MASTER_PORT to other numbers like 29501, but the same error message pops up about 29500. I have also used lsof and grep on the port 29500, but found no other processes. Does anyone know how to fix this?

here is the full error message if you want to take a look.
error.txt

Thanks in advance!

The text was updated successfully, but these errors were encountered:

feifeibear · 2024-12-10T09:21:57Z

This can be solved with properiate torchrun usage.

MASTER_PORT: Explicitly set the MASTER_PORT to a different value (e.g., 29501 or any other unused port) in your command

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address already in use error #22

Address already in use error #22

clarkipeng commented Dec 5, 2024 •

edited

Loading

feifeibear commented Dec 10, 2024

Address already in use error #22

Address already in use error #22

Comments

clarkipeng commented Dec 5, 2024 • edited Loading

feifeibear commented Dec 10, 2024

clarkipeng commented Dec 5, 2024 •

edited

Loading