-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the command line to parallel inference, it returns errors #102
Comments
make sure the file exists at the following path /HunyuanVideo/sample_video_parallel.py |
I run the latest scripts, but the error still exists.
|
The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue: What type of GPU are you using. H100? make sure you have 80GB VRAM. |
8 * NVIDIA L20 (48GB) |
I use 2 * A800 80G VRAM to run the model,I don't use 8 * NVIDIA GPU. |
I set the 50G memory to run the model, do you think is it enough? How much does it need, thanks! |
I guess that's enough, but I am not sure. You can have a try. But 80GB A100 definitely enough. |
I encountered the same problem, I have 8 Tesla T4s, here is my output,
|
I am afraid that your error is different, your problem is /data/HunyuanVideo/sample_video_parallel.py does not exist. |
(myenv) root@hunyuanvideo:/HunyuanVideo# torchrun --nproc_per_node=4 sample_video_parallel.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "A cat walks on the grass, realistic style." --flow-reverse --seed 42 --ulysses_degree 1 --ring_degree 4 --save-path ./results
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779]
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *****************************************
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *****************************************
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideoWeb/sample_video_parallel.py': [Errno 2] No such file or directory
E1210 17:54:08.883000 140392803211072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 1418) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
sample_video_parallel.py FAILED
Failures:
[1]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1419)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 1420)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 1421)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1418)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: