Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the command line to parallel inference, it returns errors #102

Closed
sunshinewhy opened this issue Dec 10, 2024 · 9 comments
Closed

Use the command line to parallel inference, it returns errors #102

sunshinewhy opened this issue Dec 10, 2024 · 9 comments

Comments

@sunshinewhy
Copy link

(myenv) root@hunyuanvideo:/HunyuanVideo# torchrun --nproc_per_node=4 sample_video_parallel.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "A cat walks on the grass, realistic style." --flow-reverse --seed 42 --ulysses_degree 1 --ring_degree 4 --save-path ./results

W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779]
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *****************************************
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *****************************************
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideoWeb/sample_video_parallel.py': [Errno 2] No such file or directory
E1210 17:54:08.883000 140392803211072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 1418) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video_parallel.py FAILED

Failures:
[1]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1419)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 1420)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 1421)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1418)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@feifeibear
Copy link
Contributor

make sure the file exists at the following path

/HunyuanVideo/sample_video_parallel.py

@sunshinewhy
Copy link
Author

I run the latest scripts, but the error still exists.
(myenv) root@hunyuanvideoweb-54f4564c5-2q9xq:/MultiModel/HunyuanVideoWeb# torchrun --nproc_per_node=2 sample_video.py
--video-size 832 624
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--ulysses-degree 1
--ring-degree 2
--save-path ./results
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779]
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] *****************************************
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] *****************************************
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[832, 624], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=1, ring_degree=2)
2024-12-11 14:03:26.830 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-11 14:03:26 [parallel_state.py:179] world_size=2 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[832, 624], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=1, ring_degree=2)
2024-12-11 14:03:26.833 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-11 14:03:26 [parallel_state.py:179] world_size=2 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
2024-12-11 14:03:26.848 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 14:03:26.848 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 14:03:30.197 | INFO | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/MultiModel/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
2024-12-11 14:03:30.278 | INFO | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/MultiModel/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
W1211 14:03:47.142000 139699883337536 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1149 closing signal SIGTERM
E1211 14:03:48.708000 139699883337536 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 1148) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_14:03:47
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1148)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1148

@feifeibear
Copy link
Contributor

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

@jovijovi
Copy link

jovijovi commented Dec 11, 2024

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

8 * NVIDIA L20 (48GB)
is it enough to run this model? thx

@sunshinewhy
Copy link
Author

sunshinewhy commented Dec 11, 2024

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:
What type of GPU are you using. H100? make sure you have 80GB VRAM.

8 * NVIDIA L20 (48GB) is it enough to run this model? thx

I use 2 * A800 80G VRAM to run the model,I don't use 8 * NVIDIA GPU.

@sunshinewhy
Copy link
Author

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

I set the 50G memory to run the model, do you think is it enough? How much does it need, thanks!

@feifeibear
Copy link
Contributor

I guess that's enough, but I am not sure. You can have a try. But 80GB A100 definitely enough.

@Meleuo
Copy link

Meleuo commented Dec 20, 2024

I encountered the same problem, I have 8 Tesla T4s, here is my output,
The environment uses hunyuanvideo/hunyuanvideo: cuda_12

W1220 12:53:28.841000 140313275897664 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1220 12:53:28.841000 140313275897664 torch/distributed/run.py:779] *****************************************
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
E1220 12:53:28.955000 140313275897664 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 9011) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sample_video_parallel.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 9012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 9013)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 9014)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 9015)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 9016)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 9017)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 9018)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 9011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(base) root@localhost:/data/HunyuanVideo# nvidia-smi 
Fri Dec 20 12:59:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:1A:00.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:1D:00.0 Off |                    0 |
| N/A   30C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:1F:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:20:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla T4                       Off | 00000000:21:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla T4                       Off | 00000000:22:00.0 Off |                    0 |
| N/A   30C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla T4                       Off | 00000000:23:00.0 Off |                    0 |
| N/A   31C    P8              10W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla T4                       Off | 00000000:24:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@feifeibear
Copy link
Contributor

/data/HunyuanVideo/sample_video_parallel.py

I am afraid that your error is different, your problem is /data/HunyuanVideo/sample_video_parallel.py does not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants