Use the command line to parallel inference, it returns errors #102

sunshinewhy · 2024-12-10T10:04:15Z

(myenv) root@hunyuanvideo:/HunyuanVideo# torchrun --nproc_per_node=4 sample_video_parallel.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "A cat walks on the grass, realistic style." --flow-reverse --seed 42 --ulysses_degree 1 --ring_degree 4 --save-path ./results

W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779]
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *****
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1210 17:54:08.771000 140392803211072 torch/distributed/run.py:779] *
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/envs/myenv/bin/python: can't open file '/HunyuanVideoWeb/sample_video_parallel.py': [Errno 2] No such file or directory
E1210 17:54:08.883000 140392803211072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 1418) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video_parallel.py FAILED

Failures:
[1]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1419)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 1420)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 1421)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1418)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

feifeibear · 2024-12-10T16:19:39Z

make sure the file exists at the following path

/HunyuanVideo/sample_video_parallel.py

sunshinewhy · 2024-12-11T06:12:19Z

I run the latest scripts, but the error still exists.
(myenv) root@hunyuanvideoweb-54f4564c5-2q9xq:/MultiModel/HunyuanVideoWeb# torchrun --nproc_per_node=2 sample_video.py
--video-size 832 624
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--ulysses-degree 1
--ring-degree 2
--save-path ./results
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779]
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] *****
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1211 14:03:24.180000 139699883337536 torch/distributed/run.py:779] *
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[832, 624], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=1, ring_degree=2)
2024-12-11 14:03:26.830 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-11 14:03:26 [parallel_state.py:179] world_size=2 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[832, 624], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=1, ring_degree=2)
2024-12-11 14:03:26.833 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-11 14:03:26 [parallel_state.py:179] world_size=2 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
2024-12-11 14:03:26.848 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 14:03:26.848 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 14:03:30.197 | INFO | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/MultiModel/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
2024-12-11 14:03:30.278 | INFO | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/MultiModel/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
W1211 14:03:47.142000 139699883337536 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1149 closing signal SIGTERM
E1211 14:03:48.708000 139699883337536 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 1148) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_14:03:47
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1148)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1148

feifeibear · 2024-12-11T07:18:20Z

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

jovijovi · 2024-12-11T07:37:43Z

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

8 * NVIDIA L20 (48GB)
is it enough to run this model? thx

sunshinewhy · 2024-12-11T08:03:10Z

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:
What type of GPU are you using. H100? make sure you have 80GB VRAM.

8 * NVIDIA L20 (48GB) is it enough to run this model? thx

I use 2 * A800 80G VRAM to run the model，I don't use 8 * NVIDIA GPU.

sunshinewhy · 2024-12-11T08:08:08Z

The error message indicates that the process was terminated with a SIGKILL signal (exit code -9), which typically means the process was forcefully killed, likely due to running out of memory or exceeding some resource limits. Here are some steps you can take to troubleshoot and resolve the issue:

What type of GPU are you using. H100? make sure you have 80GB VRAM.

I set the 50G memory to run the model, do you think is it enough? How much does it need, thanks!

feifeibear · 2024-12-11T08:11:29Z

I guess that's enough, but I am not sure. You can have a try. But 80GB A100 definitely enough.

Meleuo · 2024-12-20T05:02:26Z

I encountered the same problem, I have 8 Tesla T4s, here is my output,
The environment uses hunyuanvideo/hunyuanvideo: cuda_12

W1220 12:53:28.841000 140313275897664 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1220 12:53:28.841000 140313275897664 torch/distributed/run.py:779] *****************************************
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
/opt/conda/bin/python: can't open file '/data/HunyuanVideo/sample_video_parallel.py': [Errno 2] No such file or directory
E1220 12:53:28.955000 140313275897664 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 9011) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sample_video_parallel.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 9012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 9013)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 9014)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 9015)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 9016)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 9017)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 9018)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-20_12:53:28
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 9011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(base) root@localhost:/data/HunyuanVideo# nvidia-smi 
Fri Dec 20 12:59:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:1A:00.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:1D:00.0 Off |                    0 |
| N/A   30C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:1F:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:20:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla T4                       Off | 00000000:21:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla T4                       Off | 00000000:22:00.0 Off |                    0 |
| N/A   30C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla T4                       Off | 00000000:23:00.0 Off |                    0 |
| N/A   31C    P8              10W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla T4                       Off | 00000000:24:00.0 Off |                    0 |
| N/A   29C    P8               9W /  70W |      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

feifeibear · 2024-12-20T13:49:08Z

/data/HunyuanVideo/sample_video_parallel.py

I am afraid that your error is different, your problem is /data/HunyuanVideo/sample_video_parallel.py does not exist.

zhoudaquan closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the command line to parallel inference, it returns errors #102

Use the command line to parallel inference, it returns errors #102

sunshinewhy commented Dec 10, 2024

feifeibear commented Dec 10, 2024

sunshinewhy commented Dec 11, 2024

feifeibear commented Dec 11, 2024

jovijovi commented Dec 11, 2024 •

edited

Loading

sunshinewhy commented Dec 11, 2024 •

edited

Loading

sunshinewhy commented Dec 11, 2024

feifeibear commented Dec 11, 2024

Meleuo commented Dec 20, 2024

feifeibear commented Dec 20, 2024

Use the command line to parallel inference, it returns errors #102

Use the command line to parallel inference, it returns errors #102

Comments

sunshinewhy commented Dec 10, 2024

sample_video_parallel.py FAILED

Root Cause (first observed failure): [0]: time : 2024-12-10_17:54:08 host : hunyuanvideo rank : 0 (local_rank: 0) exitcode : 2 (pid: 1418) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

feifeibear commented Dec 10, 2024

sunshinewhy commented Dec 11, 2024

sample_video.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-11_14:03:47 host : hunyuanvideo rank : 0 (local_rank: 0) exitcode : -9 (pid: 1148) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1148

feifeibear commented Dec 11, 2024

jovijovi commented Dec 11, 2024 • edited Loading

sunshinewhy commented Dec 11, 2024 • edited Loading

sunshinewhy commented Dec 11, 2024

feifeibear commented Dec 11, 2024

Meleuo commented Dec 20, 2024

feifeibear commented Dec 20, 2024

Root Cause (first observed failure):
[0]:
time : 2024-12-10_17:54:08
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1418)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_14:03:47
host : hunyuanvideo
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1148)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1148

jovijovi commented Dec 11, 2024 •

edited

Loading

sunshinewhy commented Dec 11, 2024 •

edited

Loading