Skip to content

An error occurred during training using Protenix Docker 1.0.0.4 #321

@gs-ren

Description

@gs-ren

Dear Protenix Development Team,my training was conducted using the official dataset, with the mirroring used being ai4s-share-public-cn-beijing.cr.volces.com/release/protenix:1.0.0.4, and this error can be reproduced every time.

My training environment is Driver Version: 560.35.03 CUDA Version: 12.6, and the graphics card is H 100

The training instruction is:

torchrun --nproc_per_node 8 \
        --master_port=29506 \
        ./runner/train.py \
        --run_name protenix_train \
        --model_name "protenix_base_default_v1.0.0" \
        --seed 42 \
        --base_dir ./output/ \
        --dtype bf16 \
        --project protenix \
        --use_wandb false \
        --diffusion_batch_size 16 \
        --iters_to_accumulate 2 \
        --eval_interval 400 \
        --log_interval 50 \
        --checkpoint_interval 2000 \
        --ema_decay 0.999 \
        --train_crop_size 768 \
        --max_steps 100000 \
        --warmup_steps 1000 \
        --lr 0.0005 \
        --model.N_cycle 4 \
        --sample_diffusion.N_step 20 \
        --triangle_attention "cuequivariance" \
        --triangle_multiplicative "cuequivariance" \
        --data.train_sets weightedPDB_before2109_wopb_nometalc_0925 \
        --data.test_sets recentPDB_1536_sample384_0925,posebusters_0925 \
        --data.posebusters_0925.base_info.max_n_token 768 \
        --loss.weight.smooth_lddt 0.0 \
        --loss.weight.alpha_bond 1.0 

The error is as shown in the following message. How should I troubleshoot the cause of this error? Thank you!
@zhangyuxuann

2026-06-03 02:45:41,755 [/app/code/./runner/train.py:468] INFO root: Step 3599, eval posebusters_0925: {'posebusters_0925/ema0.999_resolved_loss.avg': np.float64(0.23424069203791165), 'posebusters_0925/ema0.9
99_weighted_pde_loss.avg': np.float64(0.00033150770981419124), 'posebusters_0925/ema0.999_pde_loss.avg': np.float64(3.315077184495472), 'posebusters_0925/ema0.999_loss.avg': np.float64(517.6821294239589), 'po
sebusters_0925/ema0.999_weighted_smooth_lddt_loss.avg': np.float64(0.0), 'posebusters_0925/ema0.999_pae_loss.avg': np.float64(4.15888345128014), 'posebusters_0925/ema0.999_weighted_resolved_loss.avg': np.floa
t64(2.342406854343911e-05), 'posebusters_0925/ema0.999_lddt/complex/gpde.rank1.avg': np.float64(0.3719116247835613), 'posebusters_0925/ema0.999_lddt/complex/mean.avg': np.float64(0.36325206678538097), 'posebu
sters_0925/ema0.999_smooth_lddt_loss.avg': np.float64(0.6211075533004035), 'posebusters_0925/ema0.999_lddt/complex/plddt.rank1.avg': np.float64(0.3640599228086926), 'posebusters_0925/ema0.999_lddt/complex/wor
st.avg': np.float64(0.34535366679940904), 'posebusters_0925/ema0.999_weighted_bond_loss.avg': np.float64(2.3117294575486866), 'posebusters_0925/ema0.999_plddt_loss.avg': np.float64(3.565737787882487), 'posebu
sters_0925/ema0.999_weighted_mse_loss.avg': np.float64(515.3696882702055), 'posebusters_0925/ema0.999_lddt/complex/median.avg': np.float64(0.3634037167543457), 'posebusters_0925/ema0.999_weighted_pae_loss.avg
': np.float64(0.0), 'posebusters_0925/ema0.999_lddt/complex/best.avg': np.float64(0.38063443906250455), 'posebusters_0925/ema0.999_lddt/complex/random.avg': np.float64(0.36405437333243235), 'posebusters_0925/
ema0.999_lddt/complex/ranking_score.rank1.avg': np.float64(0.3638552948832512), 'posebusters_0925/ema0.999_mse_loss.avg': np.float64(128.84242206755138), 'posebusters_0925/ema0.999_weighted_plddt_loss.avg': n
p.float64(0.0003565737704602292), 'posebusters_0925/ema0.999_bond_loss.avg': np.float64(0.5779323643871717)}
[step 3599: 799/800] : 100%|________________________________________________________________________________________________________________________________________________| 799/800 [3:55:17<00:17, 17.67s/it]
  0%|                                                                                                                                                                                   | 0/800 [00:00<?, ?it/s2
026-06-03 02:48:34,678 [/app/code/protenix/utils/permutation/chain_permutation/heuristic.py:298] WARNING protenix.utils.permutation.chain_permutation.heuristic: The label_full_dict contains 22 asym chains.it]
                                                                                                                                                                                                               N
o atom permutation is needed. Return the identity permutation.                                                                                                               | 30/800 [08:13<3:31:42, 16.50s/it]
No atom permutation is needed. Return the identity permutation.
                                                                                                                                                                                                               2
026-06-03 03:13:08,572 [/app/code/./runner/train.py:468] INFO root: Step 3649 train metrics: {'train/resolved_loss.avg': np.float64(0.22796588468944504), 'train/mse_loss.avg': np.float64(0.26421638609841464),
 'train/weighted_bond_loss.avg': np.float64(0.08525830793194472), 'train/distogram_loss.avg': np.float64(1.1500440306030213), 'train/smooth_lddt_loss.avg': np.float64(0.4224701591581106), 'train/pae_loss.avg'
: np.float64(4.158883157646291), 'train/weighted_plddt_loss.avg': np.float64(0.0003673041741392847), 'train/weighted_pae_loss.avg': np.float64(0.0), 'train/weighted_pde_loss.avg': np.float64(0.000336800128578
5085), 'train/pde_loss.avg': np.float64(3.3680013678521514), 'train/weighted_smooth_lddt_loss.avg': np.float64(0.0), 'train/weighted_distogram_loss.avg': np.float64(0.0345013201967231), 'train/plddt_loss.avg'
: np.float64(3.673041832907381), 'train/loss.avg': np.float64(1.1773075496405363), 'train/weighted_mse_loss.avg': np.float64(1.0568655443936585), 'train/weighted_resolved_loss.avg': np.float64(2.2796587913660
08e-05), 'train/bond_loss.avg': np.float64(0.02131457698298618)}
2026-06-03 03:13:08,573 [/app/code/./runner/train.py:468] INFO root: Step 3649, learning rate: [0.0005]
                                                                                                                                                                                                               N
o atom permutation is needed. Return the identity permutation.                                                                                                              | 181/800 [50:13<2:49:33, 16.44s/it]
No atom permutation is needed. Return the identity permutation.
                                                                                                                                                                                                               2
026-06-03 03:41:46,907 [/app/code/./runner/train.py:468] INFO root: Step 3699 train metrics: {'train/resolved_loss.avg': np.float64(0.2336538679937416), 'train/mse_loss.avg': np.float64(0.2683049121219665), '
train/weighted_bond_loss.avg': np.float64(0.10591329271905124), 'train/distogram_loss.avg': np.float64(1.156950723528862), 'train/smooth_lddt_loss.avg': np.float64(0.42555761106312273), 'train/pae_loss.avg':
np.float64(4.158883193705944), 'train/weighted_plddt_loss.avg': np.float64(0.000365002429889307), 'train/weighted_pae_loss.avg': np.float64(0.0), 'train/weighted_pde_loss.avg': np.float64(0.000338773453764356
1), 'train/pde_loss.avg': np.float64(3.3877346246166433), 'train/weighted_smooth_lddt_loss.avg': np.float64(0.0), 'train/weighted_distogram_loss.avg': np.float64(0.034708520870190114), 'train/plddt_loss.avg':
 np.float64(3.6500243902840515), 'train/loss.avg': np.float64(1.2145249742269515), 'train/weighted_mse_loss.avg': np.float64(1.073219648487866), 'train/weighted_resolved_loss.avg': np.float64(2.33653862206675
2e-05), 'train/bond_loss.avg': np.float64(0.02647832317976281)}
2026-06-03 03:41:46,907 [/app/code/./runner/train.py:468] INFO root: Step 3699, learning rate: [0.0005]
                                                                                                                                                                                                               2
026-06-03 03:49:59,026 [/app/code/protenix/utils/permutation/chain_permutation/heuristic.py:298] WARNING protenix.utils.permutation.chain_permutation.heuristic: The label_full_dict contains 22 asym chains.it]
                                                                                                                                                                                                               2
026-06-03 04:08:58,730 [/app/code/./runner/train.py:468] INFO root: Step 3749 train metrics: {'train/resolved_loss.avg': np.float64(0.24150056367218098), 'train/mse_loss.avg': np.float64(0.2547380602406338),
'train/weighted_bond_loss.avg': np.float64(0.08227694909088314), 'train/distogram_loss.avg': np.float64(1.1512480208277702), 'train/smooth_lddt_loss.avg': np.float64(0.4227650727331638), 'train/pae_loss.avg':
 np.float64(4.158883198007328), 'train/weighted_plddt_loss.avg': np.float64(0.0003672528546690016), 'train/weighted_pae_loss.avg': np.float64(0.0), 'train/weighted_pde_loss.avg': np.float64(0.0003353938006438
503), 'train/pde_loss.avg': np.float64(3.353938087840796), 'train/weighted_smooth_lddt_loss.avg': np.float64(0.0), 'train/weighted_distogram_loss.avg': np.float64(0.03453743982885499), 'train/plddt_loss.avg':
 np.float64(3.6725286402075414), 'train/loss.avg': np.float64(1.1364507242292166), 'train/weighted_mse_loss.avg': np.float64(1.0189522409625351), 'train/weighted_resolved_loss.avg': np.float64(2.4150055764193
882e-05), 'train/bond_loss.avg': np.float64(0.020569237272720784)}
2026-06-03 04:08:58,730 [/app/code/./runner/train.py:468] INFO root: Step 3749, learning rate: [0.0005]
[rank1]:[E603 04:19:16.432031875 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=427698, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 6
00025 milliseconds before timing out.
[rank1]:[E603 04:19:16.432467075 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 427698 PG status: last enqueued work: 427698, last complet
ed work: 427697
[rank1]:[E603 04:19:16.432547201 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_
SIZE to a non-zero value.
[rank1]:[E603 04:19:16.432676742 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank2]:[E603 04:19:16.499438391 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 2] Observed flight recorder dump signal from another rank via TCPStore.
[rank2]:[E603 04:19:16.499823711 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from  rank 1 and we will try our best to dump the debug in
fo. Last enqueued NCCL work: 427698, last completed NCCL work: 427697.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same f
or all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL
), etc.
[rank2]:[E603 04:19:16.500439151 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E603 04:19:16.570252694 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the
debug info. Last enqueued NCCL work: 427698, last completed NCCL work: 427697.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is no
t same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e
.g. NCCL), etc.
[rank1]:[E603 04:19:16.570719092 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank5]:[E603 04:19:16.760095770 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 5] Observed flight recorder dump signal from another rank via TCPStore.
[rank5]:[E603 04:19:16.760338325 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 5] Received a dump signal due to a collective timeout from  rank 1 and we will try our best to dump the debug in
fo. Last enqueued NCCL work: 427698, last completed NCCL work: 427698.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same f
or all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL
), etc.
[rank5]:[E603 04:19:16.760715425 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank7]:[E603 04:19:16.986300822 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 7] Observed flight recorder dump signal from another rank via TCPStore.
[rank7]:[E603 04:19:16.986616429 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 7] Received a dump signal due to a collective timeout from  rank 1 and we will try our best to dump the debug in
fo. Last enqueued NCCL work: 427698, last completed NCCL work: 427697.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same f
or all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL
), etc.
[rank7]:[E603 04:19:16.987430262 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 7] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank3]:[E603 04:19:16.119432928 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 3] Observed flight recorder dump signal from another rank via TCPStore.
[rank3]:[E603 04:19:16.119807892 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from  rank 1 and we will try our best to dump the debug in
fo. Last enqueued NCCL work: 427698, last completed NCCL work: 427698.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same f
or all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL
), etc.
[rank3]:[E603 04:19:16.120527487 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E603 04:19:16.122366554 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrup
ted/incomplete data.
[rank1]:[E603 04:19:16.122415683 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E603 04:19:16.127064843 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeou
t: WorkNCCL(SeqNum=427698, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fb53fb785e8 in /root/miniconda3/lib/python3.11/site-packages/torch/
lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7fb4ea5cba6d in /root/miniconda3/lib/python3.11/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7fb4ea5cd7f0 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb4ea5ceefd in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fb4da3efbf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7fb540e48ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fb540ed9a74 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0603 04:19:17.128000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 106 closing signal SIGTERM
W0603 04:19:17.133000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 108 closing signal SIGTERM
W0603 04:19:17.134000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 109 closing signal SIGTERM
[rank4]:[E603 04:19:17.315831399 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 4] Observed flight recorder dump signal from another rank via TCPStore.
W0603 04:19:17.135000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 110 closing signal SIGTERM
W0603 04:19:17.137000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 111 closing signal SIGTERM
W0603 04:19:17.138000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 112 closing signal SIGTERM
W0603 04:19:17.139000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 113 closing signal SIGTERM
E0603 04:19:17.163000 104 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 107) of binary: /root/miniconda3/bin/python3.11
Traceback (most recent call last):
  File "/root/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
./runner/train.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-06-03_04:19:17
  host      : d69c7cf49a03
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 107)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 107
====================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions