QLORA + FSDP distributed fine-tuning failed at the end during model saving stage #31675

Neo9061 · 2024-06-27T23:31:37Z

System Info

Following on Philip's blogpost to conduct FSDP + QLoRA in SageMaker

Training script is the default one.
The model I used is mistral-community/Mixtral-8x22B-v0.1.
Training instances are 2 instances of P4DE.24XLARGE (each instance has 640GB GPU and 1024 GB CPU).

Training is completed but hit failure at the very last step model saving (this line), where loading base model and merging base model and adaptor have been completed.

Error is following.

Saving the newly created merged model to /opt/ml/model
--
[E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800906 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_errorstd::runtime_error'
'  what():    what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.[Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.

I checked the memory usage and it seems not because of OOM. I wonder if that is because the model merging and saving step took a lot of time and make the process time out?

Who can help?

@ArthurZucker @philschmid @muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See descriptions above

Expected behavior

Error free

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-28T08:03:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ruian1 · 2024-07-29T00:23:42Z

I just had the same error, it happened when saving weights at end of epoch. Any suggestion?

ArthurZucker · 2024-08-03T18:22:21Z

Hey! Could you give a bit more details about the transformers version you are using / accelerate version ets:
transformers-cli env should output that

ruian1 · 2024-08-12T22:18:24Z

@ArthurZucker thanks for taking a look! I was trying to fine-tune the model of idefic2, I also tried all transformer versions that enables idefics and none of them worked. The training loss looks all good. I found discussions here but (1) try more new trainings (2)set the timeout value to a much bigger number both failed(by failure in 2 I let it run for ~12 hours and still does not go to epoch 2).

- `transformers` version: 4.44.0.dev0
- Platform: Linux-5.10.0-31-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.24.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB

ArthurZucker · 2024-08-26T16:04:11Z

I think this was fixed in the 4.44.1 patch!

github-actions · 2024-09-20T08:08:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added PyTorch FSDP PEFT labels Jun 28, 2024

github-actions bot closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLORA + FSDP distributed fine-tuning failed at the end during model saving stage #31675

QLORA + FSDP distributed fine-tuning failed at the end during model saving stage #31675

Neo9061 commented Jun 27, 2024

github-actions bot commented Jul 28, 2024

ruian1 commented Jul 29, 2024

ArthurZucker commented Aug 3, 2024

ruian1 commented Aug 12, 2024 •

edited

Loading

ArthurZucker commented Aug 26, 2024

github-actions bot commented Sep 20, 2024

QLORA + FSDP distributed fine-tuning failed at the end during model saving stage #31675

QLORA + FSDP distributed fine-tuning failed at the end during model saving stage #31675

Comments

Neo9061 commented Jun 27, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented Jul 28, 2024

ruian1 commented Jul 29, 2024

ArthurZucker commented Aug 3, 2024

ruian1 commented Aug 12, 2024 • edited Loading

ArthurZucker commented Aug 26, 2024

github-actions bot commented Sep 20, 2024

ruian1 commented Aug 12, 2024 •

edited

Loading