-
Notifications
You must be signed in to change notification settings - Fork 1
Distributed Training Template #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into fsdp_template
…helpers, instead of using torchrun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on some of the file changes. Going to add a few additonal comments outlining the errors I ran into. The example does not seem to run. Were you able to run the example successfully?
Debug ReportIt seems like something might not be working. This is the command I ran to test the example uv run python -m llm.finetune_distributed.launch compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-5,1.5e-5 --multirunI got the expected output from hydra: [2025-10-23 15:10:30,158][HYDRA] Submitit 'slurm' sweep output dir : /scratch/shawnc/vec_jobs/20251023-151029
[2025-10-23 15:10:30,161][HYDRA] #0 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-05
[2025-10-23 15:10:30,166][HYDRA] #1 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1.5e-05Logs:Got expected log directory structure. Firstly, the hydra log (llm_finetune_distributed.log) was receiving logs from each rank, so a lot of the logs were duplicated (ie. "Starting Training Loop", etc.). Like with the MLP DDP example, you should log for rank 0 and print for other ranks. In the submitit logs I found some errors. On run 1 I got the following error on rank 0: srun: error: kn110: task 1: Bus error (core dumped)The other rank seemed to freeze, it never finished loading the shards. The whole job hung after that and I had to force cancel it after like 30min For run 1 I got a different error: # <job-id>_1_0_log.out
kn010:3006221:3006221 [0] NCCL INFO Bootstrap : Using ib0:10.0.1.10<0>
kn010:3006221:3006221 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
kn010:3006221:3006221 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.20.5+cuda12.4
kn010:3006221:3006715 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.0.1.10<0>
kn010:3006221:3006715 [0] NCCL INFO Using non-device net plugin version 0
kn010:3006221:3006715 [0] NCCL INFO Using network IB
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xc34d9e4097b25c6c - Init START
kn010:3006221:3006715 [0] NCCL INFO Setting affinity for GPU 0 to 021f
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
kn010:3006221:3006715 [0] NCCL INFO Channel 00/02 : 0 1
kn010:3006221:3006715 [0] NCCL INFO Channel 01/02 : 0 1
kn010:3006221:3006715 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
kn010:3006221:3006715 [0] NCCL INFO P2P Chunksize set to 131072
kn010:3006221:3006715 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
kn010:3006221:3006715 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
kn010:3006221:3006715 [0] NCCL INFO Connected all rings
kn010:3006221:3006715 [0] NCCL INFO Connected all trees
kn010:3006221:3006715 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
kn010:3006221:3006715 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xc34d9e4097b25c6c - Init COMPLETE
submitit INFO (2025-10-23 15:34:59,608) - Job completed successfully
[2025-10-23 15:34:59,608][submitit][INFO] - Job completed successfully
submitit INFO (2025-10-23 15:34:59,624) - Exiting after successful completion
[2025-10-23 15:34:59,624][submitit][INFO] - Exiting after successful completion
kn010:3006221:3006720 [0] NCCL INFO [Service thread] Connection closed by localRank 0
kn010:3006221:3006802 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 busId 17000 - Abort COMPLETE# <job-id>_1_0_log.err
Fetching 2 files: 100%|██████████| 2/2 [01:01<00:00, 30.72s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 8.81it/s]
Map: 100%|██████████| 36718/36718 [00:01<00:00, 19053.97 examples/s]
Map: 100%|██████████| 3760/3760 [00:00<00:00, 10836.27 examples/s]
Map: 100%|██████████| 36718/36718 [00:02<00:00, 12446.24 examples/s]
Map: 100%|██████████| 3760/3760 [00:00<00:00, 11895.54 examples/s]
/scratch/shawnc/Code/vec-playbook/templates/src/llm/finetune_distributed/train.py:251: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = Trainer(
Using auto half precision backend
libibverbs: Warning: couldn't load driver 'libvmw_pvrdma-rdmav34.so': libvmw_pvrdma-rdmav34.so: cannot open shared object file: No such file or directory
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in GPTNeoXForCausalLM because mixed precision turned on in FSDP. Affects: gpt_neox.embed_in.weight, gpt_neox.final_layer_norm.weight, gpt_neox.final_layer_norm.bias, embed_out.weight.
warnings.warn(
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in GPTNeoXLayer because mixed precision turned on in FSDP. Affects: input_layernorm.weight, input_layernorm.bias, post_attention_layernorm.weight, post_attention_layernorm.bias, attention.query_key_value.weight, attention.query_key_value.bias, attention.dense.weight, attention.dense.bias, mlp.dense_h_to_4h.weight, mlp.dense_h_to_4h.bias, mlp.dense_4h_to_h.weight, mlp.dense_4h_to_h.bias.
warnings.warn(
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
***** Running training *****
Num examples = 4,685
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 4
Total optimization steps = 585
Number of trainable parameters = 3,428,651,008
0%| | 0/585 [00:11<?, ?it/s]
[rank0]:[W1023 15:35:00.954027223 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())SummarySo run 0 definitely crashed but its unclear to me whether run 1 ran successfully or not. Although it seemed like at least 1 of the ranks performed some training, there were no checkpoints. Furthermore I never saw any of the training complete logs, metric logs or evaluation logs that I would have expected to see based on reading the code from We should make sure that this example is working properly on both clusters before merging. |
4b8f63d to
32b9318
Compare
|
Just tested the example using 4 l40s, there still seems to be some issues. Similar to what I documented before, I still get the core dumped error on run 0 when trying to train with an lr of 1e-5. For run 1, it seems like it ran successfully, but there are still a ton of concerning warning logs in the submitit logs. Even though one of the 2 runs is working, there are still a few things I think could be cleaned up:
Suggested fixes:
|
|
Also run 1 finished after 30 steps. The config seems to set the number of training epochs to 1 epoch, but the job finished after 0.1 epochs. Can you explain whats going on here? Is this intentional or did it crash? |
|
Ok ran the same command: uv run python -m llm.finetune_distributed.launch compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-5,1.5e-5 --multirunEverything seems to be working now. Things are training as expected and I'm seeing all the logs that you've cleaned up. I did notice however that perhaps the trainer configuration is not being passed properly to the trainer? The output I get from hydra is: [2025-10-29 12:52:12,512][HYDRA] #0 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-05
[2025-10-29 12:52:12,517][HYDRA] #1 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1.5e-05So I can see that [2025-10-29 12:53:08,382][llm.finetune_distributed.train][INFO] - Step 1: {'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.013651877133105802}
{'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.01}
[2025-10-29 12:53:11,184][llm.finetune_distributed.train][INFO] - Step 2: {'loss': 2.9176, 'grad_norm': 4.3210296630859375, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.027303754266211604}It seems like the initial LR is 0? And then oddly the LR increases throughout training? Is this standard for finetuning? I though LR is usually decayed. |
Thanks for double checking! The LR trace you’re seeing is just HF’s warmup + cosine schedule doing its job. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting up with all my questions and requests for comments and docs here and there. LGTM!
bon_echo/a40_4x.yaml) for running jobs on 4xA40 GPU nodes, including resource and partition settings.