Distributed Training Template #6

kohankhaki · 2025-09-18T06:59:34Z

This pull request introduces a new distributed finetuning template for LLMs, enabling scalable training using either DDP or FSDP with Hydra and Submitit orchestration. It adds a complete configuration, launch, and training pipeline, along with documentation and a compute config for multi-GPU training.
Added a Slurm compute configuration (bon_echo/a40_4x.yaml) for running jobs on 4xA40 GPU nodes, including resource and partition settings.

…into fsdp_template

…helpers, instead of using torchrun.

scarere

Left some comments on some of the file changes. Going to add a few additonal comments outlining the errors I ran into. The example does not seem to run. Were you able to run the example successfully?

templates/src/llm/finetune_distributed/README.md

templates/src/llm/finetune_distributed/config.yaml

templates/src/llm/finetune_distributed/train.py

scarere · 2025-10-23T20:22:08Z

Debug Report

It seems like something might not be working.

This is the command I ran to test the example

uv run python -m llm.finetune_distributed.launch compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-5,1.5e-5 --multirun

I got the expected output from hydra:

[2025-10-23 15:10:30,158][HYDRA] Submitit 'slurm' sweep output dir : /scratch/shawnc/vec_jobs/20251023-151029
[2025-10-23 15:10:30,161][HYDRA] 	#0 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-05
[2025-10-23 15:10:30,166][HYDRA] 	#1 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1.5e-05

Logs:

Got expected log directory structure.

Firstly, the hydra log (llm_finetune_distributed.log) was receiving logs from each rank, so a lot of the logs were duplicated (ie. "Starting Training Loop", etc.). Like with the MLP DDP example, you should log for rank 0 and print for other ranks.

In the submitit logs I found some errors. On run 1 I got the following error on rank 0:

srun: error: kn110: task 1: Bus error (core dumped)

The other rank seemed to freeze, it never finished loading the shards. The whole job hung after that and I had to force cancel it after like 30min

For run 1 I got a different error:

# <job-id>_1_0_log.out

kn010:3006221:3006221 [0] NCCL INFO Bootstrap : Using ib0:10.0.1.10<0>
kn010:3006221:3006221 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
kn010:3006221:3006221 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.20.5+cuda12.4
kn010:3006221:3006715 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.0.1.10<0>
kn010:3006221:3006715 [0] NCCL INFO Using non-device net plugin version 0
kn010:3006221:3006715 [0] NCCL INFO Using network IB
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xc34d9e4097b25c6c - Init START
kn010:3006221:3006715 [0] NCCL INFO Setting affinity for GPU 0 to 021f
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
kn010:3006221:3006715 [0] NCCL INFO Channel 00/02 :    0   1
kn010:3006221:3006715 [0] NCCL INFO Channel 01/02 :    0   1
kn010:3006221:3006715 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
kn010:3006221:3006715 [0] NCCL INFO P2P Chunksize set to 131072
kn010:3006221:3006715 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
kn010:3006221:3006715 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
kn010:3006221:3006715 [0] NCCL INFO Connected all rings
kn010:3006221:3006715 [0] NCCL INFO Connected all trees
kn010:3006221:3006715 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
kn010:3006221:3006715 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
kn010:3006221:3006715 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xc34d9e4097b25c6c - Init COMPLETE
submitit INFO (2025-10-23 15:34:59,608) - Job completed successfully
[2025-10-23 15:34:59,608][submitit][INFO] - Job completed successfully
submitit INFO (2025-10-23 15:34:59,624) - Exiting after successful completion
[2025-10-23 15:34:59,624][submitit][INFO] - Exiting after successful completion
kn010:3006221:3006720 [0] NCCL INFO [Service thread] Connection closed by localRank 0
kn010:3006221:3006802 [0] NCCL INFO comm 0xbaa53f0 rank 0 nranks 2 cudaDev 0 busId 17000 - Abort COMPLETE

# <job-id>_1_0_log.err

Fetching 2 files: 100%|██████████| 2/2 [01:01<00:00, 30.72s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  8.81it/s]
Map: 100%|██████████| 36718/36718 [00:01<00:00, 19053.97 examples/s]
Map: 100%|██████████| 3760/3760 [00:00<00:00, 10836.27 examples/s]
Map: 100%|██████████| 36718/36718 [00:02<00:00, 12446.24 examples/s]
Map: 100%|██████████| 3760/3760 [00:00<00:00, 11895.54 examples/s]
/scratch/shawnc/Code/vec-playbook/templates/src/llm/finetune_distributed/train.py:251: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(
Using auto half precision backend
libibverbs: Warning: couldn't load driver 'libvmw_pvrdma-rdmav34.so': libvmw_pvrdma-rdmav34.so: cannot open shared object file: No such file or directory
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in GPTNeoXForCausalLM because mixed precision turned on in FSDP. Affects: gpt_neox.embed_in.weight, gpt_neox.final_layer_norm.weight, gpt_neox.final_layer_norm.bias, embed_out.weight.
  warnings.warn(
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in GPTNeoXLayer because mixed precision turned on in FSDP. Affects: input_layernorm.weight, input_layernorm.bias, post_attention_layernorm.weight, post_attention_layernorm.bias, attention.query_key_value.weight, attention.query_key_value.bias, attention.dense.weight, attention.dense.bias, mlp.dense_h_to_4h.weight, mlp.dense_h_to_4h.bias, mlp.dense_4h_to_h.weight, mlp.dense_4h_to_h.bias.
  warnings.warn(
/scratch/shawnc/Code/vec-playbook/.venv/lib/python3.12/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
  warnings.warn(
***** Running training *****
  Num examples = 4,685
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 585
  Number of trainable parameters = 3,428,651,008
  0%|          | 0/585 [00:11<?, ?it/s]
[rank0]:[W1023 15:35:00.954027223 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Summary

So run 0 definitely crashed but its unclear to me whether run 1 ran successfully or not. Although it seemed like at least 1 of the ranks performed some training, there were no checkpoints. Furthermore I never saw any of the training complete logs, metric logs or evaluation logs that I would have expected to see based on reading the code from train.py. Not sure if there was a mistake in my initial command. Did you test this example yourself on killarney? If you tested on bon-echo instead perhaps theres some nuance with NCCL that we need to be aware of.

We should make sure that this example is working properly on both clusters before merging.

scarere · 2025-10-24T14:19:03Z

Just tested the example using 4 l40s, there still seems to be some issues. Similar to what I documented before, I still get the core dumped error on run 0 when trying to train with an lr of 1e-5. For run 1, it seems like it ran successfully, but there are still a ton of concerning warning logs in the submitit logs.

Even though one of the 2 runs is working, there are still a few things I think could be cleaned up:

In main hydra logs, multiple logs are repeated between ranks, similar to DDP example, we should log for rank 0 and print for other ranks
No training progress is logged to hydra logs, so your forced to go to submitit logs to monitor progress, yet submitit logs are a bit of a mess due to multiple warnings. Progress only seems to be printed for rank 0 in submitit logs
When run 0 failed (bus error: core dumped), it hangs. Is there a way to get this job to exit upon a crash or failure so that you are not wasting compute resources.

Suggested fixes:

Resolve or silence warnings in submitit logs
Forward training logs to main hydra log (for rank 0 only). Currently training progress only appears in stderr and stdout
Ensure other ranks do not log to main hydra log and use print instead
See if we can print training progress or info to other ranks for visibility. Or even just print statement for each rank that indicates training progress will be logged on rank 0 only.
Figure out the cause of the core dumped error. And figure out how to get job to exit/complete instead of hanging in event of error like this
Documentation or warning log about minimum vram requirements.

scarere · 2025-10-24T14:43:04Z

Also run 1 finished after 30 steps. The config seems to set the number of training epochs to 1 epoch, but the job finished after 0.1 epochs. Can you explain whats going on here? Is this intentional or did it crash?

scarere · 2025-10-29T17:05:59Z

Ok ran the same command:

uv run python -m llm.finetune_distributed.launch compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-5,1.5e-5 --multirun

Everything seems to be working now. Things are training as expected and I'm seeing all the logs that you've cleaned up. I did notice however that perhaps the trainer configuration is not being passed properly to the trainer? The output I get from hydra is:

[2025-10-29 12:52:12,512][HYDRA] 	#0 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-05
[2025-10-29 12:52:12,517][HYDRA] 	#1 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1.5e-05

So I can see that trainer.train.learning_rate is being set appropriately. I assume this is the initial LR. However the logs show the following for both runs:

[2025-10-29 12:53:08,382][llm.finetune_distributed.train][INFO] - Step 1: {'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.013651877133105802}
{'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.01}
[2025-10-29 12:53:11,184][llm.finetune_distributed.train][INFO] - Step 2: {'loss': 2.9176, 'grad_norm': 4.3210296630859375, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.027303754266211604}

It seems like the initial LR is 0? And then oddly the LR increases throughout training? Is this standard for finetuning? I though LR is usually decayed.

kohankhaki · 2025-10-29T19:28:26Z

Ok ran the same command:
uv run python -m llm.finetune_distributed.launch compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-5,1.5e-5 --multirun
Everything seems to be working now. Things are training as expected and I'm seeing all the logs that you've cleaned up. I did notice however that perhaps the trainer configuration is not being passed properly to the trainer? The output I get from hydra is:
[2025-10-29 12:52:12,512][HYDRA] 	#0 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1e-05
[2025-10-29 12:52:12,517][HYDRA] 	#1 : compute=killarney/l40s_2x requeue=off trainer.train.learning_rate=1.5e-05
So I can see that trainer.train.learning_rate is being set appropriately. I assume this is the initial LR. However the logs show the following for both runs:
[2025-10-29 12:53:08,382][llm.finetune_distributed.train][INFO] - Step 1: {'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.013651877133105802}
{'loss': 2.9613, 'grad_norm': 3.9578468799591064, 'learning_rate': 0.0, 'epoch': 0.01}
[2025-10-29 12:53:11,184][llm.finetune_distributed.train][INFO] - Step 2: {'loss': 2.9176, 'grad_norm': 4.3210296630859375, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.027303754266211604}
It seems like the initial LR is 0? And then oddly the LR increases throughout training? Is this standard for finetuning? I though LR is usually decayed.

Thanks for double checking! The LR trace you’re seeing is just HF’s warmup + cosine schedule doing its job. In config.yaml we leave trainer.train.warmup_steps=200 and lr_scheduler_type=cosine, so the scheduler starts at zero and linearly ramps to the target LR over the first 200 optimizer steps before the cosine decay kicks in.

scarere

Thanks for putting up with all my questions and requests for comments and docs here and there. LGTM!

kohankhaki added 3 commits September 17, 2025 23:59

added llm fine-tuning ddp fsdp scripts.

02dcefc

Updated LLM Readme.

b5bcf06

added default checkpoint function.

4d9a15a

kohankhaki requested a review from jwilles September 18, 2025 14:02

jwilles requested review from scarere and removed request for jwilles and scarere October 15, 2025 15:19

kohankhaki added 8 commits October 17, 2025 23:47

Merge branch 'main' of https://github.com/VectorInstitute/vec-playbook …

aacfbf1

…into fsdp_template

fixed typo in templates README.

2f80c4a

refactored fsdp example to set env variables directly using submitit …

c85d817

…helpers, instead of using torchrun.

removed extra files.

6119639

updated config, removed extra hyper-params.

12e2c02

added readme for llm distributed training.

8d95dcb

Merge origin/main into fsdp_template

0a79a93

removed distributed training detail from llm readme.

ca0ed7c

scarere requested changes Oct 23, 2025

View reviewed changes

added comment to launch and config. removed output to null.

32b9318

kohankhaki force-pushed the fsdp_template branch from 4b8f63d to 32b9318 Compare October 24, 2025 06:39

kohankhaki added 6 commits October 24, 2025 16:00

README.md

06f2c81

updated model. added env vars in the setup for higher quality logging.

5e21787

changed logging to only rank 0. move training bar to hydra.

49a8c11

removed slurm param as not working on killarney.

9e84da7

removed extra args. updated binding code.

4a00e43

fixed typos in readme.

263ae28

added a comment about memory unit in config.

ce18a4b

scarere approved these changes Oct 30, 2025

View reviewed changes

kohankhaki merged commit 02a590b into main Oct 30, 2025

kohankhaki deleted the fsdp_template branch October 30, 2025 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Training Template #6

Distributed Training Template #6

kohankhaki commented Sep 18, 2025

Uh oh!

scarere left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scarere commented Oct 23, 2025

Uh oh!

scarere commented Oct 24, 2025

Uh oh!

scarere commented Oct 24, 2025

Uh oh!

scarere commented Oct 29, 2025

Uh oh!

kohankhaki commented Oct 29, 2025

Uh oh!

scarere left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Distributed Training Template #6

Distributed Training Template #6

Conversation

kohankhaki commented Sep 18, 2025

Uh oh!

scarere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scarere commented Oct 23, 2025

Debug Report

Logs:

Summary

Uh oh!

scarere commented Oct 24, 2025

Uh oh!

scarere commented Oct 24, 2025

Uh oh!

scarere commented Oct 29, 2025

Uh oh!

kohankhaki commented Oct 29, 2025

Uh oh!

scarere left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants