Skip to content

Commit ca0ed7c

Browse files
committed
removed distributed training detail from llm readme.
1 parent 0a79a93 commit ca0ed7c

File tree

1 file changed

+1
-19
lines changed

1 file changed

+1
-19
lines changed

templates/src/llm/README.md

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,4 @@
33
This directory includes templates for language-model workloads:
44

55
- [text_classification](text_classification/): fine-tunes a small LLM on AG News via Hugging Face Trainer.
6-
- [finetune_distributed](finetune_distributed/): distributed finetuning example adapted from VectorLM (https://github.com/VectorInstitute/vectorlm).
7-
8-
## Finetune Distributed (DDP/FSDP)
9-
10-
Run the distributed template:
11-
```bash
12-
uv run python -m llm.finetune_distributed.launch \
13-
compute=bon_echo/a40_4x \
14-
+trainer.dist.mode=fsdp --multirun
15-
```
16-
You can choose **DDP** or **FSDP** mode by setting the `+trainer.dist.mode` argument (`ddp` or `fsdp`).
17-
18-
A few points to clarify for this template:
19-
- **`launch.py`** is the Hydra entrypoint; it merges config layers and hands the resolved config to Submitit.
20-
- **`distributed_launcher.py`** is a Submitit helper; it shells out to `torch.distributed.run` so that torchrun controls per-rank workers without re-entering Hydra (the same pattern used in VectorLM).
21-
- **`train.py`** is the torchrun worker; it loads the saved config, builds tokenizer, dataloaders, model, and optimizer, and then delegates to the Trainer.
22-
- **`trainer_core.py`** is a minimal trainer (adapted from VectorLM’s `trainer.py`); it handles gradient accumulation, checkpointing, optional evaluation, and works with either DDP or FSDP.
23-
24-
Hydra and Submitit resolve and submit jobs once. Torchrun (DDP/FSDP) needs to own process creation per GPU. Launching `torch.distributed.run` in a subprocess is the standard Hydra + Submitit approach: it avoids nested Hydra invocations, keeps the Hydra run directory stable for requeues and checkpoints, and makes local debugging under `torchrun` straightforward.
6+
- [finetune_distributed](finetune_distributed/): distributed finetuning template using DDP and FSDP.

0 commit comments

Comments
 (0)