Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions templates/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ cd path/to/vec-playbook
uv sync # Automatically installs dependencies in vec-playbook/.venv
```

Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.
Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.

### UV Tip for Killarney

Expand All @@ -45,7 +45,7 @@ templates/
```

Each template directory contains a `launch.py`, a `train.py`, and a `config.yaml`.
The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.
The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.

The launch script contains the `hydra.main` decorator which points hydra to the templates local `config.yaml`. This `config.yaml` imports the `_global` config from the `configs/` directory, which in turn imports other preset configs.

Expand Down Expand Up @@ -85,12 +85,12 @@ All launchers follow the same pattern: use `uv run python -m <templatee>.launch`
uv run python -m <template_pkg>.launch \
compute=<cluster>/<preset> \
requeue=<on|off> \
<config.overridess> \
<config.overrides> \
<new-keys> \
--multirun
```

- `<template_pkg>`: The module path to the template launch script (eg. `mlp.single`)
- `<template_pkg>`: The module path to the template launch script (eg. `mlp.single`)
- `compute=<cluster>/<preset>`: chooses the Slurm resources defined under `templates/configs/compute/` (or a custom preset you add).
- `requeue=<on|off>`: toggles the Submitit requeue flag described in the checkpointing section.
- Additional config overrides use `key=value` syntax; nested keys follow the YAML structure (e.g., `compute.mem_gb=32`).
Expand Down Expand Up @@ -233,7 +233,7 @@ vec_jobs/<timestamp>/
│ └── hydra_resolved.yaml # The hydra settings that were used for this run (with all placeholder values resolved)
...
└── <hydra-run-id>/
└── <hydra-run-id>/
...
└── ...
```
Expand All @@ -243,4 +243,4 @@ vec_jobs/<timestamp>/
- `multirun.yaml` and `hydra.yaml` will contain placeholder values (eg. `${oc.select:compute.mem_gb}`). These are used to fill in the values with values from other parts of the config or other configs included in the defaults. See hydra documentation for more detail.
- When doing a hyperparameter sweep, a run is performed for each unique combination of hyperparameters. Each run is run as a separate slurm job with a unique slurm ID.
- All the runs are submitted as separate jobs using the slurm `--array` feature. Therefore there is a base slurm job id shared by all runs. The slurm-job-id actually used by slurm for each run is a combination of the base slurm job ID and the hydra run ID (eg. `1186868_1`). For multirun jobs you might end up with log files like: `1186868_1_0`. Not sure what the second integer is as it doesn't necessarily line up with the hydra run id. Most likely a process ID.
- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.
- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.
7 changes: 4 additions & 3 deletions templates/src/llm/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
### LLM training templates
# LLM Training Templates

This directory includes templates for LLM training tasks:
This directory includes templates for language-model workloads:

- [text_classification](text_classification/): Fine-tunes a small Transformer on AG News using Hugging Face Trainer.
- [text_classification](text_classification/): fine-tunes a small LLM on AG News via Hugging Face Trainer.
- [finetune_distributed](finetune_distributed/): distributed finetuning template using DDP and FSDP.
50 changes: 50 additions & 0 deletions templates/src/llm/finetune_distributed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# LLM Distributed Fine-tuning Template

This template fine-tunes Hugging Face models with the **HF Trainer** and scales via **DDP** or **FSDP**.

## How the code works

In `train.py`, we use Submitit’s helper:

```python
from submitit.helpers import TorchDistributedEnvironment
TorchDistributedEnvironment().export() # sets RANK, LOCAL_RANK, WORLD_SIZE, MASTER_ADDR/PORT
```

Then the HF `Trainer` (via `TrainingArguments`) initializes distributed training; you can also explicitly call `torch.distributed.init_process_group(backend="nccl", init_method="env://")` if you need lower-level control. The helper provides the same environment variables you would otherwise set by hand so that PyTorch’s `env://` init works. This pattern is used in Submitit’s own distributed examples and in downstream guides.

## Distributed environment: tasks, ranks, and GPUs (with Submitit on Slurm)

### Tasks-per-node and GPUs-per-node
- One process per GPU is the common pattern. Concretely:
`hydra.launcher.tasks_per_node = compute.gpus_per_node`
This makes Slurm/Submitit spawn exactly one task per GPU. Each task becomes a rank in the job.

### What Submitit exports
- `submitit.helpers.TorchDistributedEnvironment().export()` populates:
- `RANK`, `LOCAL_RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` (and related fields) so that `init_method="env://"` works out of the box.

### Binding each task to one GPU
- Slurm’s GRES plugin sets `CUDA_VISIBLE_DEVICES` for each task so the task “sees” only its assigned GPU(s). You can additionally enforce a 1:1 mapping with:
```yaml
hydra.launcher.setup:
- "export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID"
```
This ensures rank-local GPU selection is unambiguous (task 0 -> GPU 0, task 1 -> GPU 1 on that node).

### Quick glossary
- **WORLD_SIZE**: total number of processes across all nodes.
- **RANK**: global process id `[0 .. WORLD_SIZE-1]`.
- **LOCAL_RANK**: process id per node `[0 .. tasks_per_node-1]`.

## Why not torchrun here?

`torchrun` is valid for distributed launches (including on Slurm), but this template uses **Hydra’s Submitit launcher** to keep **sweeps, config composition, logging, and requeue** inside Hydra, and to avoid maintaining separate bash wrappers. Submitit handles **job submission and per-task rank context**; we still initialize PyTorch distributed via the standard env-var pathway.

If you prefer `torchrun`, you can adapt the script and configs—but you’ll then manage the Slurm submission layer (or wrap `torchrun` inside an `sbatch` yourself) and wire up Hydra sweeps accordingly.

## References

- PyTorch distributed environment variables: https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization
- Slurm GRES guide: https://slurm.schedmd.com/gres.html
- Hugging Face FSDP / Trainer documentation: https://huggingface.co/docs/transformers/fsdp
1 change: 1 addition & 0 deletions templates/src/llm/finetune_distributed/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""LLM training template: Fine-tuning using distributed training."""
66 changes: 66 additions & 0 deletions templates/src/llm/finetune_distributed/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
defaults:
- _global
- _self_

hydra:
job:
name: llm_finetune_distributed
searchpath:
- pkg://configs
launcher:
tasks_per_node: ${compute.gpus_per_node}
setup:
- 'export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID'

paths:
out_dir: null

trainer:
seed: 42
model:
name: "EleutherAI/pythia-6.9b"
revision: null
trust_remote_code: true
torch_dtype: "float16"
data:
dataset_name: "wikitext"
dataset_config_name: "wikitext-2-raw-v1"
text_column: "text"
train_split: "train"
eval_split: "validation"
max_length: 512
load_kwargs:
streaming: false
train:
num_train_epochs: 1
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.5e-5
weight_decay: 0.01
warmup_steps: 200
logging_steps: 1
logging_first_step: true
eval_steps: 10
save_steps: 10
eval_strategy: "steps"
save_strategy: "steps"
save_total_limit: 2
lr_scheduler_type: "cosine"
max_grad_norm: 1.0
optim: "adamw_torch"
dist:
mode: "fsdp"
fp16: true
bf16: false
fsdp: ["full_shard", "auto_wrap"]
fsdp_config:
use_orig_params: true
activation_checkpointing: false
limit_all_gathers: true
forward_prefetch: true
sync_module_states: true
fsdp_auto_wrap_policy: "SIZE_BASED_WRAP"
fsdp_min_num_params: 1000000
logging:
report_to: []
42 changes: 42 additions & 0 deletions templates/src/llm/finetune_distributed/launch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""Launch script for checkpointable distributed finetuning with Hydra + Submitit."""

import logging
import os

import hydra
from omegaconf import DictConfig, OmegaConf

from .train import FinetuneDistributedTrainer


logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config", version_base=None)
def main(cfg: DictConfig):
"""Hydra entrypoint that updates paths, saves config, and launches training."""
# Turn of struct mode so that we can modify DictConfig
OmegaConf.set_struct(cfg, False)

# Add output_directory for current run
hydra_config = hydra.core.hydra_config.HydraConfig.get()
cfg.paths.out_dir = str(os.path.join(hydra_config.runtime.output_dir, "outputs"))
logger.info(f"Setting paths.out_dir to: {cfg.paths.out_dir}")

# Save a resolved version of the hydra config
save_path = os.path.join(
hydra_config.runtime.output_dir,
hydra_config.output_subdir,
"hydra_resolved.yaml",
)
logger.info(f"Resolving hydra config for this run and saving to: {save_path}")
OmegaConf.set_readonly(hydra_config, False)
OmegaConf.resolve(hydra_config)
OmegaConf.save(hydra_config, save_path)

trainer = FinetuneDistributedTrainer()
return trainer(cfg)


if __name__ == "__main__":
main()
Loading