VectorInstitute · kohankhaki · Oct 30, 2025 · Sep 18, 2025 · Sep 18, 2025 · Sep 18, 2025
diff --git a/templates/README.md b/templates/README.md
@@ -26,7 +26,7 @@ cd path/to/vec-playbook
 uv sync  # Automatically installs dependencies in vec-playbook/.venv
 ```
 
-Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster. 
+Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.
 
 ### UV Tip for Killarney
 
@@ -45,7 +45,7 @@ templates/
 ```
 
 Each template directory contains a `launch.py`, a `train.py`, and a `config.yaml`.
-The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings. 
+The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.
 
 The launch script contains the `hydra.main` decorator which points hydra to the templates local `config.yaml`. This `config.yaml` imports the `_global` config from the `configs/` directory, which in turn imports other preset configs.
 
@@ -85,12 +85,12 @@ All launchers follow the same pattern: use `uv run python -m <templatee>.launch`
 uv run python -m <template_pkg>.launch \
   compute=<cluster>/<preset> \
   requeue=<on|off> \
-  <config.overridess> \
+  <config.overrides> \
   <new-keys> \
   --multirun
 ```
 
--  `<template_pkg>`: The module path to the template launch script (eg.  `mlp.single`)
+- `<template_pkg>`: The module path to the template launch script (eg.  `mlp.single`)
 - `compute=<cluster>/<preset>`: chooses the Slurm resources defined under `templates/configs/compute/` (or a custom preset you add).
 - `requeue=<on|off>`: toggles the Submitit requeue flag described in the checkpointing section.
 - Additional config overrides use `key=value` syntax; nested keys follow the YAML structure (e.g., `compute.mem_gb=32`).
@@ -233,7 +233,7 @@ vec_jobs/<timestamp>/
 │      └── hydra_resolved.yaml  # The hydra settings that were used for this run (with all placeholder values resolved)
 │
 ...
-└── <hydra-run-id>/ 
+└── <hydra-run-id>/
     ...
     └── ...
 ```
@@ -243,4 +243,4 @@ vec_jobs/<timestamp>/
 - `multirun.yaml` and `hydra.yaml` will contain placeholder values (eg. `${oc.select:compute.mem_gb}`). These are used to fill in the values with values from other parts of the config or other configs included in the defaults. See hydra documentation for more detail.
 - When doing a hyperparameter sweep, a run is performed for each unique combination of hyperparameters. Each run is run as a separate slurm job with a unique slurm ID.
   - All the runs are submitted as separate jobs using the slurm `--array` feature. Therefore there is a base slurm job id shared by all runs. The slurm-job-id actually used by slurm for each run is a combination of the base slurm job ID and the hydra run ID (eg. `1186868_1`). For multirun jobs you might end up with log files like: `1186868_1_0`. Not sure what the second integer is as it doesn't necessarily line up with the hydra run id. Most likely a process ID.
-- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.
+- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.
diff --git a/templates/configs/compute/bon_echo/a100_1x.yaml b/templates/configs/compute/bon_echo/a100_1x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 1
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 16
-mem_gb: 80
+mem_gb: 80 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/bon_echo/a100_4x.yaml b/templates/configs/compute/bon_echo/a100_4x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 4
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 16
-mem_gb: 320
+mem_gb: 320 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/bon_echo/a40_1x.yaml b/templates/configs/compute/bon_echo/a40_1x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 1
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 8
-mem_gb: 40
+mem_gb: 40 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/bon_echo/a40_2x.yaml b/templates/configs/compute/bon_echo/a40_2x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 2
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 8
-mem_gb: 80
+mem_gb: 80 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/bon_echo/a40_4x.yaml b/templates/configs/compute/bon_echo/a40_4x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 4
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 8
-mem_gb: 160
+mem_gb: 160 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/bon_echo/cpu_1x.yaml b/templates/configs/compute/bon_echo/cpu_1x.yaml
@@ -3,7 +3,7 @@ nodes: 1
 gpus_per_node: 0
 gres: null
 cpus_per_task: 2
-mem_gb: 8
+mem_gb: 8 # values are binary GiB as expected by Slurm
 work_root: /scratch/ssd004/scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/h100_1x.yaml b/templates/configs/compute/killarney/h100_1x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 1
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 6
-mem_gb: 240
+mem_gb: 240 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/h100_2x.yaml b/templates/configs/compute/killarney/h100_2x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 2
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 6
-mem_gb: 480
+mem_gb: 480 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/h100_4x.yaml b/templates/configs/compute/killarney/h100_4x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 4
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 6
-mem_gb: 960
+mem_gb: 960 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/h100_8x.yaml b/templates/configs/compute/killarney/h100_8x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 8
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 6
-mem_gb: 1920
+mem_gb: 1920 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/l40s_1x.yaml b/templates/configs/compute/killarney/l40s_1x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 1
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 16
-mem_gb: 120
+mem_gb: 120 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/l40s_2x.yaml b/templates/configs/compute/killarney/l40s_2x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 2
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 16
-mem_gb: 240
+mem_gb: 240 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/configs/compute/killarney/l40s_4x.yaml b/templates/configs/compute/killarney/l40s_4x.yaml
@@ -5,7 +5,7 @@ gpus_per_node: 4
 gres: gpu:${.gpu_type}:${.gpus_per_node}
 tasks_per_node: ${.gpus_per_node}
 cpus_per_task: 16
-mem_gb: 480
+mem_gb: 480 # values are binary GiB as expected by Slurm
 work_root: /scratch/${oc.env:USER}
 timeout_min: 60
 slurm:

diff --git a/templates/src/llm/README.md b/templates/src/llm/README.md
@@ -1,5 +1,6 @@
-### LLM training templates
+# LLM Training Templates
 
-This directory includes templates for LLM training tasks:
+This directory includes templates for language-model workloads:
 
-- [text_classification](text_classification/): Fine-tunes a small Transformer on AG News using Hugging Face Trainer.
+- [text_classification](text_classification/): fine-tunes a small LLM on AG News via Hugging Face Trainer.
+- [finetune_distributed](finetune_distributed/): distributed finetuning template using DDP and FSDP.
diff --git a/templates/src/llm/finetune_distributed/README.md b/templates/src/llm/finetune_distributed/README.md
@@ -0,0 +1,209 @@
+# LLM Distributed Fine-tuning Template
+
+This template fine-tunes Hugging Face models with the **HF Trainer** and scales via **DDP** (DistributedDataParallel) or **FSDP** (Fully Sharded Data Parallel) which are PyTorch methods for **distributed training**.
+
+## Overview: DDP vs FSDP
+
+**DDP** runs a complete copy of the model on each GPU and synchronizes gradients across devices during training. It offers high performance and is suitable for models that comfortably fit within GPU memory.
+**FSDP**, by contrast, shards model parameters, gradients, and optimizer states across GPUs to reduce memory consumption, allowing the training of much larger models.
+
+For more information, see the official PyTorch tutorials:
+- [DistributedDataParallel (DDP) Tutorial](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
+- [Fully Sharded Data Parallel (FSDP) Tutorial](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
+
+## Memory Breakdown and Scaling
+
+Training memory consists of several components:
+
+1. **Model Parameters**: The weights of the neural network
+   - Size: `num_params × bytes_per_param` (2 bytes for fp16/bf16, 4 bytes for fp32)
+
+2. **Gradients**: One gradient value per parameter for backpropagation
+   - Size: Same as model parameters
+   - Typically stored in the same dtype as the model parameters (fp16/bf16)
+
+3. **Optimizer States**: Adam optimizer maintains running averages
+   - Size: ~2× model parameters (first moment + second moment estimates)
+   - Both are usually stored in **fp32** for numerical stability
+   - **Optional:** Many mixed-precision setups also keep an **fp32 master weights** copy
+   - Total optimizer footprint ≈ **2×–3× model size (in fp32)** depending on implementation
+
+4. **Activations**: Intermediate outputs stored during the forward pass for backpropagation
+   - Size: `batch_size × seq_length × hidden_dim × num_layers × constant`
+   - Scales linearly with batch size and sequence length
+   - Can be reduced substantially with `activation_checkpointing: true` (recomputes instead of storing; savings vary by layer, typically 30–60%)
+
+---
+
+### DDP vs FSDP Summary
+
+| Method | Description | Pros | Cons |
+|--------|--------------|------|------|
+| **DDP** | Replicates the **entire model** on each GPU | Simple, efficient for smaller models | High memory use |
+| **FSDP** | Shards parameters, gradients, and optimizer states across GPUs | Allows **larger models** | More communication overhead |
+
+---
+
+### With FSDP `full_shard`
+
+- Parameters, gradients, and optimizer states are **sharded (divided)** across all GPUs.
+- Activations are **not sharded**; they are fully **replicated** on each GPU.
+- During training, FSDP **temporarily all-gathers parameters** per layer, so brief **memory spikes** above steady-state usage are expected.
+
+---
+
+### Formula
+
+```text
+Memory per GPU ≈ (Parameters + Gradients + Optimizer States) / num_GPUs + Activations
+
+              ≈ (params × bytes_per_param × factor) / num_GPUs + (batch × seq × hidden × layers × constant)
+```
+
+Where:
+- `factor = 6` → fp16/bf16 (params + grads in fp16 + m,v in fp32)
+- `factor = 8` → fp16/bf16 with fp32 master weights (common in AdamW)
+
+---
+
+### Example
+
+**Pythia-1B on 2× NVIDIA L40 GPUs with batch=16, seq=512:**
+
+```text
+Params: 1.0B × 2 bytes = 2.00 GiB total
+
+Model + Gradients + Optimizer States (with master weights):
+   (1.0B × 2 bytes × 8) / 2 GPUs ≈ 7.45 GiB per GPU
+
+Activations (not sharded):
+   ≈ 1.0 GiB per GPU (depends on hidden_dim × layers)
+
+Total steady-state:
+   ≈ 8.45 GiB per GPU
+
+Transient all-gather overhead:
+   +10–20% (≈ 9.3 – 10.2 GiB peak)
+```
+
+---
+
+### To Reduce Memory
+
+- Decrease `per_device_train_batch_size` → reduces activations
+- Enable `activation_checkpointing: true` → recomputes instead of storing
+- Reduce `max_length` → reduces activations linearly
+- Increase number of GPUs → shards model and optimizer states further
+- Use **bf16** or **fp16** mixed precision → halves parameter and gradient memory
+- Use **FSDP offload to CPU/NVMe** for optimizer states if GPU memory is constrained
+
+---
+
+## Distributed Environment Setup (Submitit + Slurm)
+
+In `train.py`, we use Submitit’s helper:
+
+```python
+from submitit.helpers import TorchDistributedEnvironment
+TorchDistributedEnvironment().export()   # sets RANK, LOCAL_RANK, WORLD_SIZE, MASTER_ADDR/PORT
+```
+
+Then the HF `Trainer` (via `TrainingArguments`) initializes distributed training; you can also explicitly call:
+
+```python
+torch.distributed.init_process_group(backend="nccl", init_method="env://")
+```
+
+if you need lower-level control. The helper provides the same environment variables you would otherwise set by hand so that PyTorch’s `env://` init works.
+
+---
+
+### Tasks-per-node and GPUs-per-node
+
+One process per GPU is the common pattern for distributed training. Concretely:
+
+```yaml
+hydra.launcher.tasks_per_node = compute.gpus_per_node
+```
+
+This makes Slurm/Submitit spawn exactly one task per GPU, where each task becomes a rank in the distributed job.
+
+---
+
+### What Submitit Exports
+
+`submitit.helpers.TorchDistributedEnvironment().export()` populates:
+
+- `RANK`, `LOCAL_RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT`
+  (and related fields) so that `init_method="env://"` works out of the box.
+
+---
+
+### Binding Each Task to One GPU
+
+Slurm's GRES plugin sets `CUDA_VISIBLE_DEVICES` for each task so the task "sees" only its assigned GPU(s). You can additionally enforce explicit GPU isolation with:
+
+```yaml
+hydra.launcher.setup:
+  - "export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID"
+```
+
+This ensures each task is bound to a single GPU (for example, Task 0 → GPU 0, Task 1 → GPU 1), and within each process, that GPU always appears as `cuda:0`.
+This isolation prevents tasks from accessing other GPUs and keeps device handling consistent across nodes.
+
+---
+
+## Distributed Training Modes and Configurations
+
+### FSDP (Fully Sharded Data Parallel)
+
+The default configuration uses FSDP for memory-efficient training of large models:
+
+```yaml
+trainer.dist:
+  mode: "fsdp"
+  fsdp: ["full_shard", "auto_wrap"]
+  fsdp_config:
+    use_orig_params: true
+    activation_checkpointing: false
+    limit_all_gathers: true
+    forward_prefetch: true
+    sync_module_states: true
+    fsdp_auto_wrap_policy: "SIZE_BASED_WRAP"
+    fsdp_min_num_params: 1000000
+```
+
+**Key settings:**
+- `full_shard`: Shards model parameters, gradients, and optimizer states across all GPUs
+- `auto_wrap`: Automatically wraps model layers for sharding
+- `SIZE_BASED_WRAP`: Wraps modules that exceed `fsdp_min_num_params`
+
+---
+
+### To Use DDP Instead
+
+Set `mode: "ddp"` (or remove the `dist.mode` setting).
+DDP replicates the full model on each GPU and is simpler but uses more memory.
+
+---
+
+## Performance and Tuning Tips
+
+- Enable `activation_checkpointing: true` to reduce memory (at ~20–30% slower training)
+- Increase `fsdp_min_num_params` for finer-grained sharding
+
+---
+
+## Why not torchrun here?
+
+`torchrun` is valid for distributed launches (including on Slurm), but this template uses **Hydra’s Submitit launcher** to keep **sweeps, config composition, logging, and requeue** inside Hydra, and to avoid maintaining separate bash wrappers. Submitit handles **job submission and per-task rank context**; we still initialize PyTorch distributed via the standard env-var pathway.
+
+If you prefer `torchrun`, you can adapt the script and configs—but you’ll then manage the Slurm submission layer (or wrap `torchrun` inside an sbatch yourself) and wire up Hydra sweeps accordingly.
+
+---
+
+## References
+
+- [PyTorch distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
+- [Slurm GRES guide](https://slurm.schedmd.com/gres.html)
+- [Hugging Face FSDP / Trainer documentation](https://huggingface.co/docs/transformers/fsdp)
diff --git a/templates/src/llm/finetune_distributed/__init__.py b/templates/src/llm/finetune_distributed/__init__.py
@@ -0,0 +1 @@
+"""LLM training template: Fine-tuning using distributed training."""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""LLM training template: Fine-tuning using distributed training."""