Skip to content

Commit 02a590b

Browse files
authored
Merge pull request #6 from VectorInstitute/fsdp_template
Distributed Training Template
2 parents e82c6da + ce18a4b commit 02a590b

File tree

20 files changed

+673
-22
lines changed

20 files changed

+673
-22
lines changed

templates/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ cd path/to/vec-playbook
2626
uv sync # Automatically installs dependencies in vec-playbook/.venv
2727
```
2828

29-
Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.
29+
Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.
3030

3131
### UV Tip for Killarney
3232

@@ -45,7 +45,7 @@ templates/
4545
```
4646

4747
Each template directory contains a `launch.py`, a `train.py`, and a `config.yaml`.
48-
The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.
48+
The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.
4949

5050
The launch script contains the `hydra.main` decorator which points hydra to the templates local `config.yaml`. This `config.yaml` imports the `_global` config from the `configs/` directory, which in turn imports other preset configs.
5151

@@ -85,12 +85,12 @@ All launchers follow the same pattern: use `uv run python -m <templatee>.launch`
8585
uv run python -m <template_pkg>.launch \
8686
compute=<cluster>/<preset> \
8787
requeue=<on|off> \
88-
<config.overridess> \
88+
<config.overrides> \
8989
<new-keys> \
9090
--multirun
9191
```
9292

93-
- `<template_pkg>`: The module path to the template launch script (eg. `mlp.single`)
93+
- `<template_pkg>`: The module path to the template launch script (eg. `mlp.single`)
9494
- `compute=<cluster>/<preset>`: chooses the Slurm resources defined under `templates/configs/compute/` (or a custom preset you add).
9595
- `requeue=<on|off>`: toggles the Submitit requeue flag described in the checkpointing section.
9696
- Additional config overrides use `key=value` syntax; nested keys follow the YAML structure (e.g., `compute.mem_gb=32`).
@@ -233,7 +233,7 @@ vec_jobs/<timestamp>/
233233
│ └── hydra_resolved.yaml # The hydra settings that were used for this run (with all placeholder values resolved)
234234
235235
...
236-
└── <hydra-run-id>/
236+
└── <hydra-run-id>/
237237
...
238238
└── ...
239239
```
@@ -243,4 +243,4 @@ vec_jobs/<timestamp>/
243243
- `multirun.yaml` and `hydra.yaml` will contain placeholder values (eg. `${oc.select:compute.mem_gb}`). These are used to fill in the values with values from other parts of the config or other configs included in the defaults. See hydra documentation for more detail.
244244
- When doing a hyperparameter sweep, a run is performed for each unique combination of hyperparameters. Each run is run as a separate slurm job with a unique slurm ID.
245245
- All the runs are submitted as separate jobs using the slurm `--array` feature. Therefore there is a base slurm job id shared by all runs. The slurm-job-id actually used by slurm for each run is a combination of the base slurm job ID and the hydra run ID (eg. `1186868_1`). For multirun jobs you might end up with log files like: `1186868_1_0`. Not sure what the second integer is as it doesn't necessarily line up with the hydra run id. Most likely a process ID.
246-
- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.
246+
- The hydra logs are a good place to start to see the output of your job. If information is missing, or if an error occurs, the submitit logs are the source of truth and should contain everything. Sometimes exceptions are not captured in the hydra logs.

templates/configs/compute/bon_echo/a100_1x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 1
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 16
8-
mem_gb: 80
8+
mem_gb: 80 # values are binary GiB as expected by Slurm
99
work_root: /scratch/ssd004/scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/bon_echo/a100_4x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 4
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 16
8-
mem_gb: 320
8+
mem_gb: 320 # values are binary GiB as expected by Slurm
99
work_root: /scratch/ssd004/scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/bon_echo/a40_1x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 1
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 8
8-
mem_gb: 40
8+
mem_gb: 40 # values are binary GiB as expected by Slurm
99
work_root: /scratch/ssd004/scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/bon_echo/a40_2x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 2
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 8
8-
mem_gb: 80
8+
mem_gb: 80 # values are binary GiB as expected by Slurm
99
work_root: /scratch/ssd004/scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/bon_echo/a40_4x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 4
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 8
8-
mem_gb: 160
8+
mem_gb: 160 # values are binary GiB as expected by Slurm
99
work_root: /scratch/ssd004/scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/bon_echo/cpu_1x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ nodes: 1
33
gpus_per_node: 0
44
gres: null
55
cpus_per_task: 2
6-
mem_gb: 8
6+
mem_gb: 8 # values are binary GiB as expected by Slurm
77
work_root: /scratch/ssd004/scratch/${oc.env:USER}
88
timeout_min: 60
99
slurm:

templates/configs/compute/killarney/h100_1x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 1
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 6
8-
mem_gb: 240
8+
mem_gb: 240 # values are binary GiB as expected by Slurm
99
work_root: /scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/killarney/h100_2x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 2
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 6
8-
mem_gb: 480
8+
mem_gb: 480 # values are binary GiB as expected by Slurm
99
work_root: /scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

templates/configs/compute/killarney/h100_4x.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ gpus_per_node: 4
55
gres: gpu:${.gpu_type}:${.gpus_per_node}
66
tasks_per_node: ${.gpus_per_node}
77
cpus_per_task: 6
8-
mem_gb: 960
8+
mem_gb: 960 # values are binary GiB as expected by Slurm
99
work_root: /scratch/${oc.env:USER}
1010
timeout_min: 60
1111
slurm:

0 commit comments

Comments
 (0)