Update SkyPilot to use volume mounts and added support for multi-node #301

ansjindal · 2025-07-23T22:37:59Z

Update SkyPilot to reuse existing PVC volume mounts

Set volume_mounts to use pre-created 'nemo-workspace' PVC mapping to K8s Volume Mounts.
Set volumes to use volumes in the pod.

SkypilotExecutor did not support the launcher API (supports_launcher_transform() = False), preventing it from using torchrun for multi-node distributed training.

This PR enables launcher API support in SkypilotExecutor

Changes Made

Enable launcher transform support: Add supports_launcher_transform() -> True
Auto-detect multi-node training: Automatically set launcher="torchrun" when num_nodes > 1.

Older PR got closed by mistake: #296

Example:

from nemo.collections import llm
import nemo_run as run
from nemo_run.core.execution.launcher import Torchrun
from nemo import lightning as nl

def nemo_skypilot_executor(nodes: int, devices: int, container_image: str):
    return run.SkypilotExecutor(
        gpus="GH200-480GB",
        cpus=32,
        memory=360,
        gpus_per_node=devices,
        num_nodes=nodes,
        container_image=container_image,
        infra="k8s/kubernetes-admin@kubernetes",
        # torchrun launcher will be automatically enabled for multi-node (num_nodes > 1)
        file_mounts={
        },
        env_vars={
            "HF_TOKEN": "hf_...",
            # Basic NCCL configuration for multi-node
            "NCCL_DEBUG": "INFO",
            "TORCH_DISTRIBUTED_DEBUG": "INFO",
            "CUDA_LAUNCH_BLOCKING": "1",
        },
        cluster_name="cluster",
        volumes={
            "nemo-workspace": "nemo-workspace"
        },  
        volume_mounts=[
            {
                "path": "/data",
                "volume_name": "nemo-workspace",
                "size": "50Gi",
                "type": "k8s-pvc"
            }
        ],  
        setup="""
    conda deactivate
    python3 -m pip install "datasets>=4.0.0"
    nvidia-smi
    """,
    )


if __name__ == "__main__":
    nodes = 2  # Test multi-node with launcher API
    gpus_per_node = 1
    
    recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
        model_name="meta-llama/Llama-3.2-3B",
        dir="/data/llama3.2_3b_launcher", 
        name="llama3_lora_launcher",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        peft_scheme="lora",
        max_steps=100,
    )

    recipe.peft.target_modules = ["linear_qkv", "linear_proj", "linear_fc1", "*_proj"]
    recipe.peft.dim = 16
    recipe.peft.alpha = 32
    if nodes == 1:
        recipe.trainer.strategy = "auto"  # Let Lightning choose the best strategy
    else:
        recipe.trainer.strategy = run.Config(
            nl.FSDP2Strategy, data_parallel_size=nodes * gpus_per_node, tensor_parallel_size=1
        )

    executor = nemo_skypilot_executor(nodes=nodes, devices=gpus_per_node, container_image="nvcr.io/nvidia/nemo:25.04")

    with run.Experiment("k8s-nemo-launcher-test", executor=executor, log_level="DEBUG") as exp:
        id1 = exp.add(recipe, tail_logs=True, name="recipe")
        exp.run(detach=False, tail_logs=True, sequential=True)

Some logs:

worker1, rank=1, pid=2178, ip=172.31.0.72) Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 1405955.08 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 1055817.89 examples/s]
(head, rank=0, pid=3234) ----------------------------------------------------------------------------------------------------
(head, rank=0, pid=3234) distributed_backend=nccl
(head, rank=0, pid=3234) All distributed processes registered. Starting with 2 processes
(head, rank=0, pid=3234) ----------------------------------------------------------------------------------------------------
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234) 
(worker1, rank=1, pid=2178, ip=172.31.0.72) 
(head, rank=0, pid=3234) Map:   0%|          | 0/87599 [00:00<?, ? examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:   0%|          | 0/87599 [00:00<?, ? examples/s]
(head, rank=0, pid=3234) Map:   0%|          | 186/87599 [00:00<00:47, 1829.40 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:   0%|          | 181/87599 [00:00<00:48, 1784.73 examples/s]
(head, rank=0, pid=3234) Map:   0%|          | 380/87599 [00:00<00:46, 1884.76 examples/s]
...
worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 84916/87599 [00:44<00:01, 1971.79 examples/s]
(head, rank=0, pid=3234) Map:  99%|█████████▉| 86901/87599 [00:44<00:00, 1963.64 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 85206/87599 [00:44<00:01, 1860.88 examples/s]
(head, rank=0, pid=3234) Map:  99%|█████████▉| 87108/87599 [00:44<00:00, 1846.75 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 85397/87599 [00:44<00:01, 1871.20 examples/s]
(head, rank=0, pid=3234) Map: 100%|█████████▉| 87321/87599 [00:44<00:00, 1918.05 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 85588/87599 [00:44<00:01, 1879.02 examples/s]
Map: 100%|██████████| 87599/87599 [00:44<00:00, 1948.98 examples/s]0:00, 1923.78 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 85794/87599 [00:44<00:00, 1926.31 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 86000/87599 [00:44<00:00, 1858.73 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 86212/87599 [00:45<00:00, 1928.64 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▊| 86486/87599 [00:45<00:00, 1888.67 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 86699/87599 [00:45<00:00, 1950.21 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 86901/87599 [00:45<00:00, 1966.92 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 87107/87599 [00:45<00:00, 1846.34 examples/s]


(worker1, rank=1, pid=2178, ip=172.31.0.72) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(head, rank=0, pid=3234) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234)   | Name  | Type                 | Params | Mode 
(head, rank=0, pid=3234) -------------------------------------------------------
(head, rank=0, pid=3234) 0 | model | FSDPLlamaForCausalLM | 1.2 B  | train
(head, rank=0, pid=3234) -------------------------------------------------------
(head, rank=0, pid=3234) 11.3 M    Trainable params
(head, rank=0, pid=3234) 1.2 B     Non-trainable params
(head, rank=0, pid=3234) 1.2 B     Total params
(head, rank=0, pid=3234) 4,988.346 Total estimated model params size (MB)
(head, rank=0, pid=3234) 551       Modules in train mode
(head, rank=0, pid=3234) 0         Modules in eval mode
(head, rank=0, pid=3234) 
Epoch 0:   0%|          | 0/21900 [00:00<?, ?it/s] [NeMo I 2025-07-22 23:59:37 nemo_logging:393] Setting up optimizers
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234) Epoch 0:   0%|          | 1/21900 [00:17<107:25:40,  0.06it/s]
(he

Signed-off-by: ansjindal <[email protected]>

ansjindal · 2025-07-23T22:39:44Z

nemo_run/core/execution/skypilot.py

+                if cloud == "k8s":
+                    # VolumeConfig region and zone required even though they are marked as optional
+                    # validation fails otherwise
+                    config["cloud"] = "kubernetes"


this value needs to be kubernetes - based on the providers supported list in skypilot

hemildesai · 2025-07-25T21:16:07Z

Can you rebase on latest main? Skypilot was just updated to 0.10.0 there.

Signed-off-by: ansjindal <[email protected]>

hemildesai

Could you also add some basic unit tests for the codecov CI check to pass? Thanks a lot for your contribution.

hemildesai · 2025-07-31T15:48:21Z

nemo_run/core/execution/skypilot.py

+        launcher = self.launcher
+        # Dynamic rendezvous has an error in Skypilot Kubernetes currently
+        if (
+            launcher
+            and isinstance(launcher, (Torchrun, FaultTolerance))
+            and self.cloud == "kubernetes"
+        ):
+            launcher.rdzv_backend = "static"
+            launcher.rdzv_port = 49500


You don't need this part anymore, its fixed in the latest version of Skypilot.

Signed-off-by: ansjindal <[email protected]>

ansjindal · 2025-08-21T14:38:53Z

is there something needed more on this PR?

update skypilot to use volume mounts and launcher

0b33d98

Signed-off-by: ansjindal <[email protected]>

ansjindal commented Jul 23, 2025

View reviewed changes

github-actions bot added the community-request label Jul 25, 2025

Merge branch 'main' into ansjindal/skypilot

4cdf6f6

Signed-off-by: ansjindal <[email protected]>

ansjindal had a problem deploying to public July 30, 2025 06:08 — with GitHub Actions Failure

add missing Torchrun, which was missed during merge

26bec77

Signed-off-by: ansjindal <[email protected]>

ansjindal force-pushed the ansjindal/skypilot branch from d6727d8 to 26bec77 Compare July 30, 2025 17:10

ansjindal had a problem deploying to public July 30, 2025 17:39 — with GitHub Actions Failure

ansjindal had a problem deploying to public July 30, 2025 17:40 — with GitHub Actions Failure

hemildesai reviewed Jul 31, 2025

View reviewed changes

add some volume tests and remove the workaround

1ac7fe3

Signed-off-by: ansjindal <[email protected]>

ansjindal had a problem deploying to public August 5, 2025 01:32 — with GitHub Actions Failure

add more volume tests for code overage

9b84f34

Signed-off-by: ansjindal <[email protected]>

ansjindal had a problem deploying to public August 7, 2025 22:50 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update SkyPilot to use volume mounts and added support for multi-node #301

Update SkyPilot to use volume mounts and added support for multi-node #301

Uh oh!

ansjindal commented Jul 23, 2025 •

edited

Loading

Uh oh!

ansjindal Jul 23, 2025

Uh oh!

hemildesai commented Jul 25, 2025

Uh oh!

hemildesai left a comment

Uh oh!

hemildesai Jul 31, 2025

Uh oh!

ansjindal commented Aug 21, 2025

Uh oh!

Uh oh!

Update SkyPilot to use volume mounts and added support for multi-node #301

Are you sure you want to change the base?

Update SkyPilot to use volume mounts and added support for multi-node #301

Uh oh!

Conversation

ansjindal commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Uh oh!

ansjindal Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

hemildesai commented Jul 25, 2025

Uh oh!

hemildesai left a comment

Choose a reason for hiding this comment

Uh oh!

hemildesai Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ansjindal commented Aug 21, 2025

Uh oh!

Uh oh!

ansjindal commented Jul 23, 2025 •

edited

Loading