Skip to content

Conversation

ansjindal
Copy link
Contributor

@ansjindal ansjindal commented Jul 23, 2025

  1. Update SkyPilot to reuse existing PVC volume mounts
  • Set volume_mounts to use pre-created 'nemo-workspace' PVC mapping to K8s Volume Mounts.
  • Set volumes to use volumes in the pod.
  1. SkypilotExecutor did not support the launcher API (supports_launcher_transform() = False), preventing it from using torchrun for multi-node distributed training.

This PR enables launcher API support in SkypilotExecutor

Changes Made

  1. Enable launcher transform support: Add supports_launcher_transform() -> True
  2. Auto-detect multi-node training: Automatically set launcher="torchrun" when num_nodes > 1.

Older PR got closed by mistake: #296

Example:

from nemo.collections import llm
import nemo_run as run
from nemo_run.core.execution.launcher import Torchrun
from nemo import lightning as nl

def nemo_skypilot_executor(nodes: int, devices: int, container_image: str):
    return run.SkypilotExecutor(
        gpus="GH200-480GB",
        cpus=32,
        memory=360,
        gpus_per_node=devices,
        num_nodes=nodes,
        container_image=container_image,
        infra="k8s/kubernetes-admin@kubernetes",
        # torchrun launcher will be automatically enabled for multi-node (num_nodes > 1)
        file_mounts={
        },
        env_vars={
            "HF_TOKEN": "hf_...",
            # Basic NCCL configuration for multi-node
            "NCCL_DEBUG": "INFO",
            "TORCH_DISTRIBUTED_DEBUG": "INFO",
            "CUDA_LAUNCH_BLOCKING": "1",
        },
        cluster_name="cluster",
        volumes={
            "nemo-workspace": "nemo-workspace"
        },  
        volume_mounts=[
            {
                "path": "/data",
                "volume_name": "nemo-workspace",
                "size": "50Gi",
                "type": "k8s-pvc"
            }
        ],  
        setup="""
    conda deactivate
    python3 -m pip install "datasets>=4.0.0"
    nvidia-smi
    """,
    )


if __name__ == "__main__":
    nodes = 2  # Test multi-node with launcher API
    gpus_per_node = 1
    
    recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
        model_name="meta-llama/Llama-3.2-3B",
        dir="/data/llama3.2_3b_launcher", 
        name="llama3_lora_launcher",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        peft_scheme="lora",
        max_steps=100,
    )

    recipe.peft.target_modules = ["linear_qkv", "linear_proj", "linear_fc1", "*_proj"]
    recipe.peft.dim = 16
    recipe.peft.alpha = 32
    if nodes == 1:
        recipe.trainer.strategy = "auto"  # Let Lightning choose the best strategy
    else:
        recipe.trainer.strategy = run.Config(
            nl.FSDP2Strategy, data_parallel_size=nodes * gpus_per_node, tensor_parallel_size=1
        )

    executor = nemo_skypilot_executor(nodes=nodes, devices=gpus_per_node, container_image="nvcr.io/nvidia/nemo:25.04")

    with run.Experiment("k8s-nemo-launcher-test", executor=executor, log_level="DEBUG") as exp:
        id1 = exp.add(recipe, tail_logs=True, name="recipe")
        exp.run(detach=False, tail_logs=True, sequential=True) 

Some logs:

worker1, rank=1, pid=2178, ip=172.31.0.72) Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 1405955.08 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 1055817.89 examples/s]
(head, rank=0, pid=3234) ----------------------------------------------------------------------------------------------------
(head, rank=0, pid=3234) distributed_backend=nccl
(head, rank=0, pid=3234) All distributed processes registered. Starting with 2 processes
(head, rank=0, pid=3234) ----------------------------------------------------------------------------------------------------
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234) 
(worker1, rank=1, pid=2178, ip=172.31.0.72) 
(head, rank=0, pid=3234) Map:   0%|          | 0/87599 [00:00<?, ? examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:   0%|          | 0/87599 [00:00<?, ? examples/s]
(head, rank=0, pid=3234) Map:   0%|          | 186/87599 [00:00<00:47, 1829.40 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:   0%|          | 181/87599 [00:00<00:48, 1784.73 examples/s]
(head, rank=0, pid=3234) Map:   0%|          | 380/87599 [00:00<00:46, 1884.76 examples/s]
...
worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 84916/87599 [00:44<00:01, 1971.79 examples/s]
(head, rank=0, pid=3234) Map:  99%|█████████▉| 86901/87599 [00:44<00:00, 1963.64 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 85206/87599 [00:44<00:01, 1860.88 examples/s]
(head, rank=0, pid=3234) Map:  99%|█████████▉| 87108/87599 [00:44<00:00, 1846.75 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  97%|█████████▋| 85397/87599 [00:44<00:01, 1871.20 examples/s]
(head, rank=0, pid=3234) Map: 100%|█████████▉| 87321/87599 [00:44<00:00, 1918.05 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 85588/87599 [00:44<00:01, 1879.02 examples/s]
Map: 100%|██████████| 87599/87599 [00:44<00:00, 1948.98 examples/s]0:00, 1923.78 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 85794/87599 [00:44<00:00, 1926.31 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 86000/87599 [00:44<00:00, 1858.73 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  98%|█████████▊| 86212/87599 [00:45<00:00, 1928.64 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▊| 86486/87599 [00:45<00:00, 1888.67 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 86699/87599 [00:45<00:00, 1950.21 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 86901/87599 [00:45<00:00, 1966.92 examples/s]
(worker1, rank=1, pid=2178, ip=172.31.0.72) Map:  99%|█████████▉| 87107/87599 [00:45<00:00, 1846.34 examples/s]


(worker1, rank=1, pid=2178, ip=172.31.0.72) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(head, rank=0, pid=3234) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234)   | Name  | Type                 | Params | Mode 
(head, rank=0, pid=3234) -------------------------------------------------------
(head, rank=0, pid=3234) 0 | model | FSDPLlamaForCausalLM | 1.2 B  | train
(head, rank=0, pid=3234) -------------------------------------------------------
(head, rank=0, pid=3234) 11.3 M    Trainable params
(head, rank=0, pid=3234) 1.2 B     Non-trainable params
(head, rank=0, pid=3234) 1.2 B     Total params
(head, rank=0, pid=3234) 4,988.346 Total estimated model params size (MB)
(head, rank=0, pid=3234) 551       Modules in train mode
(head, rank=0, pid=3234) 0         Modules in eval mode
(head, rank=0, pid=3234) 
Epoch 0:   0%|          | 0/21900 [00:00<?, ?it/s] [NeMo I 2025-07-22 23:59:37 nemo_logging:393] Setting up optimizers
(head, rank=0, pid=3234) 
(head, rank=0, pid=3234) Epoch 0:   0%|          | 1/21900 [00:17<107:25:40,  0.06it/s]
(he

if cloud == "k8s":
# VolumeConfig region and zone required even though they are marked as optional
# validation fails otherwise
config["cloud"] = "kubernetes"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value needs to be kubernetes - based on the providers supported list in skypilot

@hemildesai
Copy link
Contributor

Can you rebase on latest main? Skypilot was just updated to 0.10.0 there.

Copy link
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some basic unit tests for the codecov CI check to pass? Thanks a lot for your contribution.

Comment on lines 359 to 367
launcher = self.launcher
# Dynamic rendezvous has an error in Skypilot Kubernetes currently
if (
launcher
and isinstance(launcher, (Torchrun, FaultTolerance))
and self.cloud == "kubernetes"
):
launcher.rdzv_backend = "static"
launcher.rdzv_port = 49500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this part anymore, its fixed in the latest version of Skypilot.

@ansjindal
Copy link
Contributor Author

is there something needed more on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants