pytorch · joecummings · Jan 27, 2025 · Jan 27, 2025 · Jan 27, 2025 · Jan 27, 2025
diff --git a/docs/source/api_ref_training.rst b/docs/source/api_ref_training.rst
@@ -53,6 +53,7 @@ Utilities for enabling and working with distributed training.
     init_distributed
     is_distributed
     gather_cpu_state_dict
+    get_distributed_backend
 
 .. _ac_label:
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -149,6 +149,7 @@ torchtune tutorials.
    tutorials/e2e_flow
    tutorials/llama_kd_tutorial
    tutorials/memory_optimizations
+   tutorials/multinode
 
 .. toctree::
    :glob:

diff --git a/docs/source/tutorials/multinode.rst b/docs/source/tutorials/multinode.rst
@@ -0,0 +1,101 @@
+.. _multinode_tutorial:
+
+=====================
+Multi-node finetuning
+=====================
+
+Congratulations! After years of being "GPU poor", you've worked hard, saved your hard earned Bitcoin and
+now have access to a proper multi-node cluster. You're part of the so-called "GPU middle class". In many ways,
+your worries of yesteryear are gone: memory efficient training? Who cares! But in many other ways, your problems
+are just starting because multi-node is a whole new beast. Come with me as I take you through your new life, complete with
+a big backyard, new car, and of course - a nice rack of H100s.
+
+.. grid:: 2
+
+    .. grid-item-card:: :octicon:`mortar-board;1em;` You will learn:
+
+      * How to set up the torchtune package on a SLURM cluster
+      * How to fine-tune a Llama3.3 70B model w/ full parameter updates (not LoRA)
+      * What common errors to lookout for
+
+    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+
+      * Be familiar with distributed training in torchtune
+      * Already know basic SLURM commands
+
+
+Advantages of multi-node training
+---------------------------------
+
+It's likely that if you're reading this tutorial, you don't need a refresher on the advantages of having
+MORE compute, but let's go over it again so you can appreciate how lucky you are. Let's consider a simplified calculation
+on how much memory is required to train a 70B parameter model in bfloat16.
+
+.. code-block:: text
+
+    Weights                            140 GB
+    + Optim state (AdamW)              280 GB
+    + Activations (bsz=8,seq_len=2048) XX
+    ------------------------------------------
+                                        280 GB
+
+Right now the average GPU has 80GB of VRAM so definitely can't fit on a single GPU and even multiple GPUs won't be up to the task.
+We have a ton of memory optimizations in torchtune that allow you to fit larger models in less resource.
+
+Why might you want to use multi-node then?
+* Larger models (like Llama 405B, Deepseek, etc)
+* Potentially faster training via larger batch sizes, no activation checkpointing
+* Potentially more accurate training with full parameter updates and non-approximate optimizers, etc
+
+.. note::
+
+    **Low inter-node bandwidth & FSDP**
+    We utilize <FSDP> to distribute models over multiple devices. In order to distribute training, FSDP runs an all-gather operation for each forward pass and an all-gather plus a scatter-reduce
+    operation for each backwards pass. These operations (usually) block training from continuing until completed and with a slow inter-node connection, training speed may be reduced.
+
+Training Llama3.3 70B on 2 nodes
+--------------------------------
+
+With that background out of the way, let's get training! We'll be utilizing a common cluster setup called SLURM and we assume you have a decent working knowledge for this tutorial.
+First, we need to install torchtune on your cluster. Although pretty much as straightforward as the <link> normal install instructions,
+it's recommended that you install into a virtual environment that is accessible from nodes in your cluster - something like a shared filesystem.
+
+Next, using the same idea as above, we need to download the Llama3.3 70B model to the shared fs. (You'll need to make sure you have the correct
+credentials as noted before.)
+
+.. code-block:: bash
+
+    $ tune download meta-llama/Llama-3.3-70B-Instruct --ignore-patterns "consolidated/*.pth" --output-dir SHARED_FS/Llama-3.3-70B-Instruct
+
+Now that we have a downloaded model, we can launch training. Although you can *technically* launch the multinode bash script from the tune CLI,
+it's recommended that you copy the file to your machine.
+
+.. code-block:: bash
+
+    $ tune cp full_finetune_multinode .
+
+And let's open it up to see what's inside:
+
+.. literalinclude:: ../../../recipes/full_finetune_multinode.slurm
+
+What are the high level parts?
+* Uses `full_finetune_distributed` to launch training
+* Can specify number of nodes, tasks, CPUs available, etc
+* Should consider several cluster-specific environment variables
+
+We just need to point to our checkpoint and output dir and get training!
+
+> You may need to set your interface which you can find with ipconfig
+
+Once we've trained, we can follow the instructions [here] in order to upload our beautiful new model to the Hugging Face Hub.
+
+Future development
+------------------
+
+2D parallelism
+
+Longer context (ring attention, etc)
+
+What else do you want?
+
+BLAH BLHAH BALSHD 很好
diff --git a/recipes/configs/llama3_3/70B_full_multinode.yaml b/recipes/configs/llama3_3/70B_full_multinode.yaml
@@ -0,0 +1,104 @@
+# Config for multi-node full finetuning in full_finetune_distributed.py
+# using a Llama3.3 70B Instruct model
+#
+# This config assumes that you've run the following command before launching:
+#   tune download meta-llama/Llama-3.3-70B-Instruct --ignore-patterns "original/consolidated*" --output-dir SHARED_CLUSTER_FS
+#
+# To launch on 2 nodes w/ 8 devices on a SLURM cluster, run the following command:
+#   sbatch full_finetune_multinode.slurm
+#
+# This config is only tested on 2 nodes w/ 8 H100 machines.
+
+output_dir: /tmp/torchtune/llama3_3_70B/full
+
+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama3.llama3_tokenizer
+  path: /tmp/Llama-3.3-70B-Instruct/original/tokenizer.model
+  max_seq_len: 1024
+
+# Dataset
+dataset:
+  _component_: torchtune.datasets.alpaca_dataset
+  packed: True  # True increases speed
+seed: null
+shuffle: True
+
+# Model Arguments
+model:
+  _component_: torchtune.models.llama3_3.llama3_3_70b
+
+checkpointer:
+  _component_: torchtune.training.FullModelHFCheckpointer
+  checkpoint_dir: /tmp/Llama-3.3-70B-Instruct/
+  checkpoint_files:
+    filename_format: model-{}-of-{}.safetensors
+    max_filename: "00030"
+  recipe_checkpoint: null
+  output_dir: ${output_dir}
+  model_type: LLAMA3
+resume_from_checkpoint: False
+
+# Fine-tuning arguments
+batch_size: 4
+epochs: 1
+
+optimizer:
+  _component_: torch.optim.AdamW
+  lr: 2e-5
+  # Note: highly recommended to use fused=True optimizer flag
+  # with CPU offload for faster optimizer step.
+  fused: True
+
+loss:
+  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
+max_steps_per_epoch: null
+gradient_accumulation_steps: 1  # Use to increase effective batch size
+
+
+# Training env
+device: cuda
+
+# Memory management
+enable_activation_checkpointing: True  # True reduces memory
+enable_activation_offloading: False  # True reduces memory
+custom_sharded_layers: ['tok_embeddings', 'output']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
+fsdp_cpu_offload: False
+clip_grad_norm: null
+compile: True  # torch.compile the model + loss, True increases speed + decreases memory
+optimizer_in_bwd: False  # True saves memory. Requires gradient_accumulation_steps=1
+
+# Reduced precision
+dtype: bf16
+
+# Logging
+metric_logger:
+  _component_: torchtune.training.metric_logging.DiskLogger
+  log_dir: ${output_dir}/logs
+log_every_n_steps: 1
+log_peak_memory_stats: True
+
+# Profiler (disabled)
+profiler:
+  _component_: torchtune.training.setup_torch_profiler
+  enabled: False
+
+  #Output directory of trace artifacts
+  output_dir: ${output_dir}/profiling_outputs
+
+  #`torch.profiler.ProfilerActivity` types to trace
+  cpu: True
+  cuda: True
+
+  #trace options passed to `torch.profiler.profile`
+  profile_memory: False
+  with_stack: False
+  record_shapes: True
+  with_flops: False
+
+  # `torch.profiler.schedule` options:
+  # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
+  wait_steps: 5
+  warmup_steps: 3
+  active_steps: 2
+  num_cycles: 1
diff --git a/recipes/full_finetune_distributed.py b/recipes/full_finetune_distributed.py
@@ -118,20 +118,21 @@ class FullFinetuneRecipeDistributed(FTRecipeInterface):
     """
 
     def __init__(self, cfg: DictConfig) -> None:
-        self._device = utils.get_device(device=cfg.device)
+        device_type = cfg.device
+        self._device = utils.get_device(device=device_type)
         self._dtype = training.get_dtype(cfg.dtype, device=self._device)
 
         if self._dtype == torch.float16:
             raise ValueError(
                 "full fp16 training is not supported with this recipe. Please use bf16 or fp32 instead."
             )
 
-        # logging attributes
+        # Logging attributes
         self._output_dir = cfg.output_dir
         self._log_every_n_steps = cfg.get("log_every_n_steps", 1)
         self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False)
 
-        if self._log_peak_memory_stats and self._device.type != "cuda":
+        if self._log_peak_memory_stats and device_type != "cuda":
             log.info(
                 "log_peak_memory_stats was set to True, however, training does not use cuda. Setting log_peak_memory_stats=False."
             )
@@ -147,6 +148,10 @@ def __init__(self, cfg: DictConfig) -> None:
         self._optimizer_in_bwd = cfg.get("optimizer_in_bwd", False)
         self._clip_grad_norm = cfg.get("clip_grad_norm", None)
         self._checkpoint_client = CheckpointClient(cfg)
+        self.fsdp_cpu_offload = cfg.get("fsdp_cpu_offload", False)
+        self.distributed_backend = training.get_distributed_backend(
+            device_type, enable_cpu_offload=self.fsdp_cpu_offload
+        )
 
         # Optimizer in backward is not compatible with gradient accumulation or gradient clipping
         if self._optimizer_in_bwd:
@@ -169,7 +174,7 @@ def __init__(self, cfg: DictConfig) -> None:
             "enable_activation_offloading", False
         )
         if self._enable_activation_offloading:
-            if self._device.type != "cuda":
+            if device_type != "cuda":
                 raise RuntimeError(
                     "enable_activation_offloading should only be True when training on CUDA"
                 )
@@ -240,9 +245,16 @@ def setup(self, cfg: DictConfig) -> None:
         Setup the recipe. This includes training state (if resume_from_checkpoint is True),
         model, tokenizer, loss, optimizer, lr scheduler, sampler, and dataloader.
         """
+        # Set up the backend for distributed training (NCCL, GLOO, etc.)
+        init_process_group(self.distributed_backend)
+
+        if self.fsdp_cpu_offload:
+            # Utilize all available CPU cores for intra-op parallelism. This provides ~2x
+            # speed up when benchmarking fused AdamW on CPU
+            training.set_torch_num_threads()
+
         if self._is_rank_zero:
             self._metric_logger = config.instantiate(cfg.metric_logger)
-
             # log config with parameter override
             self._metric_logger.log_config(cfg)
 
@@ -255,7 +267,7 @@ def setup(self, cfg: DictConfig) -> None:
             enable_activation_checkpointing=self._enable_activation_checkpointing,
             enable_activation_offloading=self._enable_activation_offloading,
             custom_sharded_layers=cfg.get("custom_sharded_layers", None),
-            fsdp_cpu_offload=cfg.get("fsdp_cpu_offload", False),
+            fsdp_cpu_offload=self.fsdp_cpu_offload,
             reshard_after_forward=cfg.get("fsdp_reshard_after_forward", True),
             model_state_dict=checkpoint_dict[training.MODEL_KEY],
             ac_mode=cfg.get("ac_mode", None),
@@ -890,19 +902,7 @@ def recipe_main(cfg: DictConfig) -> None:
         - Parameters specified in config (see available configs through ``tune ls``)
         - Overwritten by arguments from the command-line
     """
-    if not training.is_distributed():
-        raise RuntimeError(
-            "Distributed finetune recipe should be run via a distributed launcher."
-            "If using tune CLI, please specify --nnodes 1 and --nproc_per_node [num_gpus]"
-        )
-    init_process_group("cuda:nccl,cpu:gloo")
-    if cfg.get("fsdp_cpu_offload", False):
-        # Utilize all available CPU cores for intra-op parallelism. This provides ~2x
-        # speed up when benchmarking fused AdamW on CPU
-        training.set_torch_num_threads()
-
     config.log_config(recipe_name="FullFinetuneRecipeDistributed", cfg=cfg)
-
     recipe = FullFinetuneRecipeDistributed(cfg=cfg)
     recipe.setup(cfg=cfg)
     recipe.train()

diff --git a/recipes/full_finetune_multinode.slurm b/recipes/full_finetune_multinode.slurm
@@ -0,0 +1,44 @@
+#!/bin/bash
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+# ---------- SBATCH commands ---------- #
+#SBATCH --job-name=torchtune-multi-node
+#SBATCH --ntasks=2
+#SBATCH --nodes=2
+#SBATCH --gpus-per-task=8
+#SBATCH --cpus-per-task=96
+#SBATCH --partition=train
+
+# ---------- Set env variables ---------- #
+# Grab the IP for head node:
+nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+nodes_array=($nodes)
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+echo Node IP: $head_node_ip
+
+# You might need to explicitly set the network interface for distributed backends:
+# export NCCL_SOCKET_IFNAME=...
+# export GLOO_SOCKET_IFNAME=...
+
+export TORCH_DIST_INIT_BARRIER=1
+export LOGLEVEL=INFO
+
+# ---------- Launch training ---------- #
+# You probably want to load in a virtual env w/ conda...
+# module load conda
+# conda activate torchtune
+# ...or venv
+# source torchtune/bin/activate
+
+SHARED_FS=/mnt/slurm # <-- Replace w/ your filesystem
+CHECKPOINT_DIR="$SHARED_FS/Llama-3.3-70B-Instruct"
+OUTPUT_DIR="$SHARED_FS/Llama3.3-70B-fft-output"
+
+# Adjust sbatch --ntasks and sbatch --nodes above and --nnodes below to your specific node count
+srun tune run --nnodes 2 --nproc_per_node 8 --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" \
+    full_finetune_distributed --config llama3_3/70B_full_multinode checkpoint_dir=$CHECKPOINT_DIR output_dir=$OUTPUT_DIR
diff --git a/recipes/lora_finetune_distributed_multi_dataset.py b/recipes/lora_finetune_distributed_multi_dataset.py
@@ -138,7 +138,7 @@ def __init__(self, cfg: DictConfig) -> None:
                 "full fp16 training is not supported with this recipe. Please use bf16 or fp32 instead."
             )
 
-        _, rank = training.get_world_size_and_rank()
+        _, rank = utils.get_world_size_and_rank()
 
         self._is_rank_zero = rank == 0