diff --git a/docs/CONFIG_REFERENCE.md b/docs/CONFIG_REFERENCE.md
new file mode 100644
index 0000000000..b3bd230d6d
--- /dev/null
+++ b/docs/CONFIG_REFERENCE.md
@@ -0,0 +1,996 @@
+# YAML Configuration Reference: SFT and On-Policy Distillation
+
+This document provides a comprehensive reference for configuring Supervised Fine-Tuning (SFT) and On-Policy Distillation in NeMo RL. It documents all configuration sections, keys, and available dataset names.
+
+## Table of Contents
+
+- [Configuration Inheritance](#configuration-inheritance)
+- [SFT Configuration](#sft-configuration)
+  - [Top-Level Structure](#sft-top-level-structure)
+  - [Dataset Configuration](#dataset-configuration)
+  - [Available Datasets](#available-datasets)
+- [On-Policy Distillation Configuration](#on-policy-distillation-configuration)
+  - [Top-Level Structure](#distillation-top-level-structure)
+  - [Distillation-Specific Settings](#distillation-specific-settings)
+- [Shared Configuration Sections](#shared-configuration-sections)
+  - [Policy Configuration](#policy-configuration)
+  - [Training Backend Configuration](#training-backend-configuration)
+  - [Generation Configuration](#generation-configuration)
+  - [Logger Configuration](#logger-configuration)
+  - [Checkpointing Configuration](#checkpointing-configuration)
+  - [Cluster Configuration](#cluster-configuration)
+- [Multi-Node Configuration Examples](#multi-node-configuration-examples)
+
+---
+
+## Configuration Inheritance
+
+NeMo RL uses a YAML-based configuration system with support for inheritance via the `defaults` key:
+
+```yaml
+# Single inheritance
+defaults: parent.yaml
+
+# Multiple inheritance (later configs override earlier ones)
+defaults: [base.yaml, override.yaml]
+
+# Variable interpolation
+data:
+  max_input_seq_length: ${policy.max_total_sequence_length}
+```
+
+**Key features:**
+- Configs support nested inheritance (parents can have their own defaults)
+- Later values override earlier ones in multiple inheritance
+- Use `${path.to.value}` for variable interpolation
+- Use `${mul:${val1}, ${val2}}` for multiplication in config
+
+---
+
+## SFT Configuration
+
+### SFT Top-Level Structure
+
+```yaml
+# Main configuration structure (nemo_rl/algorithms/sft.py:77-83)
+sft:              # SFTConfig - training algorithm settings
+  ...
+policy:           # PolicyConfig - model and training settings
+  ...
+data:             # DataConfig - dataset and data processing settings
+  ...
+logger:           # LoggerConfig - logging configuration
+  ...
+checkpointing:    # CheckpointingConfig - checkpoint management
+  ...
+cluster:          # ClusterConfig - compute cluster settings
+  ...
+```
+
+### SFT Algorithm Settings
+
+**Location:** `nemo_rl/algorithms/sft.py:66-74`
+
+```yaml
+sft:
+  # Training duration (training stops at min of max_num_steps or max_num_epochs * dataset_length)
+  max_num_epochs: 1              # Maximum number of epochs to train
+  max_num_steps: 60              # Maximum number of training steps
+
+  # Validation settings
+  val_period: 10                 # Run validation every N training steps
+  val_batches: 8                 # Number of batches to use for validation
+  val_global_batch_size: 32      # Global batch size for validation
+  val_micro_batch_size: 1        # Micro batch size for validation (per GPU)
+  val_at_start: true             # Whether to run validation before training starts
+
+  # Random seed for reproducibility
+  seed: 42                       # Random seed
+```
+
+**Key notes:**
+- Training will stop at `min(max_num_steps, max_num_epochs * len(train_dataloader))`
+- Validation runs every `val_period` steps and logs validation loss
+- `val_at_start: true` is useful for debugging and comparing with pretrained checkpoints
+
+---
+
+## Dataset Configuration
+
+**Location:** `nemo_rl/data/__init__.py:21-43`
+
+### DataConfig Schema
+
+```yaml
+data:
+  # REQUIRED fields
+  dataset_name: str              # Dataset name (see Available Datasets below)
+  max_input_seq_length: int      # Maximum input sequence length (typically set to ${policy.max_total_sequence_length})
+  shuffle: bool                  # Whether to shuffle training data
+
+  # Dataset selection and splits
+  val_dataset_name: str          # Validation dataset name (can differ from training dataset)
+  split: str                     # HuggingFace dataset split (e.g., "train_1M", "train", "test")
+
+  # Custom dataset paths (for ResponseDataset, BinaryPreferenceDataset, etc.)
+  train_data_path: str           # Path to training data (local path or HuggingFace dataset name)
+  val_data_path: str             # Path to validation data
+  val_data_paths: dict[str, str] # Multiple validation datasets: {"name1": "path1", "name2": "path2"}
+
+  # Tokenization settings
+  add_bos: bool                  # Add beginning-of-sequence token (default: true)
+  add_eos: bool                  # Add end-of-sequence token (default: true)
+  add_generation_prompt: bool    # Add generation prompt to chat template (default: false)
+  add_system_prompt: bool        # Include system prompt in messages
+
+  # Custom dataset keys (for ResponseDataset)
+  input_key: str                 # Key for input/question in custom datasets (default: "input")
+  output_key: str                # Key for output/answer in custom datasets (default: "output")
+
+  # Prompt customization
+  prompt_file: str               # Path to custom prompt template file
+  system_prompt_file: str        # Path to system prompt file
+
+  # Dataset loading
+  download_dir: str              # Directory to download HuggingFace datasets
+  num_workers: int               # Number of DataLoader workers (default: 1)
+  seed: int                      # Random seed for dataset shuffling
+
+  # OpenAI format specific (for tool calling datasets)
+  chat_key: str                  # Key for messages in OpenAI format data (default: "messages")
+  system_key: str                # Key for system message (optional)
+  system_prompt: str             # Default system prompt if not in data (optional)
+  tool_key: str                  # Key for tools in the data (default: "tools")
+  use_preserving_dataset: bool   # Use PreservingDataset to preserve heterogeneous schemas (default: false)
+```
+
+### Available Datasets
+
+Datasets are loaded via mapping functions in `nemo_rl/data/datasets/`. Here are all available dataset names:
+
+#### Response Datasets (for SFT and RL)
+
+**Mapping location:** `nemo_rl/data/datasets/response_datasets/__init__.py:36-149`
+
+| Dataset Name | Class | Description | Source |
+|-------------|-------|-------------|--------|
+| `open_assistant` | OasstDataset | OpenAssistant conversational dataset | HuggingFace: `OpenAssistant/oasst2` |
+| `squad` | SquadDataset | Stanford Question Answering Dataset | HuggingFace: `rajpurkar/squad` |
+| `openmathinstruct2` | OpenMathInstruct2Dataset | Math instruction dataset | HuggingFace: `nvidia/OpenMathInstruct-2` |
+| `OpenMathInstruct-2` | OpenMathInstruct2Dataset | Same as above (for RL training) | HuggingFace: `nvidia/OpenMathInstruct-2` |
+| `DeepScaler` | DeepScalerDataset | Math reasoning dataset | HuggingFace: `ScalerLab/DeepScaleR-1M` |
+| `DAPOMath17K` | DAPOMath17KDataset | DAPO Math dataset | HuggingFace: `ScalerLab/dapo-math-17k` |
+| `clevr-cogent` | ClevrCogentDataset | Visual reasoning dataset | HuggingFace: `UCLA-AGI/CLEVR-COGENT` |
+| `refcoco` | RefCOCODataset | Referring expression dataset | HuggingFace (via `nemo_rl.data.datasets.response_datasets.refcoco`) |
+| `geometry3k` | Geometry3KDataset | Geometry problem dataset | HuggingFace: `Luckyjhg/Geo3K` |
+| `tulu3_sft_mixture` | Tulu3SFTMixtureDataset | Tulu 3 SFT mixture | HuggingFace: `allenai/tulu-3-sft-mixture` |
+| `HelpSteer3` | HelpSteer3Dataset | Helpfulness dataset | HuggingFace: `nvidia/HelpSteer3` |
+| `openai_format` | OpenAIFormatDataset | OpenAI conversation format with tool calling support | Custom JSONL files |
+| `ResponseDataset` | ResponseDataset | Generic loader for custom JSONL or HuggingFace datasets | Custom files or HuggingFace |
+
+**Usage examples:**
+
+```yaml
+# Using a built-in dataset
+data:
+  dataset_name: "squad"
+  max_input_seq_length: 1024
+  shuffle: true
+
+# Using OpenMathInstruct-2 with custom prompt and split
+data:
+  dataset_name: "openmathinstruct2"
+  prompt_file: "examples/prompts/math.txt"
+  split: "train_1M"
+  output_key: "generated_solution"
+  add_generation_prompt: true
+  shuffle: true
+
+# Using a custom dataset via ResponseDataset
+data:
+  dataset_name: "ResponseDataset"
+  train_data_path: "/path/to/train.jsonl"  # or "hf_org/hf_dataset_name"
+  val_data_path: "/path/to/val.jsonl"
+  input_key: "question"
+  output_key: "answer"
+  shuffle: true
+
+# Using OpenAI format with tool calling
+data:
+  dataset_name: "openai_format"
+  train_data_path: "/path/to/train.jsonl"
+  val_data_path: "/path/to/val.jsonl"
+  chat_key: "messages"
+  tool_key: "tools"
+  use_preserving_dataset: true  # IMPORTANT for heterogeneous tool schemas
+  shuffle: true
+```
+
+#### Preference Datasets (for DPO)
+
+**Mapping location:** `nemo_rl/data/datasets/preference_datasets/__init__.py:26-79`
+
+| Dataset Name | Class | Description |
+|-------------|-------|-------------|
+| `HelpSteer3` | HelpSteer3Dataset | Preference dataset from HelpSteer |
+| `Tulu3Preference` | Tulu3PreferenceDataset | Tulu 3 preference dataset |
+| `BinaryPreferenceDataset` | BinaryPreferenceDataset | Generic binary preference loader |
+| `PreferenceDataset` | PreferenceDataset | Generic preference loader |
+
+#### Evaluation Datasets
+
+**Mapping location:** `nemo_rl/data/datasets/eval_datasets/__init__.py:23-98`
+
+| Dataset Name | Class | Description |
+|-------------|-------|-------------|
+| `mmlu*` | MMLUDataset | MMLU benchmark (any name starting with "mmlu" except "mmlu_pro") |
+| `mmlu_pro` | MMLUProDataset | MMLU-Pro benchmark |
+| `aime2024` | AIMEDataset | AIME 2024 math competition |
+| `gpqa` / `gpqa_diamond` | GPQADataset | Graduate-level science Q&A |
+| `math` | MathDataset | MATH benchmark (test set) |
+| `math500` | MathDataset | MATH-500 subset |
+| `<path>` | LocalMathDataset | Any path to a local evaluation file |
+
+---
+
+## On-Policy Distillation Configuration
+
+### Distillation Top-Level Structure
+
+**Location:** `nemo_rl/algorithms/distillation.py:110-121`
+
+```yaml
+# Main configuration structure
+policy:           # PolicyConfig - student model configuration
+  ...
+teacher:          # PolicyConfig - teacher model configuration
+  ...
+loss_fn:          # DistillationLossConfig - loss function settings
+  ...
+distillation:     # DistillationConfig - distillation algorithm settings
+  ...
+data:             # DataConfig - dataset configuration
+  ...
+env:              # Environment configuration (e.g., math environment)
+  ...
+logger:           # LoggerConfig - logging configuration
+  ...
+checkpointing:    # CheckpointingConfig - checkpoint management
+  ...
+cluster:          # ClusterConfig - compute cluster settings
+  ...
+```
+
+### Distillation-Specific Settings
+
+**Location:** `nemo_rl/algorithms/distillation.py:73-85`
+
+```yaml
+distillation:
+  # Rollout settings
+  num_prompts_per_step: 128      # Number of prompts to sample per training step
+  num_generations_per_prompt: 1  # Number of generations per prompt (default: 1)
+  max_rollout_turns: 1           # Maximum number of conversation turns (1 for single-turn tasks like math)
+
+  # Training duration
+  max_num_steps: 1000            # Maximum number of training steps
+  max_num_epochs: 10             # Maximum number of epochs
+
+  # Validation settings
+  val_batch_size: 64             # Batch size for validation
+  val_period: 20                 # Run validation every N steps
+  val_at_start: false            # Whether to run validation before training
+  max_val_samples: 512           # Maximum number of validation samples to evaluate
+
+  # Distillation settings
+  topk_logits_k: 64              # Number of top-k logits to use from teacher
+
+  # Random seed
+  seed: 42                       # Random seed for reproducibility
+```
+
+### Loss Function Configuration
+
+**Location:** `nemo_rl/algorithms/loss_functions.py` (DistillationLossConfig)
+
+```yaml
+loss_fn:
+  kl_type: "mixed"               # KL divergence type: "forward", "reverse", or "mixed"
+  mixed_kl_weight: 0.5           # Weight for forward KL when kl_type="mixed" (0.0 = full reverse, 1.0 = full forward)
+  zero_outside_topk: false       # Zero out teacher logits outside top-k when calculating forward KL
+```
+
+**KL divergence types:**
+- `"forward"`: KL(teacher || student) - minimizes entropy, student becomes more confident
+- `"reverse"`: KL(student || teacher) - maximizes entropy, student matches teacher distribution
+- `"mixed"`: Weighted combination of forward and reverse KL
+
+### Teacher Model Configuration
+
+The `teacher` section uses the same structure as `policy` (see [Policy Configuration](#policy-configuration)), but typically:
+- Uses a larger model (e.g., `Qwen3-32B` teaching `Qwen3-4B`)
+- May have different parallelism settings (higher TP/PP for larger models)
+- Shares the same tokenizer vocabulary as the student
+
+**Important:** Student and teacher models **must have identical vocabularies**. The system validates this at startup.
+
+```yaml
+teacher:
+  model_name: "Qwen/Qwen3-32B"
+  tokenizer:
+    name: ${..model_name}
+  # Usually requires higher parallelism
+  dtensor_cfg:
+    tensor_parallel_size: 8
+    context_parallel_size: 1
+  # Same sequence settings as student
+  max_total_sequence_length: 8192
+  precision: "bfloat16"
+  # Teacher also needs generation config for on-policy rollouts
+  generation:
+    backend: "vllm"
+    max_new_tokens: ${..max_total_sequence_length}
+    temperature: 1.0
+    ...
+```
+
+### Environment Configuration
+
+```yaml
+env:
+  math:                          # Math environment for math-based distillation
+    num_workers: 8               # Number of parallel workers for environment evaluation
+```
+
+---
+
+## Shared Configuration Sections
+
+These sections are common to both SFT and distillation.
+
+### Policy Configuration
+
+**Location:** `nemo_rl/models/policy/__init__.py`
+
+```yaml
+policy:
+  # Model settings
+  model_name: "meta-llama/Llama-3.2-1B"  # HuggingFace model name
+  precision: "bfloat16"                  # Precision: "float32", "bfloat16", "float16"
+  max_total_sequence_length: 1024        # Maximum sequence length (input + output)
+
+  # Tokenizer settings
+  tokenizer:
+    name: ${policy.model_name}           # Tokenizer name (usually same as model)
+    chat_template: "default"             # Chat template: "default", null, or custom Jinja string
+    chat_template_kwargs: null           # Additional kwargs for chat template (e.g., enable_thinking: true)
+
+  # Batch sizes
+  train_global_batch_size: 32            # Global batch size across all GPUs
+  train_micro_batch_size: 1              # Micro batch size per GPU (gradient accumulation = global / (micro * num_gpus))
+
+  # For distillation/RL only:
+  generation_batch_size: 64              # Batch size for generation
+  logprob_batch_size: 1                  # Batch size for logprob calculation
+  logprob_chunk_size: null               # Chunk size for logprob calculation (null = no chunking)
+
+  # Optimization
+  max_grad_norm: 1.0                     # Gradient clipping norm
+  make_sequence_length_divisible_by: 1   # Pad sequences to be divisible by this (useful for tensor parallel)
+
+  offload_optimizer_for_logprob: false   # Offload optimizer during logprob calculation to save memory
+
+  # Optimizer (for DTensor backend)
+  optimizer:
+    name: "torch.optim.AdamW"
+    kwargs:
+      lr: 5.0e-6
+      weight_decay: 0.1
+      betas: [0.9, 0.98]
+      eps: 1e-5
+      foreach: False                     # Must be False for DTensor
+      fused: False                       # Must be False for DTensor
+
+  # Learning rate scheduler (for DTensor backend)
+  scheduler:
+    - name: "torch.optim.lr_scheduler.LinearLR"
+      kwargs:
+        start_factor: 0.1                # Warmup from lr * 0.1
+        end_factor: 1.0                  # Warmup to lr * 1.0
+        total_iters: 10                  # Warmup over 10 steps
+    - name: "torch.optim.lr_scheduler.ConstantLR"
+      kwargs:
+        factor: 1.0
+        total_iters: 10000000000         # Constant LR after warmup
+    - milestones: [10]                   # Switch from warmup to constant at step 10
+
+  # Training backend configuration (see next sections)
+  dtensor_cfg: ...                       # DTensor configuration
+  megatron_cfg: ...                      # Megatron configuration
+
+  # Optimization features (see next sections)
+  dynamic_batching: ...                  # Dynamic batching configuration
+  sequence_packing: ...                  # Sequence packing configuration
+
+  # Generation settings (for distillation/RL only)
+  generation: ...                        # Generation configuration
+```
+
+**Chat template options:**
+1. **Default tokenizer template**: `chat_template: "default"` or omit the key
+2. **Passthrough (no template)**: `chat_template: NULL` - for pre-formatted datasets
+3. **Custom Jinja template**: Provide inline Jinja string or path to `.jinja` file
+
+```yaml
+# Example custom template
+tokenizer:
+  chat_template: "{% for message in messages %}{%- if message['role'] == 'system'  %}{{'Context: ' + message['content'].strip()}}{%- elif message['role'] == 'user'  %}{{' Question: ' + message['content'].strip() + ' Answer:'}}{%- elif message['role'] == 'assistant'  %}{{' ' + message['content'].strip()}}{%- endif %}{% endfor %}"
+```
+
+---
+
+## Training Backend Configuration
+
+NeMo RL supports two training backends: **DTensor** (PyTorch-native) and **Megatron** (NVIDIA's high-performance framework).
+
+**Backend selection:** Set `enabled: true` for one backend and `enabled: false` for the other.
+
+### DTensor Configuration
+
+**Location:** `nemo_rl/models/policy/__init__.py:37-52`
+
+DTensor v2 is the default PyTorch-native distributed training backend with FSDP2, TP, CP, and SP support.
+
+```yaml
+policy:
+  dtensor_cfg:
+    enabled: true                        # Enable DTensor backend
+    _v2: true                            # Use DTensor v2 (recommended)
+
+    # Environment variables (optional)
+    env_vars: {}                         # Custom environment variables: {"KEY": "value"}
+
+    # Memory optimization
+    cpu_offload: false                   # Offload parameters to CPU (slower but saves GPU memory)
+
+    # Parallelism strategies
+    tensor_parallel_size: 1              # Tensor parallelism degree (split weights across GPUs)
+    context_parallel_size: 1             # Context parallelism degree (split sequence across GPUs)
+    sequence_parallel: false             # Enable sequence parallelism (requires TP > 1)
+    custom_parallel_plan: null           # Path to custom parallel plan (advanced)
+
+    # Activation checkpointing
+    activation_checkpointing: false      # Enable activation checkpointing (saves memory, slower)
+
+    # Advanced settings
+    clear_cache_every_n_steps: null      # Clear CUDA cache every N steps (null = disabled)
+
+    # LoRA configuration (optional)
+    lora_cfg:
+      enabled: false                     # Enable LoRA fine-tuning
+      target_modules: []                 # Module names to apply LoRA (empty = all linear if match_all_linear=true)
+      exclude_modules: []                # Module names to exclude from LoRA
+      match_all_linear: true             # Apply LoRA to all linear layers
+      dim: 8                             # LoRA rank (r): lower = fewer params, less capacity
+      alpha: 32                          # LoRA scaling factor (effective lr multiplier = alpha/dim)
+      dropout: 0.0                       # Dropout probability for LoRA layers
+      dropout_position: "post"           # Dropout position: "pre" or "post"
+      lora_A_init: "xavier"              # LoRA A matrix initialization: "xavier" or "uniform"
+      use_triton: true                   # Use Triton-optimized kernels (disable for TP > 1)
+```
+
+**Parallelism guidelines:**
+- **Tensor Parallel (TP)**: Splits model weights across GPUs. Use for large models that don't fit on single GPU. Requires high-bandwidth inter-GPU communication (NVLink).
+- **Context Parallel (CP)**: Splits sequence dimension across GPUs. Use for very long sequences. Requires Ring Attention.
+- **Sequence Parallel (SP)**: Splits activations along sequence dimension. Requires TP > 1. **Incompatible with sequence packing** for some models.
+- **FSDP2**: Automatically enabled for data parallelism (shards optimizer states and gradients).
+
+**LoRA notes:**
+- LoRA is supported with DTensor v2 and Megatron backends
+- DTensor v1 does not support LoRA
+- Set `use_triton: false` when `tensor_parallel_size > 1` due to Automodel limitations
+
+### Megatron Configuration
+
+**Location:** `nemo_rl/models/policy/__init__.py:115-150`
+
+Megatron is NVIDIA's high-performance training framework supporting 6D parallelism (TP/PP/CP/SP/EP/FSDP).
+
+```yaml
+policy:
+  megatron_cfg:
+    enabled: true                        # Enable Megatron backend
+
+    # Environment variables (optional)
+    env_vars: {}                         # Custom environment variables
+
+    # Memory management
+    empty_unused_memory_level: 1         # GPU memory cleanup level (0=none, 1=after training, 2=aggressive)
+
+    # Parallelism strategies
+    tensor_model_parallel_size: 1        # Tensor parallelism degree
+    pipeline_model_parallel_size: 1      # Pipeline parallelism degree (split layers across GPUs)
+    expert_tensor_parallel_size: 1       # Expert tensor parallelism (for MoE models)
+    expert_model_parallel_size: 1        # Expert model parallelism (for MoE models)
+    context_parallel_size: 1             # Context parallelism degree
+    sequence_parallel: false             # Enable sequence parallelism (requires TP > 1)
+
+    # Pipeline parallel layer distribution (optional)
+    num_layers_in_first_pipeline_stage: null  # Number of layers in first PP stage
+    num_layers_in_last_pipeline_stage: null   # Number of layers in last PP stage
+
+    # Precision
+    pipeline_dtype: ${policy.precision}  # Data type for pipeline parallel communication
+
+    # Activation checkpointing
+    activation_checkpointing: false      # Enable activation checkpointing
+
+    # MoE-specific settings
+    freeze_moe_router: false             # Freeze MoE router during training
+    moe_router_dtype: null               # Router precision (null, "fp32", "fp64")
+    moe_router_load_balancing_type: "aux_loss"  # Load balancing: "aux_loss", "seq_aux_loss", "none"
+    moe_router_bias_update_rate: 1e-3    # Router bias update rate
+    moe_permute_fusion: false            # Enable MoE permute fusion
+    moe_per_layer_logging: false         # Log MoE metrics per layer
+
+    # Performance optimizations
+    apply_rope_fusion: true              # RoPE fusion (~20% speedup with sequence packing)
+    bias_activation_fusion: true         # Bias+activation fusion (~25% speedup with packing + RoPE fusion)
+    defer_fp32_logits: false             # Defer logit casting to fp32 (required if using logprob_chunk_size)
+
+    # Checkpointing
+    force_overwrite_initial_ckpt: false  # Force overwrite of initial Megatron checkpoint
+
+    # Optimizer settings
+    optimizer:
+      optimizer: "adam"                  # Optimizer type: "adam" or "sgd"
+      lr: 5.0e-6                         # Learning rate
+      min_lr: 4.9999e-6                  # Minimum learning rate (for schedulers)
+      weight_decay: 0.1                  # Weight decay
+
+      # Precision
+      bf16: false                        # Use bf16 for optimizer states
+      fp16: false                        # Use fp16 for optimizer states
+      params_dtype: "float32"            # Parameter dtype
+
+      # Adam settings
+      adam_beta1: 0.9
+      adam_beta2: 0.98
+      adam_eps: 1e-5
+
+      # SGD settings
+      sgd_momentum: 0.9
+
+      # Distributed optimizer
+      use_distributed_optimizer: true    # Distribute optimizer states across GPUs
+      use_precision_aware_optimizer: true  # Use precision-aware optimizer
+
+      # Gradient clipping
+      clip_grad: ${policy.max_grad_norm}
+
+      # CPU offload
+      optimizer_cpu_offload: false       # Offload optimizer to CPU
+      optimizer_offload_fraction: 0.0    # Fraction of parameters to offload (0.0-1.0)
+
+    # Learning rate scheduler
+    scheduler:
+      start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      weight_decay_incr_style: "constant"  # Weight decay schedule: "constant" or "linear"
+      lr_decay_style: "constant"         # LR schedule: "constant", "linear", "cosine"
+      lr_decay_iters: 1000               # Number of iterations for LR decay
+      lr_warmup_iters: 50                # Number of warmup iterations
+      lr_warmup_init: 4.9999e-6          # Initial LR for warmup
+
+    # Distributed data parallel settings
+    distributed_data_parallel_config:
+      grad_reduce_in_fp32: false         # Reduce gradients in fp32
+      overlap_grad_reduce: true          # Overlap gradient reduction with backward pass
+      overlap_param_gather: true         # Overlap parameter gathering
+      use_custom_fsdp: false             # Use custom FSDP implementation
+      data_parallel_sharding_strategy: "optim_grads_params"  # FSDP sharding strategy
+```
+
+**Parallelism guidelines:**
+- **Tensor Parallel (TP)**: Same as DTensor. Typical values: 1, 2, 4, 8
+- **Pipeline Parallel (PP)**: Splits model layers across GPUs. Use for very large models. Creates pipeline bubbles (lower efficiency).
+- **Context Parallel (CP)**: Same as DTensor. Use for long sequences.
+- **Expert Parallel (EP)**: Splits MoE experts across GPUs (for MoE models only)
+- **Data Parallel (DP)**: Automatically enabled via FSDP
+
+---
+
+### Sequence Packing Configuration
+
+**Location:** `nemo_rl/models/policy/__init__.py:55-64`
+
+Sequence packing concatenates multiple short sequences into longer sequences to improve GPU utilization.
+
+```yaml
+policy:
+  sequence_packing:
+    enabled: false                       # Enable sequence packing
+    train_mb_tokens: 1024                # Target tokens per micro-batch (usually max_seq_len * micro_batch_size)
+    logprob_mb_tokens: 1024              # Target tokens per logprob micro-batch (distillation/RL only)
+    algorithm: "modified_first_fit_decreasing"  # Packing algorithm
+    sequence_length_round: 64            # Round sequence lengths to multiples of this value
+```
+
+**Notes:**
+- Sequence packing can provide **significant speedups** (2-5x) for datasets with variable-length sequences
+- **Incompatible with DTensor SP + packing** for some models (see [issue #1178](https://github.com/NVIDIA-NeMo/RL/issues/1178))
+- For Megatron, enable `apply_rope_fusion: true` and `bias_activation_fusion: true` for additional 20-50% speedup
+
+### Dynamic Batching Configuration
+
+**Location:** Same as sequence packing
+
+Dynamic batching adjusts batch size based on sequence length to maintain consistent token count per batch.
+
+```yaml
+policy:
+  dynamic_batching:
+    enabled: false                       # Enable dynamic batching
+    train_mb_tokens: 1024                # Target tokens per micro-batch
+    logprob_mb_tokens: 1024              # Target tokens per logprob micro-batch (distillation/RL only)
+    sequence_length_round: 64            # Round sequence lengths to multiples of this value
+```
+
+**Note:** Use either `dynamic_batching` OR `sequence_packing`, not both.
+
+---
+
+## Generation Configuration
+
+**Location:** `nemo_rl/models/generation/interfaces.py:118-131`
+
+Generation configuration is only required for distillation and RL algorithms (not SFT).
+
+```yaml
+policy:
+  generation:
+    backend: "vllm"                      # Generation backend: "vllm" or "megatron"
+
+    # Generation parameters
+    max_new_tokens: 8192                 # Maximum new tokens to generate (usually ${..max_total_sequence_length})
+    temperature: 1.0                     # Sampling temperature (higher = more random)
+    top_p: 1.0                           # Nucleus sampling top-p (1.0 = disabled)
+    top_k: null                          # Top-k sampling (null = disabled)
+
+    # Stopping criteria
+    stop_token_ids: null                 # List of token IDs to stop generation
+    stop_strings: null                   # List of strings to stop generation
+
+    # vLLM-specific configuration
+    vllm_cfg:
+      # Parallelism
+      tensor_parallel_size: 1            # vLLM tensor parallelism
+      pipeline_parallel_size: 1          # vLLM pipeline parallelism
+      expert_parallel_size: 1            # vLLM expert parallelism (for MoE)
+
+      # Memory settings
+      gpu_memory_utilization: 0.6        # Fraction of GPU memory to use (0.0-1.0)
+      max_model_len: 8192                # Maximum sequence length
+      kv_cache_dtype: "auto"             # KV cache precision: "auto", "fp8", "fp8_e4m3"
+
+      # Precision
+      precision: ${...precision}         # Model precision
+
+      # Performance
+      enforce_eager: false               # Disable CUDA graphs (slower but more compatible)
+      use_deep_gemm: false               # Use DeepGEMM optimization
+      num_last_layers_in_bf16: 0         # Number of last layers in bf16 (FP8 mixed precision)
+      num_first_layers_in_bf16: 0        # Number of first layers in bf16 (FP8 mixed precision)
+
+      # Advanced
+      async_engine: false                # Use async vLLM engine
+      distributed_executor_backend: null # Distributed executor: null, "ray", "mp"
+
+    # Colocation settings (whether generation shares training GPUs)
+    colocated:
+      enabled: true                      # true = share training GPUs, false = dedicated generation resources
+      resources:                         # Only used when enabled=false
+        gpus_per_node: null              # GPUs per node for generation (when cluster.num_nodes==1)
+        num_nodes: null                  # Nodes for generation
+```
+
+**Colocation modes:**
+- `colocated.enabled: true`: Generation shares training GPUs (memory-efficient, but training pauses during generation)
+- `colocated.enabled: false`: Dedicated generation resources (faster RL, but requires more GPUs)
+
+**Backend selection:**
+- **vLLM**: Fast, memory-efficient inference. Recommended for most cases. Requires weight conversion from training format.
+- **Megatron**: Native Megatron inference. No weight conversion needed. Use for very large models or FP8 training.
+
+---
+
+## Logger Configuration
+
+**Location:** `nemo_rl/utils/logger.py:77-89`
+
+```yaml
+logger:
+  log_dir: "logs"                        # Base directory for all logs
+
+  # Logger backends (enable/disable)
+  wandb_enabled: true                    # Enable Weights & Biases logging
+  tensorboard_enabled: true              # Enable TensorBoard logging
+  mlflow_enabled: false                  # Enable MLflow logging
+  swanlab_enabled: false                 # Enable SwanLab logging
+
+  # GPU monitoring
+  monitor_gpus: true                     # Monitor and log GPU usage metrics
+
+  # Sampling (how many validation samples to print)
+  num_val_samples_to_print: 5            # Number of validation samples to print to console
+
+  # Weights & Biases configuration
+  wandb:
+    project: "sft-dev"                   # W&B project name
+    name: "sft-dev-${data.dataset_name}" # W&B run name (supports interpolation)
+
+  # SwanLab configuration
+  swanlab:
+    project: "sft-dev"
+    name: "sft-dev-${data.dataset_name}"
+
+  # TensorBoard configuration
+  tensorboard:
+    log_dir: "tb_logs-sft-dev-${data.dataset_name}"  # TensorBoard log directory
+
+  # MLflow configuration
+  mlflow:
+    experiment_name: "sft-dev"           # MLflow experiment name
+    run_name: "sft-dev-${data.dataset_name}"  # MLflow run name
+    tracking_uri: null                   # MLflow tracking URI (null = local)
+    artifact_location: null              # Artifact storage location
+
+  # GPU monitoring settings
+  gpu_monitoring:
+    collection_interval: 10              # Collect GPU metrics every N seconds
+    flush_interval: 10                   # Flush GPU metrics to loggers every N seconds
+```
+
+**Notes:**
+- Multiple loggers can be enabled simultaneously
+- W&B requires `wandb login` before running
+- Variable interpolation works in all string fields (e.g., `${data.dataset_name}`)
+
+---
+
+## Checkpointing Configuration
+
+**Location:** `nemo_rl/utils/checkpoint.py:36-67`
+
+```yaml
+checkpointing:
+  enabled: true                          # Enable checkpointing
+  checkpoint_dir: "results/sft"          # Directory to save checkpoints
+
+  # Checkpoint selection
+  metric_name: "val:val_loss"            # Metric to track for best checkpoint (format: "val:<metric>" or "train:<metric>")
+  higher_is_better: false                # Whether higher metric values are better
+
+  # Checkpoint retention
+  keep_top_k: 3                          # Number of best checkpoints to keep (null = keep all)
+  save_period: 10                        # Save checkpoint every N steps
+  checkpoint_must_save_by: null          # Force save by this step (null = disabled)
+
+  # Model saving format (for DTensor v2 / Megatron)
+  model_save_format: "safetensors"       # Format: "safetensors" or "torch_save" (null for DTensor v1)
+  save_consolidated: false               # Save HuggingFace-compatible consolidated checkpoint
+  model_cache_dir: ""                    # Model cache directory
+  model_repo_id: ""                      # HuggingFace repository ID
+
+  # PEFT support
+  is_peft: false                         # Whether model uses PEFT (LoRA, etc.)
+  peft_config: null                      # PEFT configuration
+```
+
+**Checkpoint structure:**
+```
+checkpoint_dir/
+  step_0/
+    training_info.json                   # Training state (epoch, step, metrics)
+    config.yaml                          # Full config used for this run
+    train_dataloader.pt                  # DataLoader state
+    policy/
+      weights/                           # Model weights
+      optimizer/                         # Optimizer state
+  step_10/
+    ...
+  step_20/
+    ...
+```
+
+**Metric name format:**
+- `"val:<metric_name>"` for validation metrics (e.g., `"val:val_loss"`, `"val:accuracy"`)
+- `"train:<metric_name>"` for training metrics (e.g., `"train:loss"`)
+
+---
+
+## Cluster Configuration
+
+**Location:** `nemo_rl/distributed/virtual_cluster.py:33-35`
+
+```yaml
+cluster:
+  gpus_per_node: 8                       # Number of GPUs per node
+  num_nodes: 1                           # Number of nodes
+```
+
+**Multi-node setup:**
+- Requires Ray cluster setup (see [cluster documentation](cluster.md))
+- Total GPUs = `gpus_per_node * num_nodes`
+- Data parallel degree = Total GPUs / (TP * PP * CP * EP)
+
+---
+
+## Multi-Node Configuration Examples
+
+### Multi-Node SFT with Megatron (2 nodes, 16 GPUs)
+
+**File:** `examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml`
+
+```yaml
+defaults: ../../sft.yaml                 # Inherit from base SFT config
+
+sft:
+  max_num_steps: 80
+
+policy:
+  model_name: Qwen/Qwen2.5-Math-7B
+  train_global_batch_size: 512
+  max_total_sequence_length: 16384
+
+  # Disable DTensor, enable Megatron
+  dtensor_cfg:
+    enabled: false
+
+  megatron_cfg:
+    enabled: true
+    tensor_model_parallel_size: 4        # TP=4: split model across 4 GPUs
+    context_parallel_size: 2             # CP=2: split sequence across 2 GPUs
+    sequence_parallel: true              # Enable sequence parallelism
+
+    # MoE settings (Qwen2.5-Math is MoE)
+    freeze_moe_router: true
+    moe_router_dtype: fp64
+    moe_router_bias_update_rate: 0.0
+    moe_permute_fusion: true
+
+    optimizer:
+      lr: 1.0e-06
+      bf16: true
+      adam_beta2: 0.999
+      use_distributed_optimizer: false
+
+  # Enable sequence packing for efficiency
+  sequence_packing:
+    enabled: true
+
+  make_sequence_length_divisible_by: 32
+
+data:
+  dataset_name: openmathinstruct2
+  prompt_file: examples/prompts/math.txt
+  split: train_1M
+  add_generation_prompt: true
+  output_key: generated_solution
+  num_workers: 8
+
+cluster:
+  gpus_per_node: 8                       # 8 GPUs per node
+  num_nodes: 2                           # 2 nodes = 16 GPUs total
+```
+
+**Parallelism breakdown:**
+- Total GPUs: 16
+- TP: 4, CP: 2, TP*CP: 8
+- Data parallel degree: 16 / 8 = 2
+- Each DP rank processes: global_batch_size / DP = 512 / 2 = 256 samples
+
+### Multi-Node Distillation (2 nodes, 16 GPUs)
+
+**File:** `examples/configs/recipes/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.yaml`
+
+```yaml
+defaults: ../../distillation_math.yaml   # Inherit from base distillation config
+
+distillation:
+  num_prompts_per_step: 64
+  max_num_steps: 20
+  val_batch_size: 256
+  val_period: 10
+
+loss_fn:
+  kl_type: reverse                       # Use reverse KL (student -> teacher)
+
+# Student model configuration
+policy:
+  model_name: Qwen/Qwen3-4B-Base
+  dtensor_cfg:
+    tensor_parallel_size: 2              # TP=2 for student
+    context_parallel_size: 1
+
+  dynamic_batching:
+    enabled: false
+
+  sequence_packing:
+    enabled: true                        # Enable sequence packing
+
+  make_sequence_length_divisible_by: 2
+
+# Teacher model configuration
+teacher:
+  model_name: Qwen/Qwen3-32B
+  dtensor_cfg:
+    tensor_parallel_size: 8              # TP=8 for larger teacher
+    context_parallel_size: 1
+
+  dynamic_batching:
+    enabled: false
+
+  sequence_packing:
+    enabled: true                        # Enable sequence packing
+
+  make_sequence_length_divisible_by: 2
+
+cluster:
+  gpus_per_node: 8
+  num_nodes: 2                           # 2 nodes = 16 GPUs total
+```
+
+**Parallelism breakdown:**
+- Student: TP=2, DP=8 (16 GPUs / 2)
+- Teacher: TP=8, DP=2 (16 GPUs / 8)
+- Student and teacher run in different Ray placement groups
+
+---
+
+## Additional Resources
+
+- **Guides:**
+  - [SFT Guide](guides/sft.md)
+  - [Distillation Guide](guides/distillation.md) (if available)
+  - [GRPO Guide](guides/grpo.md)
+  - [DPO Guide](guides/dpo.md)
+
+- **Design Docs:**
+  - [Training Backends](design-docs/training-backends.md)
+  - [Generation Backends](design-docs/generation.md)
+  - [Sequence Packing](design-docs/sequence-packing-and-dynamic-batching.md)
+  - [Chat Datasets](design-docs/chat-datasets.md)
+
+- **Examples:**
+  - `examples/configs/` - Configuration files
+  - `examples/run_sft.py` - SFT training script
+  - `examples/run_distillation.py` - Distillation training script
+
+---
+
+## Quick Reference: Config File Locations
+
+| Config Type | Code Location | Example Config |
+|------------|---------------|----------------|
+| SFTConfig | `nemo_rl/algorithms/sft.py:66-74` | `examples/configs/sft.yaml` |
+| DistillationConfig | `nemo_rl/algorithms/distillation.py:73-85` | `examples/configs/distillation_math.yaml` |
+| DataConfig | `nemo_rl/data/__init__.py:21-43` | (embedded in algorithm configs) |
+| PolicyConfig | `nemo_rl/models/policy/__init__.py` | (embedded in algorithm configs) |
+| DTensorConfig | `nemo_rl/models/policy/__init__.py:37-52` | `examples/configs/sft.yaml` |
+| MegatronConfig | `nemo_rl/models/policy/__init__.py:115-150` | `examples/configs/recipes/llm/sft-*-megatron.yaml` |
+| GenerationConfig | `nemo_rl/models/generation/interfaces.py:118-131` | `examples/configs/distillation_math.yaml` |
+| VllmConfig | `nemo_rl/models/generation/vllm/config.py:41-43` | `examples/configs/distillation_math.yaml` |
+| LoggerConfig | `nemo_rl/utils/logger.py:77-89` | (all example configs) |
+| CheckpointingConfig | `nemo_rl/utils/checkpoint.py:36-67` | (all example configs) |
+| ClusterConfig | `nemo_rl/distributed/virtual_cluster.py:33-35` | (all example configs) |
+
+---
+
+## Dataset Mapping Code Locations
+
+| Dataset Type | Mapping Function Location |
+|-------------|---------------------------|
+| Response Datasets (SFT/RL) | `nemo_rl/data/datasets/response_datasets/__init__.py:36-149` |
+| Preference Datasets (DPO) | `nemo_rl/data/datasets/preference_datasets/__init__.py:26-79` |
+| Eval Datasets | `nemo_rl/data/datasets/eval_datasets/__init__.py:23-98` |