diff --git a/docs/CONFIG_REFERENCE.md b/docs/CONFIG_REFERENCE.md new file mode 100644 index 0000000000..b3bd230d6d --- /dev/null +++ b/docs/CONFIG_REFERENCE.md @@ -0,0 +1,996 @@ +# YAML Configuration Reference: SFT and On-Policy Distillation + +This document provides a comprehensive reference for configuring Supervised Fine-Tuning (SFT) and On-Policy Distillation in NeMo RL. It documents all configuration sections, keys, and available dataset names. + +## Table of Contents + +- [Configuration Inheritance](#configuration-inheritance) +- [SFT Configuration](#sft-configuration) + - [Top-Level Structure](#sft-top-level-structure) + - [Dataset Configuration](#dataset-configuration) + - [Available Datasets](#available-datasets) +- [On-Policy Distillation Configuration](#on-policy-distillation-configuration) + - [Top-Level Structure](#distillation-top-level-structure) + - [Distillation-Specific Settings](#distillation-specific-settings) +- [Shared Configuration Sections](#shared-configuration-sections) + - [Policy Configuration](#policy-configuration) + - [Training Backend Configuration](#training-backend-configuration) + - [Generation Configuration](#generation-configuration) + - [Logger Configuration](#logger-configuration) + - [Checkpointing Configuration](#checkpointing-configuration) + - [Cluster Configuration](#cluster-configuration) +- [Multi-Node Configuration Examples](#multi-node-configuration-examples) + +--- + +## Configuration Inheritance + +NeMo RL uses a YAML-based configuration system with support for inheritance via the `defaults` key: + +```yaml +# Single inheritance +defaults: parent.yaml + +# Multiple inheritance (later configs override earlier ones) +defaults: [base.yaml, override.yaml] + +# Variable interpolation +data: + max_input_seq_length: ${policy.max_total_sequence_length} +``` + +**Key features:** +- Configs support nested inheritance (parents can have their own defaults) +- Later values override earlier ones in multiple inheritance +- Use `${path.to.value}` for variable interpolation +- Use `${mul:${val1}, ${val2}}` for multiplication in config + +--- + +## SFT Configuration + +### SFT Top-Level Structure + +```yaml +# Main configuration structure (nemo_rl/algorithms/sft.py:77-83) +sft: # SFTConfig - training algorithm settings + ... +policy: # PolicyConfig - model and training settings + ... +data: # DataConfig - dataset and data processing settings + ... +logger: # LoggerConfig - logging configuration + ... +checkpointing: # CheckpointingConfig - checkpoint management + ... +cluster: # ClusterConfig - compute cluster settings + ... +``` + +### SFT Algorithm Settings + +**Location:** `nemo_rl/algorithms/sft.py:66-74` + +```yaml +sft: + # Training duration (training stops at min of max_num_steps or max_num_epochs * dataset_length) + max_num_epochs: 1 # Maximum number of epochs to train + max_num_steps: 60 # Maximum number of training steps + + # Validation settings + val_period: 10 # Run validation every N training steps + val_batches: 8 # Number of batches to use for validation + val_global_batch_size: 32 # Global batch size for validation + val_micro_batch_size: 1 # Micro batch size for validation (per GPU) + val_at_start: true # Whether to run validation before training starts + + # Random seed for reproducibility + seed: 42 # Random seed +``` + +**Key notes:** +- Training will stop at `min(max_num_steps, max_num_epochs * len(train_dataloader))` +- Validation runs every `val_period` steps and logs validation loss +- `val_at_start: true` is useful for debugging and comparing with pretrained checkpoints + +--- + +## Dataset Configuration + +**Location:** `nemo_rl/data/__init__.py:21-43` + +### DataConfig Schema + +```yaml +data: + # REQUIRED fields + dataset_name: str # Dataset name (see Available Datasets below) + max_input_seq_length: int # Maximum input sequence length (typically set to ${policy.max_total_sequence_length}) + shuffle: bool # Whether to shuffle training data + + # Dataset selection and splits + val_dataset_name: str # Validation dataset name (can differ from training dataset) + split: str # HuggingFace dataset split (e.g., "train_1M", "train", "test") + + # Custom dataset paths (for ResponseDataset, BinaryPreferenceDataset, etc.) + train_data_path: str # Path to training data (local path or HuggingFace dataset name) + val_data_path: str # Path to validation data + val_data_paths: dict[str, str] # Multiple validation datasets: {"name1": "path1", "name2": "path2"} + + # Tokenization settings + add_bos: bool # Add beginning-of-sequence token (default: true) + add_eos: bool # Add end-of-sequence token (default: true) + add_generation_prompt: bool # Add generation prompt to chat template (default: false) + add_system_prompt: bool # Include system prompt in messages + + # Custom dataset keys (for ResponseDataset) + input_key: str # Key for input/question in custom datasets (default: "input") + output_key: str # Key for output/answer in custom datasets (default: "output") + + # Prompt customization + prompt_file: str # Path to custom prompt template file + system_prompt_file: str # Path to system prompt file + + # Dataset loading + download_dir: str # Directory to download HuggingFace datasets + num_workers: int # Number of DataLoader workers (default: 1) + seed: int # Random seed for dataset shuffling + + # OpenAI format specific (for tool calling datasets) + chat_key: str # Key for messages in OpenAI format data (default: "messages") + system_key: str # Key for system message (optional) + system_prompt: str # Default system prompt if not in data (optional) + tool_key: str # Key for tools in the data (default: "tools") + use_preserving_dataset: bool # Use PreservingDataset to preserve heterogeneous schemas (default: false) +``` + +### Available Datasets + +Datasets are loaded via mapping functions in `nemo_rl/data/datasets/`. Here are all available dataset names: + +#### Response Datasets (for SFT and RL) + +**Mapping location:** `nemo_rl/data/datasets/response_datasets/__init__.py:36-149` + +| Dataset Name | Class | Description | Source | +|-------------|-------|-------------|--------| +| `open_assistant` | OasstDataset | OpenAssistant conversational dataset | HuggingFace: `OpenAssistant/oasst2` | +| `squad` | SquadDataset | Stanford Question Answering Dataset | HuggingFace: `rajpurkar/squad` | +| `openmathinstruct2` | OpenMathInstruct2Dataset | Math instruction dataset | HuggingFace: `nvidia/OpenMathInstruct-2` | +| `OpenMathInstruct-2` | OpenMathInstruct2Dataset | Same as above (for RL training) | HuggingFace: `nvidia/OpenMathInstruct-2` | +| `DeepScaler` | DeepScalerDataset | Math reasoning dataset | HuggingFace: `ScalerLab/DeepScaleR-1M` | +| `DAPOMath17K` | DAPOMath17KDataset | DAPO Math dataset | HuggingFace: `ScalerLab/dapo-math-17k` | +| `clevr-cogent` | ClevrCogentDataset | Visual reasoning dataset | HuggingFace: `UCLA-AGI/CLEVR-COGENT` | +| `refcoco` | RefCOCODataset | Referring expression dataset | HuggingFace (via `nemo_rl.data.datasets.response_datasets.refcoco`) | +| `geometry3k` | Geometry3KDataset | Geometry problem dataset | HuggingFace: `Luckyjhg/Geo3K` | +| `tulu3_sft_mixture` | Tulu3SFTMixtureDataset | Tulu 3 SFT mixture | HuggingFace: `allenai/tulu-3-sft-mixture` | +| `HelpSteer3` | HelpSteer3Dataset | Helpfulness dataset | HuggingFace: `nvidia/HelpSteer3` | +| `openai_format` | OpenAIFormatDataset | OpenAI conversation format with tool calling support | Custom JSONL files | +| `ResponseDataset` | ResponseDataset | Generic loader for custom JSONL or HuggingFace datasets | Custom files or HuggingFace | + +**Usage examples:** + +```yaml +# Using a built-in dataset +data: + dataset_name: "squad" + max_input_seq_length: 1024 + shuffle: true + +# Using OpenMathInstruct-2 with custom prompt and split +data: + dataset_name: "openmathinstruct2" + prompt_file: "examples/prompts/math.txt" + split: "train_1M" + output_key: "generated_solution" + add_generation_prompt: true + shuffle: true + +# Using a custom dataset via ResponseDataset +data: + dataset_name: "ResponseDataset" + train_data_path: "/path/to/train.jsonl" # or "hf_org/hf_dataset_name" + val_data_path: "/path/to/val.jsonl" + input_key: "question" + output_key: "answer" + shuffle: true + +# Using OpenAI format with tool calling +data: + dataset_name: "openai_format" + train_data_path: "/path/to/train.jsonl" + val_data_path: "/path/to/val.jsonl" + chat_key: "messages" + tool_key: "tools" + use_preserving_dataset: true # IMPORTANT for heterogeneous tool schemas + shuffle: true +``` + +#### Preference Datasets (for DPO) + +**Mapping location:** `nemo_rl/data/datasets/preference_datasets/__init__.py:26-79` + +| Dataset Name | Class | Description | +|-------------|-------|-------------| +| `HelpSteer3` | HelpSteer3Dataset | Preference dataset from HelpSteer | +| `Tulu3Preference` | Tulu3PreferenceDataset | Tulu 3 preference dataset | +| `BinaryPreferenceDataset` | BinaryPreferenceDataset | Generic binary preference loader | +| `PreferenceDataset` | PreferenceDataset | Generic preference loader | + +#### Evaluation Datasets + +**Mapping location:** `nemo_rl/data/datasets/eval_datasets/__init__.py:23-98` + +| Dataset Name | Class | Description | +|-------------|-------|-------------| +| `mmlu*` | MMLUDataset | MMLU benchmark (any name starting with "mmlu" except "mmlu_pro") | +| `mmlu_pro` | MMLUProDataset | MMLU-Pro benchmark | +| `aime2024` | AIMEDataset | AIME 2024 math competition | +| `gpqa` / `gpqa_diamond` | GPQADataset | Graduate-level science Q&A | +| `math` | MathDataset | MATH benchmark (test set) | +| `math500` | MathDataset | MATH-500 subset | +| `` | LocalMathDataset | Any path to a local evaluation file | + +--- + +## On-Policy Distillation Configuration + +### Distillation Top-Level Structure + +**Location:** `nemo_rl/algorithms/distillation.py:110-121` + +```yaml +# Main configuration structure +policy: # PolicyConfig - student model configuration + ... +teacher: # PolicyConfig - teacher model configuration + ... +loss_fn: # DistillationLossConfig - loss function settings + ... +distillation: # DistillationConfig - distillation algorithm settings + ... +data: # DataConfig - dataset configuration + ... +env: # Environment configuration (e.g., math environment) + ... +logger: # LoggerConfig - logging configuration + ... +checkpointing: # CheckpointingConfig - checkpoint management + ... +cluster: # ClusterConfig - compute cluster settings + ... +``` + +### Distillation-Specific Settings + +**Location:** `nemo_rl/algorithms/distillation.py:73-85` + +```yaml +distillation: + # Rollout settings + num_prompts_per_step: 128 # Number of prompts to sample per training step + num_generations_per_prompt: 1 # Number of generations per prompt (default: 1) + max_rollout_turns: 1 # Maximum number of conversation turns (1 for single-turn tasks like math) + + # Training duration + max_num_steps: 1000 # Maximum number of training steps + max_num_epochs: 10 # Maximum number of epochs + + # Validation settings + val_batch_size: 64 # Batch size for validation + val_period: 20 # Run validation every N steps + val_at_start: false # Whether to run validation before training + max_val_samples: 512 # Maximum number of validation samples to evaluate + + # Distillation settings + topk_logits_k: 64 # Number of top-k logits to use from teacher + + # Random seed + seed: 42 # Random seed for reproducibility +``` + +### Loss Function Configuration + +**Location:** `nemo_rl/algorithms/loss_functions.py` (DistillationLossConfig) + +```yaml +loss_fn: + kl_type: "mixed" # KL divergence type: "forward", "reverse", or "mixed" + mixed_kl_weight: 0.5 # Weight for forward KL when kl_type="mixed" (0.0 = full reverse, 1.0 = full forward) + zero_outside_topk: false # Zero out teacher logits outside top-k when calculating forward KL +``` + +**KL divergence types:** +- `"forward"`: KL(teacher || student) - minimizes entropy, student becomes more confident +- `"reverse"`: KL(student || teacher) - maximizes entropy, student matches teacher distribution +- `"mixed"`: Weighted combination of forward and reverse KL + +### Teacher Model Configuration + +The `teacher` section uses the same structure as `policy` (see [Policy Configuration](#policy-configuration)), but typically: +- Uses a larger model (e.g., `Qwen3-32B` teaching `Qwen3-4B`) +- May have different parallelism settings (higher TP/PP for larger models) +- Shares the same tokenizer vocabulary as the student + +**Important:** Student and teacher models **must have identical vocabularies**. The system validates this at startup. + +```yaml +teacher: + model_name: "Qwen/Qwen3-32B" + tokenizer: + name: ${..model_name} + # Usually requires higher parallelism + dtensor_cfg: + tensor_parallel_size: 8 + context_parallel_size: 1 + # Same sequence settings as student + max_total_sequence_length: 8192 + precision: "bfloat16" + # Teacher also needs generation config for on-policy rollouts + generation: + backend: "vllm" + max_new_tokens: ${..max_total_sequence_length} + temperature: 1.0 + ... +``` + +### Environment Configuration + +```yaml +env: + math: # Math environment for math-based distillation + num_workers: 8 # Number of parallel workers for environment evaluation +``` + +--- + +## Shared Configuration Sections + +These sections are common to both SFT and distillation. + +### Policy Configuration + +**Location:** `nemo_rl/models/policy/__init__.py` + +```yaml +policy: + # Model settings + model_name: "meta-llama/Llama-3.2-1B" # HuggingFace model name + precision: "bfloat16" # Precision: "float32", "bfloat16", "float16" + max_total_sequence_length: 1024 # Maximum sequence length (input + output) + + # Tokenizer settings + tokenizer: + name: ${policy.model_name} # Tokenizer name (usually same as model) + chat_template: "default" # Chat template: "default", null, or custom Jinja string + chat_template_kwargs: null # Additional kwargs for chat template (e.g., enable_thinking: true) + + # Batch sizes + train_global_batch_size: 32 # Global batch size across all GPUs + train_micro_batch_size: 1 # Micro batch size per GPU (gradient accumulation = global / (micro * num_gpus)) + + # For distillation/RL only: + generation_batch_size: 64 # Batch size for generation + logprob_batch_size: 1 # Batch size for logprob calculation + logprob_chunk_size: null # Chunk size for logprob calculation (null = no chunking) + + # Optimization + max_grad_norm: 1.0 # Gradient clipping norm + make_sequence_length_divisible_by: 1 # Pad sequences to be divisible by this (useful for tensor parallel) + + offload_optimizer_for_logprob: false # Offload optimizer during logprob calculation to save memory + + # Optimizer (for DTensor backend) + optimizer: + name: "torch.optim.AdamW" + kwargs: + lr: 5.0e-6 + weight_decay: 0.1 + betas: [0.9, 0.98] + eps: 1e-5 + foreach: False # Must be False for DTensor + fused: False # Must be False for DTensor + + # Learning rate scheduler (for DTensor backend) + scheduler: + - name: "torch.optim.lr_scheduler.LinearLR" + kwargs: + start_factor: 0.1 # Warmup from lr * 0.1 + end_factor: 1.0 # Warmup to lr * 1.0 + total_iters: 10 # Warmup over 10 steps + - name: "torch.optim.lr_scheduler.ConstantLR" + kwargs: + factor: 1.0 + total_iters: 10000000000 # Constant LR after warmup + - milestones: [10] # Switch from warmup to constant at step 10 + + # Training backend configuration (see next sections) + dtensor_cfg: ... # DTensor configuration + megatron_cfg: ... # Megatron configuration + + # Optimization features (see next sections) + dynamic_batching: ... # Dynamic batching configuration + sequence_packing: ... # Sequence packing configuration + + # Generation settings (for distillation/RL only) + generation: ... # Generation configuration +``` + +**Chat template options:** +1. **Default tokenizer template**: `chat_template: "default"` or omit the key +2. **Passthrough (no template)**: `chat_template: NULL` - for pre-formatted datasets +3. **Custom Jinja template**: Provide inline Jinja string or path to `.jinja` file + +```yaml +# Example custom template +tokenizer: + chat_template: "{% for message in messages %}{%- if message['role'] == 'system' %}{{'Context: ' + message['content'].strip()}}{%- elif message['role'] == 'user' %}{{' Question: ' + message['content'].strip() + ' Answer:'}}{%- elif message['role'] == 'assistant' %}{{' ' + message['content'].strip()}}{%- endif %}{% endfor %}" +``` + +--- + +## Training Backend Configuration + +NeMo RL supports two training backends: **DTensor** (PyTorch-native) and **Megatron** (NVIDIA's high-performance framework). + +**Backend selection:** Set `enabled: true` for one backend and `enabled: false` for the other. + +### DTensor Configuration + +**Location:** `nemo_rl/models/policy/__init__.py:37-52` + +DTensor v2 is the default PyTorch-native distributed training backend with FSDP2, TP, CP, and SP support. + +```yaml +policy: + dtensor_cfg: + enabled: true # Enable DTensor backend + _v2: true # Use DTensor v2 (recommended) + + # Environment variables (optional) + env_vars: {} # Custom environment variables: {"KEY": "value"} + + # Memory optimization + cpu_offload: false # Offload parameters to CPU (slower but saves GPU memory) + + # Parallelism strategies + tensor_parallel_size: 1 # Tensor parallelism degree (split weights across GPUs) + context_parallel_size: 1 # Context parallelism degree (split sequence across GPUs) + sequence_parallel: false # Enable sequence parallelism (requires TP > 1) + custom_parallel_plan: null # Path to custom parallel plan (advanced) + + # Activation checkpointing + activation_checkpointing: false # Enable activation checkpointing (saves memory, slower) + + # Advanced settings + clear_cache_every_n_steps: null # Clear CUDA cache every N steps (null = disabled) + + # LoRA configuration (optional) + lora_cfg: + enabled: false # Enable LoRA fine-tuning + target_modules: [] # Module names to apply LoRA (empty = all linear if match_all_linear=true) + exclude_modules: [] # Module names to exclude from LoRA + match_all_linear: true # Apply LoRA to all linear layers + dim: 8 # LoRA rank (r): lower = fewer params, less capacity + alpha: 32 # LoRA scaling factor (effective lr multiplier = alpha/dim) + dropout: 0.0 # Dropout probability for LoRA layers + dropout_position: "post" # Dropout position: "pre" or "post" + lora_A_init: "xavier" # LoRA A matrix initialization: "xavier" or "uniform" + use_triton: true # Use Triton-optimized kernels (disable for TP > 1) +``` + +**Parallelism guidelines:** +- **Tensor Parallel (TP)**: Splits model weights across GPUs. Use for large models that don't fit on single GPU. Requires high-bandwidth inter-GPU communication (NVLink). +- **Context Parallel (CP)**: Splits sequence dimension across GPUs. Use for very long sequences. Requires Ring Attention. +- **Sequence Parallel (SP)**: Splits activations along sequence dimension. Requires TP > 1. **Incompatible with sequence packing** for some models. +- **FSDP2**: Automatically enabled for data parallelism (shards optimizer states and gradients). + +**LoRA notes:** +- LoRA is supported with DTensor v2 and Megatron backends +- DTensor v1 does not support LoRA +- Set `use_triton: false` when `tensor_parallel_size > 1` due to Automodel limitations + +### Megatron Configuration + +**Location:** `nemo_rl/models/policy/__init__.py:115-150` + +Megatron is NVIDIA's high-performance training framework supporting 6D parallelism (TP/PP/CP/SP/EP/FSDP). + +```yaml +policy: + megatron_cfg: + enabled: true # Enable Megatron backend + + # Environment variables (optional) + env_vars: {} # Custom environment variables + + # Memory management + empty_unused_memory_level: 1 # GPU memory cleanup level (0=none, 1=after training, 2=aggressive) + + # Parallelism strategies + tensor_model_parallel_size: 1 # Tensor parallelism degree + pipeline_model_parallel_size: 1 # Pipeline parallelism degree (split layers across GPUs) + expert_tensor_parallel_size: 1 # Expert tensor parallelism (for MoE models) + expert_model_parallel_size: 1 # Expert model parallelism (for MoE models) + context_parallel_size: 1 # Context parallelism degree + sequence_parallel: false # Enable sequence parallelism (requires TP > 1) + + # Pipeline parallel layer distribution (optional) + num_layers_in_first_pipeline_stage: null # Number of layers in first PP stage + num_layers_in_last_pipeline_stage: null # Number of layers in last PP stage + + # Precision + pipeline_dtype: ${policy.precision} # Data type for pipeline parallel communication + + # Activation checkpointing + activation_checkpointing: false # Enable activation checkpointing + + # MoE-specific settings + freeze_moe_router: false # Freeze MoE router during training + moe_router_dtype: null # Router precision (null, "fp32", "fp64") + moe_router_load_balancing_type: "aux_loss" # Load balancing: "aux_loss", "seq_aux_loss", "none" + moe_router_bias_update_rate: 1e-3 # Router bias update rate + moe_permute_fusion: false # Enable MoE permute fusion + moe_per_layer_logging: false # Log MoE metrics per layer + + # Performance optimizations + apply_rope_fusion: true # RoPE fusion (~20% speedup with sequence packing) + bias_activation_fusion: true # Bias+activation fusion (~25% speedup with packing + RoPE fusion) + defer_fp32_logits: false # Defer logit casting to fp32 (required if using logprob_chunk_size) + + # Checkpointing + force_overwrite_initial_ckpt: false # Force overwrite of initial Megatron checkpoint + + # Optimizer settings + optimizer: + optimizer: "adam" # Optimizer type: "adam" or "sgd" + lr: 5.0e-6 # Learning rate + min_lr: 4.9999e-6 # Minimum learning rate (for schedulers) + weight_decay: 0.1 # Weight decay + + # Precision + bf16: false # Use bf16 for optimizer states + fp16: false # Use fp16 for optimizer states + params_dtype: "float32" # Parameter dtype + + # Adam settings + adam_beta1: 0.9 + adam_beta2: 0.98 + adam_eps: 1e-5 + + # SGD settings + sgd_momentum: 0.9 + + # Distributed optimizer + use_distributed_optimizer: true # Distribute optimizer states across GPUs + use_precision_aware_optimizer: true # Use precision-aware optimizer + + # Gradient clipping + clip_grad: ${policy.max_grad_norm} + + # CPU offload + optimizer_cpu_offload: false # Offload optimizer to CPU + optimizer_offload_fraction: 0.0 # Fraction of parameters to offload (0.0-1.0) + + # Learning rate scheduler + scheduler: + start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} + end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} + weight_decay_incr_style: "constant" # Weight decay schedule: "constant" or "linear" + lr_decay_style: "constant" # LR schedule: "constant", "linear", "cosine" + lr_decay_iters: 1000 # Number of iterations for LR decay + lr_warmup_iters: 50 # Number of warmup iterations + lr_warmup_init: 4.9999e-6 # Initial LR for warmup + + # Distributed data parallel settings + distributed_data_parallel_config: + grad_reduce_in_fp32: false # Reduce gradients in fp32 + overlap_grad_reduce: true # Overlap gradient reduction with backward pass + overlap_param_gather: true # Overlap parameter gathering + use_custom_fsdp: false # Use custom FSDP implementation + data_parallel_sharding_strategy: "optim_grads_params" # FSDP sharding strategy +``` + +**Parallelism guidelines:** +- **Tensor Parallel (TP)**: Same as DTensor. Typical values: 1, 2, 4, 8 +- **Pipeline Parallel (PP)**: Splits model layers across GPUs. Use for very large models. Creates pipeline bubbles (lower efficiency). +- **Context Parallel (CP)**: Same as DTensor. Use for long sequences. +- **Expert Parallel (EP)**: Splits MoE experts across GPUs (for MoE models only) +- **Data Parallel (DP)**: Automatically enabled via FSDP + +--- + +### Sequence Packing Configuration + +**Location:** `nemo_rl/models/policy/__init__.py:55-64` + +Sequence packing concatenates multiple short sequences into longer sequences to improve GPU utilization. + +```yaml +policy: + sequence_packing: + enabled: false # Enable sequence packing + train_mb_tokens: 1024 # Target tokens per micro-batch (usually max_seq_len * micro_batch_size) + logprob_mb_tokens: 1024 # Target tokens per logprob micro-batch (distillation/RL only) + algorithm: "modified_first_fit_decreasing" # Packing algorithm + sequence_length_round: 64 # Round sequence lengths to multiples of this value +``` + +**Notes:** +- Sequence packing can provide **significant speedups** (2-5x) for datasets with variable-length sequences +- **Incompatible with DTensor SP + packing** for some models (see [issue #1178](https://github.com/NVIDIA-NeMo/RL/issues/1178)) +- For Megatron, enable `apply_rope_fusion: true` and `bias_activation_fusion: true` for additional 20-50% speedup + +### Dynamic Batching Configuration + +**Location:** Same as sequence packing + +Dynamic batching adjusts batch size based on sequence length to maintain consistent token count per batch. + +```yaml +policy: + dynamic_batching: + enabled: false # Enable dynamic batching + train_mb_tokens: 1024 # Target tokens per micro-batch + logprob_mb_tokens: 1024 # Target tokens per logprob micro-batch (distillation/RL only) + sequence_length_round: 64 # Round sequence lengths to multiples of this value +``` + +**Note:** Use either `dynamic_batching` OR `sequence_packing`, not both. + +--- + +## Generation Configuration + +**Location:** `nemo_rl/models/generation/interfaces.py:118-131` + +Generation configuration is only required for distillation and RL algorithms (not SFT). + +```yaml +policy: + generation: + backend: "vllm" # Generation backend: "vllm" or "megatron" + + # Generation parameters + max_new_tokens: 8192 # Maximum new tokens to generate (usually ${..max_total_sequence_length}) + temperature: 1.0 # Sampling temperature (higher = more random) + top_p: 1.0 # Nucleus sampling top-p (1.0 = disabled) + top_k: null # Top-k sampling (null = disabled) + + # Stopping criteria + stop_token_ids: null # List of token IDs to stop generation + stop_strings: null # List of strings to stop generation + + # vLLM-specific configuration + vllm_cfg: + # Parallelism + tensor_parallel_size: 1 # vLLM tensor parallelism + pipeline_parallel_size: 1 # vLLM pipeline parallelism + expert_parallel_size: 1 # vLLM expert parallelism (for MoE) + + # Memory settings + gpu_memory_utilization: 0.6 # Fraction of GPU memory to use (0.0-1.0) + max_model_len: 8192 # Maximum sequence length + kv_cache_dtype: "auto" # KV cache precision: "auto", "fp8", "fp8_e4m3" + + # Precision + precision: ${...precision} # Model precision + + # Performance + enforce_eager: false # Disable CUDA graphs (slower but more compatible) + use_deep_gemm: false # Use DeepGEMM optimization + num_last_layers_in_bf16: 0 # Number of last layers in bf16 (FP8 mixed precision) + num_first_layers_in_bf16: 0 # Number of first layers in bf16 (FP8 mixed precision) + + # Advanced + async_engine: false # Use async vLLM engine + distributed_executor_backend: null # Distributed executor: null, "ray", "mp" + + # Colocation settings (whether generation shares training GPUs) + colocated: + enabled: true # true = share training GPUs, false = dedicated generation resources + resources: # Only used when enabled=false + gpus_per_node: null # GPUs per node for generation (when cluster.num_nodes==1) + num_nodes: null # Nodes for generation +``` + +**Colocation modes:** +- `colocated.enabled: true`: Generation shares training GPUs (memory-efficient, but training pauses during generation) +- `colocated.enabled: false`: Dedicated generation resources (faster RL, but requires more GPUs) + +**Backend selection:** +- **vLLM**: Fast, memory-efficient inference. Recommended for most cases. Requires weight conversion from training format. +- **Megatron**: Native Megatron inference. No weight conversion needed. Use for very large models or FP8 training. + +--- + +## Logger Configuration + +**Location:** `nemo_rl/utils/logger.py:77-89` + +```yaml +logger: + log_dir: "logs" # Base directory for all logs + + # Logger backends (enable/disable) + wandb_enabled: true # Enable Weights & Biases logging + tensorboard_enabled: true # Enable TensorBoard logging + mlflow_enabled: false # Enable MLflow logging + swanlab_enabled: false # Enable SwanLab logging + + # GPU monitoring + monitor_gpus: true # Monitor and log GPU usage metrics + + # Sampling (how many validation samples to print) + num_val_samples_to_print: 5 # Number of validation samples to print to console + + # Weights & Biases configuration + wandb: + project: "sft-dev" # W&B project name + name: "sft-dev-${data.dataset_name}" # W&B run name (supports interpolation) + + # SwanLab configuration + swanlab: + project: "sft-dev" + name: "sft-dev-${data.dataset_name}" + + # TensorBoard configuration + tensorboard: + log_dir: "tb_logs-sft-dev-${data.dataset_name}" # TensorBoard log directory + + # MLflow configuration + mlflow: + experiment_name: "sft-dev" # MLflow experiment name + run_name: "sft-dev-${data.dataset_name}" # MLflow run name + tracking_uri: null # MLflow tracking URI (null = local) + artifact_location: null # Artifact storage location + + # GPU monitoring settings + gpu_monitoring: + collection_interval: 10 # Collect GPU metrics every N seconds + flush_interval: 10 # Flush GPU metrics to loggers every N seconds +``` + +**Notes:** +- Multiple loggers can be enabled simultaneously +- W&B requires `wandb login` before running +- Variable interpolation works in all string fields (e.g., `${data.dataset_name}`) + +--- + +## Checkpointing Configuration + +**Location:** `nemo_rl/utils/checkpoint.py:36-67` + +```yaml +checkpointing: + enabled: true # Enable checkpointing + checkpoint_dir: "results/sft" # Directory to save checkpoints + + # Checkpoint selection + metric_name: "val:val_loss" # Metric to track for best checkpoint (format: "val:" or "train:") + higher_is_better: false # Whether higher metric values are better + + # Checkpoint retention + keep_top_k: 3 # Number of best checkpoints to keep (null = keep all) + save_period: 10 # Save checkpoint every N steps + checkpoint_must_save_by: null # Force save by this step (null = disabled) + + # Model saving format (for DTensor v2 / Megatron) + model_save_format: "safetensors" # Format: "safetensors" or "torch_save" (null for DTensor v1) + save_consolidated: false # Save HuggingFace-compatible consolidated checkpoint + model_cache_dir: "" # Model cache directory + model_repo_id: "" # HuggingFace repository ID + + # PEFT support + is_peft: false # Whether model uses PEFT (LoRA, etc.) + peft_config: null # PEFT configuration +``` + +**Checkpoint structure:** +``` +checkpoint_dir/ + step_0/ + training_info.json # Training state (epoch, step, metrics) + config.yaml # Full config used for this run + train_dataloader.pt # DataLoader state + policy/ + weights/ # Model weights + optimizer/ # Optimizer state + step_10/ + ... + step_20/ + ... +``` + +**Metric name format:** +- `"val:"` for validation metrics (e.g., `"val:val_loss"`, `"val:accuracy"`) +- `"train:"` for training metrics (e.g., `"train:loss"`) + +--- + +## Cluster Configuration + +**Location:** `nemo_rl/distributed/virtual_cluster.py:33-35` + +```yaml +cluster: + gpus_per_node: 8 # Number of GPUs per node + num_nodes: 1 # Number of nodes +``` + +**Multi-node setup:** +- Requires Ray cluster setup (see [cluster documentation](cluster.md)) +- Total GPUs = `gpus_per_node * num_nodes` +- Data parallel degree = Total GPUs / (TP * PP * CP * EP) + +--- + +## Multi-Node Configuration Examples + +### Multi-Node SFT with Megatron (2 nodes, 16 GPUs) + +**File:** `examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml` + +```yaml +defaults: ../../sft.yaml # Inherit from base SFT config + +sft: + max_num_steps: 80 + +policy: + model_name: Qwen/Qwen2.5-Math-7B + train_global_batch_size: 512 + max_total_sequence_length: 16384 + + # Disable DTensor, enable Megatron + dtensor_cfg: + enabled: false + + megatron_cfg: + enabled: true + tensor_model_parallel_size: 4 # TP=4: split model across 4 GPUs + context_parallel_size: 2 # CP=2: split sequence across 2 GPUs + sequence_parallel: true # Enable sequence parallelism + + # MoE settings (Qwen2.5-Math is MoE) + freeze_moe_router: true + moe_router_dtype: fp64 + moe_router_bias_update_rate: 0.0 + moe_permute_fusion: true + + optimizer: + lr: 1.0e-06 + bf16: true + adam_beta2: 0.999 + use_distributed_optimizer: false + + # Enable sequence packing for efficiency + sequence_packing: + enabled: true + + make_sequence_length_divisible_by: 32 + +data: + dataset_name: openmathinstruct2 + prompt_file: examples/prompts/math.txt + split: train_1M + add_generation_prompt: true + output_key: generated_solution + num_workers: 8 + +cluster: + gpus_per_node: 8 # 8 GPUs per node + num_nodes: 2 # 2 nodes = 16 GPUs total +``` + +**Parallelism breakdown:** +- Total GPUs: 16 +- TP: 4, CP: 2, TP*CP: 8 +- Data parallel degree: 16 / 8 = 2 +- Each DP rank processes: global_batch_size / DP = 512 / 2 = 256 samples + +### Multi-Node Distillation (2 nodes, 16 GPUs) + +**File:** `examples/configs/recipes/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.yaml` + +```yaml +defaults: ../../distillation_math.yaml # Inherit from base distillation config + +distillation: + num_prompts_per_step: 64 + max_num_steps: 20 + val_batch_size: 256 + val_period: 10 + +loss_fn: + kl_type: reverse # Use reverse KL (student -> teacher) + +# Student model configuration +policy: + model_name: Qwen/Qwen3-4B-Base + dtensor_cfg: + tensor_parallel_size: 2 # TP=2 for student + context_parallel_size: 1 + + dynamic_batching: + enabled: false + + sequence_packing: + enabled: true # Enable sequence packing + + make_sequence_length_divisible_by: 2 + +# Teacher model configuration +teacher: + model_name: Qwen/Qwen3-32B + dtensor_cfg: + tensor_parallel_size: 8 # TP=8 for larger teacher + context_parallel_size: 1 + + dynamic_batching: + enabled: false + + sequence_packing: + enabled: true # Enable sequence packing + + make_sequence_length_divisible_by: 2 + +cluster: + gpus_per_node: 8 + num_nodes: 2 # 2 nodes = 16 GPUs total +``` + +**Parallelism breakdown:** +- Student: TP=2, DP=8 (16 GPUs / 2) +- Teacher: TP=8, DP=2 (16 GPUs / 8) +- Student and teacher run in different Ray placement groups + +--- + +## Additional Resources + +- **Guides:** + - [SFT Guide](guides/sft.md) + - [Distillation Guide](guides/distillation.md) (if available) + - [GRPO Guide](guides/grpo.md) + - [DPO Guide](guides/dpo.md) + +- **Design Docs:** + - [Training Backends](design-docs/training-backends.md) + - [Generation Backends](design-docs/generation.md) + - [Sequence Packing](design-docs/sequence-packing-and-dynamic-batching.md) + - [Chat Datasets](design-docs/chat-datasets.md) + +- **Examples:** + - `examples/configs/` - Configuration files + - `examples/run_sft.py` - SFT training script + - `examples/run_distillation.py` - Distillation training script + +--- + +## Quick Reference: Config File Locations + +| Config Type | Code Location | Example Config | +|------------|---------------|----------------| +| SFTConfig | `nemo_rl/algorithms/sft.py:66-74` | `examples/configs/sft.yaml` | +| DistillationConfig | `nemo_rl/algorithms/distillation.py:73-85` | `examples/configs/distillation_math.yaml` | +| DataConfig | `nemo_rl/data/__init__.py:21-43` | (embedded in algorithm configs) | +| PolicyConfig | `nemo_rl/models/policy/__init__.py` | (embedded in algorithm configs) | +| DTensorConfig | `nemo_rl/models/policy/__init__.py:37-52` | `examples/configs/sft.yaml` | +| MegatronConfig | `nemo_rl/models/policy/__init__.py:115-150` | `examples/configs/recipes/llm/sft-*-megatron.yaml` | +| GenerationConfig | `nemo_rl/models/generation/interfaces.py:118-131` | `examples/configs/distillation_math.yaml` | +| VllmConfig | `nemo_rl/models/generation/vllm/config.py:41-43` | `examples/configs/distillation_math.yaml` | +| LoggerConfig | `nemo_rl/utils/logger.py:77-89` | (all example configs) | +| CheckpointingConfig | `nemo_rl/utils/checkpoint.py:36-67` | (all example configs) | +| ClusterConfig | `nemo_rl/distributed/virtual_cluster.py:33-35` | (all example configs) | + +--- + +## Dataset Mapping Code Locations + +| Dataset Type | Mapping Function Location | +|-------------|---------------------------| +| Response Datasets (SFT/RL) | `nemo_rl/data/datasets/response_datasets/__init__.py:36-149` | +| Preference Datasets (DPO) | `nemo_rl/data/datasets/preference_datasets/__init__.py:26-79` | +| Eval Datasets | `nemo_rl/data/datasets/eval_datasets/__init__.py:23-98` |