feat: add SGLang rollout backend, part1 [WIP] #1580

PrinsYin · 2025-11-30T17:47:50Z

What does this PR do ?

Add SGLang rollout backend, including initilization, generation and refit.
RFC: https://icn9gp2qrfay.feishu.cn/wiki/DfC1wU1UkiRJyGklg1ActrQpnEb

Current Status

Colocated mode for SGLang + FSDP is fully functional:
1. Initiated Ray workers and allocated GPUs to SGLang servers according to the provided configs.
2. Implemented DP sharding, parameter handling, generation workflow, and result collection.
3. Added support for updating weights from tensors (FSDP).
4. Provided environment setup and example usage for running the full pipeline end to end.

Remaining Work / TODOs

Refine SGLang rollout configurations; the current version is a minimal working prototype.
Implement disaggregated mode; only colocated mode is supported today. Disagg will require distributed weight updates and a redesigned worker resource-allocation strategy.
Add support for updating weights from tensors in Megatron and from fully distributed states.
Implement sleep/awake lifecycle control (expected to be straightforward).
Potentially add a more robust router: the current DP-based dispatcher may overload servers under heavy traffic, and the semaphore-based throttling needs further testing and tuning.
Extend generation features (async, multi-turn, etc.) and expand related functionality.
check/add monitoring metrics

Results compared to vllm baseline (Qwen 2.5-1.5B)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Added SGLang as a supported generation backend for model inference.
- Introduced HTTP-based weight streaming for distributed inference workflows.
Documentation
- Updated generation interface documentation to include SGLang backend configuration and usage guidance.
- Added new GRPO training configuration example with SGLang backend.
Configuration
- Extended generation configuration with top_k and model_name fields.
- Added SGLang optional dependency group to project dependencies.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

nemo_rl/algorithms/grpo.py

nemo_rl/models/generation/sglang/config.py

nemo_rl/models/generation/sglang/sglang_generation.py

terrykong

awesome work @PrinsYin

just an FYI of a PR that's in flight https://github.com/NVIDIA-NeMo/RL/pull/1567/files

left some comments

examples/configs/grpo_math_1B_sglang.yaml

nemo_rl/algorithms/grpo.py

Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

…a server Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

…p servers Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

sglang: add 1B example Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

Signed-off-by: Ryan <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

Signed-off-by: Zhuoran Yin <[email protected]>

- Convert SGLangConfig from regular class to TypedDict inheriting GenerationConfig - Align structure with VllmConfig pattern for consistency - Mark all fields as NotRequired for backward compatibility - Add sglang_kwargs field for additional ServerArgs parameters - Add type casting in grpo.py for type safety This maintains backward compatibility while aligning with the existing generation config structure pattern. Signed-off-by: Zhuoran Yin <[email protected]>

Signed-off-by: Zhuoran Yin <[email protected]>

Co-authored-by: Shanmugam Ramasamy <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>

coderabbitai · 2025-12-04T20:56:04Z

📝 Walkthrough

Walkthrough

This PR introduces SGLang as a new distributed generation backend alongside vLLM. It includes SGLang-specific configuration, worker implementation, HTTP-based weight streaming for colocated inference, integration into GRPO training, generalized metrics handling, and comprehensive setup/documentation changes.

Changes

Cohort / File(s)	Summary
Documentation & Configuration `docs/design-docs/generation.md`, `examples/configs/grpo_math_1B_sglang.yaml`, `run.sh`	Updated generation interface docs to include Megatron and extended GenerationConfig fields (top_k, model_name). Added comprehensive GRPO+SGLang YAML example with distributed training, dynamic batching, and SGLang-specific generation settings. Added setup script for GRPO math example.
SGLang Backend Implementation `nemo_rl/models/generation/sglang/__init__.py`, `nemo_rl/models/generation/sglang/config.py`, `nemo_rl/models/generation/sglang/sglang_generation.py`, `nemo_rl/models/generation/sglang/sglang_worker.py`, `nemo_rl/models/generation/sglang/utils.py`	New SGLang backend package with configuration (SglangSpecificArgs, SGLangConfig), distributed generation orchestration (SGLangGeneration), Ray-based worker with HTTP API integration (SGLangGenerationWorker), and async loop utility (AsyncLoopThread).
GRPO Algorithm Integration `nemo_rl/algorithms/grpo.py`, `nemo_rl/algorithms/utils.py`	Added SGLang initialization path (init_sglang), generic initialize_generation_with_policy helper for parallel/sequential setup, replaced vLLM-specific metrics with generalized generation_logger_metrics throughout sync and async training loops, updated weight-update logic for HTTP streaming.
Generation Interface Extensions `nemo_rl/models/generation/interfaces.py`, `nemo_rl/models/generation/vllm/vllm_generation.py`	Added optional clear_logger_metrics and get_logger_metrics methods to GenerationInterface. Implemented concrete versions in VllmGeneration wrapping existing vLLM metrics functions.
Policy Weight Streaming `nemo_rl/models/policy/interfaces.py`, `nemo_rl/models/policy/lm_policy.py`, `nemo_rl/models/policy/dtensor_policy_worker_v2.py`, `nemo_rl/models/policy/utils.py`	Added stream_weights_via_http to ColocatablePolicyInterface and Policy class. Implemented HTTP streaming in DTensorPolicyWorkerV2. Added stream_weights_via_http_impl with IPC gathering and distributed rank coordination for SGLang server updates.
Distributed Setup `nemo_rl/distributed/virtual_cluster.py`, `nemo_rl/distributed/ray_actor_environment_registry.py`	Added AUTOMODEL_SGLANG and SGLANG executables to PY_EXECUTABLES. Extended ACTOR_ENVIRONMENT_REGISTRY with SGLangGenerationWorker and additional environment mappings. Updated DTensorPolicyWorkerV2 to use AUTOMODEL_SGLANG for HTTP streaming support.
Dependencies `pyproject.toml`	Added new sglang optional dependency group with 17 packages (sglang>=0.4.1, pybase64, requests, uvloop, torchao, xgrammar, etc.).

Sequence Diagram(s)

sequenceDiagram
    participant GRPO as GRPO Training Loop
    participant GenInit as Initialize Generation<br/>(with policy)
    participant Policy as Policy Worker
    participant SGLangGen as SGLang Generation<br/>Worker
    participant SGLang as SGLang Server
    
    GRPO->>GenInit: Call initialize_generation_with_policy<br/>(colocated_inference=true)
    
    alt Parallel Initialization (colocated)
        par Init Policy
            GenInit->>Policy: Initialize policy
            Policy-->>GenInit: policy ready
        and Init Generation
            GenInit->>SGLangGen: Initialize generation
            SGLangGen->>SGLang: Launch server process
            SGLang-->>SGLangGen: Server running
            SGLangGen-->>GenInit: generation ready
        end
    end
    
    GenInit-->>GRPO: (policy, generation, timing_metrics)
    
    GRPO->>SGLangGen: Generate samples (batched data)
    SGLangGen->>SGLangGen: Shard data across DP axis
    SGLangGen->>SGLang: POST /generate (HTTP request)
    SGLang->>SGLang: Execute generation per sample
    SGLang-->>SGLangGen: Batch results (token ids, logprobs)
    SGLangGen->>SGLangGen: Aggregate & pad results
    SGLangGen-->>GRPO: BatchedDataDict output
    
    GRPO->>Policy: Call refit_policy_generation()
    Policy->>GRPO: Generate samples with updated policy
    
    Note over GRPO,Policy: If colocated and SGLang:
    GRPO->>Policy: stream_weights_via_http(sglang_urls)
    Policy->>Policy: Convert DTensor params to local tensors
    Policy->>SGLang: POST /update_weights_from_tensor (HTTP)
    SGLang->>SGLang: Update model weights
    SGLang-->>Policy: ACK
    Policy-->>GRPO: Weights streamed & synced
    
    GRPO->>SGLangGen: Get metrics: generation_logger_metrics
    SGLangGen-->>GRPO: {inflight_batch_sizes, pending_samples}
    GRPO->>GRPO: Log metrics to WandB

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

SGLang worker & generation orchestration (sglang_generation.py, sglang_worker.py): Complex Ray-based distributed generation with GPU bundle allocation, data sharding, and HTTP communication; careful verification of resource management and error handling needed
Weight streaming implementation (nemo_rl/models/policy/utils.py): HTTP-based tensor transmission with IPC gathering and distributed rank coordination; validate correctness of gather groups and serialization logic
GRPO integration refactor (nemo_rl/algorithms/grpo.py): Significant refactoring of initialization logic with new generic helper, metric renaming throughout, and dual vLLM/SGLang paths; ensure backward compatibility and correctness of both paths
Metrics generalization (nemo_rl/algorithms/grpo.py, nemo_rl/algorithms/utils.py): Systematic replacement of vLLM-specific metrics with generic generation metrics; verify all metric collection, clearing, and logging flows work correctly
Actor environment registry updates (ray_actor_environment_registry.py): Multiple new mappings added; verify correct executable assignment for each new actor type

Possibly related PRs

cp: feat: refit refactoring with zmq and overlapping (1267) into r0.4.0 #1409 — Modifies colocated weight-update flow and generation-policy interfaces; this PR adds HTTP/SGLang streaming path while the related PR refactors ZMQ/IPC streaming infrastructure
fix: ADDING DOCS #1595 — Updates generation documentation to add Megatron as supported backend; overlapping documentation schema changes (GenerationConfig fields)
feat: Support Reward Model based Environments #1026 — Modifies GRPO algorithm and generation initialization paths; affects same grpo.py code for resource allocation and setup

Suggested labels

CI:L1

Suggested reviewers

yuki-97
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	Major feature (SGLang backend) affecting convergence and performance lacks documented test results, numerical metrics, or performance comparison despite being [WIP] with critical review issues.	Document numerical test results comparing SGLang vs vLLM baselines, include performance metrics (throughput/latency/memory) with configuration details, provide convergence analysis, and address critical review comments before completion.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective of the PR: adding SGLang as a new rollout/generation backend. The '[WIP]' marker indicates work-in-progress status, and 'part1' signals this is a multi-part feature.
Docstring Coverage	✅ Passed	Docstring coverage is 80.88% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/distributed/virtual_cluster.py (1)
43-63: Add --directory {git_root} to new SGLang executables for consistency

The new AUTOMODEL_SGLANG and SGLANG entries omit --directory {git_root}, while all other uv run entries include it. This inconsistency means these executables will behave differently depending on the caller's current working directory, potentially failing to locate pyproject.toml.

Suggested fix:
     # Use NeMo-RL direct dependencies, nemo-automodel, and SGLang.
-    AUTOMODEL_SGLANG = "uv run --locked --extra automodel --extra sglang"
+    AUTOMODEL_SGLANG = (
+        f"uv run --locked --extra automodel --extra sglang --directory {git_root}"
+    )

     # Use NeMo-RL direct dependencies and SGLang.
-    SGLANG = "uv run --locked --extra sglang"
+    SGLANG = f"uv run --locked --extra sglang --directory {git_root}"

🧹 Nitpick comments (17)

docs/design-docs/generation.md (1)

11-21: Clarify model_name optionality to match GenerationConfig

In code, GenerationConfig.model_name is NotRequired[str] and is sometimes filled by helpers like configure_generation_config, but here it is documented as a required str. To avoid confusion, consider clarifying that:

Users typically set model_name, but

Some flows may populate it automatically and treat it as optional in the TypedDict.

E.g. change the snippet comment to something like “Name or path of the model (may be populated by helpers).”
nemo_rl/models/generation/interfaces.py (1)
261-278: Optional metrics hooks are fine; consider silencing B027

Adding clear_logger_metrics / get_logger_metrics as optional hooks with no-op defaults is a reasonable design for GenerationInterface, and it matches the vLLM implementation.

Ruff’s B027 warning (empty method in an abstract base class) is effectively a false positive here. Two options:

Keep them non-abstract (recommended) and silence B027, e.g.:
def clear_logger_metrics(self) -> None:  # noqa: B027
    """Clear logger metrics for performance reporting."""
    # Optional hook; default is a no-op.
    return None
Or add minimal behavior (e.g., a return None) plus a clarifying comment, which also addresses the “empty” concern.

I’d avoid making them abstract since that would force every backend to implement metrics even when not needed.
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)

1762-1805: HTTP weight streaming integration looks consistent with existing IPC path

The new stream_weights_via_http correctly handles cpu_offload, derives the current GPU UUID, converts DTensors to full tensors with the target dtype, and delegates to the shared HTTP implementation; the sorted state_dict iteration should give deterministic ordering across ranks. You might optionally factor the DTensor-to-local generator into a shared helper used by both HTTP and ZMQ paths to avoid duplication.

nemo_rl/models/policy/utils.py (1)

15-27: HTTP weight streaming flow is coherent; consider a few small cleanups

The end-to-end HTTP streaming path (GPU UUID → server selection → IPC handler gather → POST to /update_weights_from_tensor) looks structurally correct and matches the intended SGLang contract. A few low-priority refinements you might consider:

Remove or use unused parameters/locals: rank and sglang_url_to_gpu_uuids in _setup_ipc_gather_group, gather_group in _gather_ipc_handlers, and shape/dtype in _send_tensor_to_sglang, to reduce cognitive load.

In stream_weights_via_http_impl, if _setup_ipc_gather_group returns (None, None, None) (e.g., unexpected dist state or UUID mismatch), the function quietly becomes a no-op; raising a clear error in that case would make misconfiguration easier to diagnose.

In _send_tensor_to_sglang, replace the bare except: around response.text with except Exception: (and optionally log the exception) to avoid masking non-HTTP runtime issues while still enriching the error message.

Optionally rename the unused loop index idx in for idx, (name, tensor) in enumerate(tensor_list): to _ to match intent and silence linters.

Also applies to: 498-743

nemo_rl/models/generation/sglang/config.py (1)

20-97: SGLang config types look reasonable; align used keys with docs/YAML

The SglangSpecificArgs/SGLangConfig TypedDicts cleanly expose SGLang’s ServerArgs-style fields without hard-coding defaults, which matches the config guidelines. For the subset of keys you actually expect users to set via NeMo RL configs, ensure their purpose, valid values, and recommended defaults are documented and reflected in exemplar YAMLs (e.g., the new grpo_math_1B_sglang.yaml), so this large surface stays discoverable.

Based on learnings, config TypedDict additions should be documented and mirrored in examples.
nemo_rl/models/generation/sglang/sglang_generation.py (6)
48-54: Remove unused workers_per_node parameter.

The workers_per_node parameter is declared but never used in the constructor. Consider removing it to avoid confusion, or document why it's reserved for future use.
     def __init__(
         self,
         cluster: RayVirtualCluster,
         config: SGLangConfig,
         name_prefix: str = "sglang_policy",
-        workers_per_node: Optional[Union[int, list[int]]] = None,
     ):
84-88: Use proper logging instead of print for warnings.

Direct print statements make it harder to control log levels in production. Consider using Python's logging module or the project's logger.
+import logging
+
+logger = logging.getLogger(__name__)
+
         if total_gpus % gpus_per_server != 0:
-            print(
+            logger.warning(
                 f"[WARNING] Total GPUs ({total_gpus}) is not divisible by GPUs per server ({gpus_per_server}). "
                 f"Will use {num_servers} servers, leaving {total_gpus % gpus_per_server} GPUs unused."
             )
319-328: Add strict=True to zip() for safer iteration.

Without strict=True, mismatched list lengths would silently truncate data. This could mask bugs where urls and uuids_list have different sizes.
         # Create mapping
         url_to_uuids = {}
-        for url, uuids in zip(urls, uuids_list):
+        for url, uuids in zip(urls, uuids_list, strict=True):
             if url is not None and uuids is not None:
                 url_to_uuids[url] = uuids
330-336: Stub methods should return consistent types.

prepare_for_generation and finish_generation have -> bool implied by the interface but return None via implicit pass. Either add explicit return True or update docstrings to explain the current stub behavior.
     def prepare_for_generation(self, *args: Any, **kwargs: Any) -> bool:
         """Wake workers up for colocated inference."""
-        pass
+        return True

     def finish_generation(self, *args: Any, **kwargs: Any) -> bool:
         """Sleep workers and reset prefix cache."""
-        pass
+        return True
338-345: Narrow the exception type in shutdown.

Catching a broad Exception can hide unexpected errors. Consider catching more specific exceptions or at least logging the exception type for debugging.
     def shutdown(self) -> bool:
         """Shut down all SGLang workers and clean up resources."""
         try:
             # Use the worker group's shutdown method with the worker's cleanup method
             return self.worker_group.shutdown(cleanup_method="shutdown")
-        except Exception as e:
-            print(f"Error during SGLang policy shutdown: {e}")
+        except (RuntimeError, ray.exceptions.RayError) as e:
+            print(f"Error during SGLang policy shutdown: {type(e).__name__}: {e}")
             return False
365-380: Consider restructuring to use else block for success path.

The static analysis correctly identifies that the success print and return could be in an else block for cleaner control flow.
         try:
             futures = self.worker_group.run_all_workers_single_data(
                 "invalidate_kv_cache",
                 run_rank_0_only_axes=["tensor_parallel"],
             )
             results = ray.get(futures)
             results = [r for r in results if r is not None]
             success = all(result for result in results) if results else True
-            if success:
-                print("[sglang refit] All SGLang server caches flushed successfully", flush=True)
-            else:
-                print("[sglang refit] WARNING - Some SGLang server caches failed to flush", flush=True)
-            return success
         except Exception as e:
             print(f"[sglang refit] Error flushing SGLang caches: {e}", flush=True)
             return False
+        else:
+            if success:
+                print("[sglang refit] All SGLang server caches flushed successfully", flush=True)
+            else:
+                print("[sglang refit] WARNING - Some SGLang server caches failed to flush", flush=True)
+            return success
nemo_rl/models/generation/sglang/sglang_worker.py (5)
116-122: Unused fraction_of_gpus parameter.

The fraction_of_gpus parameter is passed from configure_worker but never used in __init__. Either use it or remove it from both places.
     def __init__(
         self,
         config: SGLangConfig,
         bundle_indices: Optional[list[int]] = None,
-        fraction_of_gpus: float = 1.0,
         seed: Optional[int] = None,
     ):
Also update configure_worker to not pass this:
-            init_kwargs["fraction_of_gpus"] = num_gpus
322-331: Unused stop_strings parameter.

The stop_strings parameter is declared but not used in _build_sampling_params. The actual stop string handling happens per-sample in _generate_single_sample. Remove this parameter to avoid confusion.
     def _build_sampling_params(
         self,
         *,
         greedy: bool,
-        stop_strings,
         max_new_tokens: Optional[int] = None,
         input_len: Optional[int] = None,
         context_length: Optional[int] = None,
         sample_index: Optional[int] = None,
     ) -> dict[str, Any]:
         """Build sampling parameters dictionary for SGLang API.
         
         Args:
             greedy: Whether to use greedy decoding (temperature=0.0)
-            stop_strings: Merged stop strings (not used here, handled per sample)
             max_new_tokens: Override max_new_tokens from config if provided
463-467: Remove redundant exception handler.

This exception handler catches and immediately re-raises without any additional handling. It adds no value and reduces code clarity.
         async def wrap(idx, coro):
             async with semaphore:
-                try:
-                    result = await coro
-                    return idx, result
-                except Exception as e:
-                    raise
+                result = await coro
+                return idx, result
590-593: Remove redundant exception handler.

Same issue as lines 463-467 - catching and re-raising without any handling.
         # Execute all requests concurrently using the dedicated event loop thread
-        try:
-            all_results = self.async_loop_thread.run(self._generate_async(tasks))
-        except Exception as e:
-            raise
+        all_results = self.async_loop_thread.run(self._generate_async(tasks))
606-606: Unused loop variable new_logprobs.

In the first pass calculating max_length, new_logprobs is unpacked but not used. Use _ to indicate it's intentionally ignored.
         # First pass: calculate max_length
-        for i, (new_tokens, new_logprobs) in enumerate(all_results):
+        for i, (new_tokens, _) in enumerate(all_results):
             input_len = input_lengths[i].item()
             generation_length = len(new_tokens)
nemo_rl/algorithms/grpo.py (1)
493-553: Remove unused generation_name parameter.

The generation_name parameter is passed but never used in the function body. Either remove it or use it in logging/output.
 def initialize_generation_with_policy(
     init_generation_fn,
-    generation_name: str,
     init_time_key: str,
     colocated_inference: bool,
     worker_init_timing_metrics: dict,
 ):
     """
     Generic function to initialize a generation engine (vLLM or SGLang) along with policy.
     
     Args:
         init_generation_fn: Function that initializes the generation engine (init_vllm or init_sglang)
-        generation_name: Name of the generation engine ("vLLM" or "SGLang")
         init_time_key: Key name for storing initialization time in metrics ("vllm_init_time_s" or "sglang_init_time_s")
And update the call sites:
         policy_generation, policy = initialize_generation_with_policy(
             init_generation_fn=init_vllm,
-            generation_name="vLLM",
             init_time_key="vllm_init_time_s",

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e73bfd and 843a06a.

📒 Files selected for processing (19)

docs/design-docs/generation.md (5 hunks)
examples/configs/grpo_math_1B_sglang.yaml (1 hunks)
nemo_rl/algorithms/grpo.py (14 hunks)
nemo_rl/algorithms/utils.py (1 hunks)
nemo_rl/distributed/ray_actor_environment_registry.py (2 hunks)
nemo_rl/distributed/virtual_cluster.py (2 hunks)
nemo_rl/models/generation/interfaces.py (1 hunks)
nemo_rl/models/generation/sglang/__init__.py (1 hunks)
nemo_rl/models/generation/sglang/config.py (1 hunks)
nemo_rl/models/generation/sglang/sglang_generation.py (1 hunks)
nemo_rl/models/generation/sglang/sglang_worker.py (1 hunks)
nemo_rl/models/generation/sglang/utils.py (1 hunks)
nemo_rl/models/generation/vllm/vllm_generation.py (1 hunks)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1 hunks)
nemo_rl/models/policy/interfaces.py (1 hunks)
nemo_rl/models/policy/lm_policy.py (1 hunks)
nemo_rl/models/policy/utils.py (2 hunks)
pyproject.toml (1 hunks)
run.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py