Support SGLang Inference Engine#533
Open
pan-x-c wants to merge 16 commits intoagentscope-ai:mainfrom
Open
Conversation
Collaborator
Author
|
/unittest-module-trainer |
1 similar comment
Collaborator
Author
|
/unittest-module-trainer |
Collaborator
Author
|
/unittest-module-trainer |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
Collaborator
Author
|
/unittest-module-trainer |
Collaborator
Author
|
/unittest-pattern-TestTrainerLoRA |
Collaborator
Author
|
/unittest-pattern-ColocateModeTest |
Summary
Tests
Github Test Reporter by CTRF 💚 |
Summary
Tests
Github Test Reporter by CTRF 💚 |
Collaborator
Author
|
/unittest-diff |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
There was a problem hiding this comment.
Pull request overview
Adds first-class support for running Explorer rollouts on the SGLang inference engine, while extending the weight-sync API to carry an explicit SyncMethod (NCCL vs checkpoint) and enhancing perf reporting.
Changes:
- Introduce an SGLang rollout model (embedded HTTP server + client) and wiring to create SGLang-based Explorer engines.
- Extend model weight-sync APIs to include
SyncMethod(+ timeout) and propagate through Explorer/proxy/tests. - Improve trainer/inference stability & observability (dtype alignment for FSDP2 LoRA, checkpoint state_dict handling, perf throughput metrics, remove problematic distributed barriers).
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| trinity/trainer/verl/verl_trainer.py | Add trainer backend initialization log. |
| trinity/trainer/verl/megatron_workers.py | Remove incorrect per-rank barrier usage during weight sync. |
| trinity/trainer/verl/fsdp_workers.py | Improve FSDP2/LoRA dtype handling; adjust weight sync group initialization/logging. |
| trinity/trainer/verl/fsdp_checkpoint_manager.py | Make checkpoint upload robust to tensor wrappers (e.g., full_tensor()). |
| trinity/perf/stage_perf.py | Include global token throughput metrics in perf payload timing. |
| trinity/perf/report_viewer.py | Display additional throughput metrics in UI and tweak layout. |
| trinity/perf/report_metrics.py | New helper to compute global token throughput metrics from step metrics. |
| trinity/manager/synchronizer.py | Track latest checkpoint model path and expose via async getter. |
| trinity/explorer/proxy/service.py | Pass explicit SyncMethod when syncing model weights in service loop. |
| trinity/explorer/explorer.py | Propagate sync_method and timeout into model sync calls. |
| trinity/common/workflows/workflow.py | Change async workflow logging level to INFO for chat start/response details. |
| trinity/common/models/vllm_worker.py | Remove barriers and simplify process group init/update sequencing. |
| trinity/common/models/vllm_model.py | Update sync/init_process_group signatures; add sync logging. |
| trinity/common/models/tinker_model.py | Update sync signature to accept SyncMethod and timeout. |
| trinity/common/models/sglang_patch/api_patch.py | Add embedded SGLang HTTP server bootstrap/cleanup utilities. |
| trinity/common/models/sglang_patch/init.py | Export SGLang embedded server helper. |
| trinity/common/models/sglang_model.py | New SGLang rollout model implementation + sync via NCCL or checkpoint. |
| trinity/common/models/model.py | Extend sync API to include SyncMethod; add deterministic port selection via base_port + engine_id. |
| trinity/common/models/external_model.py | Update sync signature to accept SyncMethod and timeout. |
| trinity/common/models/init.py | Add SGLang engine selection and new Ray actor factories; avoid mutating shared config via deepcopy. |
| trinity/common/config.py | Add enable_multimodal, base_port, and engine_id to inference config. |
| trinity/buffer/reader/queue_reader.py | Add StopAsyncIteration handling for Ray queue reads (sync/async). |
| tests/trainer/trainer_test.py | Update strategy matrix to include fsdp2 and propagate strategy into trainer config. |
| tests/template/config.yaml | Switch template trainer strategy default to fsdp2. |
| tests/manager/synchronizer_test.py | Update mocks/assertions for new sync signature and pipeline usage. |
| tests/explorer/scheduler_test.py | Update dummy model sync signature for new parameters. |
| tests/common/config_test.py | Add tests for deterministic port selection behavior. |
| scripts/docker/Dockerfile.uv | Install protobuf compiler dependency. |
| pyproject.toml | Add optional sglang extra; bump transformers minimum version. |
| perf/scripts/explorer/perf_workflow.py | Add perf workflow that measures OpenAI API-call token throughput. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+37
to
+41
| except Exception as e: | ||
| if "StopAsyncIteration" in traceback.format_exc(): | ||
| raise StopIteration() from e | ||
| else: | ||
| raise e |
Comment on lines
+46
to
+54
| try: | ||
| exp_bytes = await self.queue.get_batch.remote( | ||
| batch_size, timeout=self.timeout, **kwargs | ||
| ) | ||
| except Exception as e: | ||
| if "StopAsyncIteration" in traceback.format_exc(): | ||
| raise StopAsyncIteration() from e | ||
| else: | ||
| raise e |
| response.eid.run = i + self.run_id_base | ||
|
|
||
| self.logger.debug( | ||
| self.logger.info( |
Comment on lines
+494
to
+506
| elif method == SyncMethod.CHECKPOINT: | ||
| model_path = await self.synchronizer.get_latest_model_path.remote() | ||
| if model_path is not None: | ||
| await self.api_client.update_weights_from_disk( | ||
| model_path=model_path, | ||
| weight_version=str(model_version), | ||
| timeout=timeout, | ||
| ) | ||
| else: | ||
| raise ValueError(f"Unsupported sync method for SGLang: {method}") | ||
| self.logger.info(f"Synchronized model to version {model_version} using method {method}.") | ||
| self.model_version = model_version | ||
| return model_version |
| usage_completion_tokens += float(completion_tokens) | ||
|
|
||
| self.logger.info("Received response: %s", responses.choices[0].message) | ||
| exps = self.model.extract_experience_from_history() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
As the title says
Checklist
Please check the following items before code is ready to be reviewed.