-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[rollout, sglang] feat: support blockwise fp8 rollout #4415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces FP8 rollout support for the sglang backend, which is a significant feature enhancement. The changes include adding a new utility file for FP8 quantization, updating the sglang server and rollout worker to handle FP8 configurations, and updating the documentation accordingly. The implementation appears solid. My review focuses on improving maintainability by addressing code duplication and removing unreachable code.
| if weight_block_size is not None: | ||
| if torch.distributed.get_rank() == 0: | ||
| logger.debug(f" Quantizing to FP8 blockwise: {k}") | ||
| param_lp, param_scale = scaled_fp8_blockwise( | ||
| v.to(dtype), | ||
| weight_block_size=weight_block_size, | ||
| ) | ||
| param_scale = param_scale.squeeze(-1) | ||
| weights_quantized.append([k, param_lp]) | ||
| weights_quantized.append([k + "_scale_inv", param_scale]) | ||
| else: | ||
| raise ValueError( | ||
| "Only blockwise quantization is supported. Please set weight_block_size in quant_config" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block is unreachable. weight_block_size is checked for None on line 152 before the loop begins, and an exception is raised if it is None. Consequently, the condition weight_block_size is not None on line 163 will always evaluate to true inside the loop, rendering the else branch dead code. Removing the conditional wrapper and the unreachable else block will improve code clarity and maintainability.
if torch.distributed.get_rank() == 0:
logger.debug(f" Quantizing to FP8 blockwise: {k}")
param_lp, param_scale = scaled_fp8_blockwise(
v.to(dtype),
weight_block_size=weight_block_size,
)
param_scale = param_scale.squeeze(-1)
weights_quantized.append([k, param_lp])
weights_quantized.append([k + "_scale_inv", param_scale])| assert sglang.__version__ >= "0.5.5", "sglang>=0.5.5 is required for FP8 quantization" | ||
| FP8_BLOCK_QUANT_KWARGS = { | ||
| "activation_scheme": "dynamic", | ||
| "fmt": "e4m3", | ||
| "quant_method": "fp8", | ||
| "weight_block_size": [128, 128], | ||
| } | ||
| fp8_block_quant_kwargs = dict(FP8_BLOCK_QUANT_KWARGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FP8 quantization configuration logic, including the version check and FP8_BLOCK_QUANT_KWARGS dictionary, is duplicated in verl/workers/rollout/sglang_rollout/sglang_rollout.py. To improve maintainability and prevent future inconsistencies, this logic should be centralized. Consider moving FP8_BLOCK_QUANT_KWARGS to verl/utils/sglang/sglang_fp8_utils.py as a constant and creating a helper function there to encapsulate the version check and config creation.
| assert sglang.__version__ >= "0.5.5", "sglang>=0.5.5 is required for FP8 quantization" | ||
| FP8_BLOCK_QUANT_KWARGS = { | ||
| "activation_scheme": "dynamic", | ||
| "fmt": "e4m3", | ||
| "quant_method": "fp8", | ||
| "weight_block_size": [128, 128], | ||
| } | ||
| fp8_block_quant_kwargs = dict(FP8_BLOCK_QUANT_KWARGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FP8 quantization configuration logic, including the version check and FP8_BLOCK_QUANT_KWARGS dictionary, is duplicated in verl/workers/rollout/sglang_rollout/async_sglang_server.py. To improve maintainability and prevent future inconsistencies, this logic should be centralized. Consider moving FP8_BLOCK_QUANT_KWARGS to verl/utils/sglang/sglang_fp8_utils.py as a constant and creating a helper function there to encapsulate the version check and config creation.
What does this PR do?
This PR introduces FP8 rollout with sglang inference backend in verl.
Experiments and Outcomes
Qwen3-8B-Base Dense Model
Configuration
Accuracy
With TIS, FP8 rollout aligns with BF16

Performance


purple: BF16, red: FP8 rollout
Results and observations:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)