[rollout] test: bucketed transfer utils by pengwu22 · Pull Request #5309 · verl-project/verl

pengwu22 · 2026-02-13T00:58:08Z

What does this PR do?

Abstract the current vllm weight update helper out for clear interfaces, and tests

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Extra unittests covering shm and ipc

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request refactors the weight transfer logic into a new, dedicated module bucketed_weight_transfer.py, improving code organization and testability. However, a critical security vulnerability has been identified in the new bucketed weight transfer mechanism due to insecure deserialization and arbitrary code execution. The use of ZMQ's recv_pyobj (which uses pickle) over a predictable IPC socket path in /tmp/ allows any user on the same host to achieve code execution. This must be addressed by using secure serialization and avoiding the transmission of executable callables. Additionally, two high-severity robustness issues were found: one concerning the reuse of a helper function for shared memory creation to improve error handling, and another regarding the fragility of relying on a hardcoded index for CUDA IPC.

verl/workers/rollout/bucketed_weight_transfer.py

gemini-code-assist · 2026-02-13T01:00:55Z

verl/workers/rollout/bucketed_weight_transfer.py

+    func, args = handle
+    list_args = list(args)
+    if device_id is not None:
+        # the key is to change device id to the current device id
+        # in case two processes have different CUDA_VISIBLE_DEVICES
+        list_args[6] = device_id
+    buffer = func(*list_args)


This function relies on a hardcoded index 6 to modify the device ID for CUDA IPC tensor reconstruction. This is highly fragile as it depends on the internal data structure returned by torch.multiprocessing.reductions.reduce_tensor, which is not a public API and can change without notice in future PyTorch versions. A change in PyTorch could lead to silent failures or hard-to-debug errors. To make this slightly more robust, an assertion should be added to validate the structure of the arguments before modification.

Suggested change

func, args = handle

list_args = list(args)

if device_id is not None:

# the key is to change device id to the current device id

# in case two processes have different CUDA_VISIBLE_DEVICES

list_args[6] = device_id

buffer = func(*list_args)

func, args = handle

list_args = list(args)

if device_id is not None:

# The tuple from reduce_tensor is (rebuild_fn, (tensor_cls, tensor_size, storage_handle, storage_offset, requires_grad, device_id, ipc_handle))

# We are modifying the device_id at index 6.

assert len(list_args) >= 7, "Unexpected arguments structure for CUDA IPC tensor reconstruction."

# the key is to change device id to the current device id

# in case two processes have different CUDA_VISIBLE_DEVICES

list_args[6] = device_id

buffer = func(*list_args)

as original, no change

gemini-code-assist · 2026-02-13T01:00:55Z

verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py

+
+            # Create unique name for shared memory
+            shm_name = f"verl_weights_{uuid.uuid4().hex}"
+            shm = shared_memory.SharedMemory(name=shm_name, create=True, size=self.bucket_size)


The create_shared_memory helper function, which correctly handles attaching to existing shared memory segments in case of a FileExistsError, is defined in this file but not used here. Instead, shared_memory.SharedMemory(create=True) is called directly. This could lead to unhandled FileExistsError exceptions if a shared memory segment from a previous crashed run was not cleaned up. To improve robustness and reuse the existing helper, please use create_shared_memory.

Suggested change

shm = shared_memory.SharedMemory(name=shm_name, create=True, size=self.bucket_size)

shm = create_shared_memory(self.bucket_size, shm_name)

as original. this part not changed

wuxibin89 · 2026-02-13T03:49:07Z

Hold this PR until #5029 merged.

abstract it out

f46b033

pengwu22 requested review from PeterSH6, chenhaiq and wuxibin89 as code owners February 13, 2026 00:58

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

move file

5b5c527

pengwu22 removed request for PeterSH6 and chenhaiq February 13, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] test: bucketed transfer utils#5309

[rollout] test: bucketed transfer utils#5309
pengwu22 wants to merge 2 commits intoverl-project:mainfrom
pengwu22:pw/test-wt-utils

pengwu22 commented Feb 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

pengwu22 Feb 13, 2026

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

pengwu22 Feb 13, 2026

Uh oh!

wuxibin89 commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	shm = shared_memory.SharedMemory(name=shm_name, create=True, size=self.bucket_size)
	shm = create_shared_memory(self.bucket_size, shm_name)

Conversation

pengwu22 commented Feb 13, 2026

What does this PR do?

Test

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

pengwu22 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

pengwu22 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants