[rollout] test: bucketed transfer utils#5309
[rollout] test: bucketed transfer utils#5309pengwu22 wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the weight transfer logic into a new, dedicated module bucketed_weight_transfer.py, improving code organization and testability. However, a critical security vulnerability has been identified in the new bucketed weight transfer mechanism due to insecure deserialization and arbitrary code execution. The use of ZMQ's recv_pyobj (which uses pickle) over a predictable IPC socket path in /tmp/ allows any user on the same host to achieve code execution. This must be addressed by using secure serialization and avoiding the transmission of executable callables. Additionally, two high-severity robustness issues were found: one concerning the reuse of a helper function for shared memory creation to improve error handling, and another regarding the fragility of relying on a hardcoded index for CUDA IPC.
| func, args = handle | ||
| list_args = list(args) | ||
| if device_id is not None: | ||
| # the key is to change device id to the current device id | ||
| # in case two processes have different CUDA_VISIBLE_DEVICES | ||
| list_args[6] = device_id | ||
| buffer = func(*list_args) |
There was a problem hiding this comment.
This function relies on a hardcoded index 6 to modify the device ID for CUDA IPC tensor reconstruction. This is highly fragile as it depends on the internal data structure returned by torch.multiprocessing.reductions.reduce_tensor, which is not a public API and can change without notice in future PyTorch versions. A change in PyTorch could lead to silent failures or hard-to-debug errors. To make this slightly more robust, an assertion should be added to validate the structure of the arguments before modification.
| func, args = handle | |
| list_args = list(args) | |
| if device_id is not None: | |
| # the key is to change device id to the current device id | |
| # in case two processes have different CUDA_VISIBLE_DEVICES | |
| list_args[6] = device_id | |
| buffer = func(*list_args) | |
| func, args = handle | |
| list_args = list(args) | |
| if device_id is not None: | |
| # The tuple from reduce_tensor is (rebuild_fn, (tensor_cls, tensor_size, storage_handle, storage_offset, requires_grad, device_id, ipc_handle)) | |
| # We are modifying the device_id at index 6. | |
| assert len(list_args) >= 7, "Unexpected arguments structure for CUDA IPC tensor reconstruction." | |
| # the key is to change device id to the current device id | |
| # in case two processes have different CUDA_VISIBLE_DEVICES | |
| list_args[6] = device_id | |
| buffer = func(*list_args) |
There was a problem hiding this comment.
as original, no change
|
|
||
| # Create unique name for shared memory | ||
| shm_name = f"verl_weights_{uuid.uuid4().hex}" | ||
| shm = shared_memory.SharedMemory(name=shm_name, create=True, size=self.bucket_size) |
There was a problem hiding this comment.
The create_shared_memory helper function, which correctly handles attaching to existing shared memory segments in case of a FileExistsError, is defined in this file but not used here. Instead, shared_memory.SharedMemory(create=True) is called directly. This could lead to unhandled FileExistsError exceptions if a shared memory segment from a previous crashed run was not cleaned up. To improve robustness and reuse the existing helper, please use create_shared_memory.
| shm = shared_memory.SharedMemory(name=shm_name, create=True, size=self.bucket_size) | |
| shm = create_shared_memory(self.bucket_size, shm_name) |
There was a problem hiding this comment.
as original. this part not changed
|
Hold this PR until #5029 merged. |
What does this PR do?
Test
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)