[Feature][AscendP2P]Add H2H host staging for Ascend P2P HCCL#259
[Feature][AscendP2P]Add H2H host staging for Ascend P2P HCCL#259matthewygf wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a host-staging mechanism for P2P transfers in the Ascend backend, utilizing a bounded pinned host arena instead of registering the entire CPU KV pool. It also implements lease tracking and expiration checks for one-sided reads, and adds error handling to mark failed P2P loads so that vLLM can fall back to local recomputation. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Fixes #217 in P2P setup.
Problem
Currently when using HCCL channel, we have to register the entire CPU Buffer for zero copy RDMA Transfer.
However, for Ascend Hardware, this mean registering the CPU buffer onto the RoCE hardware, which could lead to Device OS OOM.
This PR
Instead of registering an entire CPU Buffer for zero copy RDMA transfer, this PR adds a staging area for H2H memobjs, then the remote one sided pull can directly pull from the staging area.
Furthermore, when a P2P pull fails, the system marks the corresponding blocks for local recomputation, preventing silent failures and improving reliability.