Fix the usage of meta information #23
Conversation
|
BTW, the remote token number (stored in |
|
It is used in combine |
|
While reviewing the code, I noticed that the kernels copy data from CUDA buffers to NVSHMEM buffers before sending. For example, in dispatch here. This effectively disables zero copy that is possible as NVSHMEM can send from both NVSHMEM and CUDA buffers. This is required for 2 reasons:
In the proposed variant, no piggy-back is required, so if performance-wise it's ok to send the token and it's scaling addition separately, no copy is required. I want to play with it to see if it gives any improvement. |


Hello,
While reviewing the code, I spotted a potential misuse of the meta-arrays generated during the dispatch phase.
Please find my reasoning below.
Here instead of getting sequential indices of tokens sent to this specific expert, the global token numbers will be obtained.
This will result in tokens written in non-contiguous fashion (instead of being compacted at the beginning of the corresponding buffer segment).
Because the target buffer is always large enough to accommodate all buffers, it will not cause a buffer overflow.
And due to the absence (yet?) of data verification, the issue went unnoticed.