You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The support for CUDAGraph compatibility with variable-length inputs has been updated, but one issue remains:
When running chunked prefill or speculative decoding with variable-length tokens under a fixed batch size, key_states and value_states must be appended to the KV cache tensor. However, the current AppendPagedKVCacheKernel does not fully support variable-length tokens.
To run append_paged_kv_cache within a CUDA Graph, I tried the following workaround:
• Among the arguments of append_paged_kv_cache, key_states, value_states, batch_indices, and positions depend on the token length.
• I filled unused token slots in key_states and value_states with zeros, as the model's input size remains fixed.
• batch_indices and positions were filled with arbitrary values.
However, for this approach to work correctly, the kernel should refer to kv_indptr and avoid processing any indices beyond the last valid token. The current kernel implementation lacks this check.
Therefore, I propose the following modification to the kernel:
If there is a better way to handle variable-length inputs with append_paged_kv_cache using CUDA Graphs, please let me know! I’m open to alternative suggestions.
Hello flashinfer team,
The support for CUDAGraph compatibility with variable-length inputs has been updated, but one issue remains:
When running chunked prefill or speculative decoding with variable-length tokens under a fixed batch size,
key_states
andvalue_states
must be appended to the KV cache tensor. However, the currentAppendPagedKVCacheKernel
does not fully support variable-length tokens.To run append_paged_kv_cache within a CUDA Graph, I tried the following workaround:
• Among the arguments of append_paged_kv_cache,
key_states
,value_states
,batch_indices
, andpositions
depend on the token length.• I filled unused token slots in
key_states
andvalue_states
with zeros, as the model's input size remains fixed.•
batch_indices
andpositions
were filled with arbitrary values.However, for this approach to work correctly, the kernel should refer to
kv_indptr
and avoid processing any indices beyond the last valid token. The current kernel implementation lacks this check.Therefore, I propose the following modification to the kernel:
Original Code:
Proposed Fix:
The text was updated successfully, but these errors were encountered: