Summary
Add an optional radix prefix cache backend to PyPTO Serving. The backend should reuse existing paged KV cache pages for requests that share long token prefixes, while keeping the current hash prefix cache as the default path.
Area
Batching and scheduling
Motivation / Use Case
Many serving workloads contain repeated long prefixes, such as system prompts, shared context, retrieved documents, or multi-turn prompt templates. The current prefix cache path is hash-block based. A radix tree is a more natural structure for longest-prefix matching and can reduce duplicated prefill work when requests share a large prefix but diverge near the end.
The goal is to support this workflow:
request A: shared long prefix + suffix A
request B: shared long prefix + suffix B
After request A computes and stores its prefix KV pages, request B should reuse the matching prefix pages and only prefill the unmatched suffix.
Proposed API / Behavior
Add a CLI option:
--prefix-cache-backend {hash,radix}
Default behavior should remain unchanged:
Expected behavior:
hash continues to use the existing prefix cache path.
radix enables a page-aligned radix prefix cache.
- The radix key should include model namespace information, for example
(model_id, token_prefix), to avoid cross-model page reuse.
- The radix cache should map page-aligned token prefixes to physical KV page IDs.
- On a prefix hit, the scheduler should attach matched KV pages to the request and set
num_computed_tokens to the matched prefix length.
- The scheduler should only schedule prefill for the unmatched suffix.
- After prefill/decode progresses, completed page-aligned prefixes should be inserted into the radix cache.
- Request finish, abort, and preemption should release request references and radix locks correctly.
- Existing page attention kernels and paged KV storage should remain unchanged.
Flow / Request Lifecycle
The radix backend request lifecycle should be:
1. Request enters serving
The OpenAI-compatible API receives a request. The engine tokenizes the
prompt and creates a Request carrying model_id, prompt_token_ids,
max_new_tokens, and sampling parameters.
2. Scheduler receives a waiting request
The scheduler pops a new request from the waiting queue.
If prefix_cache_backend == "hash", it follows the existing hash prefix
cache path.
If prefix_cache_backend == "radix", it enters radix prefix lookup.
3. Radix prefix lookup
The scheduler looks up radix cache by:
(model_id, prompt_token_ids[:prompt_len - 1])
The lookup should stop at prompt_len - 1 to avoid a full-prompt hit with
no token left for prefill/first-logits computation.
4. Attach matched KV pages
If radix hits a page-aligned prefix:
- Set request.cached_block_ids to the matched physical page IDs.
- Set request.num_computed_tokens to matched_tokens.
- Retain matched pages for the active request.
- Lock the matched radix node/path to prevent eviction while the request
is running.
5. Allocate pages for unmatched suffix
The scheduler allocates new KV pages only for the unmatched suffix tokens.
The worker still receives the normal paged KV layout:
block_ids = cached pages + newly allocated pages
6. Worker runs prefill/decode
The worker uses the existing Qwen3 prefill/decode path and page attention
kernels. Radix should not change the kernel interface or the underlying
paged KV storage format.
7. Insert completed prefixes
After worker execution, the scheduler updates request.num_computed_tokens.
Completed page-aligned prefixes are inserted into the radix cache:
(model_id, completed token prefix) -> physical page IDs
8. Request finishes or is interrupted
On finish, abort, or preemption:
- Release active request page references.
- Release radix node locks.
- Keep radix cache-owned page references for future reuse.
- Eviction may reclaim only radix leaf pages without active request locks.
Simplified flow:
HTTP request
|
v
Tokenize prompt
|
v
Scheduler waiting queue
|
+---------------- prefix_cache_backend == hash ----------------+
| |
v |
Existing hash prefix cache path |
|
+---------------- prefix_cache_backend == radix ---------------+
|
v
Radix lookup by (model_id, token_prefix)
|
+-------------+-------------+
| |
v v
miss hit
| |
v v
allocate all needed pages attach matched cached pages
| |
+-------------+-------------+
|
v
allocate pages for unmatched suffix
|
v
run existing prefill/decode kernels
|
v
insert completed page-aligned prefix
|
v
finish / abort / preempt cleanup
Alternatives Considered
Keep only the existing hash-block prefix cache. This is simpler but less expressive for longest-prefix reuse across prompts with shared long prefixes.
Implement radix attention directly in lower-level kernels. This would be more invasive and would require changing the kernel/runtime interface. The proposed design keeps radix in the serving scheduler and KV cache management layer, reusing the existing paged KV cache and page attention kernels.
Additional Context
A possible implementation plan:
- Add a backend selector in the serving CLI and config path.
- Add a
RadixPrefixCache data structure for compressed radix-tree prefix matching.
- Store only page-aligned prefixes so the cache can safely reuse complete KV pages.
- Extend
KvCacheManager with explicit request/cache page retain and release APIs.
- In the scheduler, add a radix branch parallel to the current hash prefix-cache branch.
- Track radix node locks while a request is using matched prefix pages.
- Insert completed page-aligned prefixes after worker execution.
- Add unit tests for tree insert/match/split, scheduler hit/miss behavior, default hash behavior, cleanup, and preemption.
- Add an HTTP-level validation script that sends cold/hit/miss requests and checks expected radix debug counters.
Expected validation criteria:
- Default startup without
--prefix-cache-backend radix keeps the existing hash path.
- Explicit
--prefix-cache-backend radix enables radix lookup and insertion.
- A second request with the same page-aligned prefix should reuse cached pages.
- A request with a different prefix should miss and run normal prefill.
- Request completion, abort, and preemption should not leak KV page references.
- The implementation should work with Qwen3-14B serving using the existing paged attention path.
Summary
Add an optional radix prefix cache backend to PyPTO Serving. The backend should reuse existing paged KV cache pages for requests that share long token prefixes, while keeping the current hash prefix cache as the default path.
Area
Batching and scheduling
Motivation / Use Case
Many serving workloads contain repeated long prefixes, such as system prompts, shared context, retrieved documents, or multi-turn prompt templates. The current prefix cache path is hash-block based. A radix tree is a more natural structure for longest-prefix matching and can reduce duplicated prefill work when requests share a large prefix but diverge near the end.
The goal is to support this workflow:
After request A computes and stores its prefix KV pages, request B should reuse the matching prefix pages and only prefill the unmatched suffix.
Proposed API / Behavior
Add a CLI option:
--prefix-cache-backend {hash,radix}Default behavior should remain unchanged:
Expected behavior:
hashcontinues to use the existing prefix cache path.radixenables a page-aligned radix prefix cache.(model_id, token_prefix), to avoid cross-model page reuse.num_computed_tokensto the matched prefix length.Flow / Request Lifecycle
The radix backend request lifecycle should be:
Simplified flow:
Alternatives Considered
Keep only the existing hash-block prefix cache. This is simpler but less expressive for longest-prefix reuse across prompts with shared long prefixes.
Implement radix attention directly in lower-level kernels. This would be more invasive and would require changing the kernel/runtime interface. The proposed design keeps radix in the serving scheduler and KV cache management layer, reusing the existing paged KV cache and page attention kernels.
Additional Context
A possible implementation plan:
RadixPrefixCachedata structure for compressed radix-tree prefix matching.KvCacheManagerwith explicit request/cache page retain and release APIs.Expected validation criteria:
--prefix-cache-backend radixkeeps the existing hash path.--prefix-cache-backend radixenables radix lookup and insertion.