Skip to content

[Feature] Add radix prefix cache backend for paged KV reuse #38

Description

@zmnobug

Summary

Add an optional radix prefix cache backend to PyPTO Serving. The backend should reuse existing paged KV cache pages for requests that share long token prefixes, while keeping the current hash prefix cache as the default path.

Area

Batching and scheduling

Motivation / Use Case

Many serving workloads contain repeated long prefixes, such as system prompts, shared context, retrieved documents, or multi-turn prompt templates. The current prefix cache path is hash-block based. A radix tree is a more natural structure for longest-prefix matching and can reduce duplicated prefill work when requests share a large prefix but diverge near the end.

The goal is to support this workflow:

request A: shared long prefix + suffix A
request B: shared long prefix + suffix B

After request A computes and stores its prefix KV pages, request B should reuse the matching prefix pages and only prefill the unmatched suffix.

Proposed API / Behavior

Add a CLI option:

--prefix-cache-backend {hash,radix}

Default behavior should remain unchanged:

default: hash

Expected behavior:

  • hash continues to use the existing prefix cache path.
  • radix enables a page-aligned radix prefix cache.
  • The radix key should include model namespace information, for example (model_id, token_prefix), to avoid cross-model page reuse.
  • The radix cache should map page-aligned token prefixes to physical KV page IDs.
  • On a prefix hit, the scheduler should attach matched KV pages to the request and set num_computed_tokens to the matched prefix length.
  • The scheduler should only schedule prefill for the unmatched suffix.
  • After prefill/decode progresses, completed page-aligned prefixes should be inserted into the radix cache.
  • Request finish, abort, and preemption should release request references and radix locks correctly.
  • Existing page attention kernels and paged KV storage should remain unchanged.

Flow / Request Lifecycle

The radix backend request lifecycle should be:

1. Request enters serving
   The OpenAI-compatible API receives a request. The engine tokenizes the
   prompt and creates a Request carrying model_id, prompt_token_ids,
   max_new_tokens, and sampling parameters.

2. Scheduler receives a waiting request
   The scheduler pops a new request from the waiting queue.
   If prefix_cache_backend == "hash", it follows the existing hash prefix
   cache path.
   If prefix_cache_backend == "radix", it enters radix prefix lookup.

3. Radix prefix lookup
   The scheduler looks up radix cache by:
     (model_id, prompt_token_ids[:prompt_len - 1])
   The lookup should stop at prompt_len - 1 to avoid a full-prompt hit with
   no token left for prefill/first-logits computation.

4. Attach matched KV pages
   If radix hits a page-aligned prefix:
     - Set request.cached_block_ids to the matched physical page IDs.
     - Set request.num_computed_tokens to matched_tokens.
     - Retain matched pages for the active request.
     - Lock the matched radix node/path to prevent eviction while the request
       is running.

5. Allocate pages for unmatched suffix
   The scheduler allocates new KV pages only for the unmatched suffix tokens.
   The worker still receives the normal paged KV layout:
     block_ids = cached pages + newly allocated pages

6. Worker runs prefill/decode
   The worker uses the existing Qwen3 prefill/decode path and page attention
   kernels. Radix should not change the kernel interface or the underlying
   paged KV storage format.

7. Insert completed prefixes
   After worker execution, the scheduler updates request.num_computed_tokens.
   Completed page-aligned prefixes are inserted into the radix cache:
     (model_id, completed token prefix) -> physical page IDs

8. Request finishes or is interrupted
   On finish, abort, or preemption:
     - Release active request page references.
     - Release radix node locks.
     - Keep radix cache-owned page references for future reuse.
     - Eviction may reclaim only radix leaf pages without active request locks.

Simplified flow:

HTTP request
    |
    v
Tokenize prompt
    |
    v
Scheduler waiting queue
    |
    +---------------- prefix_cache_backend == hash ----------------+
    |                                                             |
    v                                                             |
Existing hash prefix cache path                                  |
                                                                  |
    +---------------- prefix_cache_backend == radix ---------------+
                                  |
                                  v
                    Radix lookup by (model_id, token_prefix)
                                  |
                    +-------------+-------------+
                    |                           |
                    v                           v
                 miss                         hit
                    |                           |
                    v                           v
          allocate all needed pages     attach matched cached pages
                    |                           |
                    +-------------+-------------+
                                  |
                                  v
                    allocate pages for unmatched suffix
                                  |
                                  v
                    run existing prefill/decode kernels
                                  |
                                  v
                    insert completed page-aligned prefix
                                  |
                                  v
                    finish / abort / preempt cleanup

Alternatives Considered

Keep only the existing hash-block prefix cache. This is simpler but less expressive for longest-prefix reuse across prompts with shared long prefixes.

Implement radix attention directly in lower-level kernels. This would be more invasive and would require changing the kernel/runtime interface. The proposed design keeps radix in the serving scheduler and KV cache management layer, reusing the existing paged KV cache and page attention kernels.

Additional Context

A possible implementation plan:

  1. Add a backend selector in the serving CLI and config path.
  2. Add a RadixPrefixCache data structure for compressed radix-tree prefix matching.
  3. Store only page-aligned prefixes so the cache can safely reuse complete KV pages.
  4. Extend KvCacheManager with explicit request/cache page retain and release APIs.
  5. In the scheduler, add a radix branch parallel to the current hash prefix-cache branch.
  6. Track radix node locks while a request is using matched prefix pages.
  7. Insert completed page-aligned prefixes after worker execution.
  8. Add unit tests for tree insert/match/split, scheduler hit/miss behavior, default hash behavior, cleanup, and preemption.
  9. Add an HTTP-level validation script that sends cold/hit/miss requests and checks expected radix debug counters.

Expected validation criteria:

  • Default startup without --prefix-cache-backend radix keeps the existing hash path.
  • Explicit --prefix-cache-backend radix enables radix lookup and insertion.
  • A second request with the same page-aligned prefix should reuse cached pages.
  • A request with a different prefix should miss and run normal prefill.
  • Request completion, abort, and preemption should not leak KV page references.
  • The implementation should work with Qwen3-14B serving using the existing paged attention path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions