Add hole-mode CacheBlend extension for non-contiguous cache reuse#254
Add hole-mode CacheBlend extension for non-contiguous cache reuse#254scestola-h wants to merge 1 commit into
Conversation
Extends LMCache-Ascend's CacheBlend integration to support reuse of cached KV across non-contiguous prefixes. Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow. Implementation lives under lmcache_ascend/ alongside the existing contiguous-prefix path: a hole-aware connector and adapter (LMCacheConnectorV1ImplHole), a segment-aware blender (LMCBlenderHole), a hole-aware NPU buffer connector, hole-aware lookup client and server, and supporting types and helpers. The standard CacheBlend path is preserved unchanged; the hole path is selected via connector module choice. Includes a benchmark entrypoint, a test, and a maintainer overview in docs/hole-feature-overview.md describing the request flow and file map for code review.
There was a problem hiding this comment.
Code Review
This pull request introduces the 'Hole' feature to standard CacheBlend, enabling non-contiguous prefix reuse to handle cached prefixes with gaps or 'holes'. It adds comprehensive documentation, a new benchmark suite, and integrates hole-aware connectors, blenders, and segment helpers. While the implementation is solid, several critical issues must be addressed to prevent runtime failures. These include an UnboundLocalError in cache_engine.py when keys are empty, a NameError in blender.py due to an undefined total_len variable, and a potential IndexError in the benchmark script. Additionally, the trace_utils patching logic should be improved to ensure compatibility with upstream lmcache, and a bare except clause in utils.py should be replaced to comply with PEP 8.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| yield None | ||
|
|
||
| next(mem_obj_consumer) | ||
|
|
||
| for mem_obj in to_count_down: | ||
| if mem_obj.is_pinned: | ||
| mem_obj.unpin() |
There was a problem hiding this comment.
If keys is empty, the else block on line 535 is executed. However, mem_obj_consumer and to_count_down are only defined inside the if keys: block (lines 515 and 518). When keys is empty, executing lines 541 and 543 will raise an UnboundLocalError at runtime, crashing the engine on any cache miss. These finalization steps should be moved inside the if keys: block.
| yield None | |
| next(mem_obj_consumer) | |
| for mem_obj in to_count_down: | |
| if mem_obj.is_pinned: | |
| mem_obj.unpin() | |
| yield None | |
| if keys: | |
| next(mem_obj_consumer) | |
| for mem_obj in to_count_down: | |
| if mem_obj.is_pinned: | |
| mem_obj.unpin() |
| self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len) | ||
| trace_flow( | ||
| "blender.nohole", | ||
| "process_qkv", | ||
| layer_id=layer_id, | ||
| req_id=kwargs.get("req_id", self._trace_req_id), | ||
| num_falses=num_falses, | ||
| eligible_tokens=total_len, | ||
| topk_num=topk_num, | ||
| top_indices=top_indices.tolist(), | ||
| absolute_top_indices=absolute_top_indices.tolist(), |
There was a problem hiding this comment.
The variable total_len is referenced on lines 189 and 196, but it is never defined within the process_qkv function of blender.py. This will raise a NameError at runtime when the standard (nohole) CacheBlend path is executed. It should be replaced with diff_k.shape[0] or k[num_falses:].shape[0] which represents the number of eligible tokens.
total_len = diff_k.shape[0]
self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len)
trace_flow(
"blender.nohole",
"process_qkv",
layer_id=layer_id,
req_id=kwargs.get("req_id", self._trace_req_id),
num_falses=num_falses,
eligible_tokens=total_len,
topk_num=topk_num,
top_indices=top_indices.tolist(),
absolute_top_indices=absolute_top_indices.tolist(),| def _patch_trace_utils_compat(): | ||
| try: | ||
| # Third Party | ||
| import lmcache.v1.trace_utils # noqa: F401 | ||
| except ModuleNotFoundError as exc: | ||
| if exc.name != "lmcache.v1.trace_utils": | ||
| raise | ||
|
|
||
| # First Party | ||
| from lmcache_ascend.v1 import trace_utils_compat | ||
|
|
||
| sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat | ||
|
|
There was a problem hiding this comment.
The _patch_trace_utils_compat function only replaces sys.modules['lmcache.v1.trace_utils'] if ModuleNotFoundError is raised. However, if an upstream version of lmcache is installed, the import succeeds, but it will lack the newly introduced tracing symbols (like trace_layer_enabled, trace_probe_positions, etc.). This will cause an ImportError when other modules attempt to import these missing symbols. The patch should dynamically inject these fallback stubs into the existing lmcache.v1.trace_utils module if it is successfully imported to ensure compatibility.
| def _patch_trace_utils_compat(): | |
| try: | |
| # Third Party | |
| import lmcache.v1.trace_utils # noqa: F401 | |
| except ModuleNotFoundError as exc: | |
| if exc.name != "lmcache.v1.trace_utils": | |
| raise | |
| # First Party | |
| from lmcache_ascend.v1 import trace_utils_compat | |
| sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat | |
| def _patch_trace_utils_compat(): | |
| try: | |
| # Third Party | |
| import lmcache.v1.trace_utils as tu | |
| for attr in ["trace_layer_enabled", "trace_probe_positions", "trace_compare_prefix_len", "summarize_kv_tensor_stats", "summarize_prefix_kv_tensor_stats", "emit_request_timer"]: | |
| if not hasattr(tu, attr): | |
| setattr(tu, attr, lambda *args, **kwargs: False if "enabled" in attr else None) | |
| except ModuleNotFoundError as exc: | |
| if exc.name != "lmcache.v1.trace_utils": | |
| raise | |
| # First Party | |
| from lmcache_ascend.v1 import trace_utils_compat | |
| sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat |
References
- Prioritize consistency with upstream implementation logic when extending external libraries to ensure compatibility, even if the logic appears unconventional.
| output_str = output.outputs[0].text | ||
| answers = meta['answers'] | ||
|
|
||
| e2e_time = extract_request_e2e_s(output) | ||
| ttft = extract_request_ttft_s(output) | ||
| tbt = None | ||
| if e2e_time is not None and ttft is not None: | ||
| generated_tokens = len(output.outputs[0].token_ids) | ||
| tbt = (e2e_time - ttft) / max(1, generated_tokens - 1) |
There was a problem hiding this comment.
Accessing output.outputs[0] directly on lines 1243 and 1250 assumes that output.outputs is never empty. If a request fails or returns an empty output list, this will raise an IndexError. It is safer to defensively check if output.outputs is non-empty before accessing its elements, similar to the check performed on line 1333.
| output_str = output.outputs[0].text | |
| answers = meta['answers'] | |
| e2e_time = extract_request_e2e_s(output) | |
| ttft = extract_request_ttft_s(output) | |
| tbt = None | |
| if e2e_time is not None and ttft is not None: | |
| generated_tokens = len(output.outputs[0].token_ids) | |
| tbt = (e2e_time - ttft) / max(1, generated_tokens - 1) | |
| output_str = output.outputs[0].text if output.outputs else "" | |
| answers = meta['answers'] | |
| e2e_time = extract_request_e2e_s(output) | |
| ttft = extract_request_ttft_s(output) | |
| tbt = None | |
| if e2e_time is not None and ttft is not None and output.outputs: | |
| generated_tokens = len(output.outputs[0].token_ids) | |
| tbt = (e2e_time - ttft) / max(1, generated_tokens - 1) |
| def get_mem(npu_id, chip_id=0): | ||
| try: | ||
| return parse_npu_process_info(npu_id,chip_id)['memory_mb'] | ||
| except: | ||
| return -1 |
There was a problem hiding this comment.
Using a bare except: clause is a violation of PEP 8 style guidelines. It catches system-exiting exceptions like SystemExit and KeyboardInterrupt, making it harder to interrupt the program and potentially masking unrelated bugs. It should be replaced with except Exception:.
| def get_mem(npu_id, chip_id=0): | |
| try: | |
| return parse_npu_process_info(npu_id,chip_id)['memory_mb'] | |
| except: | |
| return -1 | |
| def get_mem(npu_id, chip_id=0): | |
| try: | |
| return parse_npu_process_info(npu_id,chip_id)['memory_mb'] | |
| except Exception: | |
| return -1 |
References
- PEP 8: Programming Recommendations - 'When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.' (link)
|
@larksudo review plz |
Hole-mode CacheBlend: Analysis and ReproductionFollowing up on the request for precision, coverage, and speed data on this PR. All numbers below are from runs on Ascend 910B4 NPU hardware using the benchmark harness included in this PR ( If this data is insufficient for review, we're happy to run additional experiments. For example, broader parameter sweeps, additional recompute ratios or hole configurations, other datasets, or additional models. Let us know what would be most useful. 1. What the feature does (recap)Standard LMCache CacheBlend reuses a contiguous cached KV prefix and stops at the first cache miss. Hole-mode extends this to reuse cached KV across non-contiguous prefixes: when a prompt has one or more "holes" (missing cached document positions) in an otherwise cached prefix, hole-mode computes fresh KV for the holes and continues to reuse cached segments that follow. 2. Definitions2.1 Dataset and prompt structureMusique: multi-hop QA dataset. Each query is a question paired with a set of native retrieved documents. The model is asked to answer the question given the documents. The benchmark serialises this as: 2.2 Configuration parameters
Concrete walkthrough: with
2.3 MetricsEach benchmark run does two phases:
Metrics reported per pass:
Note on pass 2 2.4 Modes
3. F1 (precision) analysis3.1 Short context, r=0.5 — hole positions and multi-hole
Observations:
3.2 Recompute-ratio sensitivity (r=0.3)Same short-context regime, tighter recompute ratio.
Observation: at r=0.3, baseline f1 drops to 0.3228 (from 0.3657 at r=0.5), the expected quality/speed tradeoff. Hole runs remain within ~0.012 of the baseline. 3.3 Long context (inflate=10, 20, 30) at r=0.15The most stress-testing regime. Inflate synthesizes long contexts by adding extra documents; r=0.15 is a tight-recompute setting. As in section 3.1, "baseline" here means LMCache CacheBlend with
Observations:
3.4 Reference points (full vLLM, LMCache-only)Quality ceiling (full vLLM, no cache reuse) and cache-reuse-without-blending baseline.
Observations:
4. Hole path activation (coverage evidence)Two dedicated validation runs (
4.1 Per-request path distributionEach request that reaches the LMCache connector is routed to one of three paths depending on its cache state:
4.2 Coverage statistics on hole-path requestsFor requests that took the hole path:
These show that when hole-mode engages, it covers up to ~19K tokens of the reusable prefix, with the hole itself accounting for at most ~1.7K tokens of fresh recompute. 4.3 Quality of validation runs
val_A reproduces row23's quality (f1 p2 = 0.2200) and cache behavior (hit p2 = 0.9992); latency differs by ~0.03s, within typical run-to-run variance. val_B provides a matched multi-hole datapoint at the same context/r configuration. 5. Reproduction5.1 Environment
5.2 Model and dataset
5.3 Benchmark entrypointIncluded in this PR: 5.4 Row-by-row commandsEach row is reproduced by: Where
Notes:
|
| **File Path**: `vllm-ascend/vllm-ascend/worker/worker.py` | ||
|
|
||
| - In `vllm-ascend/vllm-ascend/worker/worker_v1.py`, comment out `ensure_kv_transfer_initialized(vllm_config)` in function `def _init_worker_distributed_environment`. | ||
| - In `vllm-ascend/vllm-ascend/worker/worker.py`, comment out `ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)` in function `def initialize_from_config`. |
There was a problem hiding this comment.
Nice work. I see the changes on the latest main branch.
For completeness, should we list which vllm-ascend versions are supported.
Summary
Adds hole-mode CacheBlend to LMCache-Ascend: extends the existing CacheBlend path to support reuse of cached KV across non-contiguous prefixes (with "holes"). Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow.
Where to start reading
The maintainer overview for this PR is at
docs/hole-feature-overview.md. It covers the architecture, request flow vs standard CacheBlend, file map, and a recommended code-review reading order.What's included
LMCacheConnectorV1ImplHole)LMCBlenderHole)VLLMBufferLayerwiseNPUHoleConnector)lmcache_ascend/v1/benchmark/cacheblend/benchmark-hole.py) with config (benchmark/cacheblend/config/config-hole.yaml)tests/v1/blend/test_blend.py)docs/hole-feature-overview.md)Design
The standard CacheBlend path is preserved unchanged. Hole-mode is selected via connector module choice. See
docs/hole-feature-overview.mdfor design rationale and where each piece of the request flow lives.Validation
Single-hole and multi-hole runs on the Musique multi-hop QA dataset with Qwen3-8B. On a matched comparison (inflate=20, until=150, recompute_ratio=0.15), standard CacheBlend reaches f1=0.2230; single-hole mode reaches f1=0.2200. Concrete numbers and configuration details in section 6 of the overview doc.