Add hole-mode CacheBlend extension for non-contiguous cache reuse by scestola-h · Pull Request #254 · LMCache/LMCache-Ascend

scestola-h · 2026-06-16T15:31:53Z

Summary

Adds hole-mode CacheBlend to LMCache-Ascend: extends the existing CacheBlend path to support reuse of cached KV across non-contiguous prefixes (with "holes"). Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow.

Where to start reading

The maintainer overview for this PR is at docs/hole-feature-overview.md. It covers the architecture, request flow vs standard CacheBlend, file map, and a recommended code-review reading order.

What's included

Hole-aware connector and adapter (LMCacheConnectorV1ImplHole)
Segment-aware blender (LMCBlenderHole)
Hole-aware NPU buffer connector (VLLMBufferLayerwiseNPUHoleConnector)
Hole-aware lookup client and server
Supporting types and segment utilities under lmcache_ascend/v1/
Benchmark entrypoint (benchmark/cacheblend/benchmark-hole.py) with config (benchmark/cacheblend/config/config-hole.yaml)
Test (tests/v1/blend/test_blend.py)
Maintainer overview (docs/hole-feature-overview.md)

Design

The standard CacheBlend path is preserved unchanged. Hole-mode is selected via connector module choice. See docs/hole-feature-overview.md for design rationale and where each piece of the request flow lives.

Validation

Single-hole and multi-hole runs on the Musique multi-hop QA dataset with Qwen3-8B. On a matched comparison (inflate=20, until=150, recompute_ratio=0.15), standard CacheBlend reaches f1=0.2230; single-hole mode reaches f1=0.2200. Concrete numbers and configuration details in section 6 of the overview doc.

Extends LMCache-Ascend's CacheBlend integration to support reuse of cached KV across non-contiguous prefixes. Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow. Implementation lives under lmcache_ascend/ alongside the existing contiguous-prefix path: a hole-aware connector and adapter (LMCacheConnectorV1ImplHole), a segment-aware blender (LMCBlenderHole), a hole-aware NPU buffer connector, hole-aware lookup client and server, and supporting types and helpers. The standard CacheBlend path is preserved unchanged; the hole path is selected via connector module choice. Includes a benchmark entrypoint, a test, and a maintainer overview in docs/hole-feature-overview.md describing the request flow and file map for code review.

gemini-code-assist

Code Review

This pull request introduces the 'Hole' feature to standard CacheBlend, enabling non-contiguous prefix reuse to handle cached prefixes with gaps or 'holes'. It adds comprehensive documentation, a new benchmark suite, and integrates hole-aware connectors, blenders, and segment helpers. While the implementation is solid, several critical issues must be addressed to prevent runtime failures. These include an UnboundLocalError in cache_engine.py when keys are empty, a NameError in blender.py due to an undefined total_len variable, and a potential IndexError in the benchmark script. Additionally, the trace_utils patching logic should be improved to ensure compatibility with upstream lmcache, and a bare except clause in utils.py should be replaced to comply with PEP 8.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-16T15:34:53Z

+        yield None
+
+        next(mem_obj_consumer)
+
+        for mem_obj in to_count_down:
+            if mem_obj.is_pinned:
+                mem_obj.unpin()


If keys is empty, the else block on line 535 is executed. However, mem_obj_consumer and to_count_down are only defined inside the if keys: block (lines 515 and 518). When keys is empty, executing lines 541 and 543 will raise an UnboundLocalError at runtime, crashing the engine on any cache miss. These finalization steps should be moved inside the if keys: block.

Suggested change

yield None

next(mem_obj_consumer)

for mem_obj in to_count_down:

if mem_obj.is_pinned:

mem_obj.unpin()

yield None

if keys:

next(mem_obj_consumer)

for mem_obj in to_count_down:

if mem_obj.is_pinned:

mem_obj.unpin()

gemini-code-assist · 2026-06-16T15:34:53Z

+            self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len)
+            trace_flow(
+                "blender.nohole",
+                "process_qkv",
+                layer_id=layer_id,
+                req_id=kwargs.get("req_id", self._trace_req_id),
+                num_falses=num_falses,
+                eligible_tokens=total_len,
+                topk_num=topk_num,
+                top_indices=top_indices.tolist(),
+                absolute_top_indices=absolute_top_indices.tolist(),


The variable total_len is referenced on lines 189 and 196, but it is never defined within the process_qkv function of blender.py. This will raise a NameError at runtime when the standard (nohole) CacheBlend path is executed. It should be replaced with diff_k.shape[0] or k[num_falses:].shape[0] which represents the number of eligible tokens.

total_len = diff_k.shape[0] self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len) trace_flow( "blender.nohole", "process_qkv", layer_id=layer_id, req_id=kwargs.get("req_id", self._trace_req_id), num_falses=num_falses, eligible_tokens=total_len, topk_num=topk_num, top_indices=top_indices.tolist(), absolute_top_indices=absolute_top_indices.tolist(),

gemini-code-assist · 2026-06-16T15:34:53Z

+def _patch_trace_utils_compat():
+    try:
+        # Third Party
+        import lmcache.v1.trace_utils  # noqa: F401
+    except ModuleNotFoundError as exc:
+        if exc.name != "lmcache.v1.trace_utils":
+            raise
+
+        # First Party
+        from lmcache_ascend.v1 import trace_utils_compat
+
+        sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat
+


The _patch_trace_utils_compat function only replaces sys.modules['lmcache.v1.trace_utils'] if ModuleNotFoundError is raised. However, if an upstream version of lmcache is installed, the import succeeds, but it will lack the newly introduced tracing symbols (like trace_layer_enabled, trace_probe_positions, etc.). This will cause an ImportError when other modules attempt to import these missing symbols. The patch should dynamically inject these fallback stubs into the existing lmcache.v1.trace_utils module if it is successfully imported to ensure compatibility.

Suggested change

def _patch_trace_utils_compat():

try:

# Third Party

import lmcache.v1.trace_utils # noqa: F401

except ModuleNotFoundError as exc:

if exc.name != "lmcache.v1.trace_utils":

raise

# First Party

from lmcache_ascend.v1 import trace_utils_compat

sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat

def _patch_trace_utils_compat():

try:

# Third Party

import lmcache.v1.trace_utils as tu

for attr in ["trace_layer_enabled", "trace_probe_positions", "trace_compare_prefix_len", "summarize_kv_tensor_stats", "summarize_prefix_kv_tensor_stats", "emit_request_timer"]:

if not hasattr(tu, attr):

setattr(tu, attr, lambda *args, **kwargs: False if "enabled" in attr else None)

except ModuleNotFoundError as exc:

if exc.name != "lmcache.v1.trace_utils":

raise

# First Party

from lmcache_ascend.v1 import trace_utils_compat

sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat

References

Prioritize consistency with upstream implementation logic when extending external libraries to ensure compatibility, even if the logic appears unconventional.

gemini-code-assist · 2026-06-16T15:34:53Z

+        output_str = output.outputs[0].text
+        answers = meta['answers']
+
+        e2e_time = extract_request_e2e_s(output)
+        ttft = extract_request_ttft_s(output)
+        tbt = None
+        if e2e_time is not None and ttft is not None:
+            generated_tokens = len(output.outputs[0].token_ids)
+            tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)


Accessing output.outputs[0] directly on lines 1243 and 1250 assumes that output.outputs is never empty. If a request fails or returns an empty output list, this will raise an IndexError. It is safer to defensively check if output.outputs is non-empty before accessing its elements, similar to the check performed on line 1333.

Suggested change

output_str = output.outputs[0].text

answers = meta['answers']

e2e_time = extract_request_e2e_s(output)

ttft = extract_request_ttft_s(output)

tbt = None

if e2e_time is not None and ttft is not None:

generated_tokens = len(output.outputs[0].token_ids)

tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)

output_str = output.outputs[0].text if output.outputs else ""

answers = meta['answers']

e2e_time = extract_request_e2e_s(output)

ttft = extract_request_ttft_s(output)

tbt = None

if e2e_time is not None and ttft is not None and output.outputs:

generated_tokens = len(output.outputs[0].token_ids)

tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)

gemini-code-assist · 2026-06-16T15:34:53Z

+def get_mem(npu_id, chip_id=0):
+    try:
+        return parse_npu_process_info(npu_id,chip_id)['memory_mb']
+    except:
+        return -1


Using a bare except: clause is a violation of PEP 8 style guidelines. It catches system-exiting exceptions like SystemExit and KeyboardInterrupt, making it harder to interrupt the program and potentially masking unrelated bugs. It should be replaced with except Exception:.

Suggested change

def get_mem(npu_id, chip_id=0):

try:

return parse_npu_process_info(npu_id,chip_id)['memory_mb']

except:

return -1

def get_mem(npu_id, chip_id=0):

try:

return parse_npu_process_info(npu_id,chip_id)['memory_mb']

except Exception:

return -1

References

PEP 8: Programming Recommendations - 'When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.' ^(link)

chloroethylene · 2026-06-26T07:24:19Z

@larksudo review plz

scestola-h · 2026-07-01T13:20:21Z

Hole-mode CacheBlend: Analysis and Reproduction

Following up on the request for precision, coverage, and speed data on this PR. All numbers below are from runs on Ascend 910B4 NPU hardware using the benchmark harness included in this PR (benchmark/cacheblend/benchmark-hole.py). The model used across every run below is Qwen3-8B, and the dataset is Musique.

If this data is insufficient for review, we're happy to run additional experiments. For example, broader parameter sweeps, additional recompute ratios or hole configurations, other datasets, or additional models. Let us know what would be most useful.

1. What the feature does (recap)

Standard LMCache CacheBlend reuses a contiguous cached KV prefix and stops at the first cache miss. Hole-mode extends this to reuse cached KV across non-contiguous prefixes: when a prompt has one or more "holes" (missing cached document positions) in an otherwise cached prefix, hole-mode computes fresh KV for the holes and continues to reuse cached segments that follow.

2. Definitions

2.1 Dataset and prompt structure

Musique: multi-hop QA dataset. Each query is a question paired with a set of native retrieved documents. The model is asked to answer the question given the documents. The benchmark serialises this as:

<system> # # <doc_1> # # <doc_2> # # ... # # <question>

2.2 Configuration parameters

inflate (N): appends N additional documents to each query's native document set. These extra documents come from a shared pool of non-native documents. inflate=1 keeps native docs unchanged; inflate=10 adds 10 extra documents; inflate=30 adds 30. This tests long-context behavior without changing dataset semantics.
hole_id: index into the query's native retrieved documents, selecting which one becomes the "hole document."
hole_pos: position in the final inflated document sequence where the hole document is placed. Its KV cache is not populated during phase 1, which forces hole-mode (or, if hole-mode is disabled, forces a legacy full-recompute) at inference time.
hole_ids, hole_positions: multi-hole variant. E.g., hole_ids=[2,5] hole_positions=[3,6] means native doc 2 is placed at inflated position 3 and native doc 5 is placed at inflated position 6, both with unpopulated KV.

Concrete walkthrough: with hole_id=2, hole_pos=3, inflate=10, the benchmark takes the query's native retrieved docs, appends 10 extra pool docs, and constructs the final inflated list. Native doc 2 is placed at position 3 in that list. That single position's KV is not populated during phase 1. At inference, the model must either activate hole-mode (recompute KV for position 3 while reusing the rest) or fall back to the legacy full-recompute path.

hole_mode: hole enables the hole-mode path; no-hole runs baseline CacheBlend.
recompute_ratio (r): standard CacheBlend parameter. Fraction of reusable tokens recomputed at the check layer. Lower r = faster inference but less quality-corrective recompute. Default in shipped config: 0.5. Tighter regimes tested here: 0.3, 0.2, 0.15.
until (N): number of queries drawn from Musique to run in each benchmark invocation. All numbers below use until=150.
check layer: single layer at which CacheBlend performs its diff/topK check to decide which cached-KV positions need refresh. Configurable via blend_check_layers; the shipped benchmark config uses layer 1.

2.3 Metrics

Each benchmark run does two phases:

Phase 1: batched prefill of documents into the LMCache cache.
Phase 2: query processing. Two passes over the query set:
- Pass 1 (phase2_pass1): first full traversal; fills any missing hole chunks during processing.
- Pass 2 (phase2_pass2): second full traversal; measures steady-state reuse behavior with warmed cache.

Metrics reported per pass:

mean_f1: mean F1 over the answer set, using Musique's evaluation. Higher is better.
cache_hit_rate: fraction of prompt tokens served from cached KV, computed as exported_hit_tokens / exported_prompt_tokens.
latency_avg_e2e_s: average end-to-end latency per request in seconds.

Note on pass 2 cache_hit_rate: pass 2 is not 100% by design. The benchmark deliberately modifies the query suffix between passes (appends ? per pass) to prevent trivial query-side cache reuse. The reported ~99.7–99.9% figure reflects reuse of the retrieved-document prefix, which is what CacheBlend actually targets.

2.4 Modes

method=cacheblend: LMCache-backed inference with LMCACHE_ENABLE_BLENDING=True. This is the CacheBlend path: batched prefill + blend-based reuse.
method=reuse: LMCache-backed inference with LMCACHE_ENABLE_BLENDING=False. Same prefill infrastructure but no blending, pure cache reuse without CacheBlend's recompute step.
method=full: plain vLLM with no LMCache. Serves as a quality ceiling reference (no reuse artifacts).

3. F1 (precision) analysis

3.1 Short context, r=0.5 — hole positions and multi-hole

inflate=1 (no extra documents added). "Baseline" here means LMCache CacheBlend with hole_mode=no-hole, i.e., the standard contiguous-prefix CacheBlend path that this PR extends.

Configuration	hole_mode	Hole config	r	f1 p1	f1 p2	hit p1	hit p2	e2e p1 (s)	e2e p2 (s)
row1 (baseline)	no-hole	—	0.5	0.3523	0.3657	0.6038	0.9962	0.9467	0.7721
row2 (hole @ pos 3)	hole	hole_id=2, hole_pos=3	0.5	0.3388	0.3499	0.9261	0.9979	0.9313	0.8293
row3 (hole @ pos 6)	hole	hole_id=2, hole_pos=6	0.5	0.3289	0.3369	0.9261	0.9979	0.9314	0.7726
row4 (hole @ pos 8)	hole	hole_id=2, hole_pos=8	0.5	0.3369	0.3372	0.9261	0.9979	0.9851	0.7900
row5 (two holes)	hole	hole_ids=[2,5], hole_positions=[4,7]	0.5	0.3448	0.3422	0.8582	0.9991	1.0187	0.7778

Observations:

Baseline f1 p2: 0.3657. Single-hole variants: 0.3369–0.3499. Multi-hole: 0.3422.
All hole configurations land within 0.03 of the baseline; multi-hole within 0.024.

3.2 Recompute-ratio sensitivity (r=0.3)

Same short-context regime, tighter recompute ratio.

Configuration	hole_mode	r	f1 p1	f1 p2	hit p1	hit p2	e2e p1 (s)	e2e p2 (s)
row6 (baseline)	no-hole	0.3	0.3397	0.3228	0.6038	0.9962	0.9368	0.7152
row7 (hole @ pos 6)	hole	0.3	0.3039	0.3109	0.9261	0.9979	0.8992	0.8222

Observation: at r=0.3, baseline f1 drops to 0.3228 (from 0.3657 at r=0.5), the expected quality/speed tradeoff. Hole runs remain within ~0.012 of the baseline.

3.3 Long context (inflate=10, 20, 30) at r=0.15

The most stress-testing regime. Inflate synthesizes long contexts by adding extra documents; r=0.15 is a tight-recompute setting. As in section 3.1, "baseline" here means LMCache CacheBlend with hole_mode=no-hole.

Configuration	hole_mode	inflate	r	f1 p1	f1 p2	hit p1	hit p2	e2e p1 (s)	e2e p2 (s)
row16 (baseline)	no-hole	10	0.15	0.2587	0.2378	0.6359	0.9972	1.2258	0.8712
row17 (hole @ pos 4)	hole	10	0.15	0.2227	0.2222	0.9587	0.9988	1.0103	0.8457
row18 (baseline)	no-hole	20	0.15	0.2263	0.2230	0.6405	0.9975	1.7630	1.1434
row19 (hole @ pos 6)	hole	20	0.15	0.2167	0.2148	0.9717	0.9992	1.4411	1.1566
row23 (hole @ pos 4)	hole	20	0.15	0.2230	0.2200	0.9717	0.9992	1.3748	1.1385
row21 (baseline)	no-hole	30	0.15	0.2023	0.1916	0.6422	0.9977	2.4660	1.5244
row22 (hole @ pos 4)	hole	30	0.15	0.1758	0.1753	0.9784	0.9996	1.8335	1.5139

Observations:

Absolute f1 decreases with more inflated context (harder task), this happens for baseline and hole alike.
The baseline-vs-hole f1 gap remains within ~0.05 across the entire long-context regime.

3.4 Reference points (full vLLM, LMCache-only)

Quality ceiling (full vLLM, no cache reuse) and cache-reuse-without-blending baseline.

Configuration	method	inflate	r	f1 p1	f1 p2	hit p1	hit p2	e2e p1 (s)	e2e p2 (s)
row8	full	1	0.5	0.3760	0.3760	0.0000	0.0000	0.8600	0.8557
row9	reuse (no CacheBlend)	1	0.5	0.3760	0.3807	0.0009	0.9982	0.9772	0.5922
row10	full	10	0.5	0.3783	0.3783	0.0000	0.0000	1.4588	1.4627
row13	full	20	0.5	0.3252	0.3252	0.0000	0.0000	2.3049	2.3106
row20	full	30	0.5	0.3046	0.3046	0.0000	0.0000	3.3118	3.3167

Observations:

Full vLLM at inflate=1: f1=0.3760, latency 0.8557s. Baseline CacheBlend (row1) reaches f1=0.3657, latency 0.7721s, trading ~0.01 f1 for ~10% speedup.
Reuse-without-blending (row9) at f1=0.3807 slightly exceeds full vLLM (0.3760) at inflate=1, with much lower latency. The small f1 difference is within measurement noise for a 150-query set. This mode uses CacheBlend's cache-reuse infrastructure without the blend/recompute step.

4. Hole path activation (coverage evidence)

Two dedicated validation runs (val_A_single_hole and val_B_two_holes) were configured to match a single-hole and multi-hole scenario respectively. val_A processes ~21K requests and val_B processes ~19K requests through Phase 2.

val_A_single_hole configuration: hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150. Same configuration as row23.
val_B_two_holes configuration: hole_mode=hole hole_ids=[2,5] hole_positions=[3,6] inflate=20 recompute_ratio=0.15 until=150.

4.1 Per-request path distribution

Each request that reaches the LMCache connector is routed to one of three paths depending on its cache state:

Path	val_A (single hole)	val_B (two holes)	What it means
Hole path (log: `Hole mode=hole`)	108	130	Hole-mode logic activated; fresh KV computed for hole positions, cached KV reused elsewhere.
Legacy path (log: `Hole mode=legacy`)	1148	1035	The hole-mode adapter routed the request through the legacy contiguous-prefix path.
Pure-hit path (log: `load_mode=pure_hit`)	20559	18325	Full LMCache reuse without needing hole-mode; all documents were cached.

4.2 Coverage statistics on hole-path requests

For requests that took the hole path:

Metric	val_A	val_B
Covered tokens (range)	0 – 18,995	0 – 19,002
Prefix-miss tokens (range)	0 – 879	0 – 1,697

These show that when hole-mode engages, it covers up to ~19K tokens of the reusable prefix, with the hole itself accounting for at most ~1.7K tokens of fresh recompute.

4.3 Quality of validation runs

Configuration	f1 p1	f1 p2	hit p2	e2e p2 (s)
val_A (single hole)	0.2230	0.2200	0.9992	1.1089
val_B (two holes)	0.2022	0.2022	0.9996	1.1281

val_A reproduces row23's quality (f1 p2 = 0.2200) and cache behavior (hit p2 = 0.9992); latency differs by ~0.03s, within typical run-to-run variance. val_B provides a matched multi-hole datapoint at the same context/r configuration.

5. Reproduction

5.1 Environment

Hardware: Ascend 910B4 NPU (single card, NPU 3 used for all matrix runs)
npu-smi driver version: 25.6.rc1.b099
Feature branch: this PR at commit 58b0133
Container base image: quay.io/ascend/vllm-ascend:v0.18.0
Python package versions:
- vllm 0.18.0+empty
- vllm-ascend 0.18.0
- lmcache 0.4.4
- torch 2.9.0+cpu
- torch_npu 2.9.0.post1+gitee7ba04

5.2 Model and dataset

Model: Qwen3-8B
Dataset: Musique multi-hop QA

5.3 Benchmark entrypoint

Included in this PR:

benchmark/cacheblend/benchmark-hole.py
benchmark/cacheblend/config/config-hole.yaml

5.4 Row-by-row commands

Each row is reproduced by:

python3 benchmark/cacheblend/benchmark-hole.py <override>

Where <override> is:

Row	Override
row1 (baseline, inflate=1)	`npu=3 method=cacheblend hole_mode=no-hole inflate=1 until=150`
row2 (hole @ pos 3)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=3 inflate=1 until=150`
row3 (hole @ pos 6)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=1 until=150`
row4 (hole @ pos 8)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=8 inflate=1 until=150`
row5 (two holes)	`npu=3 method=cacheblend hole_mode=hole hole_ids=[2,5] hole_positions=[4,7] inflate=1 until=150`
row6 (baseline, r=0.3)	`npu=3 method=cacheblend hole_mode=no-hole inflate=1 until=150 recompute_ratio=0.3`
row7 (hole, r=0.3)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=1 until=150 recompute_ratio=0.3`
row8 (full vLLM)	`npu=3 method=full hole_mode=no-hole inflate=1 until=150`
row9 (reuse)	`npu=3 method=reuse hole_mode=no-hole inflate=1 until=150`
row10 (full, inflate=10)	`npu=3 method=full hole_mode=no-hole inflate=10 until=150`
row13 (full, inflate=20)	`npu=3 method=full hole_mode=no-hole inflate=20 until=150`
row16 (baseline, inflate=10, r=0.15)	`npu=3 method=cacheblend hole_mode=no-hole inflate=10 recompute_ratio=0.15 until=150`
row17 (hole, inflate=10, r=0.15)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=10 recompute_ratio=0.15 until=150`
row18 (baseline, inflate=20, r=0.15)	`npu=3 method=cacheblend hole_mode=no-hole inflate=20 recompute_ratio=0.15 until=150`
row19 (hole, inflate=20, r=0.15)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=20 recompute_ratio=0.15 until=150`
row20 (full, inflate=30)	`npu=3 method=full hole_mode=no-hole inflate=30 until=150`
row21 (baseline, inflate=30, r=0.15)	`npu=3 method=cacheblend hole_mode=no-hole inflate=30 recompute_ratio=0.15 until=150`
row22 (hole, inflate=30, r=0.15)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=30 recompute_ratio=0.15 until=150`
row23 (hole, inflate=20, r=0.15)	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150`
val_A_single_hole	`npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150`
val_B_two_holes	`npu=3 method=cacheblend hole_mode=hole hole_ids=[2,5] hole_positions=[3,6] inflate=20 recompute_ratio=0.15 until=150`

Notes:

All runs use until=150 (150 queries drawn from Musique).
hole_mode=no-hole selects the standard contiguous-prefix CacheBlend path.
hole_mode=hole selects the hole-mode path introduced in this PR.
Baseline recompute_ratio (from shipped config) is 0.5.

larksudo · 2026-07-02T07:39:00Z

+**File Path**: `vllm-ascend/vllm-ascend/worker/worker.py`

- In `vllm-ascend/vllm-ascend/worker/worker_v1.py`, comment out `ensure_kv_transfer_initialized(vllm_config)` in function `def _init_worker_distributed_environment`.
+- In `vllm-ascend/vllm-ascend/worker/worker.py`, comment out `ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)` in function `def initialize_from_config`.


Nice work. I see the changes on the latest main branch.
For completeness, should we list which vllm-ascend versions are supported.

scestola-h requested review from chloroethylene and matthewygf as code owners June 16, 2026 15:31

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

chloroethylene requested a review from larksudo June 26, 2026 07:22

larksudo reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hole-mode CacheBlend extension for non-contiguous cache reuse#254

Add hole-mode CacheBlend extension for non-contiguous cache reuse#254
scestola-h wants to merge 1 commit into
LMCache:mainfrom
scestola-h:pr/hole-mode-cacheblend

scestola-h commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Uh oh!

chloroethylene commented Jun 26, 2026

Uh oh!

scestola-h commented Jul 1, 2026

Uh oh!

larksudo Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

scestola-h commented Jun 16, 2026

Summary

Where to start reading

What's included

Design

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chloroethylene commented Jun 26, 2026

Uh oh!

scestola-h commented Jul 1, 2026

Hole-mode CacheBlend: Analysis and Reproduction

1. What the feature does (recap)

2. Definitions

2.1 Dataset and prompt structure

2.2 Configuration parameters

2.3 Metrics

2.4 Modes

3. F1 (precision) analysis

3.1 Short context, r=0.5 — hole positions and multi-hole

3.2 Recompute-ratio sensitivity (r=0.3)

3.3 Long context (inflate=10, 20, 30) at r=0.15

3.4 Reference points (full vLLM, LMCache-only)

4. Hole path activation (coverage evidence)

4.1 Per-request path distribution

4.2 Coverage statistics on hole-path requests

4.3 Quality of validation runs

5. Reproduction

5.1 Environment

5.2 Model and dataset

5.3 Benchmark entrypoint

5.4 Row-by-row commands

Uh oh!

larksudo Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants