Skip to content

Add hole-mode CacheBlend extension for non-contiguous cache reuse#254

Open
scestola-h wants to merge 1 commit into
LMCache:mainfrom
scestola-h:pr/hole-mode-cacheblend
Open

Add hole-mode CacheBlend extension for non-contiguous cache reuse#254
scestola-h wants to merge 1 commit into
LMCache:mainfrom
scestola-h:pr/hole-mode-cacheblend

Conversation

@scestola-h

Copy link
Copy Markdown

Summary

Adds hole-mode CacheBlend to LMCache-Ascend: extends the existing CacheBlend path to support reuse of cached KV across non-contiguous prefixes (with "holes"). Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow.

Where to start reading

The maintainer overview for this PR is at docs/hole-feature-overview.md. It covers the architecture, request flow vs standard CacheBlend, file map, and a recommended code-review reading order.

What's included

  • Hole-aware connector and adapter (LMCacheConnectorV1ImplHole)
  • Segment-aware blender (LMCBlenderHole)
  • Hole-aware NPU buffer connector (VLLMBufferLayerwiseNPUHoleConnector)
  • Hole-aware lookup client and server
  • Supporting types and segment utilities under lmcache_ascend/v1/
  • Benchmark entrypoint (benchmark/cacheblend/benchmark-hole.py) with config (benchmark/cacheblend/config/config-hole.yaml)
  • Test (tests/v1/blend/test_blend.py)
  • Maintainer overview (docs/hole-feature-overview.md)

Design

The standard CacheBlend path is preserved unchanged. Hole-mode is selected via connector module choice. See docs/hole-feature-overview.md for design rationale and where each piece of the request flow lives.

Validation

Single-hole and multi-hole runs on the Musique multi-hop QA dataset with Qwen3-8B. On a matched comparison (inflate=20, until=150, recompute_ratio=0.15), standard CacheBlend reaches f1=0.2230; single-hole mode reaches f1=0.2200. Concrete numbers and configuration details in section 6 of the overview doc.

Extends LMCache-Ascend's CacheBlend integration to support reuse of cached KV across non-contiguous prefixes. Standard CacheBlend stops at the first cache miss and recomputes everything after it. Hole-mode computes fresh KV for the missing segment(s) only and continues to reuse cached segments that follow.

Implementation lives under lmcache_ascend/ alongside the existing contiguous-prefix path: a hole-aware connector and adapter (LMCacheConnectorV1ImplHole), a segment-aware blender (LMCBlenderHole), a hole-aware NPU buffer connector, hole-aware lookup client and server, and supporting types and helpers. The standard CacheBlend path is preserved unchanged; the hole path is selected via connector module choice.

Includes a benchmark entrypoint, a test, and a maintainer overview in docs/hole-feature-overview.md describing the request flow and file map for code review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'Hole' feature to standard CacheBlend, enabling non-contiguous prefix reuse to handle cached prefixes with gaps or 'holes'. It adds comprehensive documentation, a new benchmark suite, and integrates hole-aware connectors, blenders, and segment helpers. While the implementation is solid, several critical issues must be addressed to prevent runtime failures. These include an UnboundLocalError in cache_engine.py when keys are empty, a NameError in blender.py due to an undefined total_len variable, and a potential IndexError in the benchmark script. Additionally, the trace_utils patching logic should be improved to ensure compatibility with upstream lmcache, and a bare except clause in utils.py should be replaced to comply with PEP 8.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +539 to +545
yield None

next(mem_obj_consumer)

for mem_obj in to_count_down:
if mem_obj.is_pinned:
mem_obj.unpin()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If keys is empty, the else block on line 535 is executed. However, mem_obj_consumer and to_count_down are only defined inside the if keys: block (lines 515 and 518). When keys is empty, executing lines 541 and 543 will raise an UnboundLocalError at runtime, crashing the engine on any cache miss. These finalization steps should be moved inside the if keys: block.

Suggested change
yield None
next(mem_obj_consumer)
for mem_obj in to_count_down:
if mem_obj.is_pinned:
mem_obj.unpin()
yield None
if keys:
next(mem_obj_consumer)
for mem_obj in to_count_down:
if mem_obj.is_pinned:
mem_obj.unpin()

Comment on lines +189 to +199
self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len)
trace_flow(
"blender.nohole",
"process_qkv",
layer_id=layer_id,
req_id=kwargs.get("req_id", self._trace_req_id),
num_falses=num_falses,
eligible_tokens=total_len,
topk_num=topk_num,
top_indices=top_indices.tolist(),
absolute_top_indices=absolute_top_indices.tolist(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable total_len is referenced on lines 189 and 196, but it is never defined within the process_qkv function of blender.py. This will raise a NameError at runtime when the standard (nohole) CacheBlend path is executed. It should be replaced with diff_k.shape[0] or k[num_falses:].shape[0] which represents the number of eligible tokens.

            total_len = diff_k.shape[0]
            self._log_recomputed_tokens(layer_id, absolute_top_indices, total_len)
            trace_flow(
                "blender.nohole",
                "process_qkv",
                layer_id=layer_id,
                req_id=kwargs.get("req_id", self._trace_req_id),
                num_falses=num_falses,
                eligible_tokens=total_len,
                topk_num=topk_num,
                top_indices=top_indices.tolist(),
                absolute_top_indices=absolute_top_indices.tolist(),

Comment on lines +818 to +830
def _patch_trace_utils_compat():
try:
# Third Party
import lmcache.v1.trace_utils # noqa: F401
except ModuleNotFoundError as exc:
if exc.name != "lmcache.v1.trace_utils":
raise

# First Party
from lmcache_ascend.v1 import trace_utils_compat

sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _patch_trace_utils_compat function only replaces sys.modules['lmcache.v1.trace_utils'] if ModuleNotFoundError is raised. However, if an upstream version of lmcache is installed, the import succeeds, but it will lack the newly introduced tracing symbols (like trace_layer_enabled, trace_probe_positions, etc.). This will cause an ImportError when other modules attempt to import these missing symbols. The patch should dynamically inject these fallback stubs into the existing lmcache.v1.trace_utils module if it is successfully imported to ensure compatibility.

Suggested change
def _patch_trace_utils_compat():
try:
# Third Party
import lmcache.v1.trace_utils # noqa: F401
except ModuleNotFoundError as exc:
if exc.name != "lmcache.v1.trace_utils":
raise
# First Party
from lmcache_ascend.v1 import trace_utils_compat
sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat
def _patch_trace_utils_compat():
try:
# Third Party
import lmcache.v1.trace_utils as tu
for attr in ["trace_layer_enabled", "trace_probe_positions", "trace_compare_prefix_len", "summarize_kv_tensor_stats", "summarize_prefix_kv_tensor_stats", "emit_request_timer"]:
if not hasattr(tu, attr):
setattr(tu, attr, lambda *args, **kwargs: False if "enabled" in attr else None)
except ModuleNotFoundError as exc:
if exc.name != "lmcache.v1.trace_utils":
raise
# First Party
from lmcache_ascend.v1 import trace_utils_compat
sys.modules["lmcache.v1.trace_utils"] = trace_utils_compat
References
  1. Prioritize consistency with upstream implementation logic when extending external libraries to ensure compatibility, even if the logic appears unconventional.

Comment on lines +1243 to +1251
output_str = output.outputs[0].text
answers = meta['answers']

e2e_time = extract_request_e2e_s(output)
ttft = extract_request_ttft_s(output)
tbt = None
if e2e_time is not None and ttft is not None:
generated_tokens = len(output.outputs[0].token_ids)
tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing output.outputs[0] directly on lines 1243 and 1250 assumes that output.outputs is never empty. If a request fails or returns an empty output list, this will raise an IndexError. It is safer to defensively check if output.outputs is non-empty before accessing its elements, similar to the check performed on line 1333.

Suggested change
output_str = output.outputs[0].text
answers = meta['answers']
e2e_time = extract_request_e2e_s(output)
ttft = extract_request_ttft_s(output)
tbt = None
if e2e_time is not None and ttft is not None:
generated_tokens = len(output.outputs[0].token_ids)
tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)
output_str = output.outputs[0].text if output.outputs else ""
answers = meta['answers']
e2e_time = extract_request_e2e_s(output)
ttft = extract_request_ttft_s(output)
tbt = None
if e2e_time is not None and ttft is not None and output.outputs:
generated_tokens = len(output.outputs[0].token_ids)
tbt = (e2e_time - ttft) / max(1, generated_tokens - 1)

Comment thread lmcache_ascend/utils.py
Comment on lines +55 to +59
def get_mem(npu_id, chip_id=0):
try:
return parse_npu_process_info(npu_id,chip_id)['memory_mb']
except:
return -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a bare except: clause is a violation of PEP 8 style guidelines. It catches system-exiting exceptions like SystemExit and KeyboardInterrupt, making it harder to interrupt the program and potentially masking unrelated bugs. It should be replaced with except Exception:.

Suggested change
def get_mem(npu_id, chip_id=0):
try:
return parse_npu_process_info(npu_id,chip_id)['memory_mb']
except:
return -1
def get_mem(npu_id, chip_id=0):
try:
return parse_npu_process_info(npu_id,chip_id)['memory_mb']
except Exception:
return -1
References
  1. PEP 8: Programming Recommendations - 'When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.' (link)

@chloroethylene chloroethylene requested a review from larksudo June 26, 2026 07:22
@chloroethylene

Copy link
Copy Markdown
Collaborator

@larksudo review plz

@scestola-h

Copy link
Copy Markdown
Author

Hole-mode CacheBlend: Analysis and Reproduction

Following up on the request for precision, coverage, and speed data on this PR. All numbers below are from runs on Ascend 910B4 NPU hardware using the benchmark harness included in this PR (benchmark/cacheblend/benchmark-hole.py). The model used across every run below is Qwen3-8B, and the dataset is Musique.

If this data is insufficient for review, we're happy to run additional experiments. For example, broader parameter sweeps, additional recompute ratios or hole configurations, other datasets, or additional models. Let us know what would be most useful.

1. What the feature does (recap)

Standard LMCache CacheBlend reuses a contiguous cached KV prefix and stops at the first cache miss. Hole-mode extends this to reuse cached KV across non-contiguous prefixes: when a prompt has one or more "holes" (missing cached document positions) in an otherwise cached prefix, hole-mode computes fresh KV for the holes and continues to reuse cached segments that follow.

2. Definitions

2.1 Dataset and prompt structure

Musique: multi-hop QA dataset. Each query is a question paired with a set of native retrieved documents. The model is asked to answer the question given the documents. The benchmark serialises this as:

<system> # # <doc_1> # # <doc_2> # # ... # # <question>

2.2 Configuration parameters

  • inflate (N): appends N additional documents to each query's native document set. These extra documents come from a shared pool of non-native documents. inflate=1 keeps native docs unchanged; inflate=10 adds 10 extra documents; inflate=30 adds 30. This tests long-context behavior without changing dataset semantics.
  • hole_id: index into the query's native retrieved documents, selecting which one becomes the "hole document."
  • hole_pos: position in the final inflated document sequence where the hole document is placed. Its KV cache is not populated during phase 1, which forces hole-mode (or, if hole-mode is disabled, forces a legacy full-recompute) at inference time.
  • hole_ids, hole_positions: multi-hole variant. E.g., hole_ids=[2,5] hole_positions=[3,6] means native doc 2 is placed at inflated position 3 and native doc 5 is placed at inflated position 6, both with unpopulated KV.

Concrete walkthrough: with hole_id=2, hole_pos=3, inflate=10, the benchmark takes the query's native retrieved docs, appends 10 extra pool docs, and constructs the final inflated list. Native doc 2 is placed at position 3 in that list. That single position's KV is not populated during phase 1. At inference, the model must either activate hole-mode (recompute KV for position 3 while reusing the rest) or fall back to the legacy full-recompute path.

  • hole_mode: hole enables the hole-mode path; no-hole runs baseline CacheBlend.
  • recompute_ratio (r): standard CacheBlend parameter. Fraction of reusable tokens recomputed at the check layer. Lower r = faster inference but less quality-corrective recompute. Default in shipped config: 0.5. Tighter regimes tested here: 0.3, 0.2, 0.15.
  • until (N): number of queries drawn from Musique to run in each benchmark invocation. All numbers below use until=150.
  • check layer: single layer at which CacheBlend performs its diff/topK check to decide which cached-KV positions need refresh. Configurable via blend_check_layers; the shipped benchmark config uses layer 1.

2.3 Metrics

Each benchmark run does two phases:

  • Phase 1: batched prefill of documents into the LMCache cache.
  • Phase 2: query processing. Two passes over the query set:
    • Pass 1 (phase2_pass1): first full traversal; fills any missing hole chunks during processing.
    • Pass 2 (phase2_pass2): second full traversal; measures steady-state reuse behavior with warmed cache.

Metrics reported per pass:

  • mean_f1: mean F1 over the answer set, using Musique's evaluation. Higher is better.
  • cache_hit_rate: fraction of prompt tokens served from cached KV, computed as exported_hit_tokens / exported_prompt_tokens.
  • latency_avg_e2e_s: average end-to-end latency per request in seconds.

Note on pass 2 cache_hit_rate: pass 2 is not 100% by design. The benchmark deliberately modifies the query suffix between passes (appends ? per pass) to prevent trivial query-side cache reuse. The reported ~99.7–99.9% figure reflects reuse of the retrieved-document prefix, which is what CacheBlend actually targets.

2.4 Modes

  • method=cacheblend: LMCache-backed inference with LMCACHE_ENABLE_BLENDING=True. This is the CacheBlend path: batched prefill + blend-based reuse.
  • method=reuse: LMCache-backed inference with LMCACHE_ENABLE_BLENDING=False. Same prefill infrastructure but no blending, pure cache reuse without CacheBlend's recompute step.
  • method=full: plain vLLM with no LMCache. Serves as a quality ceiling reference (no reuse artifacts).

3. F1 (precision) analysis

3.1 Short context, r=0.5 — hole positions and multi-hole

inflate=1 (no extra documents added). "Baseline" here means LMCache CacheBlend with hole_mode=no-hole, i.e., the standard contiguous-prefix CacheBlend path that this PR extends.

Configuration hole_mode Hole config r f1 p1 f1 p2 hit p1 hit p2 e2e p1 (s) e2e p2 (s)
row1 (baseline) no-hole 0.5 0.3523 0.3657 0.6038 0.9962 0.9467 0.7721
row2 (hole @ pos 3) hole hole_id=2, hole_pos=3 0.5 0.3388 0.3499 0.9261 0.9979 0.9313 0.8293
row3 (hole @ pos 6) hole hole_id=2, hole_pos=6 0.5 0.3289 0.3369 0.9261 0.9979 0.9314 0.7726
row4 (hole @ pos 8) hole hole_id=2, hole_pos=8 0.5 0.3369 0.3372 0.9261 0.9979 0.9851 0.7900
row5 (two holes) hole hole_ids=[2,5], hole_positions=[4,7] 0.5 0.3448 0.3422 0.8582 0.9991 1.0187 0.7778

Observations:

  • Baseline f1 p2: 0.3657. Single-hole variants: 0.3369–0.3499. Multi-hole: 0.3422.
  • All hole configurations land within 0.03 of the baseline; multi-hole within 0.024.

3.2 Recompute-ratio sensitivity (r=0.3)

Same short-context regime, tighter recompute ratio.

Configuration hole_mode r f1 p1 f1 p2 hit p1 hit p2 e2e p1 (s) e2e p2 (s)
row6 (baseline) no-hole 0.3 0.3397 0.3228 0.6038 0.9962 0.9368 0.7152
row7 (hole @ pos 6) hole 0.3 0.3039 0.3109 0.9261 0.9979 0.8992 0.8222

Observation: at r=0.3, baseline f1 drops to 0.3228 (from 0.3657 at r=0.5), the expected quality/speed tradeoff. Hole runs remain within ~0.012 of the baseline.

3.3 Long context (inflate=10, 20, 30) at r=0.15

The most stress-testing regime. Inflate synthesizes long contexts by adding extra documents; r=0.15 is a tight-recompute setting. As in section 3.1, "baseline" here means LMCache CacheBlend with hole_mode=no-hole.

Configuration hole_mode inflate r f1 p1 f1 p2 hit p1 hit p2 e2e p1 (s) e2e p2 (s)
row16 (baseline) no-hole 10 0.15 0.2587 0.2378 0.6359 0.9972 1.2258 0.8712
row17 (hole @ pos 4) hole 10 0.15 0.2227 0.2222 0.9587 0.9988 1.0103 0.8457
row18 (baseline) no-hole 20 0.15 0.2263 0.2230 0.6405 0.9975 1.7630 1.1434
row19 (hole @ pos 6) hole 20 0.15 0.2167 0.2148 0.9717 0.9992 1.4411 1.1566
row23 (hole @ pos 4) hole 20 0.15 0.2230 0.2200 0.9717 0.9992 1.3748 1.1385
row21 (baseline) no-hole 30 0.15 0.2023 0.1916 0.6422 0.9977 2.4660 1.5244
row22 (hole @ pos 4) hole 30 0.15 0.1758 0.1753 0.9784 0.9996 1.8335 1.5139

Observations:

  • Absolute f1 decreases with more inflated context (harder task), this happens for baseline and hole alike.
  • The baseline-vs-hole f1 gap remains within ~0.05 across the entire long-context regime.

3.4 Reference points (full vLLM, LMCache-only)

Quality ceiling (full vLLM, no cache reuse) and cache-reuse-without-blending baseline.

Configuration method inflate r f1 p1 f1 p2 hit p1 hit p2 e2e p1 (s) e2e p2 (s)
row8 full 1 0.5 0.3760 0.3760 0.0000 0.0000 0.8600 0.8557
row9 reuse (no CacheBlend) 1 0.5 0.3760 0.3807 0.0009 0.9982 0.9772 0.5922
row10 full 10 0.5 0.3783 0.3783 0.0000 0.0000 1.4588 1.4627
row13 full 20 0.5 0.3252 0.3252 0.0000 0.0000 2.3049 2.3106
row20 full 30 0.5 0.3046 0.3046 0.0000 0.0000 3.3118 3.3167

Observations:

  • Full vLLM at inflate=1: f1=0.3760, latency 0.8557s. Baseline CacheBlend (row1) reaches f1=0.3657, latency 0.7721s, trading ~0.01 f1 for ~10% speedup.
  • Reuse-without-blending (row9) at f1=0.3807 slightly exceeds full vLLM (0.3760) at inflate=1, with much lower latency. The small f1 difference is within measurement noise for a 150-query set. This mode uses CacheBlend's cache-reuse infrastructure without the blend/recompute step.

4. Hole path activation (coverage evidence)

Two dedicated validation runs (val_A_single_hole and val_B_two_holes) were configured to match a single-hole and multi-hole scenario respectively. val_A processes ~21K requests and val_B processes ~19K requests through Phase 2.

  • val_A_single_hole configuration: hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150. Same configuration as row23.
  • val_B_two_holes configuration: hole_mode=hole hole_ids=[2,5] hole_positions=[3,6] inflate=20 recompute_ratio=0.15 until=150.

4.1 Per-request path distribution

Each request that reaches the LMCache connector is routed to one of three paths depending on its cache state:

Path val_A (single hole) val_B (two holes) What it means
Hole path (log: Hole mode=hole) 108 130 Hole-mode logic activated; fresh KV computed for hole positions, cached KV reused elsewhere.
Legacy path (log: Hole mode=legacy) 1148 1035 The hole-mode adapter routed the request through the legacy contiguous-prefix path.
Pure-hit path (log: load_mode=pure_hit) 20559 18325 Full LMCache reuse without needing hole-mode; all documents were cached.

4.2 Coverage statistics on hole-path requests

For requests that took the hole path:

Metric val_A val_B
Covered tokens (range) 0 – 18,995 0 – 19,002
Prefix-miss tokens (range) 0 – 879 0 – 1,697

These show that when hole-mode engages, it covers up to ~19K tokens of the reusable prefix, with the hole itself accounting for at most ~1.7K tokens of fresh recompute.

4.3 Quality of validation runs

Configuration f1 p1 f1 p2 hit p2 e2e p2 (s)
val_A (single hole) 0.2230 0.2200 0.9992 1.1089
val_B (two holes) 0.2022 0.2022 0.9996 1.1281

val_A reproduces row23's quality (f1 p2 = 0.2200) and cache behavior (hit p2 = 0.9992); latency differs by ~0.03s, within typical run-to-run variance. val_B provides a matched multi-hole datapoint at the same context/r configuration.

5. Reproduction

5.1 Environment

  • Hardware: Ascend 910B4 NPU (single card, NPU 3 used for all matrix runs)
  • npu-smi driver version: 25.6.rc1.b099
  • Feature branch: this PR at commit 58b0133
  • Container base image: quay.io/ascend/vllm-ascend:v0.18.0
  • Python package versions:
    • vllm 0.18.0+empty
    • vllm-ascend 0.18.0
    • lmcache 0.4.4
    • torch 2.9.0+cpu
    • torch_npu 2.9.0.post1+gitee7ba04

5.2 Model and dataset

  • Model: Qwen3-8B
  • Dataset: Musique multi-hop QA

5.3 Benchmark entrypoint

Included in this PR:

benchmark/cacheblend/benchmark-hole.py
benchmark/cacheblend/config/config-hole.yaml

5.4 Row-by-row commands

Each row is reproduced by:

python3 benchmark/cacheblend/benchmark-hole.py <override>

Where <override> is:

Row Override
row1 (baseline, inflate=1) npu=3 method=cacheblend hole_mode=no-hole inflate=1 until=150
row2 (hole @ pos 3) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=3 inflate=1 until=150
row3 (hole @ pos 6) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=1 until=150
row4 (hole @ pos 8) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=8 inflate=1 until=150
row5 (two holes) npu=3 method=cacheblend hole_mode=hole hole_ids=[2,5] hole_positions=[4,7] inflate=1 until=150
row6 (baseline, r=0.3) npu=3 method=cacheblend hole_mode=no-hole inflate=1 until=150 recompute_ratio=0.3
row7 (hole, r=0.3) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=1 until=150 recompute_ratio=0.3
row8 (full vLLM) npu=3 method=full hole_mode=no-hole inflate=1 until=150
row9 (reuse) npu=3 method=reuse hole_mode=no-hole inflate=1 until=150
row10 (full, inflate=10) npu=3 method=full hole_mode=no-hole inflate=10 until=150
row13 (full, inflate=20) npu=3 method=full hole_mode=no-hole inflate=20 until=150
row16 (baseline, inflate=10, r=0.15) npu=3 method=cacheblend hole_mode=no-hole inflate=10 recompute_ratio=0.15 until=150
row17 (hole, inflate=10, r=0.15) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=10 recompute_ratio=0.15 until=150
row18 (baseline, inflate=20, r=0.15) npu=3 method=cacheblend hole_mode=no-hole inflate=20 recompute_ratio=0.15 until=150
row19 (hole, inflate=20, r=0.15) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=6 inflate=20 recompute_ratio=0.15 until=150
row20 (full, inflate=30) npu=3 method=full hole_mode=no-hole inflate=30 until=150
row21 (baseline, inflate=30, r=0.15) npu=3 method=cacheblend hole_mode=no-hole inflate=30 recompute_ratio=0.15 until=150
row22 (hole, inflate=30, r=0.15) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=30 recompute_ratio=0.15 until=150
row23 (hole, inflate=20, r=0.15) npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150
val_A_single_hole npu=3 method=cacheblend hole_mode=hole hole_id=2 hole_pos=4 inflate=20 recompute_ratio=0.15 until=150
val_B_two_holes npu=3 method=cacheblend hole_mode=hole hole_ids=[2,5] hole_positions=[3,6] inflate=20 recompute_ratio=0.15 until=150

Notes:

  • All runs use until=150 (150 queries drawn from Musique).
  • hole_mode=no-hole selects the standard contiguous-prefix CacheBlend path.
  • hole_mode=hole selects the hole-mode path introduced in this PR.
  • Baseline recompute_ratio (from shipped config) is 0.5.

**File Path**: `vllm-ascend/vllm-ascend/worker/worker.py`

- In `vllm-ascend/vllm-ascend/worker/worker_v1.py`, comment out `ensure_kv_transfer_initialized(vllm_config)` in function `def _init_worker_distributed_environment`.
- In `vllm-ascend/vllm-ascend/worker/worker.py`, comment out `ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)` in function `def initialize_from_config`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I see the changes on the latest main branch.
For completeness, should we list which vllm-ascend versions are supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants