Skip to content

[Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4)#246

Open
marcobarlo wants to merge 15 commits into
LMCache:mainfrom
marcobarlo:dsv4_support
Open

[Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4)#246
marcobarlo wants to merge 15 commits into
LMCache:mainfrom
marcobarlo:dsv4_support

Conversation

@marcobarlo

@marcobarlo marcobarlo commented May 28, 2026

Copy link
Copy Markdown
Contributor

[Core] Support for hybrid KV cache groups with mixed type/compression (e.g., Deepseek v4 in vllm-ascend)

Summary

DeepSeek V4 on vLLM-Ascend (config) exposes 11 scheduler KV-cache groups per request (distinct block_ids, block_size, compress_ratio) and heterogeneous per-layer tensors (multi-plane tuples, shared int8 blobs, mixed dtypes).

Upstream LMCache#3171 added DSv4 on the MP path only. vLLM-Ascend uses the in-process connector; upstream RequestTracker keeps block_ids[0] and one slot_mapping, so DSv4 is wrong or fails asserts.

Feature Required? Role
1. In-process multi-group support Yes All groups tracked; per-group slots; NPU grouping/dispatch; DSv4 layouts
2. Multi-plane copy bundling No (perf) Fewer NPU launches; default LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=1

Background

Layer type kv_caches[name] Scheduler groups
SWA / MTP bf16 view 1
Compress-4 7 views, shared int8 blob up to 7
Compress-128 4-tuple mixed dtypes up to 4

DS4RandomQuarterLayers: 11 KVCacheGroupSpecs; new_request.block_ids is a tuple of 11 lists. Upstream MP (#3171) runs one CUDA launch per homogeneous group — it does not fuse multiple planes of one layer.


Feature 1 — In-process multi-group support (required)

Without this: groups 1..10 block tables are dropped, compressed slot_mapping lengths fail asserts, and KVLayerGroupsManager mis-groups blob / mixed-dtype layers. Exploded mode (LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0) is the minimal correct fallback (MP-style, one transfer per homogeneous NPU group).

Workflow vs upstream

Step Upstream in-process This PR
New request allocated_block_ids = group 0 allocated_block_ids_by_group (all 11)
ReqMeta One slot_mapping slot_mappings_by_group[g]; primary_kv_group_idx
wait_for_save / start_load_kv One slot tensor → engine Per-group CPU/NPU maps + slot_mappings_npu_by_group
register_kv_caches Flat homogeneous layers build_flat_kv_caches + layout_hints
Layer groups (kv_size, nh, hs, dtype) build_kv_layer_groups + KVCacheFormat; scheduler-slot split
Store/load kernels One pointer table _multi_group_kv_transfer; scheduler_slot_group maps NPU index → scheduler group
MemoryObj Single hidden / chunk _patch_metadata_get_shapes per NPU group

MP stages block IDs on the server; here the NPU connector builds per-group pointers and slot slices.

End-to-end path

register_kv_caches:   (vllm_v1_adapter) build_flat_kv_caches → layout_hints
                      → ensure_kv_layer_groups (per NPU group MemoryObj sizing)
build_connector_meta: (multi_group_vllm_adapter) RequestTracker + ReqMeta
wait_for_save:        (vllm_v1_adapter) slot_mappings_by_group → NPU → engine.store
                      → connector: multi_layer_kv_transfer[_multi_plane] per NPU group
start_load_kv:        symmetric retrieve (load_stream)

Chunk-hash / mask sizing uses the primary group’s slot_mapping; kernels use all groups.

Main changes

  • lmcache_ascend_connector_v1.py — entry (LMCacheAscendConnectorV1Dynamic, SupportsHMA) → LMCacheAscendConnectorV1Impl
  • multi_group_vllm_adapter.pyLMCacheConnectorV1ImplMultiGroup, RequestTracker / ReqMeta, slot builders, build_connector_meta
  • vllm_v1_adapter.py — register, wait_for_save, start_load_kv, request_finished*
  • multi_spec_flatten.py, kv_format.py, kv_layer_groups.py, npu_connectors.py, skip_state_groups.py
  • mem_kernels.cpp / utils.cpp — DSA-C8 sizing on homogeneous kernel path
  • lmcache_ascend/__init__.py — optional _patch_vllm_v1_adapter() shim into upstream lmcache.integration.vllm.vllm_v1_adapter on import lmcache_ascend (not the DSv4 connector entry path)

Concepts

  • Plane — one KV component of a layer (sub-tensor: own buffer or shared-allocation view). Bundled layers have several planes; each may map to a different scheduler group.
  • Flat entry — one slot in LMCache’s per-layer list. Bundling on: one tuple per multi-spec layer; off: one entry per plane.
  • Same kernel — iff registered and (a) planes of the same flat entry, or (b) different flat entries with the same shape key (kv_size, hidden, block_size, dtype_key, num_tensors) and scheduler group.
  • Scheduler KV-cache group — one vLLM KVCacheGroupSpec; slot_mapping[g] length ceil(T / compress_ratio[g]).
  • NPU layer group — flat indices sharing shape key and scheduler group after split → one kernel per chunk.
  • scheduler_slot_group — NPU group i uses slot_mappings_by_group[g] where g is that group’s scheduler index (not [i]).

DSV4 mapping (bundled, state skipped):

Scheduler group Role NPU group Bundled?
sg0, sg1, sg3 CR4 attn + SWA + indexer (+ scale) g0 (4 planes) yes — one flat entry per CR4 layer
sg1 @ layer 21 Standalone SWA g1 no
sg2 SWA on even layers ≥22 g2 no
sg8 Compress-128 (SW=128) g3 no
sg4–7, sg10 State specs skipped at registration
sg9 CR128 SWA companion slot mapping only; not in MemoryObj

Example (T=512): len(slot_mapping[g]) = ceil(512 / CR[g]); primary group = argmax_g(len(block_ids[g]) × block_size[g]).

Architectural choices

Skip state groups (skip_state_groups.py, default on): DSV4 state groups (C4AttnKVState, C4IndexerKVState, C128AttnScoreState, …) hold recurrent/score buffers not needed for prefix reuse.

  • LMCACHE_ASCEND_SKIP_STATE_GROUPS=1 — filter before layer-group planning.
  • LMCACHE_ASCEND_SKIP_STATE_SPEC_ALLOWLIST — optional; default skips six *State* specs (sg4–7, sg10).
  • Bundled: remove skipped planes from tuples; drop empty flat entries.
  • Exploded: drop flat entries for skipped scheduler groups.
  • Indexer KV (C4Indexer, sg3) is not skipped.

Sliding-window store (sg8, SW=128): only the last sliding_window tokens per LMCache chunk are persisted. Storing the full logical chunk over-allocated MemoryObj rows (~5.2× before fix).

  1. Store mask (multi_group_vllm_adapter.py) — slot_mapping[g] = -1 for tokens outside chunk_end - sliding_window; kernels skip -1.
  2. Physical chunk size (kv_layer_groups.py, npu_connectors.py, _patch_metadata_get_shapes) — physical_chunk_size = sliding_window // compress_ratio (SW=128, CR=128 → 1 row for sg8).

Multi-group mode requires discard_partial_chunks=True globally.


Feature 2 — Multi-plane copy bundling (optional)

Feature 1 alone is sufficient for correctness via exploded mode: each plane is a flat layer; each homogeneous NPU group gets multi_layer_kv_transfer — same idea as MP’s multiple CUDA launches. Bundling is a performance choice on NPU (launch overhead on Atlas A2), not something upstream MP implements.

Aspect Upstream in-process Upstream MP (#3171) This PR (bundled)
Multi-spec layer Not supported N registrations, N launches One tuple (MULTI_PLANE_KV / DSA_C8_KV)
Heterogeneous block_size N/A Separate kernels multi_layer_kv_transfer_multi_plane
Chunk layout Standard Per-group slots Plane-major uint8 rows
Slot map Single tensor Server-built Concat + multi_plane_slot_slice_bounds

Files: multi_layer_kv_transfer_multi_plane, KVCacheFormat.MULTI_PLANE_KV / DSA_C8_KV, _invoke_multi_plane_kv_transfer. Launch count: 4 NPU groups vs exploded O(planes × layers).


Breakage without this PR:

  • RequestTracker / ReqMeta — single group; wrong retrieve assert.
  • KVLayerGroupsManager + gpu_connector/utils.py — no blob / multi-plane / per-plane block_size.
  • VLLMPagedMemGPUConnectorV2 — flat pointer table; mixed tuple lengths break.
  • LMCacheMetadata.get_shapes — cannot size per-group multi-plane row bytes.
  • multi_layer_kv_transfer — scalar block_size insufficient for bundled planes.
  • HMA request_finished — needs request_finished_all_groups for tuple block_ids.

Flags and tests

Flag Effect
LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0 Exploded planes; feature 2 off
LMCACHE_ASCEND_FLATTEN_MULTI_SPEC=0 Legacy single-group; DSv4 unsupported
LMCACHE_ASCEND_SKIP_STATE_GROUPS=1 (default) Skip state scheduler groups

Tests: test_ds4_kvcache_roundtrip.py, test_multi_group_vllm_adapter.py, test_npu_connector_multi_group_load.py, test_mem_kernels.py, test_kv_layer_groups_npu.py, test_per_group_memory_allocation.py. Single-group models unchanged.


Testing

Configuration:

Official vllm-ascend configuration for DeepseekV4 and in addition:

    --no-disable-hybrid-kv-cache-manager \
    --kv-transfer-config '{"kv_connector":"LMCacheAscendConnectorV1Dynamic","kv_role":"kv_both", "kv_connector_module_path":"lmcache_ascend.integration.vllm.lmcache_ascend_connector_v1"}'\
    --no-enable-prefix-caching \

LMCache configuration (tag v0.4.5)

export LMCACHE_TRACK_USAGE=false
export LMCACHE_MAX_LOCAL_CPU_SIZE="70"
export LMCACHE_LOCAL_CPU="True"
export LMCACHE_LOG_LEVEL=INFO
export LMCACHE_USE_LAYERWISE="False"
export LMCACHE_NUMA_MODE="auto"
export LMCACHE_CHUNK_SIZE=1024
export LMCACHE_EXTRA_CONFIG='{"save_only_first_rank": false}'

With the multi group connector defaults to discard partial chunks.

Results GSM8K

GSM8k results: first run, no LMCache KV cache hit,
full_GSM8K_FIRST_RUN

GSM8k results: second run, ~ 30 % of requests with KV cache hit form LMCache.

full_GSM8K_SECOND_RUN

Results vllm bench

vllm bench serve \
    --host "${HOST}" \
    --port "${PORT}" \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model "${MODEL_NAME}" \
    --tokenizer "${MODEL_PATH}" \
    --tokenizer-mode deepseek_v4 \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --request-rate 0.5 \
    --prefix-repetition-num-prefixes 30 \
    --prefix-repetition-prefix-len 30000 \
    --prefix-repetition-suffix-len 10000 \
    --prefix-repetition-output-len 1000 \
    --ignore-eos \
    --temperature 0

Results no LMCache (no prefix matching vLLM):
bench_baseline

Results LMCache (no prefix matching vLLM):
bench_lmc

EDIT 1 - Upstream HMA (cec6092, #3419)

#3171 added DSv4 on MP CUDA; cec6092 generalises to HMA (LMCacheGroupView, group_layers_by_identity, Gemma-3 CI). Same problem class; this PR targets the in-process NPU path.

Concern Upstream cec6092 This PR
Block IDs / slots expand_block_ids_to_views allocated_block_ids_by_group, slot_mappings_by_group[g]
Grouping group_layers_by_identity (CUDA) build_kv_layer_groups + KVCacheFormat + scheduler split
Physical chunk physical_chunk_size + CR + sliding_window_size_by_group
Runtime MP + CUDA In-process NPU
Layouts / bundling Homogeneous CUDA tensors Tuple/blob/multi-plane; multi_layer_kv_transfer_multi_plane
State / SW All groups; SW out of scope Skip *State*; tail store + SW // CR sizing (~5.2× alloc fix)

Net-new: in-process connector stack (LMCacheAscendConnectorV1DynamicLMCacheAscendConnectorV1ImplLMCacheConnectorV1ImplMultiGroup), Ascend format detection, multi-plane kernel, state skip, SW-aware allocation.

Integration: land on LMCache-Ascend for DSv4; on rebase ≥ cec6092 keep in-process adapter + Ascend grouping + skip/SW/bundling (upstream HMA does not replace them). Port engine-neutral SW helpers upstream when DSv4 MP is needed. Cherry-picking cec6092 alone does not fix vLLM-Ascend.

EDIT 2 -- Latest vLLM-Ascend DeepseekV4

vLLM-Ascend has now latest for DeepseekV4. This has likely a different KV cache structure than the v0.18 tested for this PR. Additional tests are needed

EDIT 3 -- Latest vLLM-AscendDeepseekV4 is now supported (unstable)

The main difference between the two versions are as follows.
offload_v20_vs_v18_diff.md

Part of the multi-group/multi-plane bundling (feature 2) of this PR may not be needed to support v0.20 and part of it can be stripped out (along with the new kernel). I suggest to merge this PR to a branch of LMCache-Ascend to support DSv4 in v0.18, and then merge only needed parts to continue with upstream from v0.20 on in main branch. The multi-plane path and multi-plane kernel would only be useful to support DSA C8, which is anyway not used in DSv4 (it is used in DSv3.2 and GLM5 with enable_sparse_c8=True).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-group KV cache support for LMCache-Ascend (DSv4 / HMA), including a new multi-plane transfer kernel, multi-spec flattening/bundling logic, and updates to the NPU connector and vLLM v1 adapter. The code review highlights several critical issues: a missing mapping for signed int8 tensors in get_dtype_from_torch, a lack of host-side validation for num_planes and block_sizes (which could lead to out-of-bounds access or division-by-zero), and performance bottlenecks from synchronous stream synchronization and redundant .pin_memory() calls on small slot mapping tensors. Additionally, the use of strict=True in zip() introduces Python 3.10+ compatibility issues, and a commented-out version check in tests/bootstrap.py disables dependency syncing.

Comment thread csrc/utils.cpp Outdated
Comment thread csrc/mem_kernels.cpp Outdated
Comment thread csrc/mem_kernels.cpp Outdated
Comment thread csrc/mem_kernels.cpp Outdated
Comment thread lmcache_ascend/integration/vllm/vllm_v1_adapter.py Outdated
Comment thread lmcache_ascend/integration/vllm/vllm_v1_adapter.py Outdated
Comment thread lmcache_ascend/integration/vllm/multi_group_vllm_adapter.py Outdated
Comment thread lmcache_ascend/integration/vllm/multi_group_vllm_adapter.py Outdated
Comment thread lmcache_ascend/integration/vllm/multi_group_vllm_adapter.py Outdated
Comment thread tests/bootstrap.py Outdated
@marcobarlo marcobarlo force-pushed the dsv4_support branch 3 times, most recently from 9caec65 to 9f1780b Compare May 29, 2026 08:17
@marcobarlo

Copy link
Copy Markdown
Contributor Author

/gemini review

@marcobarlo

Copy link
Copy Markdown
Contributor Author

@copilot review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-group and multi-plane KV cache transfer support for LMCache-Ascend, adding new NPU kernels and multi-group adapter layers to handle heterogeneous block sizes and multi-plane layouts. The review feedback identifies several critical issues, including a potential runtime crash due to a strict alignment assertion, a performance bottleneck from synchronous device-to-host transfers in a loop, incorrect block ID calculations from hardcoded block sizes in failure recording, and a risk of retrieving incorrect memory addresses from tensor views using untyped_storage().data_ptr().

Comment thread lmcache_ascend/v1/kv_layer_groups.py Outdated
Comment thread csrc/mem_kernels.cpp Outdated
Comment thread lmcache_ascend/integration/vllm/multi_group_vllm_adapter.py Outdated
Comment thread lmcache_ascend/integration/vllm/multi_group_vllm_adapter.py
Comment thread lmcache_ascend/integration/vllm/vllm_v1_adapter.py
Comment thread lmcache_ascend/v1/npu_connector/npu_connectors.py Outdated
Comment thread csrc/mem_kernels.cpp Outdated
@marcobarlo marcobarlo changed the title deepseek v4 support [Feat][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) Jun 9, 2026
@marcobarlo marcobarlo changed the title [Feat][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) [Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) Jun 10, 2026
@marcobarlo marcobarlo force-pushed the dsv4_support branch 3 times, most recently from 889d5ab to cc22178 Compare June 12, 2026 17:58
@marcobarlo marcobarlo marked this pull request as ready for review June 12, 2026 18:00
Comment thread lmcache_ascend/v1/memory_management.py Outdated
class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic):
def __init__(self, vllm_config: "VllmConfig", role: KVConnectorRole) -> None:
super().__init__(vllm_config=vllm_config, role=role)
class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic, SupportsHMA):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to add 'SupportsHMA' to the entry connector from vllm-ascend.
Two options:

  1. make a PR in vllm-ascend for the in-process connector to have SupportsHMA
  2. create a connector in LMCache-Ascend, patched it during the import lmcache-ascend.

wdyt @marcobarlo @chloroethylene
It's probably better to go with 1.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's go with 2) since in process mode is going to be deprecated.

Comment thread docs/deployment.md
Comment on lines +129 to +168
## for vLLM-MindSpore

### Docker

1. Clone LMCache-Ascend Repo
Our repo contains a kvcache ops submodule for ease of maintenance, therefore we recommend cloning the repo with submodules.

```bash
cd /workspace
git clone --recurse-submodules https://github.com/LMCache/LMCache-Ascend.git
```

2. Build Docker Image
```bash
cd /workspace/LMCache-Ascend
docker build -f docker/mindspore/Dockerfile.a2.openEuler -t lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler .
```

3. Start Container
Once that is built, run it with the following cmd
```bash
docker run -itd \
--shm-size 200g --privileged \
--net=host \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /var/log/npu/:/var/log/npu \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v /lib/modules:/lib/modules:ro \
-v /usr/src/kernels:/usr/src/kernels:ro \
-v /mnt/storage1/data:/data \
-v /home:/home \
--name lmcache-ascend-ms \
--entrypoint /bin/bash \
lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, change this back to v0.3.12 for actual tested ver.

Comment thread docs/deployment.md Outdated
--trust-remote-code \
--disable-log-requests \
--block-size 128 \
--kv-transfer-config '{"kv_connector":"LMCacheAscendConnectorV1Dynamic","kv_role":"kv_both", "kv_connector_module_path":"lmcache_ascend.integration.vllm.lmcache_ascend_connector_v1"}'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to use the in process connector in vllm-ascend

Comment thread lmcache_ascend/v1/slot_mapping_utils.py Outdated
Comment on lines +146 to +148
ratios = tuple(int(r) for r in compress_ratios)
if not ratios:
ratios = (1,) * len(slot_mappings_by_group)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from refactoring. compress_ratios function parameter is also unused. To be removed

Comment on lines +25 to +32
def flatten_multi_spec_enabled() -> bool:
val = os.environ.get("LMCACHE_ASCEND_FLATTEN_MULTI_SPEC", "1")
return val != "0"


def bundle_multi_spec_enabled() -> bool:
val = os.environ.get("LMCACHE_ASCEND_BUNDLE_MULTI_SPEC", "1")
return val != "0"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably change these into the config definition as per init.py

like:

lmcache.v1.config._CONFIG_DEFINITIONS["ascend_bundle_multi_spec"] = {
    "type": bool,
    "default": True,
    "env_converter": _to_bool,
    "description": (
        "Whether LMCache-Ascend keeps multi-spec KV planes bundled for "
        "multi-plane NPU transfer. If False, multi-spec planes are exploded "
        "into synthetic .subN layers for legacy/fallback handling."
    ),
}

Comment on lines +62 to +63
if os.environ.get("LMCACHE_ASCEND_SKIP_STATE_GROUPS", "1") != "1":
return None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, It would be best if we can make these configurable in the config definition init.py

Comment on lines +253 to +267
if self.use_layerwise:
if idx == last_idx:
sync = True
else:
sync = False
# NOTE(Jiayi): Perform blending before layerwise prefix caching
if self.enable_blending:
self.blender.blend(
tokens[:lmcache_cached_tokens],
token_mask[:lmcache_cached_tokens],
kvcaches=kvcaches,
slot_mapping=slot_mapping_npu[:lmcache_cached_tokens],
vllm_cached_tokens=request.load_spec.vllm_cached_tokens,
)
else:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we should enable blending here tbh for kvcache groups > 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code path is there since it was copy-pasted from super.start_load_kv() and extended for multi-group, but it is actually untested - and probably we do not want to test it as SP is going to be deprecated. Flagging for deletion.

Comment on lines +1064 to +1065
indices = group.layer_indices
rep = kv_caches[indices[0]]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always true to be indices[0] ? Might worth to have a small comment for this assumption

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging to add the following comment:

            # ``build_kv_layer_groups`` buckets layers by layout identity
            # (kv_size, hidden, block_size, dtype, tensor count). Every member
            # of ``indices`` shares that key and ``group.shape_desc``, so format
            # detection and ``_derive_group_params`` on ``indices[0]`` apply to
            # the whole group; per-layer device pointers are collected below.

Does it clarify?

n = len(pb)
if kv_format == KVCacheFormat.DSA_C8_KV and n != 4:
raise ValueError(f"DSA-C8 expects 4 plane byte widths, got {n}")
if not bs_list:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be Len(bs_list) == 0 or planes is not None ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferrable len(bs_list) == 0 as planes is not None is already the gating condition of the data structure building above.

Comment on lines +307 to +312
if is_310p():
block_size = int(k_cache.shape[-2])
page_buffer_size = int(k_cache.shape[0]) * block_size
else:
block_size = int(k_cache.shape[1])
page_buffer_size = int(k_cache.shape[0]) * block_size

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trivial, maybe worth taking the page_buffer_size out of the if branches

Comment on lines +324 to +327
if layer_name and layer_name in sched_groups:
sched_per_plane = list(sched_groups[layer_name])
if len(sched_per_plane) == 1:
sched_per_plane = sched_per_plane * 2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the kv lora and rope ?

@marcobarlo marcobarlo Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. MLA_KV format is generically detected any time the KV cache is a tuple with 2 tensors and the two tensors have different shape (k_cache.shape != v_cache.shape).
In the standard MLA path, this translates to lora and rope, but in that case we use the standard copy path and we do not enter the multi-group path. Since this snippet is in _derive_group_params, it means we are in heterogeneous/multi-group path: each plane is a generically a buffer of bytes with its own geometry. As example, in v0.20 the indexer is a tuple of int8 in K with last dim 128, and FP 16 scale with last dim 1 (copying the indexer might not be strictly necessary tho...).

Probably a cleaner version of the snippet would be:

n_planes = len(entry)
if len(sched_per_plane) == 1 and n_planes > 1:
    sched_per_plane = sched_per_plane * n_planes

but at the moment there is a little difference because to detect MLA n_planes must be 2.
@matthewygf wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleaner version looks a lot better tbh !

Comment on lines +789 to +791
from lmcache_ascend.v1.npu_connector.utils import (
permute_kv_caches_to_contiguous,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be lazily import actually...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, we anyway need to import at least once. Probably leftover from refactoring. Flag to move above.

Comment on lines +880 to +885
# Third Party
from lmcache.v1.gpu_connector.utils import normalize_kv_and_discover_format
from lmcache.v1.kv_layer_groups import KVLayerGroupsManager
from lmcache.utils import EngineType

from lmcache_ascend.v1.kv_layer_groups import build_kv_layer_groups

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not need to be lazily imported I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise. Flag to move them as module-level imports.

Comment on lines +955 to +956
while len(self._mp_launch_bufs) <= npu_group_idx:
self._mp_launch_bufs.append(None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: its better to switch to a compact form, maybe like

missing = npu_group_idx + 1 - Len(self._mp_launch_bufs)
if missing> 0:
  self._mp_launch_bufs.extend([None]*missing)

Comment on lines +993 to +995
cached = mp_launch_meta.get(key)
if cached is None:
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very unclear, it seems it meant it is known there is no work for the cached key from build_mp_launch_meta ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is quite of a safety net: all NPU groups are iterated for each memObj. However, some groups might have no work to do (e.g., all tokens stripped out or compressed or other borderline cases). build_mp_launch_meta in this cases inserts no key in meta, and at invocation time we skip the kernel call. Probably worth adding a comment

            # Batch precompute omits keys with no valid slots (has_work=False);
            # same as mp_launch_meta is None path — skip the kernel for this chunk/group.

and add a log level debug in such case (see next comment).

Comment on lines +1011 to +1012
if not has_work:
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth a logger ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Comment on lines +1085 to +1090
cpu_ptrs = torch.empty(len(ptrs), dtype=torch.int64, device="cpu")
cpu_ptrs.numpy()[:] = ptrs
gpu_ptrs = torch.empty(
len(ptrs), dtype=torch.int64, device=self.kvcaches_device
)
gpu_ptrs.copy_(cpu_ptrs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth making these async ? were there any perf hits ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a one off cold-path on the first copy. Probably not worth the hustle

Comment on lines +1366 to +1370
SimpleNamespace(
nb=int(k_cache.shape[0]),
bs=int(k_cache.shape[1]),
block_stride_elems=0,
),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while it is okay, not a big fan of simplenamespace, maybe even refactor the function to explicitly take nb, bs, block_stride is better

Comment on lines +1588 to +1594
if (
filtered is None
or prefixes is None
or slot_mappings_by_group is None
or not starts
):
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we raise an error here? not clear to me when do these conditions should be true or not......

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is just a shortcut for normal non-multi-group transfers that use the normal non-multi-plane copy kernel to skip this as not needed. Furthermore, this function is just an optimization. In case the meta is needed by the multi-plane kernel, it is created just in time.

Comment on lines +2105 to +2106
from lmcache_ascend.v1.kv_layer_groups import _lmc_chunk_hidden_bytes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to the module level import as well.

Comment thread lmcache_ascend/v1/kv_format.py Outdated
Comment on lines +20 to +25
def _plane_block_size(tensor: torch.Tensor) -> int:
# block_size lives at dim-1 for both 3-D (nb, bs, hidden) and
# 4-D (nb, bs, nh, hs) layouts; ndim < 3 has no paging geometry.
if tensor.ndim >= 3:
return int(tensor.shape[1])
raise ValueError(f"Unexpected KV plane ndim={tensor.ndim}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we can somehow pass in a tensor format hint somewhere, this will be better , i.e. the vLLM hints etc wdyt @marcobarlo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of cache plane lives only in LMCache, vLLM only exposes block_sizes_by_group (vLLM scheduler group), which may or may not differ from LMCache NPU groups. For this reason, it is non-trivial to pass through such a hint. The most straightforward way is to keep it like this and make it more robust (e.g., is_310p formats and so on).

Comment thread lmcache_ascend/v1/kv_layer_groups.py Outdated
Multiple dtype views alias one allocation; only the primary view (largest byte
coverage) carries the canonical paging geometry.
"""
del vllm_block_size

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this del vllm_block_size ??

@marcobarlo marcobarlo Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from refactor. Flag to remove and remove from function signature.

Comment thread lmcache_ascend/v1/kv_layer_groups.py Outdated
*,
is_310p: bool = False,
vllm_block_size: Optional[int] = None,
) -> tuple[int, int, int, Any, int]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the tuple[int, int, int, Any, int] is really hard to understand :/ , probably better to have a named type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Flag to use named tuple.

Comment thread lmcache_ascend/v1/kv_layer_groups.py Outdated
Comment on lines +295 to +298
def _get_first_layer_index(key):
return groups_dict[key][0]

sorted_keys = sorted(groups_dict.keys(), key=_get_first_layer_index)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:

# Sort groups by first layer index to maintain deterministic flat-kv order.
sorted_keys = sorted(groups_dict, key=lambda key: groups_dict[key][0])

Comment on lines +162 to +177
__aicore__ inline void CopyPagedToUb(AscendC::LocalTensor<uint8_t> &dst,
const int64_t dstByteOff, AscendC::GlobalTensor<uint8_t> &src,
const int64_t srcByteOff, const uint32_t copyBytes) const {
const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u};
const AscendC::DataCopyPadExtParams<uint8_t> pad{false, 0u, 0u, 0u};
AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params, pad);
}

// Copy one contiguous byte span from UB at srcByteOff into Paged GM via DataCopyPad.
// No branches; used as the MTE3 consumer step on the load path inside the depth-2 pipeline.
__aicore__ inline void CopyUbToPaged(AscendC::GlobalTensor<uint8_t> &dst,
const int64_t dstByteOff, AscendC::LocalTensor<uint8_t> &src,
const int64_t srcByteOff, const uint32_t copyBytes) const {
const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u};
AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, these two probably can be merged... since the diff seems to be the PadExtParams. wdyt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but probably would hurt readability because we need a direction flag as parameter instead of having a self-explaining function name. Probably the best is to simply remove the two helpers at all and directly inline DataCopyExtParams -> DataCopyPad

// Copy one paged-block part on the blockwise (hd<32) path using the depth-2 queue when bulk fits UB.
// Branch bulk: store EnQue Paged->UB then flush prev to Lmc, or load EnQue Lmc->UB then flush prev to
// Paged; branch !bulk: drain pending then copyBlockSetValue (no UB pipeline for this part).
__aicore__ inline void copyBlock(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyBlock function works, but it would be better for readability wise since Page2L and L2Page is merged into single function and the if branches exit via return.

Perhaps it is better to separate into two helpers like
copyBlockPageToLmc and copyBlockLmcToPage ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about blockwiseCopy? Shuld we separate that as well? Wdyt? Probably best to separate only copyBlock.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed. BlockWiseCopy seems fine to me tbh.


// Store one UB window: per block-part, pipeline Paged->UB (EnQue) with UB->Lmc (DeQue prev).
// Branch bulk: EnQue after CopyPagedToUb; branch invalid lead: drain pending then SetValue only.
__aicore__ inline void _page2LTransfer(__gm__ uint8_t *cacheTensor, __gm__ uint8_t *slotmappings,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the inner function code could be merged I think between _page2L and L2Page, maybe via a templated class. could be a separate PR for later opt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment is mirroring the structure of the multi_layer_mem_kernels_v2.cpp. Probably best to take a common decision for both kernel so it is easier to follow them.

Comment thread README.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should remove ref to the Dynamic Connector once we made support in the in process one.

@marcobarlo

Copy link
Copy Markdown
Contributor Author

Rebased on main and tested with gsm8k for vLLM v0.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants