[Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) by marcobarlo · Pull Request #246 · LMCache/LMCache-Ascend

marcobarlo · 2026-05-28T22:17:25Z

[Core] Support for hybrid KV cache groups with mixed type/compression (e.g., Deepseek v4 in vllm-ascend)

Summary

DeepSeek V4 on vLLM-Ascend (config) exposes 11 scheduler KV-cache groups per request (distinct block_ids, block_size, compress_ratio) and heterogeneous per-layer tensors (multi-plane tuples, shared int8 blobs, mixed dtypes).

Upstream LMCache#3171 added DSv4 on the MP path only. vLLM-Ascend uses the in-process connector; upstream RequestTracker keeps block_ids[0] and one slot_mapping, so DSv4 is wrong or fails asserts.

Feature	Required?	Role
1. In-process multi-group support	Yes	All groups tracked; per-group slots; NPU grouping/dispatch; DSv4 layouts
2. Multi-plane copy bundling	No (perf)	Fewer NPU launches; default `LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=1`

Background

Layer type	`kv_caches[name]`	Scheduler groups
SWA / MTP	1× `bf16` view	1
Compress-4	7 views, shared `int8` blob	up to 7
Compress-128	4-tuple mixed dtypes	up to 4

DS4RandomQuarterLayers: 11 KVCacheGroupSpecs; new_request.block_ids is a tuple of 11 lists. Upstream MP (#3171) runs one CUDA launch per homogeneous group — it does not fuse multiple planes of one layer.

Feature 1 — In-process multi-group support (required)

Without this: groups 1..10 block tables are dropped, compressed slot_mapping lengths fail asserts, and KVLayerGroupsManager mis-groups blob / mixed-dtype layers. Exploded mode (LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0) is the minimal correct fallback (MP-style, one transfer per homogeneous NPU group).

Workflow vs upstream

Step	Upstream in-process	This PR
New request	`allocated_block_ids` = group 0	`allocated_block_ids_by_group` (all 11)
ReqMeta	One `slot_mapping`	`slot_mappings_by_group[g]`; `primary_kv_group_idx`
`wait_for_save` / `start_load_kv`	One slot tensor → engine	Per-group CPU/NPU maps + `slot_mappings_npu_by_group`
`register_kv_caches`	Flat homogeneous layers	`build_flat_kv_caches` + `layout_hints`
Layer groups	`(kv_size, nh, hs, dtype)`	`build_kv_layer_groups` + `KVCacheFormat`; scheduler-slot split
Store/load kernels	One pointer table	`_multi_group_kv_transfer`; `scheduler_slot_group` maps NPU index → scheduler group
MemoryObj	Single `hidden` / chunk	`_patch_metadata_get_shapes` per NPU group

MP stages block IDs on the server; here the NPU connector builds per-group pointers and slot slices.

End-to-end path

register_kv_caches:   (vllm_v1_adapter) build_flat_kv_caches → layout_hints
                      → ensure_kv_layer_groups (per NPU group MemoryObj sizing)
build_connector_meta: (multi_group_vllm_adapter) RequestTracker + ReqMeta
wait_for_save:        (vllm_v1_adapter) slot_mappings_by_group → NPU → engine.store
                      → connector: multi_layer_kv_transfer[_multi_plane] per NPU group
start_load_kv:        symmetric retrieve (load_stream)

Chunk-hash / mask sizing uses the primary group’s slot_mapping; kernels use all groups.

Main changes

lmcache_ascend_connector_v1.py — entry (LMCacheAscendConnectorV1Dynamic, SupportsHMA) → LMCacheAscendConnectorV1Impl
multi_group_vllm_adapter.py — LMCacheConnectorV1ImplMultiGroup, RequestTracker / ReqMeta, slot builders, build_connector_meta
vllm_v1_adapter.py — register, wait_for_save, start_load_kv, request_finished*
multi_spec_flatten.py, kv_format.py, kv_layer_groups.py, npu_connectors.py, skip_state_groups.py
mem_kernels.cpp / utils.cpp — DSA-C8 sizing on homogeneous kernel path
lmcache_ascend/__init__.py — optional _patch_vllm_v1_adapter() shim into upstream lmcache.integration.vllm.vllm_v1_adapter on import lmcache_ascend (not the DSv4 connector entry path)

Concepts

Plane — one KV component of a layer (sub-tensor: own buffer or shared-allocation view). Bundled layers have several planes; each may map to a different scheduler group.
Flat entry — one slot in LMCache’s per-layer list. Bundling on: one tuple per multi-spec layer; off: one entry per plane.
Same kernel — iff registered and (a) planes of the same flat entry, or (b) different flat entries with the same shape key (kv_size, hidden, block_size, dtype_key, num_tensors) and scheduler group.
Scheduler KV-cache group — one vLLM KVCacheGroupSpec; slot_mapping[g] length ceil(T / compress_ratio[g]).
NPU layer group — flat indices sharing shape key and scheduler group after split → one kernel per chunk.
scheduler_slot_group — NPU group i uses slot_mappings_by_group[g] where g is that group’s scheduler index (not [i]).

DSV4 mapping (bundled, state skipped):

Scheduler group	Role	NPU group	Bundled?
sg0, sg1, sg3	CR4 attn + SWA + indexer (+ scale)	g0 (4 planes)	yes — one flat entry per CR4 layer
sg1 @ layer 21	Standalone SWA	g1	no
sg2	SWA on even layers ≥22	g2	no
sg8	Compress-128 (SW=128)	g3	no
sg4–7, sg10	State specs	—	skipped at registration
sg9	CR128 SWA companion	—	slot mapping only; not in MemoryObj

Example (T=512): len(slot_mapping[g]) = ceil(512 / CR[g]); primary group = argmax_g(len(block_ids[g]) × block_size[g]).

Architectural choices

Skip state groups (skip_state_groups.py, default on): DSV4 state groups (C4AttnKVState, C4IndexerKVState, C128AttnScoreState, …) hold recurrent/score buffers not needed for prefix reuse.

LMCACHE_ASCEND_SKIP_STATE_GROUPS=1 — filter before layer-group planning.
LMCACHE_ASCEND_SKIP_STATE_SPEC_ALLOWLIST — optional; default skips six *State* specs (sg4–7, sg10).
Bundled: remove skipped planes from tuples; drop empty flat entries.
Exploded: drop flat entries for skipped scheduler groups.
Indexer KV (C4Indexer, sg3) is not skipped.

Sliding-window store (sg8, SW=128): only the last sliding_window tokens per LMCache chunk are persisted. Storing the full logical chunk over-allocated MemoryObj rows (~5.2× before fix).

Store mask (multi_group_vllm_adapter.py) — slot_mapping[g] = -1 for tokens outside chunk_end - sliding_window; kernels skip -1.
Physical chunk size (kv_layer_groups.py, npu_connectors.py, _patch_metadata_get_shapes) — physical_chunk_size = sliding_window // compress_ratio (SW=128, CR=128 → 1 row for sg8).

Multi-group mode requires discard_partial_chunks=True globally.

Feature 2 — Multi-plane copy bundling (optional)

Feature 1 alone is sufficient for correctness via exploded mode: each plane is a flat layer; each homogeneous NPU group gets multi_layer_kv_transfer — same idea as MP’s multiple CUDA launches. Bundling is a performance choice on NPU (launch overhead on Atlas A2), not something upstream MP implements.

Aspect	Upstream in-process	Upstream MP (#3171)	This PR (bundled)
Multi-spec layer	Not supported	N registrations, N launches	One tuple (`MULTI_PLANE_KV` / `DSA_C8_KV`)
Heterogeneous `block_size`	N/A	Separate kernels	`multi_layer_kv_transfer_multi_plane`
Chunk layout	Standard	Per-group slots	Plane-major `uint8` rows
Slot map	Single tensor	Server-built	Concat + `multi_plane_slot_slice_bounds`

Files: multi_layer_kv_transfer_multi_plane, KVCacheFormat.MULTI_PLANE_KV / DSA_C8_KV, _invoke_multi_plane_kv_transfer. Launch count: 4 NPU groups vs exploded O(planes × layers).

Breakage without this PR:

RequestTracker / ReqMeta — single group; wrong retrieve assert.
KVLayerGroupsManager + gpu_connector/utils.py — no blob / multi-plane / per-plane block_size.
VLLMPagedMemGPUConnectorV2 — flat pointer table; mixed tuple lengths break.
LMCacheMetadata.get_shapes — cannot size per-group multi-plane row bytes.
multi_layer_kv_transfer — scalar block_size insufficient for bundled planes.
HMA request_finished — needs request_finished_all_groups for tuple block_ids.

Flags and tests

Flag	Effect
`LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0`	Exploded planes; feature 2 off
`LMCACHE_ASCEND_FLATTEN_MULTI_SPEC=0`	Legacy single-group; DSv4 unsupported
`LMCACHE_ASCEND_SKIP_STATE_GROUPS=1` (default)	Skip state scheduler groups

Tests: test_ds4_kvcache_roundtrip.py, test_multi_group_vllm_adapter.py, test_npu_connector_multi_group_load.py, test_mem_kernels.py, test_kv_layer_groups_npu.py, test_per_group_memory_allocation.py. Single-group models unchanged.

Testing

Configuration:

Official vllm-ascend configuration for DeepseekV4 and in addition:

    --no-disable-hybrid-kv-cache-manager \
    --kv-transfer-config '{"kv_connector":"LMCacheAscendConnectorV1Dynamic","kv_role":"kv_both", "kv_connector_module_path":"lmcache_ascend.integration.vllm.lmcache_ascend_connector_v1"}'\
    --no-enable-prefix-caching \

LMCache configuration (tag v0.4.5)

export LMCACHE_TRACK_USAGE=false
export LMCACHE_MAX_LOCAL_CPU_SIZE="70"
export LMCACHE_LOCAL_CPU="True"
export LMCACHE_LOG_LEVEL=INFO
export LMCACHE_USE_LAYERWISE="False"
export LMCACHE_NUMA_MODE="auto"
export LMCACHE_CHUNK_SIZE=1024
export LMCACHE_EXTRA_CONFIG='{"save_only_first_rank": false}'

With the multi group connector defaults to discard partial chunks.

Results GSM8K

GSM8k results: first run, no LMCache KV cache hit,

GSM8k results: second run, ~ 30 % of requests with KV cache hit form LMCache.

Results vllm bench

vllm bench serve \
    --host "${HOST}" \
    --port "${PORT}" \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model "${MODEL_NAME}" \
    --tokenizer "${MODEL_PATH}" \
    --tokenizer-mode deepseek_v4 \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --request-rate 0.5 \
    --prefix-repetition-num-prefixes 30 \
    --prefix-repetition-prefix-len 30000 \
    --prefix-repetition-suffix-len 10000 \
    --prefix-repetition-output-len 1000 \
    --ignore-eos \
    --temperature 0

Results no LMCache (no prefix matching vLLM):

Results LMCache (no prefix matching vLLM):

EDIT 1 - Upstream HMA (cec6092, #3419)

#3171 added DSv4 on MP CUDA; cec6092 generalises to HMA (LMCacheGroupView, group_layers_by_identity, Gemma-3 CI). Same problem class; this PR targets the in-process NPU path.

Concern	Upstream cec6092	This PR
Block IDs / slots	`expand_block_ids_to_views`	`allocated_block_ids_by_group`, `slot_mappings_by_group[g]`
Grouping	`group_layers_by_identity` (CUDA)	`build_kv_layer_groups` + `KVCacheFormat` + scheduler split
Physical chunk	`physical_chunk_size` + CR	+ `sliding_window_size_by_group`
Runtime	MP + CUDA	In-process NPU
Layouts / bundling	Homogeneous CUDA tensors	Tuple/blob/multi-plane; `multi_layer_kv_transfer_multi_plane`
State / SW	All groups; SW out of scope	Skip `State`; tail store + `SW // CR` sizing (~5.2× alloc fix)

Net-new: in-process connector stack (LMCacheAscendConnectorV1Dynamic → LMCacheAscendConnectorV1Impl → LMCacheConnectorV1ImplMultiGroup), Ascend format detection, multi-plane kernel, state skip, SW-aware allocation.

Integration: land on LMCache-Ascend for DSv4; on rebase ≥ cec6092 keep in-process adapter + Ascend grouping + skip/SW/bundling (upstream HMA does not replace them). Port engine-neutral SW helpers upstream when DSv4 MP is needed. Cherry-picking cec6092 alone does not fix vLLM-Ascend.

EDIT 2 -- Latest vLLM-Ascend DeepseekV4

vLLM-Ascend has now latest for DeepseekV4. This has likely a different KV cache structure than the v0.18 tested for this PR. Additional tests are needed

EDIT 3 -- Latest vLLM-AscendDeepseekV4 is now supported (unstable)

The main difference between the two versions are as follows.
offload_v20_vs_v18_diff.md

Part of the multi-group/multi-plane bundling (feature 2) of this PR may not be needed to support v0.20 and part of it can be stripped out (along with the new kernel). I suggest to merge this PR to a branch of LMCache-Ascend to support DSv4 in v0.18, and then merge only needed parts to continue with upstream from v0.20 on in main branch. The multi-plane path and multi-plane kernel would only be useful to support DSA C8, which is anyway not used in DSv4 (it is used in DSv3.2 and GLM5 with enable_sparse_c8=True).

gemini-code-assist

Code Review

This pull request introduces multi-group KV cache support for LMCache-Ascend (DSv4 / HMA), including a new multi-plane transfer kernel, multi-spec flattening/bundling logic, and updates to the NPU connector and vLLM v1 adapter. The code review highlights several critical issues: a missing mapping for signed int8 tensors in get_dtype_from_torch, a lack of host-side validation for num_planes and block_sizes (which could lead to out-of-bounds access or division-by-zero), and performance bottlenecks from synchronous stream synchronization and redundant .pin_memory() calls on small slot mapping tensors. Additionally, the use of strict=True in zip() introduces Python 3.10+ compatibility issues, and a commented-out version check in tests/bootstrap.py disables dependency syncing.

marcobarlo · 2026-05-29T08:25:27Z

/gemini review

marcobarlo · 2026-05-29T08:26:10Z

@copilot review

gemini-code-assist

Code Review

This pull request introduces multi-group and multi-plane KV cache transfer support for LMCache-Ascend, adding new NPU kernels and multi-group adapter layers to handle heterogeneous block sizes and multi-plane layouts. The review feedback identifies several critical issues, including a potential runtime crash due to a strict alignment assertion, a performance bottleneck from synchronous device-to-host transfers in a loop, incorrect block ID calculations from hardcoded block sizes in failure recording, and a risk of retrieving incorrect memory addresses from tensor views using untyped_storage().data_ptr().

matthewygf · 2026-06-16T16:40:05Z

-class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic):
-    def __init__(self, vllm_config: "VllmConfig", role: KVConnectorRole) -> None:
-        super().__init__(vllm_config=vllm_config, role=role)
+class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic, SupportsHMA):


We will need to add 'SupportsHMA' to the entry connector from vllm-ascend.
Two options:

make a PR in vllm-ascend for the in-process connector to have SupportsHMA

create a connector in LMCache-Ascend, patched it during the import lmcache-ascend.

wdyt @marcobarlo @chloroethylene
It's probably better to go with 1.

as discussed, let's go with 2) since in process mode is going to be deprecated.

matthewygf · 2026-06-17T09:17:09Z

+## for vLLM-MindSpore
+
+### Docker
+
+1. Clone LMCache-Ascend Repo
+Our repo contains a kvcache ops submodule for ease of maintenance, therefore we recommend cloning the repo with submodules.
+
+```bash
+cd /workspace
+git clone --recurse-submodules https://github.com/LMCache/LMCache-Ascend.git
+```
+
+2. Build Docker Image
+```bash
+cd /workspace/LMCache-Ascend
+docker build -f docker/mindspore/Dockerfile.a2.openEuler -t lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler .
+```
+
+3. Start Container
+Once that is built, run it with the following cmd
+```bash
+docker run -itd \
+    --shm-size 200g --privileged \
+    --net=host \
+    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
+    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
+    --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /var/log/npu/:/var/log/npu \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
+    -v /lib/modules:/lib/modules:ro \
+    -v /usr/src/kernels:/usr/src/kernels:ro \
+    -v /mnt/storage1/data:/data \
+    -v /home:/home \
+    --name lmcache-ascend-ms \
+    --entrypoint /bin/bash \
+    lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler


As discussed, change this back to v0.3.12 for actual tested ver.

matthewygf · 2026-06-17T09:19:07Z

+    --trust-remote-code \
+    --disable-log-requests \
+    --block-size 128 \
+    --kv-transfer-config '{"kv_connector":"LMCacheAscendConnectorV1Dynamic","kv_role":"kv_both", "kv_connector_module_path":"lmcache_ascend.integration.vllm.lmcache_ascend_connector_v1"}'


change this to use the in process connector in vllm-ascend

matthewygf · 2026-06-17T11:37:33Z

+    ratios = tuple(int(r) for r in compress_ratios)
+    if not ratios:
+        ratios = (1,) * len(slot_mappings_by_group)


Leftover from refactoring. compress_ratios function parameter is also unused. To be removed

matthewygf · 2026-06-17T13:53:23Z

+def flatten_multi_spec_enabled() -> bool:
+    val = os.environ.get("LMCACHE_ASCEND_FLATTEN_MULTI_SPEC", "1")
+    return val != "0"
+
+
+def bundle_multi_spec_enabled() -> bool:
+    val = os.environ.get("LMCACHE_ASCEND_BUNDLE_MULTI_SPEC", "1")
+    return val != "0"


we should probably change these into the config definition as per init.py

like:

lmcache.v1.config._CONFIG_DEFINITIONS["ascend_bundle_multi_spec"] = { "type": bool, "default": True, "env_converter": _to_bool, "description": ( "Whether LMCache-Ascend keeps multi-spec KV planes bundled for " "multi-plane NPU transfer. If False, multi-spec planes are exploded " "into synthetic .subN layers for legacy/fallback handling." ), }

matthewygf · 2026-06-17T16:18:14Z

+    if os.environ.get("LMCACHE_ASCEND_SKIP_STATE_GROUPS", "1") != "1":
+        return None


Similarly, It would be best if we can make these configurable in the config definition init.py

matthewygf · 2026-06-18T16:23:18Z

+            if self.use_layerwise:
+                if idx == last_idx:
+                    sync = True
+                else:
+                    sync = False
+                # NOTE(Jiayi): Perform blending before layerwise prefix caching
+                if self.enable_blending:
+                    self.blender.blend(
+                        tokens[:lmcache_cached_tokens],
+                        token_mask[:lmcache_cached_tokens],
+                        kvcaches=kvcaches,
+                        slot_mapping=slot_mapping_npu[:lmcache_cached_tokens],
+                        vllm_cached_tokens=request.load_spec.vllm_cached_tokens,
+                    )
+                else:


Not sure whether we should enable blending here tbh for kvcache groups > 1

That code path is there since it was copy-pasted from super.start_load_kv() and extended for multi-group, but it is actually untested - and probably we do not want to test it as SP is going to be deprecated. Flagging for deletion.

matthewygf · 2026-06-19T09:24:37Z

+            indices = group.layer_indices
+            rep = kv_caches[indices[0]]


Is this always true to be indices[0] ? Might worth to have a small comment for this assumption

Flagging to add the following comment:

# ``build_kv_layer_groups`` buckets layers by layout identity # (kv_size, hidden, block_size, dtype, tensor count). Every member # of ``indices`` shares that key and ``group.shape_desc``, so format # detection and ``_derive_group_params`` on ``indices[0]`` apply to # the whole group; per-layer device pointers are collected below.

Does it clarify?

matthewygf · 2026-06-19T09:56:26Z

+    n = len(pb)
+    if kv_format == KVCacheFormat.DSA_C8_KV and n != 4:
+        raise ValueError(f"DSA-C8 expects 4 plane byte widths, got {n}")
+    if not bs_list:


nit: might be Len(bs_list) == 0 or planes is not None ?

preferrable len(bs_list) == 0 as planes is not None is already the gating condition of the data structure building above.

matthewygf · 2026-06-19T10:45:57Z

+            if is_310p():
+                block_size = int(k_cache.shape[-2])
+                page_buffer_size = int(k_cache.shape[0]) * block_size
+            else:
+                block_size = int(k_cache.shape[1])
+                page_buffer_size = int(k_cache.shape[0]) * block_size


nit: trivial, maybe worth taking the page_buffer_size out of the if branches

matthewygf · 2026-06-19T12:53:25Z

+        if layer_name and layer_name in sched_groups:
+            sched_per_plane = list(sched_groups[layer_name])
+            if len(sched_per_plane) == 1:
+                sched_per_plane = sched_per_plane * 2


is it the kv lora and rope ?

No. MLA_KV format is generically detected any time the KV cache is a tuple with 2 tensors and the two tensors have different shape (k_cache.shape != v_cache.shape).
In the standard MLA path, this translates to lora and rope, but in that case we use the standard copy path and we do not enter the multi-group path. Since this snippet is in _derive_group_params, it means we are in heterogeneous/multi-group path: each plane is a generically a buffer of bytes with its own geometry. As example, in v0.20 the indexer is a tuple of int8 in K with last dim 128, and FP 16 scale with last dim 1 (copying the indexer might not be strictly necessary tho...).

Probably a cleaner version of the snippet would be:

n_planes = len(entry) if len(sched_per_plane) == 1 and n_planes > 1: sched_per_plane = sched_per_plane * n_planes

but at the moment there is a little difference because to detect MLA n_planes must be 2.
@matthewygf wdyt?

The cleaner version looks a lot better tbh !

matthewygf · 2026-06-19T13:06:29Z

+            from lmcache_ascend.v1.npu_connector.utils import (
+                permute_kv_caches_to_contiguous,
+            )


does this have to be lazily import actually...

No need, we anyway need to import at least once. Probably leftover from refactoring. Flag to move above.

matthewygf · 2026-06-19T13:16:26Z

+        # Third Party
+        from lmcache.v1.gpu_connector.utils import normalize_kv_and_discover_format
+        from lmcache.v1.kv_layer_groups import KVLayerGroupsManager
+        from lmcache.utils import EngineType
+
+        from lmcache_ascend.v1.kv_layer_groups import build_kv_layer_groups


Also not need to be lazily imported I think

Likewise. Flag to move them as module-level imports.

matthewygf · 2026-06-19T13:51:13Z

+        while len(self._mp_launch_bufs) <= npu_group_idx:
+            self._mp_launch_bufs.append(None)


nit: its better to switch to a compact form, maybe like

missing = npu_group_idx + 1 - Len(self._mp_launch_bufs) if missing> 0: self._mp_launch_bufs.extend([None]*missing)

matthewygf · 2026-06-19T14:19:04Z

+            cached = mp_launch_meta.get(key)
+            if cached is None:
+                return


This is very unclear, it seems it meant it is known there is no work for the cached key from build_mp_launch_meta ?

This line is quite of a safety net: all NPU groups are iterated for each memObj. However, some groups might have no work to do (e.g., all tokens stripped out or compressed or other borderline cases). build_mp_launch_meta in this cases inserts no key in meta, and at invocation time we skip the kernel call. Probably worth adding a comment

# Batch precompute omits keys with no valid slots (has_work=False); # same as mp_launch_meta is None path — skip the kernel for this chunk/group.

and add a log level debug in such case (see next comment).

matthewygf · 2026-06-19T14:19:20Z

+            if not has_work:
+                return


probably worth a logger ?

See comment above

matthewygf · 2026-06-19T14:32:15Z

+            cpu_ptrs = torch.empty(len(ptrs), dtype=torch.int64, device="cpu")
+            cpu_ptrs.numpy()[:] = ptrs
+            gpu_ptrs = torch.empty(
+                len(ptrs), dtype=torch.int64, device=self.kvcaches_device
+            )
+            gpu_ptrs.copy_(cpu_ptrs)


worth making these async ? were there any perf hits ?

It is a one off cold-path on the first copy. Probably not worth the hustle

matthewygf · 2026-06-19T14:50:16Z

+            SimpleNamespace(
+                nb=int(k_cache.shape[0]),
+                bs=int(k_cache.shape[1]),
+                block_stride_elems=0,
+            ),


nit: while it is okay, not a big fan of simplenamespace, maybe even refactor the function to explicitly take nb, bs, block_stride is better

matthewygf · 2026-06-19T14:57:17Z

+        if (
+            filtered is None
+            or prefixes is None
+            or slot_mappings_by_group is None
+            or not starts
+        ):
+            return


should we raise an error here? not clear to me when do these conditions should be true or not......

No. This is just a shortcut for normal non-multi-group transfers that use the normal non-multi-plane copy kernel to skip this as not needed. Furthermore, this function is just an optimization. In case the meta is needed by the multi-plane kernel, it is created just in time.

matthewygf · 2026-06-25T02:05:21Z

+            from lmcache_ascend.v1.kv_layer_groups import _lmc_chunk_hidden_bytes
+


let's move this to the module level import as well.

matthewygf · 2026-06-25T03:24:00Z

+def _plane_block_size(tensor: torch.Tensor) -> int:
+    # block_size lives at dim-1 for both 3-D (nb, bs, hidden) and
+    # 4-D (nb, bs, nh, hs) layouts; ndim < 3 has no paging geometry.
+    if tensor.ndim >= 3:
+        return int(tensor.shape[1])
+    raise ValueError(f"Unexpected KV plane ndim={tensor.ndim}")


nit: if we can somehow pass in a tensor format hint somewhere, this will be better , i.e. the vLLM hints etc wdyt @marcobarlo

The concept of cache plane lives only in LMCache, vLLM only exposes block_sizes_by_group (vLLM scheduler group), which may or may not differ from LMCache NPU groups. For this reason, it is non-trivial to pass through such a hint. The most straightforward way is to keep it like this and make it more robust (e.g., is_310p formats and so on).

matthewygf · 2026-06-25T04:44:28Z

+    Multiple dtype views alias one allocation; only the primary view (largest byte
+    coverage) carries the canonical paging geometry.
+    """
+    del vllm_block_size


what is this del vllm_block_size ??

Leftover from refactor. Flag to remove and remove from function signature.

matthewygf · 2026-06-25T06:00:09Z

+    *,
+    is_310p: bool = False,
+    vllm_block_size: Optional[int] = None,
+) -> tuple[int, int, int, Any, int]:


nit: the tuple[int, int, int, Any, int] is really hard to understand :/ , probably better to have a named type

Agreed. Flag to use named tuple.

matthewygf · 2026-06-25T06:15:29Z

+    def _get_first_layer_index(key):
+        return groups_dict[key][0]
+
+    sorted_keys = sorted(groups_dict.keys(), key=_get_first_layer_index)


maybe:

# Sort groups by first layer index to maintain deterministic flat-kv order. sorted_keys = sorted(groups_dict, key=lambda key: groups_dict[key][0])

matthewygf · 2026-06-25T08:27:45Z

+    __aicore__ inline void CopyPagedToUb(AscendC::LocalTensor<uint8_t> &dst,
+        const int64_t dstByteOff, AscendC::GlobalTensor<uint8_t> &src,
+        const int64_t srcByteOff, const uint32_t copyBytes) const {
+        const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u};
+        const AscendC::DataCopyPadExtParams<uint8_t> pad{false, 0u, 0u, 0u};
+        AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params, pad);
+    }
+
+    // Copy one contiguous byte span from UB at srcByteOff into Paged GM via DataCopyPad.
+    // No branches; used as the MTE3 consumer step on the load path inside the depth-2 pipeline.
+    __aicore__ inline void CopyUbToPaged(AscendC::GlobalTensor<uint8_t> &dst,
+        const int64_t dstByteOff, AscendC::LocalTensor<uint8_t> &src,
+        const int64_t srcByteOff, const uint32_t copyBytes) const {
+        const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u};
+        AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params);
+    }


as discussed, these two probably can be merged... since the diff seems to be the PadExtParams. wdyt

We could, but probably would hurt readability because we need a direction flag as parameter instead of having a self-explaining function name. Probably the best is to simply remove the two helpers at all and directly inline DataCopyExtParams -> DataCopyPad

matthewygf · 2026-06-25T12:22:21Z

+    // Copy one paged-block part on the blockwise (hd<32) path using the depth-2 queue when bulk fits UB.
+    // Branch bulk: store EnQue Paged->UB then flush prev to Lmc, or load EnQue Lmc->UB then flush prev to
+    // Paged; branch !bulk: drain pending then copyBlockSetValue (no UB pipeline for this part).
+    __aicore__ inline void copyBlock(


The copyBlock function works, but it would be better for readability wise since Page2L and L2Page is merged into single function and the if branches exit via return.

Perhaps it is better to separate into two helpers like
copyBlockPageToLmc and copyBlockLmcToPage ?

what about blockwiseCopy? Shuld we separate that as well? Wdyt? Probably best to separate only copyBlock.

Yeah agreed. BlockWiseCopy seems fine to me tbh.

matthewygf · 2026-06-25T12:35:02Z

+
+    // Store one UB window: per block-part, pipeline Paged->UB (EnQue) with UB->Lmc (DeQue prev).
+    // Branch bulk: EnQue after CopyPagedToUb; branch invalid lead: drain pending then SetValue only.
+    __aicore__ inline void _page2LTransfer(__gm__ uint8_t *cacheTensor, __gm__ uint8_t *slotmappings,


nit: the inner function code could be merged I think between _page2L and L2Page, maybe via a templated class. could be a separate PR for later opt

At the moment is mirroring the structure of the multi_layer_mem_kernels_v2.cpp. Probably best to take a common decision for both kernel so it is easier to follow them.

matthewygf · 2026-06-25T12:37:45Z

Probably should remove ref to the Dynamic Connector once we made support in the in process one.

Co-authored-by: shengyan-chen <syan0o0@outlook.com>

use torch_dev dynamic load)

marcobarlo · 2026-06-25T16:15:00Z

Rebased on main and tested with gsm8k for vLLM v0.20.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

marcobarlo force-pushed the dsv4_support branch 3 times, most recently from 9caec65 to 9f1780b Compare May 29, 2026 08:17

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

matthewygf reviewed May 29, 2026

View reviewed changes

Comment thread csrc/mem_kernels.cpp Outdated

marcobarlo force-pushed the dsv4_support branch from b8f8f36 to cfb2ea8 Compare June 5, 2026 12:55

marcobarlo changed the title ~~deepseek v4 support~~ [Feat][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) Jun 9, 2026

marcobarlo force-pushed the dsv4_support branch from a7d67b0 to 531c8c1 Compare June 9, 2026 16:19

marcobarlo changed the title ~~[Feat][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4)~~ [Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4) Jun 10, 2026

marcobarlo force-pushed the dsv4_support branch 3 times, most recently from 889d5ab to cc22178 Compare June 12, 2026 17:58

marcobarlo marked this pull request as ready for review June 12, 2026 18:00

marcobarlo requested a review from chloroethylene as a code owner June 12, 2026 18:00

marcobarlo force-pushed the dsv4_support branch from cc22178 to ec04fbb Compare June 13, 2026 00:54

matthewygf requested changes Jun 17, 2026

View reviewed changes

matthewygf reviewed Jun 18, 2026

View reviewed changes

matthewygf reviewed Jun 19, 2026

View reviewed changes

matthewygf requested changes Jun 19, 2026

View reviewed changes

matthewygf reviewed Jun 25, 2026

View reviewed changes

matthewygf requested changes Jun 25, 2026

View reviewed changes

marcobarlo and others added 12 commits June 25, 2026 23:26

deepseek v4 support

c028741

Co-authored-by: shengyan-chen <syan0o0@outlook.com>

addressed AI reviewer comments, added tests to cover

337ad29

Fix sliding window support and added skip state Spec

f3ae861

Performance improvement through caching tensors

09faffa

Fix memory obj overallocation

a414411

refactor to reduce patch lines

caefbaf

1. update doc 2. remove unnecessary patch 3. torch.cuda -> torch.npu(

1376e0f

use torch_dev dynamic load)

Refactoring to reduce patch size and test fixing

13586ad

fix DSAC8 to single kernel

fd9f03a

Refactor

1ab66bd

Refactor

ab0d164

dsv4 v020 support

f36539e

marcobarlo force-pushed the dsv4_support branch from ec04fbb to f36539e Compare June 25, 2026 16:13

marcobarlo added 2 commits June 26, 2026 22:45

turn dynamic connector into in process connector

9a9f6fd

address comments and linitng

166b9b5

matthewygf mentioned this pull request Jun 27, 2026

[RFC] DeepSeekV4 Support #228

Open

fix for disk backend

5f68ed1

		if os.environ.get("LMCACHE_ASCEND_SKIP_STATE_GROUPS", "1") != "1":
		return None

		while len(self._mp_launch_bufs) <= npu_group_idx:
		self._mp_launch_bufs.append(None)

		from lmcache_ascend.v1.kv_layer_groups import _lmc_chunk_hidden_bytes

Uh oh!

Conversation

marcobarlo commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Core] Support for hybrid KV cache groups with mixed type/compression (e.g., Deepseek v4 in vllm-ascend)

Summary

Background

Feature 1 — In-process multi-group support (required)

Workflow vs upstream

End-to-end path

Main changes

Concepts

Architectural choices

Feature 2 — Multi-plane copy bundling (optional)

Flags and tests

Testing

Configuration:

Results GSM8K

Results vllm bench

EDIT 1 - Upstream HMA (cec6092, #3419)

EDIT 2 -- Latest vLLM-Ascend DeepseekV4

EDIT 3 -- Latest vLLM-AscendDeepseekV4 is now supported (unstable)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcobarlo commented May 29, 2026

Uh oh!

marcobarlo commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

marcobarlo commented May 28, 2026 •

edited

Loading

marcobarlo Jun 25, 2026 •

edited

Loading