[Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4)#246
[Core][SP] Support for multiple and hybrid KV cache groups with mixed type/compression (e.g. Deepseek v4)#246marcobarlo wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces multi-group KV cache support for LMCache-Ascend (DSv4 / HMA), including a new multi-plane transfer kernel, multi-spec flattening/bundling logic, and updates to the NPU connector and vLLM v1 adapter. The code review highlights several critical issues: a missing mapping for signed int8 tensors in get_dtype_from_torch, a lack of host-side validation for num_planes and block_sizes (which could lead to out-of-bounds access or division-by-zero), and performance bottlenecks from synchronous stream synchronization and redundant .pin_memory() calls on small slot mapping tensors. Additionally, the use of strict=True in zip() introduces Python 3.10+ compatibility issues, and a commented-out version check in tests/bootstrap.py disables dependency syncing.
9caec65 to
9f1780b
Compare
|
/gemini review |
|
@copilot review |
There was a problem hiding this comment.
Code Review
This pull request introduces multi-group and multi-plane KV cache transfer support for LMCache-Ascend, adding new NPU kernels and multi-group adapter layers to handle heterogeneous block sizes and multi-plane layouts. The review feedback identifies several critical issues, including a potential runtime crash due to a strict alignment assertion, a performance bottleneck from synchronous device-to-host transfers in a loop, incorrect block ID calculations from hardcoded block sizes in failure recording, and a risk of retrieving incorrect memory addresses from tensor views using untyped_storage().data_ptr().
889d5ab to
cc22178
Compare
| class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic): | ||
| def __init__(self, vllm_config: "VllmConfig", role: KVConnectorRole) -> None: | ||
| super().__init__(vllm_config=vllm_config, role=role) | ||
| class LMCacheAscendConnectorV1Dynamic(LMCacheConnectorV1Dynamic, SupportsHMA): |
There was a problem hiding this comment.
We will need to add 'SupportsHMA' to the entry connector from vllm-ascend.
Two options:
- make a PR in vllm-ascend for the in-process connector to have SupportsHMA
- create a connector in LMCache-Ascend, patched it during the import lmcache-ascend.
wdyt @marcobarlo @chloroethylene
It's probably better to go with 1.
There was a problem hiding this comment.
as discussed, let's go with 2) since in process mode is going to be deprecated.
| ## for vLLM-MindSpore | ||
|
|
||
| ### Docker | ||
|
|
||
| 1. Clone LMCache-Ascend Repo | ||
| Our repo contains a kvcache ops submodule for ease of maintenance, therefore we recommend cloning the repo with submodules. | ||
|
|
||
| ```bash | ||
| cd /workspace | ||
| git clone --recurse-submodules https://github.com/LMCache/LMCache-Ascend.git | ||
| ``` | ||
|
|
||
| 2. Build Docker Image | ||
| ```bash | ||
| cd /workspace/LMCache-Ascend | ||
| docker build -f docker/mindspore/Dockerfile.a2.openEuler -t lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler . | ||
| ``` | ||
|
|
||
| 3. Start Container | ||
| Once that is built, run it with the following cmd | ||
| ```bash | ||
| docker run -itd \ | ||
| --shm-size 200g --privileged \ | ||
| --net=host \ | ||
| --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ | ||
| --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ | ||
| --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc \ | ||
| -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ | ||
| -v /var/log/npu/:/var/log/npu \ | ||
| -v /usr/local/dcmi:/usr/local/dcmi \ | ||
| -v /etc/ascend_install.info:/etc/ascend_install.info \ | ||
| -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ | ||
| -v /sys/fs/cgroup:/sys/fs/cgroup:ro \ | ||
| -v /lib/modules:/lib/modules:ro \ | ||
| -v /usr/src/kernels:/usr/src/kernels:ro \ | ||
| -v /mnt/storage1/data:/data \ | ||
| -v /home:/home \ | ||
| --name lmcache-ascend-ms \ | ||
| --entrypoint /bin/bash \ | ||
| lmcache-ascend:v0.4.4-mindspore2.7.1.post1-openeuler |
There was a problem hiding this comment.
As discussed, change this back to v0.3.12 for actual tested ver.
| --trust-remote-code \ | ||
| --disable-log-requests \ | ||
| --block-size 128 \ | ||
| --kv-transfer-config '{"kv_connector":"LMCacheAscendConnectorV1Dynamic","kv_role":"kv_both", "kv_connector_module_path":"lmcache_ascend.integration.vllm.lmcache_ascend_connector_v1"}' |
There was a problem hiding this comment.
change this to use the in process connector in vllm-ascend
| ratios = tuple(int(r) for r in compress_ratios) | ||
| if not ratios: | ||
| ratios = (1,) * len(slot_mappings_by_group) |
There was a problem hiding this comment.
Leftover from refactoring. compress_ratios function parameter is also unused. To be removed
| def flatten_multi_spec_enabled() -> bool: | ||
| val = os.environ.get("LMCACHE_ASCEND_FLATTEN_MULTI_SPEC", "1") | ||
| return val != "0" | ||
|
|
||
|
|
||
| def bundle_multi_spec_enabled() -> bool: | ||
| val = os.environ.get("LMCACHE_ASCEND_BUNDLE_MULTI_SPEC", "1") | ||
| return val != "0" |
There was a problem hiding this comment.
we should probably change these into the config definition as per init.py
like:
lmcache.v1.config._CONFIG_DEFINITIONS["ascend_bundle_multi_spec"] = {
"type": bool,
"default": True,
"env_converter": _to_bool,
"description": (
"Whether LMCache-Ascend keeps multi-spec KV planes bundled for "
"multi-plane NPU transfer. If False, multi-spec planes are exploded "
"into synthetic .subN layers for legacy/fallback handling."
),
}
| if os.environ.get("LMCACHE_ASCEND_SKIP_STATE_GROUPS", "1") != "1": | ||
| return None |
There was a problem hiding this comment.
Similarly, It would be best if we can make these configurable in the config definition init.py
| if self.use_layerwise: | ||
| if idx == last_idx: | ||
| sync = True | ||
| else: | ||
| sync = False | ||
| # NOTE(Jiayi): Perform blending before layerwise prefix caching | ||
| if self.enable_blending: | ||
| self.blender.blend( | ||
| tokens[:lmcache_cached_tokens], | ||
| token_mask[:lmcache_cached_tokens], | ||
| kvcaches=kvcaches, | ||
| slot_mapping=slot_mapping_npu[:lmcache_cached_tokens], | ||
| vllm_cached_tokens=request.load_spec.vllm_cached_tokens, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Not sure whether we should enable blending here tbh for kvcache groups > 1
There was a problem hiding this comment.
That code path is there since it was copy-pasted from super.start_load_kv() and extended for multi-group, but it is actually untested - and probably we do not want to test it as SP is going to be deprecated. Flagging for deletion.
| indices = group.layer_indices | ||
| rep = kv_caches[indices[0]] |
There was a problem hiding this comment.
Is this always true to be indices[0] ? Might worth to have a small comment for this assumption
There was a problem hiding this comment.
Flagging to add the following comment:
# ``build_kv_layer_groups`` buckets layers by layout identity
# (kv_size, hidden, block_size, dtype, tensor count). Every member
# of ``indices`` shares that key and ``group.shape_desc``, so format
# detection and ``_derive_group_params`` on ``indices[0]`` apply to
# the whole group; per-layer device pointers are collected below.
Does it clarify?
| n = len(pb) | ||
| if kv_format == KVCacheFormat.DSA_C8_KV and n != 4: | ||
| raise ValueError(f"DSA-C8 expects 4 plane byte widths, got {n}") | ||
| if not bs_list: |
There was a problem hiding this comment.
nit: might be Len(bs_list) == 0 or planes is not None ?
There was a problem hiding this comment.
preferrable len(bs_list) == 0 as planes is not None is already the gating condition of the data structure building above.
| if is_310p(): | ||
| block_size = int(k_cache.shape[-2]) | ||
| page_buffer_size = int(k_cache.shape[0]) * block_size | ||
| else: | ||
| block_size = int(k_cache.shape[1]) | ||
| page_buffer_size = int(k_cache.shape[0]) * block_size |
There was a problem hiding this comment.
nit: trivial, maybe worth taking the page_buffer_size out of the if branches
| if layer_name and layer_name in sched_groups: | ||
| sched_per_plane = list(sched_groups[layer_name]) | ||
| if len(sched_per_plane) == 1: | ||
| sched_per_plane = sched_per_plane * 2 |
There was a problem hiding this comment.
is it the kv lora and rope ?
There was a problem hiding this comment.
No. MLA_KV format is generically detected any time the KV cache is a tuple with 2 tensors and the two tensors have different shape (k_cache.shape != v_cache.shape).
In the standard MLA path, this translates to lora and rope, but in that case we use the standard copy path and we do not enter the multi-group path. Since this snippet is in _derive_group_params, it means we are in heterogeneous/multi-group path: each plane is a generically a buffer of bytes with its own geometry. As example, in v0.20 the indexer is a tuple of int8 in K with last dim 128, and FP 16 scale with last dim 1 (copying the indexer might not be strictly necessary tho...).
Probably a cleaner version of the snippet would be:
n_planes = len(entry)
if len(sched_per_plane) == 1 and n_planes > 1:
sched_per_plane = sched_per_plane * n_planes
but at the moment there is a little difference because to detect MLA n_planes must be 2.
@matthewygf wdyt?
There was a problem hiding this comment.
The cleaner version looks a lot better tbh !
| from lmcache_ascend.v1.npu_connector.utils import ( | ||
| permute_kv_caches_to_contiguous, | ||
| ) |
There was a problem hiding this comment.
does this have to be lazily import actually...
There was a problem hiding this comment.
No need, we anyway need to import at least once. Probably leftover from refactoring. Flag to move above.
| # Third Party | ||
| from lmcache.v1.gpu_connector.utils import normalize_kv_and_discover_format | ||
| from lmcache.v1.kv_layer_groups import KVLayerGroupsManager | ||
| from lmcache.utils import EngineType | ||
|
|
||
| from lmcache_ascend.v1.kv_layer_groups import build_kv_layer_groups |
There was a problem hiding this comment.
Also not need to be lazily imported I think
There was a problem hiding this comment.
Likewise. Flag to move them as module-level imports.
| while len(self._mp_launch_bufs) <= npu_group_idx: | ||
| self._mp_launch_bufs.append(None) |
There was a problem hiding this comment.
nit: its better to switch to a compact form, maybe like
missing = npu_group_idx + 1 - Len(self._mp_launch_bufs)
if missing> 0:
self._mp_launch_bufs.extend([None]*missing)
| cached = mp_launch_meta.get(key) | ||
| if cached is None: | ||
| return |
There was a problem hiding this comment.
This is very unclear, it seems it meant it is known there is no work for the cached key from build_mp_launch_meta ?
There was a problem hiding this comment.
This line is quite of a safety net: all NPU groups are iterated for each memObj. However, some groups might have no work to do (e.g., all tokens stripped out or compressed or other borderline cases). build_mp_launch_meta in this cases inserts no key in meta, and at invocation time we skip the kernel call. Probably worth adding a comment
# Batch precompute omits keys with no valid slots (has_work=False);
# same as mp_launch_meta is None path — skip the kernel for this chunk/group.
and add a log level debug in such case (see next comment).
| if not has_work: | ||
| return |
There was a problem hiding this comment.
probably worth a logger ?
There was a problem hiding this comment.
See comment above
| cpu_ptrs = torch.empty(len(ptrs), dtype=torch.int64, device="cpu") | ||
| cpu_ptrs.numpy()[:] = ptrs | ||
| gpu_ptrs = torch.empty( | ||
| len(ptrs), dtype=torch.int64, device=self.kvcaches_device | ||
| ) | ||
| gpu_ptrs.copy_(cpu_ptrs) |
There was a problem hiding this comment.
worth making these async ? were there any perf hits ?
There was a problem hiding this comment.
It is a one off cold-path on the first copy. Probably not worth the hustle
| SimpleNamespace( | ||
| nb=int(k_cache.shape[0]), | ||
| bs=int(k_cache.shape[1]), | ||
| block_stride_elems=0, | ||
| ), |
There was a problem hiding this comment.
nit: while it is okay, not a big fan of simplenamespace, maybe even refactor the function to explicitly take nb, bs, block_stride is better
| if ( | ||
| filtered is None | ||
| or prefixes is None | ||
| or slot_mappings_by_group is None | ||
| or not starts | ||
| ): | ||
| return |
There was a problem hiding this comment.
should we raise an error here? not clear to me when do these conditions should be true or not......
There was a problem hiding this comment.
No. This is just a shortcut for normal non-multi-group transfers that use the normal non-multi-plane copy kernel to skip this as not needed. Furthermore, this function is just an optimization. In case the meta is needed by the multi-plane kernel, it is created just in time.
| from lmcache_ascend.v1.kv_layer_groups import _lmc_chunk_hidden_bytes | ||
|
|
There was a problem hiding this comment.
let's move this to the module level import as well.
| def _plane_block_size(tensor: torch.Tensor) -> int: | ||
| # block_size lives at dim-1 for both 3-D (nb, bs, hidden) and | ||
| # 4-D (nb, bs, nh, hs) layouts; ndim < 3 has no paging geometry. | ||
| if tensor.ndim >= 3: | ||
| return int(tensor.shape[1]) | ||
| raise ValueError(f"Unexpected KV plane ndim={tensor.ndim}") |
There was a problem hiding this comment.
nit: if we can somehow pass in a tensor format hint somewhere, this will be better , i.e. the vLLM hints etc wdyt @marcobarlo
There was a problem hiding this comment.
The concept of cache plane lives only in LMCache, vLLM only exposes block_sizes_by_group (vLLM scheduler group), which may or may not differ from LMCache NPU groups. For this reason, it is non-trivial to pass through such a hint. The most straightforward way is to keep it like this and make it more robust (e.g., is_310p formats and so on).
| Multiple dtype views alias one allocation; only the primary view (largest byte | ||
| coverage) carries the canonical paging geometry. | ||
| """ | ||
| del vllm_block_size |
There was a problem hiding this comment.
what is this del vllm_block_size ??
There was a problem hiding this comment.
Leftover from refactor. Flag to remove and remove from function signature.
| *, | ||
| is_310p: bool = False, | ||
| vllm_block_size: Optional[int] = None, | ||
| ) -> tuple[int, int, int, Any, int]: |
There was a problem hiding this comment.
nit: the tuple[int, int, int, Any, int] is really hard to understand :/ , probably better to have a named type
There was a problem hiding this comment.
Agreed. Flag to use named tuple.
| def _get_first_layer_index(key): | ||
| return groups_dict[key][0] | ||
|
|
||
| sorted_keys = sorted(groups_dict.keys(), key=_get_first_layer_index) |
There was a problem hiding this comment.
maybe:
# Sort groups by first layer index to maintain deterministic flat-kv order.
sorted_keys = sorted(groups_dict, key=lambda key: groups_dict[key][0])
| __aicore__ inline void CopyPagedToUb(AscendC::LocalTensor<uint8_t> &dst, | ||
| const int64_t dstByteOff, AscendC::GlobalTensor<uint8_t> &src, | ||
| const int64_t srcByteOff, const uint32_t copyBytes) const { | ||
| const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u}; | ||
| const AscendC::DataCopyPadExtParams<uint8_t> pad{false, 0u, 0u, 0u}; | ||
| AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params, pad); | ||
| } | ||
|
|
||
| // Copy one contiguous byte span from UB at srcByteOff into Paged GM via DataCopyPad. | ||
| // No branches; used as the MTE3 consumer step on the load path inside the depth-2 pipeline. | ||
| __aicore__ inline void CopyUbToPaged(AscendC::GlobalTensor<uint8_t> &dst, | ||
| const int64_t dstByteOff, AscendC::LocalTensor<uint8_t> &src, | ||
| const int64_t srcByteOff, const uint32_t copyBytes) const { | ||
| const AscendC::DataCopyExtParams params{1u, copyBytes, 0u, 0u, 0u}; | ||
| AscendC::DataCopyPad(dst[dstByteOff], src[srcByteOff], params); | ||
| } |
There was a problem hiding this comment.
as discussed, these two probably can be merged... since the diff seems to be the PadExtParams. wdyt
There was a problem hiding this comment.
We could, but probably would hurt readability because we need a direction flag as parameter instead of having a self-explaining function name. Probably the best is to simply remove the two helpers at all and directly inline DataCopyExtParams -> DataCopyPad
| // Copy one paged-block part on the blockwise (hd<32) path using the depth-2 queue when bulk fits UB. | ||
| // Branch bulk: store EnQue Paged->UB then flush prev to Lmc, or load EnQue Lmc->UB then flush prev to | ||
| // Paged; branch !bulk: drain pending then copyBlockSetValue (no UB pipeline for this part). | ||
| __aicore__ inline void copyBlock( |
There was a problem hiding this comment.
The copyBlock function works, but it would be better for readability wise since Page2L and L2Page is merged into single function and the if branches exit via return.
Perhaps it is better to separate into two helpers like
copyBlockPageToLmc and copyBlockLmcToPage ?
There was a problem hiding this comment.
what about blockwiseCopy? Shuld we separate that as well? Wdyt? Probably best to separate only copyBlock.
There was a problem hiding this comment.
Yeah agreed. BlockWiseCopy seems fine to me tbh.
|
|
||
| // Store one UB window: per block-part, pipeline Paged->UB (EnQue) with UB->Lmc (DeQue prev). | ||
| // Branch bulk: EnQue after CopyPagedToUb; branch invalid lead: drain pending then SetValue only. | ||
| __aicore__ inline void _page2LTransfer(__gm__ uint8_t *cacheTensor, __gm__ uint8_t *slotmappings, |
There was a problem hiding this comment.
nit: the inner function code could be merged I think between _page2L and L2Page, maybe via a templated class. could be a separate PR for later opt
There was a problem hiding this comment.
At the moment is mirroring the structure of the multi_layer_mem_kernels_v2.cpp. Probably best to take a common decision for both kernel so it is easier to follow them.
There was a problem hiding this comment.
Probably should remove ref to the Dynamic Connector once we made support in the in process one.
Co-authored-by: shengyan-chen <syan0o0@outlook.com>
use torch_dev dynamic load)
|
Rebased on main and tested with gsm8k for vLLM v0.20. |
[Core] Support for hybrid KV cache groups with mixed type/compression (e.g., Deepseek v4 in vllm-ascend)
Summary
DeepSeek V4 on vLLM-Ascend (config) exposes 11 scheduler KV-cache groups per request (distinct
block_ids,block_size,compress_ratio) and heterogeneous per-layer tensors (multi-plane tuples, shared int8 blobs, mixed dtypes).Upstream LMCache#3171 added DSv4 on the MP path only. vLLM-Ascend uses the in-process connector; upstream
RequestTrackerkeepsblock_ids[0]and oneslot_mapping, so DSv4 is wrong or fails asserts.LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=1Background
kv_caches[name]bf16viewint8blobDS4RandomQuarterLayers: 11KVCacheGroupSpecs;new_request.block_idsis a tuple of 11 lists. Upstream MP (#3171) runs one CUDA launch per homogeneous group — it does not fuse multiple planes of one layer.Feature 1 — In-process multi-group support (required)
Without this: groups 1..10 block tables are dropped, compressed
slot_mappinglengths fail asserts, andKVLayerGroupsManagermis-groups blob / mixed-dtype layers. Exploded mode (LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0) is the minimal correct fallback (MP-style, one transfer per homogeneous NPU group).Workflow vs upstream
allocated_block_ids= group 0allocated_block_ids_by_group(all 11)slot_mappingslot_mappings_by_group[g];primary_kv_group_idxwait_for_save/start_load_kvslot_mappings_npu_by_groupregister_kv_cachesbuild_flat_kv_caches+layout_hints(kv_size, nh, hs, dtype)build_kv_layer_groups+KVCacheFormat; scheduler-slot split_multi_group_kv_transfer;scheduler_slot_groupmaps NPU index → scheduler grouphidden/ chunk_patch_metadata_get_shapesper NPU groupMP stages block IDs on the server; here the NPU connector builds per-group pointers and slot slices.
End-to-end path
Chunk-hash / mask sizing uses the primary group’s
slot_mapping; kernels use all groups.Main changes
lmcache_ascend_connector_v1.py— entry (LMCacheAscendConnectorV1Dynamic,SupportsHMA) →LMCacheAscendConnectorV1Implmulti_group_vllm_adapter.py—LMCacheConnectorV1ImplMultiGroup,RequestTracker/ReqMeta, slot builders,build_connector_metavllm_v1_adapter.py— register,wait_for_save,start_load_kv,request_finished*multi_spec_flatten.py,kv_format.py,kv_layer_groups.py,npu_connectors.py,skip_state_groups.pymem_kernels.cpp/utils.cpp— DSA-C8 sizing on homogeneous kernel pathlmcache_ascend/__init__.py— optional_patch_vllm_v1_adapter()shim into upstreamlmcache.integration.vllm.vllm_v1_adapteronimport lmcache_ascend(not the DSv4 connector entry path)Concepts
(kv_size, hidden, block_size, dtype_key, num_tensors)and scheduler group.KVCacheGroupSpec;slot_mapping[g]lengthceil(T / compress_ratio[g]).scheduler_slot_group— NPU groupiusesslot_mappings_by_group[g]wheregis that group’s scheduler index (not[i]).DSV4 mapping (bundled, state skipped):
Example (T=512):
len(slot_mapping[g]) = ceil(512 / CR[g]); primary group =argmax_g(len(block_ids[g]) × block_size[g]).Architectural choices
Skip state groups (
skip_state_groups.py, default on): DSV4 state groups (C4AttnKVState,C4IndexerKVState,C128AttnScoreState, …) hold recurrent/score buffers not needed for prefix reuse.LMCACHE_ASCEND_SKIP_STATE_GROUPS=1— filter before layer-group planning.LMCACHE_ASCEND_SKIP_STATE_SPEC_ALLOWLIST— optional; default skips six*State*specs (sg4–7, sg10).C4Indexer, sg3) is not skipped.Sliding-window store (sg8, SW=128): only the last
sliding_windowtokens per LMCache chunk are persisted. Storing the full logical chunk over-allocated MemoryObj rows (~5.2× before fix).multi_group_vllm_adapter.py) —slot_mapping[g] = -1for tokens outsidechunk_end - sliding_window; kernels skip-1.kv_layer_groups.py,npu_connectors.py,_patch_metadata_get_shapes) —physical_chunk_size = sliding_window // compress_ratio(SW=128, CR=128 → 1 row for sg8).Multi-group mode requires
discard_partial_chunks=Trueglobally.Feature 2 — Multi-plane copy bundling (optional)
Feature 1 alone is sufficient for correctness via exploded mode: each plane is a flat layer; each homogeneous NPU group gets
multi_layer_kv_transfer— same idea as MP’s multiple CUDA launches. Bundling is a performance choice on NPU (launch overhead on Atlas A2), not something upstream MP implements.MULTI_PLANE_KV/DSA_C8_KV)block_sizemulti_layer_kv_transfer_multi_planeuint8rowsmulti_plane_slot_slice_boundsFiles:
multi_layer_kv_transfer_multi_plane,KVCacheFormat.MULTI_PLANE_KV/DSA_C8_KV,_invoke_multi_plane_kv_transfer. Launch count: 4 NPU groups vs exploded O(planes × layers).Breakage without this PR:
RequestTracker/ReqMeta— single group; wrong retrieve assert.KVLayerGroupsManager+gpu_connector/utils.py— no blob / multi-plane / per-planeblock_size.VLLMPagedMemGPUConnectorV2— flat pointer table; mixed tuple lengths break.LMCacheMetadata.get_shapes— cannot size per-group multi-plane row bytes.multi_layer_kv_transfer— scalarblock_sizeinsufficient for bundled planes.request_finished— needsrequest_finished_all_groupsfor tupleblock_ids.Flags and tests
LMCACHE_ASCEND_BUNDLE_MULTI_SPEC=0LMCACHE_ASCEND_FLATTEN_MULTI_SPEC=0LMCACHE_ASCEND_SKIP_STATE_GROUPS=1(default)Tests:
test_ds4_kvcache_roundtrip.py,test_multi_group_vllm_adapter.py,test_npu_connector_multi_group_load.py,test_mem_kernels.py,test_kv_layer_groups_npu.py,test_per_group_memory_allocation.py. Single-group models unchanged.Testing
Configuration:
Official vllm-ascend configuration for DeepseekV4 and in addition:
LMCache configuration (tag v0.4.5)
With the multi group connector defaults to discard partial chunks.
Results GSM8K
GSM8k results: first run, no LMCache KV cache hit,

GSM8k results: second run, ~ 30 % of requests with KV cache hit form LMCache.
Results vllm bench
Results no LMCache (no prefix matching vLLM):

Results LMCache (no prefix matching vLLM):

EDIT 1 - Upstream HMA (cec6092, #3419)
#3171 added DSv4 on MP CUDA; cec6092 generalises to HMA (
LMCacheGroupView,group_layers_by_identity, Gemma-3 CI). Same problem class; this PR targets the in-process NPU path.expand_block_ids_to_viewsallocated_block_ids_by_group,slot_mappings_by_group[g]group_layers_by_identity(CUDA)build_kv_layer_groups+KVCacheFormat+ scheduler splitphysical_chunk_size+ CRsliding_window_size_by_groupmulti_layer_kv_transfer_multi_plane*State*; tail store +SW // CRsizing (~5.2× alloc fix)Net-new: in-process connector stack (
LMCacheAscendConnectorV1Dynamic→LMCacheAscendConnectorV1Impl→LMCacheConnectorV1ImplMultiGroup), Ascend format detection, multi-plane kernel, state skip, SW-aware allocation.Integration: land on LMCache-Ascend for DSv4; on rebase ≥ cec6092 keep in-process adapter + Ascend grouping + skip/SW/bundling (upstream HMA does not replace them). Port engine-neutral SW helpers upstream when DSv4 MP is needed. Cherry-picking cec6092 alone does not fix vLLM-Ascend.
EDIT 2 -- Latest vLLM-Ascend DeepseekV4
vLLM-Ascend has now latest for DeepseekV4. This has likely a different KV cache structure than the v0.18 tested for this PR. Additional tests are needed
EDIT 3 -- Latest vLLM-AscendDeepseekV4 is now supported (unstable)
The main difference between the two versions are as follows.
offload_v20_vs_v18_diff.md
Part of the multi-group/multi-plane bundling (feature 2) of this PR may not be needed to support v0.20 and part of it can be stripped out (along with the new kernel). I suggest to merge this PR to a branch of LMCache-Ascend to support DSv4 in v0.18, and then merge only needed parts to continue with upstream from v0.20 on in main branch. The multi-plane path and multi-plane kernel would only be useful to support DSA C8, which is anyway not used in DSv4 (it is used in DSv3.2 and GLM5 with enable_sparse_c8=True).