Skip to content

[Bug] Incorrect token generation after prefix cache hit (UCMBlendConnector, possible KV cache misalignment) #867

@coder-yangshuai

Description

@coder-yangshuai

Your current environment

basic:
time:
2026-03-25 06:46:30
hostname:
spark-2652
os:
Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.39
python:
3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
gpu:
type:
NVIDIA
detail:
- NVIDIA GB10, [N/A]
driver:
580.126.09
torch:
version:
2.9.0a0+50eac811a6.nv25.09
cuda:
13.0
cuda_available:
True
gpu_count:
1
gpu_name:
NVIDIA GB10
vllm:
version:
0.11.1rc2.dev170+g9fce7bee7.d20251024
lmcache:
Not installed
ucm:
installed:
True
cuda_env:
CUDA_VISIBLE_DEVICES:

CUDA_HOME:
/usr/local/cuda
LD_LIBRARY_PATH:
/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

🐛 Describe the bug

When using vLLM with UCMBlendConnector, prefix cache hits lead to incorrect token generation.

Reproduction:

  • Send the same request twice
    first:
    curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/mnt/models/Qwen3-32B-AWQ/",
    "messages": [
    {
    "role": "system",
    "content": "你是一个简洁、专业的中文助手"
    },
    {
    "role": "user",
    "content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
    }
    ],
    "max_tokens": 50,
    "temperature": 0
    }'
    {"id":"chatcmpl-d36e43b2c441411787caf70ded97591c","object":"chat.completion","created":1774412774,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"\n好的,用户问的是“什么是人工智能?”,我需要根据提供的三个段落来回答。首先,Passage 1说人工智能是计算机科学的一个分支。Passage 2提到机器学习是人工智能的核心技术,而Pass","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}r
    second:
    curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/mnt/models/Qwen3-32B-AWQ/",
    "messages": [
    {
    "role": "system",
    "content": "你是一个简洁、专业的中文助手"
    },
    {
    "role": "user",
    "content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
    }
    ],
    "max_tokens": 50,
    "temperature": 0
    }'
    {"id":"chatcmpl-9a4d8385c2694241a35a29043d9e90d8","object":"chat.completion","created":1774412785,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Relevant logs:
26-03-25 04:26:25.087499][UC][I] request_id: chatcmpl-9a4d8385c2694241a35a29043d9e90d8, total_blocks_num: 2, req_stage: BlendStage.BUILD_PREFIX_CACHE, first chunk prefix hit: 2, chunks cache total hit: 0, [870,870][blend_connector.py:376,get_num_new_matched_tokens]
(APIServer pid=806) INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK

This suggests that prefix cache is being used, but the generated output is invalid.

Possible causes:

  • KV cache offset misalignment after prefix reuse
  • incorrect token position when entering decode stage
  • prefix cache stitching issue in BlendConnector

Additional observations:

  • Disabling prefix cache resolves the issue
  • Issue only occurs when cache is hit
  • Mixed Chinese + English prompt may affect tokenization

Expected behavior:
Prefix cache hits should not affect output correctness. Repeated identical requests should produce identical or similar outputs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions