[Bug] Incorrect token generation after prefix cache hit (UCMBlendConnector, possible KV cache misalignment)

### Your current environment

basic:
  time:
    2026-03-25 06:46:30
  hostname:
    spark-2652
  os:
    Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.39
  python:
    3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
gpu:
  type:
    NVIDIA
  detail:
    - NVIDIA GB10, [N/A]
  driver:
    580.126.09
torch:
  version:
    2.9.0a0+50eac811a6.nv25.09
  cuda:
    13.0
  cuda_available:
    True
  gpu_count:
    1
  gpu_name:
    NVIDIA GB10
vllm:
  version:
    0.11.1rc2.dev170+g9fce7bee7.d20251024
lmcache:
  Not installed
ucm:
  installed:
    True
cuda_env:
  CUDA_VISIBLE_DEVICES:
    
  CUDA_HOME:
    /usr/local/cuda
  LD_LIBRARY_PATH:
    /usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64


### 🐛 Describe the bug

When using vLLM with UCMBlendConnector, prefix cache hits lead to incorrect token generation.

Reproduction:
- Send the same request twice
first:
curl http://localhost:7800/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "/mnt/models/Qwen3-32B-AWQ/",
    "messages": [
      {
        "role": "system",
        "content": "你是一个简洁、专业的中文助手"
      },
      {
        "role": "user",
        "content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能？\nAnswer:"
      }
    ],
    "max_tokens": 50,
    "temperature": 0
  }'
{"id":"chatcmpl-d36e43b2c441411787caf70ded97591c","object":"chat.completion","created":1774412774,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的，用户问的是“什么是人工智能？”，我需要根据提供的三个段落来回答。首先，Passage 1说人工智能是计算机科学的一个分支。Passage 2提到机器学习是人工智能的核心技术，而Pass","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}r
second:
curl http://localhost:7800/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "/mnt/models/Qwen3-32B-AWQ/",
    "messages": [
      {
        "role": "system",
        "content": "你是一个简洁、专业的中文助手"
      },
      {
        "role": "user",
        "content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能？\nAnswer:"
      }
    ],
    "max_tokens": 50,
    "temperature": 0
  }'
{"id":"chatcmpl-9a4d8385c2694241a35a29043d9e90d8","object":"chat.completion","created":1774412785,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}


Relevant logs:
26-03-25 04:26:25.087499][UC][I] request_id: chatcmpl-9a4d8385c2694241a35a29043d9e90d8, total_blocks_num: 2, req_stage: BlendStage.BUILD_PREFIX_CACHE, first chunk prefix hit: 2, chunks cache total hit: 0,  [870,870][blend_connector.py:376,get_num_new_matched_tokens]
(APIServer pid=806) INFO:     127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK

This suggests that prefix cache is being used, but the generated output is invalid.

Possible causes:
- KV cache offset misalignment after prefix reuse
- incorrect token position when entering decode stage
- prefix cache stitching issue in BlendConnector

Additional observations:
- Disabling prefix cache resolves the issue
- Issue only occurs when cache is hit
- Mixed Chinese + English prompt may affect tokenization

Expected behavior:
Prefix cache hits should not affect output correctness. Repeated identical requests should produce identical or similar outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Incorrect token generation after prefix cache hit (UCMBlendConnector, possible KV cache misalignment) #867

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Incorrect token generation after prefix cache hit (UCMBlendConnector, possible KV cache misalignment) #867

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions