Your current environment
basic:
time:
2026-03-25 06:46:30
hostname:
spark-2652
os:
Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.39
python:
3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
gpu:
type:
NVIDIA
detail:
- NVIDIA GB10, [N/A]
driver:
580.126.09
torch:
version:
2.9.0a0+50eac811a6.nv25.09
cuda:
13.0
cuda_available:
True
gpu_count:
1
gpu_name:
NVIDIA GB10
vllm:
version:
0.11.1rc2.dev170+g9fce7bee7.d20251024
lmcache:
Not installed
ucm:
installed:
True
cuda_env:
CUDA_VISIBLE_DEVICES:
CUDA_HOME:
/usr/local/cuda
LD_LIBRARY_PATH:
/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
🐛 Describe the bug
When using vLLM with UCMBlendConnector, prefix cache hits lead to incorrect token generation.
Reproduction:
- Send the same request twice
first:
curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/models/Qwen3-32B-AWQ/",
"messages": [
{
"role": "system",
"content": "你是一个简洁、专业的中文助手"
},
{
"role": "user",
"content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
}
],
"max_tokens": 50,
"temperature": 0
}'
{"id":"chatcmpl-d36e43b2c441411787caf70ded97591c","object":"chat.completion","created":1774412774,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"\n好的,用户问的是“什么是人工智能?”,我需要根据提供的三个段落来回答。首先,Passage 1说人工智能是计算机科学的一个分支。Passage 2提到机器学习是人工智能的核心技术,而Pass","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}r
second:
curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/models/Qwen3-32B-AWQ/",
"messages": [
{
"role": "system",
"content": "你是一个简洁、专业的中文助手"
},
{
"role": "user",
"content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
}
],
"max_tokens": 50,
"temperature": 0
}'
{"id":"chatcmpl-9a4d8385c2694241a35a29043d9e90d8","object":"chat.completion","created":1774412785,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Relevant logs:
26-03-25 04:26:25.087499][UC][I] request_id: chatcmpl-9a4d8385c2694241a35a29043d9e90d8, total_blocks_num: 2, req_stage: BlendStage.BUILD_PREFIX_CACHE, first chunk prefix hit: 2, chunks cache total hit: 0, [870,870][blend_connector.py:376,get_num_new_matched_tokens]
(APIServer pid=806) INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
This suggests that prefix cache is being used, but the generated output is invalid.
Possible causes:
- KV cache offset misalignment after prefix reuse
- incorrect token position when entering decode stage
- prefix cache stitching issue in BlendConnector
Additional observations:
- Disabling prefix cache resolves the issue
- Issue only occurs when cache is hit
- Mixed Chinese + English prompt may affect tokenization
Expected behavior:
Prefix cache hits should not affect output correctness. Repeated identical requests should produce identical or similar outputs.
Your current environment
basic:
time:
2026-03-25 06:46:30
hostname:
spark-2652
os:
Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.39
python:
3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
gpu:
type:
NVIDIA
detail:
- NVIDIA GB10, [N/A]
driver:
580.126.09
torch:
version:
2.9.0a0+50eac811a6.nv25.09
cuda:
13.0
cuda_available:
True
gpu_count:
1
gpu_name:
NVIDIA GB10
vllm:
version:
0.11.1rc2.dev170+g9fce7bee7.d20251024
lmcache:
Not installed
ucm:
installed:
True
cuda_env:
CUDA_VISIBLE_DEVICES:
CUDA_HOME:
/usr/local/cuda
LD_LIBRARY_PATH:
/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
🐛 Describe the bug
When using vLLM with UCMBlendConnector, prefix cache hits lead to incorrect token generation.
Reproduction:
first:
curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/models/Qwen3-32B-AWQ/",
"messages": [
{
"role": "system",
"content": "你是一个简洁、专业的中文助手"
},
{
"role": "user",
"content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
}
],
"max_tokens": 50,
"temperature": 0
}'
{"id":"chatcmpl-d36e43b2c441411787caf70ded97591c","object":"chat.completion","created":1774412774,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"\n好的,用户问的是“什么是人工智能?”,我需要根据提供的三个段落来回答。首先,Passage 1说人工智能是计算机科学的一个分支。Passage 2提到机器学习是人工智能的核心技术,而Pass","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}r
second:
curl http://localhost:7800/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/models/Qwen3-32B-AWQ/",
"messages": [
{
"role": "system",
"content": "你是一个简洁、专业的中文助手"
},
{
"role": "user",
"content": "你好Passage 1: 人工智能是计算机科学的一个分支。Passage 2: 机器学习是人工智能的核心技术。Passage 3: 深度学习是机器学习的一个分支。\n\nQuestion: 什么是人工智能?\nAnswer:"
}
],
"max_tokens": 50,
"temperature": 0
}'
{"id":"chatcmpl-9a4d8385c2694241a35a29043d9e90d8","object":"chat.completion","created":1774412785,"model":"/mnt/models/Qwen3-32B-AWQ/","choices":[{"index":0,"message":{"role":"assistant","content":"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":71,"total_tokens":121,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Relevant logs:
26-03-25 04:26:25.087499][UC][I] request_id: chatcmpl-9a4d8385c2694241a35a29043d9e90d8, total_blocks_num: 2, req_stage: BlendStage.BUILD_PREFIX_CACHE, first chunk prefix hit: 2, chunks cache total hit: 0, [870,870][blend_connector.py:376,get_num_new_matched_tokens]
(APIServer pid=806) INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
This suggests that prefix cache is being used, but the generated output is invalid.
Possible causes:
Additional observations:
Expected behavior:
Prefix cache hits should not affect output correctness. Repeated identical requests should produce identical or similar outputs.