[Usage]: Agentic Workload Tool Calling with Qwen3.6-27B

### Your current environment

Hardware/Software: 
- Dual AMD Radeon PRO W6800X Duo (128 GB VRAM, 4 GPUs)
- Ubuntu Server 24.04 LTS
- ROCm 7.2.3
- Python 3.12
- PyTorch 2.10
- vLLM 0.22.0 (KVarN fork)
- Hermes Agent v0.15.1 (2026.5.29) (Separate Ubuntu VM)
- Qwen/Qwen3.6-27B (Hugging Face)

Initial Feedback:
Changing from fp16 KV cache to kvarn_k4v2_g128 changed GPU KV cache size from: 861,434 tokens to: 2,754,939 tokens.
That's just amazing. Great Job!

Problem:
Asking the hermes agent to perform a task that requires tool calling leads to the tool calling being inside the agent reply, not triggering the actual tool.
The agent then stops and no further action is taken.
In vLLM logs, continued activity is visible through generation throughput.
Several minutes later, vLLM activity stops. The agent produces nothing.

<details>
Curtsy of https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix

<summary>1st vLLM Launch Command Attempted:</summary>

```text
source "~/venvs/kvarn-rocm-0.22/bin/activate"
VLLM_TARGET_DEVICE=rocm \
HSA_OVERRIDE_GFX_VERSION=10.3.0 \
HIP_FORCE_DEV_KERNARG=1 \
ROCR_VISIBLE_DEVICES=0,1,3,4 \
TORCH_BLAS_PREFER_HIPBLASLT=0 \
OMP_NUM_THREADS=8 \
TOKENIZERS_PARALLELISM=false \
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
VLLM_USE_TRITON_AWQ=1 \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
PYTORCH_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3.6-27B \
  --served-model-name vLLM \
  --dtype float16 \
  --attention-backend KVARN \
  --kv-cache-dtype kvarn_k4v2_g128 \
  --block-size 128 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 42 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --chat-template ~/vllm-templates/qwen3.6-enhanced.jinja \
  --default-chat-template-kwargs '{"preserve_thinking": true}' \
  --generation-config vllm \
  --max-cudagraph-capture-size 64 \
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 64 \
  --language-model-only \
  --limit-mm-per-prompt.image 0 \
  --limit-mm-per-prompt.video 0 \
  --host 0.0.0.0 \
  --port 8000
```
</details>

<details>
<summary>2nd vLLM Launch Command:</summary>

```text
source "~/venvs/kvarn-rocm-0.22/bin/activate"
VLLM_TARGET_DEVICE=rocm \
HSA_OVERRIDE_GFX_VERSION=10.3.0 \
HIP_FORCE_DEV_KERNARG=1 \
ROCR_VISIBLE_DEVICES=0,1,3,4 \
TORCH_BLAS_PREFER_HIPBLASLT=0 \
OMP_NUM_THREADS=8 \
TOKENIZERS_PARALLELISM=false \
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
VLLM_USE_TRITON_AWQ=1 \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
PYTORCH_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3.6-27B \
  --served-model-name vLLM \
  --dtype float16 \
  --attention-backend KVARN \
  --kv-cache-dtype kvarn_k4v2_g128 \
  --block-size 128 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 42 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"preserve_thinking": true}' \
  --generation-config vllm \
  --max-cudagraph-capture-size 64 \
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 \
  --language-model-only \
  --limit-mm-per-prompt.image 0 \
  --limit-mm-per-prompt.video 0 \
  --host 0.0.0.0 \
  --port 8000
```
</details>

<details>
<summary>CLI Chat</summary>

```text
██╗  ██╗███████╗██████╗ ███╗   ███╗███████╗███████╗       █████╗  ██████╗ ███████╗███╗   ██╗████████╗
██║  ██║██╔════╝██╔══██╗████╗ ████║██╔════╝██╔════╝      ██╔══██╗██╔════╝ ██╔════╝████╗  ██║╚══██╔══╝
███████║█████╗  ██████╔╝██╔████╔██║█████╗  ███████╗█████╗███████║██║  ███╗█████╗  ██╔██╗ ██║   ██║
██╔══██║██╔══╝  ██╔══██╗██║╚██╔╝██║██╔══╝  ╚════██║╚════╝██╔══██║██║   ██║██╔══╝  ██║╚██╗██║   ██║
██║  ██║███████╗██║  ██║██║ ╚═╝ ██║███████╗███████║      ██║  ██║╚██████╔╝███████╗██║ ╚████║   ██║
╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝      ╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝

╭────────────────────────────────── Hermes Agent v0.15.1 (2026.5.29) · upstream 6110aed9 ──────────────────────────────────╮
│                                   Available Tools                                                                        │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡀⠀⣀⣀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   browser: browser_back, browser_click, ...                                              │
│  ⠀⠀⠀⠀⠀⠀⢀⣠⣴⣾⣿⣿⣇⠸⣿⣿⠇⣸⣿⣿⣷⣦⣄⡀⠀⠀⠀⠀⠀⠀   browser-cdp: browser_cdp, browser_dialog                                               │
│  ⠀⢀⣠⣴⣶⠿⠋⣩⡿⣿⡿⠻⣿⡇⢠⡄⢸⣿⠟⢿⣿⢿⣍⠙⠿⣶⣦⣄⡀⠀   clarify: clarify                                                                       │
│  ⠀⠀⠉⠉⠁⠶⠟⠋⠀⠉⠀⢀⣈⣁⡈⢁⣈⣁⡀⠀⠉⠀⠙⠻⠶⠈⠉⠉⠀⠀   code_execution: execute_code                                                           │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⡿⠛⢁⡈⠛⢿⣿⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   computer_use: computer_use                                                             │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⣿⣦⣤⣈⠁⢠⣴⣿⠿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   cronjob: cronjob                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠻⢿⣿⣦⡉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   delegation: delegate_task                                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢷⣦⣈⠛⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   discord: discord                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣴⠦⠈⠙⠿⣦⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   (and 21 more toolsets...)                                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣤⡈⠁⢤⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                                                                                          │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠷⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   Available Skills                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠑⢶⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   autonomous-ai-agents: claude-code, codex, hermes-agent, kanban-codex-...               │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠁⢰⡆⠈⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   creative: architecture-diagram, ascii-art, ascii-video, b...                           │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⠈⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   data-science: jupyter-live-kernel                                                      │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   devops: enterprise-hardware-sourcing, imessage-bluebubb...                             │
│                                   email: himalaya                                                                        │
│       vLLM · Nous Research        gaming: minecraft-modpack-server, pokemon-player                                       │
│           /home/faisal            general: browser-setup, dogfood, yuanbao                                               │
│  Session: 20260610_185917_132f6f  github: codebase-inspection, github-auth, github-code-r...                             │
│                                   mcp: native-mcp                                                                        │
│                                   media: gif-search, heartmula, songsee, spotify, youtub...                              │
│                                   mlops: audiocraft-audio-generation, dspy, evaluating-l...                              │
│                                   note-taking: obsidian                                                                  │
│                                   productivity: airtable, google-workspace, linear, maps, nano-...                       │
│                                   red-teaming: godmode                                                                   │
│                                   research: arxiv, blogwatcher, ecosystem-research, llm-lan...                           │
│                                   smart-home: openhue                                                                    │
│                                   social-media: xurl                                                                     │
│                                   software-development: debugging-hermes-state, debugging-hermes-tui-co...               │
│                                                                                                                          │
│                                   Profile: sami                                                                          │
│                                   29 tools · 94 skills · /help for commands                                              │
│                                   ⚠ 1284 commits behind — run hermes update to update                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Welcome to Hermes Agent! Type your message or /help for commands.
✦ Tip: The TUI renders LaTeX inline — $E=mc^2$ becomes Unicode math instead of raw TeX.


────────────────────────────────────────
● Look up KVarN by Huawei CSL. Share what information you can about it, please
Initializing agent...

────────────────────────────────────────

┌─ Reasoning ──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
The user is asking me to look up information about "KVarN by Huawei CSL." This seems
 to be a search/research task. Let me use the search tool to find information about
 this.

CSL likely refers to China Software Laboratory, which is the research lab of Huawei
.

Let me search for this.
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

╭─ ⚕ Hermes ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
    <arg_key>query</arg_key>
    <arg_value>KVarN Huawei CSL China Software Laboratory</arg_value>
    
    <arg_key>query</arg_key>
    <arg_value>KVarN Huawei CSL China Software Laboratory</arg_value>
    <arg_key>query</arg_key>
    <arg_value>KVarN by Huawei CSL ontological reasoning</arg_value>
    "KVarN by Huawei CSL ontological reasoning"
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
⚠ Auxiliary title generation failed: Request timed out.
 ⚕ vLLM │ 17.4K/65.5K │ [███░░░░░░░] 27% │ 46m │ ⏲ 15s 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sami ❯ 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Resume this session with:
  hermes --resume 20260610_185917_132f6f -p sami

Session:        20260610_185917_132f6f
Duration:       46m 28s
Messages:       2 (1 user, 0 tool calls)

```
</details>


### How would you like to use vllm

Expectation is that tools would be called normally, similar to when not using KVarN.

### Before submitting a new issue...

- [ ] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Usage]: Agentic Workload Tool Calling with Qwen3.6-27B #17

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: Agentic Workload Tool Calling with Qwen3.6-27B #17

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions