Skip to content

[Usage]: Agentic Workload Tool Calling with Qwen3.6-27B #17

Description

@FaisalBiyari

Your current environment

Hardware/Software:

  • Dual AMD Radeon PRO W6800X Duo (128 GB VRAM, 4 GPUs)
  • Ubuntu Server 24.04 LTS
  • ROCm 7.2.3
  • Python 3.12
  • PyTorch 2.10
  • vLLM 0.22.0 (KVarN fork)
  • Hermes Agent v0.15.1 (2026.5.29) (Separate Ubuntu VM)
  • Qwen/Qwen3.6-27B (Hugging Face)

Initial Feedback:
Changing from fp16 KV cache to kvarn_k4v2_g128 changed GPU KV cache size from: 861,434 tokens to: 2,754,939 tokens.
That's just amazing. Great Job!

Problem:
Asking the hermes agent to perform a task that requires tool calling leads to the tool calling being inside the agent reply, not triggering the actual tool.
The agent then stops and no further action is taken.
In vLLM logs, continued activity is visible through generation throughput.
Several minutes later, vLLM activity stops. The agent produces nothing.

Curtsy of https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix 1st vLLM Launch Command Attempted:
source "~/venvs/kvarn-rocm-0.22/bin/activate"
VLLM_TARGET_DEVICE=rocm \
HSA_OVERRIDE_GFX_VERSION=10.3.0 \
HIP_FORCE_DEV_KERNARG=1 \
ROCR_VISIBLE_DEVICES=0,1,3,4 \
TORCH_BLAS_PREFER_HIPBLASLT=0 \
OMP_NUM_THREADS=8 \
TOKENIZERS_PARALLELISM=false \
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
VLLM_USE_TRITON_AWQ=1 \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
PYTORCH_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3.6-27B \
  --served-model-name vLLM \
  --dtype float16 \
  --attention-backend KVARN \
  --kv-cache-dtype kvarn_k4v2_g128 \
  --block-size 128 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 42 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --chat-template ~/vllm-templates/qwen3.6-enhanced.jinja \
  --default-chat-template-kwargs '{"preserve_thinking": true}' \
  --generation-config vllm \
  --max-cudagraph-capture-size 64 \
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 64 \
  --language-model-only \
  --limit-mm-per-prompt.image 0 \
  --limit-mm-per-prompt.video 0 \
  --host 0.0.0.0 \
  --port 8000
2nd vLLM Launch Command:
source "~/venvs/kvarn-rocm-0.22/bin/activate"
VLLM_TARGET_DEVICE=rocm \
HSA_OVERRIDE_GFX_VERSION=10.3.0 \
HIP_FORCE_DEV_KERNARG=1 \
ROCR_VISIBLE_DEVICES=0,1,3,4 \
TORCH_BLAS_PREFER_HIPBLASLT=0 \
OMP_NUM_THREADS=8 \
TOKENIZERS_PARALLELISM=false \
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
VLLM_USE_TRITON_AWQ=1 \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
PYTORCH_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3.6-27B \
  --served-model-name vLLM \
  --dtype float16 \
  --attention-backend KVARN \
  --kv-cache-dtype kvarn_k4v2_g128 \
  --block-size 128 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 42 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --override-generation-config '{"max_new_tokens": 8192}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"preserve_thinking": true}' \
  --generation-config vllm \
  --max-cudagraph-capture-size 64 \
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 \
  --language-model-only \
  --limit-mm-per-prompt.image 0 \
  --limit-mm-per-prompt.video 0 \
  --host 0.0.0.0 \
  --port 8000
CLI Chat
██╗  ██╗███████╗██████╗ ███╗   ███╗███████╗███████╗       █████╗  ██████╗ ███████╗███╗   ██╗████████╗
██║  ██║██╔════╝██╔══██╗████╗ ████║██╔════╝██╔════╝      ██╔══██╗██╔════╝ ██╔════╝████╗  ██║╚══██╔══╝
███████║█████╗  ██████╔╝██╔████╔██║█████╗  ███████╗█████╗███████║██║  ███╗█████╗  ██╔██╗ ██║   ██║
██╔══██║██╔══╝  ██╔══██╗██║╚██╔╝██║██╔══╝  ╚════██║╚════╝██╔══██║██║   ██║██╔══╝  ██║╚██╗██║   ██║
██║  ██║███████╗██║  ██║██║ ╚═╝ ██║███████╗███████║      ██║  ██║╚██████╔╝███████╗██║ ╚████║   ██║
╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝      ╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝

╭────────────────────────────────── Hermes Agent v0.15.1 (2026.5.29) · upstream 6110aed9 ──────────────────────────────────╮
│                                   Available Tools                                                                        │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡀⠀⣀⣀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   browser: browser_back, browser_click, ...                                              │
│  ⠀⠀⠀⠀⠀⠀⢀⣠⣴⣾⣿⣿⣇⠸⣿⣿⠇⣸⣿⣿⣷⣦⣄⡀⠀⠀⠀⠀⠀⠀   browser-cdp: browser_cdp, browser_dialog                                               │
│  ⠀⢀⣠⣴⣶⠿⠋⣩⡿⣿⡿⠻⣿⡇⢠⡄⢸⣿⠟⢿⣿⢿⣍⠙⠿⣶⣦⣄⡀⠀   clarify: clarify                                                                       │
│  ⠀⠀⠉⠉⠁⠶⠟⠋⠀⠉⠀⢀⣈⣁⡈⢁⣈⣁⡀⠀⠉⠀⠙⠻⠶⠈⠉⠉⠀⠀   code_execution: execute_code                                                           │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⡿⠛⢁⡈⠛⢿⣿⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   computer_use: computer_use                                                             │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⣿⣦⣤⣈⠁⢠⣴⣿⠿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   cronjob: cronjob                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠻⢿⣿⣦⡉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   delegation: delegate_task                                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢷⣦⣈⠛⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   discord: discord                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣴⠦⠈⠙⠿⣦⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   (and 21 more toolsets...)                                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣤⡈⠁⢤⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                                                                                          │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠷⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   Available Skills                                                                       │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠑⢶⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   autonomous-ai-agents: claude-code, codex, hermes-agent, kanban-codex-...               │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠁⢰⡆⠈⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   creative: architecture-diagram, ascii-art, ascii-video, b...                           │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⠈⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   data-science: jupyter-live-kernel                                                      │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   devops: enterprise-hardware-sourcing, imessage-bluebubb...                             │
│                                   email: himalaya                                                                        │
│       vLLM · Nous Research        gaming: minecraft-modpack-server, pokemon-player                                       │
│           /home/faisal            general: browser-setup, dogfood, yuanbao                                               │
│  Session: 20260610_185917_132f6f  github: codebase-inspection, github-auth, github-code-r...                             │
│                                   mcp: native-mcp                                                                        │
│                                   media: gif-search, heartmula, songsee, spotify, youtub...                              │
│                                   mlops: audiocraft-audio-generation, dspy, evaluating-l...                              │
│                                   note-taking: obsidian                                                                  │
│                                   productivity: airtable, google-workspace, linear, maps, nano-...                       │
│                                   red-teaming: godmode                                                                   │
│                                   research: arxiv, blogwatcher, ecosystem-research, llm-lan...                           │
│                                   smart-home: openhue                                                                    │
│                                   social-media: xurl                                                                     │
│                                   software-development: debugging-hermes-state, debugging-hermes-tui-co...               │
│                                                                                                                          │
│                                   Profile: sami                                                                          │
│                                   29 tools · 94 skills · /help for commands                                              │
│                                   ⚠ 1284 commits behind — run hermes update to update                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Welcome to Hermes Agent! Type your message or /help for commands.
✦ Tip: The TUI renders LaTeX inline — $E=mc^2$ becomes Unicode math instead of raw TeX.


────────────────────────────────────────
● Look up KVarN by Huawei CSL. Share what information you can about it, please
Initializing agent...

────────────────────────────────────────

┌─ Reasoning ──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
The user is asking me to look up information about "KVarN by Huawei CSL." This seems
 to be a search/research task. Let me use the search tool to find information about
 this.

CSL likely refers to China Software Laboratory, which is the research lab of Huawei
.

Let me search for this.
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

╭─ ⚕ Hermes ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
    <arg_key>query</arg_key>
    <arg_value>KVarN Huawei CSL China Software Laboratory</arg_value>
    
    <arg_key>query</arg_key>
    <arg_value>KVarN Huawei CSL China Software Laboratory</arg_value>
    <arg_key>query</arg_key>
    <arg_value>KVarN by Huawei CSL ontological reasoning</arg_value>
    "KVarN by Huawei CSL ontological reasoning"
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
⚠ Auxiliary title generation failed: Request timed out.
 ⚕ vLLM │ 17.4K/65.5K │ [███░░░░░░░] 27% │ 46m │ ⏲ 15s 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sami ❯ 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Resume this session with:
  hermes --resume 20260610_185917_132f6f -p sami

Session:        20260610_185917_132f6f
Duration:       46m 28s
Messages:       2 (1 user, 0 tool calls)

How would you like to use vllm

Expectation is that tools would be called normally, similar to when not using KVarN.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions