This example demonstrates running Dynamo across multiple nodes with KV-aware routing to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
The multi-node setup consists of:
- 1 Frontend: Receives HTTP requests and uses KV routing to distribute them
- 2 Model Replicas: Each with dedicated prefill and decode workers
- Smart KV-Aware Routing: Intelligently routes requests based on KV cache locality across all workers
---
title: Multi-Node Architecture with Full KV Routing (SGLang)
---
flowchart TD
Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>KV-Aware Router<br/>(Any Node)"]
Frontend --> Router{KV Routing<br/>Decision}
Router --> Prefill1["Prefill Worker 1"]
Router --> Prefill2["Prefill Worker 2"]
Prefill1 -->|NIXL Transfer| Decode1
Prefill2 -->|NIXL Transfer| Decode2
Prefill1 -.->|KV Events| Frontend
Prefill2 -.->|KV Events| Frontend
Decode1 --> |Response| Frontend
Decode2 --> |Response| Frontend
Frontend --> Client
subgraph Node1["Node 1"]
Decode1
Prefill1
end
subgraph Node2["Node 2"]
Decode2
Prefill2
end
KV-aware routing optimizes LLM inference by directing requests to workers that already have relevant data cached. Instead of random or round-robin distribution, the router:
- Tracks cached data: Monitors which token sequences are cached on each worker
- Maximizes cache reuse: Routes requests to workers with the best cache overlap, reducing redundant computation
- Balances load: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for:
- Shared system prompts: Cached across workers and reused efficiently
- Multi-turn conversations: Full conversation history benefits from caching
- Similar queries: Common prefixes are computed once and reused
- Batch processing: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the KV Cache Routing Architecture documentation.
Ensure etcd and NATS are running on a node accessible by all workers:
# On the infrastructure node (can be Node 1 or a dedicated node)
docker compose -f deploy/docker-compose.yml up -dNote the IP address of this node - you'll need it for worker configuration.
Install Dynamo with SGLang support:
pip install ai-dynamo[sglang]For more information about the SGLang backend and its integration with Dynamo, see the SGLang Backend Documentation.
Ensure the following ports are accessible between nodes:
- 2379: etcd client port
- 4222: NATS client port
- 8000: Frontend HTTP port (only needed on frontend node)
- High-speed interconnect: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
This example assumes:
- Node 1: At least 2 GPUs (for Replica 1's decode and prefill workers)
- Node 2: At least 2 GPUs (for Replica 2's decode and prefill workers)
- Frontend Node: Can be on Node 1, Node 2, or a separate node (no GPU required)
Note
You can run this example with minimal modifications on a single node with at least 4 GPUs.
In step 3, modify the CUDA_VISIBLE_DEVICES flags to CUDA_VISIBLE_DEVICES=2
for the prefill component and CUDA_VISIBLE_DEVICES=3 for the decode component.
On all nodes, set the etcd and NATS endpoints:
# Replace with your infrastructure node's IP
# To find your IP address, run the follwing on your infrastructure node:
# hostname -I | awk '{print $1}'
export INFRA_NODE_IP=<INFRA_NODE_IP>
export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379
export NATS_SERVER=nats://${INFRA_NODE_IP}:4222
export DYN_LOG=debug # Enable debug logging to see routing decisionsOpen a terminal on Node 1 and launch both workers:
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl[!INFO]
CUDA_VISIBLE_DEVICES: Controls which GPU each worker uses (0 and 1 for different > GPUs)--page-size 16: Sets the KV cache block size - must be identical across all workers--disaggregation-mode: Separates prefill (prompt processing) from decode (token > generation)--disaggregation-transfer-backend nixl: Enables high-speed GPU-to-GPU transfers--skip-tokenizer-init: Avoids duplicate tokenizer loading since the frontend > handles tokenization
Open a terminal on Node 2 and launch both workers:
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
# Launch decode worker in foreground
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixlOpen a terminal on any node and launch the frontend:
# On any node (no GPU required)
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kvTake note of the frontend IP address:
# On the same node you launched dynamo.frontend
hostname -I | awk '{print $1}'The frontend will:
- Discover all available decode workers via etcd
- Enable KV-aware routing for intelligent request distribution
- Monitor worker health and adjust routing accordingly
For more details about frontend configuration options, see the Frontend Component Documentation.
Install the OpenAI Python client library:
pip install openaiPaste in the Dynamo Frontend IP from step 4 (or use localhost if on the same node):
export DYN_FRONTEND_IP=<PASTE_FRONTEND_IP_HERE>Send a request to see it routed to one of the replicas:
from openai import OpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
client = OpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy" # Not used by Dynamo, but required by OpenAI client
)
response = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
stream=False,
max_tokens=50
)
print(response.choices[0].message.content)Create a conversation to observe how KV routing naturally benefits multi-turn interactions:
from openai import OpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
client = OpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy" # Not used by Dynamo, but required by OpenAI client
)
# First turn - establishes context
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Alice. Please remember it."}
]
response1 = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=messages,
stream=False,
max_tokens=50
)
print("First response:", response1.choices[0].message.content)
# Add the assistant's response to conversation history
messages.append({"role": "assistant", "content": response1.choices[0].message.content})
# Second turn - includes the full conversation history
# KV routing will likely route this to the same worker due to shared token prefix
messages.append({"role": "user", "content": "What is my name?"})
response2 = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=messages,
stream=False,
max_tokens=50
)
print("Second response:", response2.choices[0].message.content)Send multiple new conversations to see them distributed across replicas:
import asyncio
from openai import AsyncOpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
async def send_request(client, i):
"""Send a single request and return the response"""
try:
response = await client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "user", "content": f"Count to {i}"}
],
stream=False,
max_tokens=20
)
return f"Request {i}: {response.choices[0].message.content}"
except Exception as e:
return f"Request {i} failed: {e}"
async def load_test():
"""Send 10 requests in parallel to test load distribution"""
client = AsyncOpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy"
)
# Send 10 requests in parallel
tasks = [send_request(client, i) for i in range(1, 11)]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
# Run the load test
if __name__ == "__main__":
asyncio.run(load_test())With DYN_LOG=debug, you can observe KV routing decisions in the logs:
[DEBUG] KV overlap scores: {worker-1: 15 blocks, worker-2: 8 blocks}
[DEBUG] Selected worker-1 (best overlap: 15 blocks)
[DEBUG] Cache hit rate: 75% for this request
While this example demonstrates KV-aware routing for optimal cache utilization, Dynamo also supports simpler routing strategies:
- KV-Aware (recommended): Routes based on cache overlap across all workers
- Round-Robin: Distributes requests evenly across workers in sequence
- Random: Randomly selects workers for each request
# Example: Use round-robin routing instead of KV routing
python -m dynamo.frontend \
--http-port 8000 \
--router-mode round-robinHowever, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
Verify all workers are properly registered:
etcdctl --endpoints=$ETCD_ENDPOINTS get --prefix /dynamo/workers/With DYN_LOG=debug, the frontend logs show routing decisions:
[DEBUG] KV overlap scores: {prefill-worker-1: 15 blocks, prefill-worker-2: 8 blocks}
[DEBUG] Selected prefill-worker-1 (best overlap: 15 blocks)
[DEBUG] KV overlap scores: {decode-worker-1: 12 blocks, decode-worker-2: 18 blocks}
[DEBUG] Selected decode-worker-2 (best overlap: 18 blocks)
[DEBUG] Worker decode-worker-1 unhealthy, rerouting -> decode-worker-2
Check worker health status:
curl http://${DYN_FRONTEND_IP}:8000/health-
Verify etcd connectivity from all nodes:
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health -
Check NATS connectivity:
nats --server=$NATS_SERVER server check connection
- Ensure GPUs can communicate across nodes
- Check InfiniBand/RoCE configuration if using high-speed interconnect
- Verify CUDA IPC is enabled for optimal performance
- Confirm frontend is started with
--router-mode kv - Check that all workers are properly registered in etcd
- Verify workers are publishing KV events
- Check logs for overlap scores - if all zeros, cache tracking may not be working
- Ensure NATS is functioning for KV event distribution
For production deployments, you can fine-tune KV routing behavior:
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kv \
--kv-overlap-score-weight 1.0 # Weight for cache overlap scoring \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the KV Cache Routing documentation.
Stop all components in reverse order:
-
Stop Frontend (Ctrl+C in the frontend terminal)
-
Stop workers on each node:
-
On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
-
On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
-
To stop the background prefill workers, use one of these methods:
# Method 1: Kill background jobs in the same terminal jobs # See background jobs kill %1 # Kill the background prefill worker # Method 2: Close the terminal entirely (sends SIGHUP to background processes) exit # Method 3: Kill by process name (from any terminal) pkill -f "dynamo.sglang.*prefill"
-
-
Stop infrastructure services:
docker compose -f deploy/docker-compose.yml down
- Scale Up: Add more replicas by repeating Steps 2-3 on additional nodes
- High Availability: Run multiple frontend instances with a load balancer
- Monitoring: Deploy Prometheus and Grafana for production monitoring
- Optimization: Tune worker configurations based on workload patterns
- Cache Analysis: Use SGLang's built-in cache statistics to optimize your workloads