Skip to content

Commit 1908f7b

Browse files
authored
Merge pull request #52 from thushan/feature/better-stats
feat: better stats
2 parents e8a1368 + 982cdc6 commit 1908f7b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2306
-453
lines changed

config/profiles/lmstudio.yaml

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,4 +133,22 @@ resources:
133133
# No need for load time buffer - models are preloaded
134134
timeout_scaling:
135135
base_timeout_seconds: 180 # 3 minutes
136-
load_time_buffer: false
136+
load_time_buffer: false
137+
138+
# Metrics extraction for LM Studio responses
139+
metrics:
140+
extraction:
141+
enabled: true
142+
source: response_body
143+
format: json
144+
# LM Studio uses OpenAI-compatible format
145+
paths:
146+
model: "$.model"
147+
finish_reason: "$.choices[0].finish_reason" # String value (e.g., "stop", "length")
148+
input_tokens: "$.usage.prompt_tokens"
149+
output_tokens: "$.usage.completion_tokens"
150+
total_tokens: "$.usage.total_tokens"
151+
calculations:
152+
# Derive IsComplete from finish_reason presence (LM Studio doesn't have a separate 'done' field)
153+
is_complete: 'len(finish_reason) > 0'
154+
# LM Studio doesn't provide timing data, so we can't calculate tokens/sec

config/profiles/ollama.yaml

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,4 +171,33 @@ resources:
171171
# Dynamic timeout scaling
172172
timeout_scaling:
173173
base_timeout_seconds: 30
174-
load_time_buffer: true # adds estimated_load_time_ms to timeout
174+
load_time_buffer: true # adds estimated_load_time_ms to timeout
175+
176+
# Metrics extraction from Ollama responses
177+
metrics:
178+
extraction:
179+
enabled: true
180+
source: "response_body"
181+
format: "json"
182+
183+
# JSONPath expressions for extracting values from Ollama response
184+
paths:
185+
model: "$.model"
186+
is_complete: "$.done" # Ollama provides 'done' as a boolean directly
187+
finish_reason: "$.finish_reason" # Optional: Ollama may include this in some versions
188+
# Token counts
189+
input_tokens: "$.prompt_eval_count"
190+
output_tokens: "$.eval_count"
191+
# Timing data (in nanoseconds from Ollama)
192+
total_duration_ns: "$.total_duration"
193+
load_duration_ns: "$.load_duration"
194+
prompt_duration_ns: "$.prompt_eval_duration"
195+
eval_duration_ns: "$.eval_duration"
196+
197+
# Simple calculations to convert to useful metrics
198+
calculations:
199+
# Safe division: multiply first for precision, then divide with guard against zero
200+
tokens_per_second: "eval_duration_ns > 0 ? (output_tokens * 1000000000.0) / eval_duration_ns : 0"
201+
ttft_ms: "prompt_duration_ns / 1000000"
202+
total_ms: "total_duration_ns / 1000000"
203+
model_load_ms: "load_duration_ns / 1000000"

config/profiles/openai.yaml

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,4 +100,26 @@ resources:
100100
# No load time buffer needed for cloud services
101101
timeout_scaling:
102102
base_timeout_seconds: 120 # 2 minutes
103-
load_time_buffer: false
103+
load_time_buffer: false
104+
105+
# Metrics extraction for OpenAI-compatible responses
106+
metrics:
107+
extraction:
108+
enabled: true
109+
source: response_body
110+
format: json
111+
# OpenAI standard format
112+
paths:
113+
model: "$.model"
114+
finish_reason: "$.choices[0].finish_reason" # String value (e.g., "stop", "length", "function_call")
115+
input_tokens: "$.usage.prompt_tokens"
116+
output_tokens: "$.usage.completion_tokens"
117+
total_tokens: "$.usage.total_tokens"
118+
# Some providers include additional metrics
119+
ttft_ms: "$.metrics.time_to_first_token"
120+
total_ms: "$.metrics.total_time"
121+
calculations:
122+
# Derive IsComplete from finish_reason presence (OpenAI doesn't have a separate 'done' field)
123+
is_complete: 'len(finish_reason) > 0'
124+
# Safe division: multiply first for precision, then divide with guard against zero
125+
tokens_per_second: "total_ms > 0 ? (output_tokens * 1000.0) / total_ms : 0"

config/profiles/vllm.yaml

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -221,4 +221,26 @@ features:
221221
# Continuous batching
222222
continuous_batching:
223223
enabled: true
224-
description: "Dynamic batching for optimal GPU utilisation"
224+
description: "Dynamic batching for optimal GPU utilisation"
225+
226+
# Metrics extraction for vLLM responses
227+
metrics:
228+
extraction:
229+
enabled: true
230+
source: response_body
231+
format: json
232+
# vLLM uses OpenAI-compatible format for chat/completions endpoints
233+
paths:
234+
model: "$.model"
235+
finish_reason: "$.choices[0].finish_reason" # String value (e.g., "stop", "length")
236+
input_tokens: "$.usage.prompt_tokens"
237+
output_tokens: "$.usage.completion_tokens"
238+
total_tokens: "$.usage.total_tokens"
239+
# vLLM may include additional performance metrics
240+
ttft_ms: "$.metrics.time_to_first_token_ms"
241+
generation_time_ms: "$.metrics.generation_time_ms"
242+
calculations:
243+
# Derive IsComplete from finish_reason presence (vLLM doesn't have a separate 'done' field)
244+
is_complete: 'len(finish_reason) > 0'
245+
# Safe division: multiply first for precision, then divide with guard against zero
246+
tokens_per_second: "generation_time_ms > 0 ? (output_tokens * 1000.0) / generation_time_ms : 0"

docs/content/api-reference/overview.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,23 @@ All responses include:
8888
| `X-Olla-Model` | Model used (if applicable) |
8989
| `X-Olla-Backend-Type` | Provider type (ollama/lmstudio/openai/vllm) |
9090
| `X-Olla-Response-Time` | Total processing time |
91+
| `X-Olla-Routing-Strategy` | Routing strategy used (when model routing is active) |
92+
| `X-Olla-Routing-Decision` | Routing decision made (routed/fallback/rejected) |
93+
| `X-Olla-Routing-Reason` | Human-readable reason for routing decision |
94+
95+
### Provider Metrics (Debug Logs)
96+
97+
When available, provider-specific performance metrics are extracted from responses and included in debug logs:
98+
99+
| Metric | Description | Providers |
100+
|--------|-------------|-----------|
101+
| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio |
102+
| `provider_prompt_tokens` | Tokens in prompt (count) | All |
103+
| `provider_completion_tokens` | Tokens generated (count) | All |
104+
| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio |
105+
| `provider_model` | Actual model used | All |
106+
107+
See [Provider Metrics](../concepts/provider-metrics.md) for detailed information.
91108

92109
## Error Responses
93110

docs/content/concepts/overview.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,16 @@ Provider-specific configuration templates:
7272

7373
The profile system ensures compatibility with various LLM providers.
7474

75+
### [Provider Metrics](provider-metrics.md)
76+
Real-time performance metrics extraction:
77+
78+
- Automatic extraction from provider responses
79+
- Token usage and generation speed tracking
80+
- Processing latency measurements
81+
- Best-effort extraction with zero performance impact
82+
83+
Provider metrics give insights into model performance and resource usage.
84+
7585
## How Components Work Together
7686

7787
```mermaid

0 commit comments

Comments
 (0)