Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions eval/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[workspace]

[package]
name = "iii-eval"
version = "0.1.0"
edition = "2021"
publish = false

[[bin]]
name = "iii-eval"
path = "src/main.rs"

[dependencies]
iii-sdk = { version = "0.10.0", features = ["otel"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros", "sync", "signal"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
serde_yaml = "0.9"
anyhow = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }
clap = { version = "4", features = ["derive"] }
chrono = { version = "0.4", features = ["serde"] }
64 changes: 64 additions & 0 deletions eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# iii-eval

Every observability platform shows you dashboards. None of them score your function fleet's health as a single number, detect drift against a known-good baseline, or run inside the same engine your functions run on. iii-eval does. It ingests OTel spans, computes latency percentiles, scores system health, and tells you when something drifts — all as iii functions that any other worker can call.

**Plug and play:** Build with `cargo build --release`, then run `./target/release/iii-eval --url ws://your-engine:49134`. It connects, registers 7 functions, and starts ingesting telemetry. No config required — defaults work out of the box. Any connected worker (or the console chat bar) can call `eval::metrics`, `eval::score`, or `eval::analyze_traces` immediately.

## Functions

| Function ID | Description |
|---|---|
| `eval::ingest` | Append a span to state, keyed by function ID |
| `eval::metrics` | Compute percentiles, success rate, and throughput for a function |
| `eval::score` | Weighted health score (0-100) across all tracked functions |
| `eval::drift` | Compare current metrics against saved baselines across 5 dimensions |
| `eval::baseline` | Snapshot current metrics as the drift reference point |
| `eval::report` | Combined metrics + drift + score report for all functions |

## iii Primitives Used

- **State** -- span storage, baselines, function index
- **PubSub** -- subscribes to `telemetry.spans` topic for automatic ingestion
- **Cron** -- periodic drift detection
- **HTTP** -- all functions exposed as REST endpoints

## Prerequisites

- Rust 1.75+
- Running iii engine on `ws://127.0.0.1:49134`

## Build

```bash
cargo build --release
```

## Usage

```bash
./target/release/iii-eval --url ws://127.0.0.1:49134 --config ./config.yaml
```

```
Options:
--config <PATH> Path to config.yaml [default: ./config.yaml]
--url <URL> WebSocket URL of the iii engine [default: ws://127.0.0.1:49134]
--manifest Output module manifest as JSON and exit
-h, --help Print help
```
Comment on lines +42 to +48
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to the fenced options block.

At Line 40, the code fence has no language tag (MD040).

Proposed fix
-```
+```text
 Options:
   --config <PATH>    Path to config.yaml [default: ./config.yaml]
   --url <URL>        WebSocket URL of the iii engine [default: ws://127.0.0.1:49134]
   --manifest         Output module manifest as JSON and exit
   -h, --help         Print help
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 40-40: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @eval/README.md around lines 40 - 46, The fenced code block that begins with
"Options:" is missing a language tag; update the opening fence from ``` to

"Options:" block to ```text) to satisfy MD040; leave the block contents and
closing fence unchanged.


## Configuration

```yaml
retention_hours: 24 # how long to keep spans (reserved)
drift_threshold: 0.15 # 15% change triggers drift alert
cron_drift_check: "0 */10 * * * *" # every 10 minutes
max_spans_per_function: 1000 # ring buffer size per function
baseline_window_minutes: 60 # reserved for windowed baseline
```

## Tests

```bash
cargo test
```
183 changes: 183 additions & 0 deletions eval/SPEC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# iii-eval worker

OTel-native evaluation worker for iii-engine. Consumes function execution telemetry, computes latency percentiles and success rates, scores system health, and detects metric drift against saved baselines. Designed to sit behind any worker that emits span data via the `telemetry.spans` PubSub topic.

## Why This Exists

Every observability platform (Datadog, Grafana, Honeycomb) shows you dashboards. None of them score your function fleet's health as a single number, detect drift against a known-good baseline, or run inside the same engine your functions run on.

The gap: **a self-contained evaluation loop that lives where your functions live** — no external infra, no separate deploy, no dashboards to check. Just a worker that ingests spans, computes metrics, and tells you when something drifts.

## Architecture

```
Your Workers → OTel spans → PubSub topic "telemetry.spans" → eval::ingest
eval:spans:{fn_id} (state)
eval::metrics / eval::score / eval::drift / eval::report
```

The worker subscribes to `telemetry.spans` via a PubSub trigger. Every span ingested is stored in state keyed by function ID. Metrics, scoring, drift detection, and reporting read from that state on demand. A cron trigger runs drift detection periodically.

## State Scopes

```
eval:spans:{function_id} — array of span objects (capped at max_spans_per_function)
eval:baselines:{function_id} — baseline metrics snapshot for drift comparison
eval:function_index — list of all tracked function IDs
```
Comment on lines +25 to +29
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

State scope key mismatch with implementation.

The spec says eval:function_index but the code (ingest.rs:10-11) uses:

  • SCOPE_INDEX = "eval:index"
  • INDEX_KEY = "function_list"

Update the spec to match the implementation.

Suggested fix

eval:spans:{function_id} — array of span objects (capped at max_spans_per_function)
eval:baselines:{function_id} — baseline metrics snapshot for drift comparison
-eval:function_index — list of all tracked function IDs
+eval:index:function_list — list of all tracked function IDs

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 25-25: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eval/SPEC.md` around lines 25 - 29, Update the spec entry to match the
implementation: replace the line that documents eval:function_index with the
implemented scope and key by using the SCOPE_INDEX and INDEX_KEY names — i.e.,
document the state key as eval:index:function_list and the description as "list
of all tracked function IDs" so the spec aligns with the constants SCOPE_INDEX
and INDEX_KEY used in ingest.rs.


## Functions (6)

### eval::ingest

```
Input: {
function_id: string, (required)
duration_ms: integer, (required)
success: boolean, (required)
error?: string,
input_hash?: string,
output_hash?: string,
timestamp?: string, (ISO 8601, defaults to now)
trace_id?: string,
worker_id?: string
}
Output: {ingested: true, function_id, total_spans}
```

Appends a span to `eval:spans:{function_id}`. Trims the list to `max_spans_per_function` (oldest evicted first). Maintains the `eval:function_index` for discovery by other functions.

### eval::metrics

```
Input: {function_id: string}
Output: {
function_id, p50_ms, p95_ms, p99_ms,
success_rate, total_invocations, avg_duration_ms,
error_count, throughput_per_min
}
```

Reads spans from state, sorts durations, computes percentiles via index-based lookup. Throughput calculated from timestamp range of stored spans.

### eval::score

```
Input: {}
Output: {
overall_score: 0-100,
issues: [{function_id, issue, value}],
suggestions: [string],
functions_evaluated: integer,
timestamp: string
}
```

Iterates all tracked functions, computes metrics for each, and produces a weighted health score. Penalties applied for:
- Success rate below 95% (up to -200 points proportional to gap)
- P99 latency above 5000ms (up to -30 points)

Score is the average across all functions, clamped to 0-100.

### eval::drift

```
Input: {function_id?: string} (omit to check all functions)
Output: {
results: [{
function_id, drifted: boolean,
dimension?, baseline_value?, current_value?, delta_pct?
}],
threshold: number,
timestamp: string
}
```

Compares current metrics against saved baselines across 5 dimensions (p50, p95, p99, success_rate, avg_duration). A dimension drifts when `|current - baseline| / baseline > drift_threshold`. If no baseline exists, returns `reason: "no_baseline"`.

### eval::baseline

```
Input: {function_id: string}
Output: {saved: true, function_id, baseline: {...}}
```

Snapshots current metrics for a function and stores them at `eval:baselines:{function_id}`. Used as the reference point for drift detection. Call this after a known-good deploy.

### eval::report

```
Input: {}
Output: {
functions: [{function_id, metrics, has_baseline, drift}],
score: {overall_score, issues, suggestions, ...},
total_functions: integer,
timestamp: string
}
```

Combines metrics + drift + score into a single comprehensive report across all tracked functions.

## Triggers (2)

```
Cron (1):
expression from config (default "0 */10 * * * *") → eval::drift
Runs periodic drift detection across all functions.

PubSub (1):
topic "telemetry.spans" → eval::ingest
Subscribes to OTel span data emitted by the engine or other workers.
```

## Config (config.yaml)

```yaml
retention_hours: 24 # how long to keep spans (not yet enforced, reserved)
drift_threshold: 0.15 # 15% change triggers drift alert
cron_drift_check: "0 */10 * * * *" # every 10 minutes
max_spans_per_function: 1000 # ring buffer size per function
baseline_window_minutes: 60 # reserved for windowed baseline
```

## Integration with Other Workers

- **Any worker with OTel**: Publish spans to `telemetry.spans` topic. The eval worker picks them up automatically.
- **llm-router / llm-budget**: Ingest routing decisions and budget checks as spans to track decision latency and budget enforcement accuracy.
- **sensor**: Feed sensor readings as spans to detect telemetry pipeline degradation.
- **image-resize**: Track resize latency and error rates across different image formats.

## Example Flow

```bash
# 1. Ingest some span data
curl -X POST localhost:3111/api/eval/ingest -d '{
"function_id": "image_resize::resize",
"duration_ms": 45,
"success": true,
"trace_id": "abc123"
}'

# 2. Check metrics
curl -X POST localhost:3111/api/eval/metrics -d '{
"function_id": "image_resize::resize"
}'
# → {"p50_ms": 42, "p95_ms": 120, "p99_ms": 180, "success_rate": 0.98, ...}

# 3. Save baseline after verified good deploy
curl -X POST localhost:3111/api/eval/baseline -d '{
"function_id": "image_resize::resize"
}'

# 4. Later, check for drift
curl -X POST localhost:3111/api/eval/drift -d '{
"function_id": "image_resize::resize"
}'
# → {"results": [{"function_id": "image_resize::resize", "drifted": false}]}

# 5. Get full system report
curl -X POST localhost:3111/api/eval/report -d '{}'
# → {"overall_score": 94, "functions": [...], ...}
```
6 changes: 6 additions & 0 deletions eval/build.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
fn main() {
println!(
"cargo:rustc-env=TARGET={}",
std::env::var("TARGET").unwrap()
);
}
5 changes: 5 additions & 0 deletions eval/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
retention_hours: 24
drift_threshold: 0.15
cron_drift_check: "0 */10 * * * *"
max_spans_per_function: 1000
baseline_window_minutes: 60
95 changes: 95 additions & 0 deletions eval/src/config.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
use anyhow::Result;
use serde::Deserialize;

#[derive(Deserialize, Debug, Clone)]
pub struct EvalConfig {
#[serde(default = "default_retention_hours")]
pub retention_hours: u64,
#[serde(default = "default_drift_threshold")]
pub drift_threshold: f64,
#[serde(default = "default_cron_drift_check")]
pub cron_drift_check: String,
#[serde(default = "default_max_spans_per_function")]
pub max_spans_per_function: usize,
#[allow(dead_code)]
#[serde(default = "default_baseline_window_minutes")]
pub baseline_window_minutes: u64,
}
Comment on lines +4 to +17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unknown config keys to avoid silent misconfiguration.

A typo in YAML keys will currently be ignored and defaults will be applied silently. Add deny_unknown_fields on EvalConfig so invalid config fails fast.

Proposed fix
-#[derive(Deserialize, Debug, Clone)]
+#[derive(Deserialize, Debug, Clone)]
+#[serde(deny_unknown_fields)]
 pub struct EvalConfig {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[derive(Deserialize, Debug, Clone)]
pub struct EvalConfig {
#[serde(default = "default_retention_hours")]
pub retention_hours: u64,
#[serde(default = "default_drift_threshold")]
pub drift_threshold: f64,
#[serde(default = "default_cron_drift_check")]
pub cron_drift_check: String,
#[serde(default = "default_max_spans_per_function")]
pub max_spans_per_function: usize,
#[allow(dead_code)]
#[serde(default = "default_baseline_window_minutes")]
pub baseline_window_minutes: u64,
}
#[derive(Deserialize, Debug, Clone)]
#[serde(deny_unknown_fields)]
pub struct EvalConfig {
#[serde(default = "default_retention_hours")]
pub retention_hours: u64,
#[serde(default = "default_drift_threshold")]
pub drift_threshold: f64,
#[serde(default = "default_cron_drift_check")]
pub cron_drift_check: String,
#[serde(default = "default_max_spans_per_function")]
pub max_spans_per_function: usize,
#[allow(dead_code)]
#[serde(default = "default_baseline_window_minutes")]
pub baseline_window_minutes: u64,
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eval/src/config.rs` around lines 4 - 17, The EvalConfig struct allows
unknown/typoed config keys to be ignored; add serde's deny_unknown_fields
attribute to EvalConfig so unknown YAML keys cause deserialization to fail fast.
Modify the struct annotation for EvalConfig to include
#[serde(deny_unknown_fields)] (in addition to the existing derives) so
deserializing into EvalConfig will reject unexpected fields and surface
configuration mistakes.


fn default_retention_hours() -> u64 {
24
}

fn default_drift_threshold() -> f64 {
0.15
}

fn default_cron_drift_check() -> String {
"0 */10 * * * *".to_string()
}

fn default_max_spans_per_function() -> usize {
1000
}

fn default_baseline_window_minutes() -> u64 {
60
}

impl Default for EvalConfig {
fn default() -> Self {
EvalConfig {
retention_hours: default_retention_hours(),
drift_threshold: default_drift_threshold(),
cron_drift_check: default_cron_drift_check(),
max_spans_per_function: default_max_spans_per_function(),
baseline_window_minutes: default_baseline_window_minutes(),
}
}
}

pub fn load_config(path: &str) -> Result<EvalConfig> {
let contents = std::fs::read_to_string(path)?;
let config: EvalConfig = serde_yaml::from_str(&contents)?;
Ok(config)
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_config_defaults() {
let config: EvalConfig = serde_yaml::from_str("{}").unwrap();
assert_eq!(config.retention_hours, 24);
assert!((config.drift_threshold - 0.15).abs() < f64::EPSILON);
assert_eq!(config.cron_drift_check, "0 */10 * * * *");
assert_eq!(config.max_spans_per_function, 1000);
assert_eq!(config.baseline_window_minutes, 60);
}

#[test]
fn test_config_custom() {
let yaml = r#"
retention_hours: 48
drift_threshold: 0.25
cron_drift_check: "0 */5 * * * *"
max_spans_per_function: 500
baseline_window_minutes: 120
"#;
let config: EvalConfig = serde_yaml::from_str(yaml).unwrap();
assert_eq!(config.retention_hours, 48);
assert!((config.drift_threshold - 0.25).abs() < f64::EPSILON);
assert_eq!(config.cron_drift_check, "0 */5 * * * *");
assert_eq!(config.max_spans_per_function, 500);
assert_eq!(config.baseline_window_minutes, 120);
}

#[test]
fn test_eval_config_default() {
let config = EvalConfig::default();
assert_eq!(config.retention_hours, 24);
assert!((config.drift_threshold - 0.15).abs() < f64::EPSILON);
assert_eq!(config.max_spans_per_function, 1000);
}
}
Loading
Loading