iii-hq · rohitg00 · Apr 7, 2026 · coderabbitai · Apr 7, 2026 · coderabbitai
diff --git a/eval/Cargo.toml b/eval/Cargo.toml
@@ -0,0 +1,23 @@
+[workspace]
+
+[package]
+name = "iii-eval"
+version = "0.1.0"
+edition = "2021"
+publish = false
+
+[[bin]]
+name = "iii-eval"
+path = "src/main.rs"
+
+[dependencies]
+iii-sdk = { version = "0.10.0", features = ["otel"] }
+tokio = { version = "1", features = ["rt-multi-thread", "macros", "sync", "signal"] }
+serde = { version = "1", features = ["derive"] }
+serde_json = "1"
+serde_yaml = "0.9"
+anyhow = "1"
+tracing = "0.1"
+tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }
+clap = { version = "4", features = ["derive"] }
+chrono = { version = "0.4", features = ["serde"] }
diff --git a/eval/README.md b/eval/README.md
@@ -0,0 +1,64 @@
+# iii-eval
+
+Every observability platform shows you dashboards. None of them score your function fleet's health as a single number, detect drift against a known-good baseline, or run inside the same engine your functions run on. iii-eval does. It ingests OTel spans, computes latency percentiles, scores system health, and tells you when something drifts — all as iii functions that any other worker can call.
+
+**Plug and play:** Build with `cargo build --release`, then run `./target/release/iii-eval --url ws://your-engine:49134`. It connects, registers 7 functions, and starts ingesting telemetry. No config required — defaults work out of the box. Any connected worker (or the console chat bar) can call `eval::metrics`, `eval::score`, or `eval::analyze_traces` immediately.
+
+## Functions
+
+| Function ID | Description |
+|---|---|
+| `eval::ingest` | Append a span to state, keyed by function ID |
+| `eval::metrics` | Compute percentiles, success rate, and throughput for a function |
+| `eval::score` | Weighted health score (0-100) across all tracked functions |
+| `eval::drift` | Compare current metrics against saved baselines across 5 dimensions |
+| `eval::baseline` | Snapshot current metrics as the drift reference point |
+| `eval::report` | Combined metrics + drift + score report for all functions |
+
+## iii Primitives Used
+
+- **State** -- span storage, baselines, function index
+- **PubSub** -- subscribes to `telemetry.spans` topic for automatic ingestion
+- **Cron** -- periodic drift detection
+- **HTTP** -- all functions exposed as REST endpoints
+
+## Prerequisites
+
+- Rust 1.75+
+- Running iii engine on `ws://127.0.0.1:49134`
+
+## Build
+
+```bash
+cargo build --release
+```
+
+## Usage
+
+```bash
+./target/release/iii-eval --url ws://127.0.0.1:49134 --config ./config.yaml
+```
+
+```
+Options:
+  --config <PATH>    Path to config.yaml [default: ./config.yaml]
+  --url <URL>        WebSocket URL of the iii engine [default: ws://127.0.0.1:49134]
+  --manifest         Output module manifest as JSON and exit
+  -h, --help         Print help
+```
+
+## Configuration
+
+```yaml
+retention_hours: 24              # how long to keep spans (reserved)
+drift_threshold: 0.15            # 15% change triggers drift alert
+cron_drift_check: "0 */10 * * * *"  # every 10 minutes
+max_spans_per_function: 1000     # ring buffer size per function
+baseline_window_minutes: 60      # reserved for windowed baseline
+```
+
+## Tests
+
+```bash
+cargo test
+```
diff --git a/eval/SPEC.md b/eval/SPEC.md
@@ -0,0 +1,183 @@
+# iii-eval worker
+
+OTel-native evaluation worker for iii-engine. Consumes function execution telemetry, computes latency percentiles and success rates, scores system health, and detects metric drift against saved baselines. Designed to sit behind any worker that emits span data via the `telemetry.spans` PubSub topic.
+
+## Why This Exists
+
+Every observability platform (Datadog, Grafana, Honeycomb) shows you dashboards. None of them score your function fleet's health as a single number, detect drift against a known-good baseline, or run inside the same engine your functions run on.
+
+The gap: **a self-contained evaluation loop that lives where your functions live** — no external infra, no separate deploy, no dashboards to check. Just a worker that ingests spans, computes metrics, and tells you when something drifts.
+
+## Architecture
+
+```
+Your Workers → OTel spans → PubSub topic "telemetry.spans" → eval::ingest
+                                                                    ↓
+                                                              eval:spans:{fn_id} (state)
+                                                                    ↓
+                                              eval::metrics / eval::score / eval::drift / eval::report
+```
+
+The worker subscribes to `telemetry.spans` via a PubSub trigger. Every span ingested is stored in state keyed by function ID. Metrics, scoring, drift detection, and reporting read from that state on demand. A cron trigger runs drift detection periodically.
+
+## State Scopes
+
+```
+eval:spans:{function_id}      — array of span objects (capped at max_spans_per_function)
+eval:baselines:{function_id}  — baseline metrics snapshot for drift comparison
+eval:function_index            — list of all tracked function IDs
+```
+
+## Functions (6)
+
+### eval::ingest
+
+```
+Input:  {
+  function_id: string,       (required)
+  duration_ms: integer,      (required)
+  success: boolean,          (required)
+  error?: string,
+  input_hash?: string,
+  output_hash?: string,
+  timestamp?: string,        (ISO 8601, defaults to now)
+  trace_id?: string,
+  worker_id?: string
+}
+Output: {ingested: true, function_id, total_spans}
+```
+
+Appends a span to `eval:spans:{function_id}`. Trims the list to `max_spans_per_function` (oldest evicted first). Maintains the `eval:function_index` for discovery by other functions.
+
+### eval::metrics
+
+```
+Input:  {function_id: string}
+Output: {
+  function_id, p50_ms, p95_ms, p99_ms,
+  success_rate, total_invocations, avg_duration_ms,
+  error_count, throughput_per_min
+}
+```
+
+Reads spans from state, sorts durations, computes percentiles via index-based lookup. Throughput calculated from timestamp range of stored spans.
+
+### eval::score
+
+```
+Input:  {}
+Output: {
+  overall_score: 0-100,
+  issues: [{function_id, issue, value}],
+  suggestions: [string],
+  functions_evaluated: integer,
+  timestamp: string
+}
+```
+
+Iterates all tracked functions, computes metrics for each, and produces a weighted health score. Penalties applied for:
+- Success rate below 95% (up to -200 points proportional to gap)
+- P99 latency above 5000ms (up to -30 points)
+
+Score is the average across all functions, clamped to 0-100.
+
+### eval::drift
+
+```
+Input:  {function_id?: string}  (omit to check all functions)
+Output: {
+  results: [{
+    function_id, drifted: boolean,
+    dimension?, baseline_value?, current_value?, delta_pct?
+  }],
+  threshold: number,
+  timestamp: string
+}
+```
+
+Compares current metrics against saved baselines across 5 dimensions (p50, p95, p99, success_rate, avg_duration). A dimension drifts when `|current - baseline| / baseline > drift_threshold`. If no baseline exists, returns `reason: "no_baseline"`.
+
+### eval::baseline
+
+```
+Input:  {function_id: string}
+Output: {saved: true, function_id, baseline: {...}}
+```
+
+Snapshots current metrics for a function and stores them at `eval:baselines:{function_id}`. Used as the reference point for drift detection. Call this after a known-good deploy.
+
+### eval::report
+
+```
+Input:  {}
+Output: {
+  functions: [{function_id, metrics, has_baseline, drift}],
+  score: {overall_score, issues, suggestions, ...},
+  total_functions: integer,
+  timestamp: string
+}
+```
+
+Combines metrics + drift + score into a single comprehensive report across all tracked functions.
+
+## Triggers (2)
+
+```
+Cron (1):
+  expression from config (default "0 */10 * * * *") → eval::drift
+  Runs periodic drift detection across all functions.
+
+PubSub (1):
+  topic "telemetry.spans" → eval::ingest
+  Subscribes to OTel span data emitted by the engine or other workers.
+```
+
+## Config (config.yaml)
+
+```yaml
+retention_hours: 24              # how long to keep spans (not yet enforced, reserved)
+drift_threshold: 0.15            # 15% change triggers drift alert
+cron_drift_check: "0 */10 * * * *"  # every 10 minutes
+max_spans_per_function: 1000     # ring buffer size per function
+baseline_window_minutes: 60      # reserved for windowed baseline
+```
+
+## Integration with Other Workers
+
+- **Any worker with OTel**: Publish spans to `telemetry.spans` topic. The eval worker picks them up automatically.
+- **llm-router / llm-budget**: Ingest routing decisions and budget checks as spans to track decision latency and budget enforcement accuracy.
+- **sensor**: Feed sensor readings as spans to detect telemetry pipeline degradation.
+- **image-resize**: Track resize latency and error rates across different image formats.
+
+## Example Flow
+
+```bash
+# 1. Ingest some span data
+curl -X POST localhost:3111/api/eval/ingest -d '{
+  "function_id": "image_resize::resize",
+  "duration_ms": 45,
+  "success": true,
+  "trace_id": "abc123"
+}'
+
+# 2. Check metrics
+curl -X POST localhost:3111/api/eval/metrics -d '{
+  "function_id": "image_resize::resize"
+}'
+# → {"p50_ms": 42, "p95_ms": 120, "p99_ms": 180, "success_rate": 0.98, ...}
+
+# 3. Save baseline after verified good deploy
+curl -X POST localhost:3111/api/eval/baseline -d '{
+  "function_id": "image_resize::resize"
+}'
+
+# 4. Later, check for drift
+curl -X POST localhost:3111/api/eval/drift -d '{
+  "function_id": "image_resize::resize"
+}'
+# → {"results": [{"function_id": "image_resize::resize", "drifted": false}]}
+
+# 5. Get full system report
+curl -X POST localhost:3111/api/eval/report -d '{}'
+# → {"overall_score": 94, "functions": [...], ...}
+```
diff --git a/eval/build.rs b/eval/build.rs
@@ -0,0 +1,6 @@
+fn main() {
+    println!(
+        "cargo:rustc-env=TARGET={}",
+        std::env::var("TARGET").unwrap()
+    );
+}
diff --git a/eval/config.yaml b/eval/config.yaml
@@ -0,0 +1,5 @@
+retention_hours: 24
+drift_threshold: 0.15
+cron_drift_check: "0 */10 * * * *"
+max_spans_per_function: 1000
+baseline_window_minutes: 60
diff --git a/eval/src/config.rs b/eval/src/config.rs
@@ -0,0 +1,95 @@
+use anyhow::Result;
+use serde::Deserialize;
+
+#[derive(Deserialize, Debug, Clone)]
+pub struct EvalConfig {
+    #[serde(default = "default_retention_hours")]
+    pub retention_hours: u64,
+    #[serde(default = "default_drift_threshold")]
+    pub drift_threshold: f64,
+    #[serde(default = "default_cron_drift_check")]
+    pub cron_drift_check: String,
+    #[serde(default = "default_max_spans_per_function")]
+    pub max_spans_per_function: usize,
+    #[allow(dead_code)]
+    #[serde(default = "default_baseline_window_minutes")]
+    pub baseline_window_minutes: u64,
+}
-#[derive(Deserialize, Debug, Clone)]
-pub struct EvalConfig {
-    #[serde(default = "default_retention_hours")]
-    pub retention_hours: u64,
-    #[serde(default = "default_drift_threshold")]
-    pub drift_threshold: f64,
-    #[serde(default = "default_cron_drift_check")]
-    pub cron_drift_check: String,
-    #[serde(default = "default_max_spans_per_function")]
-    pub max_spans_per_function: usize,
-    #[allow(dead_code)]
-    #[serde(default = "default_baseline_window_minutes")]
-    pub baseline_window_minutes: u64,
-}
+#[derive(Deserialize, Debug, Clone)]
+#[serde(deny_unknown_fields)]
+pub struct EvalConfig {
+    #[serde(default = "default_retention_hours")]
+    pub retention_hours: u64,
+    #[serde(default = "default_drift_threshold")]
+    pub drift_threshold: f64,
+    #[serde(default = "default_cron_drift_check")]
+    pub cron_drift_check: String,
+    #[serde(default = "default_max_spans_per_function")]
+    pub max_spans_per_function: usize,
+    #[allow(dead_code)]
+    #[serde(default = "default_baseline_window_minutes")]
+    pub baseline_window_minutes: u64,
+}
-#[derive(Deserialize, Debug, Clone)]
-pub struct EvalConfig {
-    #[serde(default = "default_retention_hours")]
-    pub retention_hours: u64,
-    #[serde(default = "default_drift_threshold")]
-    pub drift_threshold: f64,
-    #[serde(default = "default_cron_drift_check")]
-    pub cron_drift_check: String,
-    #[serde(default = "default_max_spans_per_function")]
-    pub max_spans_per_function: usize,
-    #[allow(dead_code)]
-    #[serde(default = "default_baseline_window_minutes")]
-    pub baseline_window_minutes: u64,
-}
+#[derive(Deserialize, Debug, Clone)]
+#[serde(deny_unknown_fields)]
+pub struct EvalConfig {
+    #[serde(default = "default_retention_hours")]
+    pub retention_hours: u64,
+    #[serde(default = "default_drift_threshold")]
+    pub drift_threshold: f64,
+    #[serde(default = "default_cron_drift_check")]
+    pub cron_drift_check: String,
+    #[serde(default = "default_max_spans_per_function")]
+    pub max_spans_per_function: usize,
+    #[allow(dead_code)]
+    #[serde(default = "default_baseline_window_minutes")]
+    pub baseline_window_minutes: u64,
+}
+
+fn default_retention_hours() -> u64 {
+    24
+}
+
+fn default_drift_threshold() -> f64 {
+    0.15
+}
+
+fn default_cron_drift_check() -> String {
+    "0 */10 * * * *".to_string()
+}
+
+fn default_max_spans_per_function() -> usize {
+    1000
+}
+
+fn default_baseline_window_minutes() -> u64 {
+    60
+}
+
+impl Default for EvalConfig {
+    fn default() -> Self {
+        EvalConfig {
+            retention_hours: default_retention_hours(),
+            drift_threshold: default_drift_threshold(),
+            cron_drift_check: default_cron_drift_check(),
+            max_spans_per_function: default_max_spans_per_function(),
+            baseline_window_minutes: default_baseline_window_minutes(),
+        }
+    }
+}
+
+pub fn load_config(path: &str) -> Result<EvalConfig> {
+    let contents = std::fs::read_to_string(path)?;
+    let config: EvalConfig = serde_yaml::from_str(&contents)?;
+    Ok(config)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_config_defaults() {
+        let config: EvalConfig = serde_yaml::from_str("{}").unwrap();
+        assert_eq!(config.retention_hours, 24);
+        assert!((config.drift_threshold - 0.15).abs() < f64::EPSILON);
+        assert_eq!(config.cron_drift_check, "0 */10 * * * *");
+        assert_eq!(config.max_spans_per_function, 1000);
+        assert_eq!(config.baseline_window_minutes, 60);
+    }
+
+    #[test]
+    fn test_config_custom() {
+        let yaml = r#"
+retention_hours: 48
+drift_threshold: 0.25
+cron_drift_check: "0 */5 * * * *"
+max_spans_per_function: 500
+baseline_window_minutes: 120
+"#;
+        let config: EvalConfig = serde_yaml::from_str(yaml).unwrap();
+        assert_eq!(config.retention_hours, 48);
+        assert!((config.drift_threshold - 0.25).abs() < f64::EPSILON);
+        assert_eq!(config.cron_drift_check, "0 */5 * * * *");
+        assert_eq!(config.max_spans_per_function, 500);
+        assert_eq!(config.baseline_window_minutes, 120);
+    }
+
+    #[test]
+    fn test_eval_config_default() {
+        let config = EvalConfig::default();
+        assert_eq!(config.retention_hours, 24);
+        assert!((config.drift_threshold - 0.15).abs() < f64::EPSILON);
+        assert_eq!(config.max_spans_per_function, 1000);
+    }
+}