feat(eval): Agent Evaluation & Benchmarking Framework ÔÇö data-driven agent optimization by jaeko44 · Pull Request #451 · virtengine/bosun

jaeko44 · 2026-03-26T03:45:59Z

Task-ID: bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\nAutomated PR for task bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\n---\n\nBosun-Origin: created

github-actions · 2026-03-26T03:46:12Z

Bosun CI signal: Bosun-created PR currently has failing checks.

Automation scope: Bosun-created
Workflow: CI
Conclusion: failure
Run: https://github.com/virtengine/bosun/actions/runs/23598677846
Head SHA: 07cf4e3
Failed jobs: Build + Tests

bench/eval-framework.mjs

+  return output;
+}
+
+function escapeRegex(text) {


bench/eval-framework.mjs

+  return index >= 0 ? args[index + 1] : "";
+}
+
+function hasFlag(args, flag) {


Copilot

Pull request overview

Adds an agent evaluation/benchmarking framework to the repo, exposing it via a new bosun eval CLI entry point and covering core flows with new Vitest suites.

Changes:

Introduces bench/eval-framework.mjs with benchmark/task models, metric helpers, run persistence, comparison, regression detection, and an eval CLI dispatcher.
Wires bosun eval / --eval routing into cli.mjs and documents the commands in top-level help.
Adds tests for the framework behaviors and CLI routing; ensures the new framework file is included in the packaged files allowlist.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/eval-framework.test.mjs	New test coverage for importing benchmarks, running evaluations, persisting results, summaries, comparisons, and CLI flows.
tests/cli-eval-routing.test.mjs	Ensures `eval` routing happens before generic `--help` handling and help text includes eval docs.
package.json	Adds `bench/eval-framework.mjs` to the published files allowlist.
cli.mjs	Adds `eval` command/flag routing and help text entries.
bench/eval-framework.mjs	Implements the evaluation & benchmarking framework and eval CLI subcommands.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T03:49:34Z

bench/eval-framework.mjs

+    const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
+    const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
+    metricDeltas[key] = {
+      baseline: baseValue,
+      candidate: candidateValue,
+      delta,
+      significance: computeSignificance(baselineTaskValues, candidateTaskValues),


compareEvaluationRuns() computes significance using baseline.summary.perTask values for every key, but perTask entries don't include metrics like p95Tokens, p95TimeMs, or totalCostUsd (they'll be undefined → coerced to 0). That yields misleading significance results (often comparing distributions of zeros). Consider either limiting significance to keys present in perTask summaries, or computing per-task/per-repeat distributions from run.results for each metric.

Suggested change

const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key]));

const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key]));

metricDeltas[key] = {

baseline: baseValue,

candidate: candidateValue,

delta,

significance: computeSignificance(baselineTaskValues, candidateTaskValues),

const baselinePerTask = ensureArray(baseline?.summary?.perTask);

const candidatePerTask = ensureArray(candidate?.summary?.perTask);

// Only compute per-task significance when at least one run has this metric at the per-task level.

const hasPerTaskMetric =

baselinePerTask.some((entry) => entry && entry[key] != null) ||

candidatePerTask.some((entry) => entry && entry[key] != null);

let significance = null;

if (hasPerTaskMetric) {

const baselineTaskValues = baselinePerTask

.map((entry) => toNumber(entry?.[key]))

.filter((value) => Number.isFinite(value));

const candidateTaskValues = candidatePerTask

.map((entry) => toNumber(entry?.[key]))

.filter((value) => Number.isFinite(value));

if (baselineTaskValues.length > 0 && candidateTaskValues.length > 0) {

significance = computeSignificance(baselineTaskValues, candidateTaskValues);

}

}

metricDeltas[key] = {

baseline: baseValue,

candidate: candidateValue,

delta,

significance,

Copilot · 2026-03-26T03:49:35Z

bench/eval-framework.mjs

+    entry.repeats += 1;
+    if (result.metrics?.TaskSuccess === "pass") entry.passCount += 1;
+    entry.tokenValues.push(toNumber(result.outcome?.totalTokens, toNumber(result.outcome?.tokensInput) + toNumber(result.outcome?.tokensOutput)));
+    entry.timeValues.push(toNumber(result.outcome?.durationMs));


buildTaskSummaryEntries() uses result.outcome.durationMs for per-task avgTimeMs, while summarizeRun() uses the TimeToComplete metric if present. If callers override TimeToComplete (as your tests do), run-level avgTimeMs and per-task avgTimeMs can diverge unexpectedly. Consider deriving per-task time values from result.metrics.TimeToComplete (with the same fallback) to keep summaries consistent.

Suggested change

entry.timeValues.push(toNumber(result.outcome?.durationMs));

const timeToComplete = result.metrics?.TimeToComplete ?? result.outcome?.durationMs;

entry.timeValues.push(toNumber(timeToComplete));

Copilot · 2026-03-26T03:49:35Z

bench/eval-framework.mjs

+    this.runner = options.runner || defaultRunner();
+    this.parallelism = Math.max(1, toNumber(options.parallelism, 1));
+    this.storageAdapter = options.storageAdapter || null;
+  }


Evaluator accepts a parallelism option and records it on the run, but evaluate() executes all strategy/task/repeat iterations sequentially and never uses this.parallelism. Either implement concurrency (e.g., a bounded work queue) or drop/avoid exposing the option to prevent confusion about expected runtime behavior.

Copilot · 2026-03-26T03:49:35Z

cli.mjs

+    const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex;
+    const evalArgs = args.slice(commandStartIndex + 1);
+    const { runEvalCli } = await import("./bench/eval-framework.mjs");
+    const { exitCode } = await runEvalCli(evalArgs);
+    process.exit(exitCode);


In eval routing, commandStartIndex always prefers the eval subcommand index over --eval when both are present, even if --eval appears earlier in args; using the earliest of the two would make slicing behavior deterministic. Also, process.exit(exitCode) will exit 0 if exitCode is undefined—using exitCode ?? 0 would align with other routing blocks (e.g. tui) and be safer if runEvalCli ever returns without an explicit code.

Suggested change

const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex;

const evalArgs = args.slice(commandStartIndex + 1);

const { runEvalCli } = await import("./bench/eval-framework.mjs");

const { exitCode } = await runEvalCli(evalArgs);

process.exit(exitCode);

const commandStartIndex =

evalFlagIndex >= 0 && evalCommandIndex >= 0

? Math.min(evalFlagIndex, evalCommandIndex)

: (evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex);

const evalArgs = args.slice(commandStartIndex + 1);

const { runEvalCli } = await import("./bench/eval-framework.mjs");

const { exitCode } = await runEvalCli(evalArgs);

process.exit(exitCode ?? 0);

Copilot · 2026-03-26T03:49:35Z

bench/eval-framework.mjs

+  const dir = resolve(resultsDir);
+  if (!existsSync(dir)) return [];
+  return readdirSync(dir)
+    .filter((name) => name.endsWith(".json"))


listStoredEvaluationRuns() currently returns every *.json file in the results directory. This will incorrectly treat non-run JSON files (e.g. a benchmark.json stored alongside results) as evaluation runs, which breaks bosun eval history and also contradicts the new test that writes benchmark.json into the same dir. Filter by the run filename convention (e.g. eval-*.json) and/or validate the parsed shape (has runId/summary/results) before returning entries.

Suggested change

.filter((name) => name.endsWith(".json"))

.filter((name) => name.startsWith("eval-") && name.endsWith(".json"))

…-benchmarking-framewor

github-actions · 2026-03-26T11:56:44Z

Bosun PR classification: Bosun-created.
This PR is tracked by Bosun attachment automation.

PR class label: bosun-pr-bosun-created
Attach label: bosun-attached (yes, because Bosun-created PRs remain attached regardless of human PR attach mode)
Attach mode policy: all
Trusted author: no
Bosun-created label present: yes
Automation scope: Eligible for Bosun repair and merge automation (Bosun-created PR).
CI failure signal label: bosun-needs-fix
CI failure marker:
Trigger: pull_request_target / synchronize

Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>

chore: auto-commit agent work (bc219832-c3a)

305b1fd

jaeko44 added the bosun-attached Bosun PR attachment marker label Mar 26, 2026

Copilot AI review requested due to automatic review settings March 26, 2026 03:46

github-actions bot added the bosun-pr-public PR observed by Bosun but not trusted for high-risk automation label Mar 26, 2026

Copilot started reviewing on behalf of jaeko44 March 26, 2026 03:46 View session

github-code-quality bot found potential problems Mar 26, 2026

View reviewed changes

bench/eval-framework.mjs

return output;

}

function escapeRegex(text) {

bench/eval-framework.mjs

return index >= 0 ? args[index + 1] : "";

}

function hasFlag(args, flag) {

jaeko44 added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 26, 2026

Copilot AI reviewed Mar 26, 2026

View reviewed changes

jaeko44 added bosun-pr-bosun-created PR created by Bosun and eligible for Bosun automation and removed bosun-pr-public PR observed by Bosun but not trusted for high-risk automation labels Mar 26, 2026

Merge branch 'main' into task/bc219832c3a1-feat-eval-agent-evaluation…

6e5fc36

…-benchmarking-framewor

jaeko44 added 3 commits March 26, 2026 23:14

Merge remote-tracking branch 'origin/main' into HEAD

1619bc5

Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into HEAD

885aa31

Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into HEAD

07cf4e3

Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>

jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 26, 2026

jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Agent Evaluation & Benchmarking Framework ÔÇö data-driven agent optimization#451

feat(eval): Agent Evaluation & Benchmarking Framework ÔÇö data-driven agent optimization#451
jaeko44 wants to merge 5 commits intomainfrom
task/bc219832c3a1-feat-eval-agent-evaluation-benchmarking-framewor

jaeko44 commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
-    const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
-    metricDeltas[key] = {
-      baseline: baseValue,
-      candidate: candidateValue,
-      delta,
-      significance: computeSignificance(baselineTaskValues, candidateTaskValues),
+    const baselinePerTask = ensureArray(baseline?.summary?.perTask);
+    const candidatePerTask = ensureArray(candidate?.summary?.perTask);
+    // Only compute per-task significance when at least one run has this metric at the per-task level.
+    const hasPerTaskMetric =
+      baselinePerTask.some((entry) => entry && entry[key] != null) ||
+      candidatePerTask.some((entry) => entry && entry[key] != null);
+    let significance = null;
+    if (hasPerTaskMetric) {
+      const baselineTaskValues = baselinePerTask
+        .map((entry) => toNumber(entry?.[key]))
+        .filter((value) => Number.isFinite(value));
+      const candidateTaskValues = candidatePerTask
+        .map((entry) => toNumber(entry?.[key]))
+        .filter((value) => Number.isFinite(value));
+      if (baselineTaskValues.length > 0 && candidateTaskValues.length > 0) {
+        significance = computeSignificance(baselineTaskValues, candidateTaskValues);
+      }
+    }
+    metricDeltas[key] = {
+      baseline: baseValue,
+      candidate: candidateValue,
+      delta,
+      significance,

	entry.timeValues.push(toNumber(result.outcome?.durationMs));
	const timeToComplete = result.metrics?.TimeToComplete ?? result.outcome?.durationMs;
	entry.timeValues.push(toNumber(timeToComplete));

	.filter((name) => name.endsWith(".json"))
	.filter((name) => name.startsWith("eval-") && name.endsWith(".json"))

Conversation

jaeko44 commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 26, 2026 •

edited

Loading