Skip to content

feat(eval): Agent Evaluation & Benchmarking Framework ÔÇö data-driven agent optimization#451

Open
jaeko44 wants to merge 5 commits intomainfrom
task/bc219832c3a1-feat-eval-agent-evaluation-benchmarking-framewor
Open

feat(eval): Agent Evaluation & Benchmarking Framework ÔÇö data-driven agent optimization#451
jaeko44 wants to merge 5 commits intomainfrom
task/bc219832c3a1-feat-eval-agent-evaluation-benchmarking-framewor

Conversation

@jaeko44
Copy link
Copy Markdown
Member

@jaeko44 jaeko44 commented Mar 26, 2026

Task-ID: bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\nAutomated PR for task bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\n---\n\nBosun-Origin: created

@jaeko44 jaeko44 added the bosun-attached Bosun PR attachment marker label Mar 26, 2026
Copilot AI review requested due to automatic review settings March 26, 2026 03:46
@github-actions github-actions bot added the bosun-pr-public PR observed by Bosun but not trusted for high-risk automation label Mar 26, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

Bosun CI signal: Bosun-created PR currently has failing checks.

return output;
}

function escapeRegex(text) {
return index >= 0 ? args[index + 1] : "";
}

function hasFlag(args, flag) {
@jaeko44 jaeko44 added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an agent evaluation/benchmarking framework to the repo, exposing it via a new bosun eval CLI entry point and covering core flows with new Vitest suites.

Changes:

  • Introduces bench/eval-framework.mjs with benchmark/task models, metric helpers, run persistence, comparison, regression detection, and an eval CLI dispatcher.
  • Wires bosun eval / --eval routing into cli.mjs and documents the commands in top-level help.
  • Adds tests for the framework behaviors and CLI routing; ensures the new framework file is included in the packaged files allowlist.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/eval-framework.test.mjs New test coverage for importing benchmarks, running evaluations, persisting results, summaries, comparisons, and CLI flows.
tests/cli-eval-routing.test.mjs Ensures eval routing happens before generic --help handling and help text includes eval docs.
package.json Adds bench/eval-framework.mjs to the published files allowlist.
cli.mjs Adds eval command/flag routing and help text entries.
bench/eval-framework.mjs Implements the evaluation & benchmarking framework and eval CLI subcommands.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +430 to +436
const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
metricDeltas[key] = {
baseline: baseValue,
candidate: candidateValue,
delta,
significance: computeSignificance(baselineTaskValues, candidateTaskValues),
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compareEvaluationRuns() computes significance using baseline.summary.perTask values for every key, but perTask entries don't include metrics like p95Tokens, p95TimeMs, or totalCostUsd (they'll be undefined → coerced to 0). That yields misleading significance results (often comparing distributions of zeros). Consider either limiting significance to keys present in perTask summaries, or computing per-task/per-repeat distributions from run.results for each metric.

Suggested change
const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key]));
metricDeltas[key] = {
baseline: baseValue,
candidate: candidateValue,
delta,
significance: computeSignificance(baselineTaskValues, candidateTaskValues),
const baselinePerTask = ensureArray(baseline?.summary?.perTask);
const candidatePerTask = ensureArray(candidate?.summary?.perTask);
// Only compute per-task significance when at least one run has this metric at the per-task level.
const hasPerTaskMetric =
baselinePerTask.some((entry) => entry && entry[key] != null) ||
candidatePerTask.some((entry) => entry && entry[key] != null);
let significance = null;
if (hasPerTaskMetric) {
const baselineTaskValues = baselinePerTask
.map((entry) => toNumber(entry?.[key]))
.filter((value) => Number.isFinite(value));
const candidateTaskValues = candidatePerTask
.map((entry) => toNumber(entry?.[key]))
.filter((value) => Number.isFinite(value));
if (baselineTaskValues.length > 0 && candidateTaskValues.length > 0) {
significance = computeSignificance(baselineTaskValues, candidateTaskValues);
}
}
metricDeltas[key] = {
baseline: baseValue,
candidate: candidateValue,
delta,
significance,

Copilot uses AI. Check for mistakes.
entry.repeats += 1;
if (result.metrics?.TaskSuccess === "pass") entry.passCount += 1;
entry.tokenValues.push(toNumber(result.outcome?.totalTokens, toNumber(result.outcome?.tokensInput) + toNumber(result.outcome?.tokensOutput)));
entry.timeValues.push(toNumber(result.outcome?.durationMs));
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildTaskSummaryEntries() uses result.outcome.durationMs for per-task avgTimeMs, while summarizeRun() uses the TimeToComplete metric if present. If callers override TimeToComplete (as your tests do), run-level avgTimeMs and per-task avgTimeMs can diverge unexpectedly. Consider deriving per-task time values from result.metrics.TimeToComplete (with the same fallback) to keep summaries consistent.

Suggested change
entry.timeValues.push(toNumber(result.outcome?.durationMs));
const timeToComplete = result.metrics?.TimeToComplete ?? result.outcome?.durationMs;
entry.timeValues.push(toNumber(timeToComplete));

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +340
this.runner = options.runner || defaultRunner();
this.parallelism = Math.max(1, toNumber(options.parallelism, 1));
this.storageAdapter = options.storageAdapter || null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluator accepts a parallelism option and records it on the run, but evaluate() executes all strategy/task/repeat iterations sequentially and never uses this.parallelism. Either implement concurrency (e.g., a bounded work queue) or drop/avoid exposing the option to prevent confusion about expected runtime behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +1454 to +1458
const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex;
const evalArgs = args.slice(commandStartIndex + 1);
const { runEvalCli } = await import("./bench/eval-framework.mjs");
const { exitCode } = await runEvalCli(evalArgs);
process.exit(exitCode);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eval routing, commandStartIndex always prefers the eval subcommand index over --eval when both are present, even if --eval appears earlier in args; using the earliest of the two would make slicing behavior deterministic. Also, process.exit(exitCode) will exit 0 if exitCode is undefined—using exitCode ?? 0 would align with other routing blocks (e.g. tui) and be safer if runEvalCli ever returns without an explicit code.

Suggested change
const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex;
const evalArgs = args.slice(commandStartIndex + 1);
const { runEvalCli } = await import("./bench/eval-framework.mjs");
const { exitCode } = await runEvalCli(evalArgs);
process.exit(exitCode);
const commandStartIndex =
evalFlagIndex >= 0 && evalCommandIndex >= 0
? Math.min(evalFlagIndex, evalCommandIndex)
: (evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex);
const evalArgs = args.slice(commandStartIndex + 1);
const { runEvalCli } = await import("./bench/eval-framework.mjs");
const { exitCode } = await runEvalCli(evalArgs);
process.exit(exitCode ?? 0);

Copilot uses AI. Check for mistakes.
const dir = resolve(resultsDir);
if (!existsSync(dir)) return [];
return readdirSync(dir)
.filter((name) => name.endsWith(".json"))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listStoredEvaluationRuns() currently returns every *.json file in the results directory. This will incorrectly treat non-run JSON files (e.g. a benchmark.json stored alongside results) as evaluation runs, which breaks bosun eval history and also contradicts the new test that writes benchmark.json into the same dir. Filter by the run filename convention (e.g. eval-*.json) and/or validate the parsed shape (has runId/summary/results) before returning entries.

Suggested change
.filter((name) => name.endsWith(".json"))
.filter((name) => name.startsWith("eval-") && name.endsWith(".json"))

Copilot uses AI. Check for mistakes.
@jaeko44 jaeko44 added bosun-pr-bosun-created PR created by Bosun and eligible for Bosun automation and removed bosun-pr-public PR observed by Bosun but not trusted for high-risk automation labels Mar 26, 2026
@github-actions
Copy link
Copy Markdown

Bosun PR classification: Bosun-created.
This PR is tracked by Bosun attachment automation.

  • PR class label: bosun-pr-bosun-created
  • Attach label: bosun-attached (yes, because Bosun-created PRs remain attached regardless of human PR attach mode)
  • Attach mode policy: all
  • Trusted author: no
  • Bosun-created label present: yes
  • Automation scope: Eligible for Bosun repair and merge automation (Bosun-created PR).
  • CI failure signal label: bosun-needs-fix
  • CI failure marker:
  • Trigger: pull_request_target / synchronize

jaeko44 added 3 commits March 26, 2026 23:14
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
@jaeko44 jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 26, 2026
@jaeko44 jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bosun-attached Bosun PR attachment marker bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair bosun-pr-bosun-created PR created by Bosun and eligible for Bosun automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants