Conversation
|
Bosun CI signal: Bosun-created PR currently has failing checks.
|
There was a problem hiding this comment.
Pull request overview
Adds an agent evaluation/benchmarking framework to the repo, exposing it via a new bosun eval CLI entry point and covering core flows with new Vitest suites.
Changes:
- Introduces
bench/eval-framework.mjswith benchmark/task models, metric helpers, run persistence, comparison, regression detection, and an eval CLI dispatcher. - Wires
bosun eval/--evalrouting intocli.mjsand documents the commands in top-level help. - Adds tests for the framework behaviors and CLI routing; ensures the new framework file is included in the packaged
filesallowlist.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/eval-framework.test.mjs | New test coverage for importing benchmarks, running evaluations, persisting results, summaries, comparisons, and CLI flows. |
| tests/cli-eval-routing.test.mjs | Ensures eval routing happens before generic --help handling and help text includes eval docs. |
| package.json | Adds bench/eval-framework.mjs to the published files allowlist. |
| cli.mjs | Adds eval command/flag routing and help text entries. |
| bench/eval-framework.mjs | Implements the evaluation & benchmarking framework and eval CLI subcommands. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key])); | ||
| const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key])); | ||
| metricDeltas[key] = { | ||
| baseline: baseValue, | ||
| candidate: candidateValue, | ||
| delta, | ||
| significance: computeSignificance(baselineTaskValues, candidateTaskValues), |
There was a problem hiding this comment.
compareEvaluationRuns() computes significance using baseline.summary.perTask values for every key, but perTask entries don't include metrics like p95Tokens, p95TimeMs, or totalCostUsd (they'll be undefined → coerced to 0). That yields misleading significance results (often comparing distributions of zeros). Consider either limiting significance to keys present in perTask summaries, or computing per-task/per-repeat distributions from run.results for each metric.
| const baselineTaskValues = ensureArray(baseline?.summary?.perTask).map((entry) => toNumber(entry?.[key])); | |
| const candidateTaskValues = ensureArray(candidate?.summary?.perTask).map((entry) => toNumber(entry?.[key])); | |
| metricDeltas[key] = { | |
| baseline: baseValue, | |
| candidate: candidateValue, | |
| delta, | |
| significance: computeSignificance(baselineTaskValues, candidateTaskValues), | |
| const baselinePerTask = ensureArray(baseline?.summary?.perTask); | |
| const candidatePerTask = ensureArray(candidate?.summary?.perTask); | |
| // Only compute per-task significance when at least one run has this metric at the per-task level. | |
| const hasPerTaskMetric = | |
| baselinePerTask.some((entry) => entry && entry[key] != null) || | |
| candidatePerTask.some((entry) => entry && entry[key] != null); | |
| let significance = null; | |
| if (hasPerTaskMetric) { | |
| const baselineTaskValues = baselinePerTask | |
| .map((entry) => toNumber(entry?.[key])) | |
| .filter((value) => Number.isFinite(value)); | |
| const candidateTaskValues = candidatePerTask | |
| .map((entry) => toNumber(entry?.[key])) | |
| .filter((value) => Number.isFinite(value)); | |
| if (baselineTaskValues.length > 0 && candidateTaskValues.length > 0) { | |
| significance = computeSignificance(baselineTaskValues, candidateTaskValues); | |
| } | |
| } | |
| metricDeltas[key] = { | |
| baseline: baseValue, | |
| candidate: candidateValue, | |
| delta, | |
| significance, |
| entry.repeats += 1; | ||
| if (result.metrics?.TaskSuccess === "pass") entry.passCount += 1; | ||
| entry.tokenValues.push(toNumber(result.outcome?.totalTokens, toNumber(result.outcome?.tokensInput) + toNumber(result.outcome?.tokensOutput))); | ||
| entry.timeValues.push(toNumber(result.outcome?.durationMs)); |
There was a problem hiding this comment.
buildTaskSummaryEntries() uses result.outcome.durationMs for per-task avgTimeMs, while summarizeRun() uses the TimeToComplete metric if present. If callers override TimeToComplete (as your tests do), run-level avgTimeMs and per-task avgTimeMs can diverge unexpectedly. Consider deriving per-task time values from result.metrics.TimeToComplete (with the same fallback) to keep summaries consistent.
| entry.timeValues.push(toNumber(result.outcome?.durationMs)); | |
| const timeToComplete = result.metrics?.TimeToComplete ?? result.outcome?.durationMs; | |
| entry.timeValues.push(toNumber(timeToComplete)); |
| this.runner = options.runner || defaultRunner(); | ||
| this.parallelism = Math.max(1, toNumber(options.parallelism, 1)); | ||
| this.storageAdapter = options.storageAdapter || null; | ||
| } |
There was a problem hiding this comment.
Evaluator accepts a parallelism option and records it on the run, but evaluate() executes all strategy/task/repeat iterations sequentially and never uses this.parallelism. Either implement concurrency (e.g., a bounded work queue) or drop/avoid exposing the option to prevent confusion about expected runtime behavior.
| const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex; | ||
| const evalArgs = args.slice(commandStartIndex + 1); | ||
| const { runEvalCli } = await import("./bench/eval-framework.mjs"); | ||
| const { exitCode } = await runEvalCli(evalArgs); | ||
| process.exit(exitCode); |
There was a problem hiding this comment.
In eval routing, commandStartIndex always prefers the eval subcommand index over --eval when both are present, even if --eval appears earlier in args; using the earliest of the two would make slicing behavior deterministic. Also, process.exit(exitCode) will exit 0 if exitCode is undefined—using exitCode ?? 0 would align with other routing blocks (e.g. tui) and be safer if runEvalCli ever returns without an explicit code.
| const commandStartIndex = evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex; | |
| const evalArgs = args.slice(commandStartIndex + 1); | |
| const { runEvalCli } = await import("./bench/eval-framework.mjs"); | |
| const { exitCode } = await runEvalCli(evalArgs); | |
| process.exit(exitCode); | |
| const commandStartIndex = | |
| evalFlagIndex >= 0 && evalCommandIndex >= 0 | |
| ? Math.min(evalFlagIndex, evalCommandIndex) | |
| : (evalCommandIndex >= 0 ? evalCommandIndex : evalFlagIndex); | |
| const evalArgs = args.slice(commandStartIndex + 1); | |
| const { runEvalCli } = await import("./bench/eval-framework.mjs"); | |
| const { exitCode } = await runEvalCli(evalArgs); | |
| process.exit(exitCode ?? 0); |
| const dir = resolve(resultsDir); | ||
| if (!existsSync(dir)) return []; | ||
| return readdirSync(dir) | ||
| .filter((name) => name.endsWith(".json")) |
There was a problem hiding this comment.
listStoredEvaluationRuns() currently returns every *.json file in the results directory. This will incorrectly treat non-run JSON files (e.g. a benchmark.json stored alongside results) as evaluation runs, which breaks bosun eval history and also contradicts the new test that writes benchmark.json into the same dir. Filter by the run filename convention (e.g. eval-*.json) and/or validate the parsed shape (has runId/summary/results) before returning entries.
| .filter((name) => name.endsWith(".json")) | |
| .filter((name) => name.startsWith("eval-") && name.endsWith(".json")) |
…-benchmarking-framewor
|
Bosun PR classification: Bosun-created.
|
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
Task-ID: bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\nAutomated PR for task bc219832-c3a1-4baa-8191-cedcbc9e96cb\n\n---\n\nBosun-Origin: created