Skip to content

feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14

Merged
echobt merged 3 commits intomainfrom
feat/benchmark-pipeline-metrics
Feb 17, 2026
Merged

feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14
echobt merged 3 commits intomainfrom
feat/benchmark-pipeline-metrics

Conversation

@echobt
Copy link
Copy Markdown
Contributor

@echobt echobt commented Feb 17, 2026

Summary

Add a new benchmark CLI command that evaluates the SWE pipeline against a batch of PRs, collecting detailed metrics on filtering, difficulty distribution, quality, and throughput. Results are persisted as JSON and documented in the README.

Changes

  • New benchmark subcommand (src/cli/commands.rs): runs the pipeline on a configurable number of PRs (default 100) and outputs structured results including:
    • Total PRs processed, filtered count, and filter rate
    • Difficulty breakdown (easy / medium / hard)
    • Quality classification of accepted PRs
    • Throughput metrics (requests/second, average processing time)
  • BenchmarkMetrics tracking (src/swe/pipeline.rs, src/swe/orchestrator.rs): instrument the pipeline to capture per-PR timing, filtering decisions, and difficulty classification during benchmark runs
  • Benchmark output artifacts (benchmark-output/, benchmark_results.json, benchmark_output.json): sample benchmark results across 8 repositories covering Go, Java, Python, and Rust projects
  • README.md: add comprehensive benchmark results section documenting filtering rates, difficulty distribution, quality metrics, and performance characteristics

Notes

  • No breaking changes to existing CLI commands or pipeline behavior
  • Benchmark mode is opt-in via the new benchmark subcommand
  • All 300 existing tests continue to pass

…tion

Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through
the full mining pipeline and outputs detailed metrics as JSON. The benchmark
exercises the complete flow: GH Archive ingestion → enrichment → filtering →
LLM classification → patch extraction → Docker-based agentic test generation
→ quality scoring → export.

Code changes in src/cli/commands.rs:
- Added SweBenchmarkArgs struct with configurable parameters (count, min-stars,
  languages, model, api-key, cache-db, output directory)
- Added Benchmark variant to SweSubcommand enum
- Implemented run_swe_benchmark_command async handler that validates API keys,
  configures SweOrchestrator, runs the pipeline, and outputs JSON results

README.md updated with comprehensive English benchmark results from a run
processing 100 PRs (2026-02-17), including:
- Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield)
- Difficulty distribution (81.8% medium, 18.2% easy, 0% hard)
- Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30)
- Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR)
- Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%)
- Accepted task listing with scores
- Test generation failure analysis
- Usage instructions for running the benchmark

Benchmark artifacts added:
- benchmark-output/ with 8 accepted task directories, each containing
  workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts
- benchmark_output.json and benchmark_results.json with raw pipeline output
- benchmark_clean.log with pipeline execution log
@echobt echobt merged commit 4d63a7a into main Feb 17, 2026
9 checks passed
@echobt echobt deleted the feat/benchmark-pipeline-metrics branch February 17, 2026 18:32
echobt added a commit that referenced this pull request Apr 8, 2026
…nalysis (#14)

* feat(swe): add BenchmarkMetrics tracking to pipeline and orchestrator

* feat(benchmark): add benchmark command and pipeline metrics documentation

Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through
the full mining pipeline and outputs detailed metrics as JSON. The benchmark
exercises the complete flow: GH Archive ingestion → enrichment → filtering →
LLM classification → patch extraction → Docker-based agentic test generation
→ quality scoring → export.

Code changes in src/cli/commands.rs:
- Added SweBenchmarkArgs struct with configurable parameters (count, min-stars,
  languages, model, api-key, cache-db, output directory)
- Added Benchmark variant to SweSubcommand enum
- Implemented run_swe_benchmark_command async handler that validates API keys,
  configures SweOrchestrator, runs the pipeline, and outputs JSON results

README.md updated with comprehensive English benchmark results from a run
processing 100 PRs (2026-02-17), including:
- Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield)
- Difficulty distribution (81.8% medium, 18.2% easy, 0% hard)
- Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30)
- Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR)
- Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%)
- Accepted task listing with scores
- Test generation failure analysis
- Usage instructions for running the benchmark

Benchmark artifacts added:
- benchmark-output/ with 8 accepted task directories, each containing
  workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts
- benchmark_output.json and benchmark_results.json with raw pipeline output
- benchmark_clean.log with pipeline execution log

* ci: trigger CI run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant