Conversation
Overhaul the synthetic dataset generator to produce higher-quality benchmarks faster and with better validation: Quality improvements: - Add min_description_length filter (30 chars) to reject PRs with empty or very short descriptions, preventing blank/useless benchmark tasks - Strip repository names, PR numbers, and GitHub URLs from generated prompts via post-processing in prompt_rewriter to avoid leaking project identity into benchmark tasks - Add test script validation (validate_test_scripts) that checks shell scripts have shebang lines, are non-empty, and that referenced test files actually exist in the submission set — with retry loop support Observability: - Add new progress module with ProgressCounters (shared atomics) and ProgressMonitor (background tokio task) that logs pipeline stats (filtered/extracted/scored/accepted) every 30s with ETA percentage - Wire progress monitor into SweOrchestrator::run lifecycle Build config: - Simplify .cargo/config.toml linker to use cc instead of clang+mold for broader compatibility
…ity-speed-validation
echobt
added a commit
that referenced
this pull request
Apr 8, 2026
…ogress monitoring (#13) * refactor(model): replace openai/gpt-5.2-codex:nitro with moonshotai/kimi-k2.5:nitro * feat(swe): improve dataset quality, speed, and validation pipeline Overhaul the synthetic dataset generator to produce higher-quality benchmarks faster and with better validation: Quality improvements: - Add min_description_length filter (30 chars) to reject PRs with empty or very short descriptions, preventing blank/useless benchmark tasks - Strip repository names, PR numbers, and GitHub URLs from generated prompts via post-processing in prompt_rewriter to avoid leaking project identity into benchmark tasks - Add test script validation (validate_test_scripts) that checks shell scripts have shebang lines, are non-empty, and that referenced test files actually exist in the submission set — with retry loop support Observability: - Add new progress module with ProgressCounters (shared atomics) and ProgressMonitor (background tokio task) that logs pipeline stats (filtered/extracted/scored/accepted) every 30s with ETA percentage - Wire progress monitor into SweOrchestrator::run lifecycle Build config: - Simplify .cargo/config.toml linker to use cc instead of clang+mold for broader compatibility
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Overhauls the SWE synthetic dataset generation pipeline to improve quality, speed, and reliability. Switches the default LLM provider to
moonshotai/kimi-k2.5:nitro, adds real-time progress monitoring, introduces test script validation, and cleans up stale test artifacts.Changes
Pipeline & Performance
pipeline.rsto streamline dataset generation flow (~250 lines removed)progress.rsmodule with async progress monitoring and atomic counters for real-time generation trackingfilters.rsmodule for PR/issue filtering logic extracted from pipelinetest_generator.rswith validation for generated test scripts (checks for missing files, broken redirects, script coherence)Model Configuration
openai/gpt-5.2-codex:nitrowithmoonshotai/kimi-k2.5:nitroas default model across OpenRouter provider and CLI defaultsPrompt & Harness Improvements
prompt_rewriter.rsto strip PR numbers and project identifiers from generated promptsharness.rswith improved orchestration hooksorchestrator.rsTest Artifacts & Cleanup
agent-tests/directory with outdated benchmark resultsbaseagent-echoartifactDocumentation
benchmark_validation_report.mdandvalidation_summary.jsonwith latest results