feat(swe): harden pipeline quality gates and improve harness reliability#10
Merged
feat(swe): harden pipeline quality gates and improve harness reliability#10
Conversation
Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard) selected from 23 candidates across 4 generation batches. Implement pipeline improvements to address systemic quality issues discovered during validation. Pipeline quality improvements (src/swe/): - test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject (instead of accept) empty fail_to_pass, string-matching tests after retries, dual-commit validation failures, and patch-apply failures. Enhance system prompt with explicit pass_to_pass verification instructions requiring agents to use existing test infrastructure rather than creating new test files. - filters.rs: Activate added_lines range validation (was ignored via underscore prefix). Add docs-only change detection via is_docs_only_change() heuristic that checks file extensions and names against known documentation patterns. Accept new changed_files parameter for file-level filtering. - quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both score threshold AND classification.quality_good for a task to pass the gate. - pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the new docs-only filter. - harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback when shallow clone misses target commit. Auto-select Docker image based on task language. Add docker_write_file() helper and test file copying from meta.test_files JSON into containers. - docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency. Dataset and documentation: - test-run/: Raw generated tasks across easy, easy2, medium, hard batches (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and test files for each task. - validated-dataset/: 9 curated tasks organized by difficulty with full workspace metadata, test scripts, and parquet shards. - benchmark_validation_report.md: Detailed analysis of all 23 candidates with quality ratings, rejection reasons, and pipeline recommendations. - validation_summary.json: Machine-readable validation metrics.
echobt
added a commit
that referenced
this pull request
Apr 8, 2026
…es (#10) Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard) selected from 23 candidates across 4 generation batches. Implement pipeline improvements to address systemic quality issues discovered during validation. Pipeline quality improvements (src/swe/): - test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject (instead of accept) empty fail_to_pass, string-matching tests after retries, dual-commit validation failures, and patch-apply failures. Enhance system prompt with explicit pass_to_pass verification instructions requiring agents to use existing test infrastructure rather than creating new test files. - filters.rs: Activate added_lines range validation (was ignored via underscore prefix). Add docs-only change detection via is_docs_only_change() heuristic that checks file extensions and names against known documentation patterns. Accept new changed_files parameter for file-level filtering. - quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both score threshold AND classification.quality_good for a task to pass the gate. - pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the new docs-only filter. - harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback when shallow clone misses target commit. Auto-select Docker image based on task language. Add docker_write_file() helper and test file copying from meta.test_files JSON into containers. - docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency. Dataset and documentation: - test-run/: Raw generated tasks across easy, easy2, medium, hard batches (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and test files for each task. - validated-dataset/: 9 curated tasks organized by difficulty with full workspace metadata, test scripts, and parquet shards. - benchmark_validation_report.md: Detailed analysis of all 23 candidates with quality ratings, rejection reasons, and pipeline recommendations. - validation_summary.json: Machine-readable validation metrics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Strengthens the SWE-bench dataset generation pipeline by enforcing stricter quality gates in the test generator, improving harness reliability for shallow clones, and adding smarter PR filtering.
Changes
Test Generator (
test_generator.rs)MAX_VALIDATION_RETRIESfrom 2 to 3 for more robust test generationfail_to_passinstead of silently acceptingHarness (
harness.rs)docker_write_filehelper to copy test files into containers via stdin pipegit fetch --unshallowfallback when checkout fails on shallow clonesFilters (
filters.rs)added_linesfiltering (was previously ignored via_added_lines)changed_filesparameter to filter out documentation/config-only PRsis_docs_only_changeheuristic covering common doc extensions and filenamesQuality Scoring (
quality.rs)quality_goodflag in addition to score threshold for task acceptanceDocker Sandbox (
docker_sandbox.rs)Pipeline (
pipeline.rs)changed_filesto the sweep filter for docs-only detectionBenchmark Validation