feat(swe): harden pipeline quality gates and improve harness reliability by echobt · Pull Request #10 · CortexLM/swe-forge

echobt · 2026-02-17T14:24:29Z

Summary

Strengthens the SWE-bench dataset generation pipeline by enforcing stricter quality gates in the test generator, improving harness reliability for shallow clones, and adding smarter PR filtering.

Changes

Test Generator (`test_generator.rs`)

Increase MAX_VALIDATION_RETRIES from 2 to 3 for more robust test generation
Reject submissions with empty fail_to_pass instead of silently accepting
Reject (instead of accept) string-matching tests and dual-commit validation failures after max retries
Reject tasks when the PR patch cannot be applied to the base commit
Strengthen system prompt: require pass_to_pass commands to use existing test infrastructure and be verified before submission

Harness (`harness.rs`)

Add docker_write_file helper to copy test files into containers via stdin pipe
Auto-select Docker image based on task language when using the default image
Increase git clone depth from 100 to 500 to reduce shallow-clone misses
Add git fetch --unshallow fallback when checkout fails on shallow clones
Copy test files from task metadata into the container before running checks

Filters (`filters.rs`)

Activate added_lines filtering (was previously ignored via _added_lines)
Add changed_files parameter to filter out documentation/config-only PRs
Add is_docs_only_change heuristic covering common doc extensions and filenames

Quality Scoring (`quality.rs`)

Raise minimum quality score threshold from 0.1 to 0.25
Require quality_good flag in addition to score threshold for task acceptance

Docker Sandbox (`docker_sandbox.rs`)

Increase clone depth from 50 to 500 for consistency with harness

Pipeline (`pipeline.rs`)

Pass changed_files to the sweep filter for docs-only detection

Benchmark Validation

Add validation report and validated dataset with 9 benchmark tasks (3 easy, 3 medium, 3 hard)

Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard) selected from 23 candidates across 4 generation batches. Implement pipeline improvements to address systemic quality issues discovered during validation. Pipeline quality improvements (src/swe/): - test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject (instead of accept) empty fail_to_pass, string-matching tests after retries, dual-commit validation failures, and patch-apply failures. Enhance system prompt with explicit pass_to_pass verification instructions requiring agents to use existing test infrastructure rather than creating new test files. - filters.rs: Activate added_lines range validation (was ignored via underscore prefix). Add docs-only change detection via is_docs_only_change() heuristic that checks file extensions and names against known documentation patterns. Accept new changed_files parameter for file-level filtering. - quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both score threshold AND classification.quality_good for a task to pass the gate. - pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the new docs-only filter. - harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback when shallow clone misses target commit. Auto-select Docker image based on task language. Add docker_write_file() helper and test file copying from meta.test_files JSON into containers. - docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency. Dataset and documentation: - test-run/: Raw generated tasks across easy, easy2, medium, hard batches (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and test files for each task. - validated-dataset/: 9 curated tasks organized by difficulty with full workspace metadata, test scripts, and parquet shards. - benchmark_validation_report.md: Detailed analysis of all 23 candidates with quality ratings, rejection reasons, and pipeline recommendations. - validation_summary.json: Machine-readable validation metrics.

…es (#10) Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard) selected from 23 candidates across 4 generation batches. Implement pipeline improvements to address systemic quality issues discovered during validation. Pipeline quality improvements (src/swe/): - test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject (instead of accept) empty fail_to_pass, string-matching tests after retries, dual-commit validation failures, and patch-apply failures. Enhance system prompt with explicit pass_to_pass verification instructions requiring agents to use existing test infrastructure rather than creating new test files. - filters.rs: Activate added_lines range validation (was ignored via underscore prefix). Add docs-only change detection via is_docs_only_change() heuristic that checks file extensions and names against known documentation patterns. Accept new changed_files parameter for file-level filtering. - quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both score threshold AND classification.quality_good for a task to pass the gate. - pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the new docs-only filter. - harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback when shallow clone misses target commit. Auto-select Docker image based on task language. Add docker_write_file() helper and test file copying from meta.test_files JSON into containers. - docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency. Dataset and documentation: - test-run/: Raw generated tasks across easy, easy2, medium, hard batches (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and test files for each task. - validated-dataset/: 9 curated tasks organized by difficulty with full workspace metadata, test scripts, and parquet shards. - benchmark_validation_report.md: Detailed analysis of all 23 candidates with quality ratings, rejection reasons, and pipeline recommendations. - validation_summary.json: Machine-readable validation metrics.

echobt added 2 commits February 17, 2026 14:23

style: apply cargo fmt after merge

be87c70

echobt merged commit 4852e38 into main Feb 17, 2026
8 checks passed

echobt deleted the feat/harden-pipeline-quality-gates branch February 17, 2026 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(swe): harden pipeline quality gates and improve harness reliability#10

feat(swe): harden pipeline quality gates and improve harness reliability#10
echobt merged 2 commits intomainfrom
feat/harden-pipeline-quality-gates

echobt commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

echobt commented Feb 17, 2026

Summary

Changes

Test Generator (test_generator.rs)

Harness (harness.rs)

Filters (filters.rs)

Quality Scoring (quality.rs)

Docker Sandbox (docker_sandbox.rs)

Pipeline (pipeline.rs)

Benchmark Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Test Generator (`test_generator.rs`)

Harness (`harness.rs`)

Filters (`filters.rs`)

Quality Scoring (`quality.rs`)

Docker Sandbox (`docker_sandbox.rs`)

Pipeline (`pipeline.rs`)