Skip to content

feat(swe): harden pipeline quality gates and improve harness reliability#10

Merged
echobt merged 2 commits intomainfrom
feat/harden-pipeline-quality-gates
Feb 17, 2026
Merged

feat(swe): harden pipeline quality gates and improve harness reliability#10
echobt merged 2 commits intomainfrom
feat/harden-pipeline-quality-gates

Conversation

@echobt
Copy link
Copy Markdown
Contributor

@echobt echobt commented Feb 17, 2026

Summary

Strengthens the SWE-bench dataset generation pipeline by enforcing stricter quality gates in the test generator, improving harness reliability for shallow clones, and adding smarter PR filtering.

Changes

Test Generator (test_generator.rs)

  • Increase MAX_VALIDATION_RETRIES from 2 to 3 for more robust test generation
  • Reject submissions with empty fail_to_pass instead of silently accepting
  • Reject (instead of accept) string-matching tests and dual-commit validation failures after max retries
  • Reject tasks when the PR patch cannot be applied to the base commit
  • Strengthen system prompt: require pass_to_pass commands to use existing test infrastructure and be verified before submission

Harness (harness.rs)

  • Add docker_write_file helper to copy test files into containers via stdin pipe
  • Auto-select Docker image based on task language when using the default image
  • Increase git clone depth from 100 to 500 to reduce shallow-clone misses
  • Add git fetch --unshallow fallback when checkout fails on shallow clones
  • Copy test files from task metadata into the container before running checks

Filters (filters.rs)

  • Activate added_lines filtering (was previously ignored via _added_lines)
  • Add changed_files parameter to filter out documentation/config-only PRs
  • Add is_docs_only_change heuristic covering common doc extensions and filenames

Quality Scoring (quality.rs)

  • Raise minimum quality score threshold from 0.1 to 0.25
  • Require quality_good flag in addition to score threshold for task acceptance

Docker Sandbox (docker_sandbox.rs)

  • Increase clone depth from 50 to 500 for consistency with harness

Pipeline (pipeline.rs)

  • Pass changed_files to the sweep filter for docs-only detection

Benchmark Validation

  • Add validation report and validated dataset with 9 benchmark tasks (3 easy, 3 medium, 3 hard)

Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard)
selected from 23 candidates across 4 generation batches. Implement pipeline
improvements to address systemic quality issues discovered during validation.

Pipeline quality improvements (src/swe/):

- test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject
  (instead of accept) empty fail_to_pass, string-matching tests after retries,
  dual-commit validation failures, and patch-apply failures. Enhance system
  prompt with explicit pass_to_pass verification instructions requiring agents
  to use existing test infrastructure rather than creating new test files.

- filters.rs: Activate added_lines range validation (was ignored via underscore
  prefix). Add docs-only change detection via is_docs_only_change() heuristic
  that checks file extensions and names against known documentation patterns.
  Accept new changed_files parameter for file-level filtering.

- quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both
  score threshold AND classification.quality_good for a task to pass the gate.

- pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the
  new docs-only filter.

- harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback
  when shallow clone misses target commit. Auto-select Docker image based on
  task language. Add docker_write_file() helper and test file copying from
  meta.test_files JSON into containers.

- docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency.

Dataset and documentation:

- test-run/: Raw generated tasks across easy, easy2, medium, hard batches
  (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and
  test files for each task.

- validated-dataset/: 9 curated tasks organized by difficulty with full
  workspace metadata, test scripts, and parquet shards.

- benchmark_validation_report.md: Detailed analysis of all 23 candidates
  with quality ratings, rejection reasons, and pipeline recommendations.

- validation_summary.json: Machine-readable validation metrics.
@echobt echobt merged commit 4852e38 into main Feb 17, 2026
8 checks passed
@echobt echobt deleted the feat/harden-pipeline-quality-gates branch February 17, 2026 14:27
echobt added a commit that referenced this pull request Apr 8, 2026
…es (#10)

Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard)
selected from 23 candidates across 4 generation batches. Implement pipeline
improvements to address systemic quality issues discovered during validation.

Pipeline quality improvements (src/swe/):

- test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject
  (instead of accept) empty fail_to_pass, string-matching tests after retries,
  dual-commit validation failures, and patch-apply failures. Enhance system
  prompt with explicit pass_to_pass verification instructions requiring agents
  to use existing test infrastructure rather than creating new test files.

- filters.rs: Activate added_lines range validation (was ignored via underscore
  prefix). Add docs-only change detection via is_docs_only_change() heuristic
  that checks file extensions and names against known documentation patterns.
  Accept new changed_files parameter for file-level filtering.

- quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both
  score threshold AND classification.quality_good for a task to pass the gate.

- pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the
  new docs-only filter.

- harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback
  when shallow clone misses target commit. Auto-select Docker image based on
  task language. Add docker_write_file() helper and test file copying from
  meta.test_files JSON into containers.

- docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency.

Dataset and documentation:

- test-run/: Raw generated tasks across easy, easy2, medium, hard batches
  (23 candidates total) with workspace.yaml, checks.txt, prompt.md, and
  test files for each task.

- validated-dataset/: 9 curated tasks organized by difficulty with full
  workspace metadata, test scripts, and parquet shards.

- benchmark_validation_report.md: Detailed analysis of all 23 candidates
  with quality ratings, rejection reasons, and pipeline recommendations.

- validation_summary.json: Machine-readable validation metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant