Skip to content

feat(swe): improve dataset generation pipeline with validation and progress monitoring#13

Merged
echobt merged 3 commits intomainfrom
feat/swe-dataset-quality-speed-validation
Feb 17, 2026
Merged

feat(swe): improve dataset generation pipeline with validation and progress monitoring#13
echobt merged 3 commits intomainfrom
feat/swe-dataset-quality-speed-validation

Conversation

@echobt
Copy link
Copy Markdown
Contributor

@echobt echobt commented Feb 17, 2026

Summary

Overhauls the SWE synthetic dataset generation pipeline to improve quality, speed, and reliability. Switches the default LLM provider to moonshotai/kimi-k2.5:nitro, adds real-time progress monitoring, introduces test script validation, and cleans up stale test artifacts.

Changes

Pipeline & Performance

  • Refactor pipeline.rs to streamline dataset generation flow (~250 lines removed)
  • Add progress.rs module with async progress monitoring and atomic counters for real-time generation tracking
  • Add filters.rs module for PR/issue filtering logic extracted from pipeline
  • Add test_generator.rs with validation for generated test scripts (checks for missing files, broken redirects, script coherence)

Model Configuration

  • Replace openai/gpt-5.2-codex:nitro with moonshotai/kimi-k2.5:nitro as default model across OpenRouter provider and CLI defaults

Prompt & Harness Improvements

  • Enhance prompt_rewriter.rs to strip PR numbers and project identifiers from generated prompts
  • Update harness.rs with improved orchestration hooks
  • Add orchestrator-level validation entry point in orchestrator.rs

Test Artifacts & Cleanup

  • Remove stale agent-tests/ directory with outdated benchmark results
  • Remove baseagent-echo artifact
  • Update test-run workspace configurations and fix test script paths
  • Replace Python-based test suites with JavaScript equivalents where appropriate

Documentation

  • Update benchmark_validation_report.md and validation_summary.json with latest results
  • Update AGENTS.md files for cli and llm modules

Overhaul the synthetic dataset generator to produce higher-quality
benchmarks faster and with better validation:

Quality improvements:
- Add min_description_length filter (30 chars) to reject PRs with empty
  or very short descriptions, preventing blank/useless benchmark tasks
- Strip repository names, PR numbers, and GitHub URLs from generated
  prompts via post-processing in prompt_rewriter to avoid leaking
  project identity into benchmark tasks
- Add test script validation (validate_test_scripts) that checks shell
  scripts have shebang lines, are non-empty, and that referenced test
  files actually exist in the submission set — with retry loop support

Observability:
- Add new progress module with ProgressCounters (shared atomics) and
  ProgressMonitor (background tokio task) that logs pipeline stats
  (filtered/extracted/scored/accepted) every 30s with ETA percentage
- Wire progress monitor into SweOrchestrator::run lifecycle

Build config:
- Simplify .cargo/config.toml linker to use cc instead of clang+mold
  for broader compatibility
@echobt echobt merged commit 990a765 into main Feb 17, 2026
9 checks passed
@echobt echobt deleted the feat/swe-dataset-quality-speed-validation branch February 17, 2026 15:49
echobt added a commit that referenced this pull request Apr 8, 2026
…ogress monitoring (#13)

* refactor(model): replace openai/gpt-5.2-codex:nitro with moonshotai/kimi-k2.5:nitro

* feat(swe): improve dataset quality, speed, and validation pipeline

Overhaul the synthetic dataset generator to produce higher-quality
benchmarks faster and with better validation:

Quality improvements:
- Add min_description_length filter (30 chars) to reject PRs with empty
  or very short descriptions, preventing blank/useless benchmark tasks
- Strip repository names, PR numbers, and GitHub URLs from generated
  prompts via post-processing in prompt_rewriter to avoid leaking
  project identity into benchmark tasks
- Add test script validation (validate_test_scripts) that checks shell
  scripts have shebang lines, are non-empty, and that referenced test
  files actually exist in the submission set — with retry loop support

Observability:
- Add new progress module with ProgressCounters (shared atomics) and
  ProgressMonitor (background tokio task) that logs pipeline stats
  (filtered/extracted/scored/accepted) every 30s with ETA percentage
- Wire progress monitor into SweOrchestrator::run lifecycle

Build config:
- Simplify .cargo/config.toml linker to use cc instead of clang+mold
  for broader compatibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant