Skip to content

feat(dataset): add validated SWE-bench task dataset with 9 tasks across 3 difficulty levels#9

Merged
echobt merged 3 commits intomainfrom
feat/swe-bench-dataset-validation
Feb 17, 2026
Merged

feat(dataset): add validated SWE-bench task dataset with 9 tasks across 3 difficulty levels#9
echobt merged 3 commits intomainfrom
feat/swe-bench-dataset-validation

Conversation

@echobt
Copy link
Copy Markdown
Contributor

@echobt echobt commented Feb 17, 2026

Summary

Adds a generated and validated SWE-bench dataset containing 9 tasks (3 easy, 3 medium, 3 hard) sourced from real GitHub PRs, along with a detailed validation report.

Changes

Dataset tasks (test-run/)

Each task includes:

  • workspace.yaml with full metadata (base_commit, patch, fail_to_pass, pass_to_pass)
  • prompt.md task description
  • checks.txt with test commands
  • original_pr.md PR context
  • Behavioral test scripts (fail_to_pass + pass_to_pass)

Parquet exports

  • Pre-built .parquet files per difficulty level for direct consumption

Validation report (test-run/VALIDATION_REPORT.md)

  • Per-task analysis covering sanity checks, behavioral test quality, and coverage
  • Classification of identified issues (missing pass_to_pass, string matching, incomplete coverage)
  • Recommendations for test generator improvements (prompt refinements, dual-commit validation)

…ium, 3 hard)

Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty
levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their
coherence against SWE-bench quality criteria.

Tasks generated:
- Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9,
  conda-forge/elevenlabs-feedstock-25
- Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977,
  babalae/bettergi-scripts-list-2892
- Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145,
  DPorvenir/Sat_Project_Back-28

Each task directory contains workspace.yaml (metadata, patch, commits, test config),
prompt.md (sanitized problem statement), original_pr.md (source PR description),
checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts
plus language-specific test files (Python, TypeScript, JavaScript).

Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are
generated per difficulty level in SWE-bench compatible format.

Validation results (8/9 fully coherent):
- 9/9 have valid prompts, patches, and workspace metadata
- 8/9 have both fail_to_pass and pass_to_pass tests
- 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change
- All prompts avoid leaking solution details
- Quality scores range from 0.20 (easy) to 0.78 (hard)

VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution,
known issues, and file structure conventions.
@echobt echobt merged commit 2d142e5 into main Feb 17, 2026
7 checks passed
@echobt echobt deleted the feat/swe-bench-dataset-validation branch February 17, 2026 10:07
echobt added a commit that referenced this pull request Apr 8, 2026
…ss 3 difficulty levels (#9)

* feat(dataset): generate and validate 9 SWE-bench tasks (3 easy, 3 medium, 3 hard)

Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty
levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their
coherence against SWE-bench quality criteria.

Tasks generated:
- Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9,
  conda-forge/elevenlabs-feedstock-25
- Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977,
  babalae/bettergi-scripts-list-2892
- Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145,
  DPorvenir/Sat_Project_Back-28

Each task directory contains workspace.yaml (metadata, patch, commits, test config),
prompt.md (sanitized problem statement), original_pr.md (source PR description),
checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts
plus language-specific test files (Python, TypeScript, JavaScript).

Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are
generated per difficulty level in SWE-bench compatible format.

Validation results (8/9 fully coherent):
- 9/9 have valid prompts, patches, and workspace metadata
- 8/9 have both fail_to_pass and pass_to_pass tests
- 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change
- All prompts avoid leaking solution details
- Quality scores range from 0.20 (easy) to 0.78 (hard)

VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution,
known issues, and file structure conventions.

* style: fix CI — apply cargo +nightly fmt formatting

* fix: resolve clippy errors — for_kv_map and overly_complex_bool_expr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant