feat(dataset): add validated SWE-bench task dataset with 9 tasks across 3 difficulty levels by echobt · Pull Request #9 · CortexLM/swe-forge

echobt · 2026-02-17T10:00:47Z

Summary

Adds a generated and validated SWE-bench dataset containing 9 tasks (3 easy, 3 medium, 3 hard) sourced from real GitHub PRs, along with a detailed validation report.

Changes

Dataset tasks (`test-run/`)

Easy (3 tasks): [bot-automerge] elevenlabs v2.36.0 conda-forge/elevenlabs-feedstock#25, Add issue templates (bug report and feature request) noi-techpark/opendatahub-gtfs-api#3, Fix double ZIP in CI artifact by extracting plugin ZIP before upload Meltingplot/dwc-meltingplot-config#9
Medium (3 tasks): 自动购买每天&3天&每周刷新商品3.2.3 babalae/bettergi-scripts-list#2892, [Bugfix #370] Fix Gemini --yolo mode in general consultations cluesmith/codev#371, ISSUE #2944: chore: (base-images/aipcc.sh): enable aaiet-notebooks/rhelai-el9 COPR repo opendatahub-io/notebooks#2977
Hard (3 tasks): kevin-dev DPorvenir/Sat_Project_Back#28, Harden tracer flow with extractable core, API guardrails, and UI presets laser-thinhs/lt316-customizer-app#23, imagiq-dashboard#145

Each task includes:

workspace.yaml with full metadata (base_commit, patch, fail_to_pass, pass_to_pass)
prompt.md task description
checks.txt with test commands
original_pr.md PR context
Behavioral test scripts (fail_to_pass + pass_to_pass)

Parquet exports

Pre-built .parquet files per difficulty level for direct consumption

Validation report (`test-run/VALIDATION_REPORT.md`)

Per-task analysis covering sanity checks, behavioral test quality, and coverage
Classification of identified issues (missing pass_to_pass, string matching, incomplete coverage)
Recommendations for test generator improvements (prompt refinements, dual-commit validation)

…ium, 3 hard) Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their coherence against SWE-bench quality criteria. Tasks generated: - Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9, conda-forge/elevenlabs-feedstock-25 - Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977, babalae/bettergi-scripts-list-2892 - Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145, DPorvenir/Sat_Project_Back-28 Each task directory contains workspace.yaml (metadata, patch, commits, test config), prompt.md (sanitized problem statement), original_pr.md (source PR description), checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts plus language-specific test files (Python, TypeScript, JavaScript). Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are generated per difficulty level in SWE-bench compatible format. Validation results (8/9 fully coherent): - 9/9 have valid prompts, patches, and workspace metadata - 8/9 have both fail_to_pass and pass_to_pass tests - 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change - All prompts avoid leaking solution details - Quality scores range from 0.20 (easy) to 0.78 (hard) VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution, known issues, and file structure conventions.

…ss 3 difficulty levels (#9) * feat(dataset): generate and validate 9 SWE-bench tasks (3 easy, 3 medium, 3 hard) Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their coherence against SWE-bench quality criteria. Tasks generated: - Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9, conda-forge/elevenlabs-feedstock-25 - Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977, babalae/bettergi-scripts-list-2892 - Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145, DPorvenir/Sat_Project_Back-28 Each task directory contains workspace.yaml (metadata, patch, commits, test config), prompt.md (sanitized problem statement), original_pr.md (source PR description), checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts plus language-specific test files (Python, TypeScript, JavaScript). Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are generated per difficulty level in SWE-bench compatible format. Validation results (8/9 fully coherent): - 9/9 have valid prompts, patches, and workspace metadata - 8/9 have both fail_to_pass and pass_to_pass tests - 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change - All prompts avoid leaking solution details - Quality scores range from 0.20 (easy) to 0.78 (hard) VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution, known issues, and file structure conventions. * style: fix CI — apply cargo +nightly fmt formatting * fix: resolve clippy errors — for_kv_map and overly_complex_bool_expr

echobt added 3 commits February 17, 2026 09:59

style: fix CI — apply cargo +nightly fmt formatting

58e69e9

fix: resolve clippy errors — for_kv_map and overly_complex_bool_expr

dcdfda7

echobt merged commit 2d142e5 into main Feb 17, 2026
7 checks passed

echobt deleted the feat/swe-bench-dataset-validation branch February 17, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataset): add validated SWE-bench task dataset with 9 tasks across 3 difficulty levels#9

feat(dataset): add validated SWE-bench task dataset with 9 tasks across 3 difficulty levels#9
echobt merged 3 commits intomainfrom
feat/swe-bench-dataset-validation

echobt commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

echobt commented Feb 17, 2026

Summary

Changes

Dataset tasks (test-run/)

Parquet exports

Validation report (test-run/VALIDATION_REPORT.md)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dataset tasks (`test-run/`)

Validation report (`test-run/VALIDATION_REPORT.md`)