Conversation
…ium, 3 hard)
Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty
levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their
coherence against SWE-bench quality criteria.
Tasks generated:
- Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9,
conda-forge/elevenlabs-feedstock-25
- Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977,
babalae/bettergi-scripts-list-2892
- Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145,
DPorvenir/Sat_Project_Back-28
Each task directory contains workspace.yaml (metadata, patch, commits, test config),
prompt.md (sanitized problem statement), original_pr.md (source PR description),
checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts
plus language-specific test files (Python, TypeScript, JavaScript).
Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are
generated per difficulty level in SWE-bench compatible format.
Validation results (8/9 fully coherent):
- 9/9 have valid prompts, patches, and workspace metadata
- 8/9 have both fail_to_pass and pass_to_pass tests
- 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change
- All prompts avoid leaking solution details
- Quality scores range from 0.20 (easy) to 0.78 (hard)
VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution,
known issues, and file structure conventions.
echobt
added a commit
that referenced
this pull request
Apr 8, 2026
…ss 3 difficulty levels (#9) * feat(dataset): generate and validate 9 SWE-bench tasks (3 easy, 3 medium, 3 hard) Generate initial batch of 9 SWE-bench benchmark tasks across three difficulty levels using swe-forge with openai/gpt-5.2-codex:nitro, and validate their coherence against SWE-bench quality criteria. Tasks generated: - Easy (3): noi-techpark/opendatahub-gtfs-api-3, Meltingplot/dwc-meltingplot-config-9, conda-forge/elevenlabs-feedstock-25 - Medium (3): cluesmith/codev-371, opendatahub-io/notebooks-2977, babalae/bettergi-scripts-list-2892 - Hard (3): laser-thinhs/lt316-customizer-app-23, Botopia-Tecnology/imagiq-dashboard-145, DPorvenir/Sat_Project_Back-28 Each task directory contains workspace.yaml (metadata, patch, commits, test config), prompt.md (sanitized problem statement), original_pr.md (source PR description), checks.txt (test commands), and tests/ with fail_to_pass and pass_to_pass scripts plus language-specific test files (Python, TypeScript, JavaScript). Parquet files (train.parquet, {difficulty}.parquet, data/shard-0000.parquet) are generated per difficulty level in SWE-bench compatible format. Validation results (8/9 fully coherent): - 9/9 have valid prompts, patches, and workspace metadata - 8/9 have both fail_to_pass and pass_to_pass tests - 1 task (dwc-meltingplot-config-9) lacks fail_to_pass due to CI-only YAML change - All prompts avoid leaking solution details - Quality scores range from 0.20 (easy) to 0.78 (hard) VALIDATION_REPORT.md documents detailed per-task analysis, quality distribution, known issues, and file structure conventions. * style: fix CI — apply cargo +nightly fmt formatting * fix: resolve clippy errors — for_kv_map and overly_complex_bool_expr
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a generated and validated SWE-bench dataset containing 9 tasks (3 easy, 3 medium, 3 hard) sourced from real GitHub PRs, along with a detailed validation report.
Changes
Dataset tasks (
test-run/)Each task includes:
workspace.yamlwith full metadata (base_commit, patch, fail_to_pass, pass_to_pass)prompt.mdtask descriptionchecks.txtwith test commandsoriginal_pr.mdPR contextParquet exports
.parquetfiles per difficulty level for direct consumptionValidation report (
test-run/VALIDATION_REPORT.md)