Commit 728edbd
authored
feat(swe): validate benchmark dataset and harden pipeline quality gates (#10)
Generate and validate a 9-task SWE-bench dataset (3 easy, 3 medium, 3 hard)
selected from 23 candidates across 4 generation batches. Implement pipeline
improvements to address systemic quality issues discovered during validation.
Pipeline quality improvements (src/swe/):
- test_generator.rs: Increase MAX_VALIDATION_RETRIES from 2 to 3. Reject
(instead of accept) empty fail_to_pass, string-matching tests after retries,
dual-commit validation failures, and patch-apply failures. Enhance system
prompt with explicit pass_to_pass verification instructions requiring agents
to use existing test infrastructure rather than creating new test files.
- filters.rs: Activate added_lines range validation (was ignored via underscore
prefix). Add docs-only change detection via is_docs_only_change() heuristic
that checks file extensions and names against known documentation patterns.
Accept new changed_files parameter for file-level filtering.
- quality.rs: Raise min_quality_score default from 0.1 to 0.25. Require both
score threshold AND classification.quality_good for a task to pass the gate.
- pipeline.rs: Pass enriched.changed_files to keep_candidate() to enable the
new docs-only filter.
- harness.rs: Increase clone depth from 100 to 500. Add --unshallow fallback
when shallow clone misses target commit. Auto-select Docker image based on
task language. Add docker_write_file() helper and test file copying from
meta.test_files JSON into containers.
- docker_sandbox.rs: Increase clone depth from 50 to 500 for consistency.
Dataset and documentation:
- test-run/: Raw generated tasks across easy, easy2, medium, hard batches
(23 candidates total) with workspace.yaml, checks.txt, prompt.md, and
test files for each task.
- validated-dataset/: 9 curated tasks organized by difficulty with full
workspace metadata, test scripts, and parquet shards.
- benchmark_validation_report.md: Detailed analysis of all 23 candidates
with quality ratings, rejection reasons, and pipeline recommendations.
- validation_summary.json: Machine-readable validation metrics.1 parent 9f0c2c7 commit 728edbd
File tree
231 files changed
+5727
-13
lines changed- src/swe
- test-run
- easy2
- Integrated-Disease-Monitoring-Kenya/dmi-etl-7
- tests
- amistio/.github-1
- tests
- cs360s26impact/impact-15
- tests
- data
- merge-demo/mergequeue-bazel-5378
- tests
- online-store-2026/books-catalog-frontend-26
- tests
- easy
- batocera-linux/batocera.linux-15418
- tests
- bfansports/aws-api-lambda-boilerplate-2
- tests
- happier-dev/happier-35
- tests
- kartoza/devops-app-88
- tests
- merge-demo
- mergequeue-bazel-5345
- tests
- mergequeue-bazel-5404
- tests
- mergequeue-wf-44201
- tests
- hard
- TrooHQ/troo-core-30
- tests
- eclipse-hawkbit/hawkbit-2923
- tests
- ep-eaglepoint-ai/bd_datasets_002-245
- tests
- stellatogrp/cvxro-56
- tests
- wopr-network/wopr-642
- tests
- medium
- Altinn/altinn-studio-17755
- tests
- BibliothecaDAO/eternum-4225
- tests
- PostHog/posthog-48030
- tests
- alphagov/govuk-brand-guidelines-323
- tests
- enatega/food-delivery-multivendor-2052
- tests
- hermetoproject/hermeto-1294
- tests
- validated-dataset
- easy
- batocera-linux__batocera.linux-15418
- tests
- cs360s26impact__impact-15
- tests
- happier-dev__happier-35
- tests
- hard
- TrooHQ__troo-core-30
- tests
- ep-eaglepoint-ai__bd_datasets_002-245
- tests
- stellatogrp__cvxro-56
- tests
- medium
- Altinn__altinn-studio-17755
- tests
- BibliothecaDAO__eternum-4225
- tests
- hermetoproject__hermeto-1294
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
231 files changed
+5727
-13
lines changedLarge diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | | - | |
| 108 | + | |
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
97 | 98 | | |
98 | 99 | | |
99 | 100 | | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
100 | 122 | | |
101 | 123 | | |
102 | 124 | | |
| |||
108 | 130 | | |
109 | 131 | | |
110 | 132 | | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
111 | 162 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
146 | 168 | | |
147 | 169 | | |
148 | 170 | | |
| |||
186 | 208 | | |
187 | 209 | | |
188 | 210 | | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
189 | 219 | | |
190 | 220 | | |
191 | 221 | | |
| |||
202 | 232 | | |
203 | 233 | | |
204 | 234 | | |
205 | | - | |
| 235 | + | |
206 | 236 | | |
207 | 237 | | |
208 | 238 | | |
| |||
243 | 273 | | |
244 | 274 | | |
245 | 275 | | |
246 | | - | |
| 276 | + | |
247 | 277 | | |
248 | 278 | | |
249 | 279 | | |
| |||
261 | 291 | | |
262 | 292 | | |
263 | 293 | | |
264 | | - | |
265 | | - | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
266 | 314 | | |
267 | 315 | | |
268 | 316 | | |
| |||
289 | 337 | | |
290 | 338 | | |
291 | 339 | | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
292 | 357 | | |
293 | 358 | | |
294 | 359 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
360 | 360 | | |
361 | 361 | | |
362 | 362 | | |
| 363 | + | |
363 | 364 | | |
364 | 365 | | |
365 | 366 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
335 | 335 | | |
336 | 336 | | |
337 | 337 | | |
338 | | - | |
| 338 | + | |
339 | 339 | | |
340 | 340 | | |
341 | 341 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
61 | 68 | | |
62 | 69 | | |
63 | 70 | | |
| |||
316 | 323 | | |
317 | 324 | | |
318 | 325 | | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
319 | 349 | | |
320 | 350 | | |
321 | 351 | | |
| |||
338 | 368 | | |
339 | 369 | | |
340 | 370 | | |
341 | | - | |
| 371 | + | |
342 | 372 | | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
343 | 380 | | |
344 | 381 | | |
345 | 382 | | |
| |||
369 | 406 | | |
370 | 407 | | |
371 | 408 | | |
372 | | - | |
| 409 | + | |
373 | 410 | | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
374 | 419 | | |
375 | 420 | | |
376 | 421 | | |
| |||
463 | 508 | | |
464 | 509 | | |
465 | 510 | | |
466 | | - | |
| 511 | + | |
467 | 512 | | |
468 | 513 | | |
469 | | - | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
470 | 517 | | |
471 | 518 | | |
472 | 519 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
0 commit comments