Summary
We identified two distinct test-name mismatches in the evaluation pipeline that cause instances to be marked as unresolved even when every required test passes. Both affect instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan and likely others.
Bug 1 — Trailing whitespace in parser output
Root cause: run_script.sh uses a sed command to prefix test descriptions with the source filename (e.g., describe("Foo") → describe("test/file.js::Foo")). In certain test titles the substitution leaves a trailing space. The parser.py preserves this space in the name field it writes to output.json.
Effect: The comparison in swe_bench_pro_eval.py:
passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
result = (f2p | p2p) <= passed_tests
uses exact string equality. A name like:
"test/database.js | ... getSortedSetRange() should work with big arrays (length > 100) "
^
trailing space
does not match the fail_to_pass entry which has no trailing space, so the test is considered missing.
Affected test (confirmed):
test/database.js | Test database test/database/sorted.js::Sorted Set methods test/database/sorted.js::getSortedSetRange() should work with big arrays (length > 100)
Fix (included in the linked PR): call .strip() on both sides of the comparison.
Bug 2 — Truncated test names in the fail_to_pass dataset field
Root cause: The fail_to_pass column is stored as a serialized Python list. Some test titles contain embedded double-quote characters (e.g., should accurately build digest list given ACP default "day"). During dataset generation the closing " of the embedded value was lost, producing a truncated entry.
Affected entries in fail_to_pass for instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan:
| Stored value (truncated) |
Correct value |
...given ACP default "day |
...given ACP default "day" |
...given ACP default "week |
...given ACP default "week" |
...given ACP default "off |
...given ACP default "off" |
Note that ...given ACP default "null" (not set) is stored correctly because additional text follows the closing ", which prevented truncation.
Effect: The truncated names never match any entry in passed_tests (the parser correctly outputs the full name with the closing "), so the instance always scores False regardless of patch quality.
Fix needed: Update the dataset on HuggingFace (ScaleAI/SWE-bench_Pro) to restore the missing closing " in those three entries. This cannot be addressed via a code-only PR since the source of truth is the dataset.
Reproduction
import json
from datasets import load_dataset
ds = load_dataset("ScaleAI/SWE-bench_Pro", split="test")
for rec in ds:
if rec["instance_id"] == "instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan":
f2p = eval(rec["fail_to_pass"])
for t in f2p:
if "Digest.getSubscribers" in t:
print(repr(t))
break
Output:
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "day'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "week'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "off'
Related PR
A PR fixing Bug 1 (whitespace normalization) is open
Bug 2 requires a dataset update on HuggingFace.
Summary
We identified two distinct test-name mismatches in the evaluation pipeline that cause instances to be marked as unresolved even when every required test passes. Both affect instance
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnanand likely others.Bug 1 — Trailing whitespace in parser output
Root cause:
run_script.shuses asedcommand to prefix test descriptions with the source filename (e.g.,describe("Foo")→describe("test/file.js::Foo")). In certain test titles the substitution leaves a trailing space. Theparser.pypreserves this space in thenamefield it writes tooutput.json.Effect: The comparison in
swe_bench_pro_eval.py:uses exact string equality. A name like:
does not match the
fail_to_passentry which has no trailing space, so the test is considered missing.Affected test (confirmed):
Fix (included in the linked PR): call
.strip()on both sides of the comparison.Bug 2 — Truncated test names in the
fail_to_passdataset fieldRoot cause: The
fail_to_passcolumn is stored as a serialized Python list. Some test titles contain embedded double-quote characters (e.g.,should accurately build digest list given ACP default "day"). During dataset generation the closing"of the embedded value was lost, producing a truncated entry.Affected entries in
fail_to_passfor instanceinstance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan:...given ACP default "day...given ACP default "day"...given ACP default "week...given ACP default "week"...given ACP default "off...given ACP default "off"Note that
...given ACP default "null" (not set)is stored correctly because additional text follows the closing", which prevented truncation.Effect: The truncated names never match any entry in
passed_tests(the parser correctly outputs the full name with the closing"), so the instance always scoresFalseregardless of patch quality.Fix needed: Update the dataset on HuggingFace (
ScaleAI/SWE-bench_Pro) to restore the missing closing"in those three entries. This cannot be addressed via a code-only PR since the source of truth is the dataset.Reproduction
Output:
Related PR
A PR fixing Bug 1 (whitespace normalization) is open
Bug 2 requires a dataset update on HuggingFace.