Skip to content

Test name mismatches cause false negatives in eval scoring (whitespace + truncated quotes in fail_to_pass) #76

@pedropnaves

Description

@pedropnaves

Summary

We identified two distinct test-name mismatches in the evaluation pipeline that cause instances to be marked as unresolved even when every required test passes. Both affect instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan and likely others.


Bug 1 — Trailing whitespace in parser output

Root cause: run_script.sh uses a sed command to prefix test descriptions with the source filename (e.g., describe("Foo")describe("test/file.js::Foo")). In certain test titles the substitution leaves a trailing space. The parser.py preserves this space in the name field it writes to output.json.

Effect: The comparison in swe_bench_pro_eval.py:

passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
result = (f2p | p2p) <= passed_tests

uses exact string equality. A name like:

"test/database.js | ... getSortedSetRange() should work with big arrays (length > 100) "
                                                                                       ^
                                                                         trailing space

does not match the fail_to_pass entry which has no trailing space, so the test is considered missing.

Affected test (confirmed):

test/database.js | Test database test/database/sorted.js::Sorted Set methods test/database/sorted.js::getSortedSetRange() should work with big arrays (length > 100)

Fix (included in the linked PR): call .strip() on both sides of the comparison.


Bug 2 — Truncated test names in the fail_to_pass dataset field

Root cause: The fail_to_pass column is stored as a serialized Python list. Some test titles contain embedded double-quote characters (e.g., should accurately build digest list given ACP default "day"). During dataset generation the closing " of the embedded value was lost, producing a truncated entry.

Affected entries in fail_to_pass for instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan:

Stored value (truncated) Correct value
...given ACP default "day ...given ACP default "day"
...given ACP default "week ...given ACP default "week"
...given ACP default "off ...given ACP default "off"

Note that ...given ACP default "null" (not set) is stored correctly because additional text follows the closing ", which prevented truncation.

Effect: The truncated names never match any entry in passed_tests (the parser correctly outputs the full name with the closing "), so the instance always scores False regardless of patch quality.

Fix needed: Update the dataset on HuggingFace (ScaleAI/SWE-bench_Pro) to restore the missing closing " in those three entries. This cannot be addressed via a code-only PR since the source of truth is the dataset.


Reproduction

import json
from datasets import load_dataset

ds = load_dataset("ScaleAI/SWE-bench_Pro", split="test")
for rec in ds:
    if rec["instance_id"] == "instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan":
        f2p = eval(rec["fail_to_pass"])
        for t in f2p:
            if "Digest.getSubscribers" in t:
                print(repr(t))
        break

Output:

'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "day'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "week'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "off'

Related PR

A PR fixing Bug 1 (whitespace normalization) is open

Bug 2 requires a dataset update on HuggingFace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions