Test name mismatches cause false negatives in eval scoring (whitespace + truncated quotes in fail_to_pass)

## Summary

We identified two distinct test-name mismatches in the evaluation pipeline that cause instances to be marked as unresolved even when every required test passes. Both affect instance `instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan` and likely others.

---

## Bug 1 — Trailing whitespace in parser output

**Root cause:** `run_script.sh` uses a `sed` command to prefix test descriptions with the source filename (e.g., `describe("Foo")` → `describe("test/file.js::Foo")`). In certain test titles the substitution leaves a trailing space. The `parser.py` preserves this space in the `name` field it writes to `output.json`.

**Effect:** The comparison in `swe_bench_pro_eval.py`:
```python
passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
result = (f2p | p2p) <= passed_tests
```
uses exact string equality. A name like:
```
"test/database.js | ... getSortedSetRange() should work with big arrays (length > 100) "
                                                                                       ^
                                                                         trailing space
```
does not match the `fail_to_pass` entry which has no trailing space, so the test is considered missing.

**Affected test (confirmed):**
```
test/database.js | Test database test/database/sorted.js::Sorted Set methods test/database/sorted.js::getSortedSetRange() should work with big arrays (length > 100)
```

**Fix (included in the linked PR):** call `.strip()` on both sides of the comparison.

---

## Bug 2 — Truncated test names in the `fail_to_pass` dataset field

**Root cause:** The `fail_to_pass` column is stored as a serialized Python list. Some test titles contain embedded double-quote characters (e.g., `should accurately build digest list given ACP default "day"`). During dataset generation the closing `"` of the embedded value was lost, producing a truncated entry.

**Affected entries in `fail_to_pass` for instance `instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan`:**

| Stored value (truncated) | Correct value |
|---|---|
| `...given ACP default "day` | `...given ACP default "day"` |
| `...given ACP default "week` | `...given ACP default "week"` |
| `...given ACP default "off` | `...given ACP default "off"` |

Note that `...given ACP default "null" (not set)` is stored correctly because additional text follows the closing `"`, which prevented truncation.

**Effect:** The truncated names never match any entry in `passed_tests` (the parser correctly outputs the full name with the closing `"`), so the instance always scores `False` regardless of patch quality.

**Fix needed:** Update the dataset on HuggingFace (`ScaleAI/SWE-bench_Pro`) to restore the missing closing `"` in those three entries. This cannot be addressed via a code-only PR since the source of truth is the dataset.

---

## Reproduction

```python
import json
from datasets import load_dataset

ds = load_dataset("ScaleAI/SWE-bench_Pro", split="test")
for rec in ds:
    if rec["instance_id"] == "instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan":
        f2p = eval(rec["fail_to_pass"])
        for t in f2p:
            if "Digest.getSubscribers" in t:
                print(repr(t))
        break
```

Output:
```
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "day'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "week'
'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "off'
```

---

## Related PR

A PR fixing Bug 1 (whitespace normalization) is open

Bug 2 requires a dataset update on HuggingFace.

Stored value (truncated)	Correct value
`...given ACP default "day`	`...given ACP default "day"`
`...given ACP default "week`	`...given ACP default "week"`
`...given ACP default "off`	`...given ACP default "off"`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test name mismatches cause false negatives in eval scoring (whitespace + truncated quotes in fail_to_pass) #76

Summary

Bug 1 — Trailing whitespace in parser output

Bug 2 — Truncated test names in the `fail_to_pass` dataset field

Reproduction

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test name mismatches cause false negatives in eval scoring (whitespace + truncated quotes in fail_to_pass) #76

Description

Summary

Bug 1 — Trailing whitespace in parser output

Bug 2 — Truncated test names in the fail_to_pass dataset field

Reproduction

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 2 — Truncated test names in the `fail_to_pass` dataset field