Summary
6 internetarchive/openlibrary instances have P2P test names that contain year-dependent parameters generated by @pytest.mark.parametrize. These test names were recorded when the dataset was created in 2025, but shift when the evaluation runs in 2026 or later, causing false P2P failures.
Affected instances
| Instance |
F2P |
P2P (stale) |
P2P (fixed) |
openlibrary-1351c59f |
1/1 |
16/18 |
18/18 |
openlibrary-fdbc0d8f |
1/1 |
37/39 |
39/39 |
openlibrary-b112069e |
3/3 |
62/64 |
64/64 |
openlibrary-1894cb48 |
1/1 |
62/64 |
64/64 |
openlibrary-43f9e7e0 |
1/1 |
60/62 |
62/62 |
openlibrary-08ac40d0 |
6/7 |
119/121 |
121/121 (2 fewer false failures) |
Root cause
The test test_future_publication_dates_are_deleted in openlibrary/catalog/add_book/tests/test_add_book.py generates parametrized test IDs using the current year:
@pytest.mark.parametrize("date,expected", [
("2000-11-11", True),
(str(current_year), True), # dynamic
(str(current_year + 1), False), # dynamic
("9999-01-01", False),
])
def test_future_publication_dates_are_deleted(date, expected):
...
When dataset was created (2025):
test_future_publication_dates_are_deleted[2025-True]
test_future_publication_dates_are_deleted[2026-False]
When eval runs (2026):
test_future_publication_dates_are_deleted[2026-True]
test_future_publication_dates_are_deleted[2027-False]
The harness does exact string matching on test names, so it can't find [2025-True] → marks as NOT RUN → FAIL.
Verification
We confirmed this by running the golden patch against the dataset:
|
Stale dataset (2025 names) |
Updated dataset (2026 names) |
| Golden patch |
FAIL (0/5) |
PASS (5/5) |
The golden patch itself fails when the dataset has stale year-dependent test names. All tests actually pass — the failure is purely a name mismatch.
Suggested fix
Update the pass_to_pass field in the dataset for the 6 affected instances:
[2025-True] → [2026-True]
[2026-False] → [2027-False]
Or consider excluding year-parameterized tests from P2P lists, since they will go stale every year.
Impact
5 instances are currently false negatives — the golden patch produces correct results but is marked as failed due to stale test names in the dataset.
Summary
6
internetarchive/openlibraryinstances have P2P test names that contain year-dependent parameters generated by@pytest.mark.parametrize. These test names were recorded when the dataset was created in 2025, but shift when the evaluation runs in 2026 or later, causing false P2P failures.Affected instances
openlibrary-1351c59fopenlibrary-fdbc0d8fopenlibrary-b112069eopenlibrary-1894cb48openlibrary-43f9e7e0openlibrary-08ac40d0Root cause
The test
test_future_publication_dates_are_deletedinopenlibrary/catalog/add_book/tests/test_add_book.pygenerates parametrized test IDs using the current year:When dataset was created (2025):
test_future_publication_dates_are_deleted[2025-True]test_future_publication_dates_are_deleted[2026-False]When eval runs (2026):
test_future_publication_dates_are_deleted[2026-True]test_future_publication_dates_are_deleted[2027-False]The harness does exact string matching on test names, so it can't find
[2025-True]→ marks as NOT RUN → FAIL.Verification
We confirmed this by running the golden patch against the dataset:
The golden patch itself fails when the dataset has stale year-dependent test names. All tests actually pass — the failure is purely a name mismatch.
Suggested fix
Update the
pass_to_passfield in the dataset for the 6 affected instances:[2025-True]→[2026-True][2026-False]→[2027-False]Or consider excluding year-parameterized tests from P2P lists, since they will go stale every year.
Impact
5 instances are currently false negatives — the golden patch produces correct results but is marked as failed due to stale test names in the dataset.