Add successful-hash index to the default cache by jorgee · Pull Request #7194 · nextflow-io/nextflow

jorgee · 2026-06-01T15:11:04Z

Summary

Adds a per-session forward index to Nextflow's default task cache that maps a task's content hash → the final (per-attempt) hash of its first successful execution. On resume, the cache resolves a completed task in a single index lookup instead of walking the iterate-and-bump per-attempt hash sequence — so a task that succeeded only after one or more failed/aborted attempts is recovered without re-executing.

This is an alternative to #6903 for the same underlying problem (#6884). Instead of recovering the starting point of the fragile per-attempt scan (previousTryCount), the successful hash is recorded directly and looked up on resume. The existing iterate-and-bump loop is left untouched as the fallback.

Apart from the #6884 fix, this PR also enables safely removing the previously failed, aborted executions without affecting the successful task cache resume process.

Design

CacheStore SPI — getSuccessfulHash / putSuccessfulHash / deleteSuccessfulHash, implemented by:
- DefaultCacheStore (LevelDB) — namespaced key (1-byte prefix, no collision with entry keys); inherently per-session (DB lives under cache/<uniqueId>). The two key namespaces (entry keys vs prefixed index keys) are documented as a key-space contract on the store.
- CloudCacheStore (nf-cloudcache) — object under the per-session dataPath (basePath/<uniqueId>/index/<hash>), not a shared prefix. Cross-session/global indexing is intentionally out of scope.
CacheDB — getSuccessfulHash + putSuccessfulHashAsync, enqueued on the same writer Agent after the entry write (commit-marker discipline: a reader never follows the index to a not-yet-durable entry). writeTaskEntry0 stores the task content hash in a fourth record slot so removeTaskEntry can clean the index pointer when an entry is deleted.
TaskRun — contentHash, captured once on the first attempt (null-guarded) and preserved across retries via clone()/makeCopy(). Replaces Fix cached task after abort #6903's previousTryCount.
Session — writes the pointer on successful execution (notifyTaskComplete, gated on isCompleted()) and on resume (notifyTaskCached, self-heals a stale pointer and upgrades a scan-path resume to a fast-path one).
TaskProcessor.checkCachedOrLaunchTask — content-hash capture plus a one-line delegation to a private resumeFromSuccessfulHashIndex(task) helper; falls through to the unchanged iterate-and-bump loop on miss/stale/error.

Index cleanup. On entry removal (removeTaskEntry, ref-count → 0) the content hash is recovered from the record and the index pointer is conditionally deleted — only when it targets the deleted entry, so removing a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Old 3-slot entries carry no content hash, so their pointer self-heals on the next resume; a full cache drop() wipes the index wholesale.

Always on, graceful fallback. The fourth record slot is a backward-compatible extension — older versions deserialize the record, use slots 0–2, and ignore the extra slot. Cross-version cache compatibility validated: a run on this branch resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice-versa.

Testing

Unit: DefaultCacheStoreTest (round-trip + deleteSuccessfulHash + key non-collision + drop()), CloudCacheStoreTest (incl. per-session isolation + delete), CacheDBTest (delegation + conditional index cleanup: successful-entry pointer removed, failed-entry pointer preserved), SessionTest, TaskProcessorTest, TaskRunTest.
Integration: tests/resume-retried-with-abort.nf — reproduces Cache fail when specific conditions are met #6884 across abort/retry/resume cycles and asserts the retried→succeeded task resumes cached (verified locally: final resume reports cached=1).

Closes #6884.

🤖 Generated with Claude Code

Maintain a per-session forward index mapping a task's content hash to the final (per-attempt) hash of its first successful execution. On resume, the cache resolves a completed task in a single index lookup instead of walking the iterate-and-bump per-attempt hash sequence, so a task that succeeded only after one or more failed/aborted attempts is recovered without re-executing. This is an alternative to the previousTryCount approach for the same problem (issue #6884): rather than recovering the starting point of the fragile scan, the successful hash is recorded directly and looked up on resume. - CacheStore SPI: getHashIndex/putHashIndex, implemented by DefaultCacheStore (namespaced LevelDB key, inherently per-session) and CloudCacheStore (object under the per-session dataPath, not a shared prefix). - CacheDB: getHashIndex + putHashIndexAsync, enqueued after the entry write so a reader never follows the index to a not-yet-durable entry. - TaskRun: contentHash, captured once on the first attempt and preserved across retries via clone()/makeCopy(). - Session: write the pointer on successful execution (notifyTaskComplete) and on resume (notifyTaskCached, self-heal/upgrade). - TaskProcessor: fast-path index lookup; the iterate-and-bump loop is left untouched as the fallback for misses, stale pointers, or pre-existing caches. Always on with graceful fallback; no on-disk format change. Includes unit tests across the cache stores, CacheDB, Session and TaskProcessor, plus an end-to-end resume-after-retry-with-abort integration test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>

netlify · 2026-06-01T15:11:14Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`0d77ac3`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a22d67fbeb4a800089751d8

Follow-up to the successful-hash index: - Index cleanup on entry removal. The task content hash is now stored in a fourth slot of the cache entry record (backward-compatible: readers ignore extra slots, old 3-slot entries lack it). When removeTaskEntry deletes an entry, it recovers the content hash and conditionally removes the index pointer -- only when the pointer targets the deleted entry, so deleting a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Adds deleteSuccessfulHash to the CacheStore SPI. - Rename the index methods to make their intent explicit at call sites: get/put/deleteSuccessfulHash (SPI + CacheDB), putSuccessfulHashAsync, resumeFromSuccessfulHashIndex, successfulIndexKey, SUCCESSFUL_HASH_INDEX_PREFIX. - Extract the resume fast-path out of checkCachedOrLaunchTask into a private resumeFromSuccessfulHashIndex(task) helper, leaving the iterate-and-bump loop unchanged, to keep the method small. - Document the DefaultCacheStore LevelDB key-space contract (two disjoint namespaces: KEY_SIZE-byte entry keys vs prefixed index keys) and update the spec accordingly. Cross-version cache compatibility validated manually: a run on this branch resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice versa -- the fourth record slot is read/written transparently by older versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee · 2026-06-05T12:00:58Z

Pushed b7d3bedca — follow-up on top of the initial commit:

Index cleanup on entry removal. The task content hash is now stored in a fourth slot of the cache entry record (backward-compatible: readers ignore extra slots; old 3-slot entries simply lack it). removeTaskEntry recovers it on delete and conditionally removes the index pointer — only when the pointer targets the deleted entry, so deleting a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Adds deleteSuccessfulHash to the CacheStore SPI.
Clearer naming — get/put/deleteSuccessfulHash (SPI + CacheDB), putSuccessfulHashAsync, resumeFromSuccessfulHashIndex, successfulIndexKey, SUCCESSFUL_HASH_INDEX_PREFIX.
Smaller method — the resume fast-path is extracted into a private resumeFromSuccessfulHashIndex(task) helper; the iterate-and-bump loop is left unchanged.
Docs — documented the DefaultCacheStore LevelDB key-space contract (two disjoint namespaces: KEY_SIZE-byte entry keys vs prefixed index keys).

Cross-version cache compatibility — validated manually: a run on this branch resumes correctly (tasks cached) when resumed with released 26.04.x and 25.10.x, and vice-versa. The new fourth record slot is read/written transparently by older versions (they deserialize the record, use slots 0–2, and ignore the extra slot).

- Skip the successful-hash index rewrite in Session.notifyTaskCached when the task was resumed via the index fast-path, where the pointer is already correct. Scan-path resumes still rewrite to self-heal/upgrade. Tracked with a transient TaskRun.resumedFromIndex flag set on the validation clone. - Make the index-pointer cleanup in CacheDB.removeTaskEntry best-effort: a bad/unreadable pointer is logged and swallowed so it can never derail entry removal (it self-heals on the next resume). - Add SessionTest coverage for the notifyTaskCached write path (both branches) and a CacheDBTest asserting entry removal survives a throwing index lookup. - De-flake tests/resume-retried-with-abort.nf: fail the first attempt fast and run the successful attempt long, widening the abort windows in the checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee · 2026-06-05T14:03:33Z

Follow-up: skip redundant index rewrite on fast-path resume + harden cleanup (0d77ac3)

Avoid a redundant index write on fast-path resumes. notifyTaskCached previously rewrote the contentHash → finalHash pointer on every resume. On a fast-path hit the pointer is already correct, so the rewrite is pure overhead (multiplied across every cached task on a large resume). A transient TaskRun.resumedFromIndex flag, set on the validation clone, lets notifyTaskCached skip it; scan-path resumes still rewrite to self-heal and upgrade the pointer.
Best-effort index cleanup on entry removal. removeTaskEntry's pointer cleanup is now wrapped so a bad/unreadable index pointer is logged and swallowed instead of propagating out of entry removal (e.g. nextflow clean). The entry is already deleted, and a stale pointer self-heals on the next resume.
Tests. Added SessionTest coverage for the notifyTaskCached write path (both the rewrite and the skip branch) and a CacheDBTest asserting entry removal survives a throwing index lookup.
De-flaked the integration test. resume-retried-with-abort.nf now fails the first attempt fast and runs the successful attempt long, widening the abort windows so the kill points sit comfortably mid-phase.

jorgee marked this pull request as ready for review June 5, 2026 12:07

jorgee requested review from bentsherman and pditommaso June 5, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add successful-hash index to the default cache#7194

Add successful-hash index to the default cache#7194
jorgee wants to merge 3 commits into
masterfrom
successful-hash-index

jorgee commented Jun 1, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

jorgee commented Jun 5, 2026

Uh oh!

jorgee commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jorgee commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Testing

Uh oh!

netlify Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

jorgee commented Jun 5, 2026

Uh oh!

jorgee commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jorgee commented Jun 1, 2026 •

edited

Loading

netlify Bot commented Jun 1, 2026 •

edited

Loading