Skip to content

Add successful-hash index to the default cache#7194

Open
jorgee wants to merge 3 commits into
masterfrom
successful-hash-index
Open

Add successful-hash index to the default cache#7194
jorgee wants to merge 3 commits into
masterfrom
successful-hash-index

Conversation

@jorgee

@jorgee jorgee commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a per-session forward index to Nextflow's default task cache that maps a task's content hash → the final (per-attempt) hash of its first successful execution. On resume, the cache resolves a completed task in a single index lookup instead of walking the iterate-and-bump per-attempt hash sequence — so a task that succeeded only after one or more failed/aborted attempts is recovered without re-executing.

This is an alternative to #6903 for the same underlying problem (#6884). Instead of recovering the starting point of the fragile per-attempt scan (previousTryCount), the successful hash is recorded directly and looked up on resume. The existing iterate-and-bump loop is left untouched as the fallback.

Apart from the #6884 fix, this PR also enables safely removing the previously failed, aborted executions without affecting the successful task cache resume process.

Design

  • CacheStore SPIgetSuccessfulHash / putSuccessfulHash / deleteSuccessfulHash, implemented by:
    • DefaultCacheStore (LevelDB) — namespaced key (1-byte prefix, no collision with entry keys); inherently per-session (DB lives under cache/<uniqueId>). The two key namespaces (entry keys vs prefixed index keys) are documented as a key-space contract on the store.
    • CloudCacheStore (nf-cloudcache) — object under the per-session dataPath (basePath/<uniqueId>/index/<hash>), not a shared prefix. Cross-session/global indexing is intentionally out of scope.
  • CacheDBgetSuccessfulHash + putSuccessfulHashAsync, enqueued on the same writer Agent after the entry write (commit-marker discipline: a reader never follows the index to a not-yet-durable entry). writeTaskEntry0 stores the task content hash in a fourth record slot so removeTaskEntry can clean the index pointer when an entry is deleted.
  • TaskRuncontentHash, captured once on the first attempt (null-guarded) and preserved across retries via clone()/makeCopy(). Replaces Fix cached task after abort #6903's previousTryCount.
  • Session — writes the pointer on successful execution (notifyTaskComplete, gated on isCompleted()) and on resume (notifyTaskCached, self-heals a stale pointer and upgrades a scan-path resume to a fast-path one).
  • TaskProcessor.checkCachedOrLaunchTask — content-hash capture plus a one-line delegation to a private resumeFromSuccessfulHashIndex(task) helper; falls through to the unchanged iterate-and-bump loop on miss/stale/error.

Index cleanup. On entry removal (removeTaskEntry, ref-count → 0) the content hash is recovered from the record and the index pointer is conditionally deleted — only when it targets the deleted entry, so removing a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Old 3-slot entries carry no content hash, so their pointer self-heals on the next resume; a full cache drop() wipes the index wholesale.

Always on, graceful fallback. The fourth record slot is a backward-compatible extension — older versions deserialize the record, use slots 0–2, and ignore the extra slot. Cross-version cache compatibility validated: a run on this branch resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice-versa.

Testing

  • Unit: DefaultCacheStoreTest (round-trip + deleteSuccessfulHash + key non-collision + drop()), CloudCacheStoreTest (incl. per-session isolation + delete), CacheDBTest (delegation + conditional index cleanup: successful-entry pointer removed, failed-entry pointer preserved), SessionTest, TaskProcessorTest, TaskRunTest.
  • Integration: tests/resume-retried-with-abort.nf — reproduces Cache fail when specific conditions are met #6884 across abort/retry/resume cycles and asserts the retried→succeeded task resumes cached (verified locally: final resume reports cached=1).

Closes #6884.

🤖 Generated with Claude Code

Maintain a per-session forward index mapping a task's content hash to the
final (per-attempt) hash of its first successful execution. On resume, the
cache resolves a completed task in a single index lookup instead of walking
the iterate-and-bump per-attempt hash sequence, so a task that succeeded only
after one or more failed/aborted attempts is recovered without re-executing.

This is an alternative to the previousTryCount approach for the same problem
(issue #6884): rather than recovering the starting point of the fragile scan,
the successful hash is recorded directly and looked up on resume.

- CacheStore SPI: getHashIndex/putHashIndex, implemented by DefaultCacheStore
  (namespaced LevelDB key, inherently per-session) and CloudCacheStore (object
  under the per-session dataPath, not a shared prefix).
- CacheDB: getHashIndex + putHashIndexAsync, enqueued after the entry write so
  a reader never follows the index to a not-yet-durable entry.
- TaskRun: contentHash, captured once on the first attempt and preserved across
  retries via clone()/makeCopy().
- Session: write the pointer on successful execution (notifyTaskComplete) and on
  resume (notifyTaskCached, self-heal/upgrade).
- TaskProcessor: fast-path index lookup; the iterate-and-bump loop is left
  untouched as the fallback for misses, stale pointers, or pre-existing caches.

Always on with graceful fallback; no on-disk format change. Includes unit tests
across the cache stores, CacheDB, Session and TaskProcessor, plus an end-to-end
resume-after-retry-with-abort integration test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify

netlify Bot commented Jun 1, 2026

Copy link
Copy Markdown

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 0d77ac3
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a22d67fbeb4a800089751d8

Follow-up to the successful-hash index:

- Index cleanup on entry removal. The task content hash is now stored in a
  fourth slot of the cache entry record (backward-compatible: readers ignore
  extra slots, old 3-slot entries lack it). When removeTaskEntry deletes an
  entry, it recovers the content hash and conditionally removes the index
  pointer -- only when the pointer targets the deleted entry, so deleting a
  failed attempt that shares the same content hash never clobbers the pointer
  to the successful one. Adds deleteSuccessfulHash to the CacheStore SPI.

- Rename the index methods to make their intent explicit at call sites:
  get/put/deleteSuccessfulHash (SPI + CacheDB), putSuccessfulHashAsync,
  resumeFromSuccessfulHashIndex, successfulIndexKey, SUCCESSFUL_HASH_INDEX_PREFIX.

- Extract the resume fast-path out of checkCachedOrLaunchTask into a private
  resumeFromSuccessfulHashIndex(task) helper, leaving the iterate-and-bump loop
  unchanged, to keep the method small.

- Document the DefaultCacheStore LevelDB key-space contract (two disjoint
  namespaces: KEY_SIZE-byte entry keys vs prefixed index keys) and update the
  spec accordingly.

Cross-version cache compatibility validated manually: a run on this branch
resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice
versa -- the fourth record slot is read/written transparently by older versions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee

jorgee commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Pushed b7d3bedca — follow-up on top of the initial commit:

  • Index cleanup on entry removal. The task content hash is now stored in a fourth slot of the cache entry record (backward-compatible: readers ignore extra slots; old 3-slot entries simply lack it). removeTaskEntry recovers it on delete and conditionally removes the index pointer — only when the pointer targets the deleted entry, so deleting a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Adds deleteSuccessfulHash to the CacheStore SPI.
  • Clearer namingget/put/deleteSuccessfulHash (SPI + CacheDB), putSuccessfulHashAsync, resumeFromSuccessfulHashIndex, successfulIndexKey, SUCCESSFUL_HASH_INDEX_PREFIX.
  • Smaller method — the resume fast-path is extracted into a private resumeFromSuccessfulHashIndex(task) helper; the iterate-and-bump loop is left unchanged.
  • Docs — documented the DefaultCacheStore LevelDB key-space contract (two disjoint namespaces: KEY_SIZE-byte entry keys vs prefixed index keys).

Cross-version cache compatibility — validated manually: a run on this branch resumes correctly (tasks cached) when resumed with released 26.04.x and 25.10.x, and vice-versa. The new fourth record slot is read/written transparently by older versions (they deserialize the record, use slots 0–2, and ignore the extra slot).

@jorgee jorgee marked this pull request as ready for review June 5, 2026 12:07
@jorgee jorgee requested review from bentsherman and pditommaso June 5, 2026 12:12
- Skip the successful-hash index rewrite in Session.notifyTaskCached when
  the task was resumed via the index fast-path, where the pointer is already
  correct. Scan-path resumes still rewrite to self-heal/upgrade. Tracked with
  a transient TaskRun.resumedFromIndex flag set on the validation clone.
- Make the index-pointer cleanup in CacheDB.removeTaskEntry best-effort: a
  bad/unreadable pointer is logged and swallowed so it can never derail entry
  removal (it self-heals on the next resume).
- Add SessionTest coverage for the notifyTaskCached write path (both branches)
  and a CacheDBTest asserting entry removal survives a throwing index lookup.
- De-flake tests/resume-retried-with-abort.nf: fail the first attempt fast and
  run the successful attempt long, widening the abort windows in the checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee

jorgee commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: skip redundant index rewrite on fast-path resume + harden cleanup (0d77ac3)

  • Avoid a redundant index write on fast-path resumes. notifyTaskCached previously rewrote the contentHash → finalHash pointer on every resume. On a fast-path hit the pointer is already correct, so the rewrite is pure overhead (multiplied across every cached task on a large resume). A transient TaskRun.resumedFromIndex flag, set on the validation clone, lets notifyTaskCached skip it; scan-path resumes still rewrite to self-heal and upgrade the pointer.
  • Best-effort index cleanup on entry removal. removeTaskEntry's pointer cleanup is now wrapped so a bad/unreadable index pointer is logged and swallowed instead of propagating out of entry removal (e.g. nextflow clean). The entry is already deleted, and a stale pointer self-heals on the next resume.
  • Tests. Added SessionTest coverage for the notifyTaskCached write path (both the rewrite and the skip branch) and a CacheDBTest asserting entry removal survives a throwing index lookup.
  • De-flaked the integration test. resume-retried-with-abort.nf now fails the first attempt fast and runs the successful attempt long, widening the abort windows so the kill points sit comfortably mid-phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache fail when specific conditions are met

1 participant