Add successful-hash index to the default cache#7194
Open
jorgee wants to merge 3 commits into
Open
Conversation
Maintain a per-session forward index mapping a task's content hash to the final (per-attempt) hash of its first successful execution. On resume, the cache resolves a completed task in a single index lookup instead of walking the iterate-and-bump per-attempt hash sequence, so a task that succeeded only after one or more failed/aborted attempts is recovered without re-executing. This is an alternative to the previousTryCount approach for the same problem (issue #6884): rather than recovering the starting point of the fragile scan, the successful hash is recorded directly and looked up on resume. - CacheStore SPI: getHashIndex/putHashIndex, implemented by DefaultCacheStore (namespaced LevelDB key, inherently per-session) and CloudCacheStore (object under the per-session dataPath, not a shared prefix). - CacheDB: getHashIndex + putHashIndexAsync, enqueued after the entry write so a reader never follows the index to a not-yet-durable entry. - TaskRun: contentHash, captured once on the first attempt and preserved across retries via clone()/makeCopy(). - Session: write the pointer on successful execution (notifyTaskComplete) and on resume (notifyTaskCached, self-heal/upgrade). - TaskProcessor: fast-path index lookup; the iterate-and-bump loop is left untouched as the fallback for misses, stale pointers, or pre-existing caches. Always on with graceful fallback; no on-disk format change. Includes unit tests across the cache stores, CacheDB, Session and TaskProcessor, plus an end-to-end resume-after-retry-with-abort integration test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Follow-up to the successful-hash index: - Index cleanup on entry removal. The task content hash is now stored in a fourth slot of the cache entry record (backward-compatible: readers ignore extra slots, old 3-slot entries lack it). When removeTaskEntry deletes an entry, it recovers the content hash and conditionally removes the index pointer -- only when the pointer targets the deleted entry, so deleting a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Adds deleteSuccessfulHash to the CacheStore SPI. - Rename the index methods to make their intent explicit at call sites: get/put/deleteSuccessfulHash (SPI + CacheDB), putSuccessfulHashAsync, resumeFromSuccessfulHashIndex, successfulIndexKey, SUCCESSFUL_HASH_INDEX_PREFIX. - Extract the resume fast-path out of checkCachedOrLaunchTask into a private resumeFromSuccessfulHashIndex(task) helper, leaving the iterate-and-bump loop unchanged, to keep the method small. - Document the DefaultCacheStore LevelDB key-space contract (two disjoint namespaces: KEY_SIZE-byte entry keys vs prefixed index keys) and update the spec accordingly. Cross-version cache compatibility validated manually: a run on this branch resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice versa -- the fourth record slot is read/written transparently by older versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Contributor
Author
|
Pushed
Cross-version cache compatibility — validated manually: a run on this branch resumes correctly (tasks cached) when resumed with released 26.04.x and 25.10.x, and vice-versa. The new fourth record slot is read/written transparently by older versions (they deserialize the record, use slots 0–2, and ignore the extra slot). |
- Skip the successful-hash index rewrite in Session.notifyTaskCached when the task was resumed via the index fast-path, where the pointer is already correct. Scan-path resumes still rewrite to self-heal/upgrade. Tracked with a transient TaskRun.resumedFromIndex flag set on the validation clone. - Make the index-pointer cleanup in CacheDB.removeTaskEntry best-effort: a bad/unreadable pointer is logged and swallowed so it can never derail entry removal (it self-heals on the next resume). - Add SessionTest coverage for the notifyTaskCached write path (both branches) and a CacheDBTest asserting entry removal survives a throwing index lookup. - De-flake tests/resume-retried-with-abort.nf: fail the first attempt fast and run the successful attempt long, widening the abort windows in the checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Contributor
Author
|
Follow-up: skip redundant index rewrite on fast-path resume + harden cleanup (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a per-session forward index to Nextflow's default task cache that maps a task's content hash → the final (per-attempt) hash of its first successful execution. On resume, the cache resolves a completed task in a single index lookup instead of walking the iterate-and-bump per-attempt hash sequence — so a task that succeeded only after one or more failed/aborted attempts is recovered without re-executing.
This is an alternative to #6903 for the same underlying problem (#6884). Instead of recovering the starting point of the fragile per-attempt scan (
previousTryCount), the successful hash is recorded directly and looked up on resume. The existing iterate-and-bump loop is left untouched as the fallback.Apart from the #6884 fix, this PR also enables safely removing the previously failed, aborted executions without affecting the successful task cache resume process.
Design
CacheStoreSPI —getSuccessfulHash/putSuccessfulHash/deleteSuccessfulHash, implemented by:DefaultCacheStore(LevelDB) — namespaced key (1-byte prefix, no collision with entry keys); inherently per-session (DB lives undercache/<uniqueId>). The two key namespaces (entry keys vs prefixed index keys) are documented as a key-space contract on the store.CloudCacheStore(nf-cloudcache) — object under the per-sessiondataPath(basePath/<uniqueId>/index/<hash>), not a shared prefix. Cross-session/global indexing is intentionally out of scope.CacheDB—getSuccessfulHash+putSuccessfulHashAsync, enqueued on the same writerAgentafter the entry write (commit-marker discipline: a reader never follows the index to a not-yet-durable entry).writeTaskEntry0stores the task content hash in a fourth record slot soremoveTaskEntrycan clean the index pointer when an entry is deleted.TaskRun—contentHash, captured once on the first attempt (null-guarded) and preserved across retries viaclone()/makeCopy(). Replaces Fix cached task after abort #6903'spreviousTryCount.Session— writes the pointer on successful execution (notifyTaskComplete, gated onisCompleted()) and on resume (notifyTaskCached, self-heals a stale pointer and upgrades a scan-path resume to a fast-path one).TaskProcessor.checkCachedOrLaunchTask— content-hash capture plus a one-line delegation to a privateresumeFromSuccessfulHashIndex(task)helper; falls through to the unchanged iterate-and-bump loop on miss/stale/error.Index cleanup. On entry removal (
removeTaskEntry, ref-count → 0) the content hash is recovered from the record and the index pointer is conditionally deleted — only when it targets the deleted entry, so removing a failed attempt that shares the same content hash never clobbers the pointer to the successful one. Old 3-slot entries carry no content hash, so their pointer self-heals on the next resume; a full cachedrop()wipes the index wholesale.Always on, graceful fallback. The fourth record slot is a backward-compatible extension — older versions deserialize the record, use slots 0–2, and ignore the extra slot. Cross-version cache compatibility validated: a run on this branch resumes correctly (tasks cached) on released 26.04.x and 25.10.x, and vice-versa.
Testing
DefaultCacheStoreTest(round-trip +deleteSuccessfulHash+ key non-collision +drop()),CloudCacheStoreTest(incl. per-session isolation + delete),CacheDBTest(delegation + conditional index cleanup: successful-entry pointer removed, failed-entry pointer preserved),SessionTest,TaskProcessorTest,TaskRunTest.tests/resume-retried-with-abort.nf— reproduces Cache fail when specific conditions are met #6884 across abort/retry/resume cycles and asserts the retried→succeeded task resumes cached (verified locally: final resume reportscached=1).Closes #6884.
🤖 Generated with Claude Code