Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Sep 5, 2025

Note

Adds an NFS-backed cache for template builds behind a feature flag, introduces atomic file writes, updates storage interfaces, and enhances cache metrics/tracing with comprehensive tests.

  • Build/Orchestrator:
    • Conditionally wrap templateStorage with storage.NewCachedProvider during build via useNFSCache, gated by featureflags.BuildingFeatureFlagName and SharedChunkCacheDir.
  • Feature Flags:
    • Add BuildingFeatureFlagName (use-nfs-for-building-templates).
  • Storage:
    • Interfaces: Extract WriteFromFSCtx; update ObjectProvider and SeekableObjectProvider to use it.
    • Atomic writes/locking: Add lock.AtomicFile with OpenFile for atomic, lock-guarded writes; tests included.
    • Cache provider:
      • Async local delete on DeleteObjectsWithPrefix.
      • Add error recording helpers and OpenTelemetry tracing integration.
      • Add cache metrics counters (ops, bytes) with hit/miss and operation tags.
    • CachedObjectProvider:
      • Read from cache if available; otherwise read remote and asynchronously write to cache.
      • On Write/WriteFromFileSystem, write to remote and cache concurrently; use lock.OpenFile for atomic cache writes.
    • CachedSeekableObjectProvider:
      • Improve ReadAt/Size to report cache hit/miss, record errors, and async backfill cache.
      • Implement concurrent chunked caching from filesystem with bounded concurrency; write local size metadata atomically.
    • Utils: Add moveWithoutReplace and cleanup helpers; refine EOF handling.
    • Tests: New and updated tests for atomic file behavior and cache read/write paths.

Written by Cursor Bugbot for commit 83bcfe4. This will update automatically on new commits. Configure here.

@djeebus djeebus marked this pull request as ready for review September 5, 2025 22:42
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@djeebus djeebus added the improvement Improvement for current functionality label Sep 10, 2025
cursor[bot]

This comment was marked as outdated.

@djeebus djeebus force-pushed the bring-back-cache-writes branch from d141efb to f8f1a6f Compare September 16, 2025 18:14
cursor[bot]

This comment was marked as outdated.

@dobrac dobrac self-requested a review September 17, 2025 11:26
cursor[bot]

This comment was marked as outdated.

@djeebus djeebus force-pushed the bring-back-cache-writes branch from 68ece42 to 5130329 Compare September 18, 2025 02:02
dobrac
dobrac previously requested changes Sep 18, 2025
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@djeebus djeebus dismissed dobrac’s stale review September 26, 2025 21:22

ready for a new review!

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@ValentaTomas ValentaTomas removed their request for review October 7, 2025 05:25
@ValentaTomas
Copy link
Member

@djeebus Let's split to PR preparing for this change (refactor) and adding the caching behavior.

@ValentaTomas ValentaTomas self-assigned this Oct 8, 2025
@djeebus djeebus force-pushed the bring-back-cache-writes branch from 2b47e2d to 5e6fc67 Compare October 8, 2025 22:39
cursor[bot]

This comment was marked as outdated.

@djeebus
Copy link
Contributor Author

djeebus commented Oct 8, 2025

@djeebus Let's split to PR preparing for this change (refactor) and adding the caching behavior.

OK, I've removed all the nice-to-haves, the rest will go in other PRs.

@ValentaTomas ValentaTomas requested review from ValentaTomas and removed request for jakubno October 10, 2025 22:31
@ValentaTomas ValentaTomas marked this pull request as draft October 10, 2025 22:52
@djeebus
Copy link
Contributor Author

djeebus commented Oct 10, 2025

There is currently a mismatch between reading/writing whole files and reading/writing chunked files. We need to do things:

  1. Line things up so that the template writes the same files that the orchestrator expects to read
  2. Find a way to throw errors if the function calls don't line up across structs

@djeebus
Copy link
Contributor Author

djeebus commented Oct 16, 2025

This is blocked by #1361 ; will refactor after that gets merged.

@djeebus djeebus force-pushed the bring-back-cache-writes branch from b258581 to 77192cf Compare November 19, 2025 00:17
@djeebus djeebus marked this pull request as ready for review November 19, 2025 19:33
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 64 to 68
go c.writeFileToCache(
context.WithoutCancel(ctx),
bytes.NewReader(p),
cacheOpWrite,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid async caching from caller buffer after Write returns

The new Write implementation spawns a goroutine that streams bytes.NewReader(p) into the cache while immediately returning the result of inner.Write. Because p is provided by the caller, the io.Writer contract allows the caller to reuse or mutate the slice as soon as Write returns; the goroutine may therefore read modified data or race with the caller and cache corrupted bytes even when the remote write succeeded. Copy the data or defer launching the cache write until after the payload is safely owned to prevent serving incorrect cached content.

Useful? React with 👍 / 👎.

@ValentaTomas
Copy link
Member

My one concern here is that because some teams are building tens of thousands of templates, this might start ejecting things disproportionately.

templateStorage = storage.NewCachedProvider(path, templateStorage)
}

index := cache.NewHashIndex(bc.CacheScope, builder.buildStorage, templateStorage)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: NFS cache wrapping not applied to storage operations

The templateStorage variable is wrapped with NewCachedProvider at lines 229-231 to enable NFS caching, but the unwrapped builder.templateStorage is then passed to layerExecutor (line 241), baseBuilder (line 251), and postProcessingBuilder (line 293). Additionally, the unwrapped builder.templateStorage is used at line 327 for getRootfsSize. This means the cache wrapping is never actually used for any storage operations, defeating the entire purpose of the caching feature introduced in this PR.

Fix in Cursor Fix in Web


c.writeFileToCache(ctx, input, cacheOpWriteFromFileSystem)
}(context.WithoutCancel(ctx))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: File descriptor leak in WriteFromFileSystem cache operation

In WriteFromFileSystem, a file is opened at line 93 but never closed. The *os.File is passed as an io.Reader to writeFileToCache, which cannot close it since it only accepts the io.Reader interface. The goroutine runs asynchronously and leaves the file handle open, causing file descriptor leaks over time.

Fix in Cursor Fix in Web

@djeebus
Copy link
Contributor Author

djeebus commented Nov 20, 2025

Okay, I'll work on getting a cache effectiveness dashboard added. We can monitor that to see if cache effectiveness goes down, and disable via flag if it doesn't improve things. Does that work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement for current functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants