Skip to content

Conversation

@samantha-earthmover
Copy link

@samantha-earthmover samantha-earthmover commented Jan 2, 2026

Warning

This entire PR was written by Claude Code. Review accordingly.

Summary

Adds chunk_storage_stats_by_prefix_async() as a memory-efficient alternative to the existing chunk storage stats methods. Instead of fetching and parsing all manifests (which builds massive HashSets of every chunk ID), this just lists objects under the chunks/ prefix and sums their sizes.

The Problem

total_chunks_storage_async() and chunk_storage_stats_async() go ballistic on memory usage for large repos because they:

  1. Fetch all manifests from all snapshots
  2. Parse every chunk payload in every manifest
  3. Build giant HashSets to deduplicate chunks (both seen_native_chunks and seen_virtual_chunks)

For a repo with millions of chunks, this eats ridiculous amounts of memory.

The Solution (Such As It Is)

Just list the storage prefix. Native chunks are already stored deduplicated in the chunks/ directory with their chunk ID as the key, so we can just:

  1. Call storage.list_objects("chunks/")
  2. Sum up the size_bytes from the listing
  3. Return that as native_bytes

Memory usage is now constant regardless of repo size.

Caveats / Why This Might Be Hot Garbage

  1. Uses deprecated methods: The implementation uses storage() and storage_settings() which are marked deprecated. They still work, but this probably isn't the "right" way to access storage in the future.

  2. Only counts native_bytes: Virtual and inline chunks can't be calculated from storage listings, so virtual_bytes and inlined_bytes are always 0. This is probably fine since the old total_chunks_storage_async() only returned native_bytes anyway, but it's less complete than chunk_storage_stats_async().

  3. Untested on huge repos: This has only been tested with the existing small test repos. It should work great on massive repos, but I haven't actually verified it doesn't blow up.

  4. May not handle all storage backends correctly: Different storage implementations might behave differently when listing prefixes. Should be fine, but who knows.

  5. The whole approach feels hacky: Instead of properly optimizing the manifest parsing path (maybe streaming, maybe better data structures), this just goes around it entirely. It works, but it's a bit of a cop-out.

API

# New memory-efficient method
stats = await repo.chunk_storage_stats_by_prefix_async()
print(stats.native_bytes)  # Works!
print(stats.virtual_bytes)  # Always 0
print(stats.inlined_bytes)  # Always 0

Testing

Added two new test cases:

  • Basic test with in-memory storage
  • Test against existing filesystem repos to verify it matches the old implementation

Both pass. Compiles cleanly with just the expected deprecation warnings.

Should This Be Merged?

¯_(ツ)_/¯

It solves the immediate memory problem in a simple way. Whether it's the "right" solution long-term is debatable. Would love feedback from maintainers on whether this approach is acceptable or if you'd prefer something more sophisticated.

🤖 Generated with Claude Code

Adds `chunk_storage_stats_by_prefix_async()` as a memory-efficient alternative
to `chunk_storage_stats_async()` that lists the chunks/ prefix directly instead
of fetching and parsing all manifests.

This dramatically reduces memory usage for large repos by avoiding building
massive HashSets to track every chunk ID across all manifests.

Note: Only calculates native_bytes (virtual_bytes and inlined_bytes are 0).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@TomNicholas
Copy link
Member

Virtual and inline chunks can't be calculated from storage listings, so virtual_bytes and inlined_bytes are always 0.

You can however place an upper-bound on the storage used by virtual chunks by calling list_prefix on the virtual chunk container url prefixes. But this estimate could be unboundedly wrong (e.g. if someone authorizes a prefix to a bucket but doesn't create any virtual chunks pointing at the contents of that bucket).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants