Add memory-efficient chunk storage stats via prefix listing #1542
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Warning
This entire PR was written by Claude Code. Review accordingly.
Summary
Adds
chunk_storage_stats_by_prefix_async()as a memory-efficient alternative to the existing chunk storage stats methods. Instead of fetching and parsing all manifests (which builds massive HashSets of every chunk ID), this just lists objects under thechunks/prefix and sums their sizes.The Problem
total_chunks_storage_async()andchunk_storage_stats_async()go ballistic on memory usage for large repos because they:seen_native_chunksandseen_virtual_chunks)For a repo with millions of chunks, this eats ridiculous amounts of memory.
The Solution (Such As It Is)
Just list the storage prefix. Native chunks are already stored deduplicated in the
chunks/directory with their chunk ID as the key, so we can just:storage.list_objects("chunks/")size_bytesfrom the listingnative_bytesMemory usage is now constant regardless of repo size.
Caveats / Why This Might Be Hot Garbage
Uses deprecated methods: The implementation uses
storage()andstorage_settings()which are marked deprecated. They still work, but this probably isn't the "right" way to access storage in the future.Only counts native_bytes: Virtual and inline chunks can't be calculated from storage listings, so
virtual_bytesandinlined_bytesare always 0. This is probably fine since the oldtotal_chunks_storage_async()only returned native_bytes anyway, but it's less complete thanchunk_storage_stats_async().Untested on huge repos: This has only been tested with the existing small test repos. It should work great on massive repos, but I haven't actually verified it doesn't blow up.
May not handle all storage backends correctly: Different storage implementations might behave differently when listing prefixes. Should be fine, but who knows.
The whole approach feels hacky: Instead of properly optimizing the manifest parsing path (maybe streaming, maybe better data structures), this just goes around it entirely. It works, but it's a bit of a cop-out.
API
Testing
Added two new test cases:
Both pass. Compiles cleanly with just the expected deprecation warnings.
Should This Be Merged?
¯_(ツ)_/¯
It solves the immediate memory problem in a simple way. Whether it's the "right" solution long-term is debatable. Would love feedback from maintainers on whether this approach is acceptable or if you'd prefer something more sophisticated.
🤖 Generated with Claude Code