Add memory-efficient chunk storage stats via prefix listing #1542

samantha-earthmover · 2026-01-02T19:56:48Z

Warning

This entire PR was written by Claude Code. Review accordingly.

Summary

Adds chunk_storage_stats_by_prefix_async() as a memory-efficient alternative to the existing chunk storage stats methods. Instead of fetching and parsing all manifests (which builds massive HashSets of every chunk ID), this just lists objects under the chunks/ prefix and sums their sizes.

The Problem

total_chunks_storage_async() and chunk_storage_stats_async() go ballistic on memory usage for large repos because they:

Fetch all manifests from all snapshots
Parse every chunk payload in every manifest
Build giant HashSets to deduplicate chunks (both seen_native_chunks and seen_virtual_chunks)

For a repo with millions of chunks, this eats ridiculous amounts of memory.

The Solution (Such As It Is)

Just list the storage prefix. Native chunks are already stored deduplicated in the chunks/ directory with their chunk ID as the key, so we can just:

Call storage.list_objects("chunks/")
Sum up the size_bytes from the listing
Return that as native_bytes

Memory usage is now constant regardless of repo size.

Caveats / Why This Might Be Hot Garbage

Uses deprecated methods: The implementation uses storage() and storage_settings() which are marked deprecated. They still work, but this probably isn't the "right" way to access storage in the future.
Only counts native_bytes: Virtual and inline chunks can't be calculated from storage listings, so virtual_bytes and inlined_bytes are always 0. This is probably fine since the old total_chunks_storage_async() only returned native_bytes anyway, but it's less complete than chunk_storage_stats_async().
Untested on huge repos: This has only been tested with the existing small test repos. It should work great on massive repos, but I haven't actually verified it doesn't blow up.
May not handle all storage backends correctly: Different storage implementations might behave differently when listing prefixes. Should be fine, but who knows.
The whole approach feels hacky: Instead of properly optimizing the manifest parsing path (maybe streaming, maybe better data structures), this just goes around it entirely. It works, but it's a bit of a cop-out.

API

# New memory-efficient method
stats = await repo.chunk_storage_stats_by_prefix_async()
print(stats.native_bytes)  # Works!
print(stats.virtual_bytes)  # Always 0
print(stats.inlined_bytes)  # Always 0

Testing

Added two new test cases:

Basic test with in-memory storage
Test against existing filesystem repos to verify it matches the old implementation

Both pass. Compiles cleanly with just the expected deprecation warnings.

Should This Be Merged?

¯_(ツ)_/¯

It solves the immediate memory problem in a simple way. Whether it's the "right" solution long-term is debatable. Would love feedback from maintainers on whether this approach is acceptable or if you'd prefer something more sophisticated.

🤖 Generated with Claude Code

Adds `chunk_storage_stats_by_prefix_async()` as a memory-efficient alternative to `chunk_storage_stats_async()` that lists the chunks/ prefix directly instead of fetching and parsing all manifests. This dramatically reduces memory usage for large repos by avoiding building massive HashSets to track every chunk ID across all manifests. Note: Only calculates native_bytes (virtual_bytes and inlined_bytes are 0). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

TomNicholas · 2026-01-02T20:20:13Z

Virtual and inline chunks can't be calculated from storage listings, so virtual_bytes and inlined_bytes are always 0.

You can however place an upper-bound on the storage used by virtual chunks by calling list_prefix on the virtual chunk container url prefixes. But this estimate could be unboundedly wrong (e.g. if someone authorizes a prefix to a bucket but doesn't create any virtual chunks pointing at the contents of that bucket).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add memory-efficient chunk storage stats via prefix listing #1542

Add memory-efficient chunk storage stats via prefix listing #1542

Uh oh!

samantha-earthmover commented Jan 2, 2026 •

edited

Loading

Uh oh!

TomNicholas commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add memory-efficient chunk storage stats via prefix listing #1542

Are you sure you want to change the base?

Add memory-efficient chunk storage stats via prefix listing #1542

Uh oh!

Conversation

samantha-earthmover commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Problem

The Solution (Such As It Is)

Caveats / Why This Might Be Hot Garbage

API

Testing

Should This Be Merged?

Uh oh!

TomNicholas commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samantha-earthmover commented Jan 2, 2026 •

edited

Loading