Skip to content

Conversation

@TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Dec 12, 2025

Closes #1478.

Compiles (at least the core rust crate does), but I haven't tested it yet.

EDIT: Ready

Comment on lines 122 to 129
checksum.clone(),
virtual_ref.length,
&mut virtual_bytes,
)?;
} else {
// No checksum: deduplicate by (location, offset, length)
let virtual_identifier = (
virtual_ref.location.clone(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think it's possible to remove these clones (by passing a reference and then only actually cloning if needed), but it comes at the cost of an additional if. Maybe worth it though to minimize memory usage?

Copy link
Member Author

@TomNicholas TomNicholas Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a todo for this. It might be worthwhile to run it on a really big virtual repo before merging just to see what the memory usage is like.

@TomNicholas
Copy link
Member Author

I added some tests, including of virtual chunk deduplication.

@TomNicholas TomNicholas requested a review from paraseba December 13, 2025 14:56
})
}

pub(crate) fn total_chunks_storage(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother trying to do a deprecation cycle in the rust crate.

DeprecationWarning,
)

stats = await self._repository.chunk_storage_stats(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to call the _async version here?

}

pub(crate) fn total_chunks_storage(
pub(crate) fn chunk_storage_stats(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing to mirror this change in the .pyi file.

@TomNicholas TomNicholas enabled auto-merge (squash) December 20, 2025 05:42
@TomNicholas TomNicholas merged commit 7843d66 into main Dec 29, 2025
17 checks passed
@TomNicholas TomNicholas deleted the virtual-storage-stats branch December 29, 2025 18:16
TomNicholas added a commit that referenced this pull request Dec 29, 2025
…#1483)

* define rust struct

* expose struct in python

* add python docstrings

* extend repo_chunks_storage calculation to include virtual and native chunks (double-counting virtual chunks that don't have checksums)

* fully deduplicate virtual chunks when counting by using (url, offset, length) tuple if checksum not available

* factor out helper function

* compile icechunk-python

* rewire existing python method to use new method

* add async method

* update outdated comment

* add python deprecation warnings

* restrict to pub(crate)

* correct bug in calculation of non-virtual bytes

* only use (location, offset, length) as an identifier fo de-duplicating virtual chunks

* add todo comment about cloning

* use the inner pattern to wrap the rust-only struct inside the python-exposed one

* correct name of python class

* expose __add__ method in python

* cargo formatting for python crate too

* rust tests

* correct python return type

* actually correct python return type

* call async method in python

* update .pyi file with new methods

* python linting

* explicitly specify stacklevel for warning

* expect warnings in tests
TomNicholas added a commit that referenced this pull request Jan 2, 2026
* Extend storage stats calculation to include virtual and inline chunks (#1483)

* define rust struct

* expose struct in python

* add python docstrings

* extend repo_chunks_storage calculation to include virtual and native chunks (double-counting virtual chunks that don't have checksums)

* fully deduplicate virtual chunks when counting by using (url, offset, length) tuple if checksum not available

* factor out helper function

* compile icechunk-python

* rewire existing python method to use new method

* add async method

* update outdated comment

* add python deprecation warnings

* restrict to pub(crate)

* correct bug in calculation of non-virtual bytes

* only use (location, offset, length) as an identifier fo de-duplicating virtual chunks

* add todo comment about cloning

* use the inner pattern to wrap the rust-only struct inside the python-exposed one

* correct name of python class

* expose __add__ method in python

* cargo formatting for python crate too

* rust tests

* correct python return type

* actually correct python return type

* call async method in python

* update .pyi file with new methods

* python linting

* explicitly specify stacklevel for warning

* expect warnings in tests

* resolve merge conflicts properly

* more merge fixes

* more merge fixes

* update paths for bundled test repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a way to compute total size of all virtual chunks

3 participants