-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked writing of h5py.Dataset and zarr.Array #1624
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1624 +/- ##
==========================================
- Coverage 87.06% 84.63% -2.44%
==========================================
Files 40 40
Lines 6101 6116 +15
==========================================
- Hits 5312 5176 -136
- Misses 789 940 +151
|
entry_chunk_size = 100 * 1024 * 1024 // itemsize | ||
# Number of rows that works out to | ||
n_rows = max(entry_chunk_size // shape[0], 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should any of this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As things stand, the value is currently dependent on both the shape and an arbitrary cutoff in max
....so Given the current implementation, we could make 2-3 things configurable which seems like overkill. Perhaps just n_rows
should be a setting with 1000
as the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have
anndata/src/anndata/_io/h5ad.py
Line 176 in df213f6
chunk_size: int = 6000, # TODO, probably make this 2d chunks |
Also
anndata/src/anndata/_io/zarr.py
Line 30 in df213f6
chunks: tuple[int, ...] | None = None, |
Let’s not create multiple different/incompatible conventions / features under the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is an argument for not calling it chunk_size
? I wasn't proposing literally calling it n_rows
but just that variable being the settings as opposed to entry_chunk_size
or the max
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s an argument for keeping our terminology consistent when we get around to make this configurable. But we can also not do that for now.
…rshup/anndata into more-efficient-dense-writing
This PR fixes #1623 by writing backed dense arrays in chunks.
Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?
cc: @ebezzi
Some proof it works: