Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked writing of h5py.Dataset and zarr.Array #1624

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Aug 28, 2024

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

Copy link

codecov bot commented Aug 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.63%. Comparing base (df213f6) to head (c6afa80).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   87.06%   84.63%   -2.44%     
==========================================
  Files          40       40              
  Lines        6101     6116      +15     
==========================================
- Hits         5312     5176     -136     
- Misses        789      940     +151     
Files with missing lines Coverage Δ
src/anndata/_io/specs/methods.py 88.54% <100.00%> (-0.06%) ⬇️

... and 7 files with indirect coverage changes

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved
src/anndata/_io/specs/methods.py Show resolved Hide resolved
src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved
@ilan-gold ilan-gold added this to the 0.10.10 milestone Aug 29, 2024
@ilan-gold ilan-gold modified the milestones: 0.10.10, 0.11.1, 0.11.2 Nov 7, 2024
@ilan-gold ilan-gold requested review from flying-sheep and removed request for flying-sheep December 8, 2024 13:47
src/anndata/_io/specs/methods.py Show resolved Hide resolved
Comment on lines +410 to +412
entry_chunk_size = 100 * 1024 * 1024 // itemsize
# Number of rows that works out to
n_rows = max(entry_chunk_size // shape[0], 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should any of this be configurable?

Copy link
Contributor

@ilan-gold ilan-gold Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As things stand, the value is currently dependent on both the shape and an arbitrary cutoff in max....so Given the current implementation, we could make 2-3 things configurable which seems like overkill. Perhaps just n_rows should be a setting with 1000 as the default?

Copy link
Member

@flying-sheep flying-sheep Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have

chunk_size: int = 6000, # TODO, probably make this 2d chunks
, documented as “Used only when loading sparse dataset that is stored as dense.”

Also

chunks: tuple[int, ...] | None = None,

Let’s not create multiple different/incompatible conventions / features under the same name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is an argument for not calling it chunk_size? I wasn't proposing literally calling it n_rows but just that variable being the settings as opposed to entry_chunk_size or the max value

Copy link
Member

@flying-sheep flying-sheep Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s an argument for keeping our terminology consistent when we get around to make this configurable. But we can also not do that for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing a h5py.Dataset loads the whole thing into memory
3 participants