Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup · 2024-08-28T23:44:39Z

Closes Writing a h5py.Dataset loads the whole thing into memory #1623
Tests added
- This might already be good on tests, but I should check
Release note added (or unnecessary)
Add benchmarks

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

for more information, see https://pre-commit.ci

codecov · 2024-08-29T00:10:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.63%. Comparing base (df213f6) to head (c6afa80).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   87.06%   84.63%   -2.44%     
==========================================
  Files          40       40              
  Lines        6101     6116      +15     
==========================================
- Hits         5312     5176     -136     
- Misses        789      940     +151

Files with missing lines	Coverage Δ
src/anndata/_io/specs/methods.py	`88.54% <100.00%> (-0.06%)`	⬇️

... and 7 files with indirect coverage changes

src/anndata/_io/specs/methods.py

flying-sheep · 2024-12-10T14:28:44Z

src/anndata/_io/specs/methods.py

+        entry_chunk_size = 100 * 1024 * 1024 // itemsize
+        # Number of rows that works out to
+        n_rows = max(entry_chunk_size // shape[0], 1000)


should any of this be configurable?

As things stand, the value is currently dependent on both the shape and an arbitrary cutoff in max....so Given the current implementation, we could make 2-3 things configurable which seems like overkill. Perhaps just n_rows should be a setting with 1000 as the default?

We already have

anndata/src/anndata/_io/h5ad.py

Line 176 in df213f6

chunk_size: int = 6000, # TODO, probably make this 2d chunks

, documented as “Used only when loading sparse dataset that is stored as dense.”

Also

anndata/src/anndata/_io/zarr.py

Line 30 in df213f6

chunks: tuple[int, ...] | None = None,

Let’s not create multiple different/incompatible conventions / features under the same name.

So this is an argument for not calling it chunk_size? I wasn't proposing literally calling it n_rows but just that variable being the settings as opposed to entry_chunk_size or the max value

It’s an argument for keeping our terminology consistent when we get around to make this configurable. But we can also not do that for now.

…rshup/anndata into more-efficient-dense-writing

ivirshup and others added 3 commits August 28, 2024 16:40

Chunked writing of h5py.Dataset and zarr.Array

d60c3ab

[pre-commit.ci] auto fixes from pre-commit.com hooks

232bee4

for more information, see https://pre-commit.ci

Make n-dimensional

c43c5e2

Add some tests, which fail :(

749880b

ilan-gold reviewed Aug 29, 2024

View reviewed changes

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved

src/anndata/_io/specs/methods.py Show resolved Hide resolved

src/anndata/_io/specs/methods.py Outdated Show resolved Hide resolved

ilan-gold added the skip-gpu-ci label Aug 29, 2024

ilan-gold added this to the 0.10.10 milestone Aug 29, 2024

Fix up chunking algorithm + add some types

32e008d

ilan-gold modified the milestones: 0.10.10, 0.11.1, 0.11.2 Nov 7, 2024

ilan-gold added 4 commits November 15, 2024 15:52

(chore): remove unneeded check?

b2192a2

(fix): dispatch to chunked writing for dense arrays

5938d86

(chore): remove unnecessary methods

99d4400

Merge branch 'main' into more-efficient-dense-writing

690b682

ilan-gold requested review from flying-sheep and removed request for flying-sheep December 8, 2024 13:47

fmt

6ef459d

flying-sheep reviewed Dec 10, 2024

View reviewed changes

ilan-gold added 5 commits December 10, 2024 15:40

Merge branch 'main' into more-efficient-dense-writing

31c8ca6

(fix): type

0ba0da2

Merge branch 'main' into more-efficient-dense-writing

58c367d

(fix): remove erroneous change + default dtype

b5c8d7d

Merge branch 'more-efficient-dense-writing' of https://github.com/ivi…

c6afa80

…rshup/anndata into more-efficient-dense-writing

flying-sheep approved these changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked writing of h5py.Dataset and zarr.Array #1624

Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading

flying-sheep Dec 10, 2024

ilan-gold Dec 10, 2024 •

edited

Loading

flying-sheep Dec 10, 2024 •

edited

Loading

ilan-gold Dec 10, 2024

flying-sheep Dec 12, 2024 •

edited

Loading

Chunked writing of h5py.Dataset and zarr.Array #1624

Are you sure you want to change the base?

Chunked writing of h5py.Dataset and zarr.Array #1624

Conversation

ivirshup commented Aug 28, 2024 • edited Loading

codecov bot commented Aug 29, 2024 • edited Loading

Codecov Report

flying-sheep Dec 10, 2024

Choose a reason for hiding this comment

ilan-gold Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

flying-sheep Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

ilan-gold Dec 10, 2024

Choose a reason for hiding this comment

flying-sheep Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading

ilan-gold Dec 10, 2024 •

edited

Loading

flying-sheep Dec 10, 2024 •

edited

Loading

flying-sheep Dec 12, 2024 •

edited

Loading