Replies: 1 comment 1 reply
-
I think there are two broad sets of operations: operations that depend on the chunk layout (the discussion title) and operations that relate to both the chunk layout and an access pattern (the question at the end of the discussion). Here's an expanded framing based on this layering where complexity is mapped based on how much information needs to be provided with a question (e.g., nothing, a coordinate, a selection, or a batch of selections):
L0–L2 are access-independent (answerable from a static layout object); L3–L5 take an access and need a partitioner. The Asked by column lists the consumers that need each answer and how they get it today: public (a public zarr API), private (reaches into Access-independent — a layout object can answer these (pure description of the grid):
Access-parameterized — these require the partitioner (an algorithm run over a selection):
Four things follow from the ladder:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Some recent changes (adding rectilinear chunks) and upcoming ones (lazy slicing?) are straining the APIs that tell users how an array is partitioned. I think this is really important information to get right, and we would benefit from thinking through the design, maybe in a discussion. hence this discussion. For background, we had a related discussion prior to the 3.x release.
here is a quick summary of our current situation:
chunksattribute, and ashardsattribute. Neitherchunksnorshardsare array metadata fields in the v3 spec. We use these fields so that users could docreate_array(chunks=(10,), shards=(20,))to create a sensibly sharded array without threading the inner chunk shape through the codecs.We kept
chunksfor backwards compatibility with zarr-python 2.x;We chose
chunksto denote "smallest readable unit" in this context to ensure that readers consuming zarr arrays (like dask) would pick the right granularity for reading by checking thechunksattribute.Rectilinear chunking breaks the
chunksattribute. The introduction of rectilinear chunking meanschunksis not a plain tuple but potentially something large, as each individual chunk can have a unique shape. Rather than widen the type of this attribute, which might be a breaking change for consumers that expecttuple[int, ...],array.chunksraises aNotImplementedErrorwhen the rectilinear chunk grid is used:With rectilinear chunking we got two new array attributes:
read_chunk_sizesandwrite_chunk_sizes, which you can see in the code snippet above. But by focusing on abstract "read size" and "write size", these two attributes obscure important information about the array, like the actual layout of each chunk. The "read size" and "write size" is an instruction to the reader / writer about how the granularity of that operation, but an array user might also care about the stored layout of a chunk. for example, these two arrays have similar "read" and "write" sizes, but different physical chunks:Array a
Array B
In the above examples, array A uses the rectilinear chunk grid and so the stored chunks are sub-arrays with sizes
(7,7,4). Array B uses the regular chunk grid, and its stored chunks are subarrays with sizes(7,7,7).From an array indexing POV these two chunk grids behave identically, but they have different chunks, and I think we want to ensure that users can easily distinguish these two cases with methods or attributes on the
Arrayclass.And here are some complications:
zarr.Arrayto stored chunks. Lazy slicing will create views of subsets of chunks. So for a lazily sliced array, we need to enumerate the projection of that array's selection on to the underlying chunks, which isn't the same as the size of the underlying chunks. That means for a lazy indexing operation likesubset_2 = array.lazy[::2],subset_2.read_chunk_sizesisn't well defined as a collection of chunk sizes that sum tosubset_2.shape, sincesubset_2isn't defined from whole chunks.With all that said, @maxrjones has a PR that outlines a new
chunk_layoutdata structure. I think this will help convey some of the information array consumers need. I'd also like to use this discussion as a venue to enumerate exactly what kind of information we think array consumers need, given the complexity in the array API.Given some potentially lazy-sliced array
A, I think users need easy access to the following info:A? How big is each of these chunks? How is each chunk selected to produceA?A? (recall that partial chunk writes do a read first) How big is each of these chunks? How is each chunk selected when I write new values?Ashould I iterate over if I want to read 1 chunk per region? How big is the chunk I have to read, per region?Ashould I iterate over if I want to write 1 chunk per region? How big is that chunk (this is not necessarily the same size as the region!)?Some questions for participants:
I'm especially keen to hear from zarr array consumers, @psobolewskiPhD .
Beta Was this translation helpful? Give feedback.
All reactions