Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure parents are created when creating a node #2262

Merged
merged 12 commits into from
Sep 27, 2024

Conversation

TomAugspurger
Copy link
Contributor

This updates our Array and Group creation methods to ensure that parents implicitly defined through a nested path are also created. The main goals are to

  1. Do this without listing
  2. Do the creations concurrently with the node being created by the user.

To accomplish this semi-safely and efficiently, I've required a new setdefault method on the Store class. The idea is to use an atomic "set if this doesn't exist" method. But given that we already have concurrency issues around these Metadata classes, maybe we don't worry about this and do the "if exists" and "set" operations in two operations (we have to do this anyway for RemoteStore)

Closes #2228

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

This updates our Array and Group creation methods to
ensure that parents implicitly defined through a nested path are also
created.

To accomplish this semi-safely and efficiently, we require a new
setdefulat method on the Store class.
@TomAugspurger TomAugspurger marked this pull request as ready for review September 26, 2024 19:59
@@ -98,6 +98,9 @@ async def set(self, value: Buffer, byte_range: ByteRangeRequest | None = None) -
async def delete(self) -> None:
del self.shard_dict[self.chunk_coords]

async def setdefault(self, default: Buffer) -> None:
self.shard_dict.setdefault(self.chunk_coords, default)
Copy link
Contributor Author

@TomAugspurger TomAugspurger Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually tested anywhere. I'm not 100% sure, but I think all the Group / Array Metadata creation method will be using StorePath as their ByteSetter.

@@ -603,9 +609,24 @@ async def getitem(
)
return await self._get_selection(indexer, prototype=prototype)

async def _save_metadata(self, metadata: ArrayMetadata) -> None:
async def _save_metadata(self, metadata: ArrayMetadata, ensure_parents: bool = False) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new keyword is to ensure that updates to an existing nodes don't require all the setdefault operations to ensure that parents exist. Anytime we create a brand new node we should call _save_metdata with ensure_parents=True.

@@ -68,7 +69,13 @@ def _put(
f.write(value.as_numpy_array().tobytes())
return None
else:
return path.write_bytes(value.as_numpy_array().tobytes())
view = memoryview(value.as_numpy_array().tobytes())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib.Path.write_bytes doesn't provide control over the mode. So this is just that method inlined, with mode="xb" if we want exclusive create.

@jhamman jhamman requested a review from d-v-b September 26, 2024 22:34
@jhamman
Copy link
Member

jhamman commented Sep 26, 2024

Thanks @TomAugspurger -- I'll plan to review this ASAP.

To accomplish this semi-safely and efficiently, I've required a new setdefault method on the Store class. The idea is to use an atomic "set if this doesn't exist" method. But given that we already have concurrency issues around these Metadata classes, maybe we don't worry about this and do the "if exists" and "set" operations in two operations (we have to do this anyway for RemoteStore)

Noting that S3 recently added support for conditional writes (and other object stores already have this) so conceivable that s3fs/fsspec could add support for atomic writes in the near future. cc @martindurant

src/zarr/abc/store.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Noting that S3 recently added support for conditional writes (and other object stores already have this) so conceivable that s3fs/fsspec could add support for atomic writes in the near future

Opened fsspec/filesystem_spec#1693 for that. We can pick that API up if / when it's available, but live with the potential for concurrent writers to mess each other up for now.

@@ -138,6 +139,10 @@ async def set(self, key: str, value: Buffer) -> None:
with self.log():
return await self._store.set(key=key, value=value)

async def setdefault(self, key: str, default: Buffer) -> None:
with self.log():
return await self._store.set(key=key, value=default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return await self._store.set(key=key, value=default)
return await self._store.setdefault(key=key, value=default)

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this as is (modulo a few very minor suggestions). I'll defer to @d-v-b for a final review though.

src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/abc/store.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor

d-v-b commented Sep 27, 2024

Should this be in _save_metadata or in _create_(array / group)

what's the reasoning for creating the intermediate groups as an optional extra behavior in _save_metadata versus putting it slightly higher in the stack (e.g., in the array / group creation routines)? The structure of e.g. _create_v3 is basically "prepare metadata document, save it"; to me it seems like a natural change to expand this functionality to "prepare metadata document, save that one and all the other ones we might need". As I would naively expect the _save_metadata method to be scoped to only save a single metadata document, I would expect a set of documents to be saved higher in the stack, e.g. in array / group creation functions.

Nitpicking the name setdefault; maybe we should give this functionality to Store.set?

I had to read up on the setdefault dict method because I had never used it before. Personally I don't find the name of that method very clear -- generally if something is called setX(value) I will expect it to have semantics roughly equivalent to "set the X to value", but "setdefault" doesn't "set the default to value". Maybe I'm just weird here, and for dicts a slightly unclear name is probably acceptable damage, but I don't think this is the best name for a Store method. I do think the naming challenge is real here, because we are trying to describe what this method doesn't do, but maybe a name like set_no_replace would be better? But this is just personal preference in the end.

One solution to the naming issue is to simply not introduce a new method at all -- what if we made the setdefault functionality part of Store.set, e.g. Store.set(key, value, *, supplant / clobber / replace: bool)? This way anything that wraps Store.set, e.g. _set_many, can support the "create if not exists" functionality.

@TomAugspurger
Copy link
Contributor Author

Should this be in _save_metadata or in create(array / group)

Mmm that does sound a bit cleaner. The reason for putting it in _save_metadata was to do all of the I/O operations concurrently. If we put it in the _create_* method, I think we'd have to wait for each parent creation to finish before we could write the child (recursively back to the root).

I could imagine a refactor that builds up a list of operations and hands that off to a writer. We do something like of like that in _save_metadata today (between to_buffer_dict and the set_or_delete stuff, which makes coroutines).

Nitpicking the name setdefault; maybe we should give this functionality to Store.set?

Since this has the essentially the same behavior of setdefault (aside from the return value), that did feel natural to me. As a bonus, you now know about setdefault :) Some arguments either way:

  1. Putting this behind a keyword in Store.set is totally possible. I guess there's some slight confusion between the interaction of byte_range and whatever new keyword we use
  2. If we go with setdefault (or some other name) as a new method, we could in theory provide a default implementation on Store that works, but isn't safe for concurrent writers.

I don't have a strong preference.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 27, 2024

Some alternative names to setdefault:

  • create (but I worry that create should error if the thing already exists, and I'm leery of putting specific exception types into interfaces, since that's easy for implementations to mess up)
  • create_if_not_exists
  • create_or_ignore
  • setnx (or set with nx) from redis

@jhamman
Copy link
Member

jhamman commented Sep 27, 2024

RE name, how about set_if_not_exists? I'm also okay with setdefault but I agree another name could add some clarity.

One question about the ABC is whether we want to put a default implementation in? Something like:

async def set_if_not_exists(self, key, value) -> None:
    if not (await self.contains(key)):
        await self.set(key, value)

This is obviously not atomic but is as good as we can do for some stores.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 27, 2024

set_if_not_exists works for me.

A default implementation, even if it isn't atomic, is probably OK. This isn't going to be our only source of concurrency issues between Nodes & Stores. I'll leave a comment that implementations with that ability might want to override this.

There's one other consideration: whether we should require a return type to indicate whether the value was set (a new key created equal to value). We don't need it yet since we just need to ensure that some node is present in this PR, so I'm inclined to not add that to the interface yet.

@jhamman jhamman added this to the 3.0.0.beta milestone Sep 27, 2024
@TomAugspurger
Copy link
Contributor Author

This should be good to go, maybe aside from @d-v-b's point about _save_metadata maybe not being the right spot for this.

I do agree that it's a bit strange to shove this in there, but IMO the benefit of creating all the parents concurrently makes it worth it. In the future I think we could refactor this a bit to have a kind of bulk save_objects method, but we'll still need all the create routines to build up the list of their parents and add them to the create_if_not_exists list, and that'll have to happen in either the _create or _save_metadata methods.

@jhamman jhamman merged commit 2edc548 into zarr-developers:v3 Sep 27, 2024
20 checks passed
@d-v-b
Copy link
Contributor

d-v-b commented Sep 28, 2024

I do agree that it's a bit strange to shove this in there, but IMO the benefit of creating all the parents concurrently makes it worth it. In the future I think we could refactor this a bit to have a kind of bulk save_objects method, but we'll still need all the create routines to build up the list of their parents and add them to the create_if_not_exists list, and that'll have to happen in either the _create or _save_metadata methods.

I broadly agree with this analysis. For now it's great that we have this merged as-is, but longer-term I might explore moving the "create all the parents as needed" logic one level up, to the creation routines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] zarr 3 fails to create child groups
3 participants