-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892
base: master
Are you sure you want to change the base?
Conversation
Yes. This is one of the main motivations behind this feature. So we could "cheaply" provide support for versioning of zarrs and this way then allow for their releases: ATM nobody can publish/release a versioned dandiset if there is zarr in it. If we have specific version manifest for zarr -- we can get to that manifest / version no problem and thus can release zarr and access that specific version of zarr. |
@yarikoptic How should manifests for different versions of the same Zarr be organized in S3? Naming the files after the Zarr checksum like in your repository isn't guaranteed to be stable, as a user could create a Zarr, then change one entry, then change that entry back to its original contents, resulting in an S3 object with three versions while there are only two different checksums for the Zarr across its history. |
was already generated, the newer file shall replace the older. | ||
|
||
Manifest files shall also be generated for all Zarrs already in the Archive | ||
when this feature is first implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After versions support added above, here should follow description on what is to happen for publishing dandisets which have zarrs in them.
We need to review/analyze what should now happen for zarr records or assets so we capture version (checksum) information for a zarr whenever it becomes part of the released dandiset. In case of blobs it is easy since blobs are not mutable. But with zarrs, since zarr could have multiple versions - we would need to make sure that published asset has versionId for zarr which would not be changed, whenever that asset zarr would be modified in draft version.
I think that should be fine -- since we would just carry about a manifest with a specific checksum of the content. So pretty much to have similarly to blobs a "context addressable storage" (of zarrs). Then given zarr id + checksum (as stored in DB, or in assets dumps which would have zarr id and its checksum) we can get needed manifest and provide access to that particular version, possibly the same one across different points in the version history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great start, but now needs to come a hard part - analysis of current API/behavior and proposal on how to augment it so we have clear association between dandiset asset and version of zarr for an asset, and what would happen on publish, and how people (and web UI) would provide access to published versions of dandisets with corresponding versions of zarrs
f1b4a62
to
82eb89e
Compare
- Document needed changes to dandidav? | ||
- The bucket for the Archive instance will now be given on the command line (only required if a custom/non-default API URL is given) | ||
- The bucket's region will have to be looked up & stored before starting the webserver | ||
- Zarrs under `/dandisets/` will no longer determine their S3 location via `contentUrl`; instead, they will combine the Archive's bucket & region with the Zarr ID in the asset properties (templated into "zarr/{zarr_id}/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to ourselves: requests to modify a Zarr from a previous non draft version might be "hard to impossible" since would cause "race condition" between different versions pretty much if modified in parallel or otherwise very inefficient since would require large "diff" uploads. Pretty much it would boil down to have the zarr in its mutable form assigned to just a single path in a single dandiset (like now), as it must not then be changed from multiple dandisets/locations. But then it could still reside in multiple dandisets though, and even published in that original version!
Alternatives (just thinking out loud):
- operations on Zarr would operate as "patch" operations on specific version (manifest) as to simply provide a new key + versionid + ... on S3 and modify prior manifest with
finalize
saving patched manifest without doing full sweep of the bucket. cons: a more complex implementation (??? may be not) - zarr operation must be completed "in full"; version of zarr on s3 "as is" might not be a legit zarr to be used directly if ever modified for multiple versions; need to be thought through better; pros: support to modify any version of zarr; (much) more efficientfinalize
since would just modify prior manifest with changes without doing full sweep of the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another "mind strike" which relates to above: we need association of zarr to a dandiset for editing to ensure ownership/rights to modify which is somewhat different for blobs that we do not allow modifications, thus people just upload new ones. Overall feels like we need some way to distinguish a "canonical asset of a zarr" (which can still be modified) ... more thinking needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requests to modify a Zarr from a previous non draft version
What are you talking about? A "non draft version" is a published version, and published versions and their contents can't be modified.
we need association of zarr to a dandiset
Each Zarr is already associated with a Dandiset. You can see a Zarr's Dandiset by requesting /zarr/{zarr_id}/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea/hope was that we can "break" the need for association with a particular dandiset. Then the same zarr could be present in multiple dandisets, and thus versions of "draft" version could diverge in two dandisets and changes in one dandiset to the same zarr could "race" with changes in another dandiset.
Problem indeed should not manifest itself if we keep zarr associated with just a single dandiset. But that is "suboptimal" since would disallow (cheap) creating dandisets with assets (content) from another. And we already had such use cases. At least for "read-only" "mix-in" of zarrs from other dandisets... so we should see if we could support that through these proposed changes.
doc/design/zarr-manifests.md
Outdated
|
||
* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ? | ||
|
||
* Does garbage collection of old Zarr versions need to be discussed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should indeed be touched on. And we should touch on "trailing delete" to be disabled for /zarr/
prefix on s3.
doc/design/zarr-manifests.md
Outdated
Two "end-points" within that namespace are provided: | ||
|
||
- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions. | ||
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic Why is this mentioned here? The /dandisets/
hierarchy gets its information directly from S3; it does not use the Zarr manifest files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! thanks for spotting, good to know since I assumed that this one gets it also from the manifests! So we need smth like
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)). | |
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/` although providing access to Zarr files, it **does not** uses manifest files but gets listing directly from S3, so points only to the most recent version (possibly not even finalized yet during upload). |
?
But with that "not finalized" aspect, we might (not yet sure) want to change that webdav behavior and also use manifest file here to ensure access to the legit version of zarr (not finalized might be partial etc).
9d425c9
to
6a1a1d9
Compare
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
Signed-off-by: Yaroslav Halchenko <[email protected]>
…d; can reuse metadata
6a1a1d9
to
1e15c75
Compare
Unresolved issues:
TODO: add reference to CSV with sample sizes of the manifests https://gist.github.com/jwodder/4e9c6e846639b6d5be2b9ab7f8302166
The design of the
fields
key and entry arrays (copied from YOH's prior art) is odd. @jwodder does not foresee any circumstances in whichfields
would have anything other than the recommended value (andfields
values other than the recommended would be tricky to support indandidav
) but @yarikoptic insists on an explicit description some way and "future proofing", so we might want to change the manifest file format in one of the following ways:Eliminate the
fields
key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that orderEliminate the
fields
key and change the entry arrays to objects withversionId
,lastModified
,size
, andETag
fieldsEliminate the
fields
key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that order; while also adding@schema
URL which would point to a versioned jsonschema for the manifest file which would describe those fields.Describe Archive behavior when publishing Dandisets with Zarrs (See comments below)
Then whenever Zarr is modified, it gets a new
.version
and some new asset would get that.zarr_version
, possibly published later.backups2datalad
when backing up a published Zarr? If done naïvely, the program will have to recreate already-backed up Zarrs as soon as the Zarr IDs "split."git note
? in git commit message description likedatalad run
) the zarr checksum for the commit, we can discover/tag the desired "released" commit. Possible gotcha: we missed that moment and have no commit of such kind -- then recreate from closest commit based on date in a branch off the main trunk.DANDI API changes necessary to support zarr workflows
dandidav
's/dandisets/
hierarchy to serve Zarr trees that do not match the current state on S3, the API will need to gain an endpoint that acts like.../assets/paths/
(or Add endpoint for querying a folder or asset path in a Dandiset #1837) but for Zarr contents.Changes to metadata model (dandischema)
contentUrl
for Zarrs? (See comments below; might be not needing model change)Address garbage collection of old Zarr versions
/zarr/
prefix on S3consider caching/containment of manifests within DB itself (as well).
CC @yarikoptic @dandi/dandiarchive