Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

jwodder · 2024-03-18T19:47:17Z

Unresolved issues:

TODO: add reference to CSV with sample sizes of the manifests https://gist.github.com/jwodder/4e9c6e846639b6d5be2b9ab7f8302166
The design of the fields key and entry arrays (copied from YOH's prior art) is odd. @jwodder does not foresee any circumstances in which fields would have anything other than the recommended value (and fields values other than the recommended would be tricky to support in dandidav) but @yarikoptic insists on an explicit description some way and "future proofing", so we might want to change the manifest file format in one of the following ways:
- Eliminate the fields key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that order
  - Con: Manifest files become harder to read for people not sufficiently familiar with the format definition
- Eliminate the fields key and change the entry arrays to objects with versionId, lastModified, size, and ETag fields
  - Con: Much larger manifest files
- Eliminate the fields key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that order; while also adding @schema URL which would point to a versioned jsonschema for the manifest file which would describe those fields.
Describe Archive behavior when publishing Dandisets with Zarrs (See comments below)
- When a Dandiset is published, presumably the Zarr IDs of the Zarrs in the published version and in the draft version will have to diverge at some point — either at the moment of publication or when each draft Zarr is first changed after publication. When exactly will this happen? Which version gets the "old" Zarr IDs?
  - @yarikoptic: When dandiset published, all assets with all metadata records are "frozen", and we have access to the corresponding zarr checksums. So it would be the same zarr id, and corresponding checksum for that published dandiset version.
    Then whenever Zarr is modified, it gets a new .version and some new asset would get that .zarr_version, possibly published later.
  - How can we reduce the burden on backups2datalad when backing up a published Zarr? If done naïvely, the program will have to recreate already-backed up Zarrs as soon as the Zarr IDs "split."
    - @yarikoptic : if we somehow record (git note? in git commit message description like datalad run) the zarr checksum for the commit, we can discover/tag the desired "released" commit. Possible gotcha: we missed that moment and have no commit of such kind -- then recreate from closest commit based on date in a branch off the main trunk.
DANDI API changes necessary to support zarr workflows
- Add Zarr version info to API responses
- In order for dandidav's /dandisets/ hierarchy to serve Zarr trees that do not match the current state on S3, the API will need to gain an endpoint that acts like .../assets/paths/ (or Add endpoint for querying a folder or asset path in a Dandiset #1837) but for Zarr contents.
Changes to metadata model (dandischema)
- Adjustments to contentUrl for Zarrs? (See comments below; might be not needing model change)
Address garbage collection of old Zarr versions
- Disable "trailing delete" for /zarr/ prefix on S3
  - likely to be added a filter here: https://github.com/dandi/dandi-infrastructure/blob/b547159f30f201e4f805f6938d247812a3022e38/terraform/modules/dandiset_bucket/main.tf#L323C1-L323C14
  - alternative could be to not really allow to delete files from zarr but only adjust manifest as to delete those particular files. insofar @yarikoptic thinks that it would cause fragility in possible implementation and might be not worth the effort.
consider caching/containment of manifests within DB itself (as well).

CC @yarikoptic @dandi/dandiarchive

yarikoptic · 2024-03-18T19:53:36Z

Do we want the Archive to store separate manifest files for each version of each Zarr?

Yes. This is one of the main motivations behind this feature. So we could "cheaply" provide support for versioning of zarrs and this way then allow for their releases: ATM nobody can publish/release a versioned dandiset if there is zarr in it. If we have specific version manifest for zarr -- we can get to that manifest / version no problem and thus can release zarr and access that specific version of zarr.

doc/design/zarr-manifests.md

jwodder · 2024-03-18T19:58:09Z

@yarikoptic How should manifests for different versions of the same Zarr be organized in S3? Naming the files after the Zarr checksum like in your repository isn't guaranteed to be stable, as a user could create a Zarr, then change one entry, then change that entry back to its original contents, resulting in an S3 object with three versions while there are only two different checksums for the Zarr across its history.

yarikoptic · 2024-03-18T19:58:39Z

doc/design/zarr-manifests.md

+was already generated, the newer file shall replace the older.
+
+Manifest files shall also be generated for all Zarrs already in the Archive
+when this feature is first implemented.


After versions support added above, here should follow description on what is to happen for publishing dandisets which have zarrs in them.

We need to review/analyze what should now happen for zarr records or assets so we capture version (checksum) information for a zarr whenever it becomes part of the released dandiset. In case of blobs it is easy since blobs are not mutable. But with zarrs, since zarr could have multiple versions - we would need to make sure that published asset has versionId for zarr which would not be changed, whenever that asset zarr would be modified in draft version.

yarikoptic · 2024-03-18T23:58:57Z

then change that entry back to its original contents, resulting in an S3 object with three versions while there are only two different checksums for the Zarr across its history.

I think that should be fine -- since we would just carry about a manifest with a specific checksum of the content. So pretty much to have similarly to blobs a "context addressable storage" (of zarrs). Then given zarr id + checksum (as stored in DB, or in assets dumps which would have zarr id and its checksum) we can get needed manifest and provide access to that particular version, possibly the same one across different points in the version history.

yarikoptic

That is a great start, but now needs to come a hard part - analysis of current API/behavior and proposal on how to augment it so we have clear association between dandiset asset and version of zarr for an asset, and what would happen on publish, and how people (and web UI) would provide access to published versions of dandisets with corresponding versions of zarrs

doc/design/zarr-manifests.md

yarikoptic · 2024-03-22T19:58:49Z

doc/design/zarr-manifests.md

+        - Document needed changes to dandidav?
+            - The bucket for the Archive instance will now be given on the command line (only required if a custom/non-default API URL is given)
+            - The bucket's region will have to be looked up & stored before starting the webserver
+            - Zarrs under `/dandisets/` will no longer determine their S3 location via `contentUrl`; instead, they will combine the Archive's bucket & region with the Zarr ID in the asset properties (templated into "zarr/{zarr_id}/")


note to ourselves: requests to modify a Zarr from a previous non draft version might be "hard to impossible" since would cause "race condition" between different versions pretty much if modified in parallel or otherwise very inefficient since would require large "diff" uploads. Pretty much it would boil down to have the zarr in its mutable form assigned to just a single path in a single dandiset (like now), as it must not then be changed from multiple dandisets/locations. But then it could still reside in multiple dandisets though, and even published in that original version!

Alternatives (just thinking out loud):

operations on Zarr would operate as "patch" operations on specific version (manifest) as to simply provide a new key + versionid + ... on S3 and modify prior manifest with finalize saving patched manifest without doing full sweep of the bucket. cons: a more complex implementation (??? may be not) - zarr operation must be completed "in full"; version of zarr on s3 "as is" might not be a legit zarr to be used directly if ever modified for multiple versions; need to be thought through better; pros: support to modify any version of zarr; (much) more efficient finalize since would just modify prior manifest with changes without doing full sweep of the prefix.

another "mind strike" which relates to above: we need association of zarr to a dandiset for editing to ensure ownership/rights to modify which is somewhat different for blobs that we do not allow modifications, thus people just upload new ones. Overall feels like we need some way to distinguish a "canonical asset of a zarr" (which can still be modified) ... more thinking needed

@yarikoptic

requests to modify a Zarr from a previous non draft version

What are you talking about? A "non draft version" is a published version, and published versions and their contents can't be modified.

we need association of zarr to a dandiset

Each Zarr is already associated with a Dandiset. You can see a Zarr's Dandiset by requesting /zarr/{zarr_id}/.

the idea/hope was that we can "break" the need for association with a particular dandiset. Then the same zarr could be present in multiple dandisets, and thus versions of "draft" version could diverge in two dandisets and changes in one dandiset to the same zarr could "race" with changes in another dandiset.

Problem indeed should not manifest itself if we keep zarr associated with just a single dandiset. But that is "suboptimal" since would disallow (cheap) creating dandisets with assets (content) from another. And we already had such use cases. At least for "read-only" "mix-in" of zarrs from other dandisets... so we should see if we could support that through these proposed changes.

yarikoptic · 2024-03-22T20:47:17Z

doc/design/zarr-manifests.md

+
+* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ?
+
+* Does garbage collection of old Zarr versions need to be discussed?


I think it should indeed be touched on. And we should touch on "trailing delete" to be disabled for /zarr/ prefix on s3.

jwodder · 2024-08-13T20:34:21Z

doc/design/zarr-manifests.md

+Two "end-points" within that namespace are provided:
+
+- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions.
+- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)).


@yarikoptic Why is this mentioned here? The /dandisets/ hierarchy gets its information directly from S3; it does not use the Zarr manifest files.

ah! thanks for spotting, good to know since I assumed that this one gets it also from the manifests! So we need smth like

Suggested change

- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)).

- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/` although providing access to Zarr files, it **does not** uses manifest files but gets listing directly from S3, so points only to the most recent version (possibly not even finalized yet during upload).

?

But with that "not finalized" aspect, we might (not yet sure) want to change that webdav behavior and also use manifest file here to ensure access to the legit version of zarr (not finalized might be partial etc).

Signed-off-by: Yaroslav Halchenko <[email protected]>

…d; can reuse metadata

jwodder added design-doc Involves creating or discussing a design document zarr Issues with Zarr hosting/processing/etc. labels Mar 18, 2024

yarikoptic reviewed Mar 18, 2024

View reviewed changes

doc/design/zarr-manifests.md Outdated Show resolved Hide resolved

yarikoptic reviewed Mar 18, 2024

View reviewed changes

yarikoptic requested changes Mar 19, 2024

View reviewed changes

doc/design/zarr-manifests.md Show resolved Hide resolved

doc/design/zarr-manifests.md Show resolved Hide resolved

jwodder force-pushed the zarr-manifest-design branch from f1b4a62 to 82eb89e Compare March 22, 2024 13:04

yarikoptic reviewed Mar 22, 2024

View reviewed changes

yarikoptic self-assigned this Apr 30, 2024

yarikoptic mentioned this pull request Aug 6, 2024

Add webdav.dandiarchive.org to "services" for the main instance and redirect /download/ for zarrs to webdav #1993

Open

yarikoptic changed the title ~~Design doc for generating Zarr Manifest Files~~ Design doc for Zarr versioning/publishing support via Zarr Manifest Files Aug 13, 2024

jwodder commented Aug 13, 2024

View reviewed changes

yarikoptic force-pushed the zarr-manifest-design branch from 9d425c9 to 6a1a1d9 Compare August 20, 2024 20:39

jwodder and others added 8 commits August 27, 2024 14:38

Design doc for generating Zarr Manifest Files

1baa092

Signed-off-by: Yaroslav Halchenko <[email protected]>

Add versioning

a676d35

Signed-off-by: Yaroslav Halchenko <[email protected]>

Mention S3 API calls for getting object version IDs

d1836b4

Signed-off-by: Yaroslav Halchenko <[email protected]>

Outline

705d029

Signed-off-by: Yaroslav Halchenko <[email protected]>

/zarr/{zarr_id}/

52368df

Signed-off-by: Yaroslav Halchenko <[email protected]>

Extend Zarr design doc with pointers/examples on current implementation

df500f1

Signed-off-by: Yaroslav Halchenko <[email protected]>

Some more rewording and expansion in the design doc

573e72b

Signed-off-by: Yaroslav Halchenko <[email protected]>

More to Zarr versioning design: need some "ZarrVersion" or "Upload" i…

1e15c75

…d; can reuse metadata

yarikoptic force-pushed the zarr-manifest-design branch from 6a1a1d9 to 1e15c75 Compare August 27, 2024 19:18

yarikoptic added 3 commits August 28, 2024 15:46

Aim to remove ZarrArchive.dandiset and related

d2658a4

Some more of outloud thinking etc which was not committed

a60a6a7

moved updated outloud thinking up before "changes needed"

0ea7dc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

jwodder commented Mar 18, 2024 •

edited by yarikoptic

Loading

yarikoptic commented Mar 18, 2024

jwodder commented Mar 18, 2024

yarikoptic Mar 18, 2024

yarikoptic commented Mar 18, 2024

yarikoptic left a comment

yarikoptic Mar 22, 2024 •

edited

Loading

yarikoptic Mar 22, 2024

jwodder Mar 25, 2024

yarikoptic Mar 25, 2024

yarikoptic Mar 22, 2024

jwodder Aug 13, 2024

yarikoptic Aug 15, 2024 •

edited

Loading


		* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ?

		* Does garbage collection of old Zarr versions need to be discussed?

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

Are you sure you want to change the base?

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

Conversation

jwodder commented Mar 18, 2024 • edited by yarikoptic Loading

yarikoptic commented Mar 18, 2024

jwodder commented Mar 18, 2024

yarikoptic Mar 18, 2024

Choose a reason for hiding this comment

yarikoptic commented Mar 18, 2024

yarikoptic left a comment

Choose a reason for hiding this comment

yarikoptic Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

yarikoptic Mar 22, 2024

Choose a reason for hiding this comment

jwodder Mar 25, 2024

Choose a reason for hiding this comment

yarikoptic Mar 25, 2024

Choose a reason for hiding this comment

yarikoptic Mar 22, 2024

Choose a reason for hiding this comment

jwodder Aug 13, 2024

Choose a reason for hiding this comment

yarikoptic Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

jwodder commented Mar 18, 2024 •

edited by yarikoptic

Loading

yarikoptic Mar 22, 2024 •

edited

Loading

yarikoptic Aug 15, 2024 •

edited

Loading