Add upload/asset blob garbage collection design doc #1733

mvandenburgh · 2023-11-03T14:47:20Z

No description provided.

satra · 2023-11-03T14:52:36Z

doc/design/garbage-collection-uploads-asset-blobs.md

+
+## Background
+
+Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document.


i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.

on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.

i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.

The intention of this document is to only cover asset blob/upload garbage collection, since these are more straightforward to implement than Asset garbage collection (with most of the complexity there surrounding (1) are there cases where we want to retain old assets in the DB, and (2) the fundamental incompatibility the current Asset data model has with garbage collection due to the Asset.previous pointer).

on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.

Yes, I can get these numbers and add them here.

I added these numbers as of today to the design doc (5f1301a).

doc/design/garbage-collection-uploads-asset-blobs.md

Co-authored-by: Yaroslav Halchenko <[email protected]>

Add a note that this design only applies to regular assets, and not zarrs.

doc/design/garbage-collection-uploads-asset-blobs.md

satra · 2023-11-06T16:15:20Z

i'll add my thoughts here on linked assets. the primary reason for that was to maintain complete history so we could have an audit trail of changes. since a PUT generates a new asset id. the secondary reason was provenance. there was supposed to be a a provenance record added on a PUT. i don't know if that was done.

the audit trail is still key to have given that we have not exposed an audit trail to our users and we really should. and that can follow the same principles as the trailing delete. we only keep audit trails up to some specified number of days.

i suspect there are lots of assets that don't belong to any dandiset and are also not part of any chain to an asset that belongs to a dandiset. can we break up assets into those categories: asset chains that don't belong to any dandiset? and asset chains that belong to some? @mvandenburgh - would it be possible to get these numbers (total numbers of assets in these two categories)?

yarikoptic · 2023-11-29T15:53:03Z

@mvandenburgh ping on above

waxlamp · 2023-11-30T01:39:43Z

Mike and I discussed this design doc and implementation today a bit. Some of my thoughts are below, inline with your comments.

i'll add my thoughts here on linked assets. the primary reason for that was to maintain complete history so we could have an audit trail of changes. since a PUT generates a new asset id. the secondary reason was provenance. there was supposed to be a a provenance record added on a PUT. i don't know if that was done.

the audit trail is still key to have given that we have not exposed an audit trail to our users and we really should. and that can follow the same principles as the trailing delete. we only keep audit trails up to some specified number of days.

In order to properly address the audit trail and provenance use cases, we'll need to collect whatever information is present in the "previous" chains, put in place a way to collect the appropriate information into the future, and finally "cut" the previous link out of the Asset model in order to enable a more straightforward and comprehensive garbage collection scheme.

I've scheduled a meeting with the four of us (Satra, Yarik, Mike, Roni) in order to get a handle on what is really needed for "audit trails" and "provenance" (I put these in quotes because the terms encompass rather large concepts, and I want to make sure I understand what we specifically need to do in DANDI). Once we have requirements, we can perform the catch-up work on the asset chains and finally clear the way for real GC to happen.

i suspect there are lots of assets that don't belong to any dandiset and are also not part of any chain to an asset that belongs to a dandiset. can we break up assets into those categories: asset chains that don't belong to any dandiset? and asset chains that belong to some? @mvandenburgh - would it be possible to get these numbers (total numbers of assets in these two categories)?

I believe @mvandenburgh can indeed run these numbers. It will be helpful in understanding just how many chains we need to mine for that audit and provenance history.

But overall, these issues really are affecting asset GC, and because there are some substantial (but, I think manageable) issues blocking us there, I'd really like to get the simpler slice of the GC picture that is present in this design to be independently deployable, even if it seems like just a stone's throw away to also capture the design for assets. The worst that will happen is we run the asset blob GC now to see how much is cleanupable, and then we will need to run it again after an initial asset GC round occurs.

waxlamp

This design doc looks great. I would like to ratify and merge this, while handling the slightly complex issue tree surrounding Asset garbage collection ("chains", what to do about audit/provenance, and how to reliably delete dead assets while maintaining audit/provenance information) separately.

@satra, @yarikoptic if you are ok with that, please leave an approval as well.

satra

i'm approving this and wanted to comment why i felt adding those numbers were important. this started with garbage collection, but the amount of garbage collected by this is less than 1%.

waxlamp · 2023-12-01T19:58:56Z

i'm approving this and wanted to comment why i felt adding those numbers were important. this started with garbage collection, but the amount of garbage collected by this is less than 1%.

I sympathize with your reaction here, but the tradeoff is that this was the easiest phase of GC to design and develop, which of course correlates with its lower level of "cleaning power". I guess Mike and I are following the snowball strategy here, knocking out the simplest phases in order, to try to achieve faster incremental results. We'll only see the full power after everything is in.

Thanks for the approval.

dandibot · 2023-12-14T16:06:23Z

🚀 PR was released in v0.3.67 🚀

Add upload/asset blob GC design doc

b3d1fca

mvandenburgh requested review from yarikoptic, satra and waxlamp November 3, 2023 14:47

satra reviewed Nov 3, 2023

View reviewed changes

doc/design/garbage-collection-uploads-asset-blobs.md Outdated Show resolved Hide resolved

yarikoptic reviewed Nov 3, 2023

View reviewed changes

doc/design/garbage-collection-uploads-asset-blobs.md Outdated Show resolved Hide resolved

yarikoptic reviewed Nov 3, 2023

View reviewed changes

doc/design/garbage-collection-uploads-asset-blobs.md Show resolved Hide resolved

mvandenburgh and others added 2 commits November 3, 2023 15:16

Fix typo

85b2c96

Co-authored-by: Yaroslav Halchenko <[email protected]>

Clarify zarr garbage collection

11942e1

Add a note that this design only applies to regular assets, and not zarrs.

yarikoptic reviewed Nov 3, 2023

View reviewed changes

doc/design/garbage-collection-uploads-asset-blobs.md Show resolved Hide resolved

mvandenburgh added 4 commits November 6, 2023 10:11

Clarify that objects need to be cleared from both S3 and DB

cfff55f

Use relative links for other design docs

bd1dee2

Add current orphaned data count

5f1301a

Add note about additional cause of orphaned asset blobs

23c55ad

mvandenburgh force-pushed the gc-asset-blobs-uploads-doc branch from eef6976 to 23c55ad Compare November 6, 2023 15:39

mvandenburgh requested review from yarikoptic and satra November 6, 2023 15:40

waxlamp approved these changes Nov 30, 2023

View reviewed changes

satra approved these changes Dec 1, 2023

View reviewed changes

mvandenburgh merged commit 8841600 into master Dec 1, 2023
10 checks passed

mvandenburgh deleted the gc-asset-blobs-uploads-doc branch December 1, 2023 21:54

dandibot added the released This issue/pull request has been released. label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upload/asset blob garbage collection design doc #1733

Add upload/asset blob garbage collection design doc #1733

mvandenburgh commented Nov 3, 2023

satra Nov 3, 2023

mvandenburgh Nov 3, 2023

mvandenburgh Nov 6, 2023 •

edited

Loading

satra commented Nov 6, 2023

yarikoptic commented Nov 29, 2023

waxlamp commented Nov 30, 2023

waxlamp left a comment •

edited

Loading

satra left a comment

waxlamp commented Dec 1, 2023

dandibot commented Dec 14, 2023


		## Background

		Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document.

Add upload/asset blob garbage collection design doc #1733

Add upload/asset blob garbage collection design doc #1733

Conversation

mvandenburgh commented Nov 3, 2023

satra Nov 3, 2023

Choose a reason for hiding this comment

mvandenburgh Nov 3, 2023

Choose a reason for hiding this comment

mvandenburgh Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

satra commented Nov 6, 2023

yarikoptic commented Nov 29, 2023

waxlamp commented Nov 30, 2023

waxlamp left a comment • edited Loading

Choose a reason for hiding this comment

satra left a comment

Choose a reason for hiding this comment

waxlamp commented Dec 1, 2023

dandibot commented Dec 14, 2023

mvandenburgh Nov 6, 2023 •

edited

Loading

waxlamp left a comment •

edited

Loading