Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upload/asset blob garbage collection design doc #1733

Merged
merged 7 commits into from
Dec 1, 2023

Conversation

mvandenburgh
Copy link
Member

No description provided.


## Background

Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.

on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.

The intention of this document is to only cover asset blob/upload garbage collection, since these are more straightforward to implement than Asset garbage collection (with most of the complexity there surrounding (1) are there cases where we want to retain old assets in the DB, and (2) the fundamental incompatibility the current Asset data model has with garbage collection due to the Asset.previous pointer).

on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.

Yes, I can get these numbers and add them here.

Copy link
Member Author

@mvandenburgh mvandenburgh Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these numbers as of today to the design doc (5f1301a).

mvandenburgh and others added 2 commits November 3, 2023 15:16
Co-authored-by: Yaroslav Halchenko <[email protected]>
Add a note that this design only applies to regular assets,
and not zarrs.
@satra
Copy link
Member

satra commented Nov 6, 2023

i'll add my thoughts here on linked assets. the primary reason for that was to maintain complete history so we could have an audit trail of changes. since a PUT generates a new asset id. the secondary reason was provenance. there was supposed to be a a provenance record added on a PUT. i don't know if that was done.

the audit trail is still key to have given that we have not exposed an audit trail to our users and we really should. and that can follow the same principles as the trailing delete. we only keep audit trails up to some specified number of days.

i suspect there are lots of assets that don't belong to any dandiset and are also not part of any chain to an asset that belongs to a dandiset. can we break up assets into those categories: asset chains that don't belong to any dandiset? and asset chains that belong to some? @mvandenburgh - would it be possible to get these numbers (total numbers of assets in these two categories)?

@yarikoptic
Copy link
Member

@mvandenburgh ping on above

@waxlamp
Copy link
Member

waxlamp commented Nov 30, 2023

Mike and I discussed this design doc and implementation today a bit. Some of my thoughts are below, inline with your comments.

i'll add my thoughts here on linked assets. the primary reason for that was to maintain complete history so we could have an audit trail of changes. since a PUT generates a new asset id. the secondary reason was provenance. there was supposed to be a a provenance record added on a PUT. i don't know if that was done.

the audit trail is still key to have given that we have not exposed an audit trail to our users and we really should. and that can follow the same principles as the trailing delete. we only keep audit trails up to some specified number of days.

In order to properly address the audit trail and provenance use cases, we'll need to collect whatever information is present in the "previous" chains, put in place a way to collect the appropriate information into the future, and finally "cut" the previous link out of the Asset model in order to enable a more straightforward and comprehensive garbage collection scheme.

I've scheduled a meeting with the four of us (Satra, Yarik, Mike, Roni) in order to get a handle on what is really needed for "audit trails" and "provenance" (I put these in quotes because the terms encompass rather large concepts, and I want to make sure I understand what we specifically need to do in DANDI). Once we have requirements, we can perform the catch-up work on the asset chains and finally clear the way for real GC to happen.

i suspect there are lots of assets that don't belong to any dandiset and are also not part of any chain to an asset that belongs to a dandiset. can we break up assets into those categories: asset chains that don't belong to any dandiset? and asset chains that belong to some? @mvandenburgh - would it be possible to get these numbers (total numbers of assets in these two categories)?

I believe @mvandenburgh can indeed run these numbers. It will be helpful in understanding just how many chains we need to mine for that audit and provenance history.

But overall, these issues really are affecting asset GC, and because there are some substantial (but, I think manageable) issues blocking us there, I'd really like to get the simpler slice of the GC picture that is present in this design to be independently deployable, even if it seems like just a stone's throw away to also capture the design for assets. The worst that will happen is we run the asset blob GC now to see how much is cleanupable, and then we will need to run it again after an initial asset GC round occurs.

Copy link
Member

@waxlamp waxlamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design doc looks great. I would like to ratify and merge this, while handling the slightly complex issue tree surrounding Asset garbage collection ("chains", what to do about audit/provenance, and how to reliably delete dead assets while maintaining audit/provenance information) separately.

@satra, @yarikoptic if you are ok with that, please leave an approval as well.

Copy link
Member

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm approving this and wanted to comment why i felt adding those numbers were important. this started with garbage collection, but the amount of garbage collected by this is less than 1%.

@waxlamp
Copy link
Member

waxlamp commented Dec 1, 2023

i'm approving this and wanted to comment why i felt adding those numbers were important. this started with garbage collection, but the amount of garbage collected by this is less than 1%.

I sympathize with your reaction here, but the tradeoff is that this was the easiest phase of GC to design and develop, which of course correlates with its lower level of "cleaning power". I guess Mike and I are following the snowball strategy here, knocking out the simplest phases in order, to try to achieve faster incremental results. We'll only see the full power after everything is in.

Thanks for the approval.

@mvandenburgh mvandenburgh merged commit 8841600 into master Dec 1, 2023
10 checks passed
@mvandenburgh mvandenburgh deleted the gc-asset-blobs-uploads-doc branch December 1, 2023 21:54
@dandibot
Copy link
Member

🚀 PR was released in v0.3.67 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants