-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upload/asset blob garbage collection design doc #1733
Conversation
|
||
## Background | ||
|
||
Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.
on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be good to add this type of asset below under the next section. in the dandi model this would be the biggest source of garbage as we mint a new asset each time an asset or blob is modified.
The intention of this document is to only cover asset blob/upload garbage collection, since these are more straightforward to implement than Asset
garbage collection (with most of the complexity there surrounding (1) are there cases where we want to retain old assets in the DB, and (2) the fundamental incompatibility the current Asset
data model has with garbage collection due to the Asset.previous
pointer).
on this note, can we run a quick analysis of the numbers in each category? i.e. does skipping this for a future iteration really address the issue.
Yes, I can get these numbers and add them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these numbers as of today to the design doc (5f1301a).
Co-authored-by: Yaroslav Halchenko <[email protected]>
Add a note that this design only applies to regular assets, and not zarrs.
eef6976
to
23c55ad
Compare
i'll add my thoughts here on linked assets. the primary reason for that was to maintain complete history so we could have an audit trail of changes. since a the audit trail is still key to have given that we have not exposed an audit trail to our users and we really should. and that can follow the same principles as the trailing delete. we only keep audit trails up to some specified number of days. i suspect there are lots of assets that don't belong to any dandiset and are also not part of any chain to an asset that belongs to a dandiset. can we break up assets into those categories: asset chains that don't belong to any dandiset? and asset chains that belong to some? @mvandenburgh - would it be possible to get these numbers (total numbers of assets in these two categories)? |
@mvandenburgh ping on above |
Mike and I discussed this design doc and implementation today a bit. Some of my thoughts are below, inline with your comments.
In order to properly address the audit trail and provenance use cases, we'll need to collect whatever information is present in the "previous" chains, put in place a way to collect the appropriate information into the future, and finally "cut" the previous link out of the I've scheduled a meeting with the four of us (Satra, Yarik, Mike, Roni) in order to get a handle on what is really needed for "audit trails" and "provenance" (I put these in quotes because the terms encompass rather large concepts, and I want to make sure I understand what we specifically need to do in DANDI). Once we have requirements, we can perform the catch-up work on the asset chains and finally clear the way for real GC to happen.
I believe @mvandenburgh can indeed run these numbers. It will be helpful in understanding just how many chains we need to mine for that audit and provenance history. But overall, these issues really are affecting asset GC, and because there are some substantial (but, I think manageable) issues blocking us there, I'd really like to get the simpler slice of the GC picture that is present in this design to be independently deployable, even if it seems like just a stone's throw away to also capture the design for assets. The worst that will happen is we run the asset blob GC now to see how much is cleanupable, and then we will need to run it again after an initial asset GC round occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design doc looks great. I would like to ratify and merge this, while handling the slightly complex issue tree surrounding Asset
garbage collection ("chains", what to do about audit/provenance, and how to reliably delete dead assets while maintaining audit/provenance information) separately.
@satra, @yarikoptic if you are ok with that, please leave an approval as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm approving this and wanted to comment why i felt adding those numbers were important. this started with garbage collection, but the amount of garbage collected by this is less than 1%.
I sympathize with your reaction here, but the tradeoff is that this was the easiest phase of GC to design and develop, which of course correlates with its lower level of "cleaning power". I guess Mike and I are following the snowball strategy here, knocking out the simplest phases in order, to try to achieve faster incremental results. We'll only see the full power after everything is in. Thanks for the approval. |
🚀 PR was released in |
No description provided.