-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rsc: Create tables for tracking Blobs #1490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. I suspect some of the on delete resitrcts should be cascade instead most likely
.to_col(BlobStore::Id) | ||
// All blobs for a given store must be manually | ||
// deleted before the blob store may be deleted. | ||
.on_delete(ForeignKeyAction::Restrict), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on restrict vs cascade? What about specifically for blob store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the comments I gave my core justification. But in general cascade is pretty dangerous.
In this case, if you deleted a BlobStore, it would cascade to the Blobs, which would cascade to the OutputFiles, which would cascade to the Jobs.
To me that really isn't acceptable especially since we have a clear workflow for deleting all those things
- Job eviction deletes a Job which cascades to its OutputFiles
- Blob eviction later sees those blobs with no references so it triggers proper deletion using the backing blob store API then deletes the entry in the blob table
- We can then manually remove the decommissioned BlobStore
If we make it cascade we will be guaranteeing that blobs get orphaned in the remote store without really gaining anythign in the API
.to_col(Blob::Id) | ||
// A blob may not be deleted if there exists a | ||
// job that depends on it. | ||
.on_delete(ForeignKeyAction::Restrict), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These in particular seem like they might be too strong? Sometimes a blob will go missing from storage and if that happens we want to delete the blob from the database, but also delete any jobs that depend on it. I think deleting stores is often not needed so I'm less concerned with that but deleting blobs seems to be more important and common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a blob goes missing from storage you should just delete the whole job. Then you can either delete the blob manually or let blob eviction clean up the entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I claim you should always delete the job first. That way you stop the bleeding with clients. Deleting the blob first makes the bleeding worse for a short amount of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree you should delete the job but there may be many jobs you need to delete not just one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a case we see happening a lot we should write code to handle the cleanup (we can use rsc_tool or the http api to delete all jobs that depend on blob x) instead of having the database work against the guarantees we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cascade wouldn't be as bad if we have some way to ensure we cleanup on the remote stores with a blob is deleted. I'm not sure how we'd do that but if you have any ideas I'm open to hearing them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I've been doing it in the local shared cache is that you scan both ways because you shouldn't trust either side. One scan checks for non-existant blobs that are referenced in the DB, the other scan checks for blobs that exist but are orphaned. This appears to be a required step because you can't atomically delete the blob on one side and the blob in the database. One must be deleted before the other so one of the scan directions is required. In practice I've found that both are helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: We have been using on delete cascade, if you want to change that lets change it everywhere you think it makes sense while we're changing the DB.
I'm fine with it either way, using cascade only makes the delete Easier it doesn't actully buy us any kind of assurance so I'm quite happy to keep restrict over cascade. But I think the assurances it buys us are week (but certainly greater than the absence of assurance we get from cascade).
I will warn against thinking that this protects you from the blob store and the database falling out of sync, I do not think it accomplishes that. Distributed systems are hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ok, I think I've landed on liking restrict with the following reasoning:
- It corrects for some mistakes
- We still need at least one direction of out of sync checks if not both but that's ok (its not a + to either side)
- We can always do the cascade deletes manually if we need to
.to_col(Blob::Id) | ||
// A blob may not be deleted if there exists an | ||
// output file that depends on it. | ||
.on_delete(ForeignKeyAction::Restrict), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, for instance when we delete a job, we cascade the deletion to the output file
rust/rsc/src/rsc/add_job.rs
Outdated
stdout_id: Set(payload.stdout_id), | ||
stderr_id: Set(payload.stderr_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: Both here and above in the DB, I think "stdfoo_blob_id" or "stdfoo_blob_key" or something like that would be better. Just seeing "stdout_id" doesn't really tell me whats happening here.
rust/rsc/src/rsc/main.rs
Outdated
"stdout_id": blob_id, | ||
"stderr_id": blob_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh neat upshot!
Note: Due to sqlite3 limitations we had to rewrite migration history to pass our tests. After initial deployment this becomes much more difficult. We are researching if we can avoid sqlite3 for our test flow.
Creates the tables and relationships required for tracking uploaded blobs. Minimally updates routes and tests so they will pass. Updating the routes to work as expected will come in a future PR