Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsc: Create tables for tracking Blobs #1490

Merged
merged 5 commits into from
Dec 15, 2023
Merged

Conversation

V-FEXrt
Copy link
Contributor

@V-FEXrt V-FEXrt commented Dec 13, 2023

Note: Due to sqlite3 limitations we had to rewrite migration history to pass our tests. After initial deployment this becomes much more difficult. We are researching if we can avoid sqlite3 for our test flow.

Creates the tables and relationships required for tracking uploaded blobs. Minimally updates routes and tests so they will pass. Updating the routes to work as expected will come in a future PR

@V-FEXrt V-FEXrt requested a review from JakeSiFive December 13, 2023 21:42
Copy link
Contributor

@JakeSiFive JakeSiFive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I suspect some of the on delete resitrcts should be cascade instead most likely

.to_col(BlobStore::Id)
// All blobs for a given store must be manually
// deleted before the blob store may be deleted.
.on_delete(ForeignKeyAction::Restrict),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on restrict vs cascade? What about specifically for blob store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comments I gave my core justification. But in general cascade is pretty dangerous.

In this case, if you deleted a BlobStore, it would cascade to the Blobs, which would cascade to the OutputFiles, which would cascade to the Jobs.

To me that really isn't acceptable especially since we have a clear workflow for deleting all those things

  1. Job eviction deletes a Job which cascades to its OutputFiles
  2. Blob eviction later sees those blobs with no references so it triggers proper deletion using the backing blob store API then deletes the entry in the blob table
  3. We can then manually remove the decommissioned BlobStore

If we make it cascade we will be guaranteeing that blobs get orphaned in the remote store without really gaining anythign in the API

.to_col(Blob::Id)
// A blob may not be deleted if there exists a
// job that depends on it.
.on_delete(ForeignKeyAction::Restrict),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These in particular seem like they might be too strong? Sometimes a blob will go missing from storage and if that happens we want to delete the blob from the database, but also delete any jobs that depend on it. I think deleting stores is often not needed so I'm less concerned with that but deleting blobs seems to be more important and common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a blob goes missing from storage you should just delete the whole job. Then you can either delete the blob manually or let blob eviction clean up the entry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I claim you should always delete the job first. That way you stop the bleeding with clients. Deleting the blob first makes the bleeding worse for a short amount of time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree you should delete the job but there may be many jobs you need to delete not just one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a case we see happening a lot we should write code to handle the cleanup (we can use rsc_tool or the http api to delete all jobs that depend on blob x) instead of having the database work against the guarantees we want

Copy link
Contributor Author

@V-FEXrt V-FEXrt Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascade wouldn't be as bad if we have some way to ensure we cleanup on the remote stores with a blob is deleted. I'm not sure how we'd do that but if you have any ideas I'm open to hearing them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I've been doing it in the local shared cache is that you scan both ways because you shouldn't trust either side. One scan checks for non-existant blobs that are referenced in the DB, the other scan checks for blobs that exist but are orphaned. This appears to be a required step because you can't atomically delete the blob on one side and the blob in the database. One must be deleted before the other so one of the scan directions is required. In practice I've found that both are helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: We have been using on delete cascade, if you want to change that lets change it everywhere you think it makes sense while we're changing the DB.

I'm fine with it either way, using cascade only makes the delete Easier it doesn't actully buy us any kind of assurance so I'm quite happy to keep restrict over cascade. But I think the assurances it buys us are week (but certainly greater than the absence of assurance we get from cascade).

I will warn against thinking that this protects you from the blob store and the database falling out of sync, I do not think it accomplishes that. Distributed systems are hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok, I think I've landed on liking restrict with the following reasoning:

  1. It corrects for some mistakes
  2. We still need at least one direction of out of sync checks if not both but that's ok (its not a + to either side)
  3. We can always do the cascade deletes manually if we need to

.to_col(Blob::Id)
// A blob may not be deleted if there exists an
// output file that depends on it.
.on_delete(ForeignKeyAction::Restrict),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, for instance when we delete a job, we cascade the deletion to the output file

Comment on lines 32 to 33
stdout_id: Set(payload.stdout_id),
stderr_id: Set(payload.stderr_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: Both here and above in the DB, I think "stdfoo_blob_id" or "stdfoo_blob_key" or something like that would be better. Just seeing "stdout_id" doesn't really tell me whats happening here.

Comment on lines 406 to 407
"stdout_id": blob_id,
"stderr_id": blob_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh neat upshot!

@V-FEXrt V-FEXrt merged commit 8501a51 into master Dec 15, 2023
12 checks passed
@V-FEXrt V-FEXrt deleted the rsc-create-blob-entities branch December 15, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants