Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration of storage layer to delta-rs #307

Merged
merged 21 commits into from
Mar 14, 2023
Merged

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Mar 3, 2023

This is the initial effort of switching our custom MVCC/Lakehouse protocol to use the Delta lake protocol instead.

A couple of general points:

  • Full range of DDL & DML statements are supported, including DELETE & UPDATE
  • VACUUM still not implemented for new tables
  • Likewise, GC of dropped new tables is not implemented yet, though the needed foundation is there (dropped_table table and related methods)
  • No perf testing has been done yet (the object upload code is likely sub-optimal as it doesn't do multipart upload)
  • cleanup of legacy code hasn't been completely done yet: I only replaced the stuff that was in the way of getting things working with delta-rs, in theory the entire writing logic for Seafowl tables can be erased now. The reading logic for Seafowl tables should still remain (there are tests for that too), as the plan is to have this available throughout the transient Seafowl versions for easier data migration.
  • I've also bumped to Datafusion 19.0.0/arrow post-33.0.0 in order to pick up Enable casting of string to timestamp with microsecond resolution apache/arrow-rs#3752 and Unique delta object store url delta-io/delta-rs#1212.

As for the implementation details:

  • I've opted for keeping a single root object store around (the inner in internal_object_store), and then "scoping" it for each DeltaTable accordingly. This means wrapping it up with some other object stores so that / points to the table root uri (see below). The other alternative that delta-rs offers is providing the store_options, out of which it will then build the store for a given uri, but I didn't like this mostly because the memory store implementation then doesn't work (it get's re-created from scratch for each create/write/table builder instantiation).
  • The table root is set to <inner_store_root>/<table_uuid>/ (that is the place where delta-rs puts all the table-related stuff as per the protocol). For instance s3://my_bucket/e62c26e2-b6e8-414a-a37a-66dd77430b5d/ or file:///data/dir/e62c26e2-b6e8-414a-a37a-66dd77430b5d/. The table_uuid is kept in the table catalog table and created randomly for each new table.
  • This approach means that dropping a schema/table can be done lazily (e.g. during subsequent VACUUMs in chunked batches) because table uri is unique to a given table and independent of db/schema/table name. Likewise, renaming a table is only a catalog operation.

@gruuya gruuya requested a review from mildbyte March 3, 2023 10:59
Copy link
Contributor

@mildbyte mildbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's on the right track, great work!

Re: the /[dbname]/[schemaname]/[tablename] vs /[table_uuid] path prefix schema, I'm inclined towards the latter. It looks like it'd help us avoid a lot of complexity and edge cases with managing object store prefixes and paths. Re: potential downsides:

  • We'll have to have the catalog DB anyway to store UDF definitions and future metadata (unless we somehow manage to keep everything in the delta lake as flat files and avoid the catalog DB altogether, which could be an interesting thought experiment)
  • The delta spec doesn't seem to concern itself with how the tables are named/arranged (it only specifies what happens inside the table directory), so I don't think many other tools would rely on a specific way the table directory is named/accessed vs. just operating on the contents of that directory
  • Depending on the object store implementation, there might be some argument against having a giant root directory with all the tables vs a hierarchy of db/schema/table (performance reasons, fewer items in a single directory)
  • Similarly, we might be able to use some object store specific authorization code / IAMs to implement multitenancy -- this might mean sharding just by db-id/table-id? Not sure.

How do information schema queries work under this proposed model? It looks like we use the catalog DB to satisfy them instead of crawling the object store for all tables and their schemas?

tests/statements/ddl.rs Show resolved Hide resolved
src/provider.rs Show resolved Hide resolved
@gruuya
Copy link
Contributor Author

gruuya commented Mar 3, 2023

How do information schema queries work under this proposed model? It looks like we use the catalog DB to satisfy them instead of crawling the object store for all tables and their schemas?

Yes, that is correct, we pre-load the schema information from our catalog.

@gruuya gruuya marked this pull request as ready for review March 13, 2023 11:55
Copy link
Contributor

@mildbyte mildbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing and should let us close #175 + delete a lot of our homegrown table version tracking code! Great work 👍

tests/data/seafowl-legacy-data/seafowl.sqlite-wal Outdated Show resolved Hide resolved
src/system_tables.rs Outdated Show resolved Hide resolved
src/repository/default.rs Outdated Show resolved Hide resolved
src/catalog.rs Outdated
// Build a delta table but don't load it yet; we'll do that only for tables that are
// actually referenced in a statement, via the async `table` method of the schema provider.

// TODO: if the table has no columns, the result set will be empty, so we use the default UUID (all zeros).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any implications? do we use this UUID to apply some DDL modifications to 0-column tables later down the line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are—we use this UUID to scope our root object store for a particular table by adding it to the end of the path (or in other words, prefixing all table paths with it, relative to the root store uri).

I'll think about a way to better handle no-column situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that this comment is misleading—there's no way that the vec in build_table is empty since before we pass it down we group the original collection_columns_vec by table name (.group_by(|col| &col.table_name)), which means there was at least one element of the vector to begin with in order for grouping to work (likewise for build_legacy_table).

Moreover, in case of new tables, we only really need the uuid besides the name, which we can also fetch with the group by call as it's unique per table. In other words the rest of the AllDatabaseColumnsResult fields don't play a role in instantiating new tables, and I aim to simplify that loading logic after dropping the legacy tables completely.

On the other hand, this means that tables without columns wouldn't be loaded at all in the schema provider's map. However, this problem is averted since delta-rs doesn't allow column-less table creation (and we don't yet support ALTER table DROP column). Therefore, for now I'll just add a test to lock that in so it doesn't change underneath us.

src/context.rs Outdated Show resolved Hide resolved
src/context.rs Show resolved Hide resolved
src/context.rs Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
@gruuya gruuya merged commit 4e110c4 into main Mar 14, 2023
@gruuya gruuya deleted the storage-delta-lake-migration branch March 14, 2023 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants