-
Notifications
You must be signed in to change notification settings - Fork 332
QoL: Destructive schema sync after manual column dropping #2909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
c280794
to
c37c422
Compare
dlt/common/destination/client.py
Outdated
"An incremental field is being removed from schema." | ||
"You should unset the" | ||
" incremental with `incremental=dlt.sources.incremental.EMPTY`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized the incremental setting is also saved outside of the schema, so when the user
- manually drops a column from destination with an incremental setting
- syncs the schema destructively
They will need to also unset the incremental with apply_hints(incremental=dlt.sources.incremental.EMPTY)
dlt/destinations/impl/dummy/dummy.py
Outdated
def _get_actual_columns(self, table_name: str) -> List[str]: | ||
schema_columns = self.schema.get_table_columns(table_name) | ||
return list(schema_columns.keys()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just get columns from schema because it's a dummy
else: | ||
schema_columns = self.schema.get_table_columns(table_name) | ||
return list(schema_columns.keys()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to get the actual columns from filesystem files without table format unless we read entire files. Also, it's unlikely that the user manually deletes specific columns from filesystem files I think 👀 . Therefore, we should raise NotImplemented instead of doing basically nothing.
c37c422
to
598ca5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good idea but we need to approach it in more systematic way:
- (almost) all of our destinations have
def get_storage_table(self, table_name: str) -> Tuple[str, TTableSchemaColumns]:
and/or
def get_storage_tables(
self, table_names: Iterable[str]
) -> Iterable[Tuple[str, TTableSchemaColumns]]:
implemented. this will reflect the storage to get table schema out of it. you can use it to compare with the pipeline schema.
- let's formalize it: add mixin class like
WithTableReflection
in the same mannerWithStateSync
is done.get_storage_tables
is the more general method so you can add only this to the mixin - Now add this mixing to all
JobClientBase
implementations for which you want to support our new sync schema.
When the above is done we are able to actually compute schema diff.
Top level interface:
- we have
sync_schema
that will do a regular schema migration (add missing columns and tables in the destination` - we need another method which is the reverse: it will delete columns and tables in the schema that are not present in the destination and then do the schema sync above
- the method above should have a dry run mode - where we do not change the pipeline schema and we do not sync it
- it should make sure if
destination_client()
implementsWithTableReflection
before continuing - it should allow to select tables to be affected
when this is done we can think about extending cli ie dlt pipeline <name> schema
command
"""Get actual column names from files in storage for regular (non-delta/iceberg) tables | ||
or column names from schema""" | ||
|
||
if self.is_open_table("iceberg", table_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be merged into get_storage_tables
which is already present in the filesystem. it will be really helpful.
note: both are operating on Arrow schemas so look how get_storage_table
works in lance... you do not need to implement TypeMapper just look at lancedb implementation
dlt/common/destination/client.py
Outdated
) | ||
return expected_update | ||
|
||
def update_stored_schema_destructively( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is overall a good idea to add to JobClient but we should be more selective. Look at my top level review. We need 1. mixin class 2. it needs to be optional
c256d51
to
d4aadb2
Compare
d4aadb2
to
ab715ff
Compare
ab715ff
to
17651e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changes needed
354680c
to
85448dc
Compare
3e0e5ca
to
b1c3f3f
Compare
b1c3f3f
to
d2ad80e
Compare
logger.warning( | ||
f"Table '{table_name}' does not use a table format and does not support" | ||
" true schema reflection. Returning column schemas from the dlt" | ||
" schema, which may be stale if the underlying files were manually" | ||
" modified. " | ||
) | ||
yield (table_name, self.schema.get_table_columns(table_name)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that for parquet files we can also just use pyarrow and read actual metadata 👀 , but I still don't think people drop columns in parquet files...
Description
This PR adds a new pipeline function that syncs the dlt schema with the destination (not vice versa) by removing a column from the schema if that column has been manually deleted in the destination.
Related PRs:
#2754
Further:
This should be extended to table drop syncs as well.
Note:
This is essentially solving the problem when the user manually drops things in the destination and the dlt pipeline breaks.