QoL: Destructive schema sync after manual column dropping #2909

anuunchin · 2025-07-21T07:35:02Z

Description

This PR adds a new pipeline function that syncs the dlt schema with the destination (not vice versa) by removing a column from the schema if that column has been manually deleted in the destination.

The motivation behind this is that rather than offering a cli command that drops columns - where we need to have separate drop_columns functions due to dialect differences and thus adds additional maintenance overhead - we delegate the dropping part to the user and instead allow them to sync the dlt schema in those scenarios.

Related PRs:

#2754

Further:

This should be extended to table drop syncs as well.

Note:

This is essentially solving the problem when the user manually drops things in the destination and the dlt pipeline breaks.

netlify · 2025-07-21T07:35:06Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`d2ad80e`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/68bac554550c830008d938d7

anuunchin · 2025-07-21T07:59:15Z

dlt/common/destination/client.py

+                            "An incremental field is being removed from schema."
+                            "You should unset the"
+                            " incremental with `incremental=dlt.sources.incremental.EMPTY`"
+                        )


I just realized the incremental setting is also saved outside of the schema, so when the user

manually drops a column from destination with an incremental setting

syncs the schema destructively

They will need to also unset the incremental with apply_hints(incremental=dlt.sources.incremental.EMPTY)

anuunchin · 2025-07-21T08:00:50Z

dlt/destinations/impl/dummy/dummy.py

+    def _get_actual_columns(self, table_name: str) -> List[str]:
+        schema_columns = self.schema.get_table_columns(table_name)
+        return list(schema_columns.keys())
+


We just get columns from schema because it's a dummy

anuunchin · 2025-07-21T08:02:32Z

dlt/destinations/impl/filesystem/filesystem.py

+        else:
+            schema_columns = self.schema.get_table_columns(table_name)
+            return list(schema_columns.keys())
+


It's impossible to get the actual columns from filesystem files without table format unless we read entire files. Also, it's unlikely that the user manually deletes specific columns from filesystem files I think 👀 . Therefore, we should raise NotImplemented instead of doing basically nothing.

rudolfix

This is very good idea but we need to approach it in more systematic way:

(almost) all of our destinations have

def get_storage_table(self, table_name: str) -> Tuple[str, TTableSchemaColumns]:

and/or

def get_storage_tables(
        self, table_names: Iterable[str]
    ) -> Iterable[Tuple[str, TTableSchemaColumns]]:

implemented. this will reflect the storage to get table schema out of it. you can use it to compare with the pipeline schema.

let's formalize it: add mixin class like WithTableReflection in the same manner WithStateSync is done. get_storage_tables is the more general method so you can add only this to the mixin
Now add this mixing to all JobClientBase implementations for which you want to support our new sync schema.

When the above is done we are able to actually compute schema diff.

Top level interface:

we have sync_schema that will do a regular schema migration (add missing columns and tables in the destination`
we need another method which is the reverse: it will delete columns and tables in the schema that are not present in the destination and then do the schema sync above
the method above should have a dry run mode - where we do not change the pipeline schema and we do not sync it
it should make sure if destination_client() implements WithTableReflection before continuing
it should allow to select tables to be affected

when this is done we can think about extending cli ie dlt pipeline <name> schema command

dlt/common/destination/client.py

rudolfix · 2025-07-21T15:29:15Z

dlt/destinations/impl/filesystem/filesystem.py

+        """Get actual column names from files in storage for regular (non-delta/iceberg) tables
+        or column names from schema"""
+
+        if self.is_open_table("iceberg", table_name):


this should be merged into get_storage_tables which is already present in the filesystem. it will be really helpful.

note: both are operating on Arrow schemas so look how get_storage_table works in lance... you do not need to implement TypeMapper just look at lancedb implementation

rudolfix · 2025-07-21T15:30:53Z

dlt/common/destination/client.py

            )
        return expected_update

+    def update_stored_schema_destructively(


this is overall a good idea to add to JobClient but we should be more selective. Look at my top level review. We need 1. mixin class 2. it needs to be optional

rudolfix

some changes needed

dlt/common/schema/schema.py

dlt/common/destination/client.py

dlt/destinations/impl/bigquery/bigquery.py

dlt/common/destination/client.py

dlt/destinations/impl/filesystem/filesystem.py

dlt/common/destination/client.py

dlt/pipeline/pipeline.py

…_schema

anuunchin · 2025-09-05T11:24:37Z

dlt/destinations/impl/filesystem/filesystem.py

+                        logger.warning(
+                            f"Table '{table_name}' does not use a table format and does not support"
+                            " true schema reflection. Returning column schemas from the dlt"
+                            " schema, which may be stale if the underlying files were manually"
+                            " modified. "
+                        )
+                        yield (table_name, self.schema.get_table_columns(table_name))
+


Just realized that for parquet files we can also just use pyarrow and read actual metadata 👀 , but I still don't think people drop columns in parquet files...

anuunchin force-pushed the feat/1153-drop-column-sync branch 2 times, most recently from c280794 to c37c422 Compare July 21, 2025 07:56

anuunchin commented Jul 21, 2025

View reviewed changes

anuunchin force-pushed the feat/1153-drop-column-sync branch from c37c422 to 598ca5b Compare July 21, 2025 08:54

anuunchin self-assigned this Jul 21, 2025

anuunchin requested a review from rudolfix July 21, 2025 09:58

rudolfix requested changes Jul 21, 2025

View reviewed changes

anuunchin force-pushed the feat/1153-drop-column-sync branch 3 times, most recently from c256d51 to d4aadb2 Compare July 28, 2025 07:53

anuunchin requested a review from rudolfix July 28, 2025 09:07

anuunchin force-pushed the feat/1153-drop-column-sync branch from d4aadb2 to ab715ff Compare July 28, 2025 09:28

anuunchin force-pushed the feat/1153-drop-column-sync branch from ab715ff to 17651e2 Compare August 5, 2025 07:08

rudolfix requested changes Aug 18, 2025

View reviewed changes

anuunchin force-pushed the feat/1153-drop-column-sync branch 3 times, most recently from 354680c to 85448dc Compare September 2, 2025 07:43

anuunchin added 6 commits September 5, 2025 09:22

Initial impl of sync_schema_destructively

b351b6d

Formalising dlt schema sync

40a4e17

Unnecessary inheritance removed, functions moved

f70077e

Duplicate function removed, dummy implements empty update_from_stored…

c0d028a

…_schema

sync_schema deprecated, storage initialization check

ceb5f04

Unnecessary abstract class impls removed, no table reflection exception

327c25e

anuunchin force-pushed the feat/1153-drop-column-sync branch from 3e0e5ca to b1c3f3f Compare September 5, 2025 07:22

Better docstrings, var names

d2ad80e

anuunchin force-pushed the feat/1153-drop-column-sync branch from b1c3f3f to d2ad80e Compare September 5, 2025 11:11

anuunchin commented Sep 5, 2025

View reviewed changes

anuunchin requested a review from rudolfix September 9, 2025 11:23

anuunchin mentioned this pull request Sep 16, 2025

QoL: Drop column #2754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QoL: Destructive schema sync after manual column dropping #2909

QoL: Destructive schema sync after manual column dropping #2909

Uh oh!

anuunchin commented Jul 21, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

anuunchin Jul 21, 2025 •

edited

Loading

Uh oh!

anuunchin Jul 21, 2025

Uh oh!

anuunchin Jul 21, 2025 •

edited

Loading

Uh oh!

rudolfix left a comment

Uh oh!

Uh oh!

rudolfix Jul 21, 2025

Uh oh!

rudolfix Jul 21, 2025

Uh oh!

rudolfix left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anuunchin Sep 5, 2025

Uh oh!

Uh oh!

QoL: Destructive schema sync after manual column dropping #2909

Are you sure you want to change the base?

QoL: Destructive schema sync after manual column dropping #2909

Uh oh!

Conversation

anuunchin commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related PRs:

Further:

Note:

Uh oh!

netlify bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

anuunchin Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anuunchin Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rudolfix Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

rudolfix Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anuunchin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anuunchin commented Jul 21, 2025 •

edited

Loading

netlify bot commented Jul 21, 2025 •

edited

Loading

anuunchin Jul 21, 2025 •

edited

Loading

anuunchin Jul 21, 2025 •

edited

Loading