Support retrieving clustering information of Delta Lake tables. #27052

yangshangqing95 · 2025-10-22T15:23:17Z

Description

Support retrieving clustering information from Delta Lake tables, controlled by session and configuration settings.
By default, retrieving clustering information is disabled (false).
This serves as a foundation for future integration with the Delta Lake Liquid Clustering feature.
The work to fully support Delta Lake’s Liquid Clustering capability is already planned and in progress.

Additional context and related issues

About Delta Lake Liquid Clustering: https://delta.io/blog/liquid-clustering/

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta Lake
* Add support for retrieving clustering information from Delta Lake tables.

Summary by Sourcery

Enable optional retrieval of clustering information for Delta Lake tables and expose it as a new table property to support future Liquid Clustering integration.

New Features:

Introduce delta.show-clustered-columns session and catalog property to control retrieval of clustered columns
Add clustered_by table property and include it in SHOW CREATE TABLE when enabled
Implement clustering metadata extraction via ClusteringMetadataUtil reading Delta Lake transaction logs

Enhancements:

Propagate clustering columns through DeltaLakeTableHandle, metadata, and properties handling
Cache clustered columns in TableSnapshot and integrate with TransactionLogAccess

Documentation:

Document the new delta.show-clustered-columns property in connector documentation

Tests:

Add unit and integration tests for clustering metadata utilities, configuration mapping, session properties, and SHOW CREATE TABLE behavior

sourcery-ai · 2025-10-22T15:23:24Z

Reviewer's Guide

This PR adds support for retrieving clustering information from Delta Lake tables behind a new feature flag. It introduces a session/catalog property to toggle visibility, extends the metadata layer and table handles to carry an optional List of clustered columns, implements logic to parse clustering info from the transaction log (via a new utility and an Operation enum), and updates tests and documentation accordingly.

Sequence diagram for retrieving clustered columns from Delta Lake table

sequenceDiagram
    participant Session
    participant DeltaLakeMetadata
    participant TransactionLogAccess
    participant TableSnapshot
    participant ClusteringMetadataUtil

    Session->>DeltaLakeMetadata: getTableHandle(session, ...)
    DeltaLakeMetadata->>TransactionLogAccess: getClusteredColumns(fileSystem, tableSnapshot)
    TransactionLogAccess->>TableSnapshot: getCachedClusteredColumns()
    alt Not cached
        TransactionLogAccess->>ClusteringMetadataUtil: getLatestClusteredColumns(fileSystem, tableSnapshot)
        ClusteringMetadataUtil-->>TransactionLogAccess: clusteredColumns
        TransactionLogAccess->>TableSnapshot: setCachedClusteredColumns(clusteredColumns)
    end
    TransactionLogAccess-->>DeltaLakeMetadata: clusteredColumns
    DeltaLakeMetadata-->>Session: LocatedTableHandle(clusteredColumns)

ER diagram for Delta Lake table properties with clustering info

erDiagram
    DELTA_LAKE_TABLE_PROPERTIES {
        string location
        list partitioned_by
        list clustered_by
        long checkpoint_interval
        string change_data_feed_enabled
        string column_mapping_mode
    }
    DELTA_LAKE_TABLE_HANDLE {
        string location
        object metadata_entry
        object protocol_entry
        list clustered_columns
    }
    DELTA_LAKE_TABLE_PROPERTIES ||--|| DELTA_LAKE_TABLE_HANDLE : "table handle for properties"
    DELTA_LAKE_TABLE_HANDLE {
        list clustered_columns
    }

Class diagram for Delta Lake clustering metadata support

classDiagram
    class DeltaLakeConfig {
        - boolean showClusteredColumns
        + boolean isShowClusteredColumns()
        + DeltaLakeConfig setShowClusteredColumns(boolean)
    }
    class DeltaLakeSessionProperties {
        + static boolean ifShowClusteredColumns(ConnectorSession)
    }
    class DeltaLakeTableProperties {
        + static final String CLUSTER_BY_PROPERTY
        + static List<String> getClusteredBy(Map<String, Object>)
    }
    class DeltaLakeTableHandle {
        - Optional<List<String>> clusteredColumns
        + Optional<List<String>> getClusteredColumns()
    }
    class TableSnapshot {
        - Optional<List<String>> cachedClusteredColumns
        + Optional<List<String>> getCachedClusteredColumns()
        + void setCachedClusteredColumns(Optional<List<String>>)
    }
    class TransactionLogAccess {
        + Optional<List<String>> getClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class ClusteringMetadataUtil {
        + static Optional<List<String>> getLatestClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class Operation {
        <<enum>>
        + static Operation fromString(String)
    }

    DeltaLakeConfig --> DeltaLakeSessionProperties
    DeltaLakeSessionProperties --> DeltaLakeTableProperties
    DeltaLakeTableProperties --> DeltaLakeTableHandle
    DeltaLakeTableHandle --> TableSnapshot
    TableSnapshot --> TransactionLogAccess
    TransactionLogAccess --> ClusteringMetadataUtil
    ClusteringMetadataUtil --> Operation

File-Level Changes

Change	Details	Files
Feature flag for showing clustered columns	Add showClusteredColumns field with @config setter in DeltaLakeConfig Introduce SHOW_CLUSTERED_COLUMNS session property and ifShowClusteredColumns helper Update TestDeltaLakeConfig to cover default and explicit mappings Document delta.show-clustered-columns in sphinx connector docs	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSessionProperties.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConfig.java` `docs/src/main/sphinx/connector/delta-lake.md`
Extend metadata and table handle to carry clusteredColumns	Retrieve clusteredColumns in DeltaLakeMetadata.getTableHandle when enabled and protocol supports clustering Include clusteredColumns property in getTableMetadata output Add clusteredColumns field, JSON annotations, constructors, equals/hashCode in DeltaLakeTableHandle Define CLUSTER_BY_PROPERTY and getClusteredBy helper in DeltaLakeTableProperties	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableHandle.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableProperties.java`
Implement transaction log parsing for clustering info	Extend TableSnapshot to cache clustered columns Add TransactionLogAccess.getClusteredColumns method Create ClusteringMetadataUtil to walk commitInfo entries and extract cluster columns Introduce Operation enum to map Delta operations to clustering keys	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java`
Update and add tests to validate clustering info exposure	Register CLUSTERED_TABLES and add SHOW CREATE TABLE assertions in TestDeltaLakeBasic Update existing connector tests to include the new clusteredColumns parameter Add TestClusteringMetadataUtil and OperationTest for clustering utility coverage	`plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileBasedTableStatisticsProvider.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeNodeLocalDynamicSplitPruning.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestingDeltaLakeUtils.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestTransactionLogAccess.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/TestClusteringMetadataUtil.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/OperationTest.java`
Expose temporal time-travel parameter	Add public static getTemporalTimeTravelLinearSearchMaxSize method in DeltaLakeMetadata	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:94` </location>
<code_context>
+            REPLACE_TABLE_KEYWORD, CLUSTERING_PARAMETER_KEY,
+            CLUSTER_BY, NEW_CLUSTERING_PARAMETER_KEY);
+
+    private static final ThreadLocal<Map<String, String>> OLD_TO_NEW_RENAMED_COLUMNS = ThreadLocal.withInitial(HashMap::new);
+
+    private ClusteringMetadataUtil()
</code_context>

<issue_to_address>
**issue (bug_risk):** ThreadLocal usage for OLD_TO_NEW_RENAMED_COLUMNS may leak memory if not cleared in all code paths.

If an exception occurs before ThreadLocal removal, it may not be cleared. Use a try-finally block to guarantee cleanup and prevent memory leaks.
</issue_to_address>

### Comment 2
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:249` </location>
<code_context>
+    }
+
+    @VisibleForTesting
+    static void recordRenamedColumns(CommitInfoEntry commitInfoEntry)
+    {
+        String oldName = commitInfoEntry.operationParameters().get(RENAMED_OLD_COLUMN_KEY);
</code_context>

<issue_to_address>
**suggestion:** The logic for updating OLD_TO_NEW_RENAMED_COLUMNS may not handle multiple renames correctly.

The current approach may lose information if a column is renamed multiple times. Please consider tracking all previous names to ensure the mapping remains accurate.
</issue_to_address>

### Comment 3
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java:96` </location>
<code_context>
+            CREATE_TABLE_KEYWORD.getOperationName().toLowerCase(), CREATE_TABLE_KEYWORD,
+            REPLACE_TABLE_KEYWORD.getOperationName().toLowerCase(), REPLACE_TABLE_KEYWORD);
+
+    public static Operation fromString(String operationName)
+    {
+        Operation operation = LOWERCASE_NAME_TO_OPERATION.get(operationName.toLowerCase());
</code_context>

<issue_to_address>
**suggestion:** fromString may return UNKNOW_OPERATION for valid but differently-cased or formatted operation names.

Currently, only exact matches are supported, so inputs with extra whitespace or formatting may not be recognized. Consider trimming whitespace or using regex to improve matching robustness.
</issue_to_address>

### Comment 4
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java:2735-2736` </location>
<code_context>
         assertQuery("SELECT * FROM " + sourceTable, sourceTableValues);
     }

+    @Test
+    void testShowCreateTableWithClusteredInfo()
+    {
+        Session session = Session.builder(getSession())
</code_context>

<issue_to_address>
**suggestion (testing):** Test for clustered columns in SHOW CREATE TABLE covers both enabled and disabled states.

Please also add a test case for the default configuration (without setting the session property) to confirm expected default behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-10-22T15:24:24Z

...no-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java

+            REPLACE_TABLE_KEYWORD, CLUSTERING_PARAMETER_KEY,
+            CLUSTER_BY, NEW_CLUSTERING_PARAMETER_KEY);
+
+    private static final ThreadLocal<Map<String, String>> OLD_TO_NEW_RENAMED_COLUMNS = ThreadLocal.withInitial(HashMap::new);


issue (bug_risk): ThreadLocal usage for OLD_TO_NEW_RENAMED_COLUMNS may leak memory if not cleared in all code paths.

If an exception occurs before ThreadLocal removal, it may not be cleared. Use a try-finally block to guarantee cleanup and prevent memory leaks.

sourcery-ai · 2025-10-22T15:24:24Z

...no-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java

+    }
+
+    @VisibleForTesting
+    static void recordRenamedColumns(CommitInfoEntry commitInfoEntry)


suggestion: The logic for updating OLD_TO_NEW_RENAMED_COLUMNS may not handle multiple renames correctly.

The current approach may lose information if a column is renamed multiple times. Please consider tracking all previous names to ensure the mapping remains accurate.

sourcery-ai · 2025-10-22T15:24:24Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java

+            CREATE_TABLE_KEYWORD.getOperationName().toLowerCase(), CREATE_TABLE_KEYWORD,
+            REPLACE_TABLE_KEYWORD.getOperationName().toLowerCase(), REPLACE_TABLE_KEYWORD);
+
+    public static Operation fromString(String operationName)


suggestion: fromString may return UNKNOW_OPERATION for valid but differently-cased or formatted operation names.

Currently, only exact matches are supported, so inputs with extra whitespace or formatting may not be recognized. Consider trimming whitespace or using regex to improve matching robustness.

ebyhr · 2025-10-22T22:33:38Z

Is this PR preparatory work for future performance improvements, with no immediate benefit?

yangshangqing95 · 2025-10-23T02:04:29Z

Is this PR preparatory work for future performance improvements, with no immediate benefit?

Hi @ebyhr
From a performance perspective, this PR doesn’t introduce any immediate improvements. But functionally, it allows users to view clustering column information in the SHOW CREATE TABLE output for Delta tables.

ebyhr · 2025-10-23T02:06:47Z

Could you please share the final solution (= how to improve performance eventually)?

yangshangqing95 · 2025-10-23T02:39:32Z

Sure @ebyhr
The long-term goal is to leverage Delta Lake’s Liquid Clustering to improve query performance once full support is added. Here, full support refers to enabling Trino to handle reading, writing with Liquid Clustering, and performing optimize operations to reorganize Liquid Clustering, an so on.

Liquid Clustering is a flexible data layout mechanism that organizes data based on clustering keys, instead of traditional directory-based partitions. Unlike static partitions (e.g., /country=US/), it stores clustering information in metadata rather than the filesystem, allowing the data layout to evolve dynamically as new data is written. Also, Liquid Clustering is the best practice officially recommended by Delta Lake. You can find more information here: https://delta.io/blog/liquid-clustering/

In simple terms, compared with partitions, Liquid Clustering provides:

Flexible organization – data is grouped by clustering key ranges instead of fixed partition directories.
Dynamic adjustment – clustering keys can be updated at any time; new files automatically adhere to the updated clustering layout. Additionally, existing data can be reorganized through some operations, giving the table structure much greater evolution flexibility than static partitions.
Lower metadata overhead – no need to maintain thousands of partition directories.
Better query pruning – queries filtering on clustering keys can skip large data ranges even when those fields aren’t partitions. In addition, Liquid Clustering can use nested fields (e.g., person.age) as clustering keys. Delta Lake tracks the full field path and its value range in metadata, so even struct subfields can benefit from clustering-based pruning.

Compared with Parquet column statistics, which only store per-file min/max values without any global organization, Liquid Clustering provides a higher-level layout strategy. It ensures that values of clustering keys are physically localized across files, making Parquet’s per-file statistics far more effective for data skipping and reducing the number of files scanned.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

ebyhr · 2025-10-23T03:03:32Z

Thanks for explaining the details. I already know about liquid clustering. I wanted to know the actual follow-up plan (especially read part) for the Delta Lake connector.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

The connector already uses stats from the transaction logs. Are you planning to read different metadata in the future? If so, could you elaborate on that?

chenjian2664 · 2025-10-23T14:00:28Z

@yangshangqing95 Would you mind sharing the plan for supporting the write path of liquid clustering? I suspect the read path won’t change much, since Trino already supports pruning with statistics, so reading clustering information doesn’t seem to provide additional benefit in my opinion

yangshangqing95 · 2025-10-23T20:48:24Z

Haha, please ignore my long-winded message above.
Hi @ebyhr @chenjian2664

Regarding the read performance improvements, in my actual use cases, many Delta Lake tables choose a clustering field that is a scalar nested inside a struct field, where the root field itself is a large struct. Delta Lake can record subfield statistics in its stats, but currently Trino cannot perform subfield predicate pushdown during filtering. In such scenarios, we are unable to leverage Delta Lake’s stats for effective pruning — this is a problem we’re currently facing. Once we can obtain information about the clustering column, implementing pushdown for such clustered fields shouldn’t be difficult, and I can follow up on that later.
Regarding where stats are stored, yes — according to the Delta Lake protocol, they are still recorded in the AddFile entry, and that hasn’t changed.
Regarding write-side support for Liquid Clustering, some key points are:
1. Support collecting statistics for specified columns and fields.
2. Implement the ability to compute clustering values for fields (e.g., values based on Hilbert curves).
3. Before writing files, introduce clustering operators to handle these tasks (extraction, computation, classification, sorting, etc.). If clustering is enabled, these operators should be applied. Optionally, intermediate results can be spilled to disk — fortunately, we already have this capability and can reuse it.

Coming back to this PR itself — It’s mostly about test data and test code — I hope to cover enough cases. My idea is to break down the larger system into smaller modules or features, which makes it easier to review and identify issues during testing.

Open to any discussions or suggestions.

cla-bot bot added the cla-signed label Oct 22, 2025

github-actions bot added docs delta-lake Delta Lake connector labels Oct 22, 2025

sourcery-ai bot reviewed Oct 22, 2025

View reviewed changes

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch 2 times, most recently from ef44562 to 9163402 Compare October 22, 2025 18:25

Support retrieving clustering information of Delta Lake tables.

660241a

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch from 9163402 to 660241a Compare October 22, 2025 19:22

findinpath requested a review from chenjian2664 October 24, 2025 04:26

Support retrieving clustering information of Delta Lake tables. #27052

Are you sure you want to change the base?

Support retrieving clustering information of Delta Lake tables. #27052

Uh oh!

Conversation

yangshangqing95 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for retrieving clustered columns from Delta Lake table

ER diagram for Delta Lake table properties with clustering info

Class diagram for Delta Lake clustering metadata support

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ebyhr commented Oct 22, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

ebyhr commented Oct 23, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

ebyhr commented Oct 23, 2025

Uh oh!

chenjian2664 commented Oct 23, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

yangshangqing95 commented Oct 22, 2025 •

edited

Loading

sourcery-ai bot commented Oct 22, 2025 •

edited

Loading