Skip to content

Conversation

@yangshangqing95
Copy link
Member

@yangshangqing95 yangshangqing95 commented Oct 22, 2025

Description

Support retrieving clustering information from Delta Lake tables, controlled by session and configuration settings.
By default, retrieving clustering information is disabled (false).
This serves as a foundation for future integration with the Delta Lake Liquid Clustering feature.
The work to fully support Delta Lake’s Liquid Clustering capability is already planned and in progress.

Additional context and related issues

About Delta Lake Liquid Clustering: https://delta.io/blog/liquid-clustering/

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta Lake
* Add support for retrieving clustering information from Delta Lake tables.

Summary by Sourcery

Enable optional retrieval of clustering information for Delta Lake tables and expose it as a new table property to support future Liquid Clustering integration.

New Features:

  • Introduce delta.show-clustered-columns session and catalog property to control retrieval of clustered columns
  • Add clustered_by table property and include it in SHOW CREATE TABLE when enabled
  • Implement clustering metadata extraction via ClusteringMetadataUtil reading Delta Lake transaction logs

Enhancements:

  • Propagate clustering columns through DeltaLakeTableHandle, metadata, and properties handling
  • Cache clustered columns in TableSnapshot and integrate with TransactionLogAccess

Documentation:

  • Document the new delta.show-clustered-columns property in connector documentation

Tests:

  • Add unit and integration tests for clustering metadata utilities, configuration mapping, session properties, and SHOW CREATE TABLE behavior

@cla-bot cla-bot bot added the cla-signed label Oct 22, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 22, 2025

Reviewer's Guide

This PR adds support for retrieving clustering information from Delta Lake tables behind a new feature flag. It introduces a session/catalog property to toggle visibility, extends the metadata layer and table handles to carry an optional List of clustered columns, implements logic to parse clustering info from the transaction log (via a new utility and an Operation enum), and updates tests and documentation accordingly.

Sequence diagram for retrieving clustered columns from Delta Lake table

sequenceDiagram
    participant Session
    participant DeltaLakeMetadata
    participant TransactionLogAccess
    participant TableSnapshot
    participant ClusteringMetadataUtil

    Session->>DeltaLakeMetadata: getTableHandle(session, ...)
    DeltaLakeMetadata->>TransactionLogAccess: getClusteredColumns(fileSystem, tableSnapshot)
    TransactionLogAccess->>TableSnapshot: getCachedClusteredColumns()
    alt Not cached
        TransactionLogAccess->>ClusteringMetadataUtil: getLatestClusteredColumns(fileSystem, tableSnapshot)
        ClusteringMetadataUtil-->>TransactionLogAccess: clusteredColumns
        TransactionLogAccess->>TableSnapshot: setCachedClusteredColumns(clusteredColumns)
    end
    TransactionLogAccess-->>DeltaLakeMetadata: clusteredColumns
    DeltaLakeMetadata-->>Session: LocatedTableHandle(clusteredColumns)
Loading

ER diagram for Delta Lake table properties with clustering info

erDiagram
    DELTA_LAKE_TABLE_PROPERTIES {
        string location
        list partitioned_by
        list clustered_by
        long checkpoint_interval
        string change_data_feed_enabled
        string column_mapping_mode
    }
    DELTA_LAKE_TABLE_HANDLE {
        string location
        object metadata_entry
        object protocol_entry
        list clustered_columns
    }
    DELTA_LAKE_TABLE_PROPERTIES ||--|| DELTA_LAKE_TABLE_HANDLE : "table handle for properties"
    DELTA_LAKE_TABLE_HANDLE {
        list clustered_columns
    }
Loading

Class diagram for Delta Lake clustering metadata support

classDiagram
    class DeltaLakeConfig {
        - boolean showClusteredColumns
        + boolean isShowClusteredColumns()
        + DeltaLakeConfig setShowClusteredColumns(boolean)
    }
    class DeltaLakeSessionProperties {
        + static boolean ifShowClusteredColumns(ConnectorSession)
    }
    class DeltaLakeTableProperties {
        + static final String CLUSTER_BY_PROPERTY
        + static List<String> getClusteredBy(Map<String, Object>)
    }
    class DeltaLakeTableHandle {
        - Optional<List<String>> clusteredColumns
        + Optional<List<String>> getClusteredColumns()
    }
    class TableSnapshot {
        - Optional<List<String>> cachedClusteredColumns
        + Optional<List<String>> getCachedClusteredColumns()
        + void setCachedClusteredColumns(Optional<List<String>>)
    }
    class TransactionLogAccess {
        + Optional<List<String>> getClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class ClusteringMetadataUtil {
        + static Optional<List<String>> getLatestClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class Operation {
        <<enum>>
        + static Operation fromString(String)
    }

    DeltaLakeConfig --> DeltaLakeSessionProperties
    DeltaLakeSessionProperties --> DeltaLakeTableProperties
    DeltaLakeTableProperties --> DeltaLakeTableHandle
    DeltaLakeTableHandle --> TableSnapshot
    TableSnapshot --> TransactionLogAccess
    TransactionLogAccess --> ClusteringMetadataUtil
    ClusteringMetadataUtil --> Operation
Loading

File-Level Changes

Change Details Files
Feature flag for showing clustered columns
  • Add showClusteredColumns field with @config setter in DeltaLakeConfig
  • Introduce SHOW_CLUSTERED_COLUMNS session property and ifShowClusteredColumns helper
  • Update TestDeltaLakeConfig to cover default and explicit mappings
  • Document delta.show-clustered-columns in sphinx connector docs
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSessionProperties.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConfig.java
docs/src/main/sphinx/connector/delta-lake.md
Extend metadata and table handle to carry clusteredColumns
  • Retrieve clusteredColumns in DeltaLakeMetadata.getTableHandle when enabled and protocol supports clustering
  • Include clusteredColumns property in getTableMetadata output
  • Add clusteredColumns field, JSON annotations, constructors, equals/hashCode in DeltaLakeTableHandle
  • Define CLUSTER_BY_PROPERTY and getClusteredBy helper in DeltaLakeTableProperties
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableHandle.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableProperties.java
Implement transaction log parsing for clustering info
  • Extend TableSnapshot to cache clustered columns
  • Add TransactionLogAccess.getClusteredColumns method
  • Create ClusteringMetadataUtil to walk commitInfo entries and extract cluster columns
  • Introduce Operation enum to map Delta operations to clustering keys
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java
Update and add tests to validate clustering info exposure
  • Register CLUSTERED_TABLES and add SHOW CREATE TABLE assertions in TestDeltaLakeBasic
  • Update existing connector tests to include the new clusteredColumns parameter
  • Add TestClusteringMetadataUtil and OperationTest for clustering utility coverage
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileBasedTableStatisticsProvider.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeNodeLocalDynamicSplitPruning.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestingDeltaLakeUtils.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestTransactionLogAccess.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/TestClusteringMetadataUtil.java
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/OperationTest.java
Expose temporal time-travel parameter
  • Add public static getTemporalTimeTravelLinearSearchMaxSize method in DeltaLakeMetadata
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Oct 22, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:94` </location>
<code_context>
+            REPLACE_TABLE_KEYWORD, CLUSTERING_PARAMETER_KEY,
+            CLUSTER_BY, NEW_CLUSTERING_PARAMETER_KEY);
+
+    private static final ThreadLocal<Map<String, String>> OLD_TO_NEW_RENAMED_COLUMNS = ThreadLocal.withInitial(HashMap::new);
+
+    private ClusteringMetadataUtil()
</code_context>

<issue_to_address>
**issue (bug_risk):** ThreadLocal usage for OLD_TO_NEW_RENAMED_COLUMNS may leak memory if not cleared in all code paths.

If an exception occurs before ThreadLocal removal, it may not be cleared. Use a try-finally block to guarantee cleanup and prevent memory leaks.
</issue_to_address>

### Comment 2
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:249` </location>
<code_context>
+    }
+
+    @VisibleForTesting
+    static void recordRenamedColumns(CommitInfoEntry commitInfoEntry)
+    {
+        String oldName = commitInfoEntry.operationParameters().get(RENAMED_OLD_COLUMN_KEY);
</code_context>

<issue_to_address>
**suggestion:** The logic for updating OLD_TO_NEW_RENAMED_COLUMNS may not handle multiple renames correctly.

The current approach may lose information if a column is renamed multiple times. Please consider tracking all previous names to ensure the mapping remains accurate.
</issue_to_address>

### Comment 3
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java:96` </location>
<code_context>
+            CREATE_TABLE_KEYWORD.getOperationName().toLowerCase(), CREATE_TABLE_KEYWORD,
+            REPLACE_TABLE_KEYWORD.getOperationName().toLowerCase(), REPLACE_TABLE_KEYWORD);
+
+    public static Operation fromString(String operationName)
+    {
+        Operation operation = LOWERCASE_NAME_TO_OPERATION.get(operationName.toLowerCase());
</code_context>

<issue_to_address>
**suggestion:** fromString may return UNKNOW_OPERATION for valid but differently-cased or formatted operation names.

Currently, only exact matches are supported, so inputs with extra whitespace or formatting may not be recognized. Consider trimming whitespace or using regex to improve matching robustness.
</issue_to_address>

### Comment 4
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java:2735-2736` </location>
<code_context>
         assertQuery("SELECT * FROM " + sourceTable, sourceTableValues);
     }

+    @Test
+    void testShowCreateTableWithClusteredInfo()
+    {
+        Session session = Session.builder(getSession())
</code_context>

<issue_to_address>
**suggestion (testing):** Test for clustered columns in SHOW CREATE TABLE covers both enabled and disabled states.

Please also add a test case for the default configuration (without setting the session property) to confirm expected default behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

REPLACE_TABLE_KEYWORD, CLUSTERING_PARAMETER_KEY,
CLUSTER_BY, NEW_CLUSTERING_PARAMETER_KEY);

private static final ThreadLocal<Map<String, String>> OLD_TO_NEW_RENAMED_COLUMNS = ThreadLocal.withInitial(HashMap::new);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): ThreadLocal usage for OLD_TO_NEW_RENAMED_COLUMNS may leak memory if not cleared in all code paths.

If an exception occurs before ThreadLocal removal, it may not be cleared. Use a try-finally block to guarantee cleanup and prevent memory leaks.

}

@VisibleForTesting
static void recordRenamedColumns(CommitInfoEntry commitInfoEntry)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The logic for updating OLD_TO_NEW_RENAMED_COLUMNS may not handle multiple renames correctly.

The current approach may lose information if a column is renamed multiple times. Please consider tracking all previous names to ensure the mapping remains accurate.

CREATE_TABLE_KEYWORD.getOperationName().toLowerCase(), CREATE_TABLE_KEYWORD,
REPLACE_TABLE_KEYWORD.getOperationName().toLowerCase(), REPLACE_TABLE_KEYWORD);

public static Operation fromString(String operationName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: fromString may return UNKNOW_OPERATION for valid but differently-cased or formatted operation names.

Currently, only exact matches are supported, so inputs with extra whitespace or formatting may not be recognized. Consider trimming whitespace or using regex to improve matching robustness.

@yangshangqing95 yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch 2 times, most recently from ef44562 to 9163402 Compare October 22, 2025 18:25
@yangshangqing95 yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch from 9163402 to 660241a Compare October 22, 2025 19:22
@ebyhr
Copy link
Member

ebyhr commented Oct 22, 2025

Is this PR preparatory work for future performance improvements, with no immediate benefit?

@yangshangqing95
Copy link
Member Author

Is this PR preparatory work for future performance improvements, with no immediate benefit?

Hi @ebyhr
From a performance perspective, this PR doesn’t introduce any immediate improvements. But functionally, it allows users to view clustering column information in the SHOW CREATE TABLE output for Delta tables.

@ebyhr
Copy link
Member

ebyhr commented Oct 23, 2025

Could you please share the final solution (= how to improve performance eventually)?

@yangshangqing95
Copy link
Member Author

Sure @ebyhr
The long-term goal is to leverage Delta Lake’s Liquid Clustering to improve query performance once full support is added. Here, full support refers to enabling Trino to handle reading, writing with Liquid Clustering, and performing optimize operations to reorganize Liquid Clustering, an so on.

Liquid Clustering is a flexible data layout mechanism that organizes data based on clustering keys, instead of traditional directory-based partitions. Unlike static partitions (e.g., /country=US/), it stores clustering information in metadata rather than the filesystem, allowing the data layout to evolve dynamically as new data is written. Also, Liquid Clustering is the best practice officially recommended by Delta Lake. You can find more information here: https://delta.io/blog/liquid-clustering/

In simple terms, compared with partitions, Liquid Clustering provides:

  • Flexible organization – data is grouped by clustering key ranges instead of fixed partition directories.

  • Dynamic adjustment – clustering keys can be updated at any time; new files automatically adhere to the updated clustering layout. Additionally, existing data can be reorganized through some operations, giving the table structure much greater evolution flexibility than static partitions.

  • Lower metadata overhead – no need to maintain thousands of partition directories.

  • Better query pruning – queries filtering on clustering keys can skip large data ranges even when those fields aren’t partitions. In addition, Liquid Clustering can use nested fields (e.g., person.age) as clustering keys. Delta Lake tracks the full field path and its value range in metadata, so even struct subfields can benefit from clustering-based pruning.

Compared with Parquet column statistics, which only store per-file min/max values without any global organization, Liquid Clustering provides a higher-level layout strategy. It ensures that values of clustering keys are physically localized across files, making Parquet’s per-file statistics far more effective for data skipping and reducing the number of files scanned.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

@ebyhr
Copy link
Member

ebyhr commented Oct 23, 2025

Thanks for explaining the details. I already know about liquid clustering. I wanted to know the actual follow-up plan (especially read part) for the Delta Lake connector.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

The connector already uses stats from the transaction logs. Are you planning to read different metadata in the future? If so, could you elaborate on that?

@chenjian2664
Copy link
Contributor

@yangshangqing95 Would you mind sharing the plan for supporting the write path of liquid clustering? I suspect the read path won’t change much, since Trino already supports pruning with statistics, so reading clustering information doesn’t seem to provide additional benefit in my opinion

@yangshangqing95
Copy link
Member Author

Haha, please ignore my long-winded message above.
Hi @ebyhr @chenjian2664

  1. Regarding the read performance improvements, in my actual use cases, many Delta Lake tables choose a clustering field that is a scalar nested inside a struct field, where the root field itself is a large struct. Delta Lake can record subfield statistics in its stats, but currently Trino cannot perform subfield predicate pushdown during filtering. In such scenarios, we are unable to leverage Delta Lake’s stats for effective pruning — this is a problem we’re currently facing. Once we can obtain information about the clustering column, implementing pushdown for such clustered fields shouldn’t be difficult, and I can follow up on that later.

  2. Regarding where stats are stored, yes — according to the Delta Lake protocol, they are still recorded in the AddFile entry, and that hasn’t changed.

  3. Regarding write-side support for Liquid Clustering, some key points are:

    1. Support collecting statistics for specified columns and fields.
    2. Implement the ability to compute clustering values for fields (e.g., values based on Hilbert curves).
    3. Before writing files, introduce clustering operators to handle these tasks (extraction, computation, classification, sorting, etc.). If clustering is enabled, these operators should be applied. Optionally, intermediate results can be spilled to disk — fortunately, we already have this capability and can reuse it.

Coming back to this PR itself — It’s mostly about test data and test code — I hope to cover enough cases. My idea is to break down the larger system into smaller modules or features, which makes it easier to review and identify issues during testing.

Open to any discussions or suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants