Skip to content

Conversation

@yangshangqing95
Copy link
Member

@yangshangqing95 yangshangqing95 commented Oct 25, 2025

Description

For cloned Delta Lake tables (either deep or shallow clones), the checkpoint version may start at 0.
The previous validation in the CheckpointMetadataEntry constructor required the version to be positive,
which caused the following exception:

Query xxxx failed: Unable to parse transaction log entry: {"checkpointMetadata":{"version":0, ....}}

Root cause is:

com.fasterxml.jackson.databind.exc.ValueInstantiationException: 
Cannot construct instance of `io.trino.plugin.deltalake.transactionlog.CheckpointMetadataEntry`,
problem: version is not positive: 0

Additional context and related issues

Fixes #27097

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Fix failure when reading deep or shallow cloned Delta Lake tables with checkpoint version 0.

Summary by Sourcery

Allow checkpoint metadata version zero for Delta Lake clones and add comprehensive tests for CheckpointMetadataEntry.

Bug Fixes:

  • Relax the version validation in CheckpointMetadataEntry to accept zero instead of requiring positive values

Enhancements:

  • Change the version check from > 0 to >= 0 to support deep and shallow cloned Delta Lake tables

Tests:

  • Add TestCheckpointMetadataEntry to verify JSON serialization/deserialization and validation for versions 0, positive, and negative

@cla-bot cla-bot bot added the cla-signed label Oct 25, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 25, 2025

Reviewer's Guide

This PR adjusts the checkpoint metadata validation to accept version 0 on Delta Lake table clones and supplements it with comprehensive unit tests covering valid, zero, and negative version scenarios as well as JSON serialization.

Class diagram for updated CheckpointMetadataEntry validation

classDiagram
    class CheckpointMetadataEntry {
        +long version
        +Optional<Map<String, String>> tags
        +CheckpointMetadataEntry(long version, Optional<Map<String, String>> tags)
    }
    CheckpointMetadataEntry : version >= 0 validation
    CheckpointMetadataEntry : tags are copied as ImmutableMap
Loading

Class diagram for new TestCheckpointMetadataEntry unit tests

classDiagram
    class TestCheckpointMetadataEntry {
        +testValidVersion()
        +testZeroVersion()
        +testNegativeVersionThrows()
        +testJsonSerialization()
    }
    TestCheckpointMetadataEntry --> CheckpointMetadataEntry
Loading

File-Level Changes

Change Details Files
Relax version validation in CheckpointMetadataEntry to accept version 0
  • Change precondition from version > 0 to version >= 0
  • Update validation error message to reflect only negative versions
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/CheckpointMetadataEntry.java
Add unit tests covering boundary cases for checkpoint metadata
  • Test deserialization with version 0 alongside existing positive cases
  • Verify failure on negative version and missing tags
  • Test JSON serialization output formatting
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added the delta-lake Delta Lake connector label Oct 25, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java:55-64` </location>
<code_context>
+    }
+
+    @Test
+    void testInvalidCheckpointMetadataEntry()
+    {
+        @Language("JSON")
+        String jsonWithNegativeVersion = "{\"version\":-1,\"tags\":{\"sidecarNumActions\":\"1\",\"sidecarSizeInBytes\":\"20965\",\"numOfAddFiles\":\"1\",\"sidecarFileSchema\":\"\"}}";
+        assertThatThrownBy(() -> codec.fromJson(jsonWithNegativeVersion))
+                .isInstanceOf(IllegalArgumentException.class)
+                .hasMessageContaining("Invalid JSON string for");
+
+        @Language("JSON")
+        String jsonWithoutTags = "{\"version\":-1}";
+        assertThatThrownBy(() -> codec.fromJson(jsonWithoutTags))
+                .isInstanceOf(IllegalArgumentException.class)
</code_context>

<issue_to_address>
**suggestion (testing):** Missing test for valid CheckpointMetadataEntry with absent 'tags' field.

Please add a test for deserializing a valid CheckpointMetadataEntry with a non-negative version and no 'tags' field to confirm correct handling of this case.
</issue_to_address>

### Comment 2
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java:70-89` </location>
<code_context>
+    }
+
+    @Test
+    void testCheckpointMetadataEntryToJson()
+    {
+        assertThat(codec.toJson(new CheckpointMetadataEntry(
+                100,
+                Optional.of(ImmutableMap.of(
+                        "sidecarNumActions", "1",
+                        "sidecarSizeInBytes", "20965",
+                        "numOfAddFiles", "1",
+                        "sidecarFileSchema", "")))))
+                .isEqualTo("{\n" +
+                        "  \"version\" : 100,\n" +
+                        "  \"tags\" : {\n" +
+                        "    \"sidecarNumActions\" : \"1\",\n" +
+                        "    \"sidecarSizeInBytes\" : \"20965\",\n" +
+                        "    \"numOfAddFiles\" : \"1\",\n" +
+                        "    \"sidecarFileSchema\" : \"\"\n" +
+                        "  }\n" +
+                        "}");
+    }
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a test for serialization with version 0 and absent 'tags'.

Please add a test case for serializing a CheckpointMetadataEntry with version 0 and no 'tags', and verify the resulting JSON structure.

```suggestion
    @Test
    void testCheckpointMetadataEntryToJson()
    {
        assertThat(codec.toJson(new CheckpointMetadataEntry(
                100,
                Optional.of(ImmutableMap.of(
                        "sidecarNumActions", "1",
                        "sidecarSizeInBytes", "20965",
                        "numOfAddFiles", "1",
                        "sidecarFileSchema", "")))))
                .isEqualTo("{\n" +
                        "  \"version\" : 100,\n" +
                        "  \"tags\" : {\n" +
                        "    \"sidecarNumActions\" : \"1\",\n" +
                        "    \"sidecarSizeInBytes\" : \"20965\",\n" +
                        "    \"numOfAddFiles\" : \"1\",\n" +
                        "    \"sidecarFileSchema\" : \"\"\n" +
                        "  }\n" +
                        "}");
    }

    @Test
    void testCheckpointMetadataEntryToJsonWithVersionZeroAndNoTags()
    {
        assertThat(codec.toJson(new CheckpointMetadataEntry(
                0,
                Optional.empty())))
                .isEqualTo("{\n" +
                        "  \"version\" : 0\n" +
                        "}");
    }
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -0,0 +1,90 @@
/*
Copy link
Member

@ebyhr ebyhr Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer query based tests to unit tests in this repository.
Please update existing integration tests or product tests instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, integration tests updated

@ebyhr
Copy link
Member

ebyhr commented Oct 25, 2025

Fix failure when reading deep or shallow cloned Delta Lake tables.

https://trino.io/development/process#pull-request-and-commit-guidelines

  • Do not end the subject line with a period.

@yangshangqing95 yangshangqing95 changed the title Fix failure when reading deep or shallow cloned Delta Lake tables. Fix failure when reading deep or shallow cloned Delta Lake tables Oct 25, 2025
@yangshangqing95 yangshangqing95 force-pushed the fix-read-cloned-delta-lake-table-error branch from 14dd843 to d1593a9 Compare October 26, 2025 00:31
@yangshangqing95
Copy link
Member Author

Fix failure when reading deep or shallow cloned Delta Lake tables.

https://trino.io/development/process#pull-request-and-commit-guidelines

  • Do not end the subject line with a period.

Fixed

throws Exception
{
testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v1/clone_source");
testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, should be databricks154/clone_checkpoint_version_zero/checkpoint_v1/cloned_table ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks! fixed

{
testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v1/clone_source");
testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table");
testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/clone_source");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could remove source table, test cloned source is enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just test on my local, seems v1 test case not exercise the logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that make sense, as the v1 checkpoint file is in Parquet format, so it won’t go through this JSON deserialization logic. I was just testing v1 incidentally to see if there were any issues. If it’s not needed, I can remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may just keep it, there’s no harm in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no necessary to add v1 case here, since CheckpointMetadata only allowed in v2 spec https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoint-metadata

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated

@yangshangqing95 yangshangqing95 force-pushed the fix-read-cloned-delta-lake-table-error branch 2 times, most recently from 199baac to 3be6244 Compare October 28, 2025 15:49
@chenjian2664
Copy link
Contributor

Fix failure when reading deep or shallow cloned Delta Lake tables.

https://trino.io/development/process#pull-request-and-commit-guidelines

  • Do not end the subject line with a period.

reminder

Copy link
Contributor

@chenjian2664 chenjian2664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

copyDirectoryContents(new File(Resources.getResource(resourceName).toURI()).toPath(), tableLocation);
assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')".formatted(tableName, tableLocation.toUri()));

assertThat(query("SELECT * FROM " + tableName + " ORDER BY id")).matches("VALUES " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move .matches to new line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table");
}

private void testClonedTableWithCheckpointVersionZero(String resourceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now seems the the method is redundant, we could remove it, put code under testClonedTableWithCheckpointVersionZero()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yangshangqing95 yangshangqing95 force-pushed the fix-read-cloned-delta-lake-table-error branch from 3be6244 to 78997c3 Compare October 29, 2025 14:25
Data generated using Databricks 15.4:

```sql
CREATE TABLE cloned_table DEEP CLONE source_table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a case for shallow clone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector

Development

Successfully merging this pull request may close these issues.

Failure when reading deep or shallow cloned Delta Lake tables

3 participants