- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.4k
Fix failure when reading deep or shallow cloned Delta Lake tables #27098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix failure when reading deep or shallow cloned Delta Lake tables #27098
Conversation
| Reviewer's GuideThis PR adjusts the checkpoint metadata validation to accept version 0 on Delta Lake table clones and supplements it with comprehensive unit tests covering valid, zero, and negative version scenarios as well as JSON serialization. Class diagram for updated CheckpointMetadataEntry validationclassDiagram
    class CheckpointMetadataEntry {
        +long version
        +Optional<Map<String, String>> tags
        +CheckpointMetadataEntry(long version, Optional<Map<String, String>> tags)
    }
    CheckpointMetadataEntry : version >= 0 validation
    CheckpointMetadataEntry : tags are copied as ImmutableMap
Class diagram for new TestCheckpointMetadataEntry unit testsclassDiagram
    class TestCheckpointMetadataEntry {
        +testValidVersion()
        +testZeroVersion()
        +testNegativeVersionThrows()
        +testJsonSerialization()
    }
    TestCheckpointMetadataEntry --> CheckpointMetadataEntry
File-Level Changes
 Possibly linked issues
 Tips and commandsInteracting with Sourcery
 Customizing Your ExperienceAccess your dashboard to: 
 Getting Help
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java:55-64` </location>
<code_context>
+    }
+
+    @Test
+    void testInvalidCheckpointMetadataEntry()
+    {
+        @Language("JSON")
+        String jsonWithNegativeVersion = "{\"version\":-1,\"tags\":{\"sidecarNumActions\":\"1\",\"sidecarSizeInBytes\":\"20965\",\"numOfAddFiles\":\"1\",\"sidecarFileSchema\":\"\"}}";
+        assertThatThrownBy(() -> codec.fromJson(jsonWithNegativeVersion))
+                .isInstanceOf(IllegalArgumentException.class)
+                .hasMessageContaining("Invalid JSON string for");
+
+        @Language("JSON")
+        String jsonWithoutTags = "{\"version\":-1}";
+        assertThatThrownBy(() -> codec.fromJson(jsonWithoutTags))
+                .isInstanceOf(IllegalArgumentException.class)
</code_context>
<issue_to_address>
**suggestion (testing):** Missing test for valid CheckpointMetadataEntry with absent 'tags' field.
Please add a test for deserializing a valid CheckpointMetadataEntry with a non-negative version and no 'tags' field to confirm correct handling of this case.
</issue_to_address>
### Comment 2
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java:70-89` </location>
<code_context>
+    }
+
+    @Test
+    void testCheckpointMetadataEntryToJson()
+    {
+        assertThat(codec.toJson(new CheckpointMetadataEntry(
+                100,
+                Optional.of(ImmutableMap.of(
+                        "sidecarNumActions", "1",
+                        "sidecarSizeInBytes", "20965",
+                        "numOfAddFiles", "1",
+                        "sidecarFileSchema", "")))))
+                .isEqualTo("{\n" +
+                        "  \"version\" : 100,\n" +
+                        "  \"tags\" : {\n" +
+                        "    \"sidecarNumActions\" : \"1\",\n" +
+                        "    \"sidecarSizeInBytes\" : \"20965\",\n" +
+                        "    \"numOfAddFiles\" : \"1\",\n" +
+                        "    \"sidecarFileSchema\" : \"\"\n" +
+                        "  }\n" +
+                        "}");
+    }
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a test for serialization with version 0 and absent 'tags'.
Please add a test case for serializing a CheckpointMetadataEntry with version 0 and no 'tags', and verify the resulting JSON structure.
```suggestion
    @Test
    void testCheckpointMetadataEntryToJson()
    {
        assertThat(codec.toJson(new CheckpointMetadataEntry(
                100,
                Optional.of(ImmutableMap.of(
                        "sidecarNumActions", "1",
                        "sidecarSizeInBytes", "20965",
                        "numOfAddFiles", "1",
                        "sidecarFileSchema", "")))))
                .isEqualTo("{\n" +
                        "  \"version\" : 100,\n" +
                        "  \"tags\" : {\n" +
                        "    \"sidecarNumActions\" : \"1\",\n" +
                        "    \"sidecarSizeInBytes\" : \"20965\",\n" +
                        "    \"numOfAddFiles\" : \"1\",\n" +
                        "    \"sidecarFileSchema\" : \"\"\n" +
                        "  }\n" +
                        "}");
    }
    @Test
    void testCheckpointMetadataEntryToJsonWithVersionZeroAndNoTags()
    {
        assertThat(codec.toJson(new CheckpointMetadataEntry(
                0,
                Optional.empty())))
                .isEqualTo("{\n" +
                        "  \"version\" : 0\n" +
                        "}");
    }
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
        
          
                ...lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestCheckpointMetadataEntry.java
          
            Show resolved
            Hide resolved
        
      | @@ -0,0 +1,90 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer query based tests to unit tests in this repository.
Please update existing integration tests or product tests instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, integration tests updated
| 
 https://trino.io/development/process#pull-request-and-commit-guidelines 
 | 
14dd843    to
    d1593a9      
    Compare
  
    | 
 Fixed | 
| throws Exception | ||
| { | ||
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v1/clone_source"); | ||
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, should be databricks154/clone_checkpoint_version_zero/checkpoint_v1/cloned_table ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks! fixed
| { | ||
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v1/clone_source"); | ||
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table"); | ||
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/clone_source"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could remove source table, test cloned source is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just test on my local, seems v1 test case not exercise the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that make sense, as the v1 checkpoint file is in Parquet format, so it won’t go through this JSON deserialization logic. I was just testing v1 incidentally to see if there were any issues. If it’s not needed, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may just keep it, there’s no harm in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no necessary to add v1 case here, since CheckpointMetadata only allowed in v2 spec https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoint-metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, updated
199baac    to
    3be6244      
    Compare
  
    | 
 reminder | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
| copyDirectoryContents(new File(Resources.getResource(resourceName).toURI()).toPath(), tableLocation); | ||
| assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')".formatted(tableName, tableLocation.toUri())); | ||
|  | ||
| assertThat(query("SELECT * FROM " + tableName + " ORDER BY id")).matches("VALUES " + | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move .matches to new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| testClonedTableWithCheckpointVersionZero("databricks154/clone_checkpoint_version_zero/checkpoint_v2/cloned_table"); | ||
| } | ||
|  | ||
| private void testClonedTableWithCheckpointVersionZero(String resourceName) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now seems the the method is redundant, we could remove it, put code under testClonedTableWithCheckpointVersionZero()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3be6244    to
    78997c3      
    Compare
  
    | Data generated using Databricks 15.4: | ||
|  | ||
| ```sql | ||
| CREATE TABLE cloned_table DEEP CLONE source_table; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a case for shallow clone?
Description
For cloned Delta Lake tables (either deep or shallow clones), the checkpoint version may start at
0.The previous validation in the
CheckpointMetadataEntryconstructor required the version to be positive,which caused the following exception:
Root cause is:
Additional context and related issues
Fixes #27097
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:
Summary by Sourcery
Allow checkpoint metadata version zero for Delta Lake clones and add comprehensive tests for CheckpointMetadataEntry.
Bug Fixes:
Enhancements:
Tests: