Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support read inCommitTimestamp in Delta Lake history table #25056

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented Feb 18, 2025

Release notes

## Delta Lake
* Support read [in_commit_timestamp](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) value in the `history` system table. ({issue}`25056`)

@cla-bot cla-bot bot added the cla-signed label Feb 18, 2025
@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Feb 18, 2025
Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test with a static resource? I guess the connector can't read such tables because inCommitTimestamps reader feature is unsupported.

@chenjian2664
Copy link
Contributor Author

chenjian2664 commented Feb 18, 2025

Just tested, Trino seems can read such table, for the reader seems no constraints?

@ebyhr
Copy link
Member

ebyhr commented Feb 18, 2025

You are right. It is writer only feature: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps

@chenjian2664 chenjian2664 force-pushed the delta_in_commit branch 2 times, most recently from 25cd260 to 8bce054 Compare February 24, 2025 09:29
@chenjian2664 chenjian2664 force-pushed the delta_in_commit branch 4 times, most recently from 109411e to c568122 Compare February 25, 2025 08:27
@chenjian2664
Copy link
Contributor Author

@raunaqmorarka Would you like to have a look again?

commitInfoEntries.forEach(commitInfoEntry -> {
pagesBuilder.beginRow();

pagesBuilder.appendBigint(commitInfoEntry.version());
pagesBuilder.appendTimestampTzMillis(commitInfoEntry.timestamp(), timeZoneKey);
commitInfoEntry.inCommitTimestamp().ifPresentOrElse(
// use `inCommitTimestamp` if table In-Commit timestamps enabled, otherwise use file modification timestamp
Copy link
Member

@ebyhr ebyhr Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really fallback to "file modification timestamp"? Same for delta-lake.md.

@@ -438,6 +438,7 @@ private DeltaLakeTransactionLogEntry buildCommitInfoEntry(ConnectorSession sessi

CommitInfoEntry result = new CommitInfoEntry(
commitInfo.getLong("version"),
OptionalLong.empty(),
Copy link
Member

@ebyhr ebyhr Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe code readers can't understand why this logic returns OptionalLong.empty(). I would recommend supporting the field in this PR.

Copy link
Contributor Author

@chenjian2664 chenjian2664 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know to write the commitInfo in checkpoint file?
Current the whole method buildCommitInfoEntry is not covered by test or I am missing somewhere?
I see blow logic always is hit in the buildCommitInfoEntry in current tests.

        if (block.isNull(pagePosition)) {
            return null;
        }

Comment on lines +2373 to +2377
// The first two versions commitInfo doesn't contain `inCommitTimestamp`, the value is read from `timestamp` in commitInfo
// The last two versions commitInfo contain `inCommitTimestamp`, the value is read from it.
assertQuery("SELECT date_diff('millisecond', TIMESTAMP '1970-01-01 00:00:00 UTC', timestamp) FROM \"%s$history\"".formatted(tableName), "VALUES 1739859668531L, 1739859684775L, 1739859743394L, 1739859755480L");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr here shows the "fallback" to timestamp in first two versions

Copy link
Member

@ebyhr ebyhr Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjian2664 I'm not asking fallback to timestamp field. I'm asking if this PR fallback to "file modification timestamp" as you left a code comment. Does this assertion fail if we change file modification time of static resource? I believe the answer is no.

Copy link
Contributor Author

@chenjian2664 chenjian2664 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think there is a gap about the "file modification timestamp", here what I refer "file modification timestamp" is -> timestamp field, while guess you are consider of the "creation/modification time of the metadata/log entry file"?
If so I would update the comment to "... read the timestamp", since it looks misleading

I use the "file modification timestamp" is refer from the https://github.com/delta-io/delta/blob/master/PROTOCOL.md#recommendations-for-readers-of-tables-with-in-commit-timestamps, which for readers it is said :

 readers can use the following rules:

1. For commits with version >= delta.inCommitTimestampEnablementVersion, readers should use the inCommitTimestamp field of the commitInfo action.
2. For commits with version < delta.inCommitTimestampEnablementVersion, readers should use the file modification timestamp.

In addition, adjusted the `timestamp` field to return
UTC timezone instead of the user session timezone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants