-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38574][checkpoint] Avoid reusing re-uploaded sst files when checkpoint notification is delayed #27157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rkhachatryan would you mind take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @Zakelly
Can you verify my understanding of the problem?
- checkpoint
1uses file00001.SSTuploaded asxxx.sst - checkpoint
2uses the same file00001.SSTbut re-uploads it asyyy.sstbecause CP 1 wasn't yet confirmed - TM get a confirmation of checkpoint
1 - JM completes checkpoint
2and subsumes checkpoint1- removingxxx.sst - checkpoint
3tries to re-use file00001.SSTuploaded asxxx.sstin checkpoint1, but it was deleted in (4) by JM
If that's correct, could you to add it to the PR / commit (NIT).
...end-forst/src/main/java/org/apache/flink/state/forst/snapshot/ForStSnapshotStrategyBase.java
Outdated
Show resolved
Hide resolved
Yes that's correct. I've added those in PR description and in javadocs of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR, LGTM
| instanceof PlaceholderStreamStateHandle)) { | ||
| // If it's not a placeholder handle, it means the sst file has been | ||
| // re-uploaded in the following checkpoint. | ||
| prunedSstFiles.remove(handleAndLocalPath.getLocalPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: log the number of removed files if it's > 0 (on DEBUG level)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added log of removed file number on trace level as the outer snapshotMetaData also log on trace level.
74f24ee to
223eedd
Compare
|
Ah, CI failed, but I don't think that's relevant. Force-pushing to re-trigger... |
…eckpoint notification is delayed
What is the purpose of the change
During incremental checkpoints, the RocksDB/ForSt will upload checkpoint files based on previous uploaded ssts. But the uploaded ssts are determined via the checkpoint notifications. If the notification arrives late, it may use wrong sst files which are already subsumed by the shared state registry. See FLINK-38574 for one example. This PR fix this.
Following is the issue:
Brief change log
Verifying this change
testCheckpointIsIncrementalWithLateNotificationfor both RocksDB and ForSt.EventTimeWindowCheckpointingITCase.testSlidingTimeWindowandtestPreAggregatedTumblingTimeWindowwill fail occasionally without this fix.Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation