-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Search before asking
- I searched in the issues and found nothing similar.
Paimon version
I found this mistake in version 1.9.0, and it still exists in the master branch.
Compute Engine
flink version1.16, spark version 3.3.1
Minimal reproduce step
If the baseManifestList or deltaManifestList associated with the tag are deleted in advance, the datafile will be deleted mistakenly during tag cleaning, which can cause data corruption, especially since the datafile is associated with the earliests snapshot.
step1: delete baseManifestList or deltaManifestList associated with the tag, The premise is that the tag expiration time is greater than the snapshot expiration time
step2: execute expired tag program
step3: query the current snapshot or the earliest snapshot data, we will find a FileNotFoundException about the orc file
What doesn't meet your expectations?
This issue will result in datafile loss, and cause paimon unavailable.
Anything else?
When a tag expires, the left neighbor tag and the nearest right neighbor tag will be collected in skipping sets to prevent the datafile from being mistakenly deleted. if baseManifestList of the nearest right neighbor tag does not exist, the relevant datafiles will be accidentally deleted. So, I suggest the skipping set can collect both the left neighbor tag and the nearest right neighbor tag, along with the earliest snapshot.
Are you willing to submit a PR?
- I'm willing to submit a PR!