-
Notifications
You must be signed in to change notification settings - Fork 405
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
main (development)
Please describe the bug 🐞
When a fetchLog
request needs to read from remote logs and happens to encounter a corrupt index file, the TabletServer is currently unable to handle this exception and the client will receive an UnknownServerException
. In our production environment, the follower keeps receiving the following error, causing replica synchronization to stall.
2025-09-17 19:36:00,476 WARN org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread [] - Error in response for fetch log request org.apache.fluss.rpc.messages.FetchLogRequest@160f9b0b
java.util.concurrent.ExecutionException: org.apache.fluss.exception.UnknownServerException: org.apache.fluss.server.exception.CorruptIndexException: Corrupt time index found, time index file (/mnt/fluss/data/0/remote-log-index-cache/60976417_41a99e37-f1a0-46ab-8745-b623ac4c2d37.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1758092497499
at org.apache.fluss.server.log.TimeIndex.lambda$sanityCheck$0(TimeIndex.java:105)
at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:32)
at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:50)
at org.apache.fluss.server.log.TimeIndex.sanityCheck(TimeIndex.java:94)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$createCacheEntry$7(RemoteLogIndexCache.java:270)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.loadIndexFile(RemoteLogIndexCache.java:496)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.createCacheEntry(RemoteLogIndexCache.java:254)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$2(RemoteLogIndexCache.java:221)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2406)
at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1916)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2404)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$3(RemoteLogIndexCache.java:219)
at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:42)
at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:55)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.getIndexEntry(RemoteLogIndexCache.java:208)
at org.apache.fluss.server.log.remote.RemoteLogIndexCach
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?]
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.processFetchLogRequest(ReplicaFetcherThread.java:227) ~[fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
at java.util.Optional.ifPresent(Optional.java:178) [?:?]
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.maybeFetch(ReplicaFetcherThread.java:153) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.doWork(ReplicaFetcherThread.java:124) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
at org.apache.fluss.utils.concurrent.ShutdownableThread.run(ShutdownableThread.java:96) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
Caused by: org.apache.fluss.exception.UnknownServerException: org.apache.fluss.server.exception.CorruptIndexException: Corrupt time index found, time index file (/mnt/fluss/data/0/remote-log-index-cache/60976417_41a99e37-f1a0-46ab-8745-b623ac4c2d37.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1758092497499
at org.apache.fluss.server.log.TimeIndex.lambda$sanityCheck$0(TimeIndex.java:105)
at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:32)
at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:50)
at org.apache.fluss.server.log.TimeIndex.sanityCheck(TimeIndex.java:94)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$createCacheEntry$7(RemoteLogIndexCache.java:270)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.loadIndexFile(RemoteLogIndexCache.java:496)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.createCacheEntry(RemoteLogIndexCache.java:254)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$2(RemoteLogIndexCache.java:221)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2406)
at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1916)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2404)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$3(RemoteLogIndexCache.java:219)
at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:42)
at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:55)
at org.apache.fluss.server.log.remote.RemoteLogIndexCache.getIndexEntry(RemoteLogIndexCache.java:208)
at org.apache.fluss.server.log.remote.RemoteLogIndexCach
This issue will affect both clients and followers; further discussion is needed on how to fix client-side reading. This issue aims to first address the problem on the follower side.
Solution
The reason for the UnknownServerException
is that a CorruptIndexException
was thrown while retrieving firstStartPos
due to a corrupt index file.
public class ReplicaManager {
...
private @Nullable RemoteLogFetchInfo fetchLogFromRemote(Replica replica, long fetchOffset) {
List<RemoteLogSegment> remoteLogSegmentList =
remoteLogManager.relevantRemoteLogSegments(replica.getTableBucket(), fetchOffset);
if (!remoteLogSegmentList.isEmpty()) {
// will throw CorruptIndexException if index file corrupt
int firstStartPos =
remoteLogManager.lookupPositionForOffset(
remoteLogSegmentList.get(0), fetchOffset);
...
} else {
return null;
}
}
For the follower, firstStartPos is actually not needed, so we can simply skip this step for followers. This not only avoids the issue of the follower continuously receiving errors when encountering a corrupt index file, but also serves as a small performance optimization.
Are you willing to submit a PR?
- I'm willing to submit a PR!