Skip to content

ReplicaFetcherThread keeps throwing UnknownServerException because of corrupt index file #1716

@LiebingYu

Description

@LiebingYu

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

main (development)

Please describe the bug 🐞

When a fetchLog request needs to read from remote logs and happens to encounter a corrupt index file, the TabletServer is currently unable to handle this exception and the client will receive an UnknownServerException. In our production environment, the follower keeps receiving the following error, causing replica synchronization to stall.

2025-09-17 19:36:00,476 WARN  org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread [] - Error in response for fetch log request org.apache.fluss.rpc.messages.FetchLogRequest@160f9b0b
java.util.concurrent.ExecutionException: org.apache.fluss.exception.UnknownServerException: org.apache.fluss.server.exception.CorruptIndexException: Corrupt time index found, time index file (/mnt/fluss/data/0/remote-log-index-cache/60976417_41a99e37-f1a0-46ab-8745-b623ac4c2d37.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1758092497499
	at org.apache.fluss.server.log.TimeIndex.lambda$sanityCheck$0(TimeIndex.java:105)
	at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:32)
	at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:50)
	at org.apache.fluss.server.log.TimeIndex.sanityCheck(TimeIndex.java:94)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$createCacheEntry$7(RemoteLogIndexCache.java:270)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.loadIndexFile(RemoteLogIndexCache.java:496)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.createCacheEntry(RemoteLogIndexCache.java:254)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$2(RemoteLogIndexCache.java:221)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2406)
	at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1916)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2404)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
	at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$3(RemoteLogIndexCache.java:219)
	at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:42)
	at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:55)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.getIndexEntry(RemoteLogIndexCache.java:208)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCach
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?]
	at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.processFetchLogRequest(ReplicaFetcherThread.java:227) ~[fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
	at java.util.Optional.ifPresent(Optional.java:178) [?:?]
	at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.maybeFetch(ReplicaFetcherThread.java:153) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
	at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.doWork(ReplicaFetcherThread.java:124) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
	at org.apache.fluss.utils.concurrent.ShutdownableThread.run(ShutdownableThread.java:96) [fluss-server-0.8-SNAPSHOT.jar:0.8-SNAPSHOT]
Caused by: org.apache.fluss.exception.UnknownServerException: org.apache.fluss.server.exception.CorruptIndexException: Corrupt time index found, time index file (/mnt/fluss/data/0/remote-log-index-cache/60976417_41a99e37-f1a0-46ab-8745-b623ac4c2d37.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1758092497499
	at org.apache.fluss.server.log.TimeIndex.lambda$sanityCheck$0(TimeIndex.java:105)
	at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:32)
	at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:50)
	at org.apache.fluss.server.log.TimeIndex.sanityCheck(TimeIndex.java:94)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$createCacheEntry$7(RemoteLogIndexCache.java:270)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.loadIndexFile(RemoteLogIndexCache.java:496)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.createCacheEntry(RemoteLogIndexCache.java:254)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$2(RemoteLogIndexCache.java:221)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2406)
	at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1916)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2404)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
	at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.lambda$getIndexEntry$3(RemoteLogIndexCache.java:219)
	at org.apache.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:42)
	at org.apache.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:55)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCache.getIndexEntry(RemoteLogIndexCache.java:208)
	at org.apache.fluss.server.log.remote.RemoteLogIndexCach

This issue will affect both clients and followers; further discussion is needed on how to fix client-side reading. This issue aims to first address the problem on the follower side.

Solution

The reason for the UnknownServerException is that a CorruptIndexException was thrown while retrieving firstStartPos due to a corrupt index file.

public class ReplicaManager {
    ...

    private @Nullable RemoteLogFetchInfo fetchLogFromRemote(Replica replica, long fetchOffset) {
        List<RemoteLogSegment> remoteLogSegmentList =
                remoteLogManager.relevantRemoteLogSegments(replica.getTableBucket(), fetchOffset);
        if (!remoteLogSegmentList.isEmpty()) {
            // will throw CorruptIndexException if index file corrupt
            int firstStartPos =
                    remoteLogManager.lookupPositionForOffset(
                            remoteLogSegmentList.get(0), fetchOffset);
                    ...
        } else {
            return null;
        }
    }

For the follower, firstStartPos is actually not needed, so we can simply skip this step for followers. This not only avoids the issue of the follower continuously receiving errors when encountering a corrupt index file, but also serves as a small performance optimization.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions