You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2 replicas started segment commit - 1 server was successful and uploaded to deep store. Segment moves to ONLINE state.
[2025-01-02 06:04:22.684927] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-29:270] HOLDING:Picking winner time=6011 size=2
[2025-01-02 06:04:22.684947] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-29:270] HOLDING:Committer notified winner instance=Server_62 offset=353103174
3rd replica server (Server_95) was recently replaced w/ a new server and starting up. It started segment commit after the previous one was done. The controller allowed it to proceed w/ deep store upload, overwriting the previous segment copy.
Controller then failed when trying to update the ideal state since the segment wasn't in the CONSUMING state. This repeated several times until it gave up and then downloaded the copy in deep store.
[2025-01-02 06:15:33.342940] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-79:320] Processing segmentCommitStart(Server_95, 353106085)
[2025-01-02 06:15:33.343018] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-79:320] Processing segmentCommit(Server_95, 353106085)
[2025-01-02 06:15:33.343038] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-79:320] HOLDING:HOLD for instance=Server_95 offset=353106085
[2025-01-02 06:15:36.362908] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-127:368] Processing segmentConsumed(Server_95, 353106085)
[2025-01-02 06:15:36.362927] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-127:368] HOLDING:Picking winner time=103851 size=1
[2025-01-02 06:15:36.362946] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-127:368] HOLDING:Committer notified winner instance=Server_95 offset=353106085
[2025-01-02 06:15:36.362962] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-127:368] HOLDING:COMMIT for instance=Server_95 offset=353106085 buldTimeSec=126
[2025-01-02 06:15:36.366488] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-48:289] Processing segmentCommitStart(Server_95, 353106085)
[2025-01-02 06:15:36.366507] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-48:289] COMMITTER_NOTIFIED:Uploading for instance=Server_95 offset=353106085
[2025-01-02 06:15:52.366909] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-59:300] Processing segmentCommitEnd(Server_95, 353106085)
[2025-01-02 06:15:52.366952] INFO [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-59:300] Committing segment table1__327753__3444__20250102T0204Z at offset 353106085 winner Server_95
[2025-01-02 06:16:19.121540] ERROR [SegmentCompletionFSM_table1__327753__3444__20250102T0204Z] [grizzly-http-server-59:300] Caught exception while committing segment metadata for segment: table1__327753__3444__20250102T0204Z
[2025-01-02 06:16:19.121585] java.lang.IllegalStateException: Failed to find instance in CONSUMING state in IdealState for segment: table1__327753__3444__20250102T0204Z
[2025-01-02 06:16:19.121602] at org.apache.pinot.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:604) ~[pinot-all-1.2.0-2024-11-20-d3d547f256-SNAPSHOT-jar-with-dependencies.jar:1.2.0-2024-11-20-d3d547f256-SNAPSHOT-d3d547f2564eae4922c1ecb0280d6960f74d6902]
[2025-01-02 06:16:19.121626] at org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.commitSegmentMetadataInternal(PinotLLCRealtimeSegmentManager.java:523) ~[pinot-all-1.2.0-2024-11-20-d3d547f256-SNAPSHOT-jar-with-dependencies.jar:1.2.0-2024-11-20-d3d547f256-SNAPSHOT-d3d547f2564eae4922c1ecb0280d6960f74d6902]
...
commit + upload + fail repeats multiple times
...
[2025-01-02 06:26:03.847443] WARN [RealtimeSegmentDataManager_table1__327753__3444__20250102T0204Z] [HelixTaskExecutor-message_handle_thread_32:28] Current offset 353106085 ahead of the offset in zk 353103174. Downloading to replace
[2025-01-02 06:26:06.923869] INFO [table1_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_32:249] Downloaded and replaced CONSUMING segment: table1__327753__3444__20250102T0204Z
What's concerning here is that the offsets are different (353103174 vs 353106085). It seems like the controller should not have allowed the 3rd replica to proceed w/ the upload and commit.
We're using the realtime.segment.serverUploadToDeepStore feature to allow servers to upload segments directly are part of segment commit, but it doesn't seem like it's specific to that feature.
This separately caused a CRC mismatch failure for upsert compaction since Minion compacted the first version of the segment and updated the metadata, but the 3rd replica overwrote the S3 copy.
[2025-01-09 00:39:58.795983] ERROR [TaskFactoryRegistry] [TaskStateModelFactory-task_thread-33:17] Caught exception while executing task: Task_UpsertCompactionTask_0fbf766c-942e-4654-a938-7362bf6433de_1736382915514_1
[2025-01-09 00:39:58.796017] java.lang.IllegalStateException: Crc mismatched between ZK and deepstore copy of segment: table1__327753__3444__20250102T0204Z. Expected crc from ZK: 2394021143, crc from deepstore: 1876071145
[2025-01-09 00:39:58.796047] at org.apache.pinot.plugin.minion.tasks.upsertcompaction.UpsertCompactionTaskExecutor.convert(UpsertCompactionTaskExecutor.java:70)
The text was updated successfully, but these errors were encountered:
For a 3 replica table, we hit a case where:
2 replicas started segment commit - 1 server was successful and uploaded to deep store. Segment moves to ONLINE state.
3rd replica server (Server_95) was recently replaced w/ a new server and starting up. It started segment commit after the previous one was done. The controller allowed it to proceed w/ deep store upload, overwriting the previous segment copy.
Controller then failed when trying to update the ideal state since the segment wasn't in the CONSUMING state. This repeated several times until it gave up and then downloaded the copy in deep store.
What's concerning here is that the offsets are different (353103174 vs 353106085). It seems like the controller should not have allowed the 3rd replica to proceed w/ the upload and commit.
We're using the
realtime.segment.serverUploadToDeepStore
feature to allow servers to upload segments directly are part of segment commit, but it doesn't seem like it's specific to that feature.This separately caused a CRC mismatch failure for upsert compaction since Minion compacted the first version of the segment and updated the metadata, but the 3rd replica overwrote the S3 copy.
The text was updated successfully, but these errors were encountered: