[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

1996fanrui · 2025-09-26T15:27:00Z

What is the purpose of the change

MapStateNullValueCheckpointingITCase failed with No checkpoint was created yet

Root Cause Analysis

Problem Location

Log analysis revealed that the checkpoint had actually completed successfully:

07:19:37,522 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, checkpointDuration=45 ms, finalizationTime=4 ms)

However, the test code could not find the completed checkpoint when calling CommonTestUtils.getLatestCompletedCheckpointPath().

Root Cause

The problem occurs in the execution order of the CheckpointCoordinator.completePendingCheckpoint() method:

flink/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

Line 1389 in 39a4628

reportCompletedCheckpoint(completedCheckpoint);

pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);
reportCompletedCheckpoint(completedCheckpoint);

Checkpoint Coordinator mechanism:

A: pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint) completes the completion future first{}
B: reportCompletedCheckpoint(completedCheckpoint) updates checkpoint statistics.

Test code timeline:

C: Detect future completion
D: Call getLatestCompletedCheckpointPath() immediately

Usually, the execution sequence is A -> B -> C -> D, it works well.

The bug happens if execution sequence is A > C -> D -> B.

Reproduction Method

In the completePendingCheckpoint() method, inserting Thread.sleep(100) between complete() and reportCompletedCheckpoint() can reproduce this issue 100%.

Brief change log: Adjust the execution order in CheckpointCoordinator

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure

Changes:

// Update statistics first 
reportCompletedCheckpoint(completedCheckpoint);
// Complete the future later
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);

Benefits:

Fundamentally eliminates race conditions
Ensures semantic correctness: Waiting parties are notified only when the checkpoint is fully processed

Verifying this change

Added testCompletionFutureCompletesAfterReporting

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2025-09-26T15:32:30Z

CI report:

b0e8240 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

…fter updating statistics to ensures semantic correctness and prevent test failure

Izeren

Thank you for the change @1996fanrui. Overall, LGTM, my main concern is about potential test flakiness, PTAL

Izeren · 2025-10-05T18:14:27Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

                lastSubsumed = null;
            }

-            pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);


I have concerns that change like this can have potential impacts like:

Deadlock / race condition if reportCompletedCheckpoint would trigger any handler that also waits on the checkpoint future before its completion (in general, unlikely situation, and should be caught by existing test)

Checkpoint completion will be slightly delayed, but reporting is a quick operation, so doesn't seem to be critical

If reporting throws exception it will result in checkpoint being completed exceptionally. Could we confirm that this behaviour matches the previous one?

Izeren · 2025-10-05T18:16:31Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                            }
+                        });
+
+        assertThat(tracker.getReportStartedFuture().get(20, TimeUnit.SECONDS))


That is likely to end up being flaky test. Test in CI could freeze for 15min and more, so 20 seconds timeout may not be sufficient in general.
I suggest to use indefinite timeout of at least a few hours

Izeren · 2025-10-05T18:26:07Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                .as("reportCompletedCheckpoint should be started soon when checkpoint is acked.")
+                .isNull();
+
+        for (int i = 0; i < 30; i++) {


Similarly to above, I am not sure you can confirm whether expected change did not occur because of being blocked vs corresponding thread being inactive. Will be better to wait indefinitely here

Izeren · 2025-10-05T18:26:57Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+
+        tracker.getReportBlockingFuture().complete(null);
+
+        CompletedCheckpoint result = checkpointFuture.get(5, TimeUnit.SECONDS);


Izeren · 2025-10-05T18:27:02Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

+                .as("Checkpoint future should complete after reportCompletedCheckpoint finishes")
+                .isNotNull();
+
+        ackTask.get(5, TimeUnit.SECONDS);


1996fanrui force-pushed the 38408/no-checkpoint branch from dc2613d to 9afe37c Compare September 29, 2025 16:05

1996fanrui marked this pull request as ready for review September 29, 2025 18:54

1996fanrui marked this pull request as draft September 30, 2025 09:07

1996fanrui force-pushed the 38408/no-checkpoint branch from 9afe37c to 736fe94 Compare October 2, 2025 09:00

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture a…

b0e8240

…fter updating statistics to ensures semantic correctness and prevent test failure

1996fanrui force-pushed the 38408/no-checkpoint branch from 736fe94 to b0e8240 Compare October 2, 2025 09:06

1996fanrui marked this pull request as ready for review October 2, 2025 09:13

Izeren reviewed Oct 5, 2025

View reviewed changes

github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

1996fanrui commented Sep 26, 2025 •

edited

Loading

Uh oh!

flinkbot commented Sep 26, 2025 •

edited

Loading

Uh oh!

Izeren left a comment

Uh oh!

Izeren Oct 5, 2025

Uh oh!

Izeren Oct 5, 2025

Uh oh!

Izeren Oct 5, 2025

Uh oh!

Izeren Oct 5, 2025

Uh oh!

Izeren Oct 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		tracker.getReportBlockingFuture().complete(null);

		CompletedCheckpoint result = checkpointFuture.get(5, TimeUnit.SECONDS);

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

Are you sure you want to change the base?

[FLINK-38408][checkpoint] Complete the checkpoint CompletableFuture after updating statistics to ensures semantic correctness and prevent test failure #27050

Conversation

1996fanrui commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Root Cause Analysis

Problem Location

Root Cause

Reproduction Method

Brief change log: Adjust the execution order in CheckpointCoordinator

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Izeren left a comment

Choose a reason for hiding this comment

Uh oh!

Izeren Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Izeren Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Izeren Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Izeren Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Izeren Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1996fanrui commented Sep 26, 2025 •

edited

Loading

flinkbot commented Sep 26, 2025 •

edited

Loading