Demo incorrect check_quorum behaviour #7375

cjen1-msft · 2025-10-17T19:04:29Z

We have CheckQuorum to ensure that a leader should step down if it is not a good leader, as this acts as a good liveness probe of the system at large.

Our CheckQuorum condition is that the leader has a committing quorum in any configuration.
Our commit condition is that the leader has a committing quorum in every configuration.

This PR includes a scenario (check_quorum_2) which demonstrates the difference between these two conditions.

Suppose we have a 3 node cluster with n0 the leader initially, and then retire both n1 and n2 but do not replicate this to them.
Then we partition n0.

Due to CheckQuorum n0 is still a good leader, however it is unable to commit.
While the remainder of the cluster elects a new replacement leader and continues to function.

The aim of the change to raft.h is to make clear what the condition is, as well as fixing this.

Note

The other direction of this bug would be more severe, (going from n0, to n0,n1,n2, check_quorum_3) with n0 being inaccessible but staying alive due to CheckQuorum.
But the backups are unable to elect themselves without a vote from n0, so this isn't a problem in practise.

achamayou · 2025-10-20T08:22:05Z

src/consensus/aft/raft.h

-              backup_ack_timeout_count++;
+              auto search = all_other_nodes.find(node.first);
+              if (
+                search == all_other_nodes.end() ||


This is always adding one to live nodes in config, is that right? In an atomic reconfiguration situation, the next configuration may not include the next primary, so including it in the count there does not seem obviously correct.

As discussed: all_other_nodes contains every node in any configuration which is not the primary. So if there is a node in a configuration which is not in all_other_nodes then it is the primary, and hence is live.

I've added a comment to this effect.

tests/raft_scenarios/check_quorum_012_0

tests/raft_scenarios/check_quorum_0_012

tests/raft_scenarios/check_quorum_012_0

eddyashton · 2025-10-20T09:06:07Z

tests/raft_scenarios/check_quorum_012_0

+swap_nodes,2,out,1,2
+
+disconnect_node,0
+
+periodic_one,0,100
+periodic_one,1,100
+dispatch_all


Add comments and asserts if possible. We retire nodes 1 and 2, but 0 fails to replicate this update to the other nodes.

Updated this, let me know if you think there are enough. More are easy enough to add :)

Copilot

Pull Request Overview

This PR addresses a bug in the CheckQuorum implementation where the leader's quorum check condition was inconsistent with the commit condition. The change ensures that a leader only remains active if it has a committing quorum in every active configuration, not just any configuration. This prevents scenarios where a leader believes it has quorum but is actually unable to commit entries.

Key Changes:

Modified CheckQuorum logic in raft.h to require quorum in all configurations instead of any configuration
Added test scenarios demonstrating the old incorrect behavior and validating the fix
Introduced new test assertions (assert_config, assert_missing_config) to verify configuration state in tests

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/consensus/aft/raft.h`	Changed CheckQuorum from OR logic (quorum in any config) to AND logic (quorum in every config) using `std::all_of`
`src/consensus/aft/test/driver.h`	Added `assert_config` and `assert_missing_config` helper methods for validating node configurations in tests
`src/consensus/aft/test/driver.cpp`	Registered the new assertion commands in the test driver command dispatcher
`tests/raft_scenarios/check_quorum_0_012`	Test scenario demonstrating transition from single node to three-node cluster with CheckQuorum behavior
`tests/raft_scenarios/check_quorum_012_0`	Test scenario demonstrating retirement of nodes while partitioned, verifying correct leader stepdown

src/consensus/aft/raft.h

src/consensus/aft/test/driver.h

Co-authored-by: Copilot <[email protected]>

src/consensus/aft/test/driver.cpp

…e actual state of it during rollbacks

cjen1-msft · 2025-10-21T13:25:42Z

The reason getting trace validation to work required reducing the constraints on IsBecomeFollower is as follows:

Actions from raft.h trace:

receive append_entries_request (no change)
become follower (update term, follower, rollback uncommittable)
<Don't emit any event> Roll back to match incoming append entries
execute_append_entries_sync (apply changes)
send_append_entries

Required actions from Traceccfraft.tla pov:

IsReceiveAppendEntriesRequest (UpdateTerm & ExecuteAppendEntries => update term, follower, rollback uncommittable, apply changes)
IsBecomeFollower (no change)
IsExecuteAppendEntries (no change)
IsSendAppendEntriesResponse (no change)

So if IsBecomeFollower asserts that the spec state matches the raft.h state, and that it is unchanged, then IsReceiveAppendEntries's UpdateTerm will be in conflict with this.

Hence to fix this specific issue we should not roll back the log or increment the term until the message is sent.
But also we could argue that we shouldn't constrain the membershipState as it may not be stable yet.

cjen1-msft added 4 commits October 17, 2025 19:26

Add new check_quorum scenario

c279b6e

clean-up

8c933e5

Demo not-quite-a-bug scenario

97e426f

CheckQuoum fix

79f7302

achamayou reviewed Oct 20, 2025

View reviewed changes

eddyashton reviewed Oct 20, 2025

View reviewed changes

tests/raft_scenarios/check_quorum_012_0 Show resolved Hide resolved

eddyashton reviewed Oct 20, 2025

View reviewed changes

tests/raft_scenarios/check_quorum_0_012 Show resolved Hide resolved

eddyashton reviewed Oct 20, 2025

View reviewed changes

cjen1-msft added 5 commits October 20, 2025 11:06

Add sig after start

7474e0b

rename files

88a69b9

Add asserts for configurations

dc00a0f

Update specs to be easier to read

498066a

Assert sync state

17ee9c9

cjen1-msft marked this pull request as ready for review October 20, 2025 14:28

cjen1-msft requested a review from a team as a code owner October 20, 2025 14:28

Copilot AI review requested due to automatic review settings October 20, 2025 14:28

Copilot AI reviewed Oct 20, 2025

View reviewed changes

src/consensus/aft/raft.h Show resolved Hide resolved

src/consensus/aft/test/driver.h Outdated Show resolved Hide resolved

cjen1-msft and others added 2 commits October 20, 2025 15:39

Update src/consensus/aft/test/driver.h

9f41703

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into demo-check-quorum-fail

86ed2cd

achamayou reviewed Oct 20, 2025

View reviewed changes

src/consensus/aft/test/driver.cpp Outdated Show resolved Hide resolved

cjen1-msft added 2 commits October 21, 2025 12:09

Potentially massage mismatch between spec's view of the ledger and th…

75c1ab1

…e actual state of it during rollbacks

Docs for not asserting state

2b7e9a5

cjen1-msft added 3 commits October 21, 2025 14:26

Merge branch 'main' into demo-check-quorum-fail

b1723a4

assert missing->absent config

b4b47e5

Update comment

6c8c08b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Demo incorrect check_quorum behaviour #7375

Demo incorrect check_quorum behaviour #7375

cjen1-msft commented Oct 17, 2025

Uh oh!

achamayou Oct 20, 2025

Uh oh!

cjen1-msft Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eddyashton Oct 20, 2025

Uh oh!

cjen1-msft Oct 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Demo incorrect check_quorum behaviour #7375

Are you sure you want to change the base?

Demo incorrect check_quorum behaviour #7375

Conversation

cjen1-msft commented Oct 17, 2025

Note

Uh oh!

achamayou Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

cjen1-msft Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eddyashton Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

cjen1-msft Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants