Add timeouts to recovery log router initialization #12396

spraza · 2025-09-26T06:20:21Z

Description

During recovery's initializing_transaction_servers phase, CC sends init msgs to recruits and waits as follows:

foundationdb/fdbserver/TagPartitionedLogSystem.actor.cpp

Lines 2801 to 2804 in 269013d

    
           auto reply = transformErrors( 
        
               throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor( 
        
                   req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)), 
        
               cluster_recovery_failed());

This is a common pattern in the recovery recruitment and initialization code. If there's a network issue between CC and the worker and if the failure monitor marks the endpoint as failed, only then the TLOG_TIMEOUT (400ms) applies. However, there could be cases where network is healthy but worker is having some other issue (logical bug, busy, stuck, etc.) because of which it can not respond back. In such cases, CC would wait indefinitely for the worker. This leads to long/stuck recovery and cluster unavailability as recruitment happens before accepting_commits phase.

The fix is simply to have a high enough timeout from CC's perspective i.e. if the worker does not respond back within X seconds, CC gives up, and restarts recovery.

This PR is scoped to LogRouter initialization. I'll send a follow-up PR to cover other roles.

Testing

To learn about the timeout behavior, I tried various approaches to mimic the case where network is healthy but LogRouter is having an issue so it could not respond back. The approach I am using here is that the LogRouter just drops the request. It may not be a realistic condition but the goal was to assume that some issue can happen and therefore from CC perspective, it should have a timeout threshold after which it gives up. So this is just means to an end in order to test that timeout is working. With this failure mode, 100K does not pass, in fact we get 10 failures at ~1K mark.

And then 100K with the fix passes: 20250926-030021-praza-timeout-behavior-step-c09fd7600e93c39e compressed=True data_size=40514802 duration=5468155 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:45 sanity=False started=100000 stopped=20250926-040006 submitted=20250926-030021 timeout=5400 username=praza-timeout-behavior-step3-finalfix-v2-91aa98b4b1f492a1a921750ad0420eaa94282540

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2025-09-26T06:45:09Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 1b24b43
Duration 0:24:39
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:00:51Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 1b24b43
Duration 0:40:19
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:00:58Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 96fb6b7
Duration 0:26:53
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:09:51Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 1b24b43
Duration 0:49:17
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:22:14Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 96fb6b7
Duration 0:48:07
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:26:54Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 1b24b43
Duration 1:06:23
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:33:00Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 1b24b43
Duration 1:12:29
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2025-09-26T07:33:15Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 1b24b43
Duration 1:12:44
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:37:05Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 1b24b43
Duration 1:16:32
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:37:52Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 96fb6b7
Duration 1:03:44
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:38:57Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 96fb6b7
Duration 1:04:52
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T07:46:55Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 96fb6b7
Duration 1:12:47
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

dlambrig · 2025-09-26T14:53:08Z

fdbserver/TagPartitionedLogSystem.actor.cpp

 					auto reply = transformErrors(
-					    throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor(
-					        req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)),
+					    throwErrorOr(timeoutError(


should throw a log message when we timeout

TraceEvent("LogRouterInitTimeout") .detail("RequestIndex", i) .detail("Worker", workers[nextRouter].id()) .detail("RecoveryCount", recoveryCount);

if this is added in more places in a future PR, maybe its worth modifying timeoutError to accept a lambda, something like:

timeoutError( getReplyUnlessFailedFor(...), SERVER_KNOBS->CC_RECOVERY_INIT_REQ_TIMEOUT, [=]() { TraceEvent("LogRouterInitTimeout") .detail("RequestIndex", i) .detail("Worker", workers[nextRouter].id()) .detail("RecoveryCount", recoveryCount); return cluster_recovery_failed(); } )

Yes, I was also thinking something like that e.g. add optional context to timeoutError. I was thinking to do this in a follow-up PR. But maybe this can implemented in a different way if we go with Jingyu's idea here: #12396 (review).

dlambrig · 2025-09-26T14:56:45Z

fdbserver/worker.actor.cpp

+bool skipInitRspInSim() {
+	const bool skip = g_network->isSimulated() && BUGGIFY_WITH_PROB(0.01 /* 1% */);
+	if (skip) {
+		TraceEvent("SkipInitRspInSimTrue");


include interf.id() in log?

jzhou77

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.

Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

spraza · 2025-09-26T18:41:54Z

I switched to draft mode because 100K on latest commit (some extra comments I added) showed some failures in joshua summary. But I don't see them when I open the joshua logs, so it could be an infrastructure issue. I will rerun 100K.

spraza · 2025-09-26T18:57:32Z

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.

Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

I like the simplicity and elegance of the idea (global timeout monitor for the entire initializing_transaction_servers phase). Tempted by it but thinking out loud about some practical tradeoffs:

How to pick this global timeout. If we pick too high e.g. 10 minutes, we are accepting 10 minutes of unavailability. If we pick too low (10 seconds), then we may not be giving breathing room to recovery so it goes into repeat cyclic state. I am not sure if old generations will build up because we're limiting to initializing_transaction_servers which is before recovery_transaction and writing_coordinated_state. Let's say we pick 2 minutes, it's hard to say if that won't run into cyclic issue (i.e. recovery would have proceeded in 2min + delta, but we restarted too early, and this keeps repeating). Inherently the challenge is that we're trying to pick a threshold for the entire initializing_transaction_servers phase, which is harder than picking one for a single init RPC (which we know doesn't do much). And the tradeoff is unavailability so we can't go too high either.
I was thinking to have fine grained observability in a follow-up PR which whenever we timeout, has an attached context telling the reason behind timeout (e.g. LogRouter timed out, TLog timed out, etc.). But thinking about this more, maybe this is less of a concern because I do have completion trace events (LogRouter completed, TLog completed), so using that and the source code, we should be able to derive what was timing out. It's a bit more hops in investigation but it shouldn't slow us down too much.

(2) does not seem concerning after thinking it through. (1) we can discuss offline. Overall, this approach is tempting so hopefully we can make it work.

foundationdb-ci · 2025-09-26T19:04:53Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 96fb6b7
Duration 0:23:53
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:21:31Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 96fb6b7
Duration 0:40:34
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:29:40Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 96fb6b7
Duration 0:48:39
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:35:09Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 96fb6b7
Duration 0:54:06
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:41:56Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 96fb6b7
Duration 1:00:57
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:47:41Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 96fb6b7
Duration 1:06:42
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-09-26T19:53:07Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 96fb6b7
Duration 1:12:06
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

spraza · 2025-09-26T20:39:12Z

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.
Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

I like the simplicity and elegance of the idea (global timeout monitor for the entire initializing_transaction_servers phase). Tempted by it but thinking out loud about some practical tradeoffs:

How to pick this global timeout. If we pick too high e.g. 10 minutes, we are accepting 10 minutes of unavailability. If we pick too low (10 seconds), then we may not be giving breathing room to recovery so it goes into repeat cyclic state. I am not sure if old generations will build up because we're limiting to initializing_transaction_servers which is before recovery_transaction and writing_coordinated_state. Let's say we pick 2 minutes, it's hard to say if that won't run into cyclic issue (i.e. recovery would have proceeded in 2min + delta, but we restarted too early, and this keeps repeating). Inherently the challenge is that we're trying to pick a threshold for the entire initializing_transaction_servers phase, which is harder than picking one for a single init RPC (which we know doesn't do much). And the tradeoff is unavailability so we can't go too high either.

I was thinking to have fine grained observability in a follow-up PR which whenever we timeout, has an attached context telling the reason behind timeout (e.g. LogRouter timed out, TLog timed out, etc.). But thinking about this more, maybe this is less of a concern because I do have completion trace events (LogRouter completed, TLog completed), so using that and the source code, we should be able to derive what was timing out. It's a bit more hops in investigation but it shouldn't slow us down too much.

(2) does not seem concerning after thinking it through. (1) we can discuss offline. Overall, this approach is tempting so hopefully we can make it work.

Discussed with @jzhou77. Overall, I think the idea is good because it's simple and also a "catch all" so we won't miss any places where we forget to add timeouts, especially when people add new code to recovery. The concern I mentioned in (1) is valid i.e. cyclic recovery. To address that concern, one idea we discussed is to have a base timeout, and then if we restart recovery, the next timeout is higher by some factor (exponential backoff or some other scheme). We already seem to use this pattern for failure monitor timeout via MASTER_FAILURE_SLOPE_DURING_RECOVERY. Flow transport also uses this pattern when retrying on unresponsive peers.

@dlambrig, @sbodagala - I'll try an implementation of this approach but meanwhile let me know if you have any concerns.

Recovery init timeouts

1b24b43

spraza marked this pull request as draft September 26, 2025 06:20

knob comment

96fb6b7

spraza marked this pull request as ready for review September 26, 2025 06:39

spraza requested review from dlambrig, jzhou77 and sbodagala September 26, 2025 06:41

spraza removed request for dlambrig, jzhou77 and sbodagala September 26, 2025 07:14

spraza marked this pull request as draft September 26, 2025 07:15

dlambrig reviewed Sep 26, 2025

View reviewed changes

jzhou77 reviewed Sep 26, 2025

View reviewed changes

spraza closed this Sep 26, 2025

spraza reopened this Sep 26, 2025

spraza changed the title ~~Add timeouts to recovery log router recruitment~~ Add timeouts to recovery log router initialization Sep 26, 2025

	auto reply = transformErrors(
	throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor(
	req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)),
	cluster_recovery_failed());

Add timeouts to recovery log router initialization #12396

Are you sure you want to change the base?

Add timeouts to recovery log router initialization #12396

Conversation

spraza commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Code-Reviewer Section

For Release-Branches

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

dlambrig Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spraza Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

dlambrig Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 left a comment

Choose a reason for hiding this comment

Uh oh!

spraza commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spraza commented Sep 26, 2025

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Sep 26, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

spraza commented Sep 26, 2025 •

edited

Loading

dlambrig Sep 26, 2025 •

edited

Loading

spraza commented Sep 26, 2025 •

edited

Loading

spraza commented Sep 26, 2025 •

edited

Loading