Skip to content

Conversation

spraza
Copy link
Collaborator

@spraza spraza commented Sep 26, 2025

Description

During recovery's initializing_transaction_servers phase, CC sends init msgs to recruits and waits as follows:

auto reply = transformErrors(
throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor(
req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)),
cluster_recovery_failed());

This is a common pattern in the recovery recruitment and initialization code. If there's a network issue between CC and the worker and if the failure monitor marks the endpoint as failed, only then the TLOG_TIMEOUT (400ms) applies. However, there could be cases where network is healthy but worker is having some other issue (logical bug, busy, stuck, etc.) because of which it can not respond back. In such cases, CC would wait indefinitely for the worker. This leads to long/stuck recovery and cluster unavailability as recruitment happens before accepting_commits phase.

The fix is simply to have a high enough timeout from CC's perspective i.e. if the worker does not respond back within X seconds, CC gives up, and restarts recovery.

This PR is scoped to LogRouter initialization. I'll send a follow-up PR to cover other roles.

Testing

To learn about the timeout behavior, I tried various approaches to mimic the case where network is healthy but LogRouter is having an issue so it could not respond back. The approach I am using here is that the LogRouter just drops the request. It may not be a realistic condition but the goal was to assume that some issue can happen and therefore from CC perspective, it should have a timeout threshold after which it gives up. So this is just means to an end in order to test that timeout is working. With this failure mode, 100K does not pass, in fact we get 10 failures at ~1K mark.

And then 100K with the fix passes: 20250926-030021-praza-timeout-behavior-step-c09fd7600e93c39e compressed=True data_size=40514802 duration=5468155 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:45 sanity=False started=100000 stopped=20250926-040006 submitted=20250926-030021 timeout=5400 username=praza-timeout-behavior-step3-finalfix-v2-91aa98b4b1f492a1a921750ad0420eaa94282540

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@spraza spraza marked this pull request as draft September 26, 2025 06:20
@spraza spraza marked this pull request as ready for review September 26, 2025 06:39
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 1b24b43
  • Duration 0:24:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 1b24b43
  • Duration 0:40:19
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 0:26:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 1b24b43
  • Duration 0:49:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza spraza marked this pull request as draft September 26, 2025 07:15
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 96fb6b7
  • Duration 0:48:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 1b24b43
  • Duration 1:06:23
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 1b24b43
  • Duration 1:12:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 1b24b43
  • Duration 1:12:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 1b24b43
  • Duration 1:16:32
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 1:03:44
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 1:04:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 1:12:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

auto reply = transformErrors(
throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor(
req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)),
throwErrorOr(timeoutError(
Copy link
Contributor

@dlambrig dlambrig Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should throw a log message when we timeout

        TraceEvent("LogRouterInitTimeout")
            .detail("RequestIndex", i)
            .detail("Worker", workers[nextRouter].id())
            .detail("RecoveryCount", recoveryCount);

if this is added in more places in a future PR, maybe its worth modifying timeoutError to accept a lambda, something like:

timeoutError(
    getReplyUnlessFailedFor(...),
    SERVER_KNOBS->CC_RECOVERY_INIT_REQ_TIMEOUT,
    [=]() {
        TraceEvent("LogRouterInitTimeout")
            .detail("RequestIndex", i)
            .detail("Worker", workers[nextRouter].id())
            .detail("RecoveryCount", recoveryCount);
        return cluster_recovery_failed();
    }
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was also thinking something like that e.g. add optional context to timeoutError. I was thinking to do this in a follow-up PR. But maybe this can implemented in a different way if we go with Jingyu's idea here: #12396 (review).

bool skipInitRspInSim() {
const bool skip = g_network->isSimulated() && BUGGIFY_WITH_PROB(0.01 /* 1% */);
if (skip) {
TraceEvent("SkipInitRspInSimTrue");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include interf.id() in log?

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.

Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

@spraza spraza closed this Sep 26, 2025
@spraza spraza reopened this Sep 26, 2025
@spraza
Copy link
Collaborator Author

spraza commented Sep 26, 2025

I switched to draft mode because 100K on latest commit (some extra comments I added) showed some failures in joshua summary. But I don't see them when I open the joshua logs, so it could be an infrastructure issue. I will rerun 100K.

@spraza
Copy link
Collaborator Author

spraza commented Sep 26, 2025

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.

Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

I like the simplicity and elegance of the idea (global timeout monitor for the entire initializing_transaction_servers phase). Tempted by it but thinking out loud about some practical tradeoffs:

  1. How to pick this global timeout. If we pick too high e.g. 10 minutes, we are accepting 10 minutes of unavailability. If we pick too low (10 seconds), then we may not be giving breathing room to recovery so it goes into repeat cyclic state. I am not sure if old generations will build up because we're limiting to initializing_transaction_servers which is before recovery_transaction and writing_coordinated_state. Let's say we pick 2 minutes, it's hard to say if that won't run into cyclic issue (i.e. recovery would have proceeded in 2min + delta, but we restarted too early, and this keeps repeating). Inherently the challenge is that we're trying to pick a threshold for the entire initializing_transaction_servers phase, which is harder than picking one for a single init RPC (which we know doesn't do much). And the tradeoff is unavailability so we can't go too high either.

  2. I was thinking to have fine grained observability in a follow-up PR which whenever we timeout, has an attached context telling the reason behind timeout (e.g. LogRouter timed out, TLog timed out, etc.). But thinking about this more, maybe this is less of a concern because I do have completion trace events (LogRouter completed, TLog completed), so using that and the source code, we should be able to derive what was timing out. It's a bit more hops in investigation but it shouldn't slow us down too much.

(2) does not seem concerning after thinking it through. (1) we can discuss offline. Overall, this approach is tempting so hopefully we can make it work.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 0:23:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 96fb6b7
  • Duration 0:40:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 96fb6b7
  • Duration 0:48:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 0:54:06
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 1:00:57
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 96fb6b7
  • Duration 1:06:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 96fb6b7
  • Duration 1:12:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@spraza
Copy link
Collaborator Author

spraza commented Sep 26, 2025

I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.
Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.

I like the simplicity and elegance of the idea (global timeout monitor for the entire initializing_transaction_servers phase). Tempted by it but thinking out loud about some practical tradeoffs:

  1. How to pick this global timeout. If we pick too high e.g. 10 minutes, we are accepting 10 minutes of unavailability. If we pick too low (10 seconds), then we may not be giving breathing room to recovery so it goes into repeat cyclic state. I am not sure if old generations will build up because we're limiting to initializing_transaction_servers which is before recovery_transaction and writing_coordinated_state. Let's say we pick 2 minutes, it's hard to say if that won't run into cyclic issue (i.e. recovery would have proceeded in 2min + delta, but we restarted too early, and this keeps repeating). Inherently the challenge is that we're trying to pick a threshold for the entire initializing_transaction_servers phase, which is harder than picking one for a single init RPC (which we know doesn't do much). And the tradeoff is unavailability so we can't go too high either.
  2. I was thinking to have fine grained observability in a follow-up PR which whenever we timeout, has an attached context telling the reason behind timeout (e.g. LogRouter timed out, TLog timed out, etc.). But thinking about this more, maybe this is less of a concern because I do have completion trace events (LogRouter completed, TLog completed), so using that and the source code, we should be able to derive what was timing out. It's a bit more hops in investigation but it shouldn't slow us down too much.

(2) does not seem concerning after thinking it through. (1) we can discuss offline. Overall, this approach is tempting so hopefully we can make it work.

Discussed with @jzhou77. Overall, I think the idea is good because it's simple and also a "catch all" so we won't miss any places where we forget to add timeouts, especially when people add new code to recovery. The concern I mentioned in (1) is valid i.e. cyclic recovery. To address that concern, one idea we discussed is to have a base timeout, and then if we restart recovery, the next timeout is higher by some factor (exponential backoff or some other scheme). We already seem to use this pattern for failure monitor timeout via MASTER_FAILURE_SLOPE_DURING_RECOVERY. Flow transport also uses this pattern when retrying on unresponsive peers.

@dlambrig, @sbodagala - I'll try an implementation of this approach but meanwhile let me know if you have any concerns.

@spraza spraza changed the title Add timeouts to recovery log router recruitment Add timeouts to recovery log router initialization Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants