-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add timeouts to recovery log router initialization #12396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
auto reply = transformErrors( | ||
throwErrorOr(workers[nextRouter].logRouter.getReplyUnlessFailedFor( | ||
req, SERVER_KNOBS->TLOG_TIMEOUT, SERVER_KNOBS->MASTER_FAILURE_SLOPE_DURING_RECOVERY)), | ||
throwErrorOr(timeoutError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should throw a log message when we timeout
TraceEvent("LogRouterInitTimeout")
.detail("RequestIndex", i)
.detail("Worker", workers[nextRouter].id())
.detail("RecoveryCount", recoveryCount);
if this is added in more places in a future PR, maybe its worth modifying timeoutError to accept a lambda, something like:
timeoutError(
getReplyUnlessFailedFor(...),
SERVER_KNOBS->CC_RECOVERY_INIT_REQ_TIMEOUT,
[=]() {
TraceEvent("LogRouterInitTimeout")
.detail("RequestIndex", i)
.detail("Worker", workers[nextRouter].id())
.detail("RecoveryCount", recoveryCount);
return cluster_recovery_failed();
}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was also thinking something like that e.g. add optional context to timeoutError. I was thinking to do this in a follow-up PR. But maybe this can implemented in a different way if we go with Jingyu's idea here: #12396 (review).
bool skipInitRspInSim() { | ||
const bool skip = g_network->isSimulated() && BUGGIFY_WITH_PROB(0.01 /* 1% */); | ||
if (skip) { | ||
TraceEvent("SkipInitRspInSimTrue"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include interf.id()
in log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking instead of fixing each individual recruitment, maybe we can add a monitoring actor, which starts with the whole recruitment step and is cancelled if recruitment finishes within a threshold (e.g., 120s). If not, the actor throws an error to trigger a recovery.
Note we don't want to generate too many generations. I.e., we will need to be careful to cancel it before writing to coordinator states, or before recovery transaction.
I switched to draft mode because 100K on latest commit (some extra comments I added) showed some failures in joshua summary. But I don't see them when I open the joshua logs, so it could be an infrastructure issue. I will rerun 100K. |
I like the simplicity and elegance of the idea (global timeout monitor for the entire initializing_transaction_servers phase). Tempted by it but thinking out loud about some practical tradeoffs:
(2) does not seem concerning after thinking it through. (1) we can discuss offline. Overall, this approach is tempting so hopefully we can make it work. |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Discussed with @jzhou77. Overall, I think the idea is good because it's simple and also a "catch all" so we won't miss any places where we forget to add timeouts, especially when people add new code to recovery. The concern I mentioned in (1) is valid i.e. cyclic recovery. To address that concern, one idea we discussed is to have a base timeout, and then if we restart recovery, the next timeout is higher by some factor (exponential backoff or some other scheme). We already seem to use this pattern for failure monitor timeout via @dlambrig, @sbodagala - I'll try an implementation of this approach but meanwhile let me know if you have any concerns. |
Description
During recovery's initializing_transaction_servers phase, CC sends init msgs to recruits and waits as follows:
foundationdb/fdbserver/TagPartitionedLogSystem.actor.cpp
Lines 2801 to 2804 in 269013d
This is a common pattern in the recovery recruitment and initialization code. If there's a network issue between CC and the worker and if the failure monitor marks the endpoint as failed, only then the TLOG_TIMEOUT (400ms) applies. However, there could be cases where network is healthy but worker is having some other issue (logical bug, busy, stuck, etc.) because of which it can not respond back. In such cases, CC would wait indefinitely for the worker. This leads to long/stuck recovery and cluster unavailability as recruitment happens before accepting_commits phase.
The fix is simply to have a high enough timeout from CC's perspective i.e. if the worker does not respond back within X seconds, CC gives up, and restarts recovery.
This PR is scoped to LogRouter initialization. I'll send a follow-up PR to cover other roles.
Testing
To learn about the timeout behavior, I tried various approaches to mimic the case where network is healthy but LogRouter is having an issue so it could not respond back. The approach I am using here is that the LogRouter just drops the request. It may not be a realistic condition but the goal was to assume that some issue can happen and therefore from CC perspective, it should have a timeout threshold after which it gives up. So this is just means to an end in order to test that timeout is working. With this failure mode, 100K does not pass, in fact we get 10 failures at ~1K mark.
And then 100K with the fix passes:
20250926-030021-praza-timeout-behavior-step-c09fd7600e93c39e compressed=True data_size=40514802 duration=5468155 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:45 sanity=False started=100000 stopped=20250926-040006 submitted=20250926-030021 timeout=5400 username=praza-timeout-behavior-step3-finalfix-v2-91aa98b4b1f492a1a921750ad0420eaa94282540
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)