Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.2.x] Fix stepping down on timeout #24708

Open
wants to merge 7 commits into
base: v24.2.x
Choose a base branch
from

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jan 7, 2025

Backport of PR #24590
Fixes: #24668

The `raft::reply_result::follower_busy` is indicating that the follower
was unable to process the heartbeat fast enough to generate a response.
Renaming the reply from `timeout` will make it less confusing for the
reader and differentiate the error code from an RPC timeout.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 6a1e34b)
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 95a29db)
@mmaslankaprv mmaslankaprv added this to the v24.2.x-next milestone Jan 7, 2025
@mmaslankaprv mmaslankaprv added the kind/backport PRs targeting a stable branch label Jan 7, 2025
@mmaslankaprv mmaslankaprv linked an issue Jan 7, 2025 that may be closed by this pull request
Wired raft RPC service handler into Raft fixture to make the tests more
accurate and cover the service code with tests.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 5f69d9b)
Propagating timeout to the node sending RPC request is crucial for
accurate testing of Raft implementation.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 7d33bb5)
@mmaslankaprv mmaslankaprv force-pushed the manual-backport-24590-v24.2.x-643 branch from 6a30603 to 3259d02 Compare January 7, 2025 15:02
@mmaslankaprv mmaslankaprv marked this pull request as ready for review January 7, 2025 15:13
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60349

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 7, 2025

CI test results

test results on build#60349
test_id test_kind job_url test_status passed
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-44b3-92bc-601f600f4fe7 FAIL 0/2
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-4c19-949e-47ae1823c063 FAIL 0/2
gtest_archival_rpunit.gtest_archival_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-44b3-92bc-601f600f4fe7 FAIL 0/2
gtest_archival_rpunit.gtest_archival_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-4c19-949e-47ae1823c063 FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-44b3-92bc-601f600f4fe7 FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-4c19-949e-47ae1823c063 FAIL 0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-44b3-92bc-601f600f4fe7 FAIL 0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-4c19-949e-47ae1823c063 FAIL 0/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset ducktape https://buildkite.com/redpanda/redpanda/builds/60349#0194419d-b830-4b8b-a662-0ebc1e076cbd FLAKY 5/6
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=True ducktape https://buildkite.com/redpanda/redpanda/builds/60349#01944199-9127-4a22-8ab6-fefca6176271 FAIL 0/1
tm_stm_tests_rpunit.tm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-44b3-92bc-601f600f4fe7 FAIL 0/2
tm_stm_tests_rpunit.tm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60349#01944155-5086-4c19-949e-47ae1823c063 FAIL 0/2
test results on build#60501
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60501#01944b32-7230-45de-8a59-5a2c3b7b0a62 FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60501#01944b32-7231-46dc-af2d-6bd3fa1ad9ba FLAKY 1/2

@mmaslankaprv mmaslankaprv force-pushed the manual-backport-24590-v24.2.x-643 branch from 3259d02 to 60cf6da Compare January 9, 2025 12:18
Added a wrapper around the `storage::log` allowing us to inject storage
layer failures in Raft fixture tests.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit f04995a)
When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the `follower_busy` result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the `follower_busy` error.

This way we prevent to eager leader step downs when follower replicas
are slow.
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 8b57b42)
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 67e7c6e)
@mmaslankaprv mmaslankaprv force-pushed the manual-backport-24590-v24.2.x-643 branch from 60cf6da to 683d30a Compare January 9, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v24.2.x] Fix stepping down on timeout
2 participants