Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stepping down on timeout #24590

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Dec 17, 2024

When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the follower_busy result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

This way we prevent to eager leader step downs when follower replicas
are slow.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • stable leadership under load

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59862

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 17, 2024

CI test results

test results on build#59862
test_id test_kind job_url test_status passed
coordinator_rpunit.coordinator_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05c-4ce5-a326-067904a6399d FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_archival_rpunit.gtest_archival_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
partition_properties_stm_test_rpunit.partition_properties_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d5a4-9a48-4d45-8031-616226f6505a FLAKY 4/6
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/1
tm_stm_tests_rpunit.tm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
test results on build#59902
test_id test_kind job_url test_status passed
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b161-4c06-ada3-c542ebe6df9a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b162-4156-9369-57ef78449e35 FAIL 0/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d913-3ccd-4237-a5c3-aa10b1fd3682 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d925-6325-4c4e-a39e-b2f4af0d69c2 FLAKY 4/6
test results on build#59984
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FAIL 0/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_write.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FLAKY 1/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestCompactedTopic.test_write.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FLAKY 1/6

@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from c321e29 to e203f89 Compare December 18, 2024 08:02
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59902

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from e203f89 to 592fc64 Compare December 20, 2024 07:13
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59984

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

src/v/raft/tests/raft_fixture.cc Outdated Show resolved Hide resolved
src/v/raft/tests/raft_fixture.cc Outdated Show resolved Hide resolved
@@ -489,6 +490,8 @@ ss::future<> raft_node_instance::stop() {
vlog(_logger.debug, "stopping protocol");
co_await _buffered_protocol->stop();
co_await _protocol->stop();
// release f_log pointer before stopping raft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we do it before? It should not matter, as it's a shared pointer, so the underlying object will only die when _raft is gone.
Same question for stopping log before deleting raft, but it's out of scope of this PR.

src/v/raft/consensus.cc Show resolved Hide resolved
The `raft::reply_result::follower_busy` is indicating that the follower
was unable to process the heartbeat fast enough to generate a response.
Renaming the reply from `timeout` will make it less confusing for the
reader and differentiate the error code from an RPC timeout.

Signed-off-by: Michał Maślanka <[email protected]>
Wired raft RPC service handler into Raft fixture to make the tests more
accurate and cover the service code with tests.

Signed-off-by: Michał Maślanka <[email protected]>
Propagating timeout to the node sending RPC request is crucial for
accurate testing of Raft implementation.

Signed-off-by: Michał Maślanka <[email protected]>
Added a wrapper around the `storage::log` allowing us to inject storage
layer failures in Raft fixture tests.

Signed-off-by: Michał Maślanka <[email protected]>
When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the `follower_busy` result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the `follower_busy` error.

This way we prevent to eager leader step downs when follower replicas
are slow.
Signed-off-by: Michał Maślanka <[email protected]>
@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from 592fc64 to 67e7c6e Compare December 23, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants