Skip to content

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Sep 24, 2025

Replicate spillover command with the fence to avoid races.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@Lazin Lazin requested review from Copilot and oleiman and removed request for Copilot September 24, 2025 18:36
oleiman
oleiman previously approved these changes Sep 24, 2025
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.

@Copilot Copilot AI review requested due to automatic review settings September 25, 2025 12:32
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the archival system to use a fenced spillover command approach to prevent race conditions. The change centralizes fence creation logic and adds fencing support specifically to the spillover operation.

  • Introduces a new emit_rw_fence() method to centralize fence creation logic
  • Refactors multiple methods to use the centralized fence creation instead of duplicated code
  • Adds proper fencing to the spillover command with fence reset between iterations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/v/cluster/archival/ntp_archiver_service.h Declares new emit_rw_fence() method for centralized fence creation
src/v/cluster/archival/ntp_archiver_service.cc Implements emit_rw_fence() and refactors multiple methods to use it, adds fencing to spillover operations

@Lazin Lazin requested a review from oleiman September 25, 2025 14:35
@Lazin
Copy link
Contributor Author

Lazin commented Sep 25, 2025

looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.

I think we ignored it before because it's replicated in a loop. The issue is that on CI it could also attempt to upload/replicates a spillover manifest which is slightly off in some cases.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 25, 2025

CI test results

test results on build#72948
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7dfc-45a0-854a-84d474bb1dde FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/72948#0199810c-27e9-4126-ae01-455ee8374f3b FLAKY 13/21 upstream reliability is '97.03389830508475'. current run reliability is '61.904761904761905'. drift is 35.12914 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df1-4ad7-bd6b-030a62faba91 FLAKY 14/21 upstream reliability is '81.31229235880399'. current run reliability is '66.66666666666666'. drift is 14.64563 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_produce_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df2-41cb-8bd7-b7db1febc310 FLAKY 17/21 upstream reliability is '83.00518134715026'. current run reliability is '80.95238095238095'. drift is 2.0528 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df8-4ecf-8db5-ea3f420f4e18 FLAKY 12/21 upstream reliability is '81.82527301092044'. current run reliability is '57.14285714285714'. drift is 24.68242 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism_applies_to_next_request null integration https://buildkite.com/redpanda/redpanda/builds/72948#0199810c-27f1-43e1-a0a5-7f70d2ebb2be FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
PartitionBalancerTest test_rack_awareness null integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df3-405b-bb92-75d65aa92d3b FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionBalancerTest&test_method=test_rack_awareness
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df2-41cb-8bd7-b7db1febc310 FLAKY 20/21 upstream reliability is '99.51923076923077'. current run reliability is '95.23809523809523'. drift is 4.28114 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
RecoveryModeTest test_rolling_restart null integration https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df1-4ad7-bd6b-030a62faba91 FLAKY 17/21 upstream reliability is '94.36860068259386'. current run reliability is '80.95238095238095'. drift is 13.41622 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecoveryModeTest&test_method=test_rolling_restart
test results on build#73057
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MasterTestSuite test_replica_pair_frequency unit https://buildkite.com/redpanda/redpanda/builds/73057#0199862d-8f59-415f-a4bb-7e63eda8a9fb FAIL 0/1
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/73057#01998688-a51a-4120-99be-8f68479e23e8 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/73057#0199865d-61a2-49ed-9e94-8cab3ec781c8 FLAKY 9/21 upstream reliability is '88.51351351351352'. current run reliability is '42.857142857142854'. drift is 45.65637 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ControllerLogLimitMirrorMakerTests test_mirror_maker_with_limits null integration https://buildkite.com/redpanda/redpanda/builds/73057#01998688-a520-4034-89e1-ed87b1cfc150 FLAKY 20/21 upstream reliability is '99.21259842519686'. current run reliability is '95.23809523809523'. drift is 3.9745 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

_rtclog.warn,
"Failed to replicate spillover command: {}",
error.message());
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the change in control flow here, i think a bit more exposition in the commit message would be helpful. i.e. the fact that we break out of the loop on replication failure, whether that's usually attributable to the fence, why it's ok to bail at this point, what happens next, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Also, extract fence initialization into a method in the ntp_archiver to
avoid code duplication.

There is a change in the control flow in the 'apply_spillover' method.
Previously, the spillover wouldn't stop in case of replication error
causing the error to be repeated. The loop would use manifest to create
a spillover manifest and replicate the command with archival STM. The
replicate method waits until the command is applied and propagates the
error back to the loop. In case of error the error was printed and the
loop continued. Since the state of the  manifest didn't change the loop
would produce the same manifesta and the same command causing new
failure.

This commit breaks if the spillover command can't be applied. This
guarantees forward progress.

Signed-off-by: Evgeny Lazin <[email protected]>
@Lazin Lazin requested a review from oleiman September 26, 2025 13:18
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Lazin Lazin merged commit 4d18c2e into redpanda-data:dev Sep 26, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v25.1.x-416 remotes/upstream/v25.1.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v25.2.x-908 remotes/upstream/v25.2.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v24.3.x-25 remotes/upstream/v24.3.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants