Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giving state machine implementers a way to opt out from being forced to support snapshotting at arbitrary offset. #22961

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Aug 20, 2024

To support fast partition movement in Redpanda the
raft::state_machine_base interface requires state machine implementer
to provide support for taking snapshot at arbitrary, already applied
offset. This requirement is easy to achieve for all existing state
machines as they do not require raft snapshots at all.

In future we may leverage the Raft protocol snapshot as they provide a
nice way to maintain the bounded log size without compromising
consistency.

This PR introduces changes to state_machine_manager and
state_machine_base interfaces allowing state machine implementer to
opt out from the requirement to provide snapshot at offset capability.
Without this feature the partition that the STM is build at will not be
fast movable but the take_snapshot implementation will be very simple
as it will require simply serializing the current in memory state of the
state machine.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

@mmaslankaprv mmaslankaprv marked this pull request as draft August 20, 2024 13:20
@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv
Copy link
Member Author

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 27, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/53602#019193b3-0ea3-41ce-8ec9-9a413ad3ab3f:

"rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed"

new failures in https://buildkite.com/redpanda/redpanda/builds/55625#01924c29-032d-4627-8e5c-a7d90a1ac7b3:

"rptest.tests.timely_shutdown_test.ShutdownTest.test_timely_shutdown_with_failures"

@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv mmaslankaprv marked this pull request as ready for review September 27, 2024 14:10
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments.

src/v/raft/state_machine_manager.cc Outdated Show resolved Hide resolved
src/v/raft/state_machine_manager.cc Outdated Show resolved Hide resolved
src/v/raft/state_machine_manager.cc Show resolved Hide resolved
@mmaslankaprv mmaslankaprv force-pushed the stm-snapshot-opt-out branch 4 times, most recently from ace577b to 122e87f Compare October 2, 2024 06:25
@mmaslankaprv mmaslankaprv requested a review from bharathv October 2, 2024 13:51
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

A note on commit history: looks like every commit has the same commit message and I think the last two commits should be squashed, otherwise compilation is broken in the middle commit?

Comment on lines +136 to +143
if (offset != last_applied_offset()) {
throw std::logic_error(fmt::format(
"State machine that do not support taking snapshot at arbitrary "
"offset can to take snapshot at requested offset: {}, current "
"last applied offset: {}",
offset,
last_applied_offset()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_apply_units are held by this point, right? Reason I'm asking is because of scheduling points if the units are not held last_applied() may change from L136 and L144. But AFAICT thats not a concern given the units.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly the units are held in the state_machine_manager

To support fast partition movement in Redpanda the
`raft::state_machine_base` interface requires state machine implementer
to provide support for taking snapshot at arbitrary, already applied
offset. This requirement is easy to achieve for all existing state
machines as they do not require raft snapshots at all.

In future we may leverage the Raft protocol snapshot as they provide a
nice way to maintain the bounded log size without compromising
consistency.

This PR introduces changes to `state_machine_manager` and
`state_machine_base` interfaces allowing state machine implementer to
opt out from the requirement to provide snapshot at offset capability.
Without this feature the partition that the STM is build at will not be
fast movable but the `take_snapshot` implementation will be very simple
as it will require simply serializing the current in memory state of the
state machine.

Signed-off-by: Michał Maślanka <[email protected]>
@mmaslankaprv mmaslankaprv requested a review from bharathv October 4, 2024 07:52
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv mmaslankaprv merged commit dc252d7 into redpanda-data:dev Oct 8, 2024
17 checks passed
@mmaslankaprv mmaslankaprv deleted the stm-snapshot-opt-out branch October 8, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants