Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-8160] storage: add chunked compaction routine #24423

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Dec 3, 2024

This PR deals with the case in which zero segments were indexed for a round of sliding window compaction. This can happen for segments with a large number of unique keys, and per the memory constraints imposed on our key-offset hash map by storage_compaction_key_map_memory (128MiB by default).

This (historically) has not come about often, and may also be naturally alleviated by deduplicating or partially indexing the problem segment in question during future rounds of compaction (provided there is a steady ingress rate to the partition, and that keys in the problem segment are present in newer segments in the log), but added here is a routine that can handle this corner case when it arises.

Instead of throwing and logging an error when zero segments are indexed, we will now fall back to a "chunked" compaction routine.

This implementation uses some of the current abstractions from the compaction utilities to perform several rounds (chunks) of sliding window compaction with a partially indexed map created from the un-indexed segment by reading it in a linear fashion.

This implementation is sub-optimal for a number of reasons- primarily, segment indexes are read and rewritten each time a round of chunked compaction is performed. These intermediate states are then used for the next round of chunked compaction.

In the future, there may be a more optimal way to perform these steps using less IO by holding more information in memory before flushing the final results to disk, instead of flushing every intermediate stage. However, this case in which chunked compaction is required has seemed to be infrequent enough that merely having the implementation is valuable.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Adds a chunked compaction routine to local storage, which is used as a fallback in the case that we fail to index a single segment during sliding window compaction.

@WillemKauf WillemKauf requested a review from andrwng December 3, 2024 21:35
@WillemKauf WillemKauf force-pushed the storage_chunked_compaction branch from 1d4f546 to 2ca689c Compare December 3, 2024 21:41
@dotnwat
Copy link
Member

dotnwat commented Dec 4, 2024

In the case that zero segments were indexed for a round of sliding window compaction, we will now fall back to a chunked compaction routine.

Can you explain what "chunked compaction" is? When would sliding window fail to index segments, and why do we care?

@WillemKauf
Copy link
Contributor Author

Can you explain what "chunked compaction" is? When would sliding window fail to index segments, and why do we care?

Added more detail to cover letter to address these points.

For functors that may return a `ss::stop_iteration`.
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much looks good! No major complains about structure, just some naming suggestions. Nice work!

Also could probably use some ducktape testing, though IIRC you mentioned a separate PR for stress testing compaction

src/v/storage/compaction_reducers.cc Outdated Show resolved Hide resolved
src/v/storage/compaction_reducers.h Outdated Show resolved Hide resolved
src/v/storage/segment_deduplication_utils.h Outdated Show resolved Hide resolved
src/v/storage/segment_deduplication_utils.h Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Show resolved Hide resolved
@WillemKauf
Copy link
Contributor Author

WillemKauf commented Dec 10, 2024

Also could probably use some ducktape testing, though IIRC you mentioned a separate PR for stress testing compaction

That PR is merged, I'm going to parameterize it in order to test chunked compaction and assert on some added metrics.

Will have updates to this tomorrow soon (TM).

co_await map.reset();
auto read_holder = co_await seg->read_lock();
auto start_offset_inclusive = model::next_offset(last_indexed_offset);
auto rdr = internal::create_segment_full_reader(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreating this full segment reader for each round of chunked compaction is a bummer.

Not sure if we have any abstractions to get around this- log_reader::reset_config() gave me some hope that the segment's lease/lock could be reused, but it doesn't seem to allow us to reset with a start_offset lower than what has been currently read.

For context, we have to do this because in the chunked_compaction_reducer, once we fail to index an offset for a record in a batch, we break out of the loop and will have to re-read that batch in the next round using that offset as the start, inclusively.

This reducer builds a `key_offset_map` for a segment starting from an offset
`start_offset_inclusive`, until the memory limit of the map is reached
or the segment is fully indexed. `end_of_stream()` returns a boolean value
indicating whether the segment was fully indexed or not, and the map can be
probed for its `max_offset()` in order to start the next round of indexing
(using `start_offset_inclusive = model::next_offset(max_offset())`).

It is intended to be used in the case that a segment was not able to be fully
indexed during a round of sliding window compaction, falling back instead
to a "chunked" compaction routine.

This reducer will be used during the "chunked" compaction procedure by
starting from the base offset in the segment, and performing as many rounds
of map building and de-duplication as is required until `end_of_stream()` returns
`true`.
Abstracts away a section of code into a new function.

This function will be re-used in the current implementation
of chunked sliding window compaction.
And better define it for `simple_key_offset_map`.
Optionally provide a starting offset from which the reader's
`min_offset` value is assigned (otherwise, the `base_offset()` of
the `segment` is used).
Uses the `map_building_reducer` to perform a linear read of a `segment`
and index its keys and offsets, starting from a provided offset.
In the case that zero segments were able to be indexed for a round of sliding
window compaction, chunked compaction must be performed.

This implementation uses some of the current abstractions from the compaction
utilities to perform several rounds of sliding window compaction with a
partially indexed map created from the un-indexed segment in a linear fashion.

This implementation is sub-optimal for a number of reasons- namely,
that segment indexes are read and rewritten each time a round of chunked
compaction is performed. These intermediate states are then used for the
next round of chunked compaction.

In the future, there may be a more optimal way to perform these steps
using less IO by holding more information in memory before flushing
the final results to disk, and not every intermediate stage.
GTest `ASSERT_*` macros cannot be used in non-`void` returning
functions.

Add `RPTEST_EXPECT_EQ` to provide flexibility in testing for non-`void`
functions.
To move away from hardcoded boost asserts and provide
compatibility in a GTest environment.
This would previously overshoot the `size_bytes` provided to it
by filling with `elements_per_fragment()` at least once.

In the lower limit, when `required_entries` is less than `elements_per_fragment()`,
we should be taking the minimum of the two values and pushing back that
number of objects to the `entries` container.
@WillemKauf WillemKauf force-pushed the storage_chunked_compaction branch from 2ca689c to 4d14fd5 Compare December 11, 2024 20:58
@WillemKauf WillemKauf requested a review from a team as a code owner December 11, 2024 20:58
@WillemKauf WillemKauf removed the request for review from a team December 11, 2024 20:59
@WillemKauf WillemKauf force-pushed the storage_chunked_compaction branch from 4d14fd5 to fc57dff Compare December 11, 2024 21:02
@WillemKauf
Copy link
Contributor Author

WillemKauf commented Dec 11, 2024

Force push to:

  • Rebase to upstream/dev
  • Add a new condition to reset the compaction sliding window offset in chunked_sliding_window_compaction- very important.
  • Add cluster config setting storage_compaction_key_map_memory_override_for_tests.
  • Parameterize log_compaction_test.py in order to test chunked compaction.
  • Address code review comments by adding documentation and renaming some objects/functions.
  • Fix logic in key_offset_map::initialize().
  • Add chunked_compaction_runs metric to storage::probe.

@WillemKauf WillemKauf requested a review from andrwng December 11, 2024 21:04
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 11, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 11, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 11, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Dec 11, 2024
In order to test the chunked compaction routine, parameterize the existing
compaction test suite with `storage_compaction_key_map_memory_kb`.

By limiting this value, we can force compaction to go down the chunked compaction
path, and verify the log using the existing utilities after compaction settles.

Some added asserts are used to verify chunked compaction is taken or not taken
as a code path, depending on the memory constraints specified.
@WillemKauf
Copy link
Contributor Author

/ci-repeat 5
release
skip-units
skip-redpanda-build
dt-repeat=100
tests/rptest/tests/log_compaction_test.py

@WillemKauf
Copy link
Contributor Author

/ci-repeat 5
release
skip-units
skip-redpanda-build
dt-repeat=100
tests/rptest/tests/log_compaction_test.py

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59673

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/log_compaction_test.py::LogCompactionTest.compaction_stress_test@{"cleanup_policy":"compact,delete","key_set_cardinality":100,"storage_compaction_key_map_memory_kb":131072}

@WillemKauf
Copy link
Contributor Author

WillemKauf commented Dec 12, 2024

https://ci-artifacts.dev.vectorized.cloud/redpanda/59673/0193bc0b-0365-4203-802b-c969372ea7ac/vbuild/ducktape/results/final/report.html

raise RuntimeError( RuntimeError: KgoVerifierProducer-0-139757941539360 possible idempotency bug: ProduceStatus<103424 102400 1024 1 0 0 0 41112 8055.5/12348.5/20587>

time="2024-12-12T19:30:07Z" level=warning msg="Produced at unexpected offset 3508 (expected 2493) on partition 0"

Possibly bad interaction between partition movement and KgoVerifierProducer?

Seemingly unrelated to compaction changes.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 12, 2024

CI test results

test results on build#59673
test_id test_kind job_url test_status passed
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=100.storage_compaction_key_map_memory_kb=131072 ducktape https://buildkite.com/redpanda/redpanda/builds/59673#0193bc0b-0365-4203-802b-c969372ea7ac FLAKY 99/100
test results on build#59782
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59782#0193caa2-8bfd-4e36-a47d-8909582cb230 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59782#0193caa2-8bfe-4aad-ab48-5c49355e8883 FLAKY 3/6

@WillemKauf WillemKauf force-pushed the storage_chunked_compaction branch from fc57dff to fe0991e Compare December 15, 2024 12:44
@WillemKauf
Copy link
Contributor Author

Force push to:

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 15, 2024

Retry command for Build#59782

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants