ct: Add L0 read path #23629

Lazin · 2024-10-03T21:58:40Z

The PR implements the reader which consumes dl_placeholder batches from the partition and materializes them by downloading the data from the cloud storage or fetching it from the cache.

The PR adds an abstract base class for the cloud_storage::cache class. The interface is located in the cloud_io package.

WIP, not all code pushed yet.

Backports Required

Release Notes

none

Lazin · 2024-10-09T16:18:09Z

Rebased with dev

dotnwat

yeh this looks good to me. could minor nits, but is ready to merge IMO.

before, that tho, can you please do one of the following:

submit the changes to cloud_io/* and cloud_storage/* as a separate PR so that storage team wide changes are more accessible for broader review, or
bring all those changes into cloud_topics/* so that we cloud_topics owns the changes for now?

dotnwat · 2024-10-11T00:47:40Z

src/v/model/record_batch_types.h

+    dl_placeholder = 35, // placeholder batch type used by cloud topics
+    dl_overlay = 36,     // overlay batch used by dl_stm and cloud topics


should we have separate batch types, or just a single cloud_topics_meta batch type?

I'll rename this. We should have two batch types. dl_placeholder for placeholders and dl_stm_command batch for all STM commands including dl_overlay.

dotnwat · 2024-10-11T01:11:23Z

src/v/cloud_topics/reader/placeholder_extent.h

+    ss::future<result<bool>> materialize(
+      cloud_storage_clients::bucket_name bucket,
+      cloud_io::remote_api<Clock>* api,
+      cloud_io::basic_cache_service_api<Clock>* cache,
+      basic_retry_chain_node<Clock>* rtc);


seems like this should be a free function factory. that would also allow you avoid the Clock template parameter on the extent? even make_raft_data_batch could be a free function? much easier to write tests on functional bits than having mutators like this.

In this case the materialize is called conditionally. Several placeholders may share the same L0 object. So if the L0 object is materialized we're not calling materizlie for it anymore.

You do call it conditionally but from the reader not from the extent, right? Why this method is on the extent?

In this case the materialize is called conditionally. Several placeholders may share the same L0 object. So if the L0 object is materialized we're not calling materizlie for it anymore.

yeh, i still think the extent should be a dumb object and there should be free functions to construct it. i can work on that if you dont' want to.

dotnwat · 2024-10-11T01:16:52Z

src/v/cloud_topics/reader/placeholder_extent_reader.h

+
+#include <seastar/core/lowres_clock.hh>
+
+#pragma once


pragma should go up above includes?

dotnwat · 2024-10-11T01:22:34Z

src/v/cloud_topics/reader/placeholder_extent_reader.cc

+
+        ss::circular_buffer<model::record_batch> slice;
+        for (auto& e : extents.value()) {
+            slice.push_back(e.make_raft_data_batch());


it isn't obvious why we are converting placeholder batches to raft batches. i'm guessing that it has something to do with the rest of the system expecting raft data batches? are there parts of the raft batch that can't be filled in completely, or is it identical, etc...?

The placeholder batch doesn't have the data, so we're bringing the data and replacing placeholder batch with the raft_data batch. We need to make this a raft-data batch because the Kafka layer will filter out anything which isn't a raft-data batch. We can introduce another batch type (e.g. materialized_placeholder) and teach Kafka layer to use it but IMO this will only add complexity.

We need to make this a raft-data batch because the Kafka layer will filter out anything which isn't a raft-data batch

this is what i was looking for. thanks

Lazin · 2024-10-11T11:43:23Z

yeh this looks good to me. could minor nits, but is ready to merge IMO.

before, that tho, can you please do one of the following:

submit the changes to cloud_io/* and cloud_storage/* as a separate PR so that storage team wide changes are more accessible for broader review, or

bring all those changes into cloud_topics/* so that we cloud_topics owns the changes for now?

created #23748

vbotbuildovich · 2024-10-11T15:50:18Z

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56353#01927c04-92c7-4c9a-82b7-461c05b702a1:

"rptest.tests.partition_force_reconfiguration_test.NodeWiseRecoveryTest.test_node_wise_recovery.dead_node_count=2"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56578#019292f7-3531-4b61-b16d-b2421234126f:

"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=False"

vbotbuildovich · 2024-10-11T15:50:40Z

Retry command for Build#56353

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_node_wise_recovery@{"dead_node_count":2}
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage@{"cleanup_policy":"delete"}
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves@{"cleanup_policy":"delete"}

vbotbuildovich · 2024-10-11T16:46:31Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56353#01927c1d-d7fb-4a4a-a4cd-238151ebbb3e
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56353#01927c1d-d802-45a1-aba0-ec2a3384ae50
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56353#01927c1d-d7ff-4ef1-a61b-92ffb873e02f
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56353#01927cca-31c2-45a1-9de6-15534acf22a1
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56578#019292f3-088e-4cd5-85e3-7e6552de35e8
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56578#019292f7-3533-40a5-b2b6-2fe3e9e4ddcd

nvartolomei

Still need to review tests.

src/v/model/model.cc

nvartolomei · 2024-10-11T20:30:07Z

src/v/cloud_topics/README.md

@@ -0,0 +1,17 @@
+### Missing bits
+
+* In the write path we should populate batch cache with `raft_data` batches


This will be tricky given that we will have the data on the leader but not on the replicas. 🤔

Yes. I think we need to focus on optimizing the read path for parallelism to have good performance even if the data is not cached. But having it cached if the retrieval is happening shortly after the data was produced on the same broker is also important.

nvartolomei · 2024-10-11T20:32:39Z

src/v/cloud_topics/reader/placeholder_extent.h

+
+namespace experimental::cloud_topics {
+
+// Extent represents dl_placeholder with the data


Can you move this comment to the class? I missed it and was about to write a comment asking for details but found it here.

what class? basic_placeholder_extent?

nvartolomei · 2024-10-11T20:48:39Z

src/v/cloud_topics/reader/placeholder_extent.cc

+    }
+
+    if (
+      !status.has_value()


in which case do we expect o get here with status.has_value() == false? i don't understand this condition

will clean up in a followup, this check is not needed because if status.has_value() == false we will return timeout error from previous if statement (the only way status could be a nullopt is when the loop timed out while cache element is in-progress, in this case rp.is_allowed will be equal to false).

nvartolomei · 2024-10-11T20:56:15Z

src/v/cloud_topics/reader/placeholder_extent.h

+    ss::future<result<bool>> materialize(
+      cloud_storage_clients::bucket_name bucket,
+      cloud_io::remote_api<Clock>* api,
+      cloud_io::basic_cache_service_api<Clock>* cache,
+      basic_retry_chain_node<Clock>* rtc);


You do call it conditionally but from the reader not from the extent, right? Why this method is on the extent?

nvartolomei · 2024-10-11T20:57:17Z

src/v/cloud_topics/reader/placeholder_extent_reader.h

+
+namespace experimental::cloud_topics {
+
+model::record_batch_reader make_placeholder_extent_reader(


A comment please about the inputs to this function.

will add a followup

nvartolomei · 2024-10-14T12:23:11Z

src/v/cloud_topics/reader/placeholder_extent.cc

+struct errc_converter;
+
+template<>
+struct errc_converter<cloud_io::download_result, errc> {


Don't like this mapping (to be clear: I'm not asking you to change it). Don't have better ideas but maybe you get some ideas by reading this tip https://www.boost.org/doc/libs/develop/libs/outcome/doc/html/tutorial/essential/conventions.html?

this is not a public api, the whole thing lives inside .cc to begin with

Don't like this mapping

yeh i tend to agree. @nvartolomei can you say a bit more about what you don't like?

this is not a public api, the whole thing lives inside .cc to begin with

i get what you are saying: it is contained, but i'm not sure it is relevant to nicoale's feedback about the pattern.

In this case the error code enum from one module is mapped to error code enum from another. And the link says that this is a bad pattern and it's better to use std::error_code on api boundaries so you don't have to convert. The reasoning here is that cloud_io::download_result is not an std::error_code. Also, this is an exhaustive check. Sometimes the error code based solely on enum class is better because you can use switch stmt and rely on the compiler to check that the error check is exhaustive. In this case this utility function is not exposed outside and the stuff in .h returns result types which are based on std::error_code.

Let's say I have a function ss::future<result<iobuf>> download() in the header. The result stores std::error_code which may contain different error categories. The module also has its own error code type. As a user of the module I'd still expect that all public functions / methods are returning error codes from the module itself, right? In this case I can still write an exhaustive error check.

Also, Iceberg uses different approach with boost::outcome that returns error code which is not std::error_code. As Andrew explained to me the other day this was done to avoid boilerplate and to rely on switch statement during the error handling.

thanks @Lazin this makes sense!

nvartolomei · 2024-10-14T12:31:32Z

src/v/cloud_topics/reader/placeholder_extent.cc

+///
+/// The type of the error code should be known
+template<class T, class E>
+result<T> result_convert(result<T>&& res) {


cmd+f doesn't find any matches for result_convert is it used?

Looks like I ended up using error-converter directly. Will clean this up in a followup.

nvartolomei · 2024-10-15T10:34:03Z

src/v/cloud_topics/reader/placeholder_extent.cc

+
+template<class Clock>
+ss::future<result<iobuf>>
+basic_placeholder_extent<Clock>::materialize_from_cloud_storage(


@Lazin what if we make basic_placeholder_extent closer to a dumb object and instead we have some sort of materialization service which does the materialization? Or, just put this methods into the reader. Why we choose to attach so much functionality to the extent class?

This is not a final version of this code, don't overthink all this. The goal of the placeholder extent is to be used in one place in the read path for a month or two until we will roll out proper read path with reader reuse/caching and centralized control over materialized resources (similar to what we have now in TS). The goal of this code is to unblock further development and be correct.

The goal of this code is to unblock further development and be correct.

yes, but keep things as simple as possible. i think @nvartolomei is right that it would be better to have a plain object--it would be simpler.

@nvartolomei after we merge this, i think you should feel free to come back and do some clean up / refactoring work for stuff like this. i'm planning to do this as well as a forcing function to get into the code in more detail as well.

I'll make the change and we will see if it's getting much simpler.

Signed-off-by: Evgeny Lazin <[email protected]>

The extent represents the placeholder and the data that the placeholder references. The extent can be used to materialize the data and to restore the original 'raft_data' batch used to create the extent. Signed-off-by: Evgeny Lazin <[email protected]>

Signed-off-by: Evgeny Lazin <[email protected]>

The fixture is supposed to be used to test reader and extent components. It can be used to gemernate the test data and set up mocks (cache and cloud storage api). Signed-off-by: Evgeny Lazin <[email protected]>

The test is validating materialization of the extent in case if the data is cached or not. Signed-off-by: Evgeny Lazin <[email protected]>

The reader consumes placeholder batches from the underlying reader (storage::log_segment_reader) and transforms them into raft_data batches. It does this by creating placeholder_extent instance for every dl_placeholder batch and materializing this extent using cloud storage or disk cache. Signed-off-by: Evgeny Lazin <[email protected]>

Signed-off-by: Evgeny Lazin <[email protected]>

dotnwat · 2024-10-16T00:19:32Z

/ci-repeat 1

dotnwat · 2024-10-16T01:34:01Z

src/v/cloud_topics/reader/placeholder_extent.cc

+struct errc_converter;
+
+template<>
+struct errc_converter<cloud_io::download_result, errc> {


Don't like this mapping

yeh i tend to agree. @nvartolomei can you say a bit more about what you don't like?

this is not a public api, the whole thing lives inside .cc to begin with

i get what you are saying: it is contained, but i'm not sure it is relevant to nicoale's feedback about the pattern.

dotnwat · 2024-10-16T01:34:59Z

src/v/cloud_topics/reader/placeholder_extent.cc

+    }
+
+    if (
+      !status.has_value()


dotnwat · 2024-10-16T01:40:03Z

src/v/cloud_topics/reader/placeholder_extent.cc

+
+template<class Clock>
+ss::future<result<iobuf>>
+basic_placeholder_extent<Clock>::materialize_from_cloud_storage(


The goal of this code is to unblock further development and be correct.

yes, but keep things as simple as possible. i think @nvartolomei is right that it would be better to have a plain object--it would be simpler.

dotnwat · 2024-10-16T01:41:00Z

src/v/cloud_topics/reader/placeholder_extent.cc

+
+template<class Clock>
+ss::future<result<iobuf>>
+basic_placeholder_extent<Clock>::materialize_from_cloud_storage(


@nvartolomei after we merge this, i think you should feel free to come back and do some clean up / refactoring work for stuff like this. i'm planning to do this as well as a forcing function to get into the code in more detail as well.

dotnwat · 2024-10-16T01:41:33Z

src/v/cloud_topics/reader/placeholder_extent.h

+
+namespace experimental::cloud_topics {
+
+// Extent represents dl_placeholder with the data


dotnwat · 2024-10-16T01:42:45Z

src/v/cloud_topics/reader/placeholder_extent.h

+    ss::future<result<bool>> materialize(
+      cloud_storage_clients::bucket_name bucket,
+      cloud_io::remote_api<Clock>* api,
+      cloud_io::basic_cache_service_api<Clock>* cache,
+      basic_retry_chain_node<Clock>* rtc);


In this case the materialize is called conditionally. Several placeholders may share the same L0 object. So if the L0 object is materialized we're not calling materizlie for it anymore.

yeh, i still think the extent should be a dumb object and there should be free functions to construct it. i can work on that if you dont' want to.

dotnwat · 2024-10-16T01:46:34Z

src/v/cloud_topics/reader/placeholder_extent_reader.cc

+    return p.size_bytes;
+}
+
+inline ss::future<result<ss::circular_buffer<placeholder_extent>>>


this does not need to be inline

dotnwat · 2024-10-16T01:47:05Z

src/v/cloud_topics/reader/placeholder_extent_reader.cc

+  retry_chain_node* rtc) {
+    absl::node_hash_map<uuid_t, ss::lw_shared_ptr<hydrated_L0_object>> hydrated;
+    ss::circular_buffer<placeholder_extent> extents;
+    for (auto&& p : placeholders) {


why is this auto&&?

it's moved in the loop body

@Lazin

it's moved in the loop body

but auto&& isn't an r-value reference, right? it's a forwarding reference, but there isn't any type deduction happening here?

dotnwat · 2024-10-16T01:49:17Z

src/v/cloud_topics/reader/placeholder_extent_reader.cc

+
+        ss::circular_buffer<model::record_batch> slice;
+        for (auto& e : extents.value()) {
+            slice.push_back(e.make_raft_data_batch());


We need to make this a raft-data batch because the Kafka layer will filter out anything which isn't a raft-data batch

this is what i was looking for. thanks

dotnwat · 2024-10-16T01:49:29Z

src/v/cloud_topics/reader/placeholder_extent_reader.h

+
+namespace experimental::cloud_topics {
+
+model::record_batch_reader make_placeholder_extent_reader(


vbotbuildovich · 2024-10-16T01:59:55Z

the below tests from https://buildkite.com/redpanda/redpanda/builds/56578#019292b3-5176-4495-b57c-78246c95457a have failed and will be retried

gtest_raft_rpunit

vbotbuildovich · 2024-10-16T03:21:32Z

Retry command for Build#56578

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_config_test.py::ClusterConfigLegacyDefaultTest.test_legacy_default@{"wipe_cache":false}

Lazin marked this pull request as draft October 3, 2024 21:58

github-actions bot added area/build area/redpanda labels Oct 3, 2024

Lazin force-pushed the feature/l0-read-path branch 3 times, most recently from 3bf6bd0 to 9dcb29c Compare October 4, 2024 23:00

Lazin requested review from dotnwat and nvartolomei October 4, 2024 23:01

Lazin force-pushed the feature/l0-read-path branch from 9dcb29c to 8c4e90c Compare October 4, 2024 23:07

Lazin changed the title ~~[DRAFT] ct: Add L0 read path~~ ct: Add L0 read path Oct 4, 2024

Lazin force-pushed the feature/l0-read-path branch 4 times, most recently from 4de71b8 to 6ea3a72 Compare October 8, 2024 17:09

Lazin marked this pull request as ready for review October 9, 2024 13:25

Lazin force-pushed the feature/l0-read-path branch 2 times, most recently from eec897f to 34167b4 Compare October 9, 2024 16:17

Lazin force-pushed the feature/l0-read-path branch from 34167b4 to 2b9a299 Compare October 9, 2024 18:20

dotnwat reviewed Oct 11, 2024

View reviewed changes

Lazin force-pushed the feature/l0-read-path branch 2 times, most recently from 3733a37 to d7d5263 Compare October 11, 2024 13:03

nvartolomei reviewed Oct 11, 2024

View reviewed changes

nvartolomei reviewed Oct 14, 2024

View reviewed changes

nvartolomei reviewed Oct 15, 2024

View reviewed changes

Lazin added 10 commits October 15, 2024 18:18

ct: Add dl_placeholder and dl_overlay batch tupes

2a65225

Signed-off-by: Evgeny Lazin <[email protected]>

ct: Use placeholder type in the batcher

d9100eb

Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add new error codes

e274042

Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add mocks for the reader fixture

fc686ae

Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add 'placeholder_extent_fixture'

3212e08

The fixture is supposed to be used to test reader and extent components. It can be used to gemernate the test data and set up mocks (cache and cloud storage api). Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add placeholder_extent_test

56e3d7e

The test is validating materialization of the extent in case if the data is cached or not. Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add README.md

acd1c6c

Signed-off-by: Evgeny Lazin <[email protected]>

ct: Add placeholder_extent_rader tests

1ba9187

Signed-off-by: Evgeny Lazin <[email protected]>

Lazin force-pushed the feature/l0-read-path branch from d7d5263 to 1ba9187 Compare October 15, 2024 22:18

Lazin requested review from nvartolomei and dotnwat October 15, 2024 22:21

dotnwat approved these changes Oct 16, 2024

View reviewed changes

Lazin merged commit 1754e47 into redpanda-data:dev Oct 16, 2024
17 checks passed

Lazin mentioned this pull request Oct 16, 2024

ct: Read path followup #23812

Merged

7 tasks

		dl_placeholder = 35, // placeholder batch type used by cloud topics
		dl_overlay = 36, // overlay batch used by dl_stm and cloud topics

		@@ -0,0 +1,17 @@
		### Missing bits

		* In the write path we should populate batch cache with `raft_data` batches


		namespace experimental::cloud_topics {

		// Extent represents dl_placeholder with the data


		namespace experimental::cloud_topics {

		model::record_batch_reader make_placeholder_extent_reader(


		#include <seastar/core/lowres_clock.hh>

		#pragma once

ct: Add L0 read path #23629

ct: Add L0 read path #23629

Conversation

Lazin commented Oct 3, 2024

Backports Required

Release Notes

Lazin commented Oct 9, 2024

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Oct 11, 2024

vbotbuildovich commented Oct 11, 2024 • edited Loading

vbotbuildovich commented Oct 11, 2024 • edited Loading

Retry command for Build#56353

vbotbuildovich commented Oct 11, 2024 • edited Loading

nvartolomei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvartolomei Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Oct 16, 2024

vbotbuildovich commented Oct 16, 2024

Retry command for Build#56578

vbotbuildovich commented Oct 11, 2024 •

edited

Loading

vbotbuildovich commented Oct 11, 2024 •

edited

Loading

vbotbuildovich commented Oct 11, 2024 •

edited

Loading

nvartolomei Oct 14, 2024 •

edited

Loading