Add `record_multiplexer` microbenchmarks #24155

ballard26 · 2024-11-18T06:51:53Z

This PR adds benchmarks for record_multiplexer along with the following that was needed to support that;

A cmake-build-compatible random Protobuf message generator.
Protobuf support for record_generator.
A serde parquet writer that writes to a null ostream.

Backports Required

Release Notes

none

ballard26 · 2024-11-22T06:02:09Z

src/v/datalake/tests/record_generator.cc

+    co_return std::nullopt;
+}
+
+iobuf encode_protobuf_message_index(const std::vector<int32_t>& message_index) {


Is there any existing serializer for this message index format anywhere in our code-base? There is a de-serializer; get_proto_offsets in src/v/datalake/schema_registry.h. Happy to move this serializer to a more general location if there is any use for it outside of the record generator.

Closest thing is

redpanda/src/v/datalake/tests/record_schema_resolver_test.cc

Lines 63 to 72 in 514f1df

iobuf encode_pb_offsets(const std::vector<int32_t>& offsets) {

auto cnt_bytes = vint::to_bytes(offsets.size());

iobuf buf;

buf.append(cnt_bytes.data(), cnt_bytes.size());

for (auto o : offsets) {

auto bytes = vint::to_bytes(o);

buf.append(bytes.data(), bytes.size());

}

return buf;

}

I don't have strong feelings about code placement, I think leaving it in the record generate seems reasonable

vbotbuildovich · 2024-11-22T09:23:55Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58542#019352d8-d1a4-477d-92c3-9735aa6242dc

src/v/serde/protobuf/tests/data_generator.cc

StephanDollberg · 2024-12-10T09:33:53Z

@andrwng @mmaslankaprv can we get this reviewed please. Would be good to get that in ASAP so that perf can be tracked.

andrwng

Generally looks good! Nice work

src/v/datalake/tests/record_multiplexer_bench.cc

andrwng · 2024-12-11T01:07:39Z

src/v/datalake/tests/record_multiplexer_bench.cc

+    // (12 payload + 3 header) bytes per field
+    static std::string proto_schema = generate_linear_proto(40);
+
+    static thread_local chunked_vector<model::record_batch> batch_data


Curious, what's the significance of this being static thread_local?

There is none, I had copied this code from an earlier iteration of the microbenchmark I had written that ran on multiple shards and it made more sense there. Will remove.

Thanks!

Still curious, what's the significance of the remaining static?

Seastar will re-run the microbenchmark multiple times to try to account for variance in the results. static just ensures that each run will be with the same dataset. Another, potentially better, option would be to fix the random seed.

The other option would be to have each run be with a unique dataset. Some of the dataset characteristics like schema and string length would be constant, but the string values would be unique per-run. There's an argument for that being the better way to go. I just start with this as the default as it allows for runs to be quicker since a new dataset doesn't have to be generated each time.

Got it, thanks for explaining.

Yeah I guess with a single data set I would expect the variance between runs to be pretty low, given there isn't a ton of non-determinism here with IO and such. Curious if you've observed that? If so, yea maybe longer term it'd be interesting to see the effects of different data sets with similarly sized data.

I don't think that needs to block this PR though

I went ahead and tested things with random datasets on each run and variance was very low(<10ns). So I removed the static specifiers on the batch generation.

src/v/datalake/tests/record_multiplexer_bench.cc

andrwng · 2024-12-11T01:26:58Z

src/v/datalake/tests/record_multiplexer_bench.cc

+    static thread_local chunked_vector<model::record_batch> batch_data
+      = co_await generate_protobuf_batches(
+        records_per_batch,
+        batches,
+        "proto_schema",
+        proto_schema,
+        {0},
+        gen_config);
+
+    auto reader = model::make_fragmented_memory_record_batch_reader(
+      share_batches(batch_data));
+
+    auto consumer = counting_consumer{.mux = create_mux()};
+
+    perf_tests::start_measuring_time();
+    auto res = co_await reader.consume(std::move(consumer), model::no_timeout);
+    perf_tests::stop_measuring_time();
+
+    co_return res.total_bytes;


nit: I'm wondering if we should wrap this in some measure_proto(std::string schema) and use it everywhere (same for avro)

It'd be possible, I originally avoided it so that folks could look at a given microbenchmark and see right away what was being measured. However, it is rather repetitive. Happy to factor it out if you think it would be better that way.

I don't particularly read microbenchmarks frequently, but it feels like it might make it'd be nice to encapsulate those details (e.g. so it's more obvious to readers that each one is actually measuring in the same way, if that's the intent. Though maybe that's expected out of a microbench? ultimately I'll leave the call up to you)

Encapsulation it is then, don't have a strong opinion on this myself.

vbotbuildovich · 2024-12-17T02:35:03Z

CI test results

test results on build#59830

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59830#0193d1f9-5d9d-4e88-a76f-43f4eab17c6d	FLAKY	3/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59830#0193d20c-7200-4604-811b-a052fc979b26	FLAKY	4/6

test results on build#59927

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59927#0193db55-7ff2-4279-a59d-beaf6097a1a7	FAIL	0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59927#0193db55-7ff6-4b68-b598-b960eb9ba007	FLAKY	2/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59927#0193db6e-d02e-428f-a309-60e98bac61e1	FLAKY	2/6

andrwng

Remaining feedback is pretty cosmetic, + one question. Otherwise LGTM

This data writer is for tests. It doesn't write to any files, however, it does go through the process of converting the ostream to parqueat.

vbotbuildovich · 2024-12-18T22:04:49Z

Retry command for Build#59927

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

ballard26 · 2024-12-18T23:42:34Z

The CI failure in tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention is unrelated to this PR.

vbotbuildovich · 2024-12-19T00:00:13Z

/backport v24.3.x

vbotbuildovich · 2024-12-19T00:01:31Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24155-v24.3.x-778 remotes/upstream/v24.3.x
git cherry-pick -x fe9df92377 2e24c6f203 a0bd72b003 c77ce4bc4a b187703a06 1c29911e06

Workflow run logs.

github-actions bot added the area/redpanda label Nov 18, 2024

ballard26 changed the title ~~[WIP] Add microbenchmarks to datalake~~ [WIP] Add record_multiplexer microbenchmarks Nov 18, 2024

ballard26 force-pushed the iceberg-microbench-1 branch 2 times, most recently from 3712bfa to da6f4e6 Compare November 22, 2024 05:44

github-actions bot added area/build area/wasm WASM Data Transforms labels Nov 22, 2024

ballard26 force-pushed the iceberg-microbench-1 branch from da6f4e6 to 2332bcb Compare November 22, 2024 05:44

ballard26 changed the title ~~[WIP] Add record_multiplexer microbenchmarks~~ Add record_multiplexer microbenchmarks Nov 22, 2024

ballard26 requested review from mmaslankaprv, andrwng and nvartolomei November 22, 2024 05:50

ballard26 marked this pull request as ready for review November 22, 2024 05:51

ballard26 force-pushed the iceberg-microbench-1 branch 2 times, most recently from a77d575 to 5f7e229 Compare November 22, 2024 05:53

ballard26 commented Nov 22, 2024

View reviewed changes

ballard26 force-pushed the iceberg-microbench-1 branch 4 times, most recently from 1a07259 to d3cc7ab Compare November 22, 2024 06:12

mmaslankaprv reviewed Nov 22, 2024

View reviewed changes

src/v/serde/protobuf/tests/data_generator.cc Show resolved Hide resolved

ballard26 force-pushed the iceberg-microbench-1 branch 2 times, most recently from fa6789a to a151b5e Compare November 23, 2024 22:10

ballard26 requested a review from mmaslankaprv November 25, 2024 16:50

andrwng reviewed Dec 11, 2024

View reviewed changes

ballard26 force-pushed the iceberg-microbench-1 branch from a151b5e to 929b102 Compare December 16, 2024 22:57

ballard26 requested a review from andrwng December 16, 2024 22:57

andrwng previously approved these changes Dec 17, 2024

View reviewed changes

ballard26 dismissed andrwng’s stale review via 048f3e9 December 18, 2024 00:42

ballard26 force-pushed the iceberg-microbench-1 branch from 929b102 to 048f3e9 Compare December 18, 2024 00:42

ballard26 requested a review from andrwng December 18, 2024 00:46

andrwng previously approved these changes Dec 18, 2024

View reviewed changes

ballard26 dismissed andrwng’s stale review via a446eb8 December 18, 2024 01:03

ballard26 force-pushed the iceberg-microbench-1 branch from 048f3e9 to a446eb8 Compare December 18, 2024 01:03

andrwng previously approved these changes Dec 18, 2024

View reviewed changes

ballard26 added 6 commits December 18, 2024 14:01

treewide: refactor avro data_generator

fe9df92

serde/protobuf: add protobuf data_generator

2e24c6f

datalake: add protobuf support to record_generator

a0bd72b

utils: move null_output_stream to utils

c77ce4b

datalake: add test_serde_parquet_data_writer

b187703

This data writer is for tests. It doesn't write to any files, however, it does go through the process of converting the ostream to parqueat.

datalake: add record_multiplexer microbenchmark

1c29911

ballard26 dismissed andrwng’s stale review via 1c29911 December 18, 2024 19:09

ballard26 force-pushed the iceberg-microbench-1 branch from a446eb8 to 1c29911 Compare December 18, 2024 19:09

ballard26 requested a review from andrwng December 18, 2024 23:41

andrwng approved these changes Dec 18, 2024

View reviewed changes

piyushredpanda merged commit 27905b2 into redpanda-data:dev Dec 18, 2024
14 of 17 checks passed

vbotbuildovich mentioned this pull request Dec 19, 2024

[v24.3.x] Add record_multiplexer microbenchmarks #24611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `record_multiplexer` microbenchmarks #24155

Add `record_multiplexer` microbenchmarks #24155

ballard26 commented Nov 18, 2024 •

edited

Loading

ballard26 Nov 22, 2024 •

edited

Loading

andrwng Nov 22, 2024

vbotbuildovich commented Nov 22, 2024

StephanDollberg commented Dec 10, 2024

andrwng left a comment

andrwng Dec 11, 2024

ballard26 Dec 16, 2024

andrwng Dec 17, 2024

ballard26 Dec 17, 2024

ballard26 Dec 17, 2024

andrwng Dec 17, 2024

ballard26 Dec 18, 2024

andrwng Dec 11, 2024

ballard26 Dec 16, 2024

andrwng Dec 17, 2024

ballard26 Dec 17, 2024

vbotbuildovich commented Dec 17, 2024 •

edited

Loading

andrwng left a comment

vbotbuildovich commented Dec 18, 2024

ballard26 commented Dec 18, 2024

vbotbuildovich commented Dec 19, 2024

vbotbuildovich commented Dec 19, 2024

	iobuf encode_pb_offsets(const std::vector<int32_t>& offsets) {
	auto cnt_bytes = vint::to_bytes(offsets.size());
	iobuf buf;
	buf.append(cnt_bytes.data(), cnt_bytes.size());
	for (auto o : offsets) {
	auto bytes = vint::to_bytes(o);
	buf.append(bytes.data(), bytes.size());
	}
	return buf;
	}

Add record_multiplexer microbenchmarks #24155

Add record_multiplexer microbenchmarks #24155

Conversation

ballard26 commented Nov 18, 2024 • edited Loading

Backports Required

Release Notes

ballard26 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Nov 22, 2024

StephanDollberg commented Dec 10, 2024

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Dec 17, 2024 • edited Loading

CI test results

andrwng left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Dec 18, 2024

Retry command for Build#59927

ballard26 commented Dec 18, 2024

vbotbuildovich commented Dec 19, 2024

vbotbuildovich commented Dec 19, 2024

Add `record_multiplexer` microbenchmarks #24155

Add `record_multiplexer` microbenchmarks #24155

ballard26 commented Nov 18, 2024 •

edited

Loading

ballard26 Nov 22, 2024 •

edited

Loading

vbotbuildovich commented Dec 17, 2024 •

edited

Loading