datalake: helper code to get schemas for records #23308

jcipar · 2024-09-12T21:09:43Z

This adds some helper methods to get the schema for a record, and modifies the record multiplexer to use them. The translators for other types are still in progress, so we don't yet do anything with this information.

Backports Required

Release Notes

none

dotnwat · 2024-09-13T05:31:17Z

src/v/datalake/schema_registry.cc

+ */
+#include "model/record.h"
+
+std::optional<uint32_t> get_value_schema_id(const model::record& record) {


where is the format of the record defined? is this format defined by the schema registry? if so, perhaps this should be a helper that lives in schema registry?

dotnwat · 2024-09-13T05:34:18Z

src/v/datalake/tests/CMakeLists.txt

+    v::application
+    v::features
+    v::gtest_main
+    v::kafka_test_utils
+    v::datalake
+    v::model_test_utils


i don't think you need all these dependencies

dotnwat · 2024-09-13T05:34:44Z

src/v/datalake/schema_registry.h

+#pragma once
+
+#include "bytes/iobuf_parser.h"
+#include "model/record.h"


forward declare model::record

dotnwat · 2024-09-13T05:34:50Z

src/v/datalake/schema_registry.h

+ */
+#pragma once
+
+#include "bytes/iobuf_parser.h"


unused import

dotnwat · 2024-09-13T05:36:33Z

src/v/wasm/tests/wasm_fixture.h

@@ -74,3 +75,6 @@ class WasmTestFixture : public ::testing::Test {
    model::transform_metadata _meta;
    std::vector<ss::sstring> _log_lines;
 };
+
+// For using fake_schema_registry outside of WASM fixtures.
+std::unique_ptr<wasm::schema_registry> make_fake_schema_registry();


hmm, maybe the fake schema registry can live with schema registry? cc @rockwotj

I had suggested to @jcipar to move it to src/v/schema_registry. I'm not sure how we want to structure the HTTP service vs the internal functionality. Maybe we have src/v/schema_registry/http. IDK we should involve the enterprise team in this discussion too.

cc @BenPope hi!

dotnwat · 2024-09-13T05:37:22Z

src/v/datalake/schema_registry.h

@@ -11,10 +11,23 @@

 #include "bytes/iobuf_parser.h"
 #include "model/record.h"
+#include "pandaproxy/schema_registry/types.h"
+#include "wasm/schema_registry.h"


forward declare schema registry

dotnwat · 2024-09-13T05:37:39Z

src/v/datalake/schema_registry.h

@@ -11,10 +11,23 @@

 #include "bytes/iobuf_parser.h"
 #include "model/record.h"
+#include "pandaproxy/schema_registry/types.h"
+#include "wasm/schema_registry.h"

 #include <optional>


dotnwat · 2024-09-13T05:39:05Z

src/v/datalake/record_multiplexer.cc

-    batch.for_each_record([&batch, this](model::record&& record) {
-        iobuf key = record.release_key();
-        iobuf val = record.release_value();
-        // *1000: Redpanda timestamps are milliseconds. Iceberg uses
-        // microseconds.
-        int64_t timestamp = (batch.header().first_timestamp.value()
-                             + record.timestamp_delta())
-                            * 1000;
-        int64_t offset = static_cast<int64_t>(batch.base_offset())
-                         + record.offset_delta();
-        int64_t estimated_size = key.size_bytes() + val.size_bytes() + 16;
-
-        // Translate the record
-        auto& translator = get_translator();
-        iceberg::struct_value data = std::visit(
-          [&key, &val, timestamp, offset](schemaless_translator& tr) {
-              return tr.translate_event(
-                std::move(key), std::move(val), timestamp, offset);
-          },
-          translator);
-
-        // Send it to the writer
-        auto& writer = get_writer();
-        writer.add_data_struct(std::move(data), estimated_size);
-    });
+    co_await batch.for_each_record_async(
+      [&batch, this](model::record&& record) -> ss::future<> {


separate commit explaining change to async?

dotnwat · 2024-09-13T05:40:25Z

src/v/datalake/record_multiplexer.cc

+          int64_t estimated_size = key.size_bytes() + val.size_bytes() + 16;
+
+          // Translate the record
+          auto& translator = co_await get_translator(record);


co-routine lambda's aren't allowed

BenPope · 2024-09-17T06:34:38Z

src/v/datalake/schema_registry.h

+using get_schema_result = std::variant<
+  pandaproxy::schema_registry::canonical_schema_definition,
+  get_schema_error>;


Something like the following is more idiomatic:

namespace datalake { enum class error_code { success = 0, no_schema_id, invalid_schema_id }; std::error_code make_error_code(error_code e) noexcept; template<typename T> using result = result<T, error_info>; } // namespace datalake namespace std { template<> struct is_error_code_enum<datalake::error_code> : true_type {}; } // namespace std

BenPope · 2024-09-17T06:39:35Z

src/v/datalake/schema_registry.cc

+        return std::nullopt;
+    }
+    auto id = parser.consume_type<uint32_t>();
+    return id;


This might not be sufficient for protobuf, which uses zig-zag encoded offsets to get correct message if the Subject Name Strategy isn't TopicName.

It may be worth extracting and reusing some of the code in the existing validation

dotnwat · 2024-09-30T21:39:59Z

src/v/datalake/schema_registry.h

+ */
+#pragma once
+
+#include "bytes/iobuf_parser.h"


unused import

dotnwat · 2024-09-30T21:40:35Z

src/v/datalake/schema_registry.h

+// Extract the schema id from a record's value. This simply extracts the id. It
+// does not do any validation. Returns std::nullopt if the record does not have
+// a schema id.
+get_schema_id_result get_value_schema_id(const iobuf& record);


forward declare iobuf

dotnwat · 2024-09-30T21:45:31Z

src/v/datalake/schema_registry.cc

+    }
+    iobuf_const_parser parser(buf);
+    auto magic = parser.consume_type<uint8_t>();
+    if (magic != 0) {


curious if this is a value we control?

This is the schema registry format leading "magic byte".

Some named constants would help in reading the code here

thanks. was mostly just curious--zero would not be my choice for a magic value, but i guess we can't control everything!

dotnwat · 2024-09-30T21:55:17Z

src/v/datalake/schema_registry.cc

+    if (static_cast<size_t>(offset_count) > parser.bytes_left()) {
+        return get_schema_error::not_enough_bytes;
+    }
+    offsets.resize(offset_count);
+    for (auto& o : offsets) {
+        std::tie(o, bytes_read) = parser.read_varlong();


doesn't this provide an inconsistent semantic for short read scenarios? for example, even if offset_count < parser.bytes_bytes() then you could still run out of bytes when parsing since some logical offsets may take more than 1 byte read from the parser? in this case, you could get back not_enough_bytes or some exception thrown out of the parser for the same error scenario?

This is a bit inconsistent.

There are a few different reasons we might stop parsing a varint:

Found a byte with the top bit cleared.

Read the maximum number of bytes for the given type.

Got to the end of the buffer.

Our varint parser doesn't distinguish between those cases. E.g. detail::var_decoder in vint.h returns true (meaning "stop parsing") upon finding a byte with the top bit cleared, or when "shift > limit", i.e. the number of bits read is greater than the size.

deserialize in that same module loops through a range until it runs out of bytes or val_decoder returns true and returns the number of bytes read.

As far as I can tell, currently the only way we could return 0 bytes read is if we call it at the end of the buffer. Otherwise it always reads at least on byte. Also, we currently don't detect improperly formatted varints. So maybe I should just return not_enough_bytes in both cases.

In the future we could detect improperly-formatted varints and return 0 bytes read from the parser in those cases. In that case it would definitely make sense to separate these cases.

thanks! i took a closer look at our varint decoder. looks like we have some improvements to it we can make in the future!

dotnwat · 2024-09-30T21:56:58Z

src/v/datalake/schema_registry.cc

+            return get_schema_error::bad_varint;
+        }
+    }
+    if (offsets.empty()) {


do you also need to validate that the number of offsets read is the same as offset_count?

I'm not sure how they wouldn't be. The loop that reads the offsets will execute that many times, and if it fails to read an offset it returns an error.

Invalid offsets aside, if offset_count == 0 then we should return {0}. Here's the same method in franz-go:

https://github.com/twmb/franz-go/blob/b77dd13e2bfaee7f5181df27b40ee4a4f6a73b09/pkg/sr/serde.go#L482-L507

We do that here, but in a very roundabout way by resizing then checking the result of that method.

I'm not sure how they wouldn't be

i'm asking if it is safe to trust the size value you are decoding from the wire format as the prefix on the data.

There is a test for this case, and generally our iobuf parser throws on short input

rockwotj · 2024-10-01T15:32:19Z

src/v/datalake/schema_registry.h

+get_schema_id_result get_value_schema_id(const iobuf& record);
+get_proto_offsets_result get_proto_offsets(const iobuf& record);


You're going to need to know how much of the value to chop off - so I think the success case needs to either return the remaining data (via share) or how many bytes were read.

rockwotj · 2024-10-01T15:36:52Z

src/v/datalake/schema_registry.cc

+            return get_schema_error::bad_varint;
+        }
+    }
+    if (offsets.empty()) {


Invalid offsets aside, if offset_count == 0 then we should return {0}. Here's the same method in franz-go:

https://github.com/twmb/franz-go/blob/b77dd13e2bfaee7f5181df27b40ee4a4f6a73b09/pkg/sr/serde.go#L482-L507

We do that here, but in a very roundabout way by resizing then checking the result of that method.

rockwotj · 2024-10-01T18:28:29Z

src/v/datalake/schema_registry.h

+// does not do any validation. Returns std::nullopt if the record does not have
+// a schema id.


The comment is out of date about the return type.

Can you talk about the mutability of the input buffer since it's taken by mutable reference?

There is still a reference to std::nullopt which is incorrect

rockwotj · 2024-10-01T18:35:51Z

src/v/datalake/tests/schema_registry_test.cc

+
+    uint8_t proto_msg_count = 9;
+    std::array<uint8_t, 16> encoded;
+    size_t encoded_size = vint::serialize(proto_msg_count, encoded.data());


nit: vint::to_bytes is probably simpler and will make this a bit more readable.

rockwotj · 2024-10-01T18:37:37Z

src/v/datalake/tests/schema_registry_test.cc

+    EXPECT_EQ(offsets.size(), proto_msg_count);
+    for (int32_t o = 0; o < offsets.size(); o++) {
+        EXPECT_EQ(o, offsets[o]);
+    }


This can be replaced with:

EXPECT_THAT(offsets, ElementsAre(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))

Ref: http://google.github.io/googletest/reference/matchers.html#container-matchers

rockwotj · 2024-10-01T18:38:10Z

src/v/datalake/tests/schema_registry_test.cc

+    // that the message is the first one defined in the schema and return {0}.
+    iobuf value;
+
+    uint8_t magic = 0; // Invalid magic


wrong comment

rockwotj · 2024-10-01T18:38:16Z

src/v/datalake/tests/schema_registry_test.cc

+TEST(DatalakeSchemaRegistry, GetProtoOffsetsOk) {
+    iobuf value;
+
+    uint8_t magic = 0; // Invalid magic


wrong comment

whoops. Copy-and-paste error :-(

rockwotj · 2024-10-01T18:39:09Z

src/v/datalake/tests/schema_registry_test.cc

+
+    uint8_t magic = 0;
+    int32_t schema_id = 12;
+    int32_t schema_id_encoded = htobe32(schema_id);


TBH I find the name htobe32 to be quite horrible (not your fault). Seastar has nice helpers for this: ss::cpu_to_be

rockwotj · 2024-10-01T18:39:43Z

src/v/datalake/tests/schema_registry_test.cc

+    int32_t schema_id_encoded = htobe32(schema_id);
+    std::string payload = "Hello world";
+    value.append(&magic, 1);
+    value.append(reinterpret_cast<uint8_t*>(&schema_id_encoded), 4);


some of these tests could be more readable with a helper method to construct an iobuf with something like:

redpanda/src/v/serde/thrift/tests/compact_test.cc

Lines 31 to 41 in 52b4423

void buf_append(iobuf& b, uint8_t byte) { b.append(&byte, 1); }

void buf_append(iobuf& b, const bytes& byte) {

b.append(byte.data(), byte.size());

}

template<typename... Args>

iobuf buf_from(const Args&... args) {

iobuf b;

(buf_append(b, args), ...);

return b;

}

rockwotj · 2024-10-01T18:39:53Z

src/v/datalake/tests/schema_registry_test.cc

+TEST(DatalakeSchemaRegistry, GetProtoOffsetsNotEnoughData) {
+    iobuf value;
+
+    uint8_t magic = 0; // Invalid magic


wrong comment

rockwotj

Looks good - just two small things

rockwotj · 2024-10-01T20:03:07Z

src/v/datalake/schema_registry.h

+// does not do any validation. Returns std::nullopt if the record does not have
+// a schema id.


There is still a reference to std::nullopt which is incorrect

rockwotj · 2024-10-01T20:04:27Z

src/v/datalake/tests/schema_registry_test.cc

+    // value.append(&magic, 1);
+    // value.append(reinterpret_cast<uint8_t*>(&schema_id_encoded), 4);


vbotbuildovich · 2024-10-01T22:34:39Z

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#019249fa-05d7-4bed-865a-46d4432acc78:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#019249fa-05d9-4f38-9899-b4215ed935d4:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924a14-5ba5-4fc8-81e3-2d4b61fdf340:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924a14-5ba3-4283-9d11-94fb65d19c8e:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924a5c-4106-4e85-935b-a541fcf58333:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924a5c-4193-4a3c-a8a3-d8d8e0796adf:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924adb-fabe-4aae-825a-d9caf5f1440c:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924adb-fca6-4da1-9877-f67c222b725f:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924adb-f8f6-4399-b02c-0ed9e689a654:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924adb-f45a-457d-b114-d3fe446d51bb:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924d20-098d-405a-b911-37860f623ce7:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924d20-0930-47b0-a267-0b7849c92664:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924d20-08b4-4557-9310-71b2d1fad66b:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924d20-07f0-413b-9d88-569e3ffd68a9:

"rptest.tests.debug_bundle_test.DebugBundleTest.test_post_debug_bundle.ignore_none=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/55595#01924de6-0a86-4b07-9ce1-46463f696ec8:

"rptest.tests.rpk_generate_test.RpkGenerateTest.test_generate_grafana"

vbotbuildovich · 2024-10-01T22:49:15Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#019249fa-05d5-4d94-9e8f-771153657a5d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#019249fa-05d7-4bed-865a-46d4432acc78

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#01924a14-5ba3-4283-9d11-94fb65d19c8e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#01924de6-0a86-4b07-9ce1-46463f696ec8

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#01924de6-0b07-4421-a705-e7d361338bb1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#01924de6-0a0d-46af-9175-0d7904890e56

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55595#01924de6-0976-49c0-b000-acc4aa8bce00

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55673#01924efa-cb1c-4bc0-b0e3-e5feaa0fb032

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55673#01924ee7-486c-40fc-9fc5-cb3ff9606e24

BenPope

No objections

BenPope · 2024-10-02T13:15:16Z

src/v/datalake/schema_registry.h

+#include "pandaproxy/schema_registry/types.h"
+
+#include <system_error>
+#include <type_traits>


Suggested change

#include "pandaproxy/schema_registry/types.h"

#include <system_error>

#include <type_traits>

#include "base/outcome.h"

#include "pandaproxy/schema_registry/types.h"

#include <system_error>

#include <vector>

This adds a function to parse the schema id from the value of a record.

Protobuf records have an additional prefix after the schema id. Since a protobuf schema can contain multiple message types, and those messages may be nested, they contain a list of message ids to be traversed to find the actual message schema for this record. These are encoded as a varint representing the length followed by a list of varints.

github-actions bot added area/redpanda area/wasm WASM Data Transforms labels Sep 12, 2024

dotnwat reviewed Sep 13, 2024

View reviewed changes

BenPope reviewed Sep 17, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch 3 times, most recently from 4df5a62 to 65940dd Compare September 30, 2024 21:37

jcipar marked this pull request as ready for review September 30, 2024 21:37

jcipar requested review from dotnwat, BenPope and rockwotj September 30, 2024 21:37

dotnwat reviewed Sep 30, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch 2 times, most recently from d348baf to d86f6a5 Compare October 1, 2024 15:28

rockwotj reviewed Oct 1, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch from d86f6a5 to 5a3b749 Compare October 1, 2024 17:39

rockwotj reviewed Oct 1, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch 3 times, most recently from 05d5ee0 to 0fdf809 Compare October 1, 2024 19:50

rockwotj reviewed Oct 1, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch from 0fdf809 to c6e3d5d Compare October 1, 2024 20:17

rockwotj approved these changes Oct 1, 2024

View reviewed changes

BenPope reviewed Oct 2, 2024

View reviewed changes

jcipar force-pushed the jcipar/datalake-schema-registry branch from c6e3d5d to c21f5e7 Compare October 2, 2024 19:09

rockwotj approved these changes Oct 2, 2024

View reviewed changes

jcipar added 2 commits October 2, 2024 19:29

datalake: add function to parse schema prefix

0afdb12

This adds a function to parse the schema id from the value of a record.

jcipar force-pushed the jcipar/datalake-schema-registry branch from c21f5e7 to 0f557dd Compare October 2, 2024 23:29

jcipar merged commit 7ae3406 into redpanda-data:dev Oct 3, 2024
17 checks passed

		get_schema_id_result get_value_schema_id(const iobuf& record);
		get_proto_offsets_result get_proto_offsets(const iobuf& record);

		// does not do any validation. Returns std::nullopt if the record does not have
		// a schema id.

	void buf_append(iobuf& b, uint8_t byte) { b.append(&byte, 1); }
	void buf_append(iobuf& b, const bytes& byte) {
	b.append(byte.data(), byte.size());
	}

	template<typename... Args>
	iobuf buf_from(const Args&... args) {
	iobuf b;
	(buf_append(b, args), ...);
	return b;
	}

		// value.append(&magic, 1);
		// value.append(reinterpret_cast<uint8_t*>(&schema_id_encoded), 4);

datalake: helper code to get schemas for records #23308

datalake: helper code to get schemas for records #23308

Conversation

jcipar commented Sep 12, 2024

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenPope Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Oct 1, 2024 • edited Loading

vbotbuildovich commented Oct 1, 2024 • edited Loading

BenPope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenPope Sep 17, 2024 •

edited

Loading

dotnwat Sep 30, 2024 •

edited

Loading

vbotbuildovich commented Oct 1, 2024 •

edited

Loading

vbotbuildovich commented Oct 1, 2024 •

edited

Loading