iceberg: Single-value JSON serde #24812

oleiman · 2025-01-15T05:18:40Z

Needed for processing default values for schema fields.

As specified in https://iceberg.apache.org/spec/#json-single-value-serialization

Backports Required

Release Notes

none

oleiman · 2025-01-16T06:36:21Z

/ci-repeat 1
skip-redpanda-build
skip-units

oleiman · 2025-01-16T15:11:15Z

/ci-repeat 1
skip-redpanda-build
skip-units

oleiman · 2025-01-16T16:47:49Z

src/v/iceberg/values_json.cc

+        return binary_value{hex_str_to_iobuf(str)};
+    }
+    value operator()(const decimal_type& t) {
+        // TODO(oren): need to support negative scale? see datatypes.h


See table in https://iceberg.apache.org/spec/#json-single-value-serialization

Looks like we need to support scientific notation?

https://github.com/apache/iceberg/blob/b128bba57f613f23ed773f0fa6330c1d2bbf8a39/core/src/test/java/org/apache/iceberg/TestSingleValueParser.java#L53-L55
https://github.com/apache/iceberg/blob/b128bba57f613f23ed773f0fa6330c1d2bbf8a39/core/src/test/java/org/apache/iceberg/TestSingleValueParser.java#L143

For rust:

https://github.com/apache/iceberg-rust/blob/ae04c8a790b0d949d1f5303f91a6a6c2c40a9f9e/crates/iceberg/src/spec/values.rs#L1037

https://github.com/paupino/rust-decimal/blob/46fb4c3c517bc0c27cd534b65f9e8b57c24ba18e/tests/decimal_tests.rs#L3272

Yeah we need to support scientific notation, but the point here is that up-scaling is signified by a negative value for scale, whereas unsigned int is baked into decimal_value currently. We should change that, but I left it off initially to keep this PR fully isolated.

i think it's fine for a follow up since we don't currently support it at the iceberg::datatypes level

https://redpandadata.atlassian.net/browse/CORE-8835

sounds good

oleiman · 2025-01-16T16:49:10Z

src/v/iceberg/values_json.cc

+        // NOTE(oren): so this means a real timestamp offset is not supported or
+        // what? I guess this is just for downstream code that expects to find
+        // UTC?


cruft comment, but of interest from the spec:

Timestamp values with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (2017-11-16 17:10:34 PST is stored/retrieved as 2017-11-17 01:10:34 UTC and these values are considered identical).

vbotbuildovich · 2025-01-16T17:37:56Z

CI test results

test results on build#60845

test_id	test_kind	job_url	test_status	passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60845#01946fac-951c-44c6-a347-b05184313836	FLAKY	4/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60845#01946fac-9524-4685-b3c7-1459b8c6b8d8	FLAKY	3/6

test results on build#60881

test_id	test_kind	job_url	test_status	passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60881#01947233-bedc-4c00-beaa-4f31201e69c7	FLAKY	1/2

test results on build#60916

test_id	test_kind	job_url	test_status	passed
rptest.tests.audit_log_test.AuditLogTestsAppLifecycle.test_app_lifecycle	ducktape	https://buildkite.com/redpanda/redpanda/builds/60916#0194758b-204d-45c9-94f2-18a68e3da101	FLAKY	1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60916#0194758b-204d-45c9-94f2-18a68e3da101	FLAKY	1/2

src/v/iceberg/values_json.h

src/v/iceberg/values_json.cc

rockwotj · 2025-01-16T17:38:31Z

src/v/iceberg/values_json.cc

+        std::string int_frac;
+        int_frac.reserve(int_part.size() + frac_part.size());
+        std::ranges::copy(int_part, std::back_inserter(int_frac));
+        std::ranges::copy(frac_part, std::back_inserter(int_frac));
+
+        decimal_value result{};
+        if (!absl::SimpleAtoi(int_frac, &result.val)) {
+            throw std::invalid_argument("Failed to parse int128");
+        }


Generally when parsing decimals I find it easier to do something like: SimpleAtoi the int part, then multiply by scale, SimpleAtoi the fractional part and add it to the previous.

Here's the algorithm I used in Connect: https://github.com/redpanda-data/connect/blob/9d94381914a014c23cb2dd410193e5b7727d4bbd/internal/impl/snowflake/streaming/int128/decimal.go#L146

That one also rounds up which is standard for IEEE floating point numbers when they are parsed and we have to lose precision.

I also had to support scientific notation, do we need to do that here?

Lastly, I think this algorithm assumes that the fractional part is padded to the scale?

rounds up

Looks like the Java implementation uses BigDecimal, which I believe just gives up if there are too many digits after the point. I'm sort of inclined to truncate the excess and call it a day...wdyt?

Yeah thats the default, but it's configurable. I am fine with truncating. You specified a bad default if it's more precise

configurable

what I mean to say is that the apache code allows it to error :)

you specified a bad default if it's more precise

ya. basically UB

src/v/iceberg/values_json.cc

rockwotj · 2025-01-16T17:42:53Z

src/v/iceberg/values_json.cc

+        return binary_value{hex_str_to_iobuf(str)};
+    }
+    value operator()(const decimal_type& t) {
+        // TODO(oren): need to support negative scale? see datatypes.h


Looks like we need to support scientific notation?

rockwotj · 2025-01-16T17:44:27Z

src/v/iceberg/tests/values_json_test.cc

+    "decimal_value",
+    decimal_value{absl::int128{std::numeric_limits<long>::max()}},
+    decimal_type{21, 0},
+    fmt::format(R"("009223372036854775807.")"),


I assure you some of these things are going to drop the . when scale is 0.

We need a lot more test cases here :)

src/v/iceberg/values_json.cc

src/v/iceberg/tests/values_json_test.cc

src/v/iceberg/values_json.cc

nvartolomei · 2025-01-16T21:53:21Z

src/v/iceberg/values_json.cc

+}
+
+iobuf hex_str_to_iobuf(std::string_view str) {
+    if (str.size() & 0x01ul) {


Very odd to see this instead of "the usual" % 2 check. But maybe it is just me thinking in math rather than bits. The generated code is the same.

ha, yeah idk where that habit comes from

oleiman · 2025-01-17T01:20:47Z

force push reworked decimal_value serde and other assorted CR comments. Even more tests.

Decimal implementation is closer I think but still needs sci notation in a follow up. Possibly more edge cases extant, but test coverage is much better now.

rockwotj

LGTM, just a couple of suggestions where there might be absl functions that can simplify/cleanup the code

src/v/iceberg/values_json.h

src/v/iceberg/values_json.cc

Needed for processing default values for schema fields. As specified in https://iceberg.apache.org/spec/#json-single-value-serialization Signed-off-by: Oren Leiman <[email protected]>

andrwng · 2025-01-17T20:01:18Z

src/v/iceberg/tests/values_json_test.cc

+struct DecimalRoundTripTest
+  : ::testing::Test
+  , testing::WithParamInterface<decimal_parsing_test_case> {};
+
+static const std::vector<decimal_parsing_test_case> decimal_cases{
+  decimal_parsing_test_case{
+    "simple",


Not blocking, but given how non-trivial decimal parsing is, it would be great to test variants of 0 (0.0, 0., 0000., .00000, -0.0, etc). Wouldn't be surprised if that's a common default

ya, great point. i'll bolt that onto the followup ticket

which is here fyi: https://redpandadata.atlassian.net/browse/CORE-8835

oleiman self-assigned this Jan 15, 2025

github-actions bot added area/build area/redpanda labels Jan 15, 2025

oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from 16afd29 to e4000d7 Compare January 15, 2025 05:38

oleiman marked this pull request as ready for review January 15, 2025 05:38

oleiman force-pushed the dlib/core-8781/json-single-value-serde branch 3 times, most recently from 76e7e54 to 6da9c19 Compare January 16, 2025 02:50

oleiman requested review from andrwng and rockwotj January 16, 2025 16:42

oleiman commented Jan 16, 2025

View reviewed changes

rockwotj reviewed Jan 16, 2025

View reviewed changes

andrwng reviewed Jan 16, 2025

View reviewed changes

nvartolomei reviewed Jan 16, 2025

View reviewed changes

src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved

nvartolomei reviewed Jan 16, 2025

View reviewed changes

oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from 6da9c19 to a3cda56 Compare January 17, 2025 01:19

oleiman requested review from andrwng, rockwotj and nvartolomei January 17, 2025 02:57

rockwotj reviewed Jan 17, 2025

View reviewed changes

src/v/iceberg/values_json.h Show resolved Hide resolved

src/v/iceberg/values_json.cc Show resolved Hide resolved

src/v/iceberg/values_json.cc Show resolved Hide resolved

src/v/iceberg/values_json.cc Show resolved Hide resolved

iceberg: Single-value JSON serde

e13f753

Needed for processing default values for schema fields. As specified in https://iceberg.apache.org/spec/#json-single-value-serialization Signed-off-by: Oren Leiman <[email protected]>

oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from a3cda56 to e13f753 Compare January 17, 2025 17:15

oleiman requested a review from rockwotj January 17, 2025 17:59

rockwotj approved these changes Jan 17, 2025

View reviewed changes

andrwng approved these changes Jan 17, 2025

View reviewed changes

oleiman enabled auto-merge January 17, 2025 20:07

oleiman merged commit 2e246e9 into redpanda-data:dev Jan 17, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iceberg: Single-value JSON serde #24812

iceberg: Single-value JSON serde #24812

oleiman commented Jan 15, 2025

oleiman commented Jan 16, 2025

oleiman commented Jan 16, 2025

oleiman Jan 16, 2025

rockwotj Jan 16, 2025

rockwotj Jan 16, 2025

rockwotj Jan 16, 2025

oleiman Jan 16, 2025 •

edited

Loading

oleiman Jan 16, 2025

rockwotj Jan 17, 2025

oleiman Jan 16, 2025

vbotbuildovich commented Jan 16, 2025 •

edited

Loading

rockwotj Jan 16, 2025

oleiman Jan 17, 2025 •

edited

Loading

rockwotj Jan 17, 2025

oleiman Jan 17, 2025

rockwotj Jan 16, 2025

rockwotj Jan 16, 2025

rockwotj Jan 16, 2025

nvartolomei Jan 16, 2025

oleiman Jan 16, 2025

oleiman commented Jan 17, 2025

rockwotj left a comment

andrwng Jan 17, 2025

oleiman Jan 17, 2025

oleiman Jan 17, 2025

iceberg: Single-value JSON serde #24812

iceberg: Single-value JSON serde #24812

Conversation

oleiman commented Jan 15, 2025

Backports Required

Release Notes

oleiman commented Jan 16, 2025

oleiman commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Jan 16, 2025 • edited Loading

CI test results

Choose a reason for hiding this comment

oleiman Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jan 17, 2025

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman Jan 16, 2025 •

edited

Loading

vbotbuildovich commented Jan 16, 2025 •

edited

Loading

oleiman Jan 17, 2025 •

edited

Loading