Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg: Single-value JSON serde #24812

Merged

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Jan 15, 2025

Needed for processing default values for schema fields.

As specified in https://iceberg.apache.org/spec/#json-single-value-serialization

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@oleiman oleiman self-assigned this Jan 15, 2025
@oleiman oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from 16afd29 to e4000d7 Compare January 15, 2025 05:38
@oleiman oleiman marked this pull request as ready for review January 15, 2025 05:38
@oleiman oleiman force-pushed the dlib/core-8781/json-single-value-serde branch 3 times, most recently from 76e7e54 to 6da9c19 Compare January 16, 2025 02:50
@oleiman
Copy link
Member Author

oleiman commented Jan 16, 2025

/ci-repeat 1
skip-redpanda-build
skip-units

@oleiman
Copy link
Member Author

oleiman commented Jan 16, 2025

/ci-repeat 1
skip-redpanda-build
skip-units

@oleiman oleiman requested review from andrwng and rockwotj January 16, 2025 16:42
return binary_value{hex_str_to_iobuf(str)};
}
value operator()(const decimal_type& t) {
// TODO(oren): need to support negative scale? see datatypes.h
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to support scientific notation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@oleiman oleiman Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we need to support scientific notation, but the point here is that up-scaling is signified by a negative value for scale, whereas unsigned int is baked into decimal_value currently. We should change that, but I left it off initially to keep this PR fully isolated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's fine for a follow up since we don't currently support it at the iceberg::datatypes level

https://redpandadata.atlassian.net/browse/CORE-8835

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Comment on lines 442 to 444
// NOTE(oren): so this means a real timestamp offset is not supported or
// what? I guess this is just for downstream code that expects to find
// UTC?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft comment, but of interest from the spec:

Timestamp values with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (2017-11-16 17:10:34 PST is stored/retrieved as 2017-11-17 01:10:34 UTC and these values are considered identical).

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 16, 2025

CI test results

test results on build#60845
test_id test_kind job_url test_status passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60845#01946fac-951c-44c6-a347-b05184313836 FLAKY 4/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60845#01946fac-9524-4685-b3c7-1459b8c6b8d8 FLAKY 3/6
test results on build#60881
test_id test_kind job_url test_status passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60881#01947233-bedc-4c00-beaa-4f31201e69c7 FLAKY 1/2
test results on build#60916
test_id test_kind job_url test_status passed
rptest.tests.audit_log_test.AuditLogTestsAppLifecycle.test_app_lifecycle ducktape https://buildkite.com/redpanda/redpanda/builds/60916#0194758b-204d-45c9-94f2-18a68e3da101 FLAKY 1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60916#0194758b-204d-45c9-94f2-18a68e3da101 FLAKY 1/2

src/v/iceberg/values_json.h Show resolved Hide resolved
src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved
Comment on lines 272 to 277
std::string int_frac;
int_frac.reserve(int_part.size() + frac_part.size());
std::ranges::copy(int_part, std::back_inserter(int_frac));
std::ranges::copy(frac_part, std::back_inserter(int_frac));

decimal_value result{};
if (!absl::SimpleAtoi(int_frac, &result.val)) {
throw std::invalid_argument("Failed to parse int128");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally when parsing decimals I find it easier to do something like: SimpleAtoi the int part, then multiply by scale, SimpleAtoi the fractional part and add it to the previous.

Here's the algorithm I used in Connect: https://github.com/redpanda-data/connect/blob/9d94381914a014c23cb2dd410193e5b7727d4bbd/internal/impl/snowflake/streaming/int128/decimal.go#L146

That one also rounds up which is standard for IEEE floating point numbers when they are parsed and we have to lose precision.

I also had to support scientific notation, do we need to do that here?

Lastly, I think this algorithm assumes that the fractional part is padded to the scale?

Copy link
Member Author

@oleiman oleiman Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rounds up

Looks like the Java implementation uses BigDecimal, which I believe just gives up if there are too many digits after the point. I'm sort of inclined to truncate the excess and call it a day...wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats the default, but it's configurable. I am fine with truncating. You specified a bad default if it's more precise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configurable

what I mean to say is that the apache code allows it to error :)

you specified a bad default if it's more precise

ya. basically UB

src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved
return binary_value{hex_str_to_iobuf(str)};
}
value operator()(const decimal_type& t) {
// TODO(oren): need to support negative scale? see datatypes.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to support scientific notation?

"decimal_value",
decimal_value{absl::int128{std::numeric_limits<long>::max()}},
decimal_type{21, 0},
fmt::format(R"("009223372036854775807.")"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assure you some of these things are going to drop the . when scale is 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a lot more test cases here :)

src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved
src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved
src/v/iceberg/values_json.cc Outdated Show resolved Hide resolved
src/v/iceberg/tests/values_json_test.cc Outdated Show resolved Hide resolved
src/v/iceberg/tests/values_json_test.cc Outdated Show resolved Hide resolved
}

iobuf hex_str_to_iobuf(std::string_view str) {
if (str.size() & 0x01ul) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very odd to see this instead of "the usual" % 2 check. But maybe it is just me thinking in math rather than bits. The generated code is the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, yeah idk where that habit comes from

@oleiman oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from 6da9c19 to a3cda56 Compare January 17, 2025 01:19
@oleiman
Copy link
Member Author

oleiman commented Jan 17, 2025

force push reworked decimal_value serde and other assorted CR comments. Even more tests.

Decimal implementation is closer I think but still needs sci notation in a follow up. Possibly more edge cases extant, but test coverage is much better now.

Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of suggestions where there might be absl functions that can simplify/cleanup the code

src/v/iceberg/values_json.h Show resolved Hide resolved
src/v/iceberg/values_json.cc Show resolved Hide resolved
src/v/iceberg/values_json.cc Show resolved Hide resolved
src/v/iceberg/values_json.cc Show resolved Hide resolved
Needed for processing default values for schema fields.

As specified in https://iceberg.apache.org/spec/#json-single-value-serialization

Signed-off-by: Oren Leiman <[email protected]>
@oleiman oleiman force-pushed the dlib/core-8781/json-single-value-serde branch from a3cda56 to e13f753 Compare January 17, 2025 17:15
@oleiman oleiman requested a review from rockwotj January 17, 2025 17:59
Comment on lines +735 to +741
struct DecimalRoundTripTest
: ::testing::Test
, testing::WithParamInterface<decimal_parsing_test_case> {};

static const std::vector<decimal_parsing_test_case> decimal_cases{
decimal_parsing_test_case{
"simple",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but given how non-trivial decimal parsing is, it would be great to test variants of 0 (0.0, 0., 0000., .00000, -0.0, etc). Wouldn't be surprised if that's a common default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, great point. i'll bolt that onto the followup ticket

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleiman oleiman enabled auto-merge January 17, 2025 20:07
@oleiman oleiman merged commit 2e246e9 into redpanda-data:dev Jan 17, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants