-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iceberg: Single-value JSON serde #24812
iceberg: Single-value JSON serde #24812
Conversation
16afd29
to
e4000d7
Compare
76e7e54
to
6da9c19
Compare
/ci-repeat 1 |
/ci-repeat 1 |
return binary_value{hex_str_to_iobuf(str)}; | ||
} | ||
value operator()(const decimal_type& t) { | ||
// TODO(oren): need to support negative scale? see datatypes.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to support scientific notation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/iceberg/blob/b128bba57f613f23ed773f0fa6330c1d2bbf8a39/core/src/test/java/org/apache/iceberg/TestSingleValueParser.java#L53-L55
https://github.com/apache/iceberg/blob/b128bba57f613f23ed773f0fa6330c1d2bbf8a39/core/src/test/java/org/apache/iceberg/TestSingleValueParser.java#L143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we need to support scientific notation, but the point here is that up-scaling is signified by a negative value for scale, whereas unsigned int is baked into decimal_value currently. We should change that, but I left it off initially to keep this PR fully isolated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's fine for a follow up since we don't currently support it at the iceberg::datatypes
level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
src/v/iceberg/values_json.cc
Outdated
// NOTE(oren): so this means a real timestamp offset is not supported or | ||
// what? I guess this is just for downstream code that expects to find | ||
// UTC? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cruft comment, but of interest from the spec:
Timestamp values with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (2017-11-16 17:10:34 PST is stored/retrieved as 2017-11-17 01:10:34 UTC and these values are considered identical).
CI test resultstest results on build#60845
test results on build#60881
test results on build#60916
|
src/v/iceberg/values_json.cc
Outdated
std::string int_frac; | ||
int_frac.reserve(int_part.size() + frac_part.size()); | ||
std::ranges::copy(int_part, std::back_inserter(int_frac)); | ||
std::ranges::copy(frac_part, std::back_inserter(int_frac)); | ||
|
||
decimal_value result{}; | ||
if (!absl::SimpleAtoi(int_frac, &result.val)) { | ||
throw std::invalid_argument("Failed to parse int128"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally when parsing decimals I find it easier to do something like: SimpleAtoi the int part, then multiply by scale, SimpleAtoi the fractional part and add it to the previous.
Here's the algorithm I used in Connect: https://github.com/redpanda-data/connect/blob/9d94381914a014c23cb2dd410193e5b7727d4bbd/internal/impl/snowflake/streaming/int128/decimal.go#L146
That one also rounds up which is standard for IEEE floating point numbers when they are parsed and we have to lose precision.
I also had to support scientific notation, do we need to do that here?
Lastly, I think this algorithm assumes that the fractional part is padded to the scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rounds up
Looks like the Java implementation uses BigDecimal, which I believe just gives up if there are too many digits after the point. I'm sort of inclined to truncate the excess and call it a day...wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats the default, but it's configurable. I am fine with truncating. You specified a bad default if it's more precise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configurable
what I mean to say is that the apache code allows it to error :)
you specified a bad default if it's more precise
ya. basically UB
return binary_value{hex_str_to_iobuf(str)}; | ||
} | ||
value operator()(const decimal_type& t) { | ||
// TODO(oren): need to support negative scale? see datatypes.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to support scientific notation?
"decimal_value", | ||
decimal_value{absl::int128{std::numeric_limits<long>::max()}}, | ||
decimal_type{21, 0}, | ||
fmt::format(R"("009223372036854775807.")"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assure you some of these things are going to drop the .
when scale is 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a lot more test cases here :)
} | ||
|
||
iobuf hex_str_to_iobuf(std::string_view str) { | ||
if (str.size() & 0x01ul) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very odd to see this instead of "the usual" % 2
check. But maybe it is just me thinking in math rather than bits. The generated code is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, yeah idk where that habit comes from
6da9c19
to
a3cda56
Compare
force push reworked decimal_value serde and other assorted CR comments. Even more tests. Decimal implementation is closer I think but still needs sci notation in a follow up. Possibly more edge cases extant, but test coverage is much better now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of suggestions where there might be absl functions that can simplify/cleanup the code
Needed for processing default values for schema fields. As specified in https://iceberg.apache.org/spec/#json-single-value-serialization Signed-off-by: Oren Leiman <[email protected]>
a3cda56
to
e13f753
Compare
struct DecimalRoundTripTest | ||
: ::testing::Test | ||
, testing::WithParamInterface<decimal_parsing_test_case> {}; | ||
|
||
static const std::vector<decimal_parsing_test_case> decimal_cases{ | ||
decimal_parsing_test_case{ | ||
"simple", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking, but given how non-trivial decimal parsing is, it would be great to test variants of 0 (0.0, 0., 0000., .00000, -0.0, etc). Wouldn't be surprised if that's a common default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, great point. i'll bolt that onto the followup ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is here fyi: https://redpandadata.atlassian.net/browse/CORE-8835
Needed for processing default values for schema fields.
As specified in https://iceberg.apache.org/spec/#json-single-value-serialization
Backports Required
Release Notes