-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-44248: [FORMAT] Add TimestampWithOffset canonical extension type #48002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-44248: [FORMAT] Add TimestampWithOffset canonical extension type #48002
Conversation
14fd59a to
fe8056f
Compare
|
Hey friends, I pushed 2 new patches, as discussed in the Arrow dev community meeting:
Next, I need to change the Go and Rust implementation drafts to accept the new supported encodings. |
|
FWIW, forgot to mention this before in arrow-cpp we have an internal OffsetZone that represents offset in minutes. We define it per array, not per row of course. arrow/cpp/src/arrow/util/date_internal.h Lines 29 to 33 in cd23a76
|
Co-authored-by: David Li <[email protected]>
d40c0bb to
b0d9be3
Compare
6ab0deb to
d4d50b3
Compare
|
|
||
| * Extension type parameters: | ||
|
|
||
| * ``time_unit``: the time-unit of each of the stored UTC timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an extension type parameter, it's a parameter of the storage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved: 043b07c
| If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. | ||
|
|
||
| If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes consumption of these data more complicated. Why not mandate that struct fields are always non-null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And return an error on instantiation?
I think we can allow nulls on the timestamp field and on instantiation put it on the struct array also.
Would you be fine with both points above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many operations will care only about the timestamp field and having the validity buffer there helps them simply delegate to functions on the Timestamp type. At the same time, many function only look at the top-level validity buffer (i.e. the validity of the struct). So during instantiation it would make sense to share the validity buffer if there is one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And return an error on instantiation?
Which instantiation are we talking about exactly? The ExtensionArray subclass in Arrow C++?
I think we can allow nulls on the
timestampfield and on instantiation put it on the struct array also.Would you be fine with both points above?
We can, but that places a burden on the consumer for the sake of making things easier for the producer. I'm not sure that's a good tradeoff to make.
I think many operations will care only about the timestamp field and having the validity buffer there helps them simply delegate to functions on the
Timestamptype. At the same time, many function only look at the top-level validity buffer (i.e. the validity of the struct). So during instantiation it would make sense to share the validity buffer if there is one.
I admit that can make sense. Let's hope for other opinions on the topic. @paleolimbot @alamb @tustvold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And return an error on instantiation?
Which instantiation are we talking about exactly? The ExtensionArray subclass in Arrow C++?
In Arrow implementations for any language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the validity buffer of the timestamp field for both the struct and the field itself is good for both producers and consumers. A producer doesn't have to set default values for the nulls. And consumers don't have to look at both validity buffers for anything they do. Sharing the buffer (if it exists at all) is just a ref-count bump on the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not mandate that struct fields are always non-null
+1. I had the same worry when we decided to make all of GeoArrow's native memory layouts have non-nullable children...the language in GeoArrow is that children must not contain nulls. When producing, all the implementations explicitly mark fields as non-nullable. When consuming, they check for a zero null count. Specifying anything about the specific arrangement of buffers I think is out of scope for an extension type definition (if we want to do this, somebody has to make sure it can be done in places like DuckDB, JavaScript, and Julia that might not offer the same level of control over buffer creation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if that's the intent, the spec needs to be clear that all 3 validity buffers have to be the identical.
But that would be frankly a bizarre variation from normal Struct policies and I don't see any reason to constrain implementations in such a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. We are actually back to the very original proposal by @serramatutu that demanded non-nullable fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest change goes back to the original spec: 043b07c
|
|
- Drop JSON encoding recommendation - Inner fields must be non-nullble - Time unit is not a type parameter
|
|
||
| .. note:: | ||
|
|
||
| It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to say anything about what types of dictionary keys are allowed? Like is it ok to use Int8/Int16/Int32 keys -- while Int32 probably doesn't make sense as it would make the column larger, it might be necessary in some cases to encode the full range of Int16 values 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of brevity of the spec, I advised @serramatutu to not have recommendations here. I would assume anyone deciding to dictionary-encode an int16 array is considering the gains (or not) that they would have from that encoding. We are allowing but not recommending complicated encoding of an already compact 16-bit-valued array.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @serramatutu -- the basic idea looks good to me
THIS IS A DRAFT. It's being used as reference for the [DISCUSS] thread in the mailing list.
Rationale for this change
Closes #44248
Arrow has no built-in canonical way of representing the
TIMESTAMP WITH TIME ZONESQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A newarrow.timestamp_with_offsetextension type would introduce a standard canonical way of representing that information.Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558
What changes are included in this PR?
Proposal and documentation for
arrow.timestamp_with_offsetcanonical extension type.Are these changes tested?
N/A
Are there any user-facing changes?
Yes, this is an extension to the arrow format.