-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for Union types in RowConverter
#8839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for Union types in RowConverter
#8839
Conversation
9a62f3c to
2cd0253
Compare
2cd0253 to
5559011
Compare
|
|
||
| let mut child_rows = Vec::with_capacity(converters.len()); | ||
| for (type_id, converter) in converters.iter().enumerate() { | ||
| let child_array = union_array.child(type_id as i8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here type_id is the index of the converter. It looks strange but it might be OK.
Could you use the items in type_ids instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense because the type_id is the index of the child field types
Maybe we can document better that converters is indexed by type_id 🤔
| let len = rows.len(); | ||
|
|
||
| let DataType::Union(union_fields, mode) = &field.data_type else { | ||
| unreachable!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| unreachable!() | |
| unreachable!("Expected a Union but got: {}", &field.data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth writing a descriptive message when the branch is specifically designed for a case that will never hit
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- this is looking good. I left some comments and @martin-g 's comments are good to review too
|
|
||
| let mut child_rows = Vec::with_capacity(converters.len()); | ||
| for (type_id, converter) in converters.iter().enumerate() { | ||
| let child_array = union_array.child(type_id as i8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense because the type_id is the index of the child field types
Maybe we can document better that converters is indexed by type_id 🤔
| } | ||
|
|
||
| #[test] | ||
| fn test_sparse_union() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also add tests here for union arrays that have nulls? Specifically for a union array that has a null buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: 3ab1e75
arrow-row/src/lib.rs
Outdated
|
|
||
| for (idx, row) in rows.iter_mut().enumerate() { | ||
| // skip the null sentinel | ||
| let mut cursor = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to look at the null byte to recover nulls 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought union arrays don't support physical nulls. i.e. UnionArray:nulls() always return None https://docs.rs/arrow-array/57.0.0/src/arrow_array/array/union_array.rs.html#777
Nulls in union arrays are represented by nulls in the child arrays, not by a physical null buffer on the union itself. For example, consider an element at index 1 with type id 0 pointing to a null in the Int32 child array. Then the union element itself is not null, it's a valid union element that happens to point to a null value
Now that I think about it, I wonder if we can just eagerly encode 0x01 as the null sentinel byte 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought union arrays don't support physical nulls. i.e.
I agree, https://arrow.apache.org/docs/format/Columnar.html#union-layout says:
Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
If the union itself doesn't have a null mask, is there any reason to include a sentinel byte at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Done in 95a0172
e365f92 to
e960120
Compare
Which issue does this PR close?
Uniondata types for row format #8828Rationale for this change
This PR implements row format conversion for Union types (both sparse and dense modes) in the row kernel. Union types can now be encoded into the row format for sorting and comparison ops
It handles both sparse and dense union modes by encoding each row as a null sentinel byte, followed by the type id byte, and then the encoded child row data. During decoding, rows are grouped by their type id and routed to the appropriate child converter