Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR implements row format conversion for Union types (both sparse and dense modes) in the row kernel. Union types can now be encoded into the row format for sorting and comparison ops

It handles both sparse and dense union modes by encoding each row as a null sentinel byte, followed by the type id byte, and then the encoded child row data. During decoding, rows are grouped by their type id and routed to the appropriate child converter

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-row-converter branch 2 times, most recently from 9a62f3c to 2cd0253 Compare November 13, 2025 22:03
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-row-converter branch from 2cd0253 to 5559011 Compare November 13, 2025 22:09

let mut child_rows = Vec::with_capacity(converters.len());
for (type_id, converter) in converters.iter().enumerate() {
let child_array = union_array.child(type_id as i8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here type_id is the index of the converter. It looks strange but it might be OK.
Could you use the items in type_ids instead ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

let len = rows.len();

let DataType::Union(union_fields, mode) = &field.data_type else {
unreachable!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unreachable!()
unreachable!("Expected a Union but got: {}", &field.data_type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth writing a descriptive message when the branch is specifically designed for a case that will never hit

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- this is looking good. I left some comments and @martin-g 's comments are good to review too


let mut child_rows = Vec::with_capacity(converters.len());
for (type_id, converter) in converters.iter().enumerate() {
let child_array = union_array.child(type_id as i8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense because the type_id is the index of the child field types

Maybe we can document better that converters is indexed by type_id 🤔

}

#[test]
fn test_sparse_union() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add tests here for union arrays that have nulls? Specifically for a union array that has a null buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: 3ab1e75


for (idx, row) in rows.iter_mut().enumerate() {
// skip the null sentinel
let mut cursor = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to look at the null byte to recover nulls 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought union arrays don't support physical nulls. i.e. UnionArray:nulls() always return None https://docs.rs/arrow-array/57.0.0/src/arrow_array/array/union_array.rs.html#777

Nulls in union arrays are represented by nulls in the child arrays, not by a physical null buffer on the union itself. For example, consider an element at index 1 with type id 0 pointing to a null in the Int32 child array. Then the union element itself is not null, it's a valid union element that happens to point to a null value

Now that I think about it, I wonder if we can just eagerly encode 0x01 as the null sentinel byte 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought union arrays don't support physical nulls. i.e.

I agree, https://arrow.apache.org/docs/format/Columnar.html#union-layout says:

Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.

If the union itself doesn't have a null mask, is there any reason to include a sentinel byte at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Done in 95a0172

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-row-converter branch from e365f92 to e960120 Compare November 21, 2025 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Union data types for row format

3 participants