-
Notifications
You must be signed in to change notification settings - Fork 1k
Casting support for RunEndEncoded arrays #8589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
87e543d
to
f9ae6f9
Compare
Raised this PR to get Richard Baah's excellent work over the line! cc @albertlockett @brancz @alamb @Rich-T-kid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're getting close to the finish line with these changes!
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes
88c0d8a
to
2358010
Compare
Is there some way we can avoid the quadratic codegen with code paths parameterized on both run end type and value type? Perhaps it'd be possible to identify where the transitions are, perhaps using the comparison kernels and comparing the array with a slice offset by one, and then use this to construct the indexes and a filter to construct the values array? Have we done any empirical quantification into the impact this has on code bloat / compile times? Edit: https://docs.rs/arrow-ord/latest/arrow_ord/partition/fn.partition.html is the function I'm thinking of. |
I have not! Happy to do that though. Any pointers to how you'd like me to do that, from previous PRs for example? Or does a basic comparison of compile time and binary size on main and this branch suffice? |
Just this, quadratic codegen is typically severe enough to be easily measurable. |
The compile time increased by 2 seconds.
The size of
|
Yeah... That's quite bad for a single kernel, especially given the relatively niche usage of RunEndEncodedArrays, I hope you can understand that we need to be careful to keep this under control. What did you think of my suggestion about using the partition kernel to compute the run ends? It might actually be faster and would largely eliminate the additional codegen. It would mean making arrow-cast depend on arrow-ord, which is a bit meh, but perhaps unavoidable. It could possibly be a feature flag. 🤔 |
I understand! I haven't had time to look into your suggestion, but I will. Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE? |
Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it. |
Thanks for the context! |
We're talking about the pack_runs macro right? I realize it's nice as a macro, but it also seems fine to just write out by hand. |
The fact it is a macro is not the issue here, the problem is code generation based with complexity |
Got it. The way I see it, there are two paths. Either:
@vegarsti can you give the arrow-ord partitioning a try so we understand whether this would be a workable path? |
Yeah, I will give the arrow-ord partitioning a try. Some time this week! It seems like a good approach, thanks @tustvold! As for the feature flag, to me that seems a bit complicated - either the REE type should be supported or not, imo? Also, unless I'm missing something, whether to put this in a feature flag would apply to the whole REE epic #3520, so that should (eventually) be raised there. It would be great with some guidelines for the arrow-rs project with regard to the tradeoff between features and size/compile times. I'm guessing opinions might vary a bit between maintainers as well. Guidelines might make it easier to come to alignment in such discussions. In any case, maybe we get around this issue with the arrow-ord approach 🙏🏻 |
Okay I have the scaffolding but tests fail. Stay tuned 👀 |
I've implemented the partition approach now, see b8c0754. Regardless of compile time and size, this is so much cleaner than the previous approach (+46 -257), what a great idea @tustvold. Now the size is 7512216, up from 7316832, so the increase is 2.6%. The compile time was cargo build --release 575.50s user 23.11s system 906% cpu 1:06.00 total. What do you think @tustvold and @brancz? |
This is not unexpected, as it is now looping in arrow-ord which itself isn't the lightest of crates. However, most use-cases will already include it as part of the build tree, so the net change for them will be negligible. |
match to_type { | ||
DataType::RunEndEncoded(_, _) => { | ||
// Check if from_type supports equality (can be REE-encoded) | ||
match from_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new approach I think we should also support dictionary arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Made that change and added a test for it in 82c384b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look, looks good to me 👍
Left some small suggestions
arrow-cast/src/cast/run_array.rs
Outdated
} | ||
|
||
// Partition the array to identify runs of consecutive equal values | ||
let partitions = partition(&[array.clone()])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder if this should be cast_array, but I think this could cause inconsistency with can_cast_run_end_encoded and whilst casts can be lossy, they should be deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point. Later we're doing take
on cast_array
, so it does seem correct to use cast_array
here, indeed. Tests pass with that, as well.
Did this change in a16d555.
let indices = PrimitiveArray::<UInt32Type>::from_iter_values( | ||
values_indexes.iter().map(|&idx| idx as u32), | ||
); | ||
let values_array = take(&cast_array, &indices, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that internally Partitions is just a BooleanBuffer that'd be ideal to feed to the filter kernel. Perhaps we should expose that notion 🤔
arrow-cast/src/cast/run_array.rs
Outdated
for partition in partitions.ranges() { | ||
values_indexes.push(array_idx); | ||
array_idx += partition.end - partition.start; | ||
run_ends.push(array_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't array_idx just be partition.end presuming the ranges are contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right! Good catch! We still need to add to values_indexes
off by one to run_ends
. What do you think about something like this? This is correct as well, passes tests. Feels like it could be cleaner somehow, though.
// Add the first value index
values_indexes.push(0);
for (i, partition) in partitions.ranges().iter().enumerate() {
run_ends.push(partition.end);
// Add the next value index if we're not at the last partition
if i < partitions.ranges().len() - 1 {
values_indexes.push(partition.end);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also works
let mut last_partition_end = 0;
for partition in partitions.ranges() {
values_indexes.push(last_partition_end);
run_ends.push(partition.end);
last_partition_end = partition.end;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in e086d4c
arrow-cast/src/cast/mod.rs
Outdated
) => Ok(new_null_array(to_type, array.len())), | ||
(RunEndEncoded(index_type, _), _) => { | ||
let mut cast_options = cast_options.clone(); | ||
cast_options.safe = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description in the original PR #7713 has the reasoning under "Run-End Encoded Array Casting: Tradeoffs and Implementation". I found that section a bit wordy, but this line you commented on definitely needs a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a closer look at this, I actually think that's wrong. I see we don't do this anywhere else in arrow. I think that code and comment might have been AI generated as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in bdcaa4b
Thanks for the review @tustvold! Will address today. |
@tustvold I've addressed your comments now, let me know what you think. Thanks for the helpful and quick review! |
Which issue does this PR close?
RunArray
(Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520, but no specific issue for casting.Rationale for this change
This PR implements casting support for RunEndEncoded arrays in Apache Arrow.
What changes are included in this PR?
Users can now cast RunEndEncoded arrays using the standard
arrow_cast::cast()
functionrun_end_encoded_cast()
: Casts values within existing RunEndEncoded arrays to different typescast_to_run_end_encoded()
: Converts regular arrays to RunEndEncoded format with run-end encodingcan_cast_types()
to support RunEndEncoded compatibility rules. Downcasting is not allowed.Are these changes tested?
Yes!
Are there any user-facing changes?
No breaking changes, just new functionality