Casting support for RunEndEncoded arrays #8589

vegarsti · 2025-10-11T05:36:43Z

Which issue does this PR close?

Contribues towards the RunEndEncoded (REE) epic [Epic] Implement RunArray (Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520, but no specific issue for casting.
Replaces PRs Implemented casting for RunEnd Encoding #7713 and [Draft] Implemented casting for RunEnd Encoding (pt2) #8384.

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow.

Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow
Narrowing conversions (e.g., from Int64 to Int16) will always fail if any values exceed the target type’s bounds
Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless.
Widening conversions (e.g., from Int16 to Int64) are allowed, as they are inherently lossless

What changes are included in this PR?

Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function

run_end_encoded_cast(): Casts values within existing RunEndEncoded arrays to different types
cast_to_run_end_encoded(): Converts regular arrays to RunEndEncoded format with run-end encoding
Updated can_cast_types() to support RunEndEncoded compatibility rules. Downcasting is not allowed.

Are these changes tested?

Yes!

Are there any user-facing changes?

No breaking changes, just new functionality

vegarsti · 2025-10-11T06:26:59Z

Raised this PR to get Richard Baah's excellent work over the line! cc @albertlockett @brancz @alamb @Rich-T-kid

brancz

I think we're getting close to the finish line with these changes!

arrow-cast/src/cast/run_array.rs

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

…entify runs

tustvold · 2025-10-12T22:20:03Z

Is there some way we can avoid the quadratic codegen with code paths parameterized on both run end type and value type? Perhaps it'd be possible to identify where the transitions are, perhaps using the comparison kernels and comparing the array with a slice offset by one, and then use this to construct the indexes and a filter to construct the values array?

Have we done any empirical quantification into the impact this has on code bloat / compile times?

Edit: https://docs.rs/arrow-ord/latest/arrow_ord/partition/fn.partition.html is the function I'm thinking of.

vegarsti · 2025-10-13T06:31:01Z

Have we done any empirical quantification into the impact this has on code bloat / compile times?

I have not! Happy to do that though. Any pointers to how you'd like me to do that, from previous PRs for example? Or does a basic comparison of compile time and binary size on main and this branch suffice?

tustvold · 2025-10-13T08:46:27Z

Or does a basic comparison of compile time and binary size on main and this branch suffice?

Just this, quadratic codegen is typically severe enough to be easily measurable.

vegarsti · 2025-10-13T10:18:53Z

The compile time increased by 2 seconds.

         cargo build --release
main     569.35s user 23.69s system 863% cpu 1:08.66 total
branch   567.33s user 23.96s system 891% cpu 1:06.33 total

The size of libarrow_cast.rlib increased by 279kb (3.82%)

         libarrow_cast.rlib size
main     7,316,832
branch   7,596,568

tustvold · 2025-10-13T10:21:47Z

Yeah... That's quite bad for a single kernel, especially given the relatively niche usage of RunEndEncodedArrays, I hope you can understand that we need to be careful to keep this under control.

What did you think of my suggestion about using the partition kernel to compute the run ends? It might actually be faster and would largely eliminate the additional codegen.

It would mean making arrow-cast depend on arrow-ord, which is a bit meh, but perhaps unavoidable. It could possibly be a feature flag. 🤔

vegarsti · 2025-10-13T11:14:08Z

I understand! I haven't had time to look into your suggestion, but I will.

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

tustvold · 2025-10-13T11:23:03Z

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

vegarsti · 2025-10-13T11:31:16Z

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

Thanks for the context!

brancz · 2025-10-13T19:41:00Z

We're talking about the pack_runs macro right? I realize it's nice as a macro, but it also seems fine to just write out by hand.

tustvold · 2025-10-13T19:47:45Z

The fact it is a macro is not the issue here, the problem is code generation based with complexity <Number of Index Types> * <Number of Value Types> which results in slow compilation and binary bloat. Whether this is achieved with macros, generics or copy paste doesn't change this 😄

brancz · 2025-10-14T08:50:14Z

Got it. The way I see it, there are two paths. Either:

We manage to get it to work with arrow_ord's partition and avoid the amount of code required altogether (whether by macro or not)
We do it via macro but put ree casting behind a feature flag, so users who wants this can opt into it (although it feels a little strange to hide features that are very reasonable and available for other types behind a feature flag, but I don't feel strongly one way or another as we'd just enable it and move on).

@vegarsti can you give the arrow-ord partitioning a try so we understand whether this would be a workable path?

vegarsti · 2025-10-14T10:04:52Z

Yeah, I will give the arrow-ord partitioning a try. Some time this week! It seems like a good approach, thanks @tustvold!

As for the feature flag, to me that seems a bit complicated - either the REE type should be supported or not, imo? Also, unless I'm missing something, whether to put this in a feature flag would apply to the whole REE epic #3520, so that should (eventually) be raised there.

It would be great with some guidelines for the arrow-rs project with regard to the tradeoff between features and size/compile times. I'm guessing opinions might vary a bit between maintainers as well. Guidelines might make it easier to come to alignment in such discussions.

In any case, maybe we get around this issue with the arrow-ord approach 🙏🏻

vegarsti · 2025-10-15T13:23:10Z

Okay I have the scaffolding but tests fail. Stay tuned 👀

vegarsti · 2025-10-15T21:34:16Z

I've implemented the partition approach now, see b8c0754. Regardless of compile time and size, this is so much cleaner than the previous approach (+46 -257), what a great idea @tustvold. Now the size is 7512216, up from 7316832, so the increase is 2.6%. The compile time was cargo build --release 575.50s user 23.11s system 906% cpu 1:06.00 total. What do you think @tustvold and @brancz?

arrow-cast/src/cast/run_array.rs

tustvold · 2025-10-15T21:40:47Z

Now the size is 7512216, up from 7316832, so the increase is 2.6%. The compile time was cargo build --release 575.50s user 23.11s system 906% cpu 1:06.00 total. What do you think @tustvold and @brancz?

This is not unexpected, as it is now looping in arrow-ord which itself isn't the lightest of crates. However, most use-cases will already include it as part of the build tree, so the net change for them will be negligible.

tustvold · 2025-10-15T21:43:12Z

arrow-cast/src/cast/run_array.rs

+    match to_type {
+        DataType::RunEndEncoded(_, _) => {
+            // Check if from_type supports equality (can be REE-encoded)
+            match from_type {


With the new approach I think we should also support dictionary arrays

You're right! Made that change and added a test for it in 82c384b.

tustvold

Took a quick look, looks good to me 👍

Left some small suggestions

tustvold · 2025-10-15T21:46:36Z

arrow-cast/src/cast/run_array.rs

+    }
+
+    // Partition the array to identify runs of consecutive equal values
+    let partitions = partition(&[array.clone()])?;


I did wonder if this should be cast_array, but I think this could cause inconsistency with can_cast_run_end_encoded and whilst casts can be lossy, they should be deterministic.

Hm, good point. Later we're doing take on cast_array, so it does seem correct to use cast_array here, indeed. Tests pass with that, as well.

Did this change in a16d555.

tustvold · 2025-10-15T21:48:44Z

arrow-cast/src/cast/run_array.rs

+    let indices = PrimitiveArray::<UInt32Type>::from_iter_values(
+        values_indexes.iter().map(|&idx| idx as u32),
+    );
+    let values_array = take(&cast_array, &indices, None)?;


It occurs to me that internally Partitions is just a BooleanBuffer that'd be ideal to feed to the filter kernel. Perhaps we should expose that notion 🤔

tustvold · 2025-10-15T21:49:30Z

arrow-cast/src/cast/run_array.rs

+    for partition in partitions.ranges() {
+        values_indexes.push(array_idx);
+        array_idx += partition.end - partition.start;
+        run_ends.push(array_idx);


Won't array_idx just be partition.end presuming the ranges are contiguous?

That's right! Good catch! We still need to add to values_indexes off by one to run_ends. What do you think about something like this? This is correct as well, passes tests. Feels like it could be cleaner somehow, though.

// Add the first value index values_indexes.push(0); for (i, partition) in partitions.ranges().iter().enumerate() { run_ends.push(partition.end); // Add the next value index if we're not at the last partition if i < partitions.ranges().len() - 1 { values_indexes.push(partition.end); } }

This also works

let mut last_partition_end = 0; for partition in partitions.ranges() { values_indexes.push(last_partition_end); run_ends.push(partition.end); last_partition_end = partition.end; }

Went with the latter

in e086d4c

tustvold · 2025-10-15T21:50:14Z

arrow-cast/src/cast/mod.rs

        ) => Ok(new_null_array(to_type, array.len())),
+        (RunEndEncoded(index_type, _), _) => {
+            let mut cast_options = cast_options.clone();
+            cast_options.safe = false;


The description in the original PR #7713 has the reasoning under "Run-End Encoded Array Casting: Tradeoffs and Implementation". I found that section a bit wordy, but this line you commented on definitely needs a comment.

Taking a closer look at this, I actually think that's wrong. I see we don't do this anywhere else in arrow. I think that code and comment might have been AI generated as well 🤔

Removed in bdcaa4b

vegarsti · 2025-10-16T06:56:53Z

Thanks for the review @tustvold! Will address today.

vegarsti · 2025-10-18T07:18:28Z

@tustvold I've addressed your comments now, let me know what you think. Thanks for the helpful and quick review!

github-actions bot added the arrow Changes to the arrow crate label Oct 11, 2025

This was referenced Oct 11, 2025

[Draft] Implemented casting for RunEnd Encoding (pt2) #8384

Draft

Convert RunEndEncoded to Parquet #8069

Draft

vegarsti changed the title ~~Casting to/from RunEndEncoded arrays~~ Casting support for RunEndEncoded arrays Oct 11, 2025

vegarsti force-pushed the cast-run-end-encoded-arrays branch 2 times, most recently from 87e543d to f9ae6f9 Compare October 11, 2025 06:25

brancz reviewed Oct 11, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

rich-t-kid-datadog and others added 10 commits October 12, 2025 07:49

Implemented casting for RunEnd Encoding

ca051e2

Implemented casting for RunEnd Encoding

60c52b4

feat: Add Run-End Encoded array casting with overflow protection

0a6d865

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

feat: Add Run-End Encoded array casting with overflow protection

8b434d4

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

Use type specific zero-copy comparisons in cast_to_run_end_encoded

77cda81

Move tests in mod run_end_encoded_tests into mod tests

b666a97

panic if REE in cast_to_run_end_encoded

6eafcea

Use unreachable macro

3c2e837

Simplify some assertions

d1e5120

Extract populate_run_ends_and_values, which casts then iterates to id…

2358010

…entify runs

vegarsti force-pushed the cast-run-end-encoded-arrays branch from 88c0d8a to 2358010 Compare October 12, 2025 05:49

vegarsti added 2 commits October 12, 2025 08:39

Add missing Float16 and Decimal types to can_cast_to_run_end_encoded

7ed2872

Use a macro for packing runs

692f6ea

Use partition from arrow-ord to find runs

b8c0754

vegarsti commented Oct 15, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

tustvold reviewed Oct 15, 2025

View reviewed changes

vegarsti added 4 commits October 18, 2025 08:49

Remove cast_options.safe = false

bdcaa4b

Simplify variables in partition loop

e086d4c

Partition on cast_array, not array

a16d555

Support casting from dictionary types and add test for that

82c384b

Casting support for RunEndEncoded arrays #8589

Are you sure you want to change the base?

Casting support for RunEndEncoded arrays #8589

Uh oh!

Conversation

vegarsti commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

vegarsti commented Oct 11, 2025

Uh oh!

brancz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tustvold commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 13, 2025

Uh oh!

vegarsti commented Oct 13, 2025

Uh oh!

tustvold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025

Uh oh!

brancz commented Oct 13, 2025

Uh oh!

tustvold commented Oct 13, 2025

Uh oh!

brancz commented Oct 14, 2025

Uh oh!

vegarsti commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 15, 2025

Uh oh!

vegarsti commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tustvold commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vegarsti commented Oct 11, 2025 •

edited

Loading

tustvold commented Oct 12, 2025 •

edited

Loading

vegarsti commented Oct 13, 2025 •

edited

Loading

tustvold commented Oct 13, 2025 •

edited

Loading

vegarsti commented Oct 13, 2025 •

edited

Loading

tustvold commented Oct 13, 2025 •

edited

Loading

vegarsti commented Oct 14, 2025 •

edited

Loading

vegarsti commented Oct 15, 2025 •

edited

Loading

vegarsti Oct 16, 2025 •

edited

Loading

vegarsti Oct 18, 2025 •

edited

Loading