Support Arrow IPC Stream Files #18457

corasaurus-hex · 2025-11-03T02:32:56Z

Which issue does this PR close?

Closes Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar #16688.

Rationale for this change

Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.

What changes are included in this PR?

To accomplish the above, this PR splits the Arrow datasource into two separate implementations (ArrowStream* and ArrowFile*) with a facade on top to differentiate between the formats at query planning time.

Are these changes tested?

Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.

Are there any user-facing changes?

Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?

corasaurus-hex · 2025-11-04T07:00:33Z

datafusion/datasource-arrow/src/source.rs

+        // correct offset which is a lot of duplicate I/O. We're opting to avoid
+        // that entirely by only acting on a single partition and reading sequentially.
+        Ok(None)
+    }


this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)

I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- ~~correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.~~

nvm, after rereading I misunderstood this, it only affects IO

I can't imagine this would mean I need to read the entire file into memory and keep it there? In my previous message I meant we would need to read all the record batch and dictionary locations and keep them in memory in much the same way that the arrow file format footer does. So it would mean a single pass through to record all of that and then multiple threads can seek to different parts of the file and process it.

That's my understanding of the effect of this, that it means we can't parallelize queries against this file format.

If you believe that the resulting behavior would be pathological to the extreme then we should absolutely document that. Thoughts on how we can reliably test that it is? Or who might be aware of the implications of this? And where to document it?

I think partitioning is doable, but it's better be done afterwards if anyone has a real use case.

In order to do repartition, this function has to scan once, record the dictionary and batch positions, then split the work evenly to parallel partitioned workers -- this task's can be done at around full disk bandwidth speed (5GB/Sec on recent MacBooks)
Regarding decoding the batches from Arrow IPC Stream file to in-memory arrow RecordBatches, if dictionary encoding and some heavy weigh compression like zstd is applied, the bandwidth can be way lower (several hundred MB/S)
So it's still worth a whole scan up front to make the whole processing faster with partitioning, though I don't known if it's a common requirement to query large IPC Stream file.

datafusion/datasource-arrow/src/file_format.rs

datafusion/datasource-arrow/src/source.rs

datafusion/datasource-arrow/src/file_format.rs

jdcasale

I think this is basically right. Couple of nits, one question.

datafusion/datasource-arrow/src/file_format.rs

jdcasale · 2025-11-05T15:30:05Z

datafusion/datasource-arrow/src/source.rs

+        // correct offset which is a lot of duplicate I/O. We're opting to avoid
+        // that entirely by only acting on a single partition and reading sequentially.
+        Ok(None)
+    }


I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- ~~correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.~~

nvm, after rereading I misunderstood this, it only affects IO

…rapped

…o cs--register-arrow-ipc-stream-format-files

corasaurus-hex · 2025-11-09T07:55:57Z

And another update: I think the other PR had the right of it. We maintain backwards compatibility and keep users from having to know about the naming messiness. I've merged that PR back into this one and pulled in Adrian's changes from main.

jdcasale

I don’t have merge permissions, but have reviewed, all my questions have been addressed — lgtm

2010YOUY01

Thank you for the amazing work -- this looks good to me, left several suggestions.

However, I'm not deeply familiar with the related code, so I'd prefer that others to approve and merge it.

datafusion/datasource-arrow/src/file_format.rs

2010YOUY01 · 2025-11-10T09:08:36Z

datafusion/datasource-arrow/src/source.rs

-/// Does not hold anything special, since [`FileScanConfig`] is sufficient for arrow
+/// `FileSource` for Arrow IPC file format. Supports range-based parallel reading.
 #[derive(Clone)]
-pub struct ArrowSource {


Here there seems to be several public API changes like ArrowSource and ArrowOpener, it would be great to include them in the upgrade guide https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
Though I'm not sure if we can make it in 51.0.0, if not this should be under 52.0.0 section

From the sounds of it it's not going to make it into 51. Do I just make a section for 52?

Yes, that would be great.

datafusion/sqllogictest/test_files/arrow_files.slt

2010YOUY01 · 2025-11-10T09:20:38Z

I do find one thing annoying, but I don't know if it's impacting this PR. We are calling these ArrowFileSource and ArrowStreamSource, but both of them are file readers, right? It's just that one is stored in a random access approach and one is stored in a stream approach. When I see the name ArrowStreamSource I would intuitively think that means some kind of Arrow stream. Especially if I see the two of those next to each other, my intuition would be that one is a streaming source and one is a file source. I know you're reusing the terminology in the Arrow spec, so again I may be overthinking this.

Perhaps ArrowIPCStreamSource if that's the formal term 🤔

datafusion/datasource-arrow/src/file_format.rs

martin-g · 2025-11-10T10:14:48Z

datafusion/datasource-arrow/src/source.rs

+    }
+
+    fn with_batch_size(&self, _batch_size: usize) -> Arc<dyn FileSource> {
+        Arc::new(Self { ..self.clone() })


Suggested change

Arc::new(Self { ..self.clone() })

Arc::new(self.clone())

Since the Source impl does not support batch_size nothing is changed and it could just return Self here.
Same from ArrowFileOpener

That would be a change in behavior, though, since it it's supposed to return an entirely new clone of the data? And this code is the same as what was here before and also what is given as an example in MockSource in the main crate.

martin-g · 2025-11-10T10:14:57Z

datafusion/datasource-arrow/src/source.rs

+    }
+
+    fn with_projection(&self, _config: &FileScanConfig) -> Arc<dyn FileSource> {
+        Arc::new(Self { ..self.clone() })


Suggested change

Arc::new(Self { ..self.clone() })

Arc::new(self.clone())

#18457 (comment)

datafusion/datasource-arrow/src/file_format.rs

datafusion/datasource-arrow/src/source.rs

martin-g · 2025-11-10T10:35:34Z

datafusion/datasource-arrow/src/file_format.rs

+        ..Default::default()
+    };
+    let bytes = store
+        .get_opts(object_location, get_opts)


Does it return an error here if the file is less than 6 bytes ?
According to https://docs.rs/object_store/latest/object_store/struct.GetOptions.html#structfield.range it returns https://docs.rs/object_store/latest/object_store/enum.Error.html#variant.NotModified
If this error is indeed returned then the check below bytes.len() >= 6 is not really needed. It actually confuses the maintainer that it is possible bytes to be less than 6 bytes.

My reading of this is that it can be less than 6 bytes: https://github.com/apache/arrow-rs-object-store/blob/main/src/util.rs#L196-L199

datafusion/datasource-arrow/src/source.rs

understand

corasaurus-hex · 2025-11-10T15:46:01Z

@2010YOUY01 I'm fine with any name, but technically all the arrow sources in here would be ArrowIPC*, both file and stream. Since these are internal to the crate right now can we decide on a name in another PR or in another venue? Or is there a mechanism for making a decisions around something like this? This would be the third name change, including my own change, in this PR, and I'd rather get consensus somehow so I can make one final change (if needed)

Co-authored-by: Martin Grigorov <[email protected]>

timsaucer

Thank you for the contribution!

alamb

Thanks @corasaurus-hex this is a great contribution

alamb · 2025-11-10T16:58:44Z

datafusion/datasource-arrow/src/source.rs

+//!
+//! # Naming Note
+//!
+//! The naming in this module can be confusing:


alamb · 2025-11-10T17:00:00Z

Thanks also to everyone for the review

corasaurus-hex · 2025-11-10T17:02:11Z

Yeah, thanks everyone for your reviews! It's much better than it was at the start!!

2010YOUY01 · 2025-11-11T00:56:43Z

@2010YOUY01 I'm fine with any name, but technically all the arrow sources in here would be ArrowIPC*, both file and stream. Since these are internal to the crate right now can we decide on a name in another PR or in another venue? Or is there a mechanism for making a decisions around something like this? This would be the third name change, including my own change, in this PR, and I'd rather get consensus somehow so I can make one final change (if needed)

There are no formal naming rules, but my approach is to choose names that make their purpose as clear as possible. Using Stream here might be ambiguous due to other system concepts, so I’d prefer using a more explicit name like IPCStream.

This is just a minor point, though — we can proceed with either option.

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Nov 3, 2025

corasaurus-hex added 4 commits November 2, 2025 20:40

Infer stream ipc format for arrow data sources

3e6a570

Allow FileOpener for ArrowSource to open both IPC formats

0ad62ed

Split reading file vs stream because repartitioning + ranges

34ccba4

Fix rewind bug

99ebe62

corasaurus-hex force-pushed the cs--register-arrow-ipc-stream-format-files branch from 532ca54 to 99ebe62 Compare November 3, 2025 02:40

corasaurus-hex mentioned this pull request Nov 3, 2025

Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar #16688

Closed

corasaurus-hex added 8 commits November 2, 2025 20:56

Remove a comment that isn't needed anymore

936b2e3

Stray reference left over from Rename Symbol fail

a8bc19d

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

93d26b1

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

21320cf

Address clippy error

3c00395

Address additional clippy errors

917c6c3

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

0f5642a

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

8941014

corasaurus-hex commented Nov 4, 2025

View reviewed changes

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

ffeca09

martin-g reviewed Nov 4, 2025

View reviewed changes

datafusion/datasource-arrow/src/file_format.rs Outdated Show resolved Hide resolved

datafusion/datasource-arrow/src/source.rs Show resolved Hide resolved

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Open

47 tasks

Jefffrey reviewed Nov 5, 2025

View reviewed changes

datafusion/datasource-arrow/src/file_format.rs Outdated Show resolved Hide resolved

datafusion/datasource-arrow/src/file_format.rs Show resolved Hide resolved

jdcasale reviewed Nov 5, 2025

View reviewed changes

corasaurus-hex added 7 commits November 6, 2025 17:59

Pull out the stream format check into an independent function

07593b4

Refactor schema inference

0446c32

Let's move the into() outside the parens

7409462

Err, no, on the inside

3f72b0c

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

9e63fc7

Also include a test for arrow stream source

c6d4a06

Add a bunch more tests

4e28ef9

corasaurus-hex added 4 commits November 9, 2025 01:42

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files-w…

68d286a

…rapped

Type error and linting fix

8b314a5

Docs

20c8377

Merge branch 'cs--register-arrow-ipc-stream-format-files-wrapped' int…

da81767

…o cs--register-arrow-ipc-stream-format-files

github-actions bot removed the documentation Improvements or additions to documentation label Nov 9, 2025

Update docs for clarity

2ddd690

jdcasale approved these changes Nov 9, 2025

View reviewed changes

2010YOUY01 reviewed Nov 10, 2025

View reviewed changes

2010YOUY01 added the api change Changes the API exposed to users of the crate label Nov 10, 2025

martin-g reviewed Nov 10, 2025

View reviewed changes

datafusion/datasource-arrow/src/source.rs Outdated Show resolved Hide resolved

corasaurus-hex added 3 commits November 10, 2025 09:13

Add more comments describing the format so that it's easier to

cf08a7d

understand

Test querying from empty stream format file

057e0de

Add tests for corrupted stream format and empty stream

4c30eb9

corasaurus-hex and others added 3 commits November 10, 2025 10:01

ArrowStreamOpener -> ArrowStreamFileOpener in error text

b2041d8

Make inner public only within the crate

3718797

Co-authored-by: Martin Grigorov <[email protected]>

Combine errors for file and stream formats

1724b3b

timsaucer approved these changes Nov 10, 2025

View reviewed changes

timsaucer added this pull request to the merge queue Nov 10, 2025

Merged via the queue into apache:main with commit 900ee65 Nov 10, 2025
28 checks passed

alamb mentioned this pull request Nov 10, 2025

Release DataFusion 52.0.0 (Dec 2025 / Jan 2026) #18566

Open

24 tasks

alamb reviewed Nov 10, 2025

View reviewed changes

datafusion/datasource-arrow/src/source.rs

//!

//! # Naming Note

//!

//! The naming in this module can be confusing:

Copy link

Contributor

alamb Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Support Arrow IPC Stream Files #18457

Support Arrow IPC Stream Files #18457

Uh oh!

Conversation

corasaurus-hex commented Nov 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdcasale Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdcasale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdcasale Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corasaurus-hex commented Nov 9, 2025

Uh oh!

jdcasale left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2010YOUY01 commented Nov 10, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

corasaurus-hex commented Nov 10, 2025

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jdcasale Nov 5, 2025 •

edited

Loading

jdcasale Nov 5, 2025 •

edited

Loading