-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support Arrow IPC Stream Files #18457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Arrow IPC Stream Files #18457
Conversation
532ca54 to
99ebe62
Compare
| // correct offset which is a lot of duplicate I/O. We're opting to avoid | ||
| // that entirely by only acting on a single partition and reading sequentially. | ||
| Ok(None) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.
Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.
nvm, after rereading I misunderstood this, it only affects IO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine this would mean I need to read the entire file into memory and keep it there? In my previous message I meant we would need to read all the record batch and dictionary locations and keep them in memory in much the same way that the arrow file format footer does. So it would mean a single pass through to record all of that and then multiple threads can seek to different parts of the file and process it.
That's my understanding of the effect of this, that it means we can't parallelize queries against this file format.
If you believe that the resulting behavior would be pathological to the extreme then we should absolutely document that. Thoughts on how we can reliably test that it is? Or who might be aware of the implications of this? And where to document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think partitioning is doable, but it's better be done afterwards if anyone has a real use case.
In order to do repartition, this function has to scan once, record the dictionary and batch positions, then split the work evenly to parallel partitioned workers -- this task's can be done at around full disk bandwidth speed (5GB/Sec on recent MacBooks)
Regarding decoding the batches from Arrow IPC Stream file to in-memory arrow RecordBatches, if dictionary encoding and some heavy weigh compression like zstd is applied, the bandwidth can be way lower (several hundred MB/S)
So it's still worth a whole scan up front to make the whole processing faster with partitioning, though I don't known if it's a common requirement to query large IPC Stream file.
jdcasale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is basically right. Couple of nits, one question.
| // correct offset which is a lot of duplicate I/O. We're opting to avoid | ||
| // that entirely by only acting on a single partition and reading sequentially. | ||
| Ok(None) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.
Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.
nvm, after rereading I misunderstood this, it only affects IO
…o cs--register-arrow-ipc-stream-format-files
|
And another update: I think the other PR had the right of it. We maintain backwards compatibility and keep users from having to know about the naming messiness. I've merged that PR back into this one and pulled in Adrian's changes from main. |
jdcasale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have merge permissions, but have reviewed, all my questions have been addressed — lgtm
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the amazing work -- this looks good to me, left several suggestions.
However, I'm not deeply familiar with the related code, so I'd prefer that others to approve and merge it.
| /// Does not hold anything special, since [`FileScanConfig`] is sufficient for arrow | ||
| /// `FileSource` for Arrow IPC file format. Supports range-based parallel reading. | ||
| #[derive(Clone)] | ||
| pub struct ArrowSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here there seems to be several public API changes like ArrowSource and ArrowOpener, it would be great to include them in the upgrade guide https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
Though I'm not sure if we can make it in 51.0.0, if not this should be under 52.0.0 section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the sounds of it it's not going to make it into 51. Do I just make a section for 52?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great.
Perhaps |
| } | ||
|
|
||
| fn with_batch_size(&self, _batch_size: usize) -> Arc<dyn FileSource> { | ||
| Arc::new(Self { ..self.clone() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Arc::new(Self { ..self.clone() }) | |
| Arc::new(self.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Source impl does not support batch_size nothing is changed and it could just return Self here.
Same from ArrowFileOpener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a change in behavior, though, since it it's supposed to return an entirely new clone of the data? And this code is the same as what was here before and also what is given as an example in MockSource in the main crate.
| } | ||
|
|
||
| fn with_projection(&self, _config: &FileScanConfig) -> Arc<dyn FileSource> { | ||
| Arc::new(Self { ..self.clone() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Arc::new(Self { ..self.clone() }) | |
| Arc::new(self.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ..Default::default() | ||
| }; | ||
| let bytes = store | ||
| .get_opts(object_location, get_opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it return an error here if the file is less than 6 bytes ?
According to https://docs.rs/object_store/latest/object_store/struct.GetOptions.html#structfield.range it returns https://docs.rs/object_store/latest/object_store/enum.Error.html#variant.NotModified
If this error is indeed returned then the check below bytes.len() >= 6 is not really needed. It actually confuses the maintainer that it is possible bytes to be less than 6 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of this is that it can be less than 6 bytes: https://github.com/apache/arrow-rs-object-store/blob/main/src/util.rs#L196-L199
|
@2010YOUY01 I'm fine with any name, but technically all the arrow sources in here would be |
timsaucer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @corasaurus-hex this is a great contribution
| //! | ||
| //! # Naming Note | ||
| //! | ||
| //! The naming in this module can be confusing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Thanks also to everyone for the review |
|
Yeah, thanks everyone for your reviews! It's much better than it was at the start!! |
There are no formal naming rules, but my approach is to choose names that make their purpose as clear as possible. Using This is just a minor point, though — we can proceed with either option. |
Which issue does this PR close?
register_arrowor similar #16688.Rationale for this change
Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.
What changes are included in this PR?
To accomplish the above, this PR splits the Arrow datasource into two separate implementations (
ArrowStream*andArrowFile*) with a facade on top to differentiate between the formats at query planning time.Are these changes tested?
Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.
Are there any user-facing changes?
Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?