-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54337][PS] Add support for PyCapsule to Pyspark #53391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Devin Petersohn <[email protected]> Co-authored-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Devin Petersohn <[email protected]>
|
Okay I think the title is a bit misleading - the PR is not about Also I believe Before digging into the code details too much, I think we need to discuss about the general direction. If we are going to implement either protocol, we probably want to do it right. The current implementation basically ignores a lot of the required methods and implemented the minimum amount of methods to make the Then about arrow. Do we want to rely on undocumented arrow classes for spark to work? Both If I understand the protocol correctly, you don't require arrow at all to implement Anyway this is a very interesting feature, but also a big commitment. Users will expect this to work properly if we do this, so I think we need to give it more thoughts before going forward. |
|
Thanks for the comment @gaogaotiantian. From my understanding, I am okay removing What do you think about removing |
|
So I'm not that familiar with the standard itself - I don't see From the documentation on Arrow, the Arrow PyCapsule Interface is still experimental. I'm not sure what's the policy for features at such stage. For the implemtation itself, |
This is a requirement of Pyspark |
Agreed, I will remove |
Yeah I don't quite understand why that's the case. Like I said that might be the lack of understanding of arrow or pyspark. Someone else might be more familiar with the code and see if this is necessary. It seems like we copied/transformed the data a few times. |
Signed-off-by: Devin Petersohn <[email protected]>
…fer. Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
python/pyspark/interchange.py
Outdated
| byte_df = df.mapInArrow(batch_to_bytes_iter, binary_schema) | ||
| # A row is actually a batch of data in Arrow IPC format. Fetch the batches one by one. | ||
| for row in byte_df.toLocalIterator(): | ||
| with pyarrow.ipc.open_stream(row.arrow_ipc_bytes) as reader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, I feel we can optimize out this UDF execution mapInArrow:
1, use some helper methods like Dataset.toArrowBatchRdd for classic dataframe;
2, use some internal methods inside spark.client.to_table for connect dataframe.
This can be done in followup PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and we should make sure that we leverage the streaming nature of the PyCapsule protocol in the design.
Co-authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
What changes were proposed in this pull request?
Add support for Pycapsule and
__dataframe__protocols for interchange between Spark and other Python libraries. Here is a demo of what this enables with Polars and DuckDB:Polars will now be able to consume a full Pyspark dataframe (or
pyspark.pandas), and DuckDB can consume a stream built from the Pyspark dataframe. Importantly, thestream = pa.RecordBatchReader.from_stream(psdf)line does not trigger any computation, it simply creates a stream object which is incrementally consumed by DuckDB when thefetchallcall is executed.Why are the changes needed?
Currently, Pyspark (and to a lesser degree Pyspark pandas) does not integrate well with the broader Python ecosystem. Currently, the best practice is to go through pandas with
toPandas, but that materializes all data on the driver all at once. This new API and protocol allows data to stream, one Arrow Batch at a time, enabling libraries like DuckDB and Polars to consume the data as a stream.Does this PR introduce any user-facing change?
Yes, new user-level API.
How was this patch tested?
Locally
Was this patch authored or co-authored using generative AI tooling?
No