[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387
+2
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Optimize Arrow serializers (
ArrowStreamPandasSerializerandGroupPandasUDFSerializer) by avoiding unnecessarypa.Tablecreation when processing singleRecordBatchinstances.The optimization replaces
pa.Table.from_batches([batch]).itercolumns()with direct column access usingbatch.column(i)for single batches. This eliminates unnecessary Table and iterator object creation, reducing function call overhead and GC pressure.Changes:
ArrowStreamPandasSerializer.load_stream(): Direct column access instead of creating Table wrapperGroupPandasUDFSerializer.load_stream(): Direct column access for each batchCode example:
Why are the changes needed?
Several serializers in
pyspark.sql.pandas.serializersunnecessarily createpa.Tableobjects when processing singleRecordBatchinstances. When converting Arrow RecordBatches to pandas Series, the code creates apa.Tablewrapper for each batch just to iterate over columns, which introduces:For a workload processing 1000 batches with 10 columns each, this avoids creating 2000 temporary objects (1000 Table objects + 1000 iterators).
RecordBatch.column(i)directly returns the column array reference (zero-copy), reducing function call overhead.Does this PR introduce any user-facing change?
No. This is a performance optimization that maintains backward compatibility. The serialization behavior remains the same, only the internal implementation is optimized.
How was this patch tested?
Existing tests pass without modification.
Was this patch authored or co-authored using generative AI tooling?
No.