Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Dec 8, 2025

What changes were proposed in this pull request?

Optimize Arrow serializers (ArrowStreamPandasSerializer and GroupPandasUDFSerializer) by avoiding unnecessary pa.Table creation when processing single RecordBatch instances.

The optimization replaces pa.Table.from_batches([batch]).itercolumns() with direct column access using batch.column(i) for single batches. This eliminates unnecessary Table and iterator object creation, reducing function call overhead and GC pressure.

Changes:

  • ArrowStreamPandasSerializer.load_stream(): Direct column access instead of creating Table wrapper
  • GroupPandasUDFSerializer.load_stream(): Direct column access for each batch

Code example:

# Before (ArrowStreamPandasSerializer.load_stream)
for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(c, i)
        for i, c in enumerate(pa.Table.from_batches([batch]).itercolumns())
    ]

# After
for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(batch.column(i), i)
        for i in range(batch.num_columns)
    ]

Why are the changes needed?

Several serializers in pyspark.sql.pandas.serializers unnecessarily create pa.Table objects when processing single RecordBatch instances. When converting Arrow RecordBatches to pandas Series, the code creates a pa.Table wrapper for each batch just to iterate over columns, which introduces:

  • Unnecessary object creation (Table objects and iterators)
  • Extra function call overhead
  • Increased GC pressure

For a workload processing 1000 batches with 10 columns each, this avoids creating 2000 temporary objects (1000 Table objects + 1000 iterators). RecordBatch.column(i) directly returns the column array reference (zero-copy), reducing function call overhead.

Does this PR introduce any user-facing change?

No. This is a performance optimization that maintains backward compatibility. The serialization behavior remains the same, only the internal implementation is optimized.

How was this patch tested?

Existing tests pass without modification.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang marked this pull request as draft December 8, 2025 21:21
@Yicong-Huang Yicong-Huang changed the title [SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers [WIP][SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers Dec 8, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers [SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers Dec 10, 2025
@Yicong-Huang Yicong-Huang marked this pull request as ready for review December 10, 2025 00:01
@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants