[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387

Yicong-Huang · 2025-12-08T19:47:18Z

What changes were proposed in this pull request?

Optimize Arrow serializers (ArrowStreamPandasSerializer and GroupPandasUDFSerializer) by avoiding unnecessary pa.Table creation when processing single RecordBatch instances.

The optimization replaces pa.Table.from_batches([batch]).itercolumns() with direct column access using batch.column(i) for single batches. This eliminates unnecessary Table and iterator object creation, reducing function call overhead and GC pressure.

Changes:

ArrowStreamPandasSerializer.load_stream(): Direct column access instead of creating Table wrapper
GroupPandasUDFSerializer.load_stream(): Direct column access for each batch

Code example:

# Before (ArrowStreamPandasSerializer.load_stream)
for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(c, i)
        for i, c in enumerate(pa.Table.from_batches([batch]).itercolumns())
    ]

# After
for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(batch.column(i), i)
        for i in range(batch.num_columns)
    ]

Why are the changes needed?

Several serializers in pyspark.sql.pandas.serializers unnecessarily create pa.Table objects when processing single RecordBatch instances. When converting Arrow RecordBatches to pandas Series, the code creates a pa.Table wrapper for each batch just to iterate over columns, which introduces:

Unnecessary object creation (Table objects and iterators)
Extra function call overhead
Increased GC pressure

For a workload processing 1000 batches with 10 columns each, this avoids creating 2000 temporary objects (1000 Table objects + 1000 iterators). RecordBatch.column(i) directly returns the column array reference (zero-copy), reducing function call overhead.

Does this PR introduce any user-facing change?

No. This is a performance optimization that maintains backward compatibility. The serialization behavior remains the same, only the internal implementation is optimized.

How was this patch tested?

Existing tests pass without modification.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2025-12-10T23:58:48Z

Merged to master.

feat: optimize serializer additional Table conversion

45f14d7

github-actions bot added SQL PYTHON labels Dec 8, 2025

Yicong-Huang marked this pull request as draft December 8, 2025 21:21

Yicong-Huang changed the title ~~[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers~~ [WIP][SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers Dec 8, 2025

fix: format

b8a2b3d

Yicong-Huang changed the title ~~[WIP][SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers~~ [SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers Dec 10, 2025

Yicong-Huang marked this pull request as ready for review December 10, 2025 00:01

Yicong-Huang added 3 commits December 9, 2025 17:40

Merge branch 'master' into SPARK-54639/feat/optimize-arrow-serializers

96b7b67

fix: remove unused import

03c260e

revert changes on ArrowStreamAggPandasUDFSerializer

87fa021

allisonwang-db approved these changes Dec 10, 2025

View reviewed changes

HyukjinKwon approved these changes Dec 10, 2025

View reviewed changes

HyukjinKwon closed this in 19a1da9 Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387

[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387

Yicong-Huang commented Dec 8, 2025 •

edited

Loading

Uh oh!

HyukjinKwon commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387

[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers #53387

Conversation

Yicong-Huang commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented Dec 8, 2025 •

edited

Loading