Commit 19a1da9
[SPARK-54639][PYTHON] Avoid unnecessary Table creation in Arrow serializers
### What changes were proposed in this pull request?
Optimize Arrow serializers (`ArrowStreamPandasSerializer` and `GroupPandasUDFSerializer`) by avoiding unnecessary `pa.Table` creation when processing single `RecordBatch` instances.
The optimization replaces `pa.Table.from_batches([batch]).itercolumns()` with direct column access using `batch.column(i)` for single batches. This eliminates unnecessary Table and iterator object creation, reducing function call overhead and GC pressure.
**Changes:**
- `ArrowStreamPandasSerializer.load_stream()`: Direct column access instead of creating Table wrapper
- `GroupPandasUDFSerializer.load_stream()`: Direct column access for each batch
**Code example:**
```python
# Before (ArrowStreamPandasSerializer.load_stream)
for batch in batches:
pandas_batches = [
self.arrow_to_pandas(c, i)
for i, c in enumerate(pa.Table.from_batches([batch]).itercolumns())
]
# After
for batch in batches:
pandas_batches = [
self.arrow_to_pandas(batch.column(i), i)
for i in range(batch.num_columns)
]
```
### Why are the changes needed?
Several serializers in `pyspark.sql.pandas.serializers` unnecessarily create `pa.Table` objects when processing single `RecordBatch` instances. When converting Arrow RecordBatches to pandas Series, the code creates a `pa.Table` wrapper for each batch just to iterate over columns, which introduces:
- Unnecessary object creation (Table objects and iterators)
- Extra function call overhead
- Increased GC pressure
For a workload processing 1000 batches with 10 columns each, this avoids creating 2000 temporary objects (1000 Table objects + 1000 iterators). `RecordBatch.column(i)` directly returns the column array reference (zero-copy), reducing function call overhead.
### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization that maintains backward compatibility. The serialization behavior remains the same, only the internal implementation is optimized.
### How was this patch tested?
Existing tests pass without modification.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #53387 from Yicong-Huang/SPARK-54639/feat/optimize-arrow-serializers.
Authored-by: Yicong-Huang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>1 parent 19ec991 commit 19a1da9
1 file changed
+2
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
500 | 500 | | |
501 | 501 | | |
502 | 502 | | |
503 | | - | |
504 | 503 | | |
505 | 504 | | |
506 | 505 | | |
507 | 506 | | |
508 | | - | |
509 | | - | |
| 507 | + | |
510 | 508 | | |
511 | 509 | | |
512 | 510 | | |
| |||
1225 | 1223 | | |
1226 | 1224 | | |
1227 | 1225 | | |
1228 | | - | |
1229 | | - | |
| 1226 | + | |
1230 | 1227 | | |
1231 | 1228 | | |
1232 | 1229 | | |
| |||
0 commit comments