[SPARK-53638][SS][PYTHON] Limit the byte size of arrow batch for TWS to avoid OOM #52391

zeruibao · 2025-09-18T17:55:26Z

What changes were proposed in this pull request?

Limit the byte size of Arrow batch for TWS to avoid OOM.

Why are the changes needed?

On the Python worker side, when using the Pandas execution path, Arrow batches must be converted into Pandas DataFrames in memory. If an Arrow batch is too large, this conversion can lead to OOM errors in the Python worker. To mitigate this risk, we need to enforce a limit on the byte size of each Arrow batch. Similarly, processing the Pandas DataFrame inside handleInputRows also occurs entirely in memory, so applying a size limit to the DataFrame itself further helps prevent OOM issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

No

HeartSaVioR

@zeruibao
The code change looks good to me. Thanks! Let's hear how PySpark folks think about additional limiting size of Pandas DataFrame creation on top of limiting size of Arrow RecordBatch.

cc. @HyukjinKwon @zhengruifeng PTAL. Thanks in advance!

...ala/org/apache/spark/sql/execution/python/streaming/ApplyInPandasWithStatePythonRunner.scala

HeartSaVioR · 2025-09-24T06:19:10Z

python/pyspark/sql/pandas/serializers.py

            for batch_key, group_rows in groupby(row_stream(), key=lambda x: x[0]):
-                df = pd.DataFrame([row for _, row in group_rows])
-                yield (batch_key, df)
+                rows = []


@HyukjinKwon @zhengruifeng
What do you think about the code? We limit the size of Arrow RecordBatch in task thread when sending to Python worker, and @zeruibao added this to re-align the size for Pandas DataFrame. Did we do this in other UDF? Is it beneficial or probably over-thinking?

Did we do this in other UDF?

I don't think so

Is it beneficial or probably over-thinking?

I remember @HyukjinKwon discussed it before, it should be beneficial if the size is properly estimated

zhengruifeng · 2025-09-25T06:21:04Z

python/pyspark/sql/pandas/serializers.py

+                    # Short circuit batch size calculation if the batch size is
+                    # unlimited as computing batch size is computationally expensive.
+                    if self.arrow_max_bytes_per_batch != 2**31 - 1:
+                        accumulate_size += sum(sys.getsizeof(x) for x in row)


TBH, I am not sure about the way to compute the batch size in the python side.

cc @HyukjinKwon @viirya

Hmm, is sys.getsizeof accurate for nested type columns?

Btw, the rows are not arrow batches anymore here, right? Using the arrow_max_bytes_per_batch config might be misleading.

Yeah, I agree sys.getsizeof is not accurate for nested type columns. Let me try other way out.

I got a better idea. It would be to compute the average row size for each Arrow RecordBatch. When accumulating rows into a Pandas DataFrame, we can then use this average row size as an estimate for the DataFrame row, avoiding the need to measure each individual row. Then we can still use this arrow_max_bytes_per_batch config and it's more accurate than sys.getsizeof cc: @HeartSaVioR

zeruibao · 2025-10-01T04:30:24Z

Hi @HyukjinKwon, I have done the benchmark showing that there is no regression introduced by this PR. Is this PR good to merge now?

HyukjinKwon

Looks fine but I would like @HeartSaVioR and @zhengruifeng to sign off as well.

viirya · 2025-10-02T00:29:50Z

python/pyspark/worker.py

+                arrow_max_bytes_per_batch,
                int_to_decimal_coercion_enabled=int_to_decimal_coercion_enabled,
            )
        elif eval_type == PythonEvalType.SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF:


Doesn't SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF need it too?

SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF follows a separate execution path. Each Arrow batch contains rows with the same key, which allows us to directly convert the batch into a Pandas DataFrame and yield it. Since the Arrow batch size is already subject to a byte-size limit, the resulting Pandas DataFrame also inherently respects this constraint.

viirya · 2025-10-02T00:41:09Z

python/pyspark/sql/pandas/serializers.py

+                    rows.append(row)
+                    # Short circuit batch size calculation if the batch size is
+                    # unlimited as computing batch size is computationally expensive.
+                    if self.arrow_max_bytes_per_batch != 2**31 - 1:


I saw that you applied the batch size limit at the Scala side, why needing to duplicate the limit here?

Oh, the serializer groups the rows by key here.

Actually, if we were able to break a large data for a grouping key into multiple Arrow RecordBatches, each Arrow RecordBatch can be converted into Pandas DataFrame and they can be connected with iterator/generator which won't require to materialize multiple Arrow RecordBatches at once. That's how we do with applyInPandasWithState.

I can think of a couple cases which this logic would help:

Very few data are bounded in a single grouping key but split to two Arrow RecordBatches

This case we will be able to provide one Pandas DataFrame instead of two.

Converting Arrow RecordBatch to Pandas DataFrame could trigger whole different memory usage

I don't have an evidence and it's just a random thought of justification of the change - the size of data in the Arrow RecordBatch for a grouping key could be different from the result of conversion on Pandas DataFrame.

If, the size of Arrow RecordBatch would be very similar with conversion to Pandas DataFrame, maybe taking the route of applyInPandasWithState would be still OK. Though I'm fine with the argument if this logic is simpler than applyInPandasWithState, assuming sys.getsizeof is accurate.

Since we already leverage the generator-based approach to optimize for high-cardinality scenarios, switching to applyInPandasWithState would not be a trivial change. I would prefer to retaining the current implementation.

zeruibao · 2025-10-02T02:06:54Z

Hey @viirya, could you sign off this PR if it looks good to you?

zeruibao added 2 commits September 18, 2025 10:52

save

88a9ed4

save

c526618

github-actions bot added SQL STRUCTURED STREAMING CORE PYTHON labels Sep 18, 2025

zeruibao added 6 commits September 18, 2025 11:48

save

d6846af

save

294bda5

save

4ddeec6

save

8825585

save

deeb988

save

7b11d90

Kimahriman mentioned this pull request Sep 22, 2025

[SPARK-53562][PYTHON] Limit Arrow batch sizes in applyInArrow and applyInPandas #52303

Closed

HeartSaVioR reviewed Sep 24, 2025

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-53638] Limit the byte size of arrow batch for TWS to avoid OOM~~ [SPARK-53638][SS][PYTHON] Limit the byte size of arrow batch for TWS to avoid OOM Sep 24, 2025

Update ApplyInPandasWithStatePythonRunner.scala

a1768a3

zhengruifeng requested review from viirya and HyukjinKwon September 25, 2025 06:19

zhengruifeng reviewed Sep 25, 2025

View reviewed changes

HyukjinKwon approved these changes Oct 1, 2025

View reviewed changes

viirya reviewed Oct 2, 2025

View reviewed changes

zeruibao requested review from HeartSaVioR and viirya October 2, 2025 01:23

zeruibao added 3 commits October 1, 2025 20:36

save

3962eee

save

6105de8

save

266207c

[SPARK-53638][SS][PYTHON] Limit the byte size of arrow batch for TWS to avoid OOM #52391

Are you sure you want to change the base?

[SPARK-53638][SS][PYTHON] Limit the byte size of arrow batch for TWS to avoid OOM #52391

Conversation

zeruibao commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeruibao Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeruibao commented Oct 1, 2025

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeruibao commented Oct 2, 2025

Uh oh!

Uh oh!

zeruibao commented Sep 18, 2025 •

edited

Loading

zeruibao Oct 2, 2025 •

edited

Loading