-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53638][SS][PYTHON] Limit the byte size of arrow batch for TWS to avoid OOM #52391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
88a9ed4
c526618
d6846af
294bda5
4ddeec6
8825585
deeb988
7b11d90
a1768a3
3962eee
6105de8
266207c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2577,11 +2577,17 @@ def read_udfs(pickleSer, infile, eval_type): | |
) | ||
arrow_max_records_per_batch = int(arrow_max_records_per_batch) | ||
|
||
arrow_max_bytes_per_batch = runner_conf.get( | ||
"spark.sql.execution.arrow.maxBytesPerBatch", 2**31 - 1 | ||
) | ||
arrow_max_bytes_per_batch = int(arrow_max_bytes_per_batch) | ||
|
||
ser = TransformWithStateInPandasSerializer( | ||
timezone, | ||
safecheck, | ||
_assign_cols_by_name, | ||
arrow_max_records_per_batch, | ||
arrow_max_bytes_per_batch, | ||
int_to_decimal_coercion_enabled=int_to_decimal_coercion_enabled, | ||
) | ||
elif eval_type == PythonEvalType.SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
@@ -2590,11 +2596,17 @@ def read_udfs(pickleSer, infile, eval_type): | |
) | ||
arrow_max_records_per_batch = int(arrow_max_records_per_batch) | ||
|
||
arrow_max_bytes_per_batch = runner_conf.get( | ||
"spark.sql.execution.arrow.maxBytesPerBatch", 2**31 - 1 | ||
) | ||
arrow_max_bytes_per_batch = int(arrow_max_bytes_per_batch) | ||
|
||
ser = TransformWithStateInPandasInitStateSerializer( | ||
timezone, | ||
safecheck, | ||
_assign_cols_by_name, | ||
arrow_max_records_per_batch, | ||
arrow_max_bytes_per_batch, | ||
int_to_decimal_coercion_enabled=int_to_decimal_coercion_enabled, | ||
) | ||
elif eval_type == PythonEvalType.SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_UDF: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @zhengruifeng
What do you think about the code? We limit the size of Arrow RecordBatch in task thread when sending to Python worker, and @zeruibao added this to re-align the size for Pandas DataFrame. Did we do this in other UDF? Is it beneficial or probably over-thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so
I remember @HyukjinKwon discussed it before, it should be beneficial if the size is properly estimated