Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Make chunk combination threshold configurable for improved per… #51200

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion python/ray/data/_internal/arrow_ops/transform_pyarrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,22 @@
# pyarrow.Table.slice is slow when the table has many chunks
# so we combine chunks into a single one to make slice faster
# with the cost of an extra copy.
#
# The decision to combine chunks is based on a threshold for the number
# of chunks, set by `MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS`. To make
# this more flexible, we have made this threshold configurable via the
# `RAY_DATA_MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS` environment variable.
#
# This configurability is important because the size of each chunk can vary
# greatly depending on the dataset and the operations performed previously.
# A fixed threshold might not be optimal for all scenarios, as in some cases,
# a smaller number of large chunks could behave differently from a larger
# number of smaller chunks. By making this threshold tunable, users have
# the ability to optimize for their specific case, adjusting based on their
# chunk sizes and available memory.
# See https://github.com/ray-project/ray/issues/31108 for more details.
# TODO(jjyao): remove this once https://github.com/apache/arrow/issues/35126 is resolved
MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS = 10
MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS = int(os.getenv('RAY_DATA_MIN_NUM_CHUNKS_TO_TRIGGER_COMBINE_CHUNKS', 10))


logger = logging.getLogger(__name__)
Expand Down