Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Description

Changes

  • Added AsList aggregation allowing to aggregate given column values into a single element as a list
  • Added AsListVectorized
  • Added modulo op to Expr

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner January 7, 2026 07:05
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 7, 2026
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 7, 2026 07:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new AsList aggregation function for Ray Data, allowing users to collect all values within a group into a single list, along with an end-to-end test for its functionality. It also adds support for the modulo (%) operator to Ray Data expressions, including its Operation enum, expression evaluation logic for both Python's operator and PyArrow compute functions, and a test case for with_column arithmetic operations. A TODO comment was added to the AggregateFnV2 docstring regarding generic types. Review comments point out that the PyArrow implementation of the modulo operator is incorrect, missing a pc.floor() call, and that the new test for this operator only covers the pandas execution path. Additionally, a change to the rows_same utility function is identified as a breaking change due to altering its return behavior from a boolean to raising an AssertionError. The AsList aggregation is noted to have potential memory issues for large groups, suggesting a warning be added to its docstring, and the new __mod__ method in expressions.py is missing type hints.

I am having trouble creating individual review comments. Click here to see my feedback.

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py (132-134)

critical

The implementation of the modulo operator for PyArrow is incorrect. It's missing a pc.floor() call. The current implementation left - (left / right) * right will produce incorrect results for integer modulo when left / right is not an integer, because pc.divide performs floating-point division. For example, 1 % 2 would evaluate to 0 instead of 1.

The correct formula for modulo is a - n * floor(a/n). The implementation should be updated to include pc.floor.

Note that the new test for this operator in test_with_column.py only exercises the pandas code path. Adding a pyarrow batch_format to the test parameterization would have caught this bug.

    Operation.MOD: lambda left, right: pc.subtract(
        left, pc.multiply(right, pc.floor(pc.divide(left, right)))
    ),

python/ray/data/_internal/util.py (1719-1727)

high

The change to rows_same removes the try...except block, which alters the function's contract. Previously, it returned True or False. Now, it returns True on success and raises an AssertionError on failure. This is a breaking change that could affect other tests that rely on the boolean return value.

While pd.testing.assert_frame_equal provides better error messages, changing the behavior of a shared utility function can have unintended consequences.

Please either revert this change to maintain the original contract of rows_same, or ensure all call sites of this function across the codebase are updated to handle the AssertionError. Given that not all call sites may be visible in this PR, reverting seems safer.

python/ray/data/aggregate.py (382-383)

medium

The AsList aggregation collects all values for a group into a list in memory. For groups with a very large number of elements, this can lead to high memory usage and potentially Out-Of-Memory (OOM) errors on worker nodes.

It would be beneficial to add a note to the docstring to warn users about this potential memory issue, advising them to be cautious when using this aggregation on columns with large group sizes.

python/ray/data/expressions.py (331-333)

medium

The new __mod__ method is missing type hints for the other parameter and the return value. For consistency with other binary operators in this class (like __add__ and __sub__), please add the type hints.

    def __mod__(self, other: Any) -> "Expr":
        """Modulation operator (%)."""
        return self._bin(other, Operation.MOD)

@github-actions github-actions bot disabled auto-merge January 7, 2026 07:13
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 7, 2026
# Schema: {'id': int64, 'group_key': int64}
# Listing all elements per group:
result = ds.groupby("group_key").aggregate(AsList(on="id")).take_all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add sort for testing?

Suggested change
result = ds.groupby("group_key").aggregate(AsList(on="id")).take_all()
result = ds.groupby("group_key").aggregate(AsList(on="id")).sort("group_key").take_all()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not sufficient -- we need to order the list too which makes this code example really clumsy (hence why i'm skipping testing it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabliling preserve_order should solve this.

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve order won't guarantee sequence order either b/c it depends on the order of arrival of the shards into HashShuffleAggregator

Signed-off-by: Alexey Kudinkin <[email protected]>

# Conflicts:
#	python/ray/data/tests/test_with_column.py

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 9, 2026 03:52
Copy link
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexeykudinkin alexeykudinkin merged commit 38257b7 into master Jan 9, 2026
7 checks passed
@alexeykudinkin alexeykudinkin deleted the ak/agg-lst-add branch January 9, 2026 11:56
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description

Changes
---
- Added `AsList` aggregation allowing to aggregate given column values
into a single element as a list
 - Added `AsListVectorized`
 - Added modulo op to `Expr`

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: jasonwrwang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants