feat(expr-ir): Implement Acero `order_by`, `hashjoin` for `over` + `DataFrame.filter` #3173

dangotbanned · 2025-10-02T16:08:41Z

Tracking

Bug: DataFrame.filter silently ignores **constraints when using list[bool] #3182

Related issues

Child of feat(RFC): A richer Expr IR #2572
Follow-up to feat(expr-ir): Support group_by, utilize pyarrow.acero #3143

Description

Note

I've used the name sort_by for our wrapped of order_by.
The node corresponds to pa.Table.sort_by, whereas the name order_by doesn't appear anywhere else in pyarrow

Building out more acero parts to be able to support .over(*partition_by)

dangotbanned · 2025-10-02T16:13:06Z

narwhals/_plan/common.py

+# NOTE: See (https://github.com/microsoft/pyright/issues/10673#issuecomment-3033789021)
 # The issue is `T` possibly being `Iterable`
 # Ignoring here still leaks the issue to the caller, where you need to annotate the base case
+@overload
+def flatten_hash_safe(iterable: Iterable[OneOrIterable[str]], /) -> Iterator[str]: ...


It's an improvement over the previous version, but far from ideal.

Still doesn't resolve this case, and I'm not entirely sure why yet

narwhals/narwhals/_plan/compliant/column.py

Lines 49 to 60 in f77bb4c

@classmethod

def align(

cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]]

) -> Iterator[SeriesT]:

exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs))

length = cls._length_required(exprs)

if length is None:

for e in exprs:

yield e.to_series()

else:

for e in exprs:

yield e.broadcast(length)

dangotbanned · 2025-10-02T16:18:35Z

narwhals/_plan/arrow/acero.py

+def sort_by(
+    by: OneOrIterable[str],
+    *more_by: str,
+    descending: OneOrIterable[bool] = False,
+    nulls_last: bool = False,
+) -> Decl:
+    return SortMultipleOptions.parse(
+        descending=descending, nulls_last=nulls_last
+    ).to_arrow_acero(tuple(flatten_hash_safe((by, more_by))))


As of feat(expr-ir): Impl acero.sort_by, I still need to make use of this in a plan.

A good candidate might be in either/both of

over(order_by=...)

narwhals/narwhals/_plan/arrow/expr.py

Lines 328 to 350 in f77bb4c

def over_ordered(

self, node: ir.OrderedWindowExpr, frame: Frame, name: str

) -> Self | Scalar:

if node.partition_by:

msg = f"Need to implement `group_by`, `join` for:\n{node!r}"

raise NotImplementedError(msg)

# NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort`

sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by)

options = node.sort_options.to_multiple(len(node.order_by))

idx_name = temp.column_name(frame)

sorted_context = frame.with_row_index(idx_name).sort(sort_by, options)

evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name)

if isinstance(evaluated, ArrowScalar):

# NOTE: We're already sorted, defer broadcasting to the outer context

# Wouldn't be suitable for partitions, but will be fine here

# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667

# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70

return self._with_native(evaluated.native, name)

indices = pc.sort_indices(sorted_context.get_column(idx_name).native)

height = len(sorted_context)

result = evaluated.broadcast(height).native.take(indices)

return self._with_native(result, name)

is_{first,last}_distinct

test: Port over is_first_distinct tests

narwhals/narwhals/_arrow/series.py

Lines 719 to 747 in 715be22

def is_first_distinct(self) -> Self:

import numpy as np # ignore-banned-import

row_number = pa.array(np.arange(len(self)))

col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])

first_distinct_index = (

pa.Table.from_arrays([self.native], names=[self.name])

.append_column(col_token, row_number)

.group_by(self.name)

.aggregate([(col_token, "min")])

.column(f"{col_token}_min")

)

return self._with_native(pc.is_in(row_number, first_distinct_index))

def is_last_distinct(self) -> Self:

import numpy as np # ignore-banned-import

row_number = pa.array(np.arange(len(self)))

col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])

last_distinct_index = (

pa.Table.from_arrays([self.native], names=[self.name])

.append_column(col_token, row_number)

.group_by(self.name)

.aggregate([(col_token, "max")])

.column(f"{col_token}_max")

)

return self._with_native(pc.is_in(row_number, last_distinct_index))

Prep for #3173 (comment)

Mostly following what is on `main` (so far)

Both are available at all levels, + `to_series` is implemented in term of `get_columns`

`is_{first,last}_distinct` are one of a few that fit that case

https://results.pre-commit.ci/run/github/760058710/1759512874.Rw7gJ59gRbyoGTojPwa9pw

https://arrow.apache.org/docs/cpp/acero/user_guide.html#hash-join

dangotbanned · 2025-10-05T11:55:57Z

narwhals/_plan/arrow/acero.py

+def join(
+    left: pa.Table,
+    right: pa.Table,
+    how: JoinTypeSubset,
+    left_on: OneOrIterable[str],
+    right_on: OneOrIterable[str],
+    suffix: str = "_right",
+    *,
+    coalesce_keys: bool = True,
+) -> Decl:
+    """Heavily based on [`pyarrow.acero._perform_join`].
+
+    [`pyarrow.acero._perform_join`]: https://github.com/apache/arrow/blob/f7320c9a40082639f9e0cf8b3075286e3fc6c0b9/python/pyarrow/acero.py#L82-L260
+    """


TODO: Investigate using non-table_source nodes

AFAICT, hashjoin should be able to accept things like project

Defining the handling for cross joins using acero directly seems very achievable

narwhals/narwhals/_arrow/dataframe.py

Lines 400 to 422 in f4787d3

if how == "cross":

plx = self.__narwhals_namespace__()

key_token = generate_temporary_column_name(

n_bytes=8, columns=[*self.columns, *other.columns]

)

return self._with_native(

self.with_columns(

plx.lit(0, None).alias(key_token).broadcast(ExprKind.LITERAL)

)

.native.join(

other.with_columns(

plx.lit(0, None).alias(key_token).broadcast(ExprKind.LITERAL)

).native,

keys=key_token,

right_keys=key_token,

join_type="inner",

right_suffix=suffix,

)

.drop([key_token])

)

There does need to be a new layer for tracking Schema changes

Which is needed for with_columns also

Generally, the responsibility for a future LogicalPlan

- Starting to build up the join test suite - At some point, `"cross"` support will be needed

Everything else requires another feature to be implemented: - `DataFrame.filter` for semi, anti - `DataFrame.collect_schema` for suffix - `how="cross"` is just being defered currently (#3173 (comment))

50 lines! Even after all this refactoring 😔

dangotbanned · 2025-10-06T10:31:20Z

tests/plan/join_test.py

+# NOTE: Maybe merge `semi`, `anti` into the same test which just inverts the predicate?
+@XFAIL_DATAFRAME_FILTER
+@pytest.mark.parametrize(
+    ("on", "predicate", "expected"),
+    [
+        ("a", (nwp.col("b") > 5), {"a": [2], "b": [6], "zor ro": [9]}),
+        (["b"], (nwp.col("b") < 5), {"a": [1, 3], "b": [4, 4], "zor ro": [7, 8]}),
+        (["a", "b"], (nwp.col("b") < 5), {"a": [1, 3], "b": [4, 4], "zor ro": [7, 8]}),
+    ],
+)
+def test_join_semi(
+    on: On, predicate: nwp.Expr, expected: Data
+) -> None:  # pragma: no cover
+    data = {"a": [1, 3, 2], "b": [4, 4, 6], "zor ro": [7.0, 8.0, 9.0]}
+    df = dataframe(data)
+    other = df.filter(predicate)  # type: ignore[attr-defined]


It's interesting I got this far without DataFrame.filter 😂

I do have an acero version, which operates over pc.Expression though

narwhals/narwhals/_plan/arrow/acero.py

Lines 172 to 174 in b0c2a4d

def filter(*predicates: Expr, **constraints: IntoExpr) -> Decl:

expr = _parse_all_horizontal(predicates, constraints)

return Decl("filter", options=pac.FilterNodeOptions(expr))

So the missing link between those two is approximately this:

https://github.com/apache/datafusion-python/blob/e75addfa64a91c0d91ef059917a451e04495b83a/src/pyarrow_filter_expression.rs#L96-L181

But with a fallback to an eager path like main:

narwhals/narwhals/_arrow/dataframe.py

Lines 521 to 529 in 8ac061c

def filter(self, predicate: ArrowExpr | list[bool | None]) -> Self:

if isinstance(predicate, list):

mask_native: Mask | ChunkedArrayAny = predicate

else:

# `[0]` is safe as the predicate's expression only returns a single column

mask_native = self._evaluate_into_exprs(predicate)[0].native

return self._with_native(

self.native.filter(mask_native), validate_column_names=False

)

- Ideally these would be `str | Selector` or `Expr` containing only selections - But that isn't possible with the current typing - They *can* accept more - But it increases the complexity quite a lot for eager

Need similar logic for `DataFrame.filter`

Pretty sure on `main` that ignoring constraints is a bug

Quite handy that I did `Expr.filter` and `When` first 😄

dangotbanned · 2025-10-06T20:40:39Z

narwhals/_plan/dataframe.py

+        if len(predicates) == 1 and not constraints:
+            first = predicates[0]
+            if is_list_of(first, bool):
+                series = self._series.from_iterable(
+                    first,
+                    dtype=self.version.dtypes.Boolean(),
+                    backend=self.implementation,
+                )
+            elif is_series(first):
+                series = first
+            else:
+                return super().filter(first)
+            return self._with_compliant(self._compliant.filter(series._compliant))
+        non_mask = cast("tuple[OneOrIterable[IntoExprColumn],...]", predicates)
+        return super().filter(*non_mask, **constraints)


If anyone see's this - don't copy this logic to solve (#3182)

I haven't handled it here either

Update

Should be clearer now with these failing tests (test: Add test_filter_mask_mixed)

the exact text is allowed to change

Some basic cases to consider for #3182 If we decide against supporting them, then all can be converted into a `pytest.raises`

Really don't want this being part of the `ArrowDataFrame` constructor Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`

https://github.com/narwhals-dev/narwhals/actions/runs/18312754814/job/52145087735

https://github.com/narwhals-dev/narwhals/actions/runs/18344025005/job/52246113538?pr=3173

Compared to (2ebca30) Now supports the new semantics that will appear in #3183, following #3182

dangotbanned · 2025-10-08T12:45:28Z

narwhals/_plan/arrow/dataframe.py

+    def filter(self, predicate: NamedIR) -> Self:
        mask: pc.Expression | ChunkedArrayAny
-        if not fn.is_series(predicate):
-            resolved = Expr.from_named_ir(predicate, self)
-            if isinstance(resolved, Expr):
-                mask = resolved.broadcast(len(self)).native
-            else:
-                mask = acero.lit(resolved.native)
+        resolved = Expr.from_named_ir(predicate, self)
+        if isinstance(resolved, Expr):
+            mask = resolved.broadcast(len(self)).native
        else:
-            mask = predicate.native
+            mask = acero.lit(resolved.native)
        return self._with_native(self.native.filter(mask))


@FBruzzesi #3183 (comment)

This is very nice now 😄

dangotbanned added 3 commits October 1, 2025 21:11

refactor: Use temp.column_name(s) some more

cb470b4

fix(typing): Resolve some cases for flatten_hash_safe

23e9d43

feat(expr-ir): Impl acero.sort_by

f77bb4c

dangotbanned added the internal label Oct 2, 2025

dangotbanned commented Oct 2, 2025

View reviewed changes

dangotbanned added 14 commits October 2, 2025 17:49

test: Port over is_first_distinct tests

36ddce0

Prep for #3173 (comment)

chore: Add Compliant{Expr,Scalar}.is_{first,last}_distinct

0e49f57

test: Update to cover is_last_distinct as well

a5f192c

feat(DRAFT): Initial is_first_distinct impl

6a1b08a

Mostly following what is on `main` (so far)

test: Port over more cases

1c026bf

refactor: Generalize is_first_distinct impl

e7e8a04

feat: Add is_last_distinct

2d46521

refactor: Make both is_*_distinct methods, aliases

cfb775d

feat: (Properly) add get_column, to_series

9db603b

Both are available at all levels, + `to_series` is implemented in term of `get_columns`

chore: Add pc.is_in wrapper

f8255d3

docs: Add detail to FunctionFlags.LENGTH_PRESERVING

6fe2a0a

`is_{first,last}_distinct` are one of a few that fit that case

test: More test porting

938befb

typo

516f4a6

https://results.pre-commit.ci/run/github/760058710/1759512874.Rw7gJ59gRbyoGTojPwa9pw

feat(DRAFT): Some progress on hashjoin port

ead4e62

https://arrow.apache.org/docs/cpp/acero/user_guide.html#hash-join

dangotbanned changed the title ~~feat(expr-ir): Implement Acero order_by/sort_by pair~~ feat(expr-ir): Implement Acero order_by, hashjoin for over Oct 5, 2025

dangotbanned commented Oct 5, 2025

View reviewed changes

dangotbanned added 8 commits October 5, 2025 14:39

fix: Correctly pass down join keys

273bdcc

- Starting to build up the join test suite - At some point, `"cross"` support will be needed

test: Port over inner, left & clean up

ce37617

Everything else requires another feature to be implemented: - `DataFrame.filter` for semi, anti - `DataFrame.collect_schema` for suffix - `how="cross"` is just being defered currently (#3173 (comment))

test: Add test_suffix

18ef26a

test: Add how="cross" tests

94baf1e

test: Add how={"anti","semi"} tests

733b45a

test: replace "antananarivo"->"a", "bob"->"b"

ce321e0

50 lines! Even after all this refactoring 😔

test: Port the other duplicate test

cc0d379

test: Make all the xfails more visible

dd40e3a

dangotbanned added 3 commits October 5, 2025 19:38

refactor: Only expose acero.join_tables

77e55b3

chore: Start factoring-out Table dependency

8f7d2f3

Merge branch 'oh-nodes' into expr-ir/acero-order-by

b0c2a4d

dangotbanned commented Oct 6, 2025

View reviewed changes

dangotbanned added 10 commits October 6, 2025 13:21

refactor(typing): Use IntoExprColumn some more

d42f5de

- Ideally these would be `str | Selector` or `Expr` containing only selections - But that isn't possible with the current typing - They *can* accept more - But it increases the complexity quite a lot for eager

refactor: Split up _parse_sort_by

b8a58c1

Need similar logic for `DataFrame.filter`

Make a start on DataFrame.filter

05c63fd

fill out slightly more filter

025213d

get typing working again (kinda)

3e94449

feat(DRAFT): Support filter(list[bool])

a611bc9

Pretty sure on `main` that ignoring constraints is a bug

feat: Support single Series as well

d514ad0

test: Use parametrize

d452920

feat: Add predicate expansion

4c7c23d

Quite handy that I did `Expr.filter` and `When` first 😄

feat(expr-ir): Full DataFrame.filter support

2ebca30

dangotbanned changed the title ~~feat(expr-ir): Implement Acero order_by, hashjoin for over~~ feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter Oct 6, 2025

test: Merge the anti/semi tests

1b66786

dangotbanned mentioned this pull request Oct 6, 2025

Bug: DataFrame.filter silently ignores **constraints when using list[bool] #3182

Open

test: parametrize exception messages

fd38911

dangotbanned commented Oct 6, 2025

View reviewed changes

dangotbanned added 10 commits October 6, 2025 20:48

test: relax more error messages

3537cac

the exact text is allowed to change

typo

b5ef86b

test: Add test_filter_mask_mixed

8433b2d

Some basic cases to consider for #3182 If we decide against supporting them, then all can be converted into a `pytest.raises`

fix: Raise on duplicate column names

7668abb

Really don't want this being part of the `ArrowDataFrame` constructor Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`

cov

3ca43d1

https://github.com/narwhals-dev/narwhals/actions/runs/18312754814/job/52145087735

perf: Avoid multiple collections during cross join

0f06479

test: Stop repeating the same data so many times

7e9ee74

test: Add some cases from polars

1523dbb

fix: typing mypy

a479f32

https://github.com/narwhals-dev/narwhals/actions/runs/18344025005/job/52246113538?pr=3173

feat(expr-ir): Full-er DataFrame.filter support

8e840e0

Compared to (2ebca30) Now supports the new semantics that will appear in #3183, following #3182

dangotbanned commented Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Implement Acero `order_by`, `hashjoin` for `over` + `DataFrame.filter` #3173

feat(expr-ir): Implement Acero `order_by`, `hashjoin` for `over` + `DataFrame.filter` #3173

dangotbanned commented Oct 2, 2025 •

edited

Loading

Uh oh!

dangotbanned Oct 2, 2025

Uh oh!

dangotbanned Oct 2, 2025 •

edited

Loading

Uh oh!

dangotbanned Oct 5, 2025 •

edited

Loading

Uh oh!

dangotbanned Oct 6, 2025

Uh oh!

dangotbanned Oct 6, 2025 •

edited

Loading

Uh oh!

dangotbanned Oct 8, 2025

Uh oh!

Uh oh!

	@classmethod
	def align(
	cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]]
	) -> Iterator[SeriesT]:
	exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs))
	length = cls._length_required(exprs)
	if length is None:
	for e in exprs:
	yield e.to_series()
	else:
	for e in exprs:
	yield e.broadcast(length)

	def over_ordered(
	self, node: ir.OrderedWindowExpr, frame: Frame, name: str
	) -> Self \| Scalar:
	if node.partition_by:
	msg = f"Need to implement `group_by`, `join` for:\n{node!r}"
	raise NotImplementedError(msg)

	# NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort`
	sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by)
	options = node.sort_options.to_multiple(len(node.order_by))
	idx_name = temp.column_name(frame)
	sorted_context = frame.with_row_index(idx_name).sort(sort_by, options)
	evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name)
	if isinstance(evaluated, ArrowScalar):
	# NOTE: We're already sorted, defer broadcasting to the outer context
	# Wouldn't be suitable for partitions, but will be fine here
	# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667
	# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70
	return self._with_native(evaluated.native, name)
	indices = pc.sort_indices(sorted_context.get_column(idx_name).native)
	height = len(sorted_context)
	result = evaluated.broadcast(height).native.take(indices)
	return self._with_native(result, name)

	def is_first_distinct(self) -> Self:
	import numpy as np # ignore-banned-import

	row_number = pa.array(np.arange(len(self)))
	col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
	first_distinct_index = (
	pa.Table.from_arrays([self.native], names=[self.name])
	.append_column(col_token, row_number)
	.group_by(self.name)
	.aggregate([(col_token, "min")])
	.column(f"{col_token}_min")
	)

	return self._with_native(pc.is_in(row_number, first_distinct_index))

	def is_last_distinct(self) -> Self:
	import numpy as np # ignore-banned-import

	row_number = pa.array(np.arange(len(self)))
	col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
	last_distinct_index = (
	pa.Table.from_arrays([self.native], names=[self.name])
	.append_column(col_token, row_number)
	.group_by(self.name)
	.aggregate([(col_token, "max")])
	.column(f"{col_token}_max")
	)

	return self._with_native(pc.is_in(row_number, last_distinct_index))


	if how == "cross":
	plx = self.__narwhals_namespace__()
	key_token = generate_temporary_column_name(
	n_bytes=8, columns=[self.columns, other.columns]
	)

	return self._with_native(
	self.with_columns(
	plx.lit(0, None).alias(key_token).broadcast(ExprKind.LITERAL)
	)
	.native.join(
	other.with_columns(
	plx.lit(0, None).alias(key_token).broadcast(ExprKind.LITERAL)
	).native,
	keys=key_token,
	right_keys=key_token,
	join_type="inner",
	right_suffix=suffix,
	)
	.drop([key_token])
	)

	def filter(predicates: Expr, *constraints: IntoExpr) -> Decl:
	expr = _parse_all_horizontal(predicates, constraints)
	return Decl("filter", options=pac.FilterNodeOptions(expr))

	def filter(self, predicate: ArrowExpr \| list[bool \| None]) -> Self:
	if isinstance(predicate, list):
	mask_native: Mask \| ChunkedArrayAny = predicate
	else:
	# `[0]` is safe as the predicate's expression only returns a single column
	mask_native = self._evaluate_into_exprs(predicate)[0].native
	return self._with_native(
	self.native.filter(mask_native), validate_column_names=False
	)

feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter #3173

Are you sure you want to change the base?

feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter #3173

Conversation

dangotbanned commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking

Related issues

Description

Uh oh!

dangotbanned Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

over(order_by=...)

is_{first,last}_distinct

Uh oh!

dangotbanned Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

TODO: Investigate using non-table_source nodes

Uh oh!

dangotbanned Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Update

Uh oh!

dangotbanned Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

feat(expr-ir): Implement Acero `order_by`, `hashjoin` for `over` + `DataFrame.filter` #3173

feat(expr-ir): Implement Acero `order_by`, `hashjoin` for `over` + `DataFrame.filter` #3173

dangotbanned commented Oct 2, 2025 •

edited

Loading

dangotbanned Oct 2, 2025 •

edited

Loading

`over(order_by=...)`

`is_{first,last}_distinct`

dangotbanned Oct 5, 2025 •

edited

Loading

TODO: Investigate using non-`table_source` nodes

dangotbanned Oct 6, 2025 •

edited

Loading