Skip to content

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Sep 25, 2025

I've been on holiday I had a bit of time while travelling recently, so I tried rewriting the internals (yes, again)

The general idea is that narwhals.Expr stores a list of expressions in ._nodes, rather than passing around a bunch of opaque lambdas

What this enables

Expressions are pretty-printable

For example:

In [2]: nw.col('a').abs().rank()
Out[2]: col(a).abs().rank(method=average, descending=False)

This is still quite basic, but we could make it more complex by introducing line breaks if it gets too long. Seems like the kind of thing @camriddell might be interested in?

We can do simple expression rewrites

Currently, (nw.col('a').mean() + 1).over('b') isn't supported:

  • for pandas-like, it's a non-elementary operation in over
  • for sql, (mean(a) + 1) over (partition by b) isn't valid syntax, it should be mean(a) over (partition by b) + 1

With this PR, however, it is!

What we do here is, when inserting an over node, we push it down before any elementwise operations (such as +, .abs(), sum_horizontal, ...) and apply it to all expressions. There's some more details in the expansion to "how it works"

So now, expressions like (nw.col('a').mean() + 1).over('b') can be supported for all backends

This rewrite is extremely simple and cheap, it's just a matter of inserting a node at some position i rather than at the end of a list. In general, query optimisation is out of scope for Narwhals. But, given that this enables more of Polars' flexibility for other backends, I think this can be in scope.

Per-group broadcasting

Previously, nw.col('a') - nw.col('a').mean() would be fine, but (nw.col('a') - nw.col('a').mean()).over('b') would raise for sql-like backends. Now, it works fine across all backends! Really useful for feature engineering

Simplified internals

  • We can completely get rid of depth, function_name, scalar_kwargs
  • Replace CompliantWhen / CompliantThen and their complicated interaction with just CompliantNamespace.when_then

In fact, this goes as far as reducing package size by almost 1%. Not that that was the objective with this work, but it's nice to see that it doesn't make the package bigger

What this may open the doors to

  • serialisation / deserialisation of expressions. e.g. nw.Expr.from_json(expr.to_json())
  • chained window functions, like nw.col('a').shift(7).rolling_mean(7).over('store', order_by='date')
  • non-elementary group-by aggregations for pandas/pyarrow, like df.group_by('a').agg((nw.col('b')-nw.col('c')).mean())

@MarcoGorelli MarcoGorelli marked this pull request as ready for review September 27, 2025 13:38
@MarcoGorelli MarcoGorelli changed the title WIP feat: Make expressions printable, rewrite internals (travelling pr 🌴 ) feat: Make expressions printable, rewrite internals (travelling pr 🌴 ) Sep 27, 2025
@MarcoGorelli MarcoGorelli changed the title feat: Make expressions printable, rewrite internals (travelling pr 🌴 ) feat: Support over expressions more freely, make expressions printable, rewrite internals (travelling pr 🌴 ) Oct 1, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only read the writeup/update in the docs for now. I will get to the code by the end of this week 🙏

- We're performing an aggregation.
- The name of the function is `'std'`. This will be looked up in the compliant object.
- It takes keyword arguments `ddof=1`.
- We'll look at the others later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this point? Above we state: "Let's start with the third one: [...] This tells us a few things:"

or columns (e.g. `col('foo')`). Finally `allow_multi_output` tells us whether multi-outuput expressions
(more on this in the next section) are allowed to appear in `exprs`.

Node that the expression in `exprs` also has its own nodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many nodes 😂

Suggested change
Node that the expression in `exprs` also has its own nodes:
Note that the expression in `exprs` also has its own nodes:

Comment on lines 490 to 494
In general, query optimisation is out-of-scope for Narwhals. We consider this
expression rewrite acceptable because:

- It's simple.
- It allows us to evaluate operations which otherwise wouldn't be allowed for certain backends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap this into a big fat mkdocs material admonition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants