-
Notifications
You must be signed in to change notification settings - Fork 166
feat: Support over
expressions more freely, make expressions printable, rewrite internals (travelling pr 🌴 )
#3152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4feae2f
to
b4141af
Compare
over
expressions more freely, make expressions printable, rewrite internals (travelling pr 🌴 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only read the writeup/update in the docs for now. I will get to the code by the end of this week 🙏
docs/how_it_works.md
Outdated
- We're performing an aggregation. | ||
- The name of the function is `'std'`. This will be looked up in the compliant object. | ||
- It takes keyword arguments `ddof=1`. | ||
- We'll look at the others later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this point? Above we state: "Let's start with the third one: [...] This tells us a few things:"
docs/how_it_works.md
Outdated
or columns (e.g. `col('foo')`). Finally `allow_multi_output` tells us whether multi-outuput expressions | ||
(more on this in the next section) are allowed to appear in `exprs`. | ||
|
||
Node that the expression in `exprs` also has its own nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many nodes 😂
Node that the expression in `exprs` also has its own nodes: | |
Note that the expression in `exprs` also has its own nodes: |
docs/how_it_works.md
Outdated
In general, query optimisation is out-of-scope for Narwhals. We consider this | ||
expression rewrite acceptable because: | ||
|
||
- It's simple. | ||
- It allows us to evaluate operations which otherwise wouldn't be allowed for certain backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap this into a big fat mkdocs material admonition?
I've been on holiday I had a bit of time while travelling recently, so I tried rewriting the internals (yes, again)
The general idea is that
narwhals.Expr
stores a list of expressions in._nodes
, rather than passing around a bunch of opaque lambdasWhat this enables
Expressions are pretty-printable
For example:
This is still quite basic, but we could make it more complex by introducing line breaks if it gets too long. Seems like the kind of thing @camriddell might be interested in?
We can do simple expression rewrites
Currently,
(nw.col('a').mean() + 1).over('b')
isn't supported:over
(mean(a) + 1) over (partition by b)
isn't valid syntax, it should bemean(a) over (partition by b) + 1
With this PR, however, it is!
What we do here is, when inserting an
over
node, we push it down before any elementwise operations (such as+
,.abs()
,sum_horizontal
, ...) and apply it to all expressions. There's some more details in the expansion to "how it works"So now, expressions like
(nw.col('a').mean() + 1).over('b')
can be supported for all backendsThis rewrite is extremely simple and cheap, it's just a matter of inserting a node at some position
i
rather than at the end of a list. In general, query optimisation is out of scope for Narwhals. But, given that this enables more of Polars' flexibility for other backends, I think this can be in scope.Per-group broadcasting
Previously,
nw.col('a') - nw.col('a').mean()
would be fine, but(nw.col('a') - nw.col('a').mean()).over('b')
would raise for sql-like backends. Now, it works fine across all backends! Really useful for feature engineeringSimplified internals
depth
,function_name
,scalar_kwargs
CompliantWhen
/CompliantThen
and their complicated interaction with justCompliantNamespace.when_then
In fact, this goes as far as reducing package size by almost 1%. Not that that was the objective with this work, but it's nice to see that it doesn't make the package bigger
What this may open the doors to
nw.Expr.from_json(expr.to_json())
nw.col('a').shift(7).rolling_mean(7).over('store', order_by='date')
df.group_by('a').agg((nw.col('b')-nw.col('c')).mean())