@@ -76,8 +76,9 @@ pn = PandasLikeNamespace(
76
76
implementation = Implementation.PANDAS ,
77
77
version = Version.MAIN ,
78
78
)
79
- print (nw.col(" a" )._to_compliant_expr (pn))
79
+ print (nw.col(" a" )(pn))
80
80
```
81
+
81
82
The result from the last line above is the same as we'd get from ` pn.col('a') ` , and it's
82
83
a ` narwhals._pandas_like.expr.PandasLikeExpr ` object, which we'll call ` PandasLikeExpr ` for
83
84
short.
@@ -177,7 +178,7 @@ The way you access the Narwhals-compliant wrapper depends on the object:
177
178
178
179
- ` narwhals.DataFrame ` and ` narwhals.LazyFrame ` : use the ` ._compliant_frame ` attribute.
179
180
- ` narwhals.Series ` : use the ` ._compliant_series ` attribute.
180
- - ` narwhals.Expr ` : call the ` ._to_compliant_expr ` method, and pass to it the Narwhals-compliant namespace associated with
181
+ - ` narwhals.Expr ` : call the ` .__call__ ` method, and pass to it the Narwhals-compliant namespace associated with
181
182
the given backend.
182
183
183
184
🛑 BUT WAIT! What's a Narwhals-compliant namespace?
@@ -212,9 +213,10 @@ pn = PandasLikeNamespace(
212
213
implementation = Implementation.PANDAS ,
213
214
version = Version.MAIN ,
214
215
)
215
- expr = (nw.col(" a" ) + 1 )._to_compliant_expr (pn)
216
+ expr = (nw.col(" a" ) + 1 )(pn)
216
217
print (expr)
217
218
```
219
+
218
220
If we then extract a Narwhals-compliant dataframe from ` df ` by
219
221
calling ` ._compliant_frame ` , we get a ` PandasLikeDataFrame ` - and that's an object which we can pass ` expr ` to!
220
222
@@ -228,6 +230,7 @@ We can then view the underlying pandas Dataframe which was produced by calling `
228
230
``` python exec="1" result="python" session="pandas_api_mapping" source="above"
229
231
print (result._native_frame)
230
232
```
233
+
231
234
which is the same as we'd have obtained by just using the Narwhals API directly:
232
235
233
236
``` python exec="1" result="python" session="pandas_api_mapping" source="above"
@@ -238,49 +241,42 @@ print(nw.to_native(df.select(nw.col("a") + 1)))
238
241
239
242
Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
240
243
to pandas. We can write something like
244
+
241
245
``` python
242
246
df: pl.DataFrame
243
247
df.group_by(" a" ).agg((pl.col(" c" ) > pl.col(" b" ).mean()).max())
244
248
```
249
+
245
250
To do this in pandas, we need to either use ` GroupBy.apply ` (sloooow), or do some crazy manual
246
251
optimisations to get it to work.
247
252
248
253
In Narwhals, here's what we do:
249
254
250
255
- if somebody uses a simple group-by aggregation (e.g. ` df.group_by('a').agg(nw.col('b').mean()) ` ),
251
256
then on the pandas side we translate it to
252
- ``` python
253
- df: pd.DataFrame
254
- df.groupby(" a" ).agg({" b" : [" mean" ]})
255
- ```
257
+
258
+ ``` python
259
+ df: pd.DataFrame
260
+ df.groupby(" a" ).agg({" b" : [" mean" ]})
261
+ ```
262
+
256
263
- if somebody passes a complex group- by aggregation, then we use `apply` and raise a `UserWarning ` , warning
257
264
users of the performance penalty and advising them to refactor their code so that the aggregation they perform
258
265
ends up being a simple one.
259
266
260
- In order to tell whether an aggregation is simple, Narwhals uses the private ` _depth ` attribute of ` PandasLikeExpr ` :
261
-
262
- ``` python exec="1" result="python" session="pandas_impl" source="above"
263
- print (pn.col(" a" ).mean())
264
- print ((pn.col(" a" ) + 1 ).mean())
265
- ```
266
-
267
- For simple aggregations, Narwhals can just look at ` _depth ` and ` function_name ` and figure out
268
- which (efficient) elementary operation this corresponds to in pandas.
269
-
270
267
# # Expression Metadata
271
268
272
- Let's try printing out a few expressions to the console to see what they show us:
269
+ Let' s try printing out some compliant expressions' metadata to see what it shows us:
273
270
274
- ``` python exec="1" result="python" session="metadata " source="above"
271
+ ```python exec =" 1" result=" python" session=" pandas_impl " source=" above"
275
272
import narwhals as nw
276
273
277
- print (nw.col(" a" ))
278
- print (nw.col(" a" ).mean())
279
- print (nw.col(" a" ).mean().over(" b" ))
274
+ print (nw.col(" a" )(pn)._metadata )
275
+ print (nw.col(" a" ).mean()(pn)._metadata )
276
+ print (nw.col(" a" ).mean().over(" b" )(pn)._metadata )
280
277
```
281
278
282
- Note how they tell us something about their metadata. This section is all about
283
- making sense of what that all means, what the rules are, and what it enables.
279
+ This section is all about making sense of what that all means, what the rules are, and what it enables.
284
280
285
281
Here's a brief description of each piece of metadata:
286
282
@@ -293,8 +289,6 @@ Here's a brief description of each piece of metadata:
293
289
- ` ExpansionKind.MULTI_UNNAMED ` : Produces multiple outputs whose names depend
294
290
on the input dataframe. For example, ` nw.nth(0, 1) ` or ` nw.selectors.numeric() ` .
295
291
296
- - ` last_node ` : Kind of the last operation in the expression. See
297
- ` narwhals._expression_parsing.ExprKind ` for the various options.
298
292
- ` has_windows ` : Whether the expression already contains an ` over(...) ` statement.
299
293
- ` n_orderable_ops ` : How many order-dependent operations the expression contains.
300
294
@@ -311,6 +305,7 @@ Here's a brief description of each piece of metadata:
311
305
- ` is_scalar_like ` : Whether the output of the expression is always length-1.
312
306
- ` is_literal ` : Whether the expression doesn't depend on any column but instead
313
307
only on literal values, like ` nw.lit(1) ` .
308
+ - ` nodes ` : List of operations which this expression applies when evaluated.
314
309
315
310
#### Chaining
316
311
@@ -377,3 +372,67 @@ Narwhals triggers a broadcast in these situations:
377
372
378
373
Each backend is then responsible for doing its own broadcasting, as defined in each
379
374
` CompliantExpr.broadcast ` method.
375
+
376
+ ### Elementwise push-down
377
+
378
+ SQL is picky about ` over ` operations. For example:
379
+
380
+ - ` sum(a) over (partition by b) ` is valid.
381
+ - ` sum(abs(a)) over (partition by b) ` is valid.
382
+ - ` abs(sum(a)) over (partition by b) ` is not valid.
383
+
384
+ In Polars, however, all three of
385
+
386
+ - ` pl.col('a').sum().over('b') ` is valid.
387
+ - ` pl.col('a').abs().sum().over('b') ` is valid.
388
+ - ` pl.col('a').sum().abs().over('b') ` is valid.
389
+
390
+ How can we retain Polars' level of flexibility when translating to SQL engines?
391
+
392
+ The answer is: by rewriting expressions. Specifically, we push down ` over ` nodes past elementwise ones.
393
+ To see this, let's try printing the Narwhals equivalent of the last expression above (the one that SQL rejects):
394
+
395
+ ``` python exec="1" result="python" session="pushdown" source="above"
396
+ import narwhals as nw
397
+
398
+ print (nw.col(" a" ).sum().abs().over(" b" ))
399
+ ```
400
+
401
+ Note how Narwhals automatically inserted the ` over ` operation _ before_ the ` abs ` one. In other words, instead
402
+ of doing
403
+
404
+ - ` sum ` -> ` abs ` -> ` over `
405
+
406
+ it did
407
+
408
+ - ` sum ` -> ` over ` -> ` abs `
409
+
410
+ thus allowing the expression to be valid for SQL engines!
411
+
412
+ This is what we refer to as "pushing down ` over ` nodes". The idea is:
413
+
414
+ - Elementwise operations operate row-by-row and don't depend on the rows around them.
415
+ - An ` over ` node partitions or orders a computation.
416
+ - Therefore, an elementwise operation followed by an ` over ` operation is the same
417
+ as doing the ` over ` operation followed by that same elementwise operation!
418
+
419
+ Note that the pushdown also applies to any arguments to the elementwise operation.
420
+ For example, if we have
421
+
422
+ ``` python
423
+ (nw.col(" a" ).sum() + nw.col(" b" ).sum()).over(" c" )
424
+ ```
425
+
426
+ then ` + ` is an elementwise operation and so can be swapped with ` over ` . We just need
427
+ to take care to apply the ` over ` operation to all the arguments of ` + ` , so that we
428
+ end up with
429
+
430
+ ``` python
431
+ nw.col(" a" ).sum().over(" c" ) + nw.col(" b" ).sum().over(" c" )
432
+ ```
433
+
434
+ In general, query optimisation is out-of-scope for Narwhals. We consider this
435
+ expression rewrite acceptable because:
436
+
437
+ - It's simple.
438
+ - It allows us to evaluate operations which otherwise wouldn't be allowed for certain backends.
0 commit comments