-
Notifications
You must be signed in to change notification settings - Fork 2
/
polars_demo.qmd
4552 lines (3219 loc) · 105 KB
/
polars_demo.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Intro 2 Polars"
execute:
eval: true
warning: true
error: true
keep-ipynb: true
cache: true
jupyter: python3
pdf-engine: lualatex
# theme: pandoc
html:
code-tools: true
fold-code: false
author: Jonathan D. Rosenblatt
data: 03-12-2024
toc: true
number-sections: true
number-depth: 3
embed-resources: true
---
# Background {#sec-background}
## Ritchie Vink, Rust, Apache Arrow and Covid
[Here](https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/) is the story, by the creator of Polars.
## Who Can Benefit from Polars?
- Researcher (DS, Analyst, Statistician, etc):
- Working on their local machine.
- Working on a cloud machine (SageMaker, EC2).
- Production system:
- Running on a dedicated server.
- Running on "serverless" (e.g. AWS Lambda, Google Cloud Functions).
## The DataFrame Landscape
Initially there were R's `data.frame`. R has evolved, and it now offers `tibble`s and `data.table`s. Python had only `Pandas` for years. Then the Python ecosystem exploded, and now we have:
- [Pandas](https://Pandas.pydata.org/): The original Python dataframe module. Build by Wes McKinney, on top of numpy.
- [Polars](https://www.pola.rs/): A new dataframe module, build by Ritchie Vink, on top of Rust and Apache Arrow.
- [DuckDB](https://duckdb.org/):
- [ClickHouse chDB](https://github.com/chdb-io/chdb)
- [Datafusion](https://arrow.apache.org/datafusion/)
- [Databend](https://github.com/datafuselabs/databend)
- [PyArrow](https://arrow.apache.org/docs/python/index.html)
- [Daft](https://www.getdaft.io/): A distributed dataframe library built for "Complex Data" (data that doesn't usually fit in a SQL table such as images, videos, documents etc).
- [Fugue](https://fugue-tutorials.readthedocs.io/): A dataframe library that allows you to write SQL-like code, and execute it on different backends (e.g. Spark, Dask, Pandas, Polars, etc).
- [pySpark](https://spark.apache.org/docs/latest/api/python/index.html): The Python API for Spark. Spark is a distributed computing engine, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
- [CUDF](https://github.com/rapidsai/cudf): A **GPU accelerated** dataframe library, build on top of Apache Arrow.
- [datatable](https://datatable.readthedocs.io/en/latest/): An attempt to recreate R's [data.table](https://github.com/Rdatatable/data.table) API and (crazy) speed in Python.
- [Dask](https://www.dask.org/): A distributed computing engine for Python, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
- [Vaex](https://vaex.io/): A high performance Python library for lazy Out-of-Core DataFrames (similar to dask, but with a different API).
- [Modin](https://github.com/modin-project/modin): A drop-in distributed replacement for Pandas, built on top of [Ray](https://www.ray.io/).
For more details see [here](https://pola-rs.github.io/Polars-book/user-guide/misc/alternatives/), [here](https://www.getdaft.io/projects/docs/en/latest/dataframe_comparison.html), [here](https://www.linkedin.com/posts/mimounedjouallah_duckdb-pyarrow-clickhouse-activity-7172389181011161088-q1lk?utm_source=share&utm_medium=member_desktop).
# Motivation to Use Polars {#sec-motivation}
Each of the following, alone(!), is amazing.
1. Out of the box **parallelism**.
2. Lazy Evaluation: With query planning and **query optimization**.
3. Streaming engine: Can **stream data from disk** to memory for out-of-memory processing.
4. A complete set of **native dtypes**; including missing and strings.
5. An intuitive and **consistent API**; inspired by PySpark.
## Setting Up the Environment
At this point you may want to create and activate a [venv](https://realpython.com/python-virtual-environments-a-primer/) for this project.
```{python}
#| echo: false
# %pip install --upgrade pip
# %pip install --upgrade polars
# %pip install --upgrade pyarrow
# %pip install --upgrade Pandas
# %pip install --upgrade plotly
# %pip freeze > requirements.txt
```
```{python}
#| label: setup-env
# %pip install -r requirements.txt
```
```{python}
#| label: Polars-version
%pip show Polars # check you Polars version
```
```{python}
#| label: Pandas-version
%pip show Pandas # check you Pandas version
```
```{python}
#| label: preliminaries
import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs
import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline
import matplotlib.pyplot as plt
from datetime import datetime
# Following two lines only required to view plotly when rendering from VScode.
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"
# set path to current file's path
os.chdir(os.path.dirname(os.path.abspath(__file__)))
```
What Polars module and dependencies are installed?
```{python}
#| label: show-versions
pl.show_versions()
```
How many cores are available for parallelism?
```{python}
#| label: show-cores
pl.thread_pool_size()
```
## Memory Footprint
### Memory Footprint of Storage
Comparing Polars to Pandas - the memory footprint of a series of strings.
Polars.
```{python}
#| label: Polars-memory-footprint
letters = pl.Series(list(string.ascii_letters))
n = int(10e6)
letter1 = letters.sample(n, with_replacement=True)
letter1.estimated_size(unit='gb')
```
Pandas (before Pandas 2.0.0).
```{python}
#| label: Pandas-memory-footprint
# Pandas with numpy backend
letter1_Pandas = pd.Series(list(string.ascii_letters)).sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=False)
letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
```
Pandas after Pandas 2.0, with the Pyarrow backend (Apr 2023).
```{python}
#| label: Pandas-memory-footprint-with-Arrow
letter1_Pandas = pd.Series(list(string.ascii_letters), dtype="string[pyarrow]").sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=True)
letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
```
## Lazy Frames and Query Planning {#sec-query-planning}
Consider a sort operation that follows a filter operation. Ideally, filter precedes the sort, but we did not ensure this... We now demonstrate that Polars' query planner will do it for you. En passant, we see Polars is more efficient also without the query planner.
Polars' eager evaluation in the **wrong** order. Sort then filter.
```{python}
#| label: polars-eager-wrong-order
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))
```
Polars' Eager evaluation in the **right** order. Filter then sort.
```{python}
#| label: polars-eager-right-order
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()
```
Now prepare a Polars LazyFrame required for query optimization.
```{python}
#| label: polars-make-lazy
latter1_lazy = letter1.alias('letters').to_frame().lazy()
```
Polars' Lazy evaluation in the **wrong** order; **without** query planning
```{python}
#| label: polars-lazy-wrong-order-no-optimization
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect(no_optimization=True)
```
Polars' Lazy evaluation in the **wrong** order; **with** query planning
```{python}
#| label: polars-lazy-wrong-order-optimization
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()
```
Things to note:
1. A lazy evaluation was triggered when `df.lazy()` converted the Polars DataFrame to a Polars LazyFrame.
2. The query planner worked: The Lazy evaluation in the wrong order timed as much as an eager evaluation in the right order; even when accounting for the overhead of converting the frame from eager to lazy.
Here is the actual query plan of each.
The non-optimized version:
```{python}
#| label: fig-polars-lazy-wrong-order-no-optimization-plan
latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=False)
```
```{python}
#| label: fig-polars-lazy-wrong-order-optimization-plan-2
latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=True)
```
Now compare to Pandas...
Pandas' eager evaluation in the **wrong** order.
```{python}
#| label: pandas-eager-wrong-order
%timeit -n1 -r1 letter1_Pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
```
Pandas eager evaluation in the **right** order: Filter then sort.
```{python}
#| label: pandas-eager-right-order
%timeit -n1 -r1 letter1_Pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()
```
Pandas without lambda syntax.
```{python}
#| label: pandas-eager-right-order-no-lambda
%timeit -n 2 -r 2 letter1_Pandas.loc[letter1_Pandas.isin(['a','b','c'])].sort_values()
```
Things to note:
1. Query planning works!
2. Pandas has dramatically improved since \<2.0.0.
3. Lambda functions are always slow (both Pandas and Polars).
For a full list of the operations that are optimized by Polars' query planner see [here](https://docs.pola.rs/user-guide/lazy/optimizations/).
## Optimized for Within-Column Operations
Polars seamlessly parallelizes over columns (also within, when possible). As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling. The following code demonstrates this idea, using a simple sum-within-column.
```{python}
#| label: import-mlx
# M ac users with Apple silicon (M1 or M2) may also want to benchmark Apples' mlx:
# %pip install mlx
import mlx.core as mx
```
```{python}
#| label: make-data-for-benchmark
# Maker an array of floats.
A_numpy = np.random.randn(int(1e6), 10)
A_Polars = pl.DataFrame(A_numpy)
A_Pandas_numpy = pd.DataFrame(A_numpy)
A_Pandas_arrow = pd.DataFrame(A_numpy, dtype="float32[pyarrow]")
# A_arrow = pa.Table.from_Pandas(A_Pandas_numpy) # no sum method
A_mlx = mx.array(A_numpy)
```
Candidates currently omited:
1. JAX
2. PyTorch
3. TensorFlow
4. ...?
### Summing Over Columns
```{python}
%timeit -n 4 -r 2 A_numpy.sum(axis=0)
```
```{python}
A_numpy.sum(axis=0).shape
```
```{python}
%timeit -n 4 -r 2 A_Polars.sum()
```
```{python}
A_Polars.sum().shape
```
```{python}
%timeit -n 4 -r 2 A_mlx.sum(axis=0)
```
```{python}
A_mlx.sum(axis=0).shape
```
### 50 Shades of Pandas
Pandas with numpy backend
```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=0)
```
```{python}
A_Pandas_numpy.sum(axis=0).shape
```
Pandas with arrow backend
```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=0)
```
```{python}
A_Pandas_arrow.sum(axis=0).shape
```
Pandas with numpy backend, converted to numpy
```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=0)
```
```{python}
A_Pandas_numpy.values.sum(axis=0).shape
```
Pandas with arrow backend, converted to numpy
```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=0)
```
```{python}
type(A_Pandas_arrow.values)
```
```{python}
A_Pandas_arrow.values.sum(axis=0).shape
```
Pandas to mlx
```{python}
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=0)
```
```{python}
mx.array(A_Pandas_numpy.values).sum(axis=0).shape
```
### Summing Over Rows
```{python}
%timeit -n 4 -r 2 A_numpy.sum(axis=1)
```
```{python}
A_numpy.sum(axis=1).shape
```
```{python}
%timeit -n 4 -r 2 A_Polars.sum_horizontal()
```
```{python}
A_Polars.sum_horizontal().shape
```
```{python}
%timeit -n 4 -r 2 A_mlx.sum(axis=1)
```
```{python}
A_mlx.sum(axis=1).shape
```
### 50 Shades of Pandas
Pandas with numpy backend
```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=1)
```
Pandas with arrow backend
```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=1)
```
Pandas with numpy backend, converted to numpy
```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=1)
```
Pandas with arrow backend, converted to numpy
```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=1)
```
Pandas to mlx
```{python}
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=1)
```
## Speed Of Import
Polar's `pl.read_x` functions are quite faster than Pandas. This is due to parallelism, better type "guessing".
We benchmark by making synthetic data, save it on disk, and reimporting it.
### CSV Format
```{python}
n_rows = int(1e5)
n_cols = 10
data_Polars = pl.DataFrame(np.random.randn(n_rows,n_cols))
# make folder data is does not exist
if not os.path.exists('data'):
os.makedirs('data')
data_Polars.write_csv('data/data.csv', include_header = False)
f"{os.path.getsize('data/data.csv')/1e7:.2f} MB on disk"
```
Import with Pandas.
```{python}
%timeit -n2 -r2 data_Pandas = pd.read_csv('data/data.csv', header = None)
```
Import with Polars.
```{python}
%timeit -n2 -r2 data_Polars = pl.read_csv('data/data.csv', has_header = False)
```
What is the ratio of times on your machine? How many cores do you have?
### Parquet Format
```{python}
data_Polars.write_parquet('data/data.parquet')
f"{os.path.getsize('data/data.parquet')/1e7:.2f} MB on disk"
```
```{python}
%timeit -n2 -r2 data_Pandas = pd.read_parquet('data/data.parquet')
```
```{python}
%timeit -n2 -r2 data_Polars = pl.read_parquet('data/data.parquet')
```
### Feather (Apache IPC) Format
```{python}
data_Polars.write_ipc('data/data.feather')
f"{os.path.getsize('data/data.feather')/1e7:.2f} MB on disk"
```
```{python}
%timeit -n2 -r2 data_Polars = pl.read_ipc('data/data.feather')
```
```{python}
%timeit -n2 -r2 data_Pandas = pd.read_feather('data/data.feather')
```
### Pickle Format
```{python}
import pickle
pickle.dump(data_Polars, open('data/data.pickle', 'wb'))
f"{os.path.getsize('data/data.pickle')/1e7:.2f} MB on disk"
```
```{python}
%timeit -n2 -r2 data_Polars = pickle.load(open('data/data.pickle', 'rb'))
```
### Summarizing Import
Things to note:
1. The difference in speed is quite large between Pandas vs. Polars.
2. When dealing with CSV's, the function `pl.read_csv` reads in parallel, and has better type guessing heuristics.
3. The difference in speed is quite large between csv vs. parquet and feather, with feather\<parquet\<csv.
4. Feather is the fastest, but larger on disk. Thus good for short-term storage, and parquet for long-term.
5. The fact that pickle isn't the fastest surprised me; but then again, it is not optimized for data.
## Speed Of Join
Because Pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library. With Polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a `pl.dot()` function for inner products). As such, you cannot multiply two Polars dataframes, but you can certainly join then efficiently.
Make some data:
```{python}
def make_data(n_rows, n_cols):
data = np.concatenate(
(
np.arange(n_rows)[:,np.newaxis], # index
np.random.randn(n_rows,n_cols), # values
),
axis=1)
return data
n_rows = int(1e7)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)
data_left.shape
```
### Polars Join
```{python}
data_left_Polars = pl.DataFrame(data_left)
data_right_Polars = pl.DataFrame(data_right)
%timeit -n2 -r2 Polars_joined = data_left_Polars.join(data_right_Polars, on = 'column_0', how = 'left')
```
### Pandas Join
```{python}
data_left_Pandas = pd.DataFrame(data_left)
data_right_Pandas = pd.DataFrame(data_right)
%timeit -n2 -r2 Pandas_joined = data_left_Pandas.merge(data_right_Pandas, on = 0, how = 'inner')
```
## The NYC Taxi Dataset {#sec-nyc_taxi}
Empirical demonstration: Load the celebrated NYC taxi dataset, filter some rides and get the mean `tip_amount` by `passenger_count`.
```{python}
path = 'data/NYC' # Data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
file_names = os.listdir(path)
```
### Pandas
`df.query()` syntax.
```{python}
%%time
taxi_Pandas = pd.read_parquet(path)
taxi_Pandas.shape
query = '''
passenger_count > 0 and
passenger_count < 5 and
trip_distance >= 0 and
trip_distance <= 10 and
fare_amount >= 0 and
fare_amount <= 100 and
tip_amount >= 0 and
tip_amount <= 20 and
total_amount >= 0 and
total_amount <= 100
'''.replace('\n', '')
taxi_Pandas.query(query).groupby('passenger_count').agg({'tip_amount':'mean'})
```
Well, the `df.loc[]` syntax is usually faster than the `query` syntax:
```{python}
%%time
taxi_Pandas = pd.read_parquet(path)
ind = (
taxi_Pandas['passenger_count'].between(1,4)
& taxi_Pandas['trip_distance'].between(0,10)
& taxi_Pandas['fare_amount'].between(0,100)
& taxi_Pandas['tip_amount'].between(0,20)
& taxi_Pandas['total_amount'].between(0,100)
)
(
taxi_Pandas[ind]
.groupby('passenger_count')
.agg({'tip_amount':'mean'})
)
```
### Polars Lazy In Memory {#sec-polars-lazy-in-memory}
```{python}
%%time
import pyarrow.dataset as ds
q = (
# pl.scan_parquet("data/NYC/*.parquet") # will now work because parquet was created with Int32, and not Int64.
# Use pyarrow reader for robustness
pl.scan_pyarrow_dataset(
ds.dataset("data/NYC", format="parquet") # Using PyArrow's Parquet reader
)
.filter(
pl.col('passenger_count').is_between(1,4),
pl.col('trip_distance').is_between(0,10),
pl.col('fare_amount').is_between(0,100),
pl.col('tip_amount').is_between(0,20),
pl.col('total_amount').is_between(0,100)
)
.group_by(
pl.col('passenger_count')
)
.agg(
pl.col('tip_amount').mean().name.suffix('_mean')
)
)
q.collect()
```
```{python}
q.show_graph(optimized=True) # Graphviz has to be installed
```
Things to note:
1. I did not use the native `pl.scan_parquet()` as it is recommended. For your purposes, you will almost always use the native readers. It is convenient to remember, however, that you can also use the PyArrow importers in the native importers fail.
2. I only have 2 parquet files. When I run the same with more files, despite my 16GB of RAM, **Pandas will crash my python kernel**.
### Polars Lazy From Disk
::: callout-important
The following shows how to use the Polars streaming engine. This is arguably the biggest difference with Pandas, and other in memory dataframe libraries.
:::
```{python}
#| label: Polars-lazy-from-disk
q.collect(streaming=True)
```
Enough with motivation.
Let's learn something!
# Preliminaries {#sec-preliminaries}
## Object Classes
- **Polars Series**: Like a Pandas series. An in-memory array of data, with a name, and a dtype.
- **Polars DataFrame**: A collection of Polars Series. This is the Polars equivalent of a Pandas DataFrame. It is eager, and does not allow query planning.
- **Polars Expr**: A Polars series that is not yet computed, and that will be computed when needed. A Polars Expression can be thought of as:
1. A Lazy Series: A series that is not yet computed, and that will be computed when needed.
2. A function: That maps a Polars expression to another Polars expression, and can thus be chained.
- **Polars LazyFrame**: A collection of Polars Expressions. This is the Polars equivalent of a Spark DataFrame. It is lazy, thus allows query planning.
::: callout-warning
Not all methods are implemented for all classes. In particular, not all `pl.Dataframe` methods are implemented for `pl.LazyFrame` and vice versa. The same goes for `pl.Series` and `pl.Expr`.
This is not because the developers are lazy, but because the API is still being developed, and there are fundamental differences between the classes.
Think about it:
1. Why do we not see a `df.height` attribute for a `pl.LazeFrame`?
2. Why do we not see a `df.sample()` method for a `pl.LazyFrame`?
:::
## Evaluation Engines
Polars has (seemingly) 2 evaluation engines:
1. **Eager**: This is the default. It is the same as Pandas. When you call an expression, it is immediately evaluated.
2. **Lazy**: This is the same as Spark. When you call an expression, it added to a chain of expressions which make a query plan. The query plan is optimized and evaluated when you call `.collect()`.
Why "seemingly" 2? Because each engine has it's own subtleties. For instance, the behavior of the lazy engine may depend on streaming VS non-streaming evaluation, and on the means of loading the data.
1. **Streaming or Not?**: This is a special case of lazy evaluation. It is used when you want to process your data out of RAM. You can then call `.collect(streaming=True)` to process the dataset in chunks.
2. **Native Loaders or Not?**: Reading multiple parquet files using Polars native readers, may behave slightly different than reading the same files as a Pyarrow dataset (always prefer the native readers, when possible).
## Polars dtypes
Polars has its own dtypes, called with `pl.<dtype>`; e.g. `pl.Int32`. A comprehensive list may be found [here](https://docs.pola.rs/py-Polars/html/reference/datatypes.html).
Here are the most common. Note, that unlike Pandas (<2.0.0), **all are native Polars dtypes**, and do not recur to Python objects.
- Floats:
- `pl.Float64`: As the name suggests.
- Integers:
- `pl.Int64`: As the name suggests.
- Booleans:
- `pl.Boolean`: As the name suggests.
- Strings:
- `pl.Utf8`: The only string encoding supported by Polars.
- `pl.String`: Recently introduced as an alias to `pl.Utf8`.
- `pl.Categorical`: A string that is encoded as an integer.
- `pl.Enum`: Short for "enumerate". A categorical with a fixed set of values.
- Temporal:
- `pl.Date`: Date, without time.
- `pl.Datetime`: Date, with time.
- `pl.Time`: Time, without date.
- `pl.Duration`: Time difference.
- Nulls:
- `pl.Null`: Polars equivalent of Python's `None`.
- `np.nan`: The numpy dtype. Essentially a float, and not as a null.
- Nested:
- `pl.List`: A list of items.
- `pl.Array`: A fixed length list.
- `pl.Struct`: Think of it as a dict within a frame.
Things to note:
- Polars has no `pl.Int` dtype, nor `pl.Float`. You must specify the number of bits.
- Polars also supports `np.nan`(!), which is different than its `pl.Null` dtype. `np.nan` is a **float**, and `Null` is a None.
## The Polars API
- You will fall in love with it!
- Much more similar to [PySpark](https://blog.det.life/pyspark-or-polars-what-should-you-use-breakdown-of-similarities-and-differences-b261a825b9d6) than to Pandas. The Pandas API is simply not amenable to lazy evaluation. If you are familiar with PySpark, you should feel at home pretty fast.
### Some Design Principles {#sec-api-principles}
Here are some principles that will help you understand the API:
1. All columns are created equal. There are **no indexing** columns.
1. Operations on the columns of a dataframe will always be part of a **context**. Context may include:
- `pl.DataFrame.select()`: This is the most common context. Just like a SQL SELECT, it is used to select and transform columns.
- `pl.DataFrame.with_columns()`: Transform columns but return all columns in the frame; not just the transformed ones.
- `pl.DataFrame.group_by().agg()`: The `.agg()` context works like a `.select()` context, but it is used to apply operations on sub-groups of rows.
- `pl.DataFrame.filter()`: This is used to filter rows using expressions that evaluate to Booleans.
- `pl.SQLContext().execute()`: This is used if you prefer to use SQL syntax, instead of the Polars API.
1. Nothing happens "in-place".
1. Two-word methods are always lower-case, and separated by underscores. E.g: `.is_in()` instead of `.isin()`; `.is_null()` instead of `.isnull()`; `.group_by()` instead of `.group_by()` (starting version 19.0.0).
1. Look for `pl.Expr()` methods so you can chain operations. E.g. `pl.col('a').add(pl.col('b'))` is better than `pl.col('a') + pl.col('b')`; the former can be further chained. And there is always `.pipe()`.
1. Polars was designed for operation within **columns**, not within rows. Operations within rows are possible via:
- Polars functions with a `_horizontal()` suffix. Examples: `pl.sum_horizontal()`, `pl.mean_horizontal()`, `pl.rolling_sum_horizontal()`.
- Combining columns into a single column with nested dtype. Examples: `pl.list()`, `pl.array()`, `pl.struct()`.
1. Always **remember the class** you are operating on. Series, Expressions, DataFrames, and LazyFrames, have similar but-not-identical methods.
### Some Examples of the API
Here is an example to give you a little taste of what the API feels like.
```{python}
#| label: Polars-api
# Make some data
polars_frame = pl.DataFrame(make_data(100,4))
polars_frame.limit(5)
```
::: callout-note
What is the difference between `.head()` and `limit()`? For eager frames? For lazy frames?
:::
Can you parse the following in your head?
```{python}
(
polars_frame
.rename({'column_0':'group'})
.with_columns(
pl.col('group').cast(pl.Int32),
pl.col('column_1').ge(0).alias('non-negative'),
)
.group_by('non-negative')
.agg(
pl.col('group').is_between(1,4).sum().alias('one-to-four'),
pl.col('^column_[0-9]$').mean().name.suffix('_mean'),
)
)
```
Ask yourself:
- What is `polars_frame`? Is it an eager or a lazy Polars dataframe?
- Why is `column_1_mean` when `non-negative=false` indeed non negative?
- What is a Polars expression?
- What is a Polars series?
- How did I create the columns `column_1_mean`...`column_4_mean`?
- How would you have written this in Pandas?
```{python}
#| label: Polars-api-second-example
(
polars_frame
.rename({'column_0':'group'})
.select(
pl.col('group').mod(2),
pl.mean_horizontal(
pl.col('^column_[0-9]$')
)
.name.suffix('_mean')
)
.filter(
pl.col('group').eq(1),
pl.col('column_1_mean').gt(0)
)
)
```
Try parsing the following in your head:
```{python}
polars_frame_2 = (
pl.DataFrame(make_data(100,1))
.select(
pl.col('*').name.suffix('_second')
)
)
(
polars_frame
.join(
polars_frame_2,
left_on = 'column_0',
right_on = 'column_0_second',
how = 'left',
validate='1:1'
)
)
```
## Getting Help
Before we dive in, you should be aware of the following references for further help:
1. A [github page](https://github.com/pola-rs/Polars). It is particular important to subscribe to [release updates](https://github.com/pola-rs/Polars/releases).
2. A [user guide](https://pola-rs.github.io/Polars-book/user-guide/index.html).
3. A very active community on [Discord](https://discord.gg/4UfP5cfBE7).
4. The [API reference](https://pola-rs.github.io/Polars/py-Polars/html/reference/index.html).
5. A Stack-Overflow [tag](https://stackoverflow.com/questions/tagged/python-Polars).
6. Cheat-sheet for [Pandas users](https://www.rhosignal.com/posts/Polars-Pandas-cheatsheet/).
**Warning**: Be careful of AI assistants such as Github-Copilot, TabNine, etc. Polars is still very new, and they may give you Pandas completions instead of Polars.
# Polars Series {#sec-series}
A Polars series looks a feels a lot like a Pandas series.
Getting used to Polars Series, will thus give you bad intuitions when you move to Polars Expressions.
Construct a series
```{python}
#| label: make-a-series
s = pl.Series("a", [1, 2, 3])
s
```
Make Pandas series for later comparisons.
```{python}
#| label: make-a-Pandas-series
s_Pandas = pd.Series([1, 2, 3], name = "a")
s_Pandas
```
Notice even the printing to notebooks is different.
Now verify the type
```{python}
#| label: check-series-type
type(s)
```
```{python}
#| label: check-Pandas-series-type
type(s_Pandas)
```
```{python}
#| label: check-series-dtype
s.dtype
```
```{python}
#| label: check-Pandas-series-dtype
s_Pandas.dtype
```
Renaming a series; will be very useful when operating on dataframe columns.
```{python}
#| label: rename-series
s.alias("b")
```
Constructing a series of floats, for later use.
```{python}
#| label: make-a-float-series
f = pl.Series("a", [1., 2., 3.])
f
```
```{python}
#| label: check-float-series-dtype
f.dtype
```
## Export To Other Python Objects
The current section deals with exports to other python objects, **in memory**. See @sec-disk-export for exporting to disk.
Export to Polars DataFrame.
```{python}
#| label: series-to-Polars-dataframe
s.to_frame()
```
Export to Python list.
```{python}
#| label: series-to-list
s.to_list()
```
Export to Numpy array.
```{python}