polars_demo.qmd

---
title: "Intro 2 Polars"

execute:
  eval: true
  warning: true
  error: true
  keep-ipynb: true
  cache: true
jupyter: python3
pdf-engine: lualatex
# theme: pandoc
html:
    code-tools: true
    fold-code: false
    author: Jonathan D. Rosenblatt
    data: 03-12-2024
    toc: true
    number-sections: true
    number-depth: 3
    embed-resources: true
---

# Background {#sec-background}


## Ritchie Vink, Rust, Apache Arrow and Covid

[Here](https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/) is the story, by the creator of Polars.

## Who Can Benefit from Polars?

-   Researcher (DS, Analyst, Statistician, etc):
    -   Working on their local machine.
    -   Working on a cloud machine (SageMaker, EC2).
-   Production system:
    -   Running on a dedicated server.
    -   Running on "serverless" (e.g. AWS Lambda, Google Cloud Functions).

## The DataFrame Landscape

Initially there were R's `data.frame`. R has evolved, and it now offers `tibble`s and `data.table`s. Python had only `Pandas` for years. Then the Python ecosystem exploded, and now we have:

-   [Pandas](https://Pandas.pydata.org/): The original Python dataframe module. Build by Wes McKinney, on top of numpy.
-   [Polars](https://www.pola.rs/): A new dataframe module, build by Ritchie Vink, on top of Rust and Apache Arrow.
-   [DuckDB](https://duckdb.org/): 
-   [ClickHouse chDB](https://github.com/chdb-io/chdb)
-   [Datafusion](https://arrow.apache.org/datafusion/)
-   [Databend](https://github.com/datafuselabs/databend)
-   [PyArrow](https://arrow.apache.org/docs/python/index.html)
-   [Daft](https://www.getdaft.io/): A distributed dataframe library built for "Complex Data" (data that doesn't usually fit in a SQL table such as images, videos, documents etc).
-   [Fugue](https://fugue-tutorials.readthedocs.io/): A dataframe library that allows you to write SQL-like code, and execute it on different backends (e.g. Spark, Dask, Pandas, Polars, etc).
-   [pySpark](https://spark.apache.org/docs/latest/api/python/index.html): The Python API for Spark. Spark is a distributed computing engine, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
-   [CUDF](https://github.com/rapidsai/cudf): A **GPU accelerated** dataframe library, build on top of Apache Arrow.
-   [datatable](https://datatable.readthedocs.io/en/latest/): An attempt to recreate R's [data.table](https://github.com/Rdatatable/data.table) API and (crazy) speed in Python.
-   [Dask](https://www.dask.org/): A distributed computing engine for Python, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
-   [Vaex](https://vaex.io/): A high performance Python library for lazy Out-of-Core DataFrames (similar to dask, but with a different API).
-   [Modin](https://github.com/modin-project/modin): A drop-in distributed replacement for Pandas, built on top of [Ray](https://www.ray.io/).


For more details see [here](https://pola-rs.github.io/Polars-book/user-guide/misc/alternatives/), [here](https://www.getdaft.io/projects/docs/en/latest/dataframe_comparison.html), [here](https://www.linkedin.com/posts/mimounedjouallah_duckdb-pyarrow-clickhouse-activity-7172389181011161088-q1lk?utm_source=share&utm_medium=member_desktop).


# Motivation to Use Polars {#sec-motivation}

Each of the following, alone(!), is amazing.

1.  Out of the box **parallelism**.
2.  Lazy Evaluation: With query planning and **query optimization**.
3.  Streaming engine: Can **stream data from disk** to memory for out-of-memory processing.
4.  A complete set of **native dtypes**; including missing and strings.
5.  An intuitive and **consistent API**; inspired by PySpark.


## Setting Up the Environment

At this point you may want to create and activate a [venv](https://realpython.com/python-virtual-environments-a-primer/) for this project.

```{python}
#| echo: false
# %pip install --upgrade pip
# %pip install --upgrade polars
# %pip install --upgrade pyarrow
# %pip install --upgrade Pandas
# %pip install --upgrade plotly
# %pip freeze > requirements.txt
```

```{python}
#| label: setup-env

# %pip install -r requirements.txt
```

```{python}
#| label: Polars-version
%pip show Polars # check you Polars version
```


```{python}
#| label: Pandas-version
%pip show Pandas # check you Pandas version
```


```{python}
#| label: preliminaries

import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs

import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline 
import matplotlib.pyplot as plt
from datetime import datetime

# Following two lines only required to view plotly when rendering from VScode. 
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"

# set path to current file's path
os.chdir(os.path.dirname(os.path.abspath(__file__)))
```

What Polars module and dependencies are installed?
```{python}
#| label: show-versions
pl.show_versions()
```

How many cores are available for parallelism?
```{python}
#| label: show-cores
pl.thread_pool_size()
```

## Memory Footprint

### Memory Footprint of Storage

Comparing Polars to Pandas - the memory footprint of a series of strings.

Polars.

```{python}
#|  label: Polars-memory-footprint
letters = pl.Series(list(string.ascii_letters))

n = int(10e6)
letter1 = letters.sample(n, with_replacement=True)

letter1.estimated_size(unit='gb') 
```

Pandas (before Pandas 2.0.0).

```{python}
#|  label: Pandas-memory-footprint
# Pandas with numpy backend

letter1_Pandas = pd.Series(list(string.ascii_letters)).sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=False) 

letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
```

Pandas after Pandas 2.0, with the Pyarrow backend (Apr 2023).

```{python}
#|  label: Pandas-memory-footprint-with-Arrow

letter1_Pandas = pd.Series(list(string.ascii_letters), dtype="string[pyarrow]").sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=True) 

letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
```


## Lazy Frames and Query Planning {#sec-query-planning}

Consider a sort operation that follows a filter operation. Ideally, filter precedes the sort, but we did not ensure this... We now demonstrate that Polars' query planner will do it for you. En passant, we see Polars is more efficient also without the query planner.

Polars' eager evaluation in the **wrong** order. Sort then filter.

```{python}
#| label: polars-eager-wrong-order
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))
```

Polars' Eager evaluation in the **right** order. Filter then sort.

```{python}
#| label: polars-eager-right-order
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()
```

Now prepare a Polars LazyFrame required for query optimization.
```{python}
#| label: polars-make-lazy
latter1_lazy = letter1.alias('letters').to_frame().lazy()
```

Polars' Lazy evaluation in the **wrong** order; **without** query planning

```{python}
#| label: polars-lazy-wrong-order-no-optimization
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect(no_optimization=True)
```

Polars' Lazy evaluation in the **wrong** order; **with** query planning

```{python}
#| label: polars-lazy-wrong-order-optimization
%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()
```

Things to note:

1.  A lazy evaluation was triggered when `df.lazy()` converted the Polars DataFrame to a Polars LazyFrame.
2.  The query planner worked: The Lazy evaluation in the wrong order timed as much as an eager evaluation in the right order; even when accounting for the overhead of converting the frame from eager to lazy.

Here is the actual query plan of each.
The non-optimized version:

```{python}
#| label: fig-polars-lazy-wrong-order-no-optimization-plan

latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=False)
```

```{python}
#| label: fig-polars-lazy-wrong-order-optimization-plan-2

latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=True)
```

Now compare to Pandas...

Pandas' eager evaluation in the **wrong** order.

```{python}
#| label: pandas-eager-wrong-order
%timeit -n1 -r1 letter1_Pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
```

Pandas eager evaluation in the **right** order: Filter then sort.

```{python}
#| label: pandas-eager-right-order
%timeit -n1 -r1 letter1_Pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()
```

Pandas without lambda syntax.

```{python}
#| label: pandas-eager-right-order-no-lambda
%timeit -n 2 -r 2 letter1_Pandas.loc[letter1_Pandas.isin(['a','b','c'])].sort_values()
```

Things to note:

1.  Query planning works!
2.  Pandas has dramatically improved since \<2.0.0.
3.  Lambda functions are always slow (both Pandas and Polars).

For a full list of the operations that are optimized by Polars' query planner see [here](https://docs.pola.rs/user-guide/lazy/optimizations/). 


## Optimized for Within-Column Operations

Polars seamlessly parallelizes over columns (also within, when possible). As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling. The following code demonstrates this idea, using a simple sum-within-column.

```{python}
#| label: import-mlx

# M ac users with Apple silicon (M1 or M2) may also want to benchmark Apples' mlx:
# %pip install mlx
import mlx.core as mx

```

```{python}
#| label: make-data-for-benchmark

# Maker an array of floats.
A_numpy = np.random.randn(int(1e6), 10)

A_Polars = pl.DataFrame(A_numpy)
A_Pandas_numpy = pd.DataFrame(A_numpy)
A_Pandas_arrow = pd.DataFrame(A_numpy, dtype="float32[pyarrow]")
# A_arrow = pa.Table.from_Pandas(A_Pandas_numpy) # no sum method
A_mlx = mx.array(A_numpy)

```

Candidates currently omited:

1.  JAX
2.  PyTorch
3.  TensorFlow
4.  ...?

### Summing Over Columns

```{python}
%timeit -n 4 -r 2 A_numpy.sum(axis=0)
```

```{python}
A_numpy.sum(axis=0).shape
```

```{python}
%timeit -n 4 -r 2 A_Polars.sum()
```

```{python}
A_Polars.sum().shape
```

```{python}
%timeit -n 4 -r 2 A_mlx.sum(axis=0)
```

```{python}
A_mlx.sum(axis=0).shape
```

### 50 Shades of Pandas

Pandas with numpy backend

```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=0)
```

```{python}
A_Pandas_numpy.sum(axis=0).shape
```

Pandas with arrow backend

```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=0)
```

```{python}
A_Pandas_arrow.sum(axis=0).shape
```

Pandas with numpy backend, converted to numpy

```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=0)
```

```{python}
A_Pandas_numpy.values.sum(axis=0).shape
```

Pandas with arrow backend, converted to numpy

```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=0)
```

```{python}
type(A_Pandas_arrow.values)
```

```{python}
A_Pandas_arrow.values.sum(axis=0).shape
```

Pandas to mlx

```{python}
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=0)
```

```{python}
mx.array(A_Pandas_numpy.values).sum(axis=0).shape
```

### Summing Over Rows

```{python}
%timeit -n 4 -r 2 A_numpy.sum(axis=1)
```

```{python}
A_numpy.sum(axis=1).shape
```

```{python}
%timeit -n 4 -r 2 A_Polars.sum_horizontal()
```

```{python}
A_Polars.sum_horizontal().shape
```

```{python}
%timeit -n 4 -r 2 A_mlx.sum(axis=1)
```

```{python}
A_mlx.sum(axis=1).shape
```

### 50 Shades of Pandas

Pandas with numpy backend

```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=1)
```

Pandas with arrow backend

```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=1)
```

Pandas with numpy backend, converted to numpy

```{python}
%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=1)
```

Pandas with arrow backend, converted to numpy

```{python}
%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=1)
```

Pandas to mlx

```{python}
%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=1)
```

## Speed Of Import

Polar's `pl.read_x` functions are quite faster than Pandas. This is due to parallelism, better type "guessing".

We benchmark by making synthetic data, save it on disk, and reimporting it.

### CSV Format

```{python}
n_rows = int(1e5)
n_cols = 10
data_Polars = pl.DataFrame(np.random.randn(n_rows,n_cols))

# make folder data is does not exist
if not os.path.exists('data'):
    os.makedirs('data')

data_Polars.write_csv('data/data.csv', include_header = False)
f"{os.path.getsize('data/data.csv')/1e7:.2f} MB on disk"
```

Import with Pandas.

```{python}
%timeit -n2 -r2 data_Pandas = pd.read_csv('data/data.csv', header = None)
```

Import with Polars.

```{python}
%timeit -n2 -r2 data_Polars = pl.read_csv('data/data.csv', has_header = False)
```

What is the ratio of times on your machine? How many cores do you have?

### Parquet Format

```{python}
data_Polars.write_parquet('data/data.parquet')
f"{os.path.getsize('data/data.parquet')/1e7:.2f} MB on disk"
```

```{python}
%timeit -n2 -r2 data_Pandas = pd.read_parquet('data/data.parquet')
```

```{python}
%timeit -n2 -r2 data_Polars = pl.read_parquet('data/data.parquet')
```

### Feather (Apache IPC) Format

```{python}
data_Polars.write_ipc('data/data.feather')
f"{os.path.getsize('data/data.feather')/1e7:.2f} MB on disk"
```

```{python}
%timeit -n2 -r2 data_Polars = pl.read_ipc('data/data.feather')
```

```{python}
%timeit -n2 -r2 data_Pandas = pd.read_feather('data/data.feather')
```

### Pickle Format

```{python}
import pickle
pickle.dump(data_Polars, open('data/data.pickle', 'wb'))
f"{os.path.getsize('data/data.pickle')/1e7:.2f} MB on disk"
```

```{python}
%timeit -n2 -r2 data_Polars = pickle.load(open('data/data.pickle', 'rb'))
```

### Summarizing Import

Things to note:

1.  The difference in speed is quite large between Pandas vs. Polars.
2.  When dealing with CSV's, the function `pl.read_csv` reads in parallel, and has better type guessing heuristics.
3.  The difference in speed is quite large between csv vs. parquet and feather, with feather\<parquet\<csv.
4.  Feather is the fastest, but larger on disk. Thus good for short-term storage, and parquet for long-term.
5.  The fact that pickle isn't the fastest surprised me; but then again, it is not optimized for data.

## Speed Of Join

Because Pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library. With Polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a `pl.dot()` function for inner products). As such, you cannot multiply two Polars dataframes, but you can certainly join then efficiently.

Make some data:

```{python}
def make_data(n_rows, n_cols):
  data = np.concatenate(
  (
    np.arange(n_rows)[:,np.newaxis], # index
    np.random.randn(n_rows,n_cols), # values
    ),
    axis=1)
    
  return data


n_rows = int(1e7)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)

data_left.shape
```

### Polars Join

```{python}
data_left_Polars = pl.DataFrame(data_left)
data_right_Polars = pl.DataFrame(data_right)

%timeit -n2 -r2 Polars_joined = data_left_Polars.join(data_right_Polars, on = 'column_0', how = 'left')
```

### Pandas Join

```{python}
data_left_Pandas = pd.DataFrame(data_left)
data_right_Pandas = pd.DataFrame(data_right)

%timeit -n2 -r2 Pandas_joined = data_left_Pandas.merge(data_right_Pandas, on = 0, how = 'inner')
```

## The NYC Taxi Dataset {#sec-nyc_taxi}

Empirical demonstration: Load the celebrated NYC taxi dataset, filter some rides and get the mean `tip_amount` by `passenger_count`.

```{python}
path = 'data/NYC' # Data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
file_names = os.listdir(path)
```

### Pandas

`df.query()` syntax.

```{python}
%%time 
taxi_Pandas = pd.read_parquet(path)
taxi_Pandas.shape

query = '''
    passenger_count > 0 and 
    passenger_count < 5 and  
    trip_distance >= 0 and 
    trip_distance <= 10 and 
    fare_amount >= 0 and 
    fare_amount <= 100 and 
    tip_amount >= 0 and 
    tip_amount <= 20 and 
    total_amount >= 0 and 
    total_amount <= 100
    '''.replace('\n', '')
taxi_Pandas.query(query).groupby('passenger_count').agg({'tip_amount':'mean'})
```

Well, the `df.loc[]` syntax is usually faster than the `query` syntax:

```{python}
%%time 
taxi_Pandas = pd.read_parquet(path)

ind = (
    taxi_Pandas['passenger_count'].between(1,4) 
    & taxi_Pandas['trip_distance'].between(0,10) 
    & taxi_Pandas['fare_amount'].between(0,100) 
    & taxi_Pandas['tip_amount'].between(0,20) 
    & taxi_Pandas['total_amount'].between(0,100)
)
(
    taxi_Pandas[ind]
    .groupby('passenger_count')
    .agg({'tip_amount':'mean'})
)
```

### Polars Lazy In Memory {#sec-polars-lazy-in-memory}

```{python}
%%time 

import pyarrow.dataset as ds

q = (
    
    # pl.scan_parquet("data/NYC/*.parquet") # will now work because parquet was created with Int32, and not Int64. 
    
    # Use pyarrow reader for robustness
    pl.scan_pyarrow_dataset(
      ds.dataset("data/NYC", format="parquet") # Using PyArrow's Parquet reader
    )
    
    .filter(
        pl.col('passenger_count').is_between(1,4),
        pl.col('trip_distance').is_between(0,10),
        pl.col('fare_amount').is_between(0,100),
        pl.col('tip_amount').is_between(0,20),
        pl.col('total_amount').is_between(0,100)
    )
    .group_by(
      pl.col('passenger_count')
      )
    .agg(
      pl.col('tip_amount').mean().name.suffix('_mean')
      ) 
    )

q.collect()
```

```{python}
q.show_graph(optimized=True) # Graphviz has to be installed
```

Things to note:

1.  I did not use the native `pl.scan_parquet()` as it is recommended. For your purposes, you will almost always use the native readers. It is convenient to remember, however, that you can also use the PyArrow importers in the native importers fail.
2.  I only have 2 parquet files. When I run the same with more files, despite my 16GB of RAM, **Pandas will crash my python kernel**.


### Polars Lazy From Disk

::: callout-important
The following shows how to use the Polars streaming engine. This is arguably the biggest difference with Pandas, and other in memory dataframe libraries.
:::

```{python}
#|  label: Polars-lazy-from-disk

q.collect(streaming=True)
```

Enough with motivation. 
Let's learn something!


# Preliminaries {#sec-preliminaries}

## Object Classes

-   **Polars Series**: Like a Pandas series. An in-memory array of data, with a name, and a dtype.

-   **Polars DataFrame**: A collection of Polars Series. This is the Polars equivalent of a Pandas DataFrame. It is eager, and does not allow query planning.

-   **Polars Expr**: A Polars series that is not yet computed, and that will be computed when needed. A Polars Expression can be thought of as:

    1.  A Lazy Series: A series that is not yet computed, and that will be computed when needed.
    2.  A function: That maps a Polars expression to another Polars expression, and can thus be chained.

-   **Polars LazyFrame**: A collection of Polars Expressions. This is the Polars equivalent of a Spark DataFrame. It is lazy, thus allows query planning.


::: callout-warning
Not all methods are implemented for all classes. In particular, not all `pl.Dataframe` methods are implemented for `pl.LazyFrame` and vice versa. The same goes for `pl.Series` and `pl.Expr`.

This is not because the developers are lazy, but because the API is still being developed, and there are fundamental differences between the classes.

Think about it:

1.  Why do we not see a `df.height` attribute for a `pl.LazeFrame`?
2.  Why do we not see a `df.sample()` method for a `pl.LazyFrame`?
:::

## Evaluation Engines

Polars has (seemingly) 2 evaluation engines:

1.  **Eager**: This is the default. It is the same as Pandas. When you call an expression, it is immediately evaluated.
2.  **Lazy**: This is the same as Spark. When you call an expression, it added to a chain of expressions which make a query plan. The query plan is optimized and  evaluated when you call `.collect()`.

Why "seemingly" 2? Because each engine has it's own subtleties. For instance, the behavior of the lazy engine may depend on streaming VS non-streaming evaluation, and on the means of loading the data.

1.  **Streaming or Not?**: This is a special case of lazy evaluation. It is used when you want to process your data out of RAM. You can then call `.collect(streaming=True)` to process the dataset in chunks. 
2.  **Native Loaders or Not?**: Reading multiple parquet files using Polars native readers, may behave slightly different than reading the same files as a Pyarrow dataset (always prefer the native readers, when possible).

## Polars dtypes

Polars has its own dtypes, called with `pl.<dtype>`; e.g. `pl.Int32`. A comprehensive list may be found [here](https://docs.pola.rs/py-Polars/html/reference/datatypes.html).

Here are the most common. Note, that unlike Pandas (<2.0.0), **all are native Polars dtypes**, and do not recur to Python objects.

-   Floats:
    -   `pl.Float64`: As the name suggests.
-   Integers:
    -   `pl.Int64`: As the name suggests.
-   Booleans:
    -   `pl.Boolean`: As the name suggests.
-   Strings:
    -   `pl.Utf8`: The only string encoding supported by Polars.
    -   `pl.String`: Recently introduced as an alias to `pl.Utf8`.
    -   `pl.Categorical`: A string that is encoded as an integer.
    -   `pl.Enum`: Short for "enumerate". A categorical with a fixed set of values.
-   Temporal:
    -   `pl.Date`: Date, without time.
    -   `pl.Datetime`: Date, with time.
    -   `pl.Time`: Time, without date.
    -   `pl.Duration`: Time difference.
-   Nulls:
    -   `pl.Null`: Polars equivalent of Python's `None`.
    -   `np.nan`: The numpy dtype. Essentially a float, and not as a null.
-   Nested:
    -   `pl.List`: A list of items.
    -   `pl.Array`: A fixed length list.
    -   `pl.Struct`: Think of it as a dict within a frame.

Things to note:

-   Polars has no `pl.Int` dtype, nor `pl.Float`. You must specify the number of bits.
-   Polars also supports `np.nan`(!), which is different than its `pl.Null` dtype. `np.nan` is a **float**, and `Null` is a None.


## The Polars API

-   You will fall in love with it!
-   Much more similar to [PySpark](https://blog.det.life/pyspark-or-polars-what-should-you-use-breakdown-of-similarities-and-differences-b261a825b9d6) than to Pandas. The Pandas API is simply not amenable to lazy evaluation. If you are familiar with PySpark, you should feel at home pretty fast. 

### Some Design Principles {#sec-api-principles}

Here are some principles that will help you understand the API:

1.  All columns are created equal. There are **no indexing** columns.

1.  Operations on the columns of a dataframe will always be part of a **context**. Context may include:

    -  `pl.DataFrame.select()`: This is the most common context. Just like a SQL SELECT, it is used to select and transform columns.
    -  `pl.DataFrame.with_columns()`: Transform columns but return all columns in the frame; not just the transformed ones.
    -  `pl.DataFrame.group_by().agg()`: The `.agg()` context works like a `.select()` context, but it is used to apply operations on sub-groups of rows.
    -  `pl.DataFrame.filter()`: This is used to filter rows using expressions that evaluate to Booleans.
    -  `pl.SQLContext().execute()`: This is used if you prefer to use SQL syntax, instead of the Polars API.

1. Nothing happens "in-place". 

1.  Two-word methods are always lower-case, and separated by underscores. E.g: `.is_in()` instead of `.isin()`; `.is_null()` instead of `.isnull()`; `.group_by()` instead of `.group_by()` (starting version 19.0.0).

1.  Look for `pl.Expr()` methods so you can chain operations. E.g. `pl.col('a').add(pl.col('b'))` is better than `pl.col('a') + pl.col('b')`; the former can be further chained. And there is always `.pipe()`.

1. Polars was designed for operation within **columns**, not within rows. Operations within rows are possible via: 
    -  Polars functions with a `_horizontal()` suffix. Examples: `pl.sum_horizontal()`, `pl.mean_horizontal()`, `pl.rolling_sum_horizontal()`.
    -  Combining columns into a single column with nested dtype. Examples: `pl.list()`, `pl.array()`, `pl.struct()`.

1. Always **remember the class** you are operating on. Series, Expressions, DataFrames, and LazyFrames, have similar but-not-identical methods.


### Some Examples of the API

Here is an example to give you a little taste of what the API feels like.

```{python}
#| label: Polars-api

# Make some data
polars_frame = pl.DataFrame(make_data(100,4))
polars_frame.limit(5)
```

::: callout-note
What is the difference between `.head()` and `limit()`? For eager frames? For lazy frames?
:::

Can you parse the following in your head?

```{python}
(
  polars_frame
  .rename({'column_0':'group'})
  .with_columns(
    pl.col('group').cast(pl.Int32),
    pl.col('column_1').ge(0).alias('non-negative'),
  )
  .group_by('non-negative')
  .agg(
    pl.col('group').is_between(1,4).sum().alias('one-to-four'),
    pl.col('^column_[0-9]$').mean().name.suffix('_mean'),
  )
)
```

Ask yourself:

-   What is `polars_frame`? Is it an eager or a lazy Polars dataframe?
-   Why is `column_1_mean` when `non-negative=false` indeed non negative?
-   What is a Polars expression?
-   What is a Polars series?
-   How did I create the columns `column_1_mean`...`column_4_mean`?
-   How would you have written this in Pandas?

```{python}
#| label: Polars-api-second-example

(
  polars_frame
  .rename({'column_0':'group'})
  .select(
    pl.col('group').mod(2),

    pl.mean_horizontal(
      pl.col('^column_[0-9]$')
    )
    .name.suffix('_mean')
  )
  .filter(
    pl.col('group').eq(1),
    pl.col('column_1_mean').gt(0)
  )
)

```

Try parsing the following in your head:

```{python}

polars_frame_2 = (
  pl.DataFrame(make_data(100,1))
  .select(
    pl.col('*').name.suffix('_second')
  )
)


(
  polars_frame
  .join(
    polars_frame_2,
    left_on = 'column_0',
    right_on = 'column_0_second',
    how = 'left',
    validate='1:1'
  )
)


```


## Getting Help

Before we dive in, you should be aware of the following references for further help:

1.  A [github page](https://github.com/pola-rs/Polars). It is particular important to subscribe to [release updates](https://github.com/pola-rs/Polars/releases).
2.  A [user guide](https://pola-rs.github.io/Polars-book/user-guide/index.html).
3.  A very active community on [Discord](https://discord.gg/4UfP5cfBE7).
4.  The [API reference](https://pola-rs.github.io/Polars/py-Polars/html/reference/index.html).
5.  A Stack-Overflow [tag](https://stackoverflow.com/questions/tagged/python-Polars).
6.  Cheat-sheet for [Pandas users](https://www.rhosignal.com/posts/Polars-Pandas-cheatsheet/).

**Warning**: Be careful of AI assistants such as Github-Copilot, TabNine, etc. Polars is still very new, and they may give you Pandas completions instead of Polars.

# Polars Series {#sec-series}

A Polars series looks a feels a lot like a Pandas series. 
Getting used to Polars Series, will thus give you bad intuitions when you move to Polars Expressions. 

Construct a series

```{python}
#| label: make-a-series
s = pl.Series("a", [1, 2, 3])
s
```

Make Pandas series for later comparisons.

```{python}
#| label: make-a-Pandas-series
s_Pandas = pd.Series([1, 2, 3], name = "a")
s_Pandas
```

Notice even the printing to notebooks is different.

Now verify the type
```{python}
#| label: check-series-type
type(s)
```

```{python}
#| label: check-Pandas-series-type
type(s_Pandas)
```

```{python}
#| label: check-series-dtype
s.dtype
```

```{python}
#| label: check-Pandas-series-dtype
s_Pandas.dtype
```

Renaming a series; will be very useful when operating on dataframe columns.

```{python}
#| label: rename-series
s.alias("b")
```

Constructing a series of floats, for later use.

```{python}
#| label: make-a-float-series
f = pl.Series("a", [1., 2., 3.])
f
```

```{python}
#| label: check-float-series-dtype
f.dtype
```

## Export To Other Python Objects

The current section deals with exports to other python objects, **in memory**. See @sec-disk-export for exporting to disk.

Export to Polars DataFrame.

```{python}
#| label: series-to-Polars-dataframe
s.to_frame() 
```

Export to Python list.

```{python}
#| label: series-to-list
s.to_list()
```

Export to Numpy array.

```{python}
#| label: series-to-numpy
s.to_numpy() # useful for preparing data for learning with scikit-learn
```

Export to Pandas Series with Numpy backend.

```{python}
#| label: series-to-Pandas-series
s.to_pandas()
```

Export to Pandas Series with Arrow backend.

```{python}
#| label: series-to-Pandas-series-arrow
s.to_pandas(use_pyarrow_extension_array=True)
```

Export to Arrow Array. 
```{python}
#| label: series-to-arrow
s.to_arrow() 
```

Scikit-learn currently does not [support Arrow](https://github.com/scikit-learn/scikit-learn/discussions/25450), and may actually support Polars [soon enough](https://github.com/scikit-learn/scikit-learn/issues/25896).
XGBoost, however, does support Arrow.


Show the internal representation of the series.
```{python}
s.to_physical()
```

## Memory Representation of Series

Object size in memory. Super useful for profiling. Will only be available for eager objects; by definitions.

```{python}
#| label: series-memory-size
s.estimated_size(unit="b") # 8(bytes) * 3(length)
```

Shrink the memory allocation to the size of the actual data (in place).
```{python}
s.shrink_to_fit() 
```


## Filtering and Subsetting {#sec-filtering-subsetting-series}


### Filter 
```{python}
s[0] # same as s.__getitem__(0)
```

To filter, you need to use the `.filter()` method; which can accept a list or a Polars series (I did not try other iterables). 
```{python}
#| label: series-filter-with-list
s.filter([True, False, True])
```

```{python}
#| label: series-filter-with-series
s.filter(pl.Series("a", [True, False, True])) # works
```

You can filter along any expression that evaluates to a boolean list/series. 
```{python}
s.filter(s.ge(2))
```


Filtering with the `[` operator will not work:
```{python}
#| label: series-subset-with-boolean
#| eval: false
s[[True, False, True]]
```

### Subset Using Location

```{python}
#| label: series-limit
s.limit(2)
```

```{python}
#| label: series-head
s.head(2)
```

```{python}
#| label: series-tail
s.tail(2)
```


```{python}
#| label: series-gather-aka-iloc
s.gather([0, 2]) # same Pandas .iloc[[0,2]] or as s[0,2] and
```

Note: Unlike the `[` operator, `.gather()` is also a `pl.Expr()` method, so will work in a lazy frame, and it may be chained.

```{python}
#| label: series-slice
s.slice(1, 2) # same as Pandas .iloc[1:2]
```

```{python}
#| label: series-gather-every
s.gather_every(2) # same as Pandas .iloc[::2]
```

## Aggregations {#sec-series-aggregations}

### Statistical Aggregations

```{python}
#| label: series-sum
s.sum()
```

```{python}
#| label: series-min
s.min()
```

```{python}
#| label: series-arg-min
s.arg_min()
```

```{python}
#| label: series-max
s.max()
```

```{python}
#| label: series-arg-max
s.arg_max()
```

```{python}
#| label: series-which-max
s.peak_max()
```


```{python}
#| label: series-mean
s.mean()
```

```{python}
#| label: series-median
s.median()
```

```{python}
#| label: series-quantile
s.quantile(0.2)
```

```{python}
#| label: series-entropy
s.entropy()
```

```{python}
#| label: Polars-series-describe
s.describe() 
```

Polars `pl.series.describe()` is almost the same as Pandas `pd.series.describe()`.

```{python}
#| label: Pandas-series-describe
s_Pandas.describe()
```

```{python}
#| label: series-value-counts
s.value_counts()
```

### Logical Aggregations

```{python}
#| label: make-boolean-series
b = pl.Series("a", [True, True, False])
b.dtype
```

```{python}
#| label: series-all
b.all()
```

```{python}
#| label: series-any
b.any()
```

```{python}
#| label: series-not
b.not_()
```


## Missing

Thanks to Arrow, Polars has built in missing value support for all(!) dtypes. This has been a long awaited feature in the Python data science ecosystem with implications on speed, memory, style and more. The [Polars User Guide](https://pola-rs.github.io/Polars-book/user-guide/howcani/missing_data.html) has a great overview of the topic from which we collect some take-homes:

-   `np.nan` is also supported along `pl.Null`, but is not considered as a missing value by Polars. This has implications on null counts, statistical aggregations, etc.
-   `pl.Null`, and `np.nan`s have their own separate functions for imputing, counting, etc.

```{python}
m = pl.Series("a", [1, 2, None, np.nan], strict=False)
```

```{python}
m.is_null() # checking for None's. Like Pandas .isna()
```

```{python}
m.is_nan() # checking for np.nan's
```

For comparison with Pandas:

```{python}
m_Pandas = pd.Series([1, 2, None, np.nan])
```

```{python}
m_Pandas.isna()
```

```{python}
m_Pandas.isnull() # alias for pd.isna()
```

### Operating on Missing

We now compare the behavior of Polars to Pandas when operating on missing values. When interpreting the following remember:

1.  For Polars, nan is not missing. It is some unknown number.
2.  For Pandas, nan and Nulls are the same.

```{python}
# Polars
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ], strict=False) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ], strict=False) # python's nan

# Pandas
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])
```

```{python}
[
  m1.sum(), 
  m2.sum(), 
  m3.sum(), 
  m4.sum(), 
  m5.sum(), 
  m6.sum(),
  ]
```

Things to note:

-   `None` will be ignored by both, which is **unsafe**.
-   `np.nan` will be ignored by Pandas (unsafe), but not by Polars (safe).

Filling missing values; `None` and `np.nan` are treated differently:

```{python}
#| label: series-fill-null-for-null
m1.fill_null(0)
```

```{python}
#| label: series-fill-null-for-nan
m2.fill_null(0)
```

```{python}
#| label: series-fill-nan-for-nan
m3.fill_nan(0)
```

```{python}
#| label: series-drop-null-for-null
m1.drop_nulls()
```

```{python}
#| label: series-drop-nan-for-null
m1.drop_nans()
```

```{python}
#| label: series-drop-null-for-nan
m2.drop_nulls()
```

```{python}
#| label: series-interpolate-null-for-null
m1.interpolate()
```

```{python}
#| label: series-interpolate-null-for-nan
m2.interpolate() # np.nan is not considered missing, so why interpolate?
```

## Shape Transformations

```{python}
#| label: series-to-dummies
s.to_dummies(drop_first=True)
```

```{python}
#| label: series-shift  
s.shift(1, fill_value=999)
```

```{python}
#| label: series-shift-back
s.shift(-1)
```

```{python}
#| label: series-reshape
pl.Series("a",[1,2,3,4]).reshape((2,2))
```

## Arithmetic Operations

This section shows arithmetic operations on Polars series, which is not chainable. 
For chainable arithmetic, see @sec-arithmetic.

```{python}
#| label: series-add

s+1
```

```{python}
#| label: series-sub

s-1
```

```{python}
#| label: series-mul

s*2
```

```{python}
#| label: series-truediv

s/2
```

```{python}
#| label: series-floordiv

s//2
```

```{python}
#| label: series-mod

s%2
```

## Mathematical Transformations

```{python}
#| label: series-abs
s.abs()
```

```{python}
#| label: series-sin
s.sin()
```

```{python}
#| label: series-exp
s.exp()
```

```{python}
#| label: series-hash
s.hash()
```
But see [here](https://pypi.org/project/polars-hash/) for a Polars extensions for hashing. 
```{python}
#| label: series-log
s.log()
```

```{python}
#| label: series-sqrt
s.sqrt()
```

## Comparisons

Compare objects
```{python}
s.equals(pl.Series("a", [1, 2, 3]))
```

```{python}
#| eval: false
s.equals([1,2,3])
```

Compare entires in an object
```{python}
#| label: series-eq
s.eq(2)
```

```{python}
s.eq([1,2,3]) # same as s==[1,2,3]
```

I will always prefer the chainable version of a comparison. 
```{python}
#| label: series-ge
s.ge(2) # same as s >= 2
```

Clip, aka [Winsorize](https://en.wikipedia.org/wiki/Winsorizing).

```{python}
#| label: series-clip
f.clip(lower_bound=1.5,upper_bound=2.5)
```

```{python}
#| label: series-round
f.round(2)
```

```{python}
#| label: series-ceil
f.ceil()
```

```{python}
#| label: series-floor
f.floor()
```

## Search

```{python}
#| label: series-search-in-list
s.is_in([1, 10])
```

```{python}
#| label: series-search-in-range
s.is_between(2, 3, closed='both')
```

## Apply (map_elements)

Applying your own function. Also note the informative error message (introduced in Polars Ver 0.18.0) that will try to recommend a more efficient way to do things.

```{python}
s.map_elements(lambda x: x + 1, return_dtype=pl.Int32)
```

Are lambda functions really so much slower?

```{python}
s1 = pl.Series(np.random.randn(int(1e6)))
```

Adding 1 with apply:

```{python}
%timeit -n2 -r2 s1.map_elements(lambda x: x + 1, return_dtype=pl.Float64)
```

Adding 1 without apply:

```{python}
%timeit -n2 -r2 s1+1
```

## Cumulative Operations

```{python}
#| label: series-cum-max
s.cum_max()
```

```{python}
#| label: series-cumsum
s.cum_sum()
```

```{python}
#| label: series-cumprod
s.cum_prod()
```

```{python}
#| label: series-ewm
s.ewm_mean(com=0.5)
```

## Differentiation Operations

```{python}
#| label: series-diff
s.diff()
```

```{python}
#| label: series-pct-change
s.pct_change()
```

## Rolling Operations

Aka **Sliding Window** operations. 

```{python}
#| label: series-rolling-mean

s.rolling_mean(window_size=2)
```

```{python}
#| label: series-rolling-sum

s.rolling_sum(window_size=2)
```

```{python}
#| label: series-rolling-map
s.rolling_map(
  sum, 
  window_size=2)
```

Note:

-   `sum` is the `pl.sum()` function. You cannot use arbitrary functions within a `rolling_map()` call.
-   Many rolling functions have been prepared. See the [computations section](https://docs.pola.rs/py-Polars/html/reference/series/computation.html) of the Series class in the official API.

## Uniques and Duplicates

```{python}
#| label: series-is-duplicated
s.is_duplicated()
```

```{python}
#| label: series-is-unique
s.is_unique()
```

```{python}
#| label: series-n-unique
s.unique() # Same as Pandas drop_duplicates()
```

```{python}
s.n_unique()
```

```{python}
pl.Series([1,2,3,4,1]).unique_counts()
```

```{python}
#| label: series-is-first-distinct
s.is_first_distinct() # not sure there is a pl.Expr method 
```

Notes:

-   `is_first_distinct()` has had many name changes in the past. It was `is_first()` in Polars 0.18.0, and `is_first_distinct()` in Polars 0.19.0.

-   Do not confuse `.is_first_distinct()` with `.first()`. The former is a logical aggregation, and the latter is a series method.

```{python}
#| label: first-counter-examples
(
  pl.DataFrame(pl.Series("a",[1,2,3,1]))
  .select(
    pl.col('a').first()
    )
)
```

## Casting {#sec-casting}

-   `cast()` is Polars' equivalent of Pandas' `astype()`.

-   The dtypes to cast to are **Polars** dtypes. Don't try `s.cast("int32")`, `s.cast(np.int32)`, or `s.cast(pd.int)`

-   For a list of dtypes see the official [documentation](see%20https://pola-rs.github.io/Polars/py-Polars/html/reference/datatypes.html).


```{python}
s.cast(pl.Int32)
```


Interesting cases to note:

1. For Polars frames (not series) `.cast()` may take column names, but also dtypes. 
2. Dates and Datetimes will usually not be case with `.cast()`. See @sec-date-and-time.


### Shrink Dtype

Find the most efficient dtype for a series; Like Pandas `pd.to_numeric(..., downcast="..."`).

```{python}
s.shrink_dtype().dtype # 

```

Also see [here](http://braaannigan.github.io/software/2022/10/31/Polars-dtype-diet.html).


## Ordering and Sorting {#sec-ordering-and-sorting}

```{python}
#| label: series-sort
s.sort()
```

```{python}
#| label: series-reverse
s.reverse()
```

```{python}
#| label: series-rank
s.rank()
```

```{python}
#| label: series-arg-sort
s.arg_sort()  # same as R's order()
```

`arg_sort()` returns the indices that would sort the series. Same as R's `order()`.

```{python}
sorted_s = s[s.arg_sort()]
(s.sort() == sorted_s).all()
```


## Sampling {#sec-sampling}

With replacement.

```{python}
#| label: series-sample
s.sample(2, with_replacement=False)
```

Without replacement.
```{python}
#| label: series-shuffle
s.shuffle(seed=1) # random permutation
```


## Date and Time {#sec-date-and-time}

There are 4 datetime dtypes in Polars:

1.  **Date**: A date, without hours. Generated with `pl.Date()`.
2.  **Datetime**: Date and hours. Generated with `pl.Datetime()`.
3.  **Time**: Hour of day. Generated with `pl.Time()`.
4.  **Duration**: As the name suggests. Similar to `timedelta` in Pandas. Generated with `pl.Duration()`.

::: callout-warning
Python has a sea of modules that support datetimes. 
Polars can perfectly coexist with [datetime](https://docs.python.org/3/library/datetime.html).
I am ot so sure about other datetime modules such as [dateutil](https://dateutil.readthedocs.io/en/stable/), [numpy](https://numpy.org/doc/stable/reference/arrays.datetime.html), [Pandas](https://Pandas.pydata.org/Pandas-docs/version/1.1/user_guide/timeseries.html), [arrow](https://arrow.readthedocs.io/en/latest/), [scikits.timeseries](https://pytseries.sourceforge.net/) (now deprecated)... 
:::

### Time Range

```{python}
from datetime import datetime, timedelta

date = (
  pl.datetime_range(
    start = datetime(
      year= 2001, month=2, day=2, hour =2, minute=24, second=12
      ), 
    end = datetime(
      year=2002, month=2, day=5, hour =5, minute=34, second=45
      ),
    interval='1s', 
    eager= True, 
  )
  .sample(5)
)

date
```

Things to note:

-   `pl.date_range` may return a series of dtype `Date` or `Datetime`. This depens of the granularity of the inputs.
-   Try other datatime dtypes as `start` and `end`.

```{python}
date.dtype
```

Cast to different time unit. May be useful when joining datasets, and the time unit is different.

```{python}
date.dt.cast_time_unit(time_unit="ms")
```

Datetime methods are accessed with the `.dt` namespace.

### Extract Time Sub-Units

```{python}
date.dt.second()
```

```{python}
date.dt.minute()
```

```{python}
date.dt.hour()
```

```{python}
date.dt.day()
```

```{python}
date.dt.week()
```

```{python}
date.dt.weekday()
```

```{python}
date.dt.month()
```

```{python}
date.dt.year()
```

```{python}
date.dt.ordinal_day() # day in year
```

```{python}
date.dt.quarter()
```

### Round and Truncate Time 

What if I don't want the month-in-year, rather, the year-month.
Enter `.dt.truncate()`. and the more experimental `.dt.round()`.

```{python}
#| label: series-dt-tuncate
pl.DataFrame(
  {'raw': date,
  'truncated': date.dt.truncate("1mo"),
  'rounded': date.dt.round("1mo"),
  }
)
```

Things to note:

- You can think of `.dt.truncate()` as a floor operation on dates, and `.dt.round()` as a round operation.
- Unlike `strftime()` type methods, the outoput of `.dt.round()` and `.dt.truncate()` is a `Datetime` dtype; not a string.


### Durations

Equivalent to Pandas `period` dtype.

```{python}
diffs = date.diff()
diffs
```

```{python}
diffs.dtype
```

::: callout-important
The extrator of sub-units from a `pl.Duration` has recently changed from `.dt.X()` to `.dt.total_X()`.
:::

```{python}
diffs.dt.total_seconds()
```

```{python}
diffs.dt.total_minutes()
```

```{python}
diffs.dt.total_hours()
```

```{python}
diffs.dt.total_days()
```


### Date Aggregations

Note that aggregating dates, returns a `datetime` type object.

```{python}
date.max()
```

```{python}
date.min()
```

I never quite undersootd that is the "average day."

```{python}
date.mean()
```

```{python}
date.median()
```

### Adding Constant Periods

Adding and subtracting a constant period ("offset"):
```{python}
date.dt.offset_by(by="-100y")
```

```{python}
date.dt.offset_by(by="1y2m20d")
```


### From Date to String

```{python}
date.dt.to_string("%Y-%m-%d")
```

Or equivalently:

```{python}
date.dt.strftime("%Y-%m-%d")
```

### From String to Datetime

```{python}
sd = pl.Series(
    "date",
    [
        "2021-04-22",
        "2022-01-04 00:00:00",
        "01/31/22",
        "Sun Jul  8 00:34:60 2001",
    ],
)
sd
```

Parse into `Date` type.

```{python}
sd.str.to_date(format="%F", strict=False)
```

Or equivalently:

```{python}
sd.str.strptime(dtype= pl.Date, format="%F", strict=False)
```

```{python}
sd.str.strptime(pl.Date, "%D", strict=False)
```

Parse into `Datetime` type.

```{python}
sd.str.to_datetime(format="%F %T", strict=False)
```

Or equivalently:

```{python}
sd.str.strptime(pl.Datetime, "%F %T", strict=False)
```

```{python}
sd.str.strptime(pl.Datetime, "%a %h %d %T %Y",strict=False)
```

Parse into `Time` dtype.

```{python}
sd.str.to_time("%a %h %d %T %Y",strict=False)
```

```{python}
sd.str.strptime(pl.Time, "%a %h %d %T %Y", strict=False)
```

## Strings {#sec-strings}

String methods are accessed with the `.str` namespace.

```{python}
#| label: make-string-series
st = pl.Series("a", ["foo", "bar", "baz"])
```

```{python}
#| label: string-length
st.str.len_chars() # gets number of chars. In ASCII this is the same as lengths()
```

```{python}
#| label: string-concat
st.str.concat("-")
```

```{python}
#| label: string-count-matches
st.str.count_matches(pattern= 'o') # count literal matches
```

```{python}
#| label: string-contains
st.str.contains("foo|tra|bar") 
```

```{python}
#| label: string-contains-regrex
st.str.contains("ba[a-zA-Z]") 
```

```{python}
#| label: string-contains-regex-2
st.str.contains("[a-zA-Z]{4,5}") 
```

```{python}
#| label: string-count-matches-2
st.str.count_matches(pattern='[a-zA-Z]')
```

```{python}
#| label: string-ends-with
st.str.ends_with("oo")
```

```{python}
#| label: string-starts-with
st.str.starts_with("fo")
```

To extract the **first** appearance of a pattern, use `extract`:

```{python}
#| label: sample-strings
url = pl.Series("a", [
            "http://vote.com/ballon_dor?candidate=messi&ref=Polars",

            "http://vote.com/ballon_dor?candidate=jorginho&ref=Polars",

            "http://vote.com/ballon_dor?candidate=ronaldo&ref=Polars"
            ])
```

```{python}
#| label: string-extract
url.str.extract("=([a-zA-Z]+)", 1) 
# "=([a-zA-Z]+)" is read: "match an equality, followed by any number of alphanumerics".
```

To extract **all** appearances of a pattern, use `extract_all`:

```{python}
#|  label: string-extract-all
url.str.extract_all("=(\w+)") # \w is a shorthand for [a-zA-Z0-9_], i.e., alphanumerics and underscore.
```

```{python}
#| label: string-pad-end
st.str.pad_end(8, "*")
```

```{python}
#| label: string-pad-start
st.str.pad_start(8, "*")
```

```{python}
#| label: string-strip-char-start
st.str.strip_chars_start('f')
```

```{python}
#| label: string-strip-char-end
st.str.strip_chars_end('r')
```

Replacing first appearance of a pattern:

```{python}
#| label: string-replace
st.str.replace("o+", "ZZ")
```

Replace all appearances of a pattern:

```{python}
#| label: string-replace-all
st.str.replace_all("o", "ZZ")
```

String to list of strings. Number of splits inferred.

```{python}
#| label: string-split
st.str.split(by="o")
```

```{python}
#| label: string-split-inclusive
st.str.split(by="a", inclusive=True)
```

String to dict of strings (actually a Polars Struct, see @sec-nested-dtypes). Number of **splits** fixed.

```{python}
#| label: string-split-exact
st.str.split_exact("a", 2)
```

String to dict of strings. Length of **output** fixed.

```{python}
#| label: string-split-length
st.str.splitn("a", 4)
```

Strip white spaces.

```{python}
pl.Series(['   ohh   ','   yeah   ']).str.strip_chars()
```

```{python}
#| label: string-to-uppercase
st.str.to_uppercase()
```

```{python}
#| label: string-to-lowercase
st.str.to_lowercase()
```

```{python}
#| label: string-to-titlecase
st.str.to_titlecase()
```

```{python}
#| label: string-zfill
st.str.zfill(5)
```

```{python}
#| label: string-slice
st.str.slice(offset=1, length=1)
```


Done with `pl.Series`!


# Polars (Eager) DataFrames {#sec-dataframes}

Recall :

1. Eager frames are in-memory. They will thus feel like Pandas frames, accessed with a PySpark'ish syntax.
2.  There is no row index (like R's `data.frame`, `data.table`, and `tibble`; unlike Python's `Pandas`).

A full list of DataFrame methods can be found [here](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html). 


## Create

A frame can be created as you would expect. From a dictionary of series, a numpy array, a Pandas dataframe, or a list of Polars (or Pandas) series, etc.
Here, we create from a dict of Python lists.

```{python}
#| label: make-dataframe

df = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 1)), 
    (datetime(2022, 1, 2)), 
    (datetime(2022, 1, 3))], 
    "float": [4.0, 5.0, 6.0],
    "string": ["a", "b", "c"]})

```

## Inspect

Nice HTML printing to iPython.

```{python}
#| label: polars-printing-of-frame
df
```

```{python}
#| label: polars-printing-of-frame-ascii
print(df)
```

I confess I like to look at the frame this way
```{python}
#| label: polars-printing-of-frame-pandas
df.to_pandas()
```

```{python}
df.glimpse() # useful for wide frames. Similar to R's str() of Pandas's .info()
```

```{python}
df.columns
```

```{python}
df.shape
```

```{python}
df.height # probably more useful than df.shape[0]
```

```{python}
df.width
```

```{python}
df.schema # similar to Pandas info()
```

```{python}
df.with_row_index()
```

Things to note:

1.  `df.schema` and `df.columns` will be available for lazy objects, even before materializing them.
2.  `df.height` and `df.shape` will not be available for lazy objects, until they are materialized.

## Intro to Column Operations

::: callout-important
This is probably the most important section of the document.
:::

### Contexts {#sec-contexts}

As discussed in @sec-api-principles, operations on columns will always be done within a **context**. 

-   `df.select()` to select and transform columns.
-   `df.with_columns()` to return all columns after transformations. 
-   `df.group_by().agg()` is acutually `.select()` within a `group_by()` context.
-   `df.filter()` is a context for filtering rows.
-   `pl.when()` is a context for conditional operations.

Select Context:

```{python}
#| label: select-context
df.select(pl.col("integer"))
```

```{python}
#| label: with-colummns-context
df.with_columns(pl.col("integer").add(1))
```

Group-by context:

```{python}
#| label: group-by-context
df.group_by("string").agg(pl.col("integer").mean())
```

### Exctacting Columns as Expressions

Within a context there are many ways to access a column. 
Here are some examples of various ways of adding 1 to the integer column.

```{python}
#| label: add-1-to-column
#| lst-label: col-ref
#| lst-cap: Referencing a Column in a Select Context
df.select(pl.col("integer").add(1))
```

```{python}
df.select(pl.col(["integer"]).add(1))
```

```{python}
df.select(pl.col(r"^integer$").add(1))
```

Things to note:

-  `pl.col()` is the most general way to access a column. It will work in all contexts.
-  `pl.col()` can accept a string, a list of strings, a regex, etc.
-  You don't really need the "r"  prefix in `r"^integer$"`. I've put it to emphasize that it is a regex. What you do need is to start the string with `^` and end it with `$` so that Polars knows this is a RegEx. 
-  `pl.col()` with RegEx is a super power! Help your future self by naming columns so you can easily reference them with RegEx.

```{python}
df.select(pl.col(pl.Int64).add(1))
```

```{python}
df.select(pl.first().add(1))
```

```{python}
import polars.selectors as cs

df.select(cs.by_name("integer").add(1))
```

```{python}
df.select(cs.ends_with("ger").add(1))
```

```{python}
df.select(cs.starts_with('int').add(1))
```

```{python}
df.select(cs.integer().add(1))
```

```{python}
df.select(cs.first().add(1))
```

```{python}
df.select(cs.matches(r"^integer$").add(1))
```

```{python}
df.select(cs.contains("int").add(1))
```

```{python}
df.select(pl.col("integer")+1)
```

#### Exctracting Coumns as Series or Frames
```{python}
df['integer']+1
```

```{python}
df.get_column('integer')+1
```

The following will not work because a series does not have a `.add(1)` method.

```{python}
#| eval: false
df['integer'].add(1)
df.get_column('integer').add(1)
```


## Convert to Other Python Objects {#sec-convert-to-other-python-objects}

You can always convert your `pl.Series` or `pl.DataFrame` to other Python objects.

To Pandas DataFrame

```{python}
#| label: dataframe-to-Pandas
df.to_pandas()
```

To Numpy Array

```{python}
#| label: dataframe-to-numpy
df.to_numpy()
```

To List of Polars Series

```{python}
#| label: dataframe-to-list
df.get_columns() # columns as list of Polars series
```

To list of tuples

```{python}
#| label: dataframe-to-list-of-tuples
df.rows() 
```

To Dict of Polars Series

```{python}
#| label: dataframe-to-dict-of-series
df.to_dict() # columns as dict of Polars series
```

To Dict of Python Lists

```{python}
#| label: dataframe-to-dict-of-lists
df.to_dict(as_series=False) # columns as dict of Polars series
```

To String Representation (`repr`)

```{python}
#| label: dataframe-to-repr
df.to_init_repr()
```

To a Polars Series of Polars Struct

```{python}
#| label: dataframe-to-struct
df.to_struct()
```

To PyArrow Table

```{python}
#| label: dataframe-to-arrow
df.to_arrow()
```

To PyTorch array:

```{python}
import torch
df.select(['integer','float','date']).to_torch()
```


## Statistical Aggregations {#sec-statistical-aggregations}

```{python}
#| label: dataframe-describe
df.describe()
```

Compare to Pandas: Polars will summarize all columns even if they are not numeric.

```{python}
#| label: dataframe-describe-pandas
df.to_pandas().describe()
```

Statistical aggregations operate column-wise (and in parallel!).

```{python}
#| label: dataframe-max
df.max()
```

```{python}
#| label: dataframe-min
df.min()
```

```{python}
#| label: dataframe-mean
df.mean()
```

```{python}
#| label: dataframe-median
df.median()
```

```{python}
#| label: dataframe-quantile
df.quantile(0.1)
```

```{python}
#| label: dataframe-sum
df.sum()
```

Constrast with summation in row:

```{python}
#| label: dataframe-sum-horizontal
df.with_columns(pl.sum_horizontal('*'))
```

## Selections {#sec-filtering-subsetting-frames}

1.  If you are used to Pandas, recall there is no index. There is thus no need for `loc` vs. `iloc`, `reset_index()`, etc. See [here](https://pola-rs.github.io/Polars-book/user-guide/howcani/selecting_data/selecting_data_indexing.html) for a comparison of extractors between Polars and Pandas.
2.  Filtering and selection is possible with the `[` operator, or the `filter()` and `select()` methods. The latter is recommended to facilitate query planning (discussed in @sec-query-planning).

### Selecting With Indices

The following are presented for completeness. Gnerally, you can, and should, avoid selecting with indices. See @sec-selecting-columns for selecting columns, and @sec-filtering-rows for selecting rows.

```{python}
#| label: single-cell-extraction
df[0,0] # like Pandas .iloc[]
```

Slicing along rows.

```{python}
#| label: slice-rows
df[0:1] 
```

Slicing along columns.

```{python}
#| label: slice-columns
df[:,0:1]
```

### Selecting Columns {#sec-selecting-columns}

First- do you want to return a Polars frame or a Polars series?

For a frame:

```{python}
df.select("integer")
```

For a series:

```{python}
df['integer']
```

How do I know which is which?

1.  You can use `type()`.
2.  Notice the dimension of the index in the output.

Select columns with list of labels

```{python}
df.select(["integer", "float"])
```

As of Polars\>=15.0.0, you don't have to pass a list:

```{python}
df.select("integer", "float")
```

Column slicing by label

```{python}
df[:,"integer":"float"]
```

Note: `df.select()` does not support slicing ranges such as `df.select("integer":"float")`.

Get a column as a Polars Series.

```{python}
df.get_column('integer')
```

Get a column as a Polars series.

```{python}
df.to_series(0)
```

```{python}
df.get_column_index('float')
```

```{python}
df.drop("integer")
```

`df.drop()` not have an `inplace` argument. Use `df.drop_in_place()` instead.

### pl.col()

The `pl.col()` is **super important**. It allows you to select columns in many ways, and provides almost all the methods (i.e. Polars Expressions) you will need to operate on them.

```{python}
df.select(pl.col(pl.Int64))
```

```{python}
df.select(pl.col(pl.Float64))
```

```{python}
df.select(pl.col(pl.Utf8))
```

Python List of Polars dtypes

```{python}
df.select(pl.col([pl.Int64, pl.Float64]))
```

Patterns ("glob")

```{python}
df.select(pl.col("*")) # same as df.select(pl.all())
```

Regular Expression. Important! Use `pl.col('^<patterh>$')` to match regular expressions.

```{python}
df.select(pl.col(r"^\w{4}$")) 
```

Note: Without the `r` prefix, "\w" will be interpreted as an escape character.

```{python}
df.select(pl.col('^.*g$'))
```

```{python}
#| label: pl-col-with-regex
df.select(pl.col("^.*te.*$")) # regex matching anything with a "te"
```

You can use `pl.col()` to exclude columns.

```{python}
df.select(pl.col("*").exclude("integer"))
```

```{python}
df.select(pl.col("*").exclude(pl.Float64))
```

Exciting! [New API](https://pola-rs.github.io/Polars/py-Polars/html/reference/selectors.html) for column selection.

```{python}
#| label: pl-column-selector
import polars.selectors as cs

df.select(cs.starts_with('i'))
```

```{python}
#| label: pl-column-selector-set-opeartions

df.select(cs.starts_with('i') | cs.starts_with('d'))
```

```{python}
df.select(cs.starts_with('i') | cs.starts_with('d'))
```

### Selecting Rows By Index

```{python}
df.limit(2)
```

```{python}
df.head(2)
```

```{python}
df.tail(1)
```

```{python}
df.gather_every(2)
```

```{python}
df.slice(offset=1, length=1)
```

```{python}
df.sample(1)
```
Because `.sample()` requires row counts, it will not work for lazy objects.


Get row as tuple.
```{python}
df.row(1)
```

Get row as dict
```{python}
df.row(1, named=True)
```


### Selecting Rows By Condition {#sec-filtering-rows}

Aka [Projection](https://en.wikipedia.org/wiki/Projection_(relational_algebra)).

Enter the `df.filter()` context.

```{python}
#| label: filter-context
#| lst-label: filter-context
#| lst-cap: Filtering Rows in a DataFrame

df.filter(pl.col("integer").eq(2))
```

Things to note:

-   `df.filter()` is a **Polars Context**.
-   It is a **keep** filter, not a **drop** filter: it will evaluate expressions, and return the rows where the expression does not evaluate to `False`.
-   The `[` operator does not support indexing with boolean such as `df[df["integer"] == 2]`.
-   The `filter()` method is recommended over `[` by the authors of Polars, to facilitate lazy evaluation (discussed later).

An alternative syntax for equality filtering, known as **constraint** in the Polars documentation.

```{python}
df.filter(integer = 2)
```

AND conditions:

```{python}
#| label: filter-and
df.filter(
  pl.col('integer').eq(2),
  pl.col('float').gt(10)
)
```

```{python}
#| label: filter-and-2
df.filter(
  pl.col('integer').eq(2) &
  pl.col('float').gt(10)
)
```

OR conditions:

```{python}
#| label: filter-or
df.filter(
  pl.col('integer').eq(2) |
  pl.col('float').gt(10)
)
```

::: callout-note
How would you write an AND, or OR condition, without using the comparison methods `.eq()`, `.gt()`, etc.?
:::

### Selecting From Single Item Frame

Say your operation returned a Polars frame with a single float, which you want to manipulate as a Python float:

```{python}
#| label: single-item-to-python
pl.DataFrame([1]).item()
```


## Column Transformations {#sec-column-transformations}

1.  Transformations are done with Polars Expressions (@sec-expressions) within a **context** (see @sec-contexts).

-   The output column will have the same name as the input, unless you use the `alias()` method to rename it.

```{python}
#| label: with-columns
df.with_columns(
    pl.col("integer").mul(2),
    pl.col("integer").alias("integer2"),
    integer3 = pl.col("integer").truediv(3),
)
```

Warning!

```{python}
#| eval: false
df.with_columns(
    integer3 = pl.col("integer").truediv(3),
    pl.col("integer").mul(2),
    pl.col("integer").alias("integer2"),
)
```

Things to note:

-   You cannot use `[` to assign! This would not have worked `df['integer3'] = df['integer'] * 2`
-   The columns `integer` is multiplied by 2 **in place**, because no `alias` is used.
-   As of Polars version \>15.*.* (I think), you can use `=` to assign. That is how `integer3` is created.
-   The column `integer` is copied, by renaming it to `integer2`.
-   Why `.truediv()`? To distinguish from `.floordiv()` and `.mod()`. See [this issue](https://github.com/pola-rs/polars/issues/12450) for a discussion of the topic.

If a selection returns multiple columns, all will be transformed:

```{python}
#| label: with-columns-multiple
df.with_columns(
    pl.col([pl.Int64,pl.Float64]).mul(2)
)
```

```{python}
#| label: with-columns-all-cast
df.with_columns(
    pl.all().cast(pl.Utf8)
)
```

You cannot `.alias()` when operating on multiple columns. But you can use `.name.suffix()` or `.name.prefix()` from the `.name.` namespace.

```{python}
#| label: with-columns-multiple-2
df.with_columns(
    pl.col([pl.Int64,pl.Float64]).mul(2).name.suffix("_2X")
)
```

### Arithmetic {#sec-arithmetic}

```{python}
#| label: expression-add
df.select(pl.col('integer').add(1))
```

```{python}
#| label: expression-sub
df.select(pl.col('integer').sub(1))
```

```{python}
#| label: expression-mul
df.select(pl.col('integer').mul(2))
```

```{python}
#| label: expression-truediv
df.select(pl.col('integer').truediv(2))
```

### Transform the Transformed Columns

All the expressions within a context see the frame as it's initial state. Recall, odds are expressions will be evaluated in parallel, and not sequentially. So how can I operate on columns I have just transformed? By chaining contexts!

```{python}
#| label: chained-contexts
(
  df
  .with_columns(
    pl.col("integer").truediv(pl.col("float")).alias("ratio")
  )
  .with_columns(
    pl.col("ratio").mul(100)
  )
)
```

### Conditional Transformation (if-else)

```{python}
#| label: conditional-transformation
df.with_columns(
    pl.when(
      pl.col("integer").gt(2)
    )
    .then(pl.lit(1))
    .otherwise(pl.col("integer"))
    .alias("new_col")
)
```

Things to note:

-   When you think of it, `pl.when().then().otherwise()` is a `pl.Expr()` methdod; one that is not available for `pl.Series`.
-   The `otherwise()` method is optional. If omitted, the original column will be returned (see next example).
-   `pl.lit(1)` is a Polars expression that returns the literal 1. It may be ommited, but it is good practice to include it for clarity and safety.
-   `pl.col("integer").gt(2)` could have been replaced with `pl.col("integer") > 2`. I like the former because it allows easier composition of conditions.

```{python}
df.with_columns(
    pl.when(
      pl.col("integer") > 2
    )
    .then(1)
    .otherwise(pl.col("integer"))
    .alias("new_col")
)
```

### Python Lambda Functions

Apply your own lambda function.

```{python}
#| label: map-rows
(
  df
  .select([pl.col("integer"), pl.col("float")])
  .map_rows(lambda x: x[0] + x[1])
)
```

As usual, using your own functions may have a very serious toll on performance.

```{python}
df_big = pl.DataFrame(np.random.randn(1000000, 2), schema=["a", "b"]) # previous versions used columns= instead of schema=
```

```{python}
%timeit -n2 -r2 df_big.sum_horizontal()
```

```{python}
%timeit -n2 -r2 df_big.map_rows(lambda x: x[0] + x[1])
```

### Numpy Ufuncs

You can use Numpy's universal functions (ufuncs) on Polars frames. There is little overhead in using Numpy ufuncs.

-   See [here](https://docs.pola.rs/user-guide/expressions/numpy/) to use Numpy ufuncs in Polars.
-   See [here](https://www.w3schools.com/python/numpy/numpy_ufunc_create_function.asp) to create your own Numpy ufunc's.

Applying off-the-shelf Numpy ufuncs is as simple.

```{python}
#| label: numpy-ufunc

df.select(pl.col('integer').pipe(np.sin))
# the same as 
# df.select(np.sin(pl.col('integer')))
```

Writing your own Numpy ufunc is easy.

```{python}
#| label: numpy-ufunc-2

def myfunc(x):
  return x**2 + 2*x + 1

# make myfunc a ufunc
myfunc_ufunc = np.frompyfunc(myfunc, 1, 1)

df.select(
  pl.col('float').pipe(myfunc_ufunc, casting='unsafe')
  )
```

Things to note:

-   My Ufunc is created with `np.frompyfunc()`. It could have also been created with `np.vectorize()`.
-   The `casting='unsafe'` argument is required, to deal with dtype mismatch. There could be a more elegant way, but I did not find it.
-   Calling a **Ufunc with multiple arguments** is slightly more involved, and I currently did not find a "clean" solution.


### Functions of Multiple Columns

TODO

```{python}
#| eval: false

def calc_Δt(n: int, Δomega: int, Δp: int) -> float:
    return (98 / (n**2) * Δomega) + (5.5 * Δp)


def calc_Δt_v2(n: pl.Expr, Δomega: pl.Expr, Δp: pl.Expr) -> pl.Expr:
    return (98 / (n**2) * Δomega) + (5.5 * Δp)


print(
    pl.DataFrame({"foo": [1, 2, 3], "bar": [0, 1, 2], "baz": [1, 2, 3]})
    .with_columns(
        # original way
        delta_t1 = pl.struct(["foo", "bar", "baz"]).map_elements(lambda x: calc_Δt(x["foo"], x["bar"], x["baz"])),
        
        # just use the function directly, passing the columns as arguments
        delta_t2 = calc_Δt(pl.col("foo"), pl.col("bar"), pl.col("baz")),
        
      # same thing, but with type hints specifying the expression type
        delta_t3 = calc_Δt_v2(pl.col("foo"), pl.col("bar"), pl.col("baz"))
    )
)

```


## Uniques and Duplicates {#sec-uniques-and-duplicates}

Keep uniques; same as `pd.drop_duplicates()`.

```{python}
#| label: dataframe-unique
df.unique() 
```

Can be used with column subset

```{python}
#| label: dataframe-unique-subset
df.unique(["integer", "float"])
```

```{python}
df.is_unique()
```

```{python}
df.is_duplicated()
```

```{python}
df.n_unique()
```

## Missing Data {#sec-missing-data}

Make some data with missing.

```{python}
#| label: make-dataframe-with-nulls
df_with_nulls = df.with_columns(
    null_1 = pl.Series("missing", [3, None, np.nan], strict=False),
    null_2 = pl.Series("missing", [None, 5, 6], strict=False),
)
```

```{python}
#| label: dataframe-with-nulls
df_with_nulls.null_count() # same as pd.isnull().sum()
```

```{python}
#| label: df-drop-nulls
df_with_nulls.drop_nulls() # same as pd.dropna()
```

Can I also drop `np.nan`'s? There is no `drop_nan()` method. See [StackOverflow](https://stackoverflow.com/questions/75548444/Polars-dataframe-drop-nans) for workarounds.

```{python}
#| label: df-fill-nulls
df_with_nulls.fill_null(0) # same as pd.fillna(0)
```

But recall that `None` and `np.nan` are not the same thing.

```{python}
#| label: df-fill-nans
df_with_nulls.fill_nan(99)
```

```{python}
#| label: df-null-interpolate
df_with_nulls.interpolate()
```

## Sorting {#sec-sorting}

```{python}
df.sort(by=["integer","float"])
```

```{python}
df.reverse()
```

## Group By {#sec-groupby}

**High level:**

-   `df.group_by()` is a **context**, for grouping. Just like Pandas, only parallelized, etc. The output will have as many rows as the number of groups.
-   `df.partion_by()` will return a **list** of frames.
-   `pl.select(pl.col().expression().over())` is like Pandas `df.groupby.transform()`: will will not collapse rows in the original frame. Rather, it will assign each row in the original frame with the aggregate in the group. The output will have the same number of rows as the input.

**Aggregations:**

The `group_by()` context will be followed by an aggregation:

1.  `df.group_by().agg()`: an aggregating context.
2.  `df.group_by().map_groups()`: to apply your own function to each group. Replaced `df.group_by().apply()`.

You will usually use the `.agg()` context. 
Your syntax will usually look like `pl.col().some_aggregation().some_chained_expressions()`.
The aggregations you may use include almost all the `pl.Series` aggregations in @sec-series-aggregations, but also some `pl.Expr` aggregations, such as `pl.first()`, `pl.last()`.


**Grouping over time:**

Think of these as `round+group`, where you need to state the resolution of temporal rounding. 

-   `df.grouby_rolling()` for rolling window grouping, a.k.a. a sliding window. Each row will be assigned the aggregate in the window.
-   `df.group_by_dynamic()` for dynamic grouping. Each period will be assigned the agregate in the period. The output may have more rows than the input.


See the [API reference](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html) for the various options. Also see the [user guide](https://docs.pola.rs/user-guide/transformations/time-series/rolling/) for more details.

```{python}
df2 = pl.DataFrame({
    "integer": [1, 1, 2, 2, 3, 3],
    "float": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
    "string": ["a", "b", "c", "c", "d", "d"],
    "datetime": [
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 9)), 
        (datetime(2022, 1, 9)), 
        (datetime(2022, 1, 9))],
})
```


### group_by()

```{python}
#| label: group-by
groupper = df2.group_by("integer")
```

```{python}
#| label: group-by-agg
groupper.agg(
  cs.numeric().sum().name.suffix("_sum"),
  pl.col('string').n_unique().name.suffix("_n_unique"),
  pl.col('string').last().alias("last_string"),
  pl.col('string'),
)
```

Things to note:

-   Don't expect an index. This is Polars, not Pandas.
-   The grouping may be saved as an object, and used later.
-   The `group_by()` and `.agg()` contexts offer the usual functonality of `.select()` and `.with_columns()`. In particular, you can use `pl.col()` to access columns, or the `pl.selector` module (`cs`).
-   In the selector is not followed by an expression, it will collapse the Series to a Polars List (see @sec-nested-dtypes).

Some random/useful examples now follow.

#### Examples

The count (length) of each group:

```{python}
groupper.agg(pl.len())
```

When operating on all columns,

```{python}
groupper.len()
```

```{python}
groupper.sum()
```

### over()

You may be familar with Pandas `group_by().transform()`, which will return a frame with the same row-count as its input. You may be familiar with Postgres SQL [window function](https://www.postgresql.org/docs/current/tutorial-window.html). You may not be familiar with neither, and still want to aggregate within group, but propagate the result to all group members. Polars' `.over()` is the answer.

```{python}
#| label: over
df2.with_columns(
  pl.col("float").sum().over("string").alias("sum")
)
```

Things to note:

-   The output will have the same number of rows as the input.
-   `.over()` is a **context**. As such, you can evaluate column selectors and expressions within it.
-   **Careful**: `over()` should be the last operation in a chain. @over-wrong will sadly **not fail**, while it should have.

```{python}
#| label: over-wrong
#| lst-label: over-wrong
#| lst-cap: A wrong use of over()

df2.with_columns(
  pl.col("float").over("string").sum().alias("sum")
)
```

### partition_by()

Make the list of frames

```{python}
#| label: partition-by
partitioner = df2.partition_by("integer")
partitioner
```

The iterate like any Python list, with a function that operates on Polars frames:

```{python}
def myfunc(
  df: pl.DataFrame
  ) -> pl.DataFrame:
  return df.select(pl.col("float").sum())

for df in partitioner: 
  print(myfunc(df))
```


### Grouping on time

High level:

1.  Grouping on time is special, because a temporal variable implies multiple resolutions which may be used for grouping. E.g. a date may be grouped by year, month, day, etc.
2.  For a temporal version of `group_by()`, use `df.group_by_dynamic()`.
3.  For a temporal version of `over()`, use `df.rolling()`.

```{python}
#| label: group-by-dynamic
(
  df2
  .sort("datetime")
  .group_by_dynamic(
    index_column="datetime", 
    every="1d",
    )
  .agg(pl.col("float").sum())
)
```

```{python}
#| label: rolling
(
  df2
  .sort("datetime")
  .rolling(
    index_column="datetime", 
    period='1d',
    )
  .agg(pl.col("float").sum())
)
```

## Joins {#sec-joins}

High level:

-   `df.join()` for joins; like Pandas `pd.merge()` or `df.join()`. The types of joins supported are:
    -   `how='inner'`: only rows with matching keys in both dataframes are included in the result.
    -   `how='left'`: all rows from the left dataframe are included in the result, and the right dataframe is matched by keys. If no match is found, the result is filled with nulls.
    -   `how='right'`: all rows from the right dataframe are included in the result, and the left dataframe is matched by keys. If no match is found, the result is filled with nulls.
    -   `how='full'`: all rows from both dataframes are included in the result. If no match is found, the result is filled with nulls. This is also known as a `outer` join.
    -   `how='cross'`: all combinations of rows from both dataframes are included in the result.
    -   `how='semi'`: all rows for which the left dataframe is matched by keys.
    -   `how='anti'`: all rows for which the left dataframe is not matched by keys.

-   `df.join_asof()` for asof joins; like Pandas `pd.merge_asof()`.

-   `df.join_where()` to join on an arbitrary predicate (link)[https://docs.pola.rs/api/python/dev/reference/dataframe/api/polars.DataFrame.join_where.html]. 

-   `df.hstack()` for horizontal concatenation; like Pandas `pd.concat([],axis=1)` or R's `cbind`.

-   `df.vstack()` for vertical concatenation; like Pandas `pd.concat([],axis=0)` or R's `rbind`.

-   `df.merge_sorted()` for vertical stacking, with sorting.

-   `pl.concat()`, which is similar to the previous two, but with memory re-chunking. `pl.concat()` also allows diagonal concatenation, if columns are not shared.

-   `df.extend()` for vertical concatenation, but with memory re-chunking. Similar to `df.vstack().rechunk()`.


For more on the differences between these methods, see [here](https://www.rhosignal.com/posts/Polars-extend-vstack/).

### join()

```{python}
#| label: join

df = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 1)), 
    (datetime(2022, 1, 2)), 
    (datetime(2022, 1, 3))], 
    "float": [4.0, 5.0, 6.0],
    "string": ["a", "b", "c"]})


df2 = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 4)), 
    (datetime(2022, 1, 5)), 
    (datetime(2022, 1, 6))], 
    "float":[7.0, 8.0, 9.0],
    "string": ["d", "d", "d"]})


df.join(
  df2, 
  on="integer", 
  how="left",
  validate='m:1'
  )
```

Things to note:

-   Repeating column names have been suffixed with "\_right".
-   Recall, there are **no indices**. The `on`/`left_on`/`right_on` argument is always required.
-   `how=` may take the following values: 'inner', 'left', 'outer', 'semi', 'anti', 'cross'. 'inner' is the default.
-   I like to add the `validate=` argument, for safety.
-   The join is super fast, as demonstrated in @sec-motivation above.

### join_asof()

```{python}
df.join_asof(
    df2, 
    left_on="date", 
    right_on='date', 
    by="integer", 
    strategy="backward", 
    tolerance='1w',
    )
```

Things to note:

-   Yes! `join_asof()` is available. In streaming engine as well!
-   The `strategy=` argument may take the following values: 'backward', 'forward'.
-   The `tolerance=` argument may take the following values: '1w', '1d', '1h', '1m', '1s', '1ms', '1us', '1ns'.

### hstack

```{python}
#| label: hstack
new_column = pl.Series("c", np.repeat(1, df.height))

df.hstack([new_column])
```

### vstack

```{python}
#| label: vstack

df.vstack(df2)
```

Caution: Joining along rows is possible only if matched columns have the same dtype. Timestamps may be tricky because they may have different time units. Recall that timeunits may be cast before joining using `series.dt.cast_time_unit()`. Here is a demonstration of the problem:

```{python}
#| label: vstack-caution
#| eval: false
(
  df
  .vstack(
    df2
    .with_columns(
      pl.col('date').dt.cast_time_unit(time_unit="ms")
      )
    )
)
```

### merge_sorted

This is a **vertical** stacking, when expecting a sorted result, and assuming inputs are sorted.

```{python}
df.merge_sorted(df2, key="integer") 
```

### concat()

Vertical

```{python}
pl.concat([df, df2]) 
# equivalent to:
# pl.concat([df, df2], how='vertical', rechunk=True, parallel=True) 
```

Horizontal

```{python}
pl.concat(
  [df,new_column.to_frame()], 
  how='horizontal',
  )
```

Diagonal

```{python}
pl.concat(
  [df,new_column.to_frame()], 
  how='diagonal')
```

What is the difference between `pl.concat()` and `df.vstack()` and `hstack()`? `pl.concat()` is more general with more functionality.

-   concat includes re-chunking, which is useful for memory management.
-   concat includes diagonal concatenation, which is useful when columns are not shared.
-   concat includes parallel execution, which is useful for performance.
-   concat has recenetly been equipped with `how=vertial_relaxed` and `how=horizontal_relaxed`, which finds a common dtype if schemas are mismatched.

### extend()

Like `vstack()`, but with memory re-chunking. Similar to `df.vstack().rechunk()`.

```{python}
df.extend(df2) 
```

Why should I care about re-chunking? Since Polar is a columnar store, it is important to have contiguous memory layout. Otherwise, you may not enjoy the benefits of vectorized operations.

Is there a method for `df.hstack().rechunk()`? No. A columnar storage 
is not sensitive to emory framgmentation along columns.


## Casting 

Some casting options are availalbe for Polars frames, that are not available for Series or Expressions. 
For instnace, casting along dtype:

```{python}
df.cast({pl.Int64: pl.Int8})
```


## Nested dtypes {#sec-nested-dtypes}

Recall the nested dtypes:

1.  Polars Struct: Like a Python dict within a cell; Multiple named elements.
2.  Polars List: Similar to a Python list within a cell; Multiple unnamed elements. Unlike the Python list, in a Polars list, all elements must have the same dtype.
3.  Polars Array: Like a Polars list, but with a fixed length for all cells in the column.


### Polars Struct {#sec-struct}

Why would I want to use a Polars Struct?

1.  To call functions that expects a dict argument constructed from multiple columns. E.g. a Ufunc called using `pl.col().pipe()` or `pl.col().map_elements()`. See [here](https://docs.pola.rs/user-guide/expressions/user-defined-functions/).
2.  To use column methods on multiple columns: hashing, checking uniqueness, etc.
3.  As the output of a an operation that returns multiple columns. E.g. `value_counts()` within a `group_by()` context.
4.  When information cannot be structured into tabular form.

See [here](https://docs.pola.rs/user-guide/expressions/structs/) for more information on Polars Structs.

Make a Polars Struct

```{python}
#| label: struct
#| lst-label: struct
#| lst-cap: Making a Polars Struct

df.with_columns(
  pl.struct(
    pl.col("integer"),
    pl.col("float"),
    pl.col("string"),
    )
  .alias("struct")
  )

```

Or more compactly:

```{python}
df.with_columns(
  pl.struct('integer', 'float', 'string').alias('struct')
  )
```

Now add field names:

```{python}
df_with_struct = df.select(
  pl.struct(
    integer=pl.col("integer"),
    float=pl.col("float"),
    string=pl.col("string"),
    )
  .alias("struct")
  )

df_with_struct
```

Convert struct to string in JSON format

```{python}
df_with_struct.select(
  pl.col("struct").struct.json_encode()
  )
```

Get field names

```{python}
#| label: struct-fields
#| lst-label: struct-fields
#| lst-cap: Getting the fields of a Polars Struct
df_with_struct["struct"].struct.fields
```

::: callout-note
In @struct-fields, I used a Series method. For reasons I do not understand, `struct.fields` has currently only a `pl.Series` [version](https://docs.pola.rs/py-polars/html/reference/series/struct.html), and not a `pl.Expr` [version](https://docs.pola.rs/py-polars/html/reference/expressions/struct.html).
:::

Accessing fields

```{python}
#| label: struct-access
df_with_struct.select(
  pl.col("struct").struct.field("integer")
  )
```

Struct to columns with `unnest()`

```{python}
df_with_struct.unnest(columns=["struct"])
```

### Polars List

When will I want to use a Polars List?

1.  When I want to hold variable length data in a single cell. E.g. a list of tags, a list of items in a shopping cart, etc.

See [here](https://docs.pola.rs/user-guide/expressions/lists/) for more information on Polars Lists.

Make a Polars Series with strings

```{python}
#| label: list

pl.Series(["Alice, Bob, Charlie", "David, Eve, Frank", "George, Helen, Ida, Jack, Kate"])

df_with_long_strings = pl.DataFrame(
  {"employees": [
    "Alice, Bob, Charlie", 
    "David, Eve, Frank", 
    "George, Helen, Ida, Jack, Kate",
    "Liam, Mary, Noah, Olivia, Paul, Queen, Rose, Sam, Tom, Uma, Victor, Wendy, Xavier, Yara, Zane",
    "Abe, Ben, Cal, Dan, Ed, Fred, Gus, Hal, Ike, Joe, Ken, Lou, Max, Ned, Ollie, Pat, Quin, Ray, Sam, Tom, Ulf, Vic, Will, Xan, Yul, Zed",
    "Tim, Kim, Jim, Sim, Dim, Lim, Mim, Rim, Vim",
  ]}
)

pl.Config.set_fmt_str_lengths(200)
df_with_long_strings
```

Break strings into list

```{python}
#| label: list-split

df_with_list = df_with_long_strings.with_columns(
  pl.col("employees").str.split(", ").alias("employees")
  )

pl.Config.set_fmt_table_cell_list_len(100)
df_with_list
```

Start Operating on List Elements

```{python}
#| label: list-operations

df_with_list.select(
  pl.col("employees").list.len().alias("n_employees"),
  pl.col("employees").list.first().alias("first_employee"),
  # pl.col("employees").list.get(0).alias("first_employee"), will also work
  pl.col("employees").list.last().alias("last_employee"),
  # pl.col("employees").list.get(-1).alias("last_employee"), will also work
  pl.col("employees").list.slice(offset=1, length=2).alias("second_and_third_employees"),
  pl.col("employees").list.contains("Alice").alias("contains_Alice")
)

```

Things to note:

1.  Use `.list.` to access list methods.
2.  The full list of list methods is [here](https://docs.pola.rs/py-polars/html/reference/expressions/list.html).

### explode()

Polars List columns may be "exploded", i.e. broken into rows of a single list element each.

```{python}
#| label: explode

df_with_list.explode("employees").shape

```


#### List-to-Struct and Struct-to List

Sometimes, you need to convert a list to a struct. For instance, if you want to `unnest()` a list, you will need to convert it to a struct first. Alternatively, if you want named access to list elements.

List to Struct

```{python}
#| label: list-to-struct

(
  df_with_list
  .select(
    pl.col('employees').list.to_struct(
      n_field_strategy='max_width',
    )
  )
  .unnest('employees')
)

```

When converting list to struct there are two things to consider: the number of fields and their names. The number of fields is governed by the `n_field_strategy` argument. The name of the fields is governed by the `fields` argument.


### Polars Array

As of March 2024, Arrays are quite new to Polars. Currently, the behave mostly like lists, but with a fixed length. I can expect that the fixed length constraint will allow for more efficient memory allocation, and simpler API than the more general Polars Lists.

See [here](https://docs.pola.rs/py-polars/html/reference/expressions/array.html) for more information on Polars Arrays.


### Useful Tips and Methods for Nested Dtypes

TODO

- cannot explode a struct
- implode explode combo

```{python}
data_left = pl.DataFrame(
    {
        "a": [1, 2, 3,4],
        "b": [3, 5, 1,15],
    })
```

```{python}
(
    data_left
    .unpivot(on=['a', 'b']) # ordering here is important
    .sort('value')
    .select(
        value=pl.col('value').filter(pl.col('variable') == 'b'),
        proportion_lte=(
            pl.col('variable').eq('a').cum_sum()
            .filter(pl.col('variable') == 'b')
            / df.height
        )
    )
)
```

```{python}
(
    data_left
        .with_columns(z=pl.col('a').implode())
        .with_columns(
            z = pl.col('b').ge(pl.col('z').explode())
            .implode()
            .over('a')
            .list.sum()
            )
          .select(
            pl.all().truediv(pl.all().len())
          )
          
          
)
```


## JSON Type

JSON are ultimately nearly structured strings. 
What you will probably want to do, when importing JSON data, is to parse the strings into actual data structures.

```{python}
df_with_json = pl.DataFrame(
  {"json": [
    '{"a": 1, "b": 2}',
    '{"a": 3, "b": 4}',
    '{"a": 5, "b": 6}',
    ]}
  )

df_with_json
```

From a JSON string to a Polars Struct
```{python}
(
  df_with_json
  .select(
    pl.col('json').str.json_decode()
  )
)
```

From JSON string to columns

```{python}
(
  df_with_json
  .select(
    pl.col('json').str.json_decode().alias('json_struct')
  )
  .unnest('json_struct')
)
```

```{python}
(
  df_with_json
  .select(
    pl.col('json').str.json_path_match('$.a')
  )
)
```


## Reshaping

High level:

-   `df.transpose()` as the name suggests.
-   `df.melt()` for wide to long.
-   `df.pivot()` for long to wide. TODO:replace with `.unpivot`
-   `df.unnest()` for breaking structs into columns.
-   `df.unstack()`

### transpose()

Recall

```{python}
df
```

Since there are no indices, if you want to keep the column names, as a new column you need to use the `include_header=True` argument.

```{python}
df.transpose(include_header=True)
```

### Wide to Long- melt()

Some data in wide format

```{python}
# The following example is adapted from Pandas documentation: https://Pandas.pydata.org/docs/reference/api/Pandas.wide_to_long.html

np.random.seed(123)
wide = pl.DataFrame({
    'famid': ["11", "12", "13"],
    'birth': [1, 2, 3],
    'ht1': [2.8, 2.9, 2.2],
    'ht2': [3.4, 3.8, 2.9]})

wide
```

Reshape into long format

```{python}
wide.unpivot(
  index=['famid', 'birth'],
  # on=['ht1', 'ht2'],
  variable_name='treatment', 
  value_name='height'
  )
```

Things to note:

-   `index` are the columns that index the wide format; these will be repeated.
-   `on` are the columns that hold information in the wide format; these will be stacked in the long format.
-   `variable_name` is the column name, in the long format, that holds the column names from the wide format. The indexing columns in the long format will thus be `id_vars`+`variable_name`.
-   `value_name` is the column name, in the long format, that holds the values after stacking.

### Long to Wide- pivot()

```{python}
# Example adapted from https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format

long = pl.DataFrame({
    'id': [1, 1, 2, 2, 3, 3],
    'treatment': ['A', 'B', 'A', 'B', 'A', 'B'],
    'height': [2.8, 2.9, 1.9, 2.2, 2.3, 2.1]
    })
  
long
```

Pivot Without Aggregation

```{python}
long.pivot(
  index='id', # index in the wide format
  on='treatment', # defines columns in the wide format
  values='height',
  )
```

If each combinatio of `index=` and `columns=` maps to more than a single value of `values=`, you can use the `aggregate_function=` argument of `.pivot()`.

#### pivot() VS value_counts()

Here are two ways to count the number of unique values in a column, by group.

```{python}
#| label: pivot-vs-value_counts

df_to_count = (
  pl.DataFrame({
    "group": ["A", "A", "A", "B", "B", "B"],
    "value": [1, 2, 3, 1, 2, 3],
    })
)

```

How to get the distribution of `value` by `group`?

The `.pivot()` method:

```{python}
#| label: pivot-vs-value_counts-0
df_to_count.pivot(
  index="group",
  on="value",
  values="value",
  aggregate_function="len",
  )

```

The `.value_counts()` method:

```{python}
#| label: pivot-vs-value_counts-1
counts_with_value_counts = (
  df_to_count
  .group_by("group")
  .agg(
    pl.col("value").value_counts().alias("n_unique_values")
  )
)

counts_with_value_counts
```

Things to note:

-   The `.pivot()` methods return a column per unique value.
-   `value_counts()` returns a Polars-List of Polars-Struct per group. This may be a pro or a con; depending on your planned usage, or the number of unique values in each group. To see this, consider a case where each group has 1000 value, non overlapping with the other groups; how many columns will the `.pivot()` method return?

Here is my best attempt to extract the frequencies of each value. If you have a better solution, please let me know.

```{python}
#| label: pivot-vs-value_counts-2
(
  counts_with_value_counts
  .with_columns(
    pl.col('n_unique_values').list.to_struct().struct.json_encode()
  )
)
```

### Long to Wide- unstack()

Another way to pivot from long to wide, which is less general but more efficient, is the `unstack()` method. Unstacking is much "dumber" than pivoting, and thus, much faster.

```{python}
#| label: unstack

(
  long
  .sort('treatment')
  .unstack(
    step = 3,
    how='vertical',
    columns="height",
    )
)
```

### unest()

See @sec-struct.

## Dataframe in Memory

```{python}
df.estimated_size(unit="mb")
```

```{python}
df.rechunk() # ensure contiguous memory layout
```

```{python}
df.n_chunks() # number of ChunkedArrays in the dataframe
```

```{python}
df.shrink_to_fit() # reduce memory allocation to actual size
```

# I/O {#sec-io}

## Reading Frames from Disk

### tl;dr

If you want an eager frame use `pl.read_xxx()`, and if you want a lazy frame use `pl.scan_xxx()`.

### In Detail

To read data from disk we need to think about the following:

1.  What is the taget object?
2.  What is the struture of the data on disk?
    1.  Single file VS multiple files? If multiple files, how are they organized?
    2.  Format of each file.
    3.  Are the files remote or local?

If the target object is an eager frame you will want to use `pl.read_xxx()`. So to read a csv you will use `pl.read_csv()`, to read a parquet you will use `pl.read_parquet()`, etc.

If the target object is a lazy frame you will want to use `pl.scan_xxx()`. So to read a csv you will use `pl.scan_csv()`, to read a parquet you will use `pl.scan_parquet()`, etc.

If the data is stored in multiple files, `scan_xxx()` methods currently have better support. You can state the locations of the files in many ways. Note, however, that if you are used to Pandas that takes a path to folder and will import, recursively, all the files in the folder, Polars will not do that; it will expect a more precise glob.

For example, say the data is in

```{text}
/
|– 2022/
| . |–01/
| . |–01.parquet
| . |–02.parquet
| . |–02/
| . |–01.parquet
| . |–02.parquet
```

In Pandas you could use `pd.read_parquet("data/")`. This will not work in Polars (TTBOMK), where you will need the following glob `pl.scan_parquet("data/**/*.parquet)"`.

Multiple Parquet files also make an Apache Arrow Dataset. Polars will allow you to read an Arrow Dataset with `pl.scan_arrow_dataset()`. The syntax was used in @sec-polars-lazy-in-memory and looks like:

```{python}
#| eval: false
dset = ds.dataset("data/", format="parquet") 
pl.scan_pyarrow_dataset(dset)
```

When using this functionality you should recall the follwing:

1.  PyArrow has more functionality that Polars. As such, it may be able to read more formats, and more complex formats, where `pl.scan_xxx()` may fail.
2.  Polars native readers are more optimized than PyArrow. In particular, they push-down more operations, which is particularly imopirtant when reading from remote storage (i.e. DataLakes) such as S3 or Azure Blob Storage.

### Why Parquet?

Why do I prefer Parquet over CSV, Pickle, Feather, or other formats?

I prefer Parquet over CSV because Parquet is compressed, and stores schema. Thus avoiding time and errors involved in guessing the schema at import time.

I prefer Parquet over Pickle because Parquet is a standard, and is not specific to a particular Python version. Also, being optimized for columnar storage, Parquet has better compression and read performance.

I Prefer Parquet over Feather because Parquet has better compression.

## Writing Frames to Disk {#sec-disk-export}

### tl;dr

An eager Polars frame will have a `.write_xxx()` method.

If the output of an opeation does not fit into memory, so that an eager frame cannot materialize, look into the `.sink_parquet()` engine. It will write to disk in a streaming fashion. Alas, it currently supports a very limited set of operations.

## Databases

See [here](https://docs.pola.rs/user-guide/io/database/#difference-between-read_database_uri-and-read_database) for the user guide.

Note you can (currently) only read Eager Frames from disk.


# Polars Expressions {#sec-expressions}

Now that you are familiar with column transformations in Polars' contexts, it is time to tell you: **all those transformations are Polars expressions**.

Things to recall about Polars expressions:

1.  Think of them as functions of other Polars expressions. As such, expressions can be **chained**. 
1. An expression, or chain thereof, meterializes when a `.collect()` method is called.
2.  Almost all `pl.Series()` methods are available as Polars expressions, and vice-versa. There are however exceptions.

Because almost all Polars Series methods are available as Polars expressions, we refer the reader to @sec-series for a review of importanta series methods. In this section we will focus on exceptions, and some important expressions that are not available for Series.


## Reusing and Compounding Expressions


### Assigning Expressions
Polars expression can be assigned, without evaluation, as a Python object:

```{python}
my_expression = pl.col('integer').add(pl.col('float')).mul(2).alias('new_column')

(
  df
  .select(my_expression)
)
```

```{python}
my_expression.meta.show_graph()
```


### Compounding Expressions
Assigned expressions may be compounded (composed):

```{python}

my_expression_2 = my_expression.mul(1e6)

(
  df
  .select(my_expression_2)
)

```

### Adding Expressions to the `pl.Expr` Namespace


1. Write a function that accepts `pl.Expr` objects and returns a `pl.Expr` object.
2. Add this function to the `pl.Expr` namespace with the `pl.register_expr_function()` decorator.

```{python}
def my_sum(expr: pl.Expr) -> pl.Expr:
  return expr.sum().alias('my_sum')

pl.Expr.my_sum = my_sum

df.select(pl.col('integer').my_sum())
```


# Polars LazyFrames {#sec-polars-lazy}

Recalling- a LazyFrame is a Polars DataFrame that has not been materialized. I.e., it is nothing but **a plan** to import some file from disk and to operate on it. It will only materialize when a `.collect()` method is called.

In the case I did not repeat it enough- LazyFrames allow you to process data that does not fit in your memory. This is a tremendous difference with Pandas, or Numpy, where the data is memory resident. Imagine processing 100GB of data in your MAcBook Air! Or your EC2 instance!

A full list of operations that are available for LazyFrames is available [here](https://docs.pola.rs/py-polars/html/reference/lazyframe/index.html). 


## LazyFrame Is Not a Single Computing Engine

As previously mentioned, Polars has multiple evaluation engines. The Eager Frames' engine is the first. The evaluation of LazyFrames, can be done by multiple engines. The depend on:

1.  **In/out of RAM**: In memory or streaming processing?
2.  **Reader**: Polars native or PyArrow file readers?
3.  **Output**: Output to memory or disk?

| In/out of RAM | Reader  | Output | Command                                             |
|------------|------------|------------|-------------------------------------|
| RAM           | Native  | RAM    | `pl.scan_parquet()...collect()`                     |
| Stream        | Native  | RAM    | `pl.scan_parquet()....collect(streaming=True)`      |
| RAM           | PyArrow | RAM    | `pl.scan_pyarrow_dataset().collect()`               |
| Stream        | PyArrow | RAM    | `pl.scan_pyarrow_dataset().collect(streaming=True)` |
| Stream        | Native  | Disk   | `pl.scan_csv()....sink_parquet()`                   |

Things worth knowing:

1.  The authors of Polars make considerable effort so that the transition between engines is seamless to the user. This effort is usually successful, but exceptions exist. An easy way to debug your code, is first to try a different engine. In my experience, the RAM/Native/RAM engine is the most complete and robust.
2.  The Stream/Native/Disk engine is very very limited. Currently, it is really useful to convert from CSV to parquet.
3.  The PyArrow reader are the most general and stable to import into Arrow format. Not being Native to Polar, they may be less efficient.

## Creating A Lazy Frame

A LazyFrame is created in two ways:

1.  By using a `pl.scan_xxx()` method to read data from disk.
2.  By the `.lazy()` method of an eager frame.

Here is a `pl.scan_xxx()`example

```{python}
path_to_file = 'data/NYC/yellow_tripdata_2023-01.parquet'

f"{os.path.getsize(path_to_file)/1e7:.2f} MB on disk"

```

```{python}
#| label: lazy-scan
taxi_lazy = pl.scan_parquet(path_to_file)

taxi_lazy.limit(5).collect()
```

## Operations on LazyFrames

Almost everything we did with eager frames, we can do with lazy frames.

If an operation fails, it may be it has not yet been implemented for the engine you are using. You will find that error messages are **less informative** for the streaming engine, than they are for the eager engine. Try changing the engine as a first step of debugging.

Here is a random example

```{python}
#| label: random-operations-on-lazy-frame
(
  taxi_lazy # calling a lazy frame
  
  # do some operations....
  .select(
    pl.col(pl.Float64).mean().name.suffix("_mean"),
  )
  .collect() # don't forget to collect
  
  # Handle the printing
  .transpose(include_header=True)
  .to_pandas() # for nice printing
)
```


## Lazy Frame Attributes

Some information is avaialble from the metadata of a LazyFrame. This information is available without materializing the frame (at least with Parquet files). 
Notice the abcense of the `collect()` call in the following examples.

```{python}
#| label: lazy-attributes-schema
taxi_lazy.schema
```

```{python}
taxi_lazy.columns
```

```{python}
taxi_lazy.dtypes
```


## Useful Tricks to Make The Most of LazyFrames

1.  Use Parquet or Feather. CSV and Pickle should be **avoided**. 

2.  Store your data **partitioned**, along your frequent group_by's and filters.

3. Materialize as late as possible. 

4. Read about the options of the native readers. In particular: 
  - Try the `low_memory=True` option of the native readers, before trying the streaming engine. 
  - When reading multiple files with a glob pattern, set the `rechunk=False` default to `True`, if your memory allows it. 


# Polars Functions {#sec-polars-functions}

Some functionality is not exposed as methods on Polars DataFrames, Series, or LazyFrames. Instead, you can access this functionality with the `pl` object.

`pl.all()` to select all columns within a `df.select()` context.
```{python}
#| eval: false 
data_left.select(pl.all().sum())
```


`pl.all_horizontal()` to select all columns within a `df.select()` context, but will be executed horizontally.


```{python}
data_left.select(pl.all_horizontal('a','b'))
```


`pl.any()` 

```{python}
data_left.with_columns(a_bool=pl.col('a').eq(2),b_bool=pl.col('b').eq(2)).select(pl.any('a_bool','b_bool'))
```

`pl.any_horizontal()`

```{python}
data_left.with_columns(a_bool=pl.col('a').eq(2),b_bool=pl.col('b').eq(2)).select(pl.any_horizontal('a_bool','b_bool'))
```

`pl.map_groups()` 

```{python}
#| label: map-groups-example

# df.group_by('a').agg(pl.map_groups())
```

`pl.approx_n_unique()` as fast version of `n_unique()`

```{python}
#| label: approx-n-unique-example


data_left.select(pl.approx_n_unique('a'))
```

`pl.arange()` to create a sequence of numbers within a df (similar to `np.arange()`)

```{python}
#| label: arange-example

data_left.select(pl.arange(0,100,10))
```

`pl.arg_sort_by()` arg-sort an expression.

```{python}
#| label: arg-sort-by-example

data_left.select(pl.arg_sort_by('a'))
```

`pl.arg_where()` filter **expression** where predicate satisfied.

```{python}
#| label: arg-where-example
data_left.select(pl.arg_where(pl.col('a').gt(2)))
# df.with_columns(pl.arg_where(pl.col('a').gt(2))) will not work
```

`pl.business_day_count()` counts business days between two dates. 
```{python}
#| label: business-day-count-example
(
  df2
  .with_columns(pl.col('date').cast(pl.Date))
  .select(
    pl.business_day_count(
      start = pl.col('date'),
      end = pl.col('date').dt.offset_by(by='100d')
      )
    )
)
```

`pl.coalesce()` works like a fill_null between columns.

```{python}
#| label: coalesce-example

(
  data_left
  .with_columns(
    y=pl.Series([None,20,None,40]),
    z=pl.Series([10,None,30,None])
    )
  .with_columns(
    pl.coalesce('z','a'),
    pl.coalesce('y','b')
  )
)

```

`pl.concat_str()` horizontally concatenate strings.

```{python}
data_left.select(pl.concat_str('a','b', separator='@'))
```

`pl.struct()` combine multiple columns into a single column of structs.

```{python}
(
  data_left
  
  .select(c = pl.struct('a','b'))
  
  .select(pl.col('c').struct.json_encode())
)
```

`pl.len()` count non-null elements in a group.

```{python}
data_left.group_by('a').agg(pl.len())
```

`pl.cum_count()` cumulative counts non-null values. 

```{python}
data_left.select(pl.cum_count('b'))
```

```{python}
df_to_count.group_by('group').agg(pl.cum_count('value'))
```

`pl.fold()` accumulate horizontally, from left most column to right.

```{python}
data_left.select(
  pl.fold(
    acc=pl.lit(1), 
    function=lambda acc, x: acc + x, 
    exprs=pl.all()
    )
  )

```


`pl.cum_fold()` same as `pl.fold()`, but cumulative; returning a struct. 

```{python}
data_left.select(
  pl.cum_fold(
    acc=pl.lit(1), 
    function=lambda acc, x: acc + x, 
    exprs=pl.all()
    )
  )
```


`pl.reduce()` same as `pl.fold()`, but using the first column as the accumulator.

```{python}
data_left.select(
  pl.reduce(
    function=lambda acc, x: acc + x, 
    exprs=pl.all()
    )
  )
```


# Misc {#sec-misc}

## SQL {#sec-sql}

```{python}

res = (
  pl.SQLContext(frame=df2)
  .execute(
    "SELECT * FROM frame WHERE integer > 2"
  )
)
res.collect()
```

Things to notes:

- The name of the frame is the one regietered (`frame`) and the name of the object. 
- I suspect that the SQL context is not as optimized as the native Polars operations. I have very little experience with it. 

For more on the SQL context see[here](https://docs.pola.rs/py-polars/html/reference/sql.html).


## Plotting

### The Plotting Backend

TODO: imoprove. 

Requires the `altair` module. 

```{python}
df.plot.line(y='float',x='integer')
```

```{python}
df.plot.scatter(y='float',x='integer')
```

Things to note:

- The default `.plot()` will behave like `pd.df.plot()`, i.e., return a line plot of all columns.
- `df.plot.` will give you access to the usual plotting methods (of **altair**), such as `hist()`, `scatter()`, etc.


### As Input to Plotting Libraries

My preferred plotting library us plotly, not hvplot. 
Since plotly is (currently) not a plotting backend for Polars, I will use the frame as an input to plotly functions. 

```{python}
px.line(
  df, 
  x="integer", 
  y="float",
  markers=True,
  )
```

Things to note:

- Recent versions of Plotly can deal with a Polars frame as any other Pandas frame. In particular, it can exctract columns using their name, and use it as axis titles. 


## Tables

Polars can print tables in ASCII, or write in HTML to ipython notebooks. 

```{python}
#| label: df-to-ascii
print(df)
```

```{python}
#| label: df-to-html

df
```

If, like myself, you are not satisfied with these options, you can gain more control on the printing of tables using the following:

1. Exporting as Pandas dataframe. 
2. Posit's GreatTables module. 

### Export as Pandas Dataframe

```{python}
(
  df
  .to_pandas()
  .style
    .format({
      'float': '{:.1f}',
      'date': '{:%Y-%m-%d}'
      })
    .background_gradient(
      cmap='Reds', 
      axis=0,
      )
)
```


### Great Tables

```{python}
import great_tables as gt

(
  gt.GT(
    df,
    # rowname_col="integer",
    )
  .tab_header(
    title="Nonsense Data", 
    subtitle="But looking good!")
  .fmt_number(columns="float", compact=True)
  .fmt_date(columns="date", date_style="wd_m_day_year")
  .tab_stubhead(label="integer")
  .data_color(
    columns=["float",'integer'],
    # domain=[1, 6],
    # palette=["rebeccapurple", "white", "orange"],
    # na_color="white",
)
  # .fmt_currency(columns=["open", "high", "low", "close"])
  .cols_hide(columns="string")

)

```


## ML

When doing ML with Polars frames there are two possibliities:

1. Your learning function can ingest Polars frames. This is currently the exception. 
2. You will need to convert your Polars frame to a Numpy array, or a PyTorch tensor, or a Pandas frame, etc. This is currently the rule. 

Example
```{python}

import sklearn as sk
from sklearn.linear_model import LinearRegression

# predict `label` with `float`
X = df.select(["float"])
y = df["string"].to_dummies()

model = LinearRegression()
model.fit(X.to_numpy(), y.to_numpy())

```

Can SKlearn ingest Polars Series? Yes.

```{python}
model.fit(X, y)
```

This is because SKlearn takes [Array-Like](https://scikit-learn.org/stable/glossary.html#term-array-like) objects, i.e., anything that `np.asarray()` can convert to a numpy array. 
This also means that with SKlearn, your learning is memory resident.

```{python}
np.asarray(X)
```


### Patsy

[Patsy](https://patsy.readthedocs.io/en/latest/) is a Python library for describing statistical models and building design matrices using R's tilde syntax (`y~X`). 

Patsy can already be used with Polars frames. 

```{python}
import patsy as pt

y, X = pt.dmatrices("float ~ integer", df)

model = LinearRegression()
model.fit(X, y)
```


### Polars-ds

[Polars-ds](https://github.com/abstractqqq/polars_ds_extension) is a Polars extension designed to give Polars some more Data-Science functionality. It is currently in development and worth following. 


## Writing Your Own Extensions

You can extend Polars in various ways.
If you are familiar with Rusy, you can a [plugin](https://docs.pola.rs/py-polars/html/reference/api.html).
If you just want your own functions to be accesible as methods, see [here](https://docs.pola.rs/py-polars/html/reference/api.html).