-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I start this issue to get some community insight/help on this. As I understand it starting with a Dask bag
is the most natural step to start this.
I understand this as a partitioned data structure on which we can apply functional combinators like map
, filter
, fold
, etc. Given the memory layout from polars, it seems most natural to have columnar or batches of columnar memory in a partitioned datastructure. Rows are very, very inefficient.
This might be a context switch as spark RDDs seem to work on rows, and pandas also has somewhat row oriented manager with C
ordering in numpy + the row manager.
Things that also might be interesting is partitioning the lazy polars queries and have no data at all. Although this is also something the dask graph does already.