Skip to content

Dask bag #1

@ritchie46

Description

@ritchie46

I start this issue to get some community insight/help on this. As I understand it starting with a Dask bag is the most natural step to start this.

I understand this as a partitioned data structure on which we can apply functional combinators like map, filter, fold, etc. Given the memory layout from polars, it seems most natural to have columnar or batches of columnar memory in a partitioned datastructure. Rows are very, very inefficient.

This might be a context switch as spark RDDs seem to work on rows, and pandas also has somewhat row oriented manager with C ordering in numpy + the row manager.

Things that also might be interesting is partitioning the lazy polars queries and have no data at all. Although this is also something the dask graph does already.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions