Skip to content

Latest commit

 

History

History
128 lines (88 loc) · 5.11 KB

Roadmap.org

File metadata and controls

128 lines (88 loc) · 5.11 KB

serde_arrow Roadmap

The following is a Roadmap (with notes of implementation details) of the serde_arrow project.

v0.1.0 [0/4]

In v0.1.0 we aim to achieve full support for Arrow’s Encapsulated Message Format. Here the Roadmap is linear. Initially, we will be working on primitives of the format and will gradually work our way up to consumable output and input.

HOLD Primitive Logical Types [5/17]

An Arrow Array can store values of a number of types, called logical types. A clear list can be found here. Before we work with Arrays we need to be able to serialize Erlang native data types into (and deserialize from) their Arrow supported counterparts. We don’t need implement all the types at once - things like timestamps can wait! We will however need to able represent Ints and Floats immediately.

These need to be immediately supported for development work:

  • [X] Boolean
  • [X] Int 8/16/32/64
  • [X] UInt 8/16/32/64
  • [X] Float 32/64

These can have support gradually added:

  • [X] Float16
  • [ ] Decimal128
  • [ ] Decimal256
  • [ ] Date 32/64
  • [ ] Time 32/64
  • [ ] Timestamp
  • [ ] Duration
  • [ ] Interval
  • [ ] Fixed Sized Binary
  • [ ] Binary
  • [ ] Large Binary
  • [ ] UTF 8
  • [ ] Large UTF 8

Physical Memory Layout aides [3/3]

Apart from the actual data, the memory layout includes things like a null count, a validity bitmaps, and offsets. We need to have ways to represent and work with these before we can work with physical layouts themselves

HOLD Physical Memory Layout [4/9]

These are the actual Memory Layouts we need to support. Not all of them need to be supported immediately. The Layouts from Primitive Layout to Struct Layout (inclusive) need immediate support for development. The rest can have support added gradually.

Fixed-Size Primitive Layout

Next we need to support the Fixed-Size Primitive Layout, the simplest (other than the Null Layout) of the Physical Memory Layouts (which technically are nested logical types).

Variable Binary Layout

We now need to support the Variable Binary Layout. This may require supporting the Binary dtype.

Fixed-Size List Layout

We need to support the Fixed-Size List Layout which is the simpler List Layout. Since this is layout supports nesting, we need to also add support for representing and validating the internal list length.

Variable-Size List Layout

We need to support the Variable-size List Layout. We will also need to add support for representing offsets as the list size is variable.

Struct Layout

We need to support the Struct Layout. We will also need to add support for representing and validating multiple types for Layout in order to represent the different fields with varying types.

Dense Union Layout

We need to support the Dense Union Layout.

Sparse Union Layout

We need to support the Sparse Union Layout.

Dictionary-encoded Layout

We need to support the Dictionary-encoded Layout.

Run-End Encoded Layout

We need to support the Run-End Encoded Layout.

Encapsulated Message Format [0/4]

We now have all the prerequisites to support the Encapsulated Message Format. In the actual format itself, we need to add support for the message body, which can be one of Schema, RecordBatch and DictionaryBatch.

Implementation details are yet to decided due to Arrow’s dependency on Flatbuffers. The details for all these Message Body types and the actual format’s implementation will be added later.

Schema

RecordBatch

DictionaryBatch

Actual Encapsulated Message Format itself

v0.2.0 [0/1]

Write Roadmap

Write a Roadmap for v0.2.0. By v0.2.0, we need to support Arrow IPC and Apache Flight.