The following is a Roadmap (with notes of implementation details) of the
serde_arrow
project.
In v0.1.0
we aim to achieve full support for Arrow’s Encapsulated Message
Format. Here the Roadmap is linear. Initially, we will be working on
primitives of the format and will gradually work our way up to consumable output
and input.
An Arrow Array can store values of a number of types, called logical types. A clear list can be found here. Before we work with Arrays we need to be able to serialize Erlang native data types into (and deserialize from) their Arrow supported counterparts. We don’t need implement all the types at once - things like timestamps can wait! We will however need to able represent Ints and Floats immediately.
These need to be immediately supported for development work:
- [X] Boolean
- [X] Int 8/16/32/64
- [X] UInt 8/16/32/64
- [X] Float 32/64
These can have support gradually added:
- [X] Float16
- [ ] Decimal128
- [ ] Decimal256
- [ ] Date 32/64
- [ ] Time 32/64
- [ ] Timestamp
- [ ] Duration
- [ ] Interval
- [ ] Fixed Sized Binary
- [ ] Binary
- [ ] Large Binary
- [ ] UTF 8
- [ ] Large UTF 8
Apart from the actual data, the memory layout includes things like a null count, a validity bitmaps, and offsets. We need to have ways to represent and work with these before we can work with physical layouts themselves
- [X] Null count
- [X] Validity bitmaps
- [X] Array length
These are the actual Memory Layouts we need to support. Not all of them need to be supported immediately. The Layouts from Primitive Layout to Struct Layout (inclusive) need immediate support for development. The rest can have support added gradually.
Next we need to support the Fixed-Size Primitive Layout, the simplest (other than the Null Layout) of the Physical Memory Layouts (which technically are nested logical types).
We now need to support the Variable Binary Layout. This may require supporting the Binary dtype.
We need to support the Fixed-Size List Layout which is the simpler List Layout. Since this is layout supports nesting, we need to also add support for representing and validating the internal list length.
We need to support the Variable-size List Layout. We will also need to add support for representing offsets as the list size is variable.
We need to support the Struct Layout. We will also need to add support for representing and validating multiple types for Layout in order to represent the different fields with varying types.
We need to support the Dense Union Layout.
We need to support the Sparse Union Layout.
We need to support the Dictionary-encoded Layout.
We need to support the Run-End Encoded Layout.
We now have all the prerequisites to support the Encapsulated Message Format. In the actual format itself, we need to add support for the message body, which can be one of Schema, RecordBatch and DictionaryBatch.
Implementation details are yet to decided due to Arrow’s dependency on Flatbuffers. The details for all these Message Body types and the actual format’s implementation will be added later.
Write a Roadmap for v0.2.0
. By v0.2.0
, we need to support Arrow IPC and
Apache Flight.