Skip to content

Commit

Permalink
Merge pull request #1 from Swoorup/sj-migrate-to-arrow
Browse files Browse the repository at this point in the history
Ported arrow2-convert to use arrow-rs library.
  • Loading branch information
Swoorup authored Mar 28, 2024
2 parents 420434a + 1e4c016 commit cc32c2f
Show file tree
Hide file tree
Showing 47 changed files with 2,108 additions and 1,824 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ jobs:
# We keep these separate since sometimes the derive fails when
# independently built.
run: |
cargo build -p arrow2_convert_derive
cargo build -p arrow2_convert
cargo build -p arrow2_convert_example_simple
cargo build -p arrow_convert_derive
cargo build -p arrow_convert
cargo build -p arrow_convert_example_simple
clippy:
name: Clippy
Expand Down
5 changes: 3 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[workspace]
resolver = "2"
members = [
"arrow2_convert",
"arrow2_convert_derive",
"arrow_convert",
"arrow_convert_derive",
"examples/simple"
]
47 changes: 24 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# arrow2_convert
# arrow_convert

Provides an API on top of [`arrow2`](https://github.com/jorgecarleitao/arrow2) to convert between rust types and Arrow.
Provides an API on top of [`arrow-rs`](https://github.com/apache/arrow-rs) to convert between rust types and Arrow. This repository was ported from the directly converted from [`arrow2-convert`](https://github.com/DataEngineeringLabs/arrow2-convert) library for use with `arrow-rs`.

The Arrow ecosystem provides many ways to convert between Arrow and other popular formats across several languages. This project aims to serve the need for rust-centric data pipelines to easily convert to/from Arrow with strong typing and arbitrary nesting.

## Example

The example below performs a round trip conversion of a struct with a single field.

Please see the [complex_example.rs](https://github.com/DataEngineeringLabs/arrow2-convert/blob/main/arrow2_convert/tests/complex_example.rs) for usage of the full functionality.
Please see the [complex_example.rs](https://github.com/Swoorup/arrow-convert/blob/main/arrow_convert/tests/complex_example.rs) for usage of the full functionality.

```rust
/// Simple example

use arrow2::array::Array;
use arrow2_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField, ArrowSerialize, ArrowDeserialize};
use arrow::array::{Array, ArrayRef};
use arrow_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField, ArrowSerialize, ArrowDeserialize};

#[derive(Debug, Clone, PartialEq, ArrowField, ArrowSerialize, ArrowDeserialize)]
pub struct Foo {
Expand All @@ -30,11 +30,11 @@ fn main() {
];

// serialize to an arrow array. try_into_arrow() is enabled by the TryIntoArrow trait
let arrow_array: Box<dyn Array> = original_array.try_into_arrow().unwrap();
let arrow_array: ArrayRef = original_array.try_into_arrow().unwrap();

// which can be cast to an Arrow StructArray and be used for all kinds of IPC, FFI, etc.
// supported by `arrow2`
let struct_array= arrow_array.as_any().downcast_ref::<arrow2::array::StructArray>().unwrap();
// supported by `arrow`
let struct_array= arrow_array.as_any().downcast_ref::<arrow::array::StructArray>().unwrap();
assert_eq!(struct_array.len(), 3);

// deserialize back to our original vector via TryIntoCollection trait.
Expand All @@ -49,9 +49,9 @@ Types that implement the `ArrowField`, `ArrowSerialize` and `ArrowDeserialize` t

The `ArrowField`, `ArrowSerialize` and `ArrowDeserialize` derive macros can be used to generate implementations of these traits for structs and enums. Custom implementations can also be defined for any type that needs to convert to/from Arrow by manually implementing the traits.

For serializing to arrow, `TryIntoArrow::try_into_arrow` can be used to serialize any iterable into an `arrow2::Array` or a `arrow2::Chunk`. `arrow2::Array` represents the in-memory Arrow layout. `arrow2::Chunk` represents a column group and can be used with `arrow2` API for other functionality such converting to parquet and arrow flight RPC.
For serializing to arrow, `TryIntoArrow::try_into_arrow` can be used to serialize any iterable into an `arrow::Array` or a `arrow::Chunk`. `arrow::Array` represents the in-memory Arrow layout. `arrow::Chunk` represents a column group and can be used with `arrow` API for other functionality such converting to parquet and arrow flight RPC.

For deserializing from arrow, the `TryIntoCollection::try_into_collection` can be used to deserialize from an `arrow2::Array` representation into any container that implements `FromIterator`.
For deserializing from arrow, the `TryIntoCollection::try_into_collection` can be used to deserialize from an `arrow::Array` representation into any container that implements `FromIterator`.

### Default implementations

Expand All @@ -66,7 +66,7 @@ Default implementations of the above traits are provided for the following:
- [`chrono::NaiveDate`], [`chrono::NaiveDateTime`]
- Option<T> if T implements `ArrowField`
- Vec<T> if T implements `ArrowField`
- Large Arrow types [`LargeBinary`], [`LargeString`], [`LargeList`] are supported via the `type` attribute. Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage.
- Large Arrow types [`LargeBinary`], [`LargeString`], [`LargeList`] are supported via the `type` attribute. Please see the [complex_example.rs](./arrow_convert/tests/complex_example.rs) for usage.
- Fixed size types [`FixedSizeBinary`], [`FixedSizeList`] are supported via the `FixedSizeVec` type override.
- Note: nesting of [`FixedSizeList`] is not supported.

Expand All @@ -83,8 +83,8 @@ i128 represents a decimal number and requires the precision and scale to be spec
For example to use `i128` as a field in a struct:

```rust
use arrow2_convert::field::I128;
use arrow2_convert::ArrowField;
use arrow_convert::field::I128;
use arrow_convert::ArrowField;

#[derive(Debug, ArrowField)]
struct S {
Expand All @@ -96,16 +96,17 @@ struct S {
A `vec<i128>` can be converted. to/from arrow by using the `arrow_serialize_to_mutable_array` and `arrow_array_deserialize_iterator_as_type` methods.

```rust
use arrow2::array::{Array, MutableArray};
use arrow2_convert::serialize::arrow_serialize_to_mutable_array;
use arrow2_convert::deserialize::arrow_array_deserialize_iterator_as_type;
use arrow2_convert::field::I128;
use arrow::array::{Array, ArrayBuilder, ArrayRef};
use arrow_convert::serialize::arrow_serialize_to_mutable_array;
use arrow_convert::deserialize::arrow_array_deserialize_iterator_as_type;
use arrow_convert::field::I128;
use std::borrow::Borrow;
use std::sync::Arc;

fn convert_i128() {
let original_array = vec![1 as i128, 2, 3];
let b: Box<dyn Array> = arrow_serialize_to_mutable_array::<_, I128<32,32>, _>(
&original_array).unwrap().as_box();
let b: ArrayRef = Arc::new(arrow_serialize_to_mutable_array::<_, I128<32,32>, _>(
&original_array).unwrap().finish());
let round_trip: Vec<i128> = arrow_array_deserialize_iterator_as_type::<_, I128<32,32>>(
b.borrow()).unwrap().collect();
assert_eq!(original_array, round_trip);
Expand All @@ -123,25 +124,25 @@ Since the Arrow format only supports one level of validity, nested option types
This is not an exhaustive list. Please open an issue if you need a feature.
## Memory

Pass-thru conversions perform a single memory copy. Deserialization performs a copy from arrow2 to the destination. Serialization performs a copy from the source to arrow2. In-place deserialization is theoretically possible but currently not supported.
Pass-thru conversions perform a single memory copy. Deserialization performs a copy from arrow to the destination. Serialization performs a copy from the source to arrow. In-place deserialization is theoretically possible but currently not supported.

## Internals

### Similarities with Serde

The design is inspired by serde. The `ArrowSerialize` and `ArrowDeserialize` are analogs of serde's `Serialize` and `Deserialize` respectively.

However unlike serde's traits provide an exhaustive and flexible mapping to the serde data model, arrow2_convert's traits provide a much more narrower mapping to arrow2's data structures.
However unlike serde's traits provide an exhaustive and flexible mapping to the serde data model, arrow_convert's traits provide a much more narrower mapping to arrow's data structures.

Specifically, the `ArrowSerialize` trait provides the logic to serialize a type to the corresponding `arrow2::array::MutableArray`. The `ArrowDeserialize` trait deserializes a type from the corresponding `arrow2::array::ArrowArray`.
Specifically, the `ArrowSerialize` trait provides the logic to serialize a type to the corresponding `arrow::array::ArrayBuilder`. The `ArrowDeserialize` trait deserializes a type from the corresponding `arrow::array::ArrowArray`.

### Workarounds

Features such as partial implementation specialization and generic associated types (currently only available in nightly builds) can greatly simplify the underlying implementation.

For example custom types need to explicitly enable Vec<T> serialization via the `arrow_enable_vec_for_type` macro on the primitive type. This is needed since Vec<u8> is a special type in Arrow, but without implementation specialization there's no way to special-case it.

Availability of generaic associated types would simplify the implementation for large and fixed types, since a generic MutableArray can be defined. Ideally for code reusability, we wouldn’t have to reimplement `ArrowSerialize` and `ArrowDeserialize` for large and fixed size types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the `ArrowSerialize` and `ArrowDeserialize` implementations to be able to specify the bounds as part of the associated type, which is not possible without generic associated types.
Availability of generaic associated types would simplify the implementation for large and fixed types, since a generic ArrayBuilder can be defined. Ideally for code reusability, we wouldn’t have to reimplement `ArrowSerialize` and `ArrowDeserialize` for large and fixed size types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the `ArrowSerialize` and `ArrowDeserialize` implementations to be able to specify the bounds as part of the associated type, which is not possible without generic associated types.

As a result, we’re forced to sacrifice code reusability and introduce a little bit of complexity by providing separate `ArrowSerialize` and `ArrowDeserialize` implementations for large and fixed size types via placeholder structures. This also requires introducing the `Type` associated type to `ArrowField` so that the arrow type can be overriden via a macro field attribute without affecting the actual type.

Expand Down
34 changes: 0 additions & 34 deletions arrow2_convert/Cargo.toml

This file was deleted.

Loading

0 comments on commit cc32c2f

Please sign in to comment.