Arrow read support (read directly as pyarrow.Table or polars.DataFrame)
#2776
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New Feature: PyArrow and Polars Output Formats
ArcticDB now supports returning data in Apache Arrow-based formats.
Overview
In addition to the existing
PANDASoutput format (which remains the default), you can now specifyPYARROWorPOLARSoutput formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:polarsUsage
Output format can be configured at three levels:
1. Arctic instance level (default for all libraries):
2. Library level (default for all reads from this library):
3. Per read operation (most granular control):
Output Format Options
OutputFormat.PANDAS(default): Returnspandas.DataFramebacked by numpy arraysOutputFormat.PYARROW: Returnspyarrow.TableobjectsOutputFormat.POLARS: Returnspolars.DataFrameobjectsString Format Customization
When using Arrow-based output formats, you can customize how string columns are encoded using
ArrowOutputStringFormat:LARGE_STRING(default):pa.large_string(), Polars:pl.StringSMALL_STRING:pa.string(), Polars:pl.StringCATEGORICAL/DICTIONARY_ENCODED:pa.dictionary(pa.int32(), pa.large_string()), Polars:pl.CategoricalExample: Using String Formats
Supported Operations
All read operations support the new output formats:
read()read_batch()read_batch_and_join()head()tail()lazy=TrueNotes
PANDASfor backward compatibilityPYARROWandPOLARSformats share the same underlying Arrow memory layout