Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

IvoDD · 2025-11-21T15:38:59Z

New Feature: PyArrow and Polars Output Formats

ArcticDB now supports returning data in Apache Arrow-based formats.

Overview

In addition to the existing PANDAS output format (which remains the default), you can now specify PYARROW or POLARS output formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:

Better performance: Especially for dataframes with many string columns
Integration with Modern systems: Native integration with Arrow-based data processing tools like polars

Usage

Output format can be configured at three levels:

1. Arctic instance level (default for all libraries):

import arcticdb as adb
from arcticdb import OutputFormat

# Set default for all libraries from this Arctic instance
ac = adb.Arctic(uri, output_format=OutputFormat.PYARROW)

2. Library level (default for all reads from this library):

# Set at library creation
lib = ac.create_library("my_lib", output_format=OutputFormat.POLARS)

# Or when getting an existing library
lib = ac.get_library("my_lib", output_format=OutputFormat.PYARROW)

3. Per read operation (most granular control):

# Override on a per-read basis
result = lib.read("symbol", output_format=OutputFormat.PYARROW)
table = result.data  # Returns pyarrow.Table

result = lib.read("symbol", output_format=OutputFormat.POLARS)
df = result.data  # Returns polars.DataFrame

Output Format Options

OutputFormat.PANDAS (default): Returns pandas.DataFrame backed by numpy arrays
OutputFormat.PYARROW: Returns pyarrow.Table objects
OutputFormat.POLARS: Returns polars.DataFrame objects

String Format Customization

When using Arrow-based output formats, you can customize how string columns are encoded using ArrowOutputStringFormat:

LARGE_STRING (default):

64-bit variable-size encoding
Best for general-purpose use and large string data
PyArrow: pa.large_string(), Polars: pl.String

SMALL_STRING:

32-bit variable-size encoding
More memory efficient for smaller string data
PyArrow: pa.string(), Polars: pl.String

CATEGORICAL / DICTIONARY_ENCODED:

Dictionary-encoded with int32 indices
Best for low cardinality columns (few unique values repeated many times)
Deduplicates strings for significant memory savings
PyArrow: pa.dictionary(pa.int32(), pa.large_string()), Polars: pl.Categorical

Example: Using String Formats

from arcticdb import ArrowOutputStringFormat

# Set default string format at Arctic level
ac = adb.Arctic(
    uri,
    output_format=OutputFormat.PYARROW,
    arrow_string_format_default=ArrowOutputStringFormat.CATEGORICAL
)

# Or per-column during read
result = lib.read(
    "symbol",
    output_format=OutputFormat.PYARROW,
    arrow_string_format_per_column={
        "country": ArrowOutputStringFormat.CATEGORICAL,  # Low cardinality
        "description": ArrowOutputStringFormat.LARGE_STRING  # High cardinality
    }
)

Supported Operations

All read operations support the new output formats:

read()
read_batch()
read_batch_and_join()
head()
tail()
LazyDataFrame operations with lazy=True

Notes

The default output format remains PANDAS for backward compatibility
Both PYARROW and POLARS formats share the same underlying Arrow memory layout
For more details on Arrow's physical layouts, see: https://arrow.apache.org/docs/format/Columnar.html

Detailed release notes to be written in PR description

IvoDD requested review from alexowens90 and poodlewars as code owners November 21, 2025 15:39

IvoDD added the minor Feature change, should increase minor version label Nov 21, 2025

IvoDD changed the title ~~Arrow read support (allow reading directly as pyarrow.Table or polars.DataFrame)~~ Arrow read support (read directly as pyarrow.Table or polars.DataFrame) Nov 21, 2025

IvoDD force-pushed the polars-output-format branch from 44b11a2 to 3c821ed Compare November 24, 2025 08:56

Polars output format without docs

a5299e0

IvoDD force-pushed the polars-output-format branch from 3c821ed to a5299e0 Compare November 24, 2025 09:25

IvoDD force-pushed the official-arrow-read-support branch 2 times, most recently from 2749078 to 022c5bb Compare November 24, 2025 09:35

IvoDD force-pushed the polars-output-format branch from ba7bff2 to d654d02 Compare November 24, 2025 13:57

Base automatically changed from polars-output-format to master November 24, 2025 15:56

Arrow output format for read operations

901bc3b

Detailed release notes to be written in PR description

IvoDD force-pushed the official-arrow-read-support branch from 022c5bb to 901bc3b Compare November 25, 2025 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Uh oh!

IvoDD commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arrow read support (read directly as pyarrow.Table or polars.DataFrame) #2776

Are you sure you want to change the base?

Arrow read support (read directly as pyarrow.Table or polars.DataFrame) #2776

Uh oh!

Conversation

IvoDD commented Nov 21, 2025

New Feature: PyArrow and Polars Output Formats

Overview

Usage

Output Format Options

String Format Customization

Example: Using String Formats

Supported Operations

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776

Arrow read support (read directly as `pyarrow.Table` or `polars.DataFrame`) #2776