Skip to content

Conversation

@IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Nov 21, 2025

New Feature: PyArrow and Polars Output Formats

ArcticDB now supports returning data in Apache Arrow-based formats.

Overview

In addition to the existing PANDAS output format (which remains the default), you can now specify PYARROW or POLARS output formats for read operations. Both formats use Apache Arrow's columnar memory layout, enabling:

  • Better performance: Especially for dataframes with many string columns
  • Integration with Modern systems: Native integration with Arrow-based data processing tools like polars

Usage

Output format can be configured at three levels:

1. Arctic instance level (default for all libraries):

import arcticdb as adb
from arcticdb import OutputFormat

# Set default for all libraries from this Arctic instance
ac = adb.Arctic(uri, output_format=OutputFormat.PYARROW)

2. Library level (default for all reads from this library):

# Set at library creation
lib = ac.create_library("my_lib", output_format=OutputFormat.POLARS)

# Or when getting an existing library
lib = ac.get_library("my_lib", output_format=OutputFormat.PYARROW)

3. Per read operation (most granular control):

# Override on a per-read basis
result = lib.read("symbol", output_format=OutputFormat.PYARROW)
table = result.data  # Returns pyarrow.Table

result = lib.read("symbol", output_format=OutputFormat.POLARS)
df = result.data  # Returns polars.DataFrame

Output Format Options

  • OutputFormat.PANDAS (default): Returns pandas.DataFrame backed by numpy arrays
  • OutputFormat.PYARROW: Returns pyarrow.Table objects
  • OutputFormat.POLARS: Returns polars.DataFrame objects

String Format Customization

When using Arrow-based output formats, you can customize how string columns are encoded using ArrowOutputStringFormat:

LARGE_STRING (default):

  • 64-bit variable-size encoding
  • Best for general-purpose use and large string data
  • PyArrow: pa.large_string(), Polars: pl.String

SMALL_STRING:

  • 32-bit variable-size encoding
  • More memory efficient for smaller string data
  • PyArrow: pa.string(), Polars: pl.String

CATEGORICAL / DICTIONARY_ENCODED:

  • Dictionary-encoded with int32 indices
  • Best for low cardinality columns (few unique values repeated many times)
  • Deduplicates strings for significant memory savings
  • PyArrow: pa.dictionary(pa.int32(), pa.large_string()), Polars: pl.Categorical

Example: Using String Formats

from arcticdb import ArrowOutputStringFormat

# Set default string format at Arctic level
ac = adb.Arctic(
    uri,
    output_format=OutputFormat.PYARROW,
    arrow_string_format_default=ArrowOutputStringFormat.CATEGORICAL
)

# Or per-column during read
result = lib.read(
    "symbol",
    output_format=OutputFormat.PYARROW,
    arrow_string_format_per_column={
        "country": ArrowOutputStringFormat.CATEGORICAL,  # Low cardinality
        "description": ArrowOutputStringFormat.LARGE_STRING  # High cardinality
    }
)

Supported Operations

All read operations support the new output formats:

  • read()
  • read_batch()
  • read_batch_and_join()
  • head()
  • tail()
  • LazyDataFrame operations with lazy=True

Notes

@IvoDD IvoDD added the minor Feature change, should increase minor version label Nov 21, 2025
@IvoDD IvoDD changed the title Arrow read support (allow reading directly as pyarrow.Table or polars.DataFrame) Arrow read support (read directly as pyarrow.Table or polars.DataFrame) Nov 21, 2025
@IvoDD IvoDD force-pushed the polars-output-format branch from 44b11a2 to 3c821ed Compare November 24, 2025 08:56
@IvoDD IvoDD force-pushed the polars-output-format branch from 3c821ed to a5299e0 Compare November 24, 2025 09:25
@IvoDD IvoDD force-pushed the official-arrow-read-support branch 2 times, most recently from 2749078 to 022c5bb Compare November 24, 2025 09:35
@IvoDD IvoDD force-pushed the polars-output-format branch from ba7bff2 to d654d02 Compare November 24, 2025 13:57
Base automatically changed from polars-output-format to master November 24, 2025 15:56
Detailed release notes to be written in PR description
@IvoDD IvoDD force-pushed the official-arrow-read-support branch from 022c5bb to 901bc3b Compare November 25, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor Feature change, should increase minor version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants