Skip to content

Conversation

@IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Nov 20, 2025

Reference Issues/PRs

Monday ref: 18362773160

What does this implement or fix?

Polars output format is just a thin wrapper around the arrow output format. We create the polars dataframe zero-copy from the pyarrow table.

Also improves some docs.

Adds just a few extra tests for polars because:

  • Extensive arrow testing covers most arrow related logic
  • Parametrizing many tests to work with polars is difficult because polars.DataFrame does not have any concept of pandas metadata.

Any other comments?

I decided to go through pyarrow even though we could avoid the pyarrow dependency by using a PyCapsule, because:

  • This would require rewriting our arrow denormalization (which currently relies on pyarrow APIs)
  • This would require extra testing coverage of polars output format. And it is harder to parametrize our existing tests because polars doesn't have a concept a pandas metadata.

Also needed to clean up some space for conda build. This is done in the conda workflow. A successful run with workflow from this branch can be seen here.

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@IvoDD IvoDD force-pushed the polars-output-format branch 2 times, most recently from a91a646 to c5d0d19 Compare November 20, 2025 14:35
@IvoDD IvoDD force-pushed the batch-configurable-strings branch from ccd25c4 to acb1887 Compare November 20, 2025 15:19
@IvoDD IvoDD force-pushed the polars-output-format branch 2 times, most recently from f4fbf82 to d2c32dc Compare November 20, 2025 15:40
Base automatically changed from batch-configurable-strings to master November 21, 2025 08:08
@IvoDD IvoDD force-pushed the polars-output-format branch from d2c32dc to 44b11a2 Compare November 21, 2025 08:12
@IvoDD IvoDD added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Nov 21, 2025
return InternalOutputFormat.PANDAS
elif output_format.lower() == OutputFormat.EXPERIMENTAL_ARROW.lower():
elif output_format.lower() in [OutputFormat.EXPERIMENTAL_ARROW.lower(), OutputFormat.EXPERIMENTAL_POLARS.lower()]:
if not _PYARROW_AVAILABLE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any dependency resolver would let us get into a position where pyarrow is available and polars isn't, but we may as well check just in case

return self.version_store.get_column_stats_info_version(symbol, version_query).to_map()

def _batch_read_keys(self, atom_keys, read_options):
def _batch_read_keys(self, atom_keys, read_options, output_format):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phoebusm has removed this function entirely as part if the recursive normalizers performance improvement PR, so whoever is merging second will get a merge conflict

query_builder = copy.deepcopy(query_builder)
read_queries = self._get_read_queries(len(symbols), date_ranges, row_ranges, columns, query_builder)
batch_read_options = self._get_batch_read_options(
batch_read_options, output_format = self._get_batch_read_options(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we expose the output_format as a read-only property of the batch_read_options C++ object?

Copy link
Collaborator Author

@IvoDD IvoDD Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exposed but both the ReadOptions and BatchReadOptions have only the internal c++ InternalOutputFormat which in both cases is just ARROW.

Here I need to differenciate between the python level OutputFormat which is different for PYARROW and POLARS.
I think I'd like to keep it this way because C++ layer doesn't need to know whether it's pyarrow or polars. Both are only python suger on top of the arrow c structures.

@IvoDD IvoDD force-pushed the polars-output-format branch from 44b11a2 to 3c821ed Compare November 24, 2025 08:56
@IvoDD IvoDD force-pushed the polars-output-format branch from 3c821ed to a5299e0 Compare November 24, 2025 09:25
@IvoDD IvoDD force-pushed the polars-output-format branch from c0d37b1 to ba7bff2 Compare November 24, 2025 13:49
@IvoDD IvoDD force-pushed the polars-output-format branch from ba7bff2 to d654d02 Compare November 24, 2025 13:57
@IvoDD IvoDD merged commit 8e6763a into master Nov 24, 2025
185 of 186 checks passed
@IvoDD IvoDD deleted the polars-output-format branch November 24, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants