-
Notifications
You must be signed in to change notification settings - Fork 153
[18362773160] Introduce EXPERIMENTAL_POLARS output format
#2769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a91a646 to
c5d0d19
Compare
ccd25c4 to
acb1887
Compare
f4fbf82 to
d2c32dc
Compare
d2c32dc to
44b11a2
Compare
| return InternalOutputFormat.PANDAS | ||
| elif output_format.lower() == OutputFormat.EXPERIMENTAL_ARROW.lower(): | ||
| elif output_format.lower() in [OutputFormat.EXPERIMENTAL_ARROW.lower(), OutputFormat.EXPERIMENTAL_POLARS.lower()]: | ||
| if not _PYARROW_AVAILABLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any dependency resolver would let us get into a position where pyarrow is available and polars isn't, but we may as well check just in case
| return self.version_store.get_column_stats_info_version(symbol, version_query).to_map() | ||
|
|
||
| def _batch_read_keys(self, atom_keys, read_options): | ||
| def _batch_read_keys(self, atom_keys, read_options, output_format): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phoebusm has removed this function entirely as part if the recursive normalizers performance improvement PR, so whoever is merging second will get a merge conflict
| query_builder = copy.deepcopy(query_builder) | ||
| read_queries = self._get_read_queries(len(symbols), date_ranges, row_ranges, columns, query_builder) | ||
| batch_read_options = self._get_batch_read_options( | ||
| batch_read_options, output_format = self._get_batch_read_options( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we expose the output_format as a read-only property of the batch_read_options C++ object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is exposed but both the ReadOptions and BatchReadOptions have only the internal c++ InternalOutputFormat which in both cases is just ARROW.
Here I need to differenciate between the python level OutputFormat which is different for PYARROW and POLARS.
I think I'd like to keep it this way because C++ layer doesn't need to know whether it's pyarrow or polars. Both are only python suger on top of the arrow c structures.
44b11a2 to
3c821ed
Compare
3c821ed to
a5299e0
Compare
c0d37b1 to
ba7bff2
Compare
ba7bff2 to
d654d02
Compare
Reference Issues/PRs
Monday ref: 18362773160
What does this implement or fix?
Polars output format is just a thin wrapper around the arrow output format. We create the polars dataframe zero-copy from the pyarrow table.
Also improves some docs.
Adds just a few extra tests for polars because:
polars.DataFramedoes not have any concept of pandas metadata.Any other comments?
I decided to go through
pyarroweven though we could avoid the pyarrow dependency by using a PyCapsule, because:Also needed to clean up some space for conda build. This is done in the conda workflow. A successful run with workflow from this branch can be seen here.
Checklist
Checklist for code changes...