Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Polars to read and write rather than Pandas #56

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

d-j-hatton
Copy link

Using polars CSV read and write functionality rather than the equivalents in pandas can lead to significant speed ups. There is the added benefit that everything can be kept in polars for additional speed ups in dataframe manipulation later. To maintain the current behaviour polars dataframes are converted to pandas before returning from starfile.read by default. The additional keyword argument polars=True can be specified to return a polars.DataFrame. starfile.write will accept data blocks that are either pandas or polars dataframes.

As polars will only accept a single character separator arbitrary whitespace has to be parsed when the star file is read. Some modifications have therefore been made to the line by line parsing for efficiency.

Attached are some very rough read and write benchmarks from an M1 MacBook Pro on particle star files of different sizes. For read the bars are split into the time taken to perform read_csv and the rest. "Polars to pandas" and "Pure polars" refer to code modified as in this PR, "Pure pandas" is the existing implementation.

read-time-comp
write-time-comp

Notes:

  • This adds polars as a dependency
  • I couldn't get the pre-commits to pass because of existing code but some of them ran and changed formatting which I can try and change back if you want

@alisterburt
Copy link
Member

What an awesome PR to wake up to! Will take a proper look once I'm up and running ☺️ thanks @d-j-hatton !

@jojoelfe
Copy link
Collaborator

Hey @d-j-hatton, are you still interested in this? I missed this, but think its a great idea. I've always been interested in polars and from looking at the documentation maybe there is even a way to make the quotation code a bit saner.

I've thought a bit about whether we should be cautious to add new dependencies, since this is a helper package used in quite a few other projects and this always increases the risk of collisions. Is there an easy way to make this an optional dependency?

@d-j-hatton
Copy link
Author

Yes, it will be possible to have it as an optional dependency by adding an optional dependency group so you can pip install starfile[polars] and using some pattern like

try:
    import polars
except ImportError:
    polars = None

to check for the presence of the package in the code and default to pandas if not present. I can make some changes for that and fix the conflicts that have cropped up

@jojoelfe
Copy link
Collaborator

jojoelfe commented Aug 8, 2024

Thank you, that sounds great! Also happy to help, but won't touch anything unless you ask me to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants