Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Feather chunksize doesn't round-trip #45422

Open
alippai opened this issue Feb 4, 2025 · 2 comments
Open

[Python] Feather chunksize doesn't round-trip #45422

alippai opened this issue Feb 4, 2025 · 2 comments

Comments

@alippai
Copy link
Contributor

alippai commented Feb 4, 2025

Describe the usage question you have. Please include as many useful details as possible.

I tried this with pyarrow 19:

import pyarrow.feather as pf
t = ...
pf.write_feather(t, 'test.feather', chunksize=1024*1024)
len(pf.read_table('test.feather').to_batches()[0]) # 65536 rows
pf.write_feather(t, 'test2.feather', chunksize=256*1024)
len(pf.read_table('test2.feather').to_batches()[0]) # 65536 rows

I expected the files to be different (different compressed sizes), but they are byte-by-byte identical. As a consequence the batch sizes are lost when reading the data back.

Do I assume correctly the file should consist of chunksize long buffers for each column (per recordbatch) and these buffers are independently compressed using lz4 or zstd?

Component(s)

Python, C++, Format

@alippai
Copy link
Contributor Author

alippai commented Feb 4, 2025

Is this the equivalent?

BATCH_SIZE = 1024*1024

if len(t.to_batches) > 1:
   t = t.combine_chunks()
with pa.OSFile('test3.feather', 'wb') as sink:
   with pa.ipc.new_file(sink, t.schema, options=pa.ipc.IpcWriteOptions(compression='lz4')) as writer:
      for batch in t.to_batches(BATCH_SIZE):
         writer.write(batch)
len(pf.read_table('test3.feather').to_batches()[0]) # 1024*1024 rows

@kou kou changed the title Feather chunksize doesn't round-trip [Python] Feather chunksize doesn't round-trip Feb 5, 2025
@raulcd
Copy link
Member

raulcd commented Feb 6, 2025

Hi @alippai ! Thanks for raising the issue. You might get a quicker response if you reach the user mailing list. I would recommend you to send and email to the user discussions: [email protected] (first subscribe by sending an e-mail to [email protected] if you are not already subscribed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants