Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust file-like objects and accept bytes/bytearray/numpy #45

Merged
merged 30 commits into from
Mar 18, 2021

Conversation

milesgranger
Copy link
Owner

@milesgranger milesgranger commented Mar 10, 2021

I think it would be cool to have file-like objects from the Rust side. This could allow side-stepping certain allocations to/from Python by allowing Rust to hold all the bytes.

Most notably, compressing one file into another should be quite performant.
ie.

from cramjam import File, snappy
input_file = File("potentially-very-large-file.csv")
output_file = File("compressed-file.csv.snappy")
snappy.compress_into(input_file, output_file) 

and decompressing right into a buffer on the rust side:

from cramjam import File, Buffer, snappy
input = File("some-file.csv.snappy")
decompressed = snappy.decompress(input)
decompressed.seek(0)
decompressed.read()  # decompressed bytes
  • Update de/compress_into to accept numpy array or these native types
  • Update documentation
  • Fix Rust tests
  • Update Python variant tests
  • Add some benchmarks for these types Not worth it, more convenience than drastically better performance.
  • May need to switch lz4 implementation for one that supports Read/Write traits
  • Probably make some common FileLike trait following IOBase interface #[pymethods] not supported for trait impls

src/io.rs Outdated Show resolved Hide resolved
@martindurant
Copy link

Fascinating!
The Files follow the full python API, so you can pass them to any python function that expects to work on file objects?

I wonder if it's possible to derive from IOBase, maybe add the class at runtime, since some functions will check isinstance(obj, IOBase).

@milesgranger
Copy link
Owner Author

That was a fleeting thought, make it fully compatible; slightly larger scope than what cramjam would need, but so close to being compatible it might as well be. 🤷‍♂️

Looking at IOBase I'd guess most all can be added, close/closed is a bit awkward as the file is normally closed in Rust when it reaches the end of its lifetime; in this case, the GIL lifetime. However, one could shimmy it by making inner: Option<File> and on close() drop it. So I guess that'd be okay. Then it's just isatty(), and fileno() which aren't immediately clear to me.

Then making it a subclass of IOBase, also not obvious, but I'm confident it's workable

@martindurant
Copy link

I'm not suggesting you need to implement the whole of IOBase - most applications will only call read (and read_into!); but seek/tell is nice. Adding it as a subclass would be nice, but only if its easy. After all, one could always wrap the thing in a python BufferedIOBase, if needed (which would handle close and other methods, at the cost of some data copy).

Cargo.toml Outdated Show resolved Hide resolved
@milesgranger
Copy link
Owner Author

milesgranger commented Mar 17, 2021

@martindurant
I'm going to bunt on adding benchmarks for the buffer specific types; they are slightly faster in most cases than 'normal' de/compressing into bytes/pybytes/numpy, but interestingly enough, for snappy raw format, they basically match what python-snappy does,
image
Specifically, to your test data "oh what a beautiful day...", it beats it just by a bit.

This was ran using this benchmark routine equivelent

from cramjam import snappy, Buffer

data = b"oh what a beautiful morning, oh what a beautiful day!!" * 1000000

compressed = Buffer()
decompressed = Buffer()

snappy.compress_raw_into(data, compressed)
compressed.seek(0)
snappy.decompress_raw_into(compressed, decompressed)

Hopefully that sort of workflow works for you, should you choose to use these types.

I think the other main benefit of the PR will be that any variant can take any combination of bytes/pybytes/numpy/cramjam.Buffer/cramjam.File

ie

cramjam.snappy.compress_raw(<<numpy array>>) # gives back numpy array in return (returning the type it got)
cramjam.snappy.compress_raw_into(<<numpy array>>, <<cramjam.Buffer>>)  # mixing input/output types are also ok

The mixing types is also okay, even when the output is bytes, so long as it is long enough.

>>> from cramjam import snappy
>>> out = b'000000000000000000'
>>> snappy.compress_raw_into(b'bytes', out)
7
>>> out
b'\x05\x10bytes00000000000''

Anyway, I'm going to add some docs and tighten a few bolts and nuts, but it's basically done. Feel free to make any suggestions in the meantime. Thanks.

@milesgranger milesgranger marked this pull request as ready for review March 17, 2021 21:07
@milesgranger milesgranger changed the title Native Rust file-like objects (File and Buffer) Rust file-like objects and accept bytes/bytearray/numpy Mar 18, 2021
@milesgranger milesgranger merged commit 16b78f5 into master Mar 18, 2021
@milesgranger milesgranger deleted the rust-file branch March 18, 2021 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants