Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the input file in parallel #189

Open
Shnatsel opened this issue Jan 6, 2023 · 14 comments
Open

Read the input file in parallel #189

Shnatsel opened this issue Jan 6, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 6, 2023

What can be improved or is missing?

OpenEXR format allows the file to be read in parallel. It is considerably faster than a single-threaded read all by itself, see #183 (comment)

It will also let us initialize buffers in parallel (until #186 becomes feasible) and allows performing pixel format conversions in worker threads more naturally (see #182) - AFAIK doing it with the current architecture would require a memcpy().

Implementation Approach

Memory-mapping is tempting, but is going to result in lots of very spooky UB the moment someone modifies the file we're reading. So we'll probably have to open a file descriptor per thread and seek it.

@Shnatsel Shnatsel added the enhancement New feature or request label Jan 6, 2023
@johannesvollmer
Copy link
Owner

Would it be faster to not open multiple file handles, given no memory mapping is used? What would it even look like, I thought File can not be shared across threads?

@johannesvollmer
Copy link
Owner

johannesvollmer commented Jan 7, 2023

Right now, memory mapping is supported by the library implicitly by allowing the user to pass any Read trait. Would memory mapping also benefit from parallel reading?

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 7, 2023

Would it be faster to not open multiple file handles,

I don't think so. If you have a single file descriptor for everything, that means you cannot issue parallel reads. If nothing else, there is only one offset value for a given file descriptor, so you cannot issue reads from several places at once. Beyond that, a file descriptor is not safe to access in parallel, you'd need a mutex which defeats the point - see https://stackoverflow.com/a/823525

We'll just need to open several file descriptors, and Files, pointing at the same underlying file in the filesystem, and then independently seek and read from each descriptor. open() is cheap, doesn't matter if we do it once or 32 times.

@johannesvollmer
Copy link
Owner

I thought so too. I just thought it sounded like there was an alternative to opening multiple handles. Let's try that then :)

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 7, 2023

Right now, memory mapping is supported by the library implicitly by allowing the user to pass any Read trait. Would memory mapping also benefit from parallel reading?

Yes, but it would be horrifically unsafe. Memory-mapped data changes in memory whenever the file is changed. That means that if you create an &str from a memory-mapped &[u8], the string is checked to be valid UTF-8 on creation, but the file may change later and turn into invalid UTF-8, violating the safety invariant. In fact, it's illegal to even have a &[u8] pointing to something that may change - the compiler assumes that &[u8] is immutable and generates code based on that. If that's violated, the very fabric of the compiler's vision of reality breaks down and anything is possible.

@johannesvollmer
Copy link
Owner

johannesvollmer commented Jan 7, 2023

Okay, but if you use the mapped memory only through the interface of the Read trait, then this can't happen, right? Because in that case, data always copied and never referenced? Or does it mean that the Rust compiler simply can't handle memory mapping?

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 7, 2023

Rust just doesn't allow creating a &[u8] over a memory map, which is required for the Read trait.

You can take a raw pointer to the memory map and ptr::read_volatile into a buffer. However, mmap is only useful when you want to avoid copying the data - in all other cases you can just read from a file descriptor into a buffer, and the performance will be exactly the same.

So to actually take advantage of mmap, you need to either adapt basically all your code to use read_volatile on raw pointers instead of creating slices, or find some platform-specific way to make sure nothing else modifies your file while it's memory-mapped.

@johannesvollmer
Copy link
Owner

thanks for that explanation :) good to know

@etemesi254
Copy link

I'm wondering how this would happen, from my docs perusing , it seems the crate wraps a BufferedReader from which it gets it's bytes from, they don't implement clone and the same one can't be sent to multiple threads , creating multiple bufreaders to the same underlying reader is explicitly mentioned in the docs to cause data loss.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 8, 2023

When reading from a file, you'll need to open multiple File handles (backed by a file decriptor) and wrap each one in its own BufReader.

For reading from an in-memory buffer, you can just send the &[u8] references to different threads - under the hood that's just pointers.

AFAIK there's no way to implement this for a generic buffered reader. There is, for example, no reasonable way to read a stream from the network in parallel, let alone seek the network stream. So if that's a use case you'd like to support, reading needs to be single-threaded in this case.

@johannesvollmer
Copy link
Owner

meta data should be read with buffered readers, since a lot of tiny read requests are made. The pixel data, which will be read in parallel, will not benefit from any buffering

@johannesvollmer
Copy link
Owner

johannesvollmer commented Jan 8, 2023

the cloning of the byte source will have to be a custom abstraction. the File type has a function try_clone, but there is no trait with that function. We'll have to add our own simple trait for cloning a byte source that we implement for Files and for other readers

or alternatively we abstract over the ability to create a new reader from scratch, by remembering the path of the file, and calling File::open repeatedly

those individual readers can then be buffered where it makes sense

@etemesi254
Copy link

And what about data passed as a slice? Won't that remove parallelism from it?

Ideally what you want is a trait, I'll call it ParalellReader, implemented for File, BufferedFile, slices (or anything that derefs to a slice), with. Method that allows whole cloning,

For a file, create a new file handler(we can't use clone because change in one file seek affects all other instances of files) for data using a cursor, create a new cursor with the same referenced data and returns it.

Also the trait should implement independent seeking, i.e each independent clone should allow seeking without affecting the others

I'm not sure if this is possible I've never really done it as I'm just speaking what I think may work.

@johannesvollmer
Copy link
Owner

yes! I'm pretty sure that calling File::try_clone on will create a new handle with an independent seek position. But manually creating new ones will be easier anyways I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants