Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no flush() on PipedGzipWriter #8

Open
pohutukawa opened this issue May 8, 2018 · 6 comments
Open

no flush() on PipedGzipWriter #8

pohutukawa opened this issue May 8, 2018 · 6 comments

Comments

@pohutukawa
Copy link

It looks like the file-like object returned by xopen() for Gzip compressed files misses the flush() method.

AttributeError: 'PipedGzipWriter' object has no attribute 'flush'

This would be very helpful to have, as well as easy to implement, as gzip.open's file-like object supports the flush() operation.

marcelm added a commit that referenced this issue May 8, 2018
@marcelm
Copy link
Collaborator

marcelm commented May 8, 2018

It’s actually not that easy. xopen doesn’t use gzip.open if it can avoid it, but runs an external gzip or pigz process to get better speed. I would need a way to tell the running gzip/pigz to flush its currently processed block(s) to disk.

The only method I can think of would be to close the output file (which will flush everything implicitly) and then to re-open it in append mode. This would work since gzip files are allowed to be concatenated, but the problem is that the resulting file will not be the same as when flush() had not been called.

At the moment, I tend to just leave it as it is. I’d accept a PR, but then probably the method would need to be called reopen() instead of flush(). I’m open for discussion, though.

@pohutukawa
Copy link
Author

Thanks for the context, I wasn't aware of the details behind, and just (blindly) assumed the gzip core Python stuff was at work in the background, and xopen was just abstracting its usage.
It's obvious the files won't be the same, as compression is in the mix, and (smaller) increments won't nearly compress as well as larger batches. Own experience in the past has shown that an decompress/compress cycle on such files has yielded a significant improvement.
For my current use case I need the option to flush, but as file i/o is not the bottle neck, I won't need the benefit of the faster pigz process, so I'll go with the vanilla gzip module for me/now (the benefit over an uncompressed file output is still large enough to walk this path).
I agree under these circumstances reopen() will probably be a better label, though indicating that this may act as a flush() workaround in the documentation. As I have no (immediate) need for it (but only limited time available) I won't be able to provide a PR for this, though. Sorry ... :-(
BTW, nonetheless many thanks for providing this module in the first place. I've learned to like and use it more and more.

@marcelm
Copy link
Collaborator

marcelm commented May 9, 2018

Thanks a lot for taking the time to write a detailed reply! I appreciate a lot learning which use cases exist.

I think one other option is to allow to specify that xopen should not use a pipe to an external program. I think this would be good to have anyway. The code already exists (as a fallback when gzip isn’t available), it would just need to be exposed. Let’s leave this issue open until someone finds the time to implement it.

@rhpvorderman
Copy link
Collaborator

I think one other option is to allow to specify that xopen should not use a pipe to an external program. I think this would be good to have anyway.

As far as I know we have solved this issue. Using threads=0 will always open in the main thread with open, gzip.open, lzma.open etc.

Furthermore the threaded option included in #131 does also allow the flush method. So eventually it will also become available for gzip threads.

@rhpvorderman
Copy link
Collaborator

Currently flush does nothing:

def flush(self) -> None:

What can be done for writing is that the EOF is given to the program. It will then terminate the compression block. The file can be opened again in append mode and then the writing can resume. This only works for things that support concatenated blocks though, such as gzip. I think xz, zst and bzip2 also support those? But that would need more investigation.

@rhpvorderman
Copy link
Collaborator

xz, zst and bzip2 also support mutiple compressed members in one file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants