-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - Is there potential to support bgzip? #92
Comments
There’s a draft pull request for zstd support in #91, please also see the discussion over there. Regarding bgzip support, I think compression support is fine. Regarding decompression (which I realize you didn’t ask for), I think the support we already have for gzip is sufficient (we have tests that ensure we can decompress concatenated gzip files). I would argue that exposing the seeking capabilities of bgzip wouldn’t be in scope for xopen because the idea is to provide an interface that lets one abstract away the used compression algorithm. |
If you know beforehand that you do want BGZIP, there is also no need for xopen's dynamic choosing of compression algorithm. So you might be better of using As regards to:
Pysam has a |
For our specific usage case that I want this right now, it doesn't seem like Pysam has the interface for what we're wanting to do: bgzip output to create a .vcf.bgz file and associated tabix index, because downstream tools are expecting bgzipped vcfs with indexes.
I'm going to do some quick performance testing to see how using threads inside of the main Python process does vs opening another process and piping uncompressed data to it like xopen does. My gut would actually say that xopen's approach is faster, just because in the 1st approach, the process will either be copying data in from Python or compressing it in the pysam threadpool, but not both at the same time because of the GIL, while xopen allows Python to send data via pipe without blocking unless the pipe fills up. I'm taking zstd support discussion to that open PR. |
Have you tried using PipedCompressionWriter with |
That looks like it would, I'll validate this. Would the xopen project be willing to make that support native, so that if you open a .bgz for writing, or a .gz and specify a bgzf or bgz format it will call bgzip? |
For Also the whole problem xopen tries to solve is that you don't know what output the user expects: stdout? gzip? uncompressed? Xopen infers that for you. If you already know what format you want (indexed bgzip in your case), then it is much less overhead to simply let the program output that format, rather than rely on xopen. @marcelm What do you think? |
@pettyalex Is it correct that you want an indexed bgz file, that is, that you would want to run For me, the question is whether we can add support for bgzip without adding another parameter to the The easiest case is that we just never create an index (run Always generating an index is the other option, but I wonder whether we wouldn’t get requests to enable setting the index name ( I’m not too worried about bgzip not being available in the standard library. If bgzip isn’t installed, we just print an error and it’s the responsibility of whoever uses xopen to ensure that bgzip is in their list of dependencies. Whatever we end up doing, wo should at least document the |
Yes, absolutely. For our workflows, we need indexes everywhere that we'd be using bgzip, as bgzip itself performs worse both speed and compression-wise vs other options. The index is its only advantage. The default index name would also be appropriate in every situation we'd want to use this with. |
Do you use with the threading option? BGZF can be used without index. BCF2FASTQ uses it to compress FASTQ files, where multiple chunks can be compressed at the same time in different threads. |
This is a clever solution to get the decompression / compression into a separate process, very helpful little tool. This pattern would work well for a couple additional compression tools that my group uses. I'm going to take a look at adding support for these formats, which should be pretty straightforward.
BGZF, or "blocked gzip" is a format that's used pretty widely in bioinformatics, it's basically a lot of gzipped files concatenated together, with some extra info in the headers and an index in a separate file saying where to seek. It's decompressible by normal gzip, so we actually see bgzf files as .gz more often than .bgz. It'd be really great to be able to compress bgz files with xopen as well. The blocked gzip reference implementation is distributed with htslib as a binary called
bgzip
, and is available both from conda and most linux distros native packages (tabix on Ubuntu, for example).Also, it'd be great to see this support zstd as well, which is just an excellent general purpose compression tool that I expect to rapidly grow in usage in the next few years.
Edit: To be clear, both of these tools are already usable from Python, there's a bgzip implementation here, and zstd has excellent Python bindings available, but getting the compression into another process like xopen does makes for much better performance.
The text was updated successfully, but these errors were encountered: