Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using MultiGzDecoder for file with garbage after gzip data #396

Open
jinschoi opened this issue Feb 11, 2024 · 4 comments
Open

Using MultiGzDecoder for file with garbage after gzip data #396

jinschoi opened this issue Feb 11, 2024 · 4 comments

Comments

@jinschoi
Copy link

I have a gzip json file that I did not create that I am using flate2 and serde_json to parse and transform. When I run my code over the unzipped file, everything is file. When I run it on the gzipped file, it throws an unexpected end of file error. I am trying to figure out what is going on.

My working assumption is that the file, which is a multi-member gzip file, has some extra garbage after the end of the last member; and indeed, when I use gzcat to uncompress it, it does say "trailing garbage ignored". The section about multi-member files in the introduction says "If a file contains contains non-gzip data after the gzip data, MultiGzDecoder will emit an error after decoding the gzip data. This behavior matches the gzip, gunzip, and zcat command line tools."

I would like some way of decoding such a file without an error being returned, and just having any trailing garbage be ignored. What would be the simplest way of doing this?

@Byron
Copy link
Member

Byron commented Feb 11, 2024

Thanks for bringing this up! Could you provide a testfile along with a test-program that shows the current behaviour?
My assumption is that it's not actually possible to decode the last member of such a file as it emits an error on the last read, but it would be great to validate. Maybe it will also decode the last member successfully, but fail on the next read as it can't decode garbage, which would mean all zipped data could be read with the current API, and it's really a question on how to indicate a 'garbage tail' occurred.

With such example, it would be possible to setup a test-case and eventually find a solution.

@jinschoi
Copy link
Author

jinschoi commented Feb 11, 2024

Sure. Here is a test file I created with the following commands:

echo "Hello, world" | gzip > hello.gz
echo x >> hello.gz

hello.gz

Running gzcat on the file:

% gzcat hello.gz
Hello, world
gzcat: hello.gz: trailing garbage ignored

A short program that demonstrates the issue:

use flate2::read::MultiGzDecoder;
use std::fs::File;
use std::io::{BufReader, Read, Write};

fn main() {
    let file = File::open("hello.gz").unwrap();
    let mut buf_reader = BufReader::new(MultiGzDecoder::new(file));
    let mut buf = [0_u8; 100];
    loop {
        match buf_reader.read(&mut buf) {
            Ok(0) => break,
            Ok(n) => {
                println!("read {} bytes:", n);
                std::io::stdout().write_all(&buf[0..n]).unwrap();
                std::io::stdout().flush().unwrap();
            }
            Err(e) => {
                println!("{:?}", e);
                break;
            }
        }
    }
}

And the output:

read 13 bytes:
Hello, world
Kind(UnexpectedEof)

@Byron Byron removed the question label Feb 11, 2024
@Byron
Copy link
Member

Byron commented Feb 11, 2024

Thanks a lot for your help!

Thoughts

It looks like the question here is of UnexpectedEOF could also be another kind of error which would indicate more clearly that indeed, there was nothing to read, or that the stream it tried to read (x here) isn't actually a deflated stream.

Another guess is that if the trailing bytes are longer, the error message will be a different, maybe more specific one. After all, the magic signature seems to be 2 bytes long, so we'd run out of data before reading this even.

Next Steps

I think it would be worth setting up a couple of test cases with varying length of trailing garbage to get an idea of how it can be detected on application level right now.

From there, we might be able to figure out how to optionally adjust or enhance the API in a backward compatible manner to make detecting this situation easier.

@jongiddy
Copy link
Contributor

jongiddy commented Sep 1, 2024

To do this MultiGzDecoder needs to detect whether the subsequent data is another gzip member or not. How does it do this? It could check the first two bytes at the end of the current member. If they are the 2-byte gzip signature, then it can take the code path that expects another gzip member, otherwise it can take the code path that treats it as trailing garbage.

However, if the trailing garbage can be modelled as random data, there is a 1 in 65536 chance that random garbage will be incorrectly interpreted as a valid gzip member and will return an error. Is that OK? Depends on the caller's requirements.

We could check more data. The next byte is always 8, so that improves the chances, but still not guaranteed (1 in 16,777,216). Is that OK? Depends on the caller's requirements.

Ultimately the best guarantee is to read the entire gzip member and only if it succeeds, return decompressed data. This requires arbitrary amounts of buffering.

What is the correct balance of correctness vs buffering to make a good choice? Depends on the caller's requirements.

Rather than have flate2 make a decision here, or add configuration to allow multiple options, this is a case where the caller should decide, by using GzDecoder and handling the validation of the trailing data itself.

Note that, when I say that the caller should decide, that does not rule out it using another third-party crate that wraps GzDecoder and makes a specific decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants