Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompress without Stream class (passing data pointer and data size continuously) #44

Open
hideakitai opened this issue Aug 3, 2021 · 5 comments

Comments

@hideakitai
Copy link

Hi, thank you for the great library!

I would like to use this library with BLE. BLE does not inherit from the Stream class, so I can't feed data directly into this library. I think I need to save the received BLE packet as fs::File once and then pass the file to this library.

If there was an interface to the library that could continuously pass the data pointer and data size (passing uint8_t* data, size_t size like esp_ota_write function), I think this library could be applied more universally to data that doesn't inherit from the Stream class. Is there already such a way?

I looked at the code of the library. If the "read data from the stream class and pass it to the decompressor" code could be split into a "read data from the stream class" part and a "pass data to the decompressor" part, it would be more versatile, including the above usage. I hope you like it. If you have any ideas or policies on how to implement this, I would be happy to help :) Thanks again for the great library!

@tobozo
Copy link
Owner

tobozo commented Aug 10, 2021

hey @hideakitai thanks for your very interesting feedback,

If you have a code implementation in mind I'll be happy to review it and learn from it, but I have no idea how to achieve that without inheriting from the Stream class.

However I understand your project uses BLE to receive compressed data, and I have a few questions:

  • Can you clarifiy if this is about decompressing gz, tar.gz or tar (or all of them) ?
  • Is there a maximum file size to your BLE transfer ?
  • Does your ESP32 have psram ?
  • Are you focused on speed or reliability ? Spoiler: you can't have both

@hideakitai
Copy link
Author

hideakitai commented Aug 11, 2021

Thank you for the reply :)

Though it's just an idea, I think it's great if we can pass the data instead of the Stream class (Stream::read()). For example, about GzUnpacker, if we have the interface which can pass the data instead of GzUnpacker::gzReadByte(), we can use GzUnpacker without the Stream class maybe...? (e.g. fs::File). Sorry I couldn't read the library through yet...

// gz filesystem helper
uint8_t GzUnpacker::gzReadByte( fs::File &gzFile, const int32_t addr, fs::SeekMode mode )
{
gzFile.seek( addr, mode );
return gzFile.read();
}
// 1) check if a file has valid gzip headers
// 2) calculate space needed for decompression
// 2) check if enough space is available on device
bool GzUnpacker::gzReadHeader( fs::File &gzFile )
{
tarGzIO.output_size = 0;
tarGzIO.gz_size = gzFile.size();
bool ret = false;
if ((gzReadByte(gzFile, 0) == 0x1f) && (gzReadByte(gzFile, 1) == 0x8b)) {
// GZIP signature matched. Find real size as encoded at the end
tarGzIO.output_size = gzReadByte(gzFile, tarGzIO.gz_size - 4);
tarGzIO.output_size += gzReadByte(gzFile, tarGzIO.gz_size - 3)<<8;
tarGzIO.output_size += gzReadByte(gzFile, tarGzIO.gz_size - 2)<<16;
tarGzIO.output_size += gzReadByte(gzFile, tarGzIO.gz_size - 1)<<24;
log_i("[GZ INFO] valid gzip file detected! gz size: %d bytes, expanded size:%d bytes", tarGzIO.gz_size, tarGzIO.output_size);

Can you clarifiy if this is about decompressing gz, tar.gz or tar (or all of them) ?

All of them, if possible! (I will use gz and tar.gz)

Is there a maximum file size to your BLE transfer ?

About 10-15MB... I will use external 32MB SPI flash memory.

Does your ESP32 have psram ?

In this project, NO. (ESP32-WROOM-32D w/ Flash 4MB)

Are you focused on speed or reliability ? Spoiler: you can't have both

Reliability. Because BLE is slow, I don't want to send it again :)

@tobozo
Copy link
Owner

tobozo commented Aug 11, 2021

gz

gzReadByte is used by one decompression method of GZUnpacker to access both the beginning and the end of the file.
This method needs the entire file to be accessible, this is why it's bound to fs::File.

So unless your BLE service has seek and readbytes characteristic (or a local gzReadHeader implementation) this method can't be applied on the fly.

The other decompression method is gzStreamExpander, this one works without knowing the uncompressed size and doesn't need to seek, probably a better canditate although still depending on Stream inheritance.

tar.gz

tar.gz is a gzipped tar file, it can work:

  1. by streaming gz uncompressed data to the tar expander (no intermediate file), but requires the full .tar.gz file to exist on some filesystem
  2. by uncompressing gz data to tar (intermediate file) then untarring to filesystem

Solution 1) tarGzExpanderNoTempFile is faster but needs a lot of heap (32kb), which you probably don't have if you're running BLE server and client.
Solution 2) tarGzExpander is slower and requires more than twice the necessary space, but works on low-end devices such as esp8266

Thoughts

If you're limited in destination space, gzStreamExpander is the best fit as it only uses readBytes to access 1 byte at a time (unzip crc method), which leaves you the opportunity to implement your own CustomStream::readBytes method to access a buffer shared with the BLE transfert logic.

However, read and seek responsibilities depend on the relationship between the BLE peers during file transfer, for example if it's using notifications to send file chunks, there is no way the receiver can tell the notifier to "wait until the last chunk is uncompressed" while properly maintaining the buffer.

On the other hand if the file chunks are sent as query responses, then it should also be possible to query for size, cursor position and specific file data chunk, and no buffer is really needed.

Although both situations are out of scope with ESP32-Targz , they can still be solved by writing a class that inherits the Stream object.

@ChaoticNeutralCzech
Copy link

(In reply to tobozo's last comment)

Hello,
I have a pretty basic ESP32 web-scraping client with

#include <WiFi.h>
#include <WiFiClientSecure.h>
WiFiClientSecure client;

I then make my HTTPS request for a gzipped website (asking for Accept-Encoding: gzip in headers), wait for a reply and get the reply headers, from which I store Content-Length (value varies, about 9000 bytes). (I could omit "Accept-Encoding" in my request to get the uncompressed website, but the length would be about 80 kB, 5 times what the ESP32 can decrypt via HTTPS, and the "juicy bits" are about 20 kB in.) So I need to un-gzip the response, which is read one byte at a time using client.read. I cannot use SPIFFS or another file in flash because it would need to be overwritten every minute so I would love a gzStreamExpander-like function that would accept bytes from the client (which is not a Stream *, and I don't really how I could "convert" it into one (if you find a resource on this, please send it my way!); I could store it into a 10k-byte array first, though). Also, I cannot have it output to a file for the aforementioned reason but a 80k-byte array would be OK. If I'm short on RAM, perhaps I'd need to only decompress half of it into a shorter 40k-byte array.

I think this may well be possible because uzlib.h's code defines struct TINF_DATA that includes const unsigned char *source and unsigned char *dest, which I guess are pointers to byte arrays that are (parts of?) the input and output "files". I don't know how to access those directly, though. I would also prefer to go through your library rather than interface with uzlib directly because I don't know how to make that work on the ESP32 (what do I #include?).

Thank you,
ChaoticNeutralCzech

@tobozo
Copy link
Owner

tobozo commented Jun 13, 2023

@ChaoticNeutralCzech GzUnpacker without stream is just raw uzlib.

Here's a basic example of uzlib_uncompress_chksum() (e.g. uzlib without the dictionary), it consumes one gz byte per iteration. You need to implement your own byteReaderFunction that will return the result of HTTPClient->read(), and your own destReaderFunction that will return the byte value at a given offset in the decompressed output.

Be aware that accessing SPIFFS in RW mode while receiving HTTP response is known to produce bugs (crashes, data corruption). It will be more stable to store the decompressed data in psram or SD card, then copy the blob to the filesystem after the decompression is complete and the HTTP connection is closed.

uzlib setup

TINF_DATA decompress_nodict = {
  .source = nullptr ,
  .readSourceByte = yourByteReaderFunction, // reads one byte from gzipped data
  .readDestByte = yourDestReaderFunction,   // reads any byte in the decompressed data
  .destSize = 1,
  .readSourceErrors = 0,
};
uzlib_init();
res = uzlib_gzip_parse_header(&decompress_nodict);
if (res != TINF_OK) {
  printf("[ERROR] uzlib_gzip_parse_header failed (response code %d)\n", res);
  return;
}
uzlib_uncompress_init(&decompress_nodict, NULL, 0);

uzlib loop

// const size_t output_buffer_size = 4096; // at least 1KB, at most 4KB
// uint8_t output_buffer[output_buffer_size]; // output buffer
size_t total_size = 0;
size_t output_position = 0;
int res = TINF_OK;

do {
    decompress_nodict.dest = &output_buffer[output_position];
    res = uzlib_uncompress_chksum(&decompress_nodict);
    if (res != TINF_OK) break; // uncompress done or aborted, no need to go further
    output_position++;
    if (output_position == output_buffer_size) {  // when destination buffer is filled, write/stream it
      total_size += output_buffer_size;
      printf("[INFO] Buffer full, now writing %d bytes (total=%d)\n", output_buffer_size, total_size);
      write_buffer( output_buffer, output_buffer_size );
      output_position = 0;
    }
} while ( res == TINF_OK );

if (res != TINF_DONE) && decompress_nodict.readSourceErrors > 0 ) {
    printf("Decompression failed with %d errors\n", decompress_nodict.readSourceErrors);
    return;
}

// consume remaining bytes in the buffer, if any
if( output_position !=0 ) write_buffer(output_buffer, output_position );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants