-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve inflate performance for repetitive data #19113
Improve inflate performance for repetitive data #19113
Conversation
…ow for overlapping regions.
cc @ianic for review |
Thanks! Looks great. I was never testing with that kind of data.
|
Right, I can also measure this now. I must have hallucinated an extra digit in my tests yesterday, it's 0.1 s for me, not 1.0 s. I looked at the previous implementation, and honestly I have no clue why it is faster. |
So this should be reverted? |
No, this is valuable improvement. When I'm referring to v1 I think about implementation we got few weeks ago. This commit improves current stdlib implementation. Although we are still behind v1 for this particular case (huge empty file). In many other cases we are now better than v1. In average 1.3 times for decompression. I did few benchmarks for this empty files and have some findings but still need some time to summarize them. |
For this data (1G empty) if I replace Empty file is compressed as series of back references of length 258 (max length) and distance 1. That tells decompressor to go 1 byte back and copy 258 bytes from that position to the current position. In 1G file there are 4_161_790 such back references. The whole purpose of the loop is to ensure non overlapping buffers so we can use Any ideas why, in this particular case, |
What memcpy implementation is ending up being used? |
This numbers are from arm Linux (Linux vm on mac M1). But I got similar results on x86 Linux. |
for example the memcpy implementation could be from glibc ( zig's compiler-rt memcpy has an open PR for optimizing it: #18912 |
It is |
Interesting. I tried this but it made no performance difference for me. Did you maybe change something else as well?
I also tried to apply this patch to memcpy, and it does give a small improvement, from 179 ms to 165 ms. Still far from v1's performance though. |
No, but I must correct myself after looking once more, I'm finding this difference only on arm not on x86. Here are benchmarks from arm Linux: Benchmark 1 - copyForwards This is from arm:
Benchmark 3, which has v1 performance is using this code: if (from_end < buffer_len and to_end < buffer_len) {
while (to < to_end) {
const dst = self.buffer[to..to_end];
const src = self.buffer[from..to];
to += brk: {
if (src.len > dst.len) {
std.mem.copyForwards(u8, dst, src[0..dst.len]);
break :brk dst.len;
}
std.mem.copyForwards(u8, dst[0..src.len], src);
break :brk src.len;
};
}
return;
} but it also ruins performance in other tests so I'm not considering including it in our implementation. This case (empty file) is interesting but I don't think it is highly relevant. You also mentioned "some uncompressed images with a small palette" do you have any recommendation from where to download some relevant dataset? |
Just tried this and it's still slower than v1 for me on x86, but certainly faster. Not sure what's going on there.
I can just send you some data: data_sets.tar.gz First one is the file that I meant by "some uncompressed images with a small palette". It's ~100 pixel art animation frames(192×108) stored using an 8-bit palette inside a custom format from a self-written pixel art editor. Second one is a screenshot I just now took of VSCode converted to bitmap with GIMP. Should be fairly good to trigger this issue because the background is mostly uniformly gray in sections longer than 258 bytes. |
Reminder that binary layout changes nondeterministically every time you compile, and can have drastic impacts on performance. Related talk. |
I made a project to ease benchmarking and experimenting with various improvements in flate. |
Thanks, that's pretty useful. I tried one more approach, instead of changing the CircularBuffer, I decided to try some changes further up the callstack in inflate.zig. fn dynamicBlock(self: *Self) !bool {
// Hot path loop!
var last_distance: u16 = 0;
var last_length: u16 = undefined;
while (!self.hist.full()) {
try self.bits.fill(15); // optimization so other bit reads can be buffered (avoiding one `if` in hot path)
const sym = try self.decodeSymbol(&self.lit_dec);
switch (sym.kind) {
.literal => {
self.hist.write(sym.symbol);
last_distance = 0;
},
.match => { // Decode match backreference <length, distance>
try self.bits.fill(5 + 15 + 13); // so we can use buffered reads
const length = try self.decodeLength(sym.symbol);
const dsm = try self.decodeSymbol(&self.dst_dec);
var distance = try self.decodeDistance(dsm.symbol);
if (distance < length and distance == last_distance) {
distance += last_length - last_length%distance;
} else {
last_length = length;
last_distance = distance;
}
try self.hist.writeMatch(length, distance);
},
.end_of_block => return true,
}
}
return false;
} This is with 60 ms a lot faster than v1 on the 1 GB file and somewhat faster on the images. However on the other tests it does cause a small(<5%) slowdown. This is pretty small and may come from binary layout changes(like andrew said), but I added more code, so it kind of makes sense that it would be slower. Maybe you got any ideas how to avoid this slowdown? |
Uf, what an interesting trick. This reminds me to lazy match evaluation in compression. There compressor when finds match first tries next character and if it results in better match drops previous, writes literal for that previous character and uses better match. I got similar performance increase, as you are experiencing decrease here, by removing branching in this hot loop. Those It seams that is pretty hard to optimize for all cases. I think that, for now, we should hold decompressing code tar (that ziglang case) as most important. Reasoning is that primary tar usage is to support package manager. |
I noticed that the new inflate implementation was slower when it comes to repeating data, like for example a bunch of zeroes.
This is relevant for things like images and voxel data, where we often see long runs of repeating data.
To improve this I used memcopies of iteratively doubling size for repeating data like this.
This is a fairly common approach and was also present in the old inflate implementation.
Benchmark
Code