Improve inflate performance for repetitive data #19113

IntegratedQuantum · 2024-02-27T21:52:54Z

I noticed that the new inflate implementation was slower when it comes to repeating data, like for example a bunch of zeroes.
This is relevant for things like images and voxel data, where we often see long runs of repeating data.

To improve this I used memcopies of iteratively doubling size for repeating data like this.
This is a fairly common approach and was also present in the old inflate implementation.

Benchmark

Code

const std = @import("std");

pub fn deflate(allocator: std.mem.Allocator, data: []const u8) ![]u8 {
	var result = std.ArrayList(u8).init(allocator);
	var comp = try std.compress.flate.compressor(result.writer(), .{});
	_ = try comp.write(data);
	try comp.finish();
	return try result.toOwnedSlice();
}

pub fn inflateTo(buf: []u8, data: []const u8) !usize {
	var streamIn = std.io.fixedBufferStream(data);
	var decomp = std.compress.flate.decompressor(streamIn.reader());
	var streamOut = std.io.fixedBufferStream(buf);
	try decomp.decompress(streamOut.writer());
	return streamOut.getWritten().len;
}

const file: []const u8 = @embedFile("data.bin");
var in: [file.len]u8 = undefined;
var out: [file.len]u8 = undefined;

pub fn main() !void {
	@memcpy(&in, file);
	const t1 = std.time.nanoTimestamp();
	const compr = try deflate(std.heap.page_allocator, &in);
	defer std.heap.page_allocator.free(compr);
	const t2 = std.time.nanoTimestamp();
	_ = try inflateTo(&out, compr);
	const t3 = std.time.nanoTimestamp();
	std.log.err("{} {} {}", .{t2 - t1, t3 - t2, compr.len});
	for(&in, &out) |a, b| {
		std.debug.assert(a == b);
	}
}

data	master	this PR	speedup
1 GB filled with zeroes	1.5 s (1.0 s before the deflate rewrite)	0.17 s	8.8×
some uncompressed images with a small palette	1.05 ms	0.46 ms	2.3×
zig/src folder put into an uncompressed tar archive	273 ms	269 ms	1.0×

…ow for overlapping regions.

andrewrk · 2024-02-27T22:16:38Z

cc @ianic for review

ianic · 2024-02-28T01:08:25Z

Thanks! Looks great.

I was never testing with that kind of data.
I'll do some more benchmarks because I still find previous v1 implementation 1.6 times faster.
I got this timings by running above code:

v1:     132141572
this:   215850413
std:   1519825334

IntegratedQuantum · 2024-02-28T09:41:20Z

I still find previous v1 implementation 1.6 times faster.

Right, I can also measure this now. I must have hallucinated an extra digit in my tests yesterday, it's 0.1 s for me, not 1.0 s.

I looked at the previous implementation, and honestly I have no clue why it is faster.

rofrol · 2024-02-28T18:03:01Z

So this should be reverted?

ianic · 2024-02-28T18:11:41Z

No, this is valuable improvement. When I'm referring to v1 I think about implementation we got few weeks ago. This commit improves current stdlib implementation. Although we are still behind v1 for this particular case (huge empty file). In many other cases we are now better than v1. In average 1.3 times for decompression.

I did few benchmarks for this empty files and have some findings but still need some time to summarize them.

ianic · 2024-02-28T22:10:51Z

For this data (1G empty) if I replace @memcpy with std.mem.copyForward (as used in v1) I got better result 144ms instead of 215ms.

Empty file is compressed as series of back references of length 258 (max length) and distance 1. That tells decompressor to go 1 byte back and copy 258 bytes from that position to the current position. In 1G file there are 4_161_790 such back references.
That backrefernce runs loop 8 times for cur_len of 1,2,4,8,16,32,64,128 and one copy out of loop for remaining 3 bytes. And that repeats 4 million times. That is all important work in this case.

The whole purpose of the loop is to ensure non overlapping buffers so we can use @memcpy. But then replacing @memcpy with copyForwards gains performance. Loop can be replaces with single copyForwards but that ruins performance to 1500ms.

Any ideas why, in this particular case, copyForwards outperforms @memcpy.

andrewrk · 2024-02-28T22:57:55Z

What memcpy implementation is ending up being used?

ianic · 2024-02-28T23:11:06Z

This numbers are from arm Linux (Linux vm on mac M1). But I got similar results on x86 Linux.

andrewrk · 2024-02-28T23:30:32Z

for example the memcpy implementation could be from glibc (-lc probably), from musl (-lc -target native-native-musl), or from zig's compiler-rt (not linking libc)

zig's compiler-rt memcpy has an open PR for optimizing it: #18912

ianic · 2024-02-28T23:53:14Z

It is lib/compiler-rt/memcpy.zig

IntegratedQuantum · 2024-02-29T12:29:06Z

For this data (1G empty) if I replace @memcpy with std.mem.copyForward (as used in v1) I got better result 144ms instead of 215ms.

Interesting. I tried this but it made no performance difference for me. Did you maybe change something else as well?
Curiously though, using std.mem.copyBackwards does improve performance from 179 ms to 145 ms on my computer (x86_64 linux).

zig's compiler-rt memcpy has an open PR for optimizing it: #18912

I also tried to apply this patch to memcpy, and it does give a small improvement, from 179 ms to 165 ms. Still far from v1's performance though.

ianic · 2024-02-29T16:08:40Z

Did you maybe change something else as well?

No, but I must correct myself after looking once more, I'm finding this difference only on arm not on x86.

Here are benchmarks from arm Linux:

Benchmark 1 - copyForwards
Benchmark 2 - @memcpy
Benchmark 3 - using same approach as v1

This is from arm:

Benchmark 1: zig-out-copy-forwards/bin/flate_bench_std -i 4
  Time (mean ± σ):     150.8 ms ±   0.6 ms    [User: 150.5 ms, System: 0.3 ms]
  Range (min … max):   149.7 ms … 151.7 ms    19 runs

Benchmark 2: zig-out-std/bin/flate_bench_std -i 4
  Time (mean ± σ):     232.7 ms ±   0.9 ms    [User: 232.1 ms, System: 0.3 ms]
  Range (min … max):   231.9 ms … 235.2 ms    12 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (235.2 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 3: zig-out/bin/flate_bench_std -i 4
  Time (mean ± σ):     128.1 ms ±   1.1 ms    [User: 127.8 ms, System: 0.2 ms]
  Range (min … max):   127.1 ms … 131.6 ms    23 runs

Summary
  zig-out/bin/flate_bench_std -i 4 ran
    1.18 ± 0.01 times faster than zig-out-copy-forwards/bin/flate_bench_std -i 4
    1.82 ± 0.02 times faster than zig-out-std/bin/flate_bench_std -i 4

Benchmark 3, which has v1 performance is using this code:

        if (from_end < buffer_len and to_end < buffer_len) {
            while (to < to_end) {
                const dst = self.buffer[to..to_end];
                const src = self.buffer[from..to];

                to += brk: {
                    if (src.len > dst.len) {
                        std.mem.copyForwards(u8, dst, src[0..dst.len]);
                        break :brk dst.len;
                    }
                    std.mem.copyForwards(u8, dst[0..src.len], src);
                    break :brk src.len;
                };
            }
       return;
      }

but it also ruins performance in other tests so I'm not considering including it in our implementation.

This case (empty file) is interesting but I don't think it is highly relevant. You also mentioned "some uncompressed images with a small palette" do you have any recommendation from where to download some relevant dataset?

IntegratedQuantum · 2024-02-29T17:04:20Z

Benchmark 3, which has v1 performance is using this code

Just tried this and it's still slower than v1 for me on x86, but certainly faster. Not sure what's going on there.

This case (empty file) is interesting but I don't think it is highly relevant. You also mentioned "some uncompressed images with a small palette" do you have any recommendation from where to download some relevant dataset?

I can just send you some data: data_sets.tar.gz

First one is the file that I meant by "some uncompressed images with a small palette". It's ~100 pixel art animation frames(192×108) stored using an 8-bit palette inside a custom format from a self-written pixel art editor.
Varies between 428 µs with this PR and 410 µs with your benchmark 3 (insignificant difference)

Second one is a screenshot I just now took of VSCode converted to bitmap with GIMP. Should be fairly good to trigger this issue because the background is mostly uniformly gray in sections longer than 258 bytes.
Varies between 5.7 ms with this PR and 5.2 ms with your benchmark 3 (10% difference)

andrewrk · 2024-02-29T17:22:20Z

Reminder that binary layout changes nondeterministically every time you compile, and can have drastic impacts on performance. Related talk.

ianic · 2024-02-29T22:52:53Z

I made a project to ease benchmarking and experimenting with various improvements in flate.
I put there few 'relevant' test cases, including your images and 1G empty file.

IntegratedQuantum · 2024-03-01T20:20:14Z

I made a project to ease benchmarking and experimenting with various improvements in flate.

Thanks, that's pretty useful.

I tried one more approach, instead of changing the CircularBuffer, I decided to try some changes further up the callstack in inflate.zig.
Essentially I keep track of the last block, and if they connect seamlessly(have the same distance) then I change the distance, such that the CircularBuffer only needs to perform a single memcpy:

        fn dynamicBlock(self: *Self) !bool {
            // Hot path loop!
            var last_distance: u16 = 0;
            var last_length: u16 = undefined;
            while (!self.hist.full()) {
                try self.bits.fill(15); // optimization so other bit reads can be buffered (avoiding one `if` in hot path)
                const sym = try self.decodeSymbol(&self.lit_dec);

                switch (sym.kind) {
                    .literal => {
                        self.hist.write(sym.symbol);
                        last_distance = 0;
                    },
                    .match => { // Decode match backreference <length, distance>
                        try self.bits.fill(5 + 15 + 13); // so we can use buffered reads
                        const length = try self.decodeLength(sym.symbol);
                        const dsm = try self.decodeSymbol(&self.dst_dec);
                        var distance = try self.decodeDistance(dsm.symbol);
                        if (distance < length and distance == last_distance) {
                            distance += last_length - last_length%distance;
                        } else {
                            last_length = length;
                            last_distance = distance;
                        }
                        try self.hist.writeMatch(length, distance);
                    },
                    .end_of_block => return true,
                }
            }
            return false;
        }

This is with 60 ms a lot faster than v1 on the 1 GB file and somewhat faster on the images. However on the other tests it does cause a small(<5%) slowdown. This is pretty small and may come from binary layout changes(like andrew said), but I added more code, so it kind of makes sense that it would be slower. Maybe you got any ideas how to avoid this slowdown?

ianic · 2024-03-01T21:04:13Z

Uf, what an interesting trick.

This reminds me to lazy match evaluation in compression. There compressor when finds match first tries next character and if it results in better match drops previous, writes literal for that previous character and uses better match.

I got similar performance increase, as you are experiencing decrease here, by removing branching in this hot loop. Those self.bits.fill are here to eliminate need for checking if the bit buffer has enough bits: removing one if inside self.bits.read.

It seams that is pretty hard to optimize for all cases. I think that, for now, we should hold decompressing code tar (that ziglang case) as most important. Reasoning is that primary tar usage is to support package manager.

Expand the memcpy fast path in flate.CircularBuffer.writeMatch to all…

09fed2c

…ow for overlapping regions.

andrewrk enabled auto-merge (rebase) February 28, 2024 01:34

andrewrk merged commit 6e07888 into ziglang:master Feb 28, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inflate performance for repetitive data #19113

Improve inflate performance for repetitive data #19113

IntegratedQuantum commented Feb 27, 2024

andrewrk commented Feb 27, 2024

ianic commented Feb 28, 2024

IntegratedQuantum commented Feb 28, 2024

rofrol commented Feb 28, 2024

ianic commented Feb 28, 2024 •

edited

Loading

ianic commented Feb 28, 2024

andrewrk commented Feb 28, 2024

ianic commented Feb 28, 2024

andrewrk commented Feb 28, 2024 •

edited

Loading

ianic commented Feb 28, 2024

IntegratedQuantum commented Feb 29, 2024

ianic commented Feb 29, 2024

IntegratedQuantum commented Feb 29, 2024

andrewrk commented Feb 29, 2024

ianic commented Feb 29, 2024

IntegratedQuantum commented Mar 1, 2024

ianic commented Mar 1, 2024

Improve inflate performance for repetitive data #19113

Improve inflate performance for repetitive data #19113

Conversation

IntegratedQuantum commented Feb 27, 2024

Benchmark

andrewrk commented Feb 27, 2024

ianic commented Feb 28, 2024

IntegratedQuantum commented Feb 28, 2024

rofrol commented Feb 28, 2024

ianic commented Feb 28, 2024 • edited Loading

ianic commented Feb 28, 2024

andrewrk commented Feb 28, 2024

ianic commented Feb 28, 2024

andrewrk commented Feb 28, 2024 • edited Loading

ianic commented Feb 28, 2024

IntegratedQuantum commented Feb 29, 2024

ianic commented Feb 29, 2024

IntegratedQuantum commented Feb 29, 2024

andrewrk commented Feb 29, 2024

ianic commented Feb 29, 2024

IntegratedQuantum commented Mar 1, 2024

ianic commented Mar 1, 2024

ianic commented Feb 28, 2024 •

edited

Loading

andrewrk commented Feb 28, 2024 •

edited

Loading