Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inflate performance for repetitive data #19113

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

IntegratedQuantum
Copy link
Contributor

I noticed that the new inflate implementation was slower when it comes to repeating data, like for example a bunch of zeroes.
This is relevant for things like images and voxel data, where we often see long runs of repeating data.

To improve this I used memcopies of iteratively doubling size for repeating data like this.
This is a fairly common approach and was also present in the old inflate implementation.

Benchmark

Code
const std = @import("std");

pub fn deflate(allocator: std.mem.Allocator, data: []const u8) ![]u8 {
	var result = std.ArrayList(u8).init(allocator);
	var comp = try std.compress.flate.compressor(result.writer(), .{});
	_ = try comp.write(data);
	try comp.finish();
	return try result.toOwnedSlice();
}

pub fn inflateTo(buf: []u8, data: []const u8) !usize {
	var streamIn = std.io.fixedBufferStream(data);
	var decomp = std.compress.flate.decompressor(streamIn.reader());
	var streamOut = std.io.fixedBufferStream(buf);
	try decomp.decompress(streamOut.writer());
	return streamOut.getWritten().len;
}

const file: []const u8 = @embedFile("data.bin");
var in: [file.len]u8 = undefined;
var out: [file.len]u8 = undefined;

pub fn main() !void {
	@memcpy(&in, file);
	const t1 = std.time.nanoTimestamp();
	const compr = try deflate(std.heap.page_allocator, &in);
	defer std.heap.page_allocator.free(compr);
	const t2 = std.time.nanoTimestamp();
	_ = try inflateTo(&out, compr);
	const t3 = std.time.nanoTimestamp();
	std.log.err("{} {} {}", .{t2 - t1, t3 - t2, compr.len});
	for(&in, &out) |a, b| {
		std.debug.assert(a == b);
	}
}
data master this PR speedup
1 GB filled with zeroes 1.5 s (1.0 s before the deflate rewrite) 0.17 s 8.8×
some uncompressed images with a small palette 1.05 ms 0.46 ms 2.3×
zig/src folder put into an uncompressed tar archive 273 ms 269 ms 1.0×

@andrewrk
Copy link
Member

cc @ianic for review

@ianic
Copy link
Contributor

ianic commented Feb 28, 2024

Thanks! Looks great.

I was never testing with that kind of data.
I'll do some more benchmarks because I still find previous v1 implementation 1.6 times faster.
I got this timings by running above code:

v1:     132141572
this:   215850413
std:   1519825334

@andrewrk andrewrk enabled auto-merge (rebase) February 28, 2024 01:34
@andrewrk andrewrk merged commit 6e07888 into ziglang:master Feb 28, 2024
10 checks passed
@IntegratedQuantum
Copy link
Contributor Author

I still find previous v1 implementation 1.6 times faster.

Right, I can also measure this now. I must have hallucinated an extra digit in my tests yesterday, it's 0.1 s for me, not 1.0 s.

I looked at the previous implementation, and honestly I have no clue why it is faster.

@rofrol
Copy link
Contributor

rofrol commented Feb 28, 2024

So this should be reverted?

@ianic
Copy link
Contributor

ianic commented Feb 28, 2024

No, this is valuable improvement. When I'm referring to v1 I think about implementation we got few weeks ago. This commit improves current stdlib implementation. Although we are still behind v1 for this particular case (huge empty file). In many other cases we are now better than v1. In average 1.3 times for decompression.

I did few benchmarks for this empty files and have some findings but still need some time to summarize them.

@ianic
Copy link
Contributor

ianic commented Feb 28, 2024

For this data (1G empty) if I replace @memcpy with std.mem.copyForward (as used in v1) I got better result 144ms instead of 215ms.

Empty file is compressed as series of back references of length 258 (max length) and distance 1. That tells decompressor to go 1 byte back and copy 258 bytes from that position to the current position. In 1G file there are 4_161_790 such back references.
That backrefernce runs loop 8 times for cur_len of 1,2,4,8,16,32,64,128 and one copy out of loop for remaining 3 bytes. And that repeats 4 million times. That is all important work in this case.

The whole purpose of the loop is to ensure non overlapping buffers so we can use @memcpy. But then replacing @memcpy with copyForwards gains performance. Loop can be replaces with single copyForwards but that ruins performance to 1500ms.

Any ideas why, in this particular case, copyForwards outperforms @memcpy.

@andrewrk
Copy link
Member

What memcpy implementation is ending up being used?

@ianic
Copy link
Contributor

ianic commented Feb 28, 2024

This numbers are from arm Linux (Linux vm on mac M1). But I got similar results on x86 Linux.

@andrewrk
Copy link
Member

andrewrk commented Feb 28, 2024

for example the memcpy implementation could be from glibc (-lc probably), from musl (-lc -target native-native-musl), or from zig's compiler-rt (not linking libc)

zig's compiler-rt memcpy has an open PR for optimizing it: #18912

@ianic
Copy link
Contributor

ianic commented Feb 28, 2024

It is lib/compiler-rt/memcpy.zig

@IntegratedQuantum
Copy link
Contributor Author

For this data (1G empty) if I replace @memcpy with std.mem.copyForward (as used in v1) I got better result 144ms instead of 215ms.

Interesting. I tried this but it made no performance difference for me. Did you maybe change something else as well?
Curiously though, using std.mem.copyBackwards does improve performance from 179 ms to 145 ms on my computer (x86_64 linux).

zig's compiler-rt memcpy has an open PR for optimizing it: #18912

I also tried to apply this patch to memcpy, and it does give a small improvement, from 179 ms to 165 ms. Still far from v1's performance though.

@ianic
Copy link
Contributor

ianic commented Feb 29, 2024

Did you maybe change something else as well?

No, but I must correct myself after looking once more, I'm finding this difference only on arm not on x86.

Here are benchmarks from arm Linux:

Benchmark 1 - copyForwards
Benchmark 2 - @memcpy
Benchmark 3 - using same approach as v1

This is from arm:

Benchmark 1: zig-out-copy-forwards/bin/flate_bench_std -i 4
  Time (mean ± σ):     150.8 ms ±   0.6 ms    [User: 150.5 ms, System: 0.3 ms]
  Range (min … max):   149.7 ms … 151.7 ms    19 runs

Benchmark 2: zig-out-std/bin/flate_bench_std -i 4
  Time (mean ± σ):     232.7 ms ±   0.9 ms    [User: 232.1 ms, System: 0.3 ms]
  Range (min … max):   231.9 ms … 235.2 ms    12 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (235.2 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 3: zig-out/bin/flate_bench_std -i 4
  Time (mean ± σ):     128.1 ms ±   1.1 ms    [User: 127.8 ms, System: 0.2 ms]
  Range (min … max):   127.1 ms … 131.6 ms    23 runs

Summary
  zig-out/bin/flate_bench_std -i 4 ran
    1.18 ± 0.01 times faster than zig-out-copy-forwards/bin/flate_bench_std -i 4
    1.82 ± 0.02 times faster than zig-out-std/bin/flate_bench_std -i 4

Benchmark 3, which has v1 performance is using this code:

        if (from_end < buffer_len and to_end < buffer_len) {
            while (to < to_end) {
                const dst = self.buffer[to..to_end];
                const src = self.buffer[from..to];

                to += brk: {
                    if (src.len > dst.len) {
                        std.mem.copyForwards(u8, dst, src[0..dst.len]);
                        break :brk dst.len;
                    }
                    std.mem.copyForwards(u8, dst[0..src.len], src);
                    break :brk src.len;
                };
            }
       return;
      }

but it also ruins performance in other tests so I'm not considering including it in our implementation.

This case (empty file) is interesting but I don't think it is highly relevant. You also mentioned "some uncompressed images with a small palette" do you have any recommendation from where to download some relevant dataset?

@IntegratedQuantum
Copy link
Contributor Author

Benchmark 3, which has v1 performance is using this code

Just tried this and it's still slower than v1 for me on x86, but certainly faster. Not sure what's going on there.

This case (empty file) is interesting but I don't think it is highly relevant. You also mentioned "some uncompressed images with a small palette" do you have any recommendation from where to download some relevant dataset?

I can just send you some data: data_sets.tar.gz

First one is the file that I meant by "some uncompressed images with a small palette". It's ~100 pixel art animation frames(192×108) stored using an 8-bit palette inside a custom format from a self-written pixel art editor.
Varies between 428 µs with this PR and 410 µs with your benchmark 3 (insignificant difference)

Second one is a screenshot I just now took of VSCode converted to bitmap with GIMP. Should be fairly good to trigger this issue because the background is mostly uniformly gray in sections longer than 258 bytes.
Varies between 5.7 ms with this PR and 5.2 ms with your benchmark 3 (10% difference)

@andrewrk
Copy link
Member

Reminder that binary layout changes nondeterministically every time you compile, and can have drastic impacts on performance. Related talk.

@ianic
Copy link
Contributor

ianic commented Feb 29, 2024

I made a project to ease benchmarking and experimenting with various improvements in flate.
I put there few 'relevant' test cases, including your images and 1G empty file.

@IntegratedQuantum
Copy link
Contributor Author

I made a project to ease benchmarking and experimenting with various improvements in flate.

Thanks, that's pretty useful.

I tried one more approach, instead of changing the CircularBuffer, I decided to try some changes further up the callstack in inflate.zig.
Essentially I keep track of the last block, and if they connect seamlessly(have the same distance) then I change the distance, such that the CircularBuffer only needs to perform a single memcpy:

        fn dynamicBlock(self: *Self) !bool {
            // Hot path loop!
            var last_distance: u16 = 0;
            var last_length: u16 = undefined;
            while (!self.hist.full()) {
                try self.bits.fill(15); // optimization so other bit reads can be buffered (avoiding one `if` in hot path)
                const sym = try self.decodeSymbol(&self.lit_dec);

                switch (sym.kind) {
                    .literal => {
                        self.hist.write(sym.symbol);
                        last_distance = 0;
                    },
                    .match => { // Decode match backreference <length, distance>
                        try self.bits.fill(5 + 15 + 13); // so we can use buffered reads
                        const length = try self.decodeLength(sym.symbol);
                        const dsm = try self.decodeSymbol(&self.dst_dec);
                        var distance = try self.decodeDistance(dsm.symbol);
                        if (distance < length and distance == last_distance) {
                            distance += last_length - last_length%distance;
                        } else {
                            last_length = length;
                            last_distance = distance;
                        }
                        try self.hist.writeMatch(length, distance);
                    },
                    .end_of_block => return true,
                }
            }
            return false;
        }

This is with 60 ms a lot faster than v1 on the 1 GB file and somewhat faster on the images. However on the other tests it does cause a small(<5%) slowdown. This is pretty small and may come from binary layout changes(like andrew said), but I added more code, so it kind of makes sense that it would be slower. Maybe you got any ideas how to avoid this slowdown?

@ianic
Copy link
Contributor

ianic commented Mar 1, 2024

Uf, what an interesting trick.

This reminds me to lazy match evaluation in compression. There compressor when finds match first tries next character and if it results in better match drops previous, writes literal for that previous character and uses better match.

I got similar performance increase, as you are experiencing decrease here, by removing branching in this hot loop. Those self.bits.fill are here to eliminate need for checking if the bit buffer has enough bits: removing one if inside self.bits.read.

It seams that is pretty hard to optimize for all cases. I think that, for now, we should hold decompressing code tar (that ziglang case) as most important. Reasoning is that primary tar usage is to support package manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants