Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup of ZopfliFindLongestMatch -- gives ~8% overall performance boost #110

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jthlim
Copy link
Contributor

@jthlim jthlim commented Apr 30, 2016

Check jthlim@b5a1ea2 for a detailed discussion of each change next to the associated code.

Please help verify that all of the comments/statements are valid from your perspective too!

Note: the 8% improvement is compounded on top of the optimizations discussed before (having asserts off, integer cost comparison, -ffast-math, link time optimization, etc.)

Note: Testing compressing hundreds of files has verified so far that these changes to not affect the compression.

@jthlim
Copy link
Contributor Author

jthlim commented Apr 30, 2016

Thanks for the feedback! would be really interesting to see GCC's generated code before and after for that function for ARM. I can't think why there'd be a regression.

Do you see similar results on x86/x64?

I assume this testing is with your modified fork rather than the verbatim zopfli codebase?

@jthlim
Copy link
Contributor Author

jthlim commented May 1, 2016

I've reverted the order of comparisons change due to the regression that you've noticed.

re: size/space trade off.. A lot of those variables end up in registers in clang, so there's no extra memory tradeoff at all. I guess that'll be platform specific too, but I expect to see that in most cases.

And perhaps gcc is just better at removing the unnecessary masking when appropriate than clang is.

@lorents17
Copy link

lorents17 commented May 1, 2016

enwik8 --i15

Zopfli (original) Zopfli (jthlim)
34 967 036 byte 34 967 390 byte
6m19.762s 6m1.886s

ECT

Compressor File Size Time
ECT -2 35 357 940 byte 0m24.922s
ECT -3 35 018 152 byte 0m30.350s
ECT -4 34 967 607 byte 0m37.274s
ECT -5 34 963 304 byte 0m58.680s
ECT -6 34 942 395 byte 1m19.929s
ECT -7 34 939 105 byte 2m1.326s
ECT -8 34 938 139 byte 5m59.643s

@jthlim
Copy link
Contributor Author

jthlim commented May 2, 2016

Thanks @lorents17 for the info! What compiler/platform were you testing with?

I see that you've measured a ~5% speedup.. but the difference in file size is interesting to me.

Did your original zopfli results include the changes from 37f6da6 -- specifically katajainen.c?

Otherwise I need to re-examing the changes to see why there could be a change.

Thanks!

@lorents17
Copy link

lorents17 commented May 2, 2016

OS - Windows 10
compiler - GCC 5.3

I used original zopfli with the latest changes from Apr 25, 2016

Interests me more as the ECT project could squeeze so quickly?

@jthlim
Copy link
Contributor Author

jthlim commented May 2, 2016

I've tried before and after my changes on OSX/Apple LLVM v 7.3 and I'm getting a consistent 34967390 bytes in each case. I'll try and setup a system with gcc later to test further.

as to the performance:
A quick look at https://github.com/fhanau/Efficient-Compression-Tool/blob/master/src/zopfli/zopfli_gzip.cpp#L259 seems to imply that there are modifications there to add multithreading support (this fork of zopfli is single threaded). At a guess, it looks like he's basing it on https://github.com/MrKrzYch00/zopfli

@fhanau
Copy link
Contributor

fhanau commented May 2, 2016

ECT (which I am the author of) is much faster than zopfli at similar compression, even without multithreading. Multithreading is off by default as it causes compression losses and needs more processing power per byte.
It achives its speeed improvements by rewriting large parts and replacing slow algorithms(like the current hash-chain based ZopfliFindLongestMatch) with faster ones(like the Binary tree based new one).
The only part of ECT that is based on MrKrzYch00/zopfli is the ZIP support.

@jthlim
Copy link
Contributor Author

jthlim commented May 2, 2016

@fhanau Thanks for the info and clarifications! Are there any downsides to importing your binary tree based code into the mainline?

@fhanau
Copy link
Contributor

fhanau commented May 2, 2016

Unfortunately, yes:

  • The binary tree match finder has a very low cost for finding matches as doing so is almost the same as updating the hash. However, updating the hash is very expensive which is why BTMF is only faster when using optimal parsing. I use a match finder based on LZ4 when lazy parsing is used.
  • BTMF is not compatible w/ the current matches cache.
  • As BTMF needs much time to update the hash and this is done at the start of each block after the first block to be able to search at posititions in the previous blocks, BTMF needs much time before compression even begins. This can be solved with copying the match finder struct of the previous block.

But you're right, the changes from ECT should be backported.

@MrKrzYch00
Copy link
Contributor

MrKrzYch00 commented May 13, 2016

Most of changes don't make any difference on ARMv7 when GCC is used, there is no way for me to efficiently run x86 tests as I run Fedora in VirtualBox (though it now supports AVX instructions set and stuff but the CPU is not "as free" as on Odroid U3). Of course ARM has a lot of registers so that may be the reason. Only the loop part change did a very little speed up, any variable type or order changing (here as well as with previous changes) did not make any difference or were actually slower. However, the differences are so small they could as well fall into a margin of error difference. As far as I'm concerned, GCC 4.8 is faster than GCC 5.3 which is faster than Clang 3.6. Or so it did score on ARMv7, would need to recompare it for original Zopfli again to confirm that Clang is still slower.

Even with my MrKrzYch00@7aa51e1 I don't see any difference in speed. :)

ATM the only thing I'm concerned about in case of Zopfli is why do I get:

*** Error in `zopfli/zopfli': double free or corruption (!prev): 0xb4550038 ***
Aborted

which only occurs on my Odroid U3 (both of them) when freeing heap (randomly) on long runs and (not 100% sure) when lazy matching is used... [can't figure it out so far, with all the debugging included] pointer structure members on not properly aligned structures. That was also the reason I invented restore points to resume compression and by this could confirm that the code should be is right and the Odroid U3 (outdated) kernel or its libs are causing this (or some alignment or small out-of-bounds going on somewhere)... GCC switches need to be altered (either 32-bit aligning or enable unaligned access) which I'm currently testing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants