Skip to content

Commit

Permalink
Rewrote most of the smart length-limiting algorithm explanation
Browse files Browse the repository at this point in the history
  • Loading branch information
romigrou committed May 26, 2024
1 parent 97a507e commit f31dec7
Showing 1 changed file with 83 additions and 51 deletions.
134 changes: 83 additions & 51 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -389,13 +389,13 @@
Still, it would be nice if we could have a strong guarantee that code lengths won't exceed a certain
maximum. Luckily, this is possible and there are several ways to achieve this.

The first and most optimal one is called the package-merge algorithm. It builds a prefix-code tree of
The first and most optimal one is called the Package-Merge algorithm. It builds a prefix-code tree of
increasingly larger depth with each iteration. You only need to stop it at some desired iteration to
obtain length-limited prefix codes. If you don't stop it, it converges to the same prefix codes as
Huffman's algorithm would, only much more slowly. Package merge runs in _O(n×D)_ where _D_
is the maximum allowed depth, whereas, as we've seen, Huffman can be implemented in _O(n)_.

Package-merge is beyond the scope of this article, if you wish to know more please refer to
Package-Merge is beyond the scope of this article, if you wish to know more please refer to
[Wikipedia](https://en.wikipedia.org/wiki/Package-merge_algorithm) or, for a more understandable
version, look at [Stephan Brumme's implementation](https://create.stephan-brumme.com/length-limited-prefix-codes/).

Expand All @@ -412,7 +412,7 @@
of lengths yield proper prefix codes. This is where the [Kraft-McMillan inequality](https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequality)
enters the game. It states that the following condition must be true for proper prefix codes to exist:

$$\sum_{s\in symbols}^{} 2^{-length(s)} \le 1$$
$$K = \sum_{s\in symbols}^{} 2^{-length(s)} \le 1$$

!!! Warning
One of the consequences of the Kraft-McMillan inequality is that there is a lower limit to the
Expand Down Expand Up @@ -442,21 +442,21 @@
for several other length-limiting methods.

!!! Tip
When the Kraft-McMillan sum is computed in fixed-point math as above, if we modify a code's
length we need to do:<br>
`krafSum += (kraftOne >> newLength) - (kraftOne >> oldLength);`<br>
When the Kraft-McMillan sum is updated, we must do the following:<br>
_K = K - 2<sup>-oldLength</sup> + 2<sup>-newLength</sup>_<br>
<br>
So, for lengthening a code, this simplifies to:<br>
`krafSum -= (kraftOne >> newLength);`<br>
For lengthening a code, this simplifies to: _K = K - 2<sup>-newLength</sup>_<br>
Which in code becomes: `krafSum -= (kraftOne >> newLength);`<br>
<br>
And for shortening a code, this simplifies to:<br>
`krafSum += (kraftOne >> oldLength);`<br>
For shortening a code, this simplifies to: _K = K + 2<sup>-oldLength</sup>_<br>
Which in code becomes: `krafSum += (kraftOne >> oldLength);`

Smarter Length-Limiting
-----------------------

### The Principle
Render to Caesar what is Caesar's: I first read about this method on [Charles Bloom's blog](http://cbloomrants.blogspot.com/2018/04/engel-coding-and-length-limited-huffman.html).
Charles Bloom, in turn, attributes it to [Joen Engel](https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md).
Charles Bloom, in turn, attributes it to [Joern Engel](https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md).

The starting point is the same as the previous method: compute Huffman lengths and clamp them.
We now again need to fix the Kraft-McMillan inequality. However, we're going to select which
Expand All @@ -465,22 +465,55 @@
we end up above or below that target. All that matters, is getting closer to it.
2. We use a cost function to decide which code to modify.

The cost function we want to minimize is:
$$\frac{weight}{\Delta distance}$$
### The Cost Function

There aren't many codes to consider at each iteration: only one per length, because all codes of a
given length yield the same _Δdistance_.
Let's call _L_ the total length of the encoded message. _L_ is the resource we want to trade and
_K_ is the currency used to trade. Therefore, we can define the cost (or price) as follows:

- If we're above 1, _Δdistance_ must be < 0 (we'll need to lengthen a code):<br>
only the code with the largest weight of each length needs to be considered.
- If we're below 1, _Δdistance_ must be > 0 (we'll need to shorten a code):<br>
only the code with the lowest weight of each length needs to be considered.
$$ cost = \frac{\Delta K}{\Delta L}$$

Just in case this method would not converge, we can always use the the previous dumb length-limiting
method as a fallback to finish the job.
- When **_K > 1_**, we have too little _L_ and we must buy some more. Of course, we want to buy at
the **lowest cost**.
- When **_K < 1_**, we have too much _L_ and we can sell some of it. This time, we want to sell at
the **highest cost**.

I found this method to be very effective, it's really fast and usually exactly as good as package-merge.
Only in a few cases did package-merge prove better and by a very very thin margin. Here's the
!!! Warning
_ΔK_ and _ΔL_ always have opposite signs therefore **_ΔK / ΔL_ is always negative**.

### Implementation Details

To make computations easier (you'll understand why shortly), we'll make two modifications to the
above cost function: use its inverse and take the absolute value. Conveniently, this changes nothing
to how we must use the cost function (when to minimize or maximize it).

$$ cost = \left| \frac{\Delta L}{\Delta K} \right|$$

As seen in the simple length-limiting algorithm, |_ΔK_| can be computed as |_ΔK_| = _2<sup>-L<sub>i</sub></sup>_
where:
- _L<sub>i</sub> = newLength_ when lengthening a code (i.e. _K > 1_)
- _L<sub>i</sub> = oldLength_ when shortening a code (i.e. _K < 1_)

And, if _W<sub>i</sub>_ is a symbol's weight, |_ΔL_| is simply _W<sub>i</sub>_. Therefore, the cost function can be computed as
_W<sub>i</sub> / 2<sup>-L<sub>i</sub></sup>_, which simplifies to _W<sub>i</sub> × 2<sup>L<sub>i</sub></sup>_,
which in turn yields the following remarkably cheap code:

~~~~~~~~~~~~~~~~~c++
cost = Wi << Li;
~~~~~~~~~~~~~~~~~

!!! Tip
_ΔK_ is the same for all the symbols of a given length. Therefore, we can speed up our search
by considering only one symbol per length: the one with the smallest weight when _K > 1_ and
the one with the biggest weight when _K < 1_.

!!! Warning
Just in case this method would not converge, we can always use the previous simple
length-limiting method as a fallback to finish the job.

### Efficiency

I found this method to be very efficient, it's really fast and usually exactly as good as Package-Merge.
Only in a few cases did Package-Merge prove better and by a very very thin margin. Here's the
compression ratio for the `dickens` file from the [Silesia corpus](https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia):

Limit | 9 | 10 | 11 | 12 | 13 | 14 | 15
Expand All @@ -498,7 +531,7 @@

We have already seen a table-based decoding technique, but we still had to iterate on that table to
find a code's length. Now that we a have the guarantee that codes cannot exceed a given length, we
can devise a much faster way to decode our prefix-codes: a lookup table. Let _L_ be the length
can devise a much faster way to decode our prefix-codes: a flat lookup table. Let _L_ be the length
limit, we then always read _L_ bits and use them to look in a table of <i>2<sup>L</sup></i> entries
that tells us directly what is the corresponding symbol and what is its length.

Expand Down Expand Up @@ -574,8 +607,8 @@
-----------------

The size of the decoding table is the main criterion for deciding which length limit to enforce.
On one hand, short length limits impact the compression ratio; on the other hand long length limits
impact performance.
On one hand, short limits impact the compression ratio; on the other hand long limits impact
performance.

Here are graphs of how the compression ratio of two files from the Silesia corpus evolves as the
length limit changes (in both cases the smart limiting algorithm was used):
Expand All @@ -592,30 +625,31 @@

### JPEG and Deflate

JPEG has a [16-bit limit](https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=151) and deflate
JPEG has a [16-bit limit](https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=151) and Deflate
(i.e. zip) has a 15-bit one (see [RFC 1951](https://www.ietf.org/rfc/rfc1951.txt), section 3.2.7).
Those two limits are really good in terms of compression ratio, however they would result in fairly
large decoding tables (128 KiB and 64 KiB respectively, assuming **2 bytes per entry**).

These were perfectly sound decisions at the time those formats were designed (1991 and 1989
respectively): CPU cache was not as crucial as it is now and was anyway too small to hold an entire
flat decoding table (1989's [Intel i486](https://en.wikipedia.org/wiki/I486) only had 8 KiB of cache
in total). So the strategy used was based on [multiple smaller tables](https://github.com/madler/zlib/blob/develop/inftrees.c).
respectively): length limiting algorithms were not as good as nowadays (Package-Merge dates from
1990) and CPU cache, although not as crucial as it is now, was too small to hold an entire flat
decoding table (1989's [Intel i486](https://en.wikipedia.org/wiki/I486) only had 8 KiB of cache
for both data and instructions). So the strategy used was based on [multiple smaller tables](https://github.com/madler/zlib/blob/develop/inftrees.c).

However, such length limits aren't so good for recent processors that we can safely assume to have
at least 32 KiB of L1 data cache per core. Those caches are now large enough to hold flat decoding
tables but still not large enough for such length limits.
However, those length limits aren't ideal for recent processors which we can usually assume to have
at least 32 KiB of L1 data cache per core (although there are [exceptions](https://developer.arm.com/documentation/ddi0500/e/level-1-memory-system/about-the-l1-memory-system)).
Those caches are now large enough to hold flat decoding tables but still too small for such length limits.

### Zstandard

As for Zstandard, a much more recent format, its length limit is [11 bits](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#huffman-tree-description).
The corresponding table weighs only 8 KiB (assuming **4 bytes per entry** this time, we'll see why
[shortly](#MultiSymbolDecoding)): this perfectly fits in the L1 cache of even low-end processors
while leaving room for other data.
[shortly](#MultiSymbolDecoding)): this fits in the L1 cache of even low-end processors while leaving
room for other data.

11 bits is a rather strict length limit without degrading compression so much that it hurts. It's
worth noting that Zstandard only uses prefix coding for one type of data: literals, other data is
compressed using ANS. Literals are usually not greatly compressible, so a higher length limit
11 bits is a rather strict length limit but it does not degrade compression so much that it hurts.
It's worth noting that Zstandard only uses prefix coding for one type of data: literals, other data
is compressed using ANS. Literals are usually not greatly compressible, so a higher length limit
wouldn't make a big difference anyway.

!!! Info
Expand Down Expand Up @@ -724,25 +758,23 @@
- The order of the data we'll read depends on the processor's endianness (but it's usually cheap
to fix endianness)

If, whatever the state of the bit buffer, we read 64 bits of data at each refill, we can guarantee
that the bit buffer will always contain at least 64 bits of data after the refill. Therefore, if
our length limit is 12 we are certain we can decode at least 5 symbols before having to refill
again. We've just made the read dependency 5 times cheaper!
When we read 64 bits of data, at least 57 of them are fresh ones (up to 7 bits might be old ones we
had to re-read because of byte granularity). Therefore, whatever the state of the bit buffer, we
can guarantee it will contain at least 57 bits after each refill. Therefore, if our length limit is
11 bits we are certain we can decode at least 5 symbols before having to refill.

!!! Warning
Actually, as our reads are still byte-aligned we can only guarantee 64-7=57 fresh new bits at
each refill. With a 12-bit limit, we can therefore only have 4 guaranteed decodes per refill.
We've just made the read dependency 5 times cheaper!

Multiple Streams
----------------

However, we still have a dependency between consecutive symbols: we don't know what are the bits of
symbol _N_ until we know how many bits symbol _N-1_ consumes. We cannot eliminate this dependency
as it is at the heart of prefix codes. But by encoding consecutive symbols in different bistreams
we can mask that dependency. Say we have two bitstreams, we can encode one symbol out of two in the
first bitstream and the other symbols in the second bistream. By doing so, symbol _N_ no longer
depends on symbol _N-1_'s length being known but on symbol _N-2_'s. By increasing the number of
streams we can mask the latencies enough to become a non-issue.
symbol _N_ until we know how many bits symbol _N-1_ consumes. We cannot get rid of this dependency
as it is at the heart of prefix codes, but we can hide it.

Say we have two bitstreams, we can encode odd-index symbols in the first bitstream and even-index
symbols in the second bistream. By doing so, symbol _N_ no longer depends on symbol _N-1_'s length
being known but on symbol _N-2_'s. If we have enough streams we can end up hiding latencies totally.

Zstandard uses [up to 4 bistreams](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#literals_section_header)
whereas [Oodle uses 6](https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/).
Expand Down

0 comments on commit f31dec7

Please sign in to comment.