Rewrote most of the smart length-limiting algorithm explanation

romigrou · May 26, 2024 · f31dec7 · f31dec7
1 parent 97a507e
commit f31dec7
Showing 1 changed file with 83 additions and 51 deletions.
diff --git a/index.html b/index.html
@@ -389,13 +389,13 @@
 Still, it would be nice if we could have a strong guarantee that code lengths won't exceed a certain
 maximum. Luckily, this is possible and there are several ways to achieve this.
 
-The first and most optimal one is called the package-merge algorithm. It builds a prefix-code tree of
+The first and most optimal one is called the Package-Merge algorithm. It builds a prefix-code tree of
 increasingly larger depth with each iteration. You only need to stop it at some desired iteration to
 obtain length-limited prefix codes. If you don't stop it, it converges to the same prefix codes as
 Huffman's algorithm would, only much more slowly. Package merge runs in _O(n×D)_ where _D_
 is the maximum allowed depth, whereas, as we've seen, Huffman can be implemented in _O(n)_.
 
-Package-merge is beyond the scope of this article, if you wish to know more please refer to
+Package-Merge is beyond the scope of this article, if you wish to know more please refer to
 [Wikipedia](https://en.wikipedia.org/wiki/Package-merge_algorithm) or, for a more understandable
 version, look at [Stephan Brumme's implementation](https://create.stephan-brumme.com/length-limited-prefix-codes/).
 
@@ -412,7 +412,7 @@
 of lengths yield proper prefix codes. This is where the [Kraft-McMillan inequality](https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequality)
 enters the game. It states that the following condition must be true for proper prefix codes to exist:
 
-$$\sum_{s\in symbols}^{} 2^{-length(s)} \le 1$$
+$$K = \sum_{s\in symbols}^{} 2^{-length(s)} \le 1$$
 
 !!! Warning
     One of the consequences of the Kraft-McMillan inequality is that there is a lower limit to the
@@ -442,21 +442,21 @@
 for several other length-limiting methods.
 
 !!! Tip
-    When the Kraft-McMillan sum is computed in fixed-point math as above, if we modify a code's
-    length we need to do:<br>
-    `krafSum += (kraftOne >> newLength) - (kraftOne >> oldLength);`<br>
+    When the Kraft-McMillan sum is updated, we must do the following:<br>
+    _K = K - 2<sup>-oldLength</sup> + 2<sup>-newLength</sup>_<br>
     <br>
-    So, for lengthening a code, this simplifies to:<br>
-    `krafSum -= (kraftOne >> newLength);`<br>
+    For lengthening a code, this simplifies to: _K = K - 2<sup>-newLength</sup>_<br>
+    Which in code becomes: `krafSum -= (kraftOne >> newLength);`<br>
     <br>
-    And for shortening a code, this simplifies to:<br>
-    `krafSum += (kraftOne >> oldLength);`<br>
+    For shortening a code, this simplifies to: _K = K + 2<sup>-oldLength</sup>_<br>
+    Which in code becomes: `krafSum += (kraftOne >> oldLength);`
 
 Smarter Length-Limiting
 -----------------------
 
+### The Principle
 Render to Caesar what is Caesar's: I first read about this method on [Charles Bloom's blog](http://cbloomrants.blogspot.com/2018/04/engel-coding-and-length-limited-huffman.html).
-Charles Bloom, in turn, attributes it to [Joen Engel](https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md).
+Charles Bloom, in turn, attributes it to [Joern Engel](https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md).
 
 The starting point is the same as the previous method: compute Huffman lengths and clamp them.
 We now again need to fix the Kraft-McMillan inequality. However, we're going to select which
@@ -465,22 +465,55 @@
     we end up above or below that target. All that matters, is getting closer to it.
  2. We use a cost function to decide which code to modify.
 
-The cost function we want to minimize is:
-$$\frac{weight}{\Delta distance}$$
+### The Cost Function
 
-There aren't many codes to consider at each iteration: only one per length, because all codes of a
-given length yield the same _Δdistance_.
+Let's call _L_ the total length of the encoded message. _L_ is the resource we want to trade and
+_K_ is the currency used to trade. Therefore, we can define the cost (or price) as follows:
 
- - If we're above 1, _Δdistance_ must be < 0 (we'll need to lengthen a code):<br>
-   only the code with the largest weight of each length needs to be considered.
- - If we're below 1, _Δdistance_ must be > 0 (we'll need to shorten a code):<br>
-   only the code with the lowest weight of each length needs to be considered.
+$$ cost = \frac{\Delta K}{\Delta L}$$
 
-Just in case this method would not converge, we can always use the the previous dumb length-limiting
-method as a fallback to finish the job.
+ - When **_K > 1_**, we have too little _L_ and we must buy some more. Of course, we want to buy at
+   the **lowest cost**.
+ - When **_K < 1_**, we have too much _L_ and we can sell some of it. This time, we want to sell at
+   the **highest cost**.
 
-I found this method to be very effective, it's really fast and usually exactly as good as package-merge.
-Only in a few cases did package-merge prove better and by a very very thin margin. Here's the
+!!! Warning
+    _ΔK_ and _ΔL_ always have opposite signs therefore **_ΔK / ΔL_ is always negative**.
+
+### Implementation Details
+
+To make computations easier (you'll understand why shortly), we'll make two modifications to the
+above cost function: use its inverse and take the absolute value. Conveniently, this changes nothing
+to how we must use the cost function (when to minimize or maximize it).
+
+$$ cost = \left| \frac{\Delta L}{\Delta K} \right|$$
+
+As seen in the simple length-limiting algorithm, |_ΔK_| can be computed as |_ΔK_| = _2<sup>-L<sub>i</sub></sup>_
+where:
+ - _L<sub>i</sub> = newLength_ when lengthening a code (i.e. _K > 1_)
+ - _L<sub>i</sub> = oldLength_ when shortening a code (i.e. _K < 1_)
+
+And, if _W<sub>i</sub>_ is a symbol's weight, |_ΔL_| is simply _W<sub>i</sub>_. Therefore, the cost function can be computed as
+_W<sub>i</sub> / 2<sup>-L<sub>i</sub></sup>_, which simplifies to _W<sub>i</sub> × 2<sup>L<sub>i</sub></sup>_,
+which in turn yields the following remarkably cheap code:
+
+~~~~~~~~~~~~~~~~~c++
+cost = Wi << Li;
+~~~~~~~~~~~~~~~~~
+
+!!! Tip
+    _ΔK_ is the same for all the symbols of a given length. Therefore, we can speed up our search
+    by considering only one symbol per length: the one with the smallest weight when _K > 1_ and
+    the one with the biggest weight when _K < 1_.
+
+!!! Warning
+    Just in case this method would not converge, we can always use the previous simple
+    length-limiting method as a fallback to finish the job.
+
+### Efficiency
+
+I found this method to be very efficient, it's really fast and usually exactly as good as Package-Merge.
+Only in a few cases did Package-Merge prove better and by a very very thin margin. Here's the
 compression ratio for the `dickens` file from the [Silesia corpus](https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia):
 
 Limit  |    9    |    10   |    11   |   12    |   13    |   14    |   15
@@ -498,7 +531,7 @@
 
 We have already seen a table-based decoding technique, but we still had to iterate on that table to
 find a code's length. Now that we a have the guarantee that codes cannot exceed a given length, we
-can devise a much faster way to decode our prefix-codes: a lookup table. Let _L_ be the length
+can devise a much faster way to decode our prefix-codes: a flat lookup table. Let _L_ be the length
 limit, we then always read _L_ bits and use them to look in a table of <i>2<sup>L</sup></i> entries
 that tells us directly what is the corresponding symbol and what is its length.
 
@@ -574,8 +607,8 @@
 -----------------
 
 The size of the decoding table is the main criterion for deciding which length limit to enforce.
-On one hand, short length limits impact the compression ratio; on the other hand long length limits
-impact performance.
+On one hand, short limits impact the compression ratio; on the other hand long limits impact
+performance.
 
 Here are graphs of how the compression ratio of two files from the Silesia corpus evolves as the
 length limit changes (in both cases the smart limiting algorithm was used):
@@ -592,30 +625,31 @@
 
 ### JPEG and Deflate
 
-JPEG has a [16-bit limit](https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=151) and deflate
+JPEG has a [16-bit limit](https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=151) and Deflate
 (i.e. zip) has a 15-bit one (see [RFC 1951](https://www.ietf.org/rfc/rfc1951.txt), section 3.2.7).
 Those two limits are really good in terms of compression ratio, however they would result in fairly
 large decoding tables (128 KiB and 64 KiB respectively, assuming **2 bytes per entry**).
 
 These were perfectly sound decisions at the time those formats were designed (1991 and 1989
-respectively): CPU cache was not as crucial as it is now and was anyway too small to hold an entire
-flat decoding table (1989's [Intel i486](https://en.wikipedia.org/wiki/I486) only had 8 KiB of cache
-in total). So the strategy used was based on [multiple smaller tables](https://github.com/madler/zlib/blob/develop/inftrees.c).
+respectively): length limiting algorithms were not as good as nowadays (Package-Merge dates from
+1990) and CPU cache, although not as crucial as it is now, was too small to hold an entire flat
+decoding table (1989's [Intel i486](https://en.wikipedia.org/wiki/I486) only had 8 KiB of cache
+for both data and instructions). So the strategy used was based on [multiple smaller tables](https://github.com/madler/zlib/blob/develop/inftrees.c).
 
-However, such length limits aren't so good for recent processors that we can safely assume to have
-at least 32 KiB of L1 data cache per core. Those caches are now large enough to hold flat decoding
-tables but still not large enough for such length limits.
+However, those length limits aren't ideal for recent processors which we can usually assume to have
+at least 32 KiB of L1 data cache per core (although there are [exceptions](https://developer.arm.com/documentation/ddi0500/e/level-1-memory-system/about-the-l1-memory-system)).
+Those caches are now large enough to hold flat decoding tables but still too small for such length limits.
 
 ### Zstandard
 
 As for Zstandard, a much more recent format, its length limit is [11 bits](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#huffman-tree-description).
 The corresponding table weighs only 8 KiB (assuming **4 bytes per entry** this time, we'll see why
-[shortly](#MultiSymbolDecoding)): this perfectly fits in the L1 cache of even low-end processors
-while leaving room for other data.
+[shortly](#MultiSymbolDecoding)): this fits in the L1 cache of even low-end processors while leaving
+room for other data.
 
-11 bits is a rather strict length limit without degrading compression so much that it hurts. It's
-worth noting that Zstandard only uses prefix coding for one type of data: literals, other data is
-compressed using ANS. Literals are usually not greatly compressible, so a higher length limit
+11 bits is a rather strict length limit but it does not degrade compression so much that it hurts.
+It's worth noting that Zstandard only uses prefix coding for one type of data: literals, other data
+is compressed using ANS. Literals are usually not greatly compressible, so a higher length limit
 wouldn't make a big difference anyway.
 
 !!! Info
@@ -724,25 +758,23 @@
  - The order of the data we'll read depends on the processor's endianness (but it's usually cheap
    to fix endianness)
 
-If, whatever the state of the bit buffer, we read 64 bits of data at each refill, we can guarantee
-that the bit buffer will always contain at least 64 bits of data after the refill. Therefore, if
-our length limit is 12 we are certain we can decode at least 5 symbols before having to refill
-again. We've just made the read dependency 5 times cheaper!
+When we read 64 bits of data, at least 57 of them are fresh ones (up to 7 bits might be old ones we
+had to re-read because of byte granularity). Therefore, whatever the state of the bit buffer, we
+can guarantee it will contain at least 57 bits after each refill. Therefore, if our length limit is
+11 bits we are certain we can decode at least 5 symbols before having to refill.
 
-!!! Warning
-    Actually, as our reads are still byte-aligned we can only guarantee 64-7=57 fresh new bits at
-    each refill. With a 12-bit limit, we can therefore only have 4 guaranteed decodes per refill.
+We've just made the read dependency 5 times cheaper!
 
 Multiple Streams
 ----------------
 
 However, we still have a dependency between consecutive symbols: we don't know what are the bits of
-symbol _N_ until we know how many bits symbol _N-1_ consumes. We cannot eliminate this dependency
-as it is at the heart of prefix codes. But by encoding consecutive symbols in different bistreams
-we can mask that dependency. Say we have two bitstreams, we can encode one symbol out of two in the
-first bitstream and the other symbols in the second bistream. By doing so, symbol _N_ no longer
-depends on symbol _N-1_'s length being known but on symbol _N-2_'s. By increasing the number of
-streams we can mask the latencies enough to become a non-issue.
+symbol _N_ until we know how many bits symbol _N-1_ consumes. We cannot get rid of this dependency
+as it is at the heart of prefix codes, but we can hide it.
+
+Say we have two bitstreams, we can encode odd-index symbols in the first bitstream and even-index
+symbols in the second bistream. By doing so, symbol _N_ no longer depends on symbol _N-1_'s length
+being known but on symbol _N-2_'s. If we have enough streams we can end up hiding latencies totally.
 
 Zstandard uses [up to 4 bistreams](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#literals_section_header)
 whereas [Oodle uses 6](https://fgiesen.wordpress.com/2023/10/29/entropy-decoding-in-oodle-data-x86-64-6-stream-huffman-decoders/).