Assertion failure: assert next_code_length >= code_length when code_length overflows #440

arai-a · 2019-09-10T14:00:34Z

Steps to reproduce:

run the following Python script and redirect the output to test.js

    for i in range(1, 23):
        for j in range(0, int(1.7**i)):
            print '{};'.format(i)

(the result is 1.1MB of JS file)

encode the JS file generated in step 1 with fbssdc

Actual result:

crash at

binjs-ref/fbssdc/model.py

Line 472 in 5cdddbc

    
           assert next_code_length >= code_length,  f'{k}, {next_code_length}, {code_length}'

AssertionError: (2097151, 21), 20, 21

The text was updated successfully, but these errors were encountered:

Yoric · 2019-09-10T14:29:16Z

Cc @dominiccooney

So, we have too many numbers in the table of numbers?

arai-a · 2019-09-10T14:35:32Z

it tries to assign 21-bit length code for 1 and 2 symbols.

we'll need to define the fallback algorithm for this kind of case.
(IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine)

dominiccooney · 2019-09-12T21:19:16Z

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue. We can and should leave code length limiting algorithms as an implementation detail of the encoder, because there are slow, accurate algorithms and fast approximating algorithms and depending on the circumstances either could be preferable. We don't want to bake one into the spec and the decoder doesn't care. If you need one to crib from you can dig up papers on length limited Huffman codes for the accurate algorithm or zstd has a fast approximation (probably many other Huffman compressors too.) Facebook does something like zstd in our Hack encoder.

…

On Tue, Sep 10, 2019, 11:35 PM arai-a ***@***.***> wrote: it tries to assign 21-bit length code for 1 and 2 symbols. we'll need to define the fallback algorithm for this kind of case. (IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#440?email_source=notifications&email_token=AAANOUAT6TEVDHYZ25CBIBTQI6WDJA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LKB7A#issuecomment-529965308>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANOUDW2CQMFD6N6VX6LFDQI6WDJANCNFSM4IVIDCFA> .

Yoric · 2019-09-13T05:27:22Z

That algorithm is a non-normative illustration of assigning codes, don't
take it as spec. We should add a note about the code length issue.

Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :)

dominiccooney · 2019-09-13T07:50:27Z

Ok. A correct encoder should not output lengths > 20 bits and SpiderMonkey should reject files with 21+ bit length codes. BTW 20 bits was an arbitrary choice and may be something to revisit. Some people may need a million strings... I hope not but I would not be surprised if there's a file out there that big. FWIW Facebook does have files where the *optimal* code has lengths > 20 bits and hence we use that heuristic to limit code lengths.

…

On Fri, Sep 13, 2019, 2:27 PM David Teller ***@***.***> wrote: That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue. Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#440?email_source=notifications&email_token=AAANOUCQHMNQYCF5QFEFPATQJMQDVA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T732A#issuecomment-531103208>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANOUDDAO53YJPQB3ZVDETQJMQDVANCNFSM4IVIDCFA> .

Yoric · 2019-09-13T10:08:01Z

Oh, got it, 20 is part of the spec, but the algorithm isn't.

So, given that there are real-world files where 20 isn't sufficient for an optimal code, do we wish to keep 20 as a hard limit? Or do we expect to become suboptimal for huge files (and possibly issue a warning telling the developers that their file is simply too large)?

Yoric mentioned this issue Sep 13, 2019

Getting started on Huffman tables #441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failure: assert next_code_length >= code_length when code_length overflows #440

Assertion failure: assert next_code_length >= code_length when code_length overflows #440

arai-a commented Sep 10, 2019

Yoric commented Sep 10, 2019

arai-a commented Sep 10, 2019

dominiccooney commented Sep 12, 2019 via email

Yoric commented Sep 13, 2019

dominiccooney commented Sep 13, 2019 via email

Yoric commented Sep 13, 2019

Assertion failure: assert next_code_length >= code_length when code_length overflows #440

Assertion failure: assert next_code_length >= code_length when code_length overflows #440

Comments

arai-a commented Sep 10, 2019

Steps to reproduce:

Actual result:

Yoric commented Sep 10, 2019

arai-a commented Sep 10, 2019

dominiccooney commented Sep 12, 2019 via email

Yoric commented Sep 13, 2019

dominiccooney commented Sep 13, 2019 via email

Yoric commented Sep 13, 2019