-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(common/models): Trie file-size + loading optimizations 💾 #11088
base: change/common/models/templates/trie-results-through-traversal
Are you sure you want to change the base?
Conversation
User Test ResultsTest specification and instructions ERROR: user tests have not yet been defined Test Artifacts
|
// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding. | ||
const ENCODED_NUM_BASE = 0x0000; | ||
const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.
Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE
values. This shouldn't be too tricky to establish for reasonable value selections, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}`; | ||
|
||
const compression = { | ||
// Achieves FAR better compression than JSON.stringify, which \u-escapes most chars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless using ENCODED_NUM_BASE=0x0020
or similar. JSON.stringify
likes to \u
-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> JSON.stringify(String.fromCharCode(1))
'"\\u0001"'
With ENCODED_NUM_BASE=0
...
-
note that most leaf notes will have low, single-digit
.entries
counts, all of which would be represented by control codes and thus subject to\u
-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars. -
also, most near-leaf internal nodes will have but a few legal
values
leading to child nodes, once again using control codes for their representation.
JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020
or above, as this bypasses the control-code range with one exception: 0x007f (DEL).
…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.
… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!
… for compressed-Trie code
b40b7c7
to
467efd4
Compare
(This PR has brought to you early by✈️ boredom.)
Note: the initial commits are common with #10973. I have local branches setup + related notes I can use to clarify the divergence in the future.
Addresses #10336.
So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.
It does not currently:
So, there's obviously more work that would be needed, but it's a solid start in the right direction.