Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(common/models): Trie file-size + loading optimizations 💾 #11088

Draft
wants to merge 8 commits into
base: change/common/models/templates/trie-results-through-traversal
Choose a base branch
from

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Mar 27, 2024

(This PR has brought to you early by ✈️ boredom.)

Note: the initial commits are common with #10973. I have local branches setup + related notes I can use to clarify the divergence in the future.

Addresses #10336.

So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.

It does not currently:

  • link these methods into our existing Trie definitions
  • use them at run-time
  • use them when compiling lexical models

So, there's obviously more work that would be needed, but it's a solid start in the right direction.

@jahorton jahorton added this to the 18.0 milestone Mar 27, 2024
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 27, 2024
@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Mar 27, 2024

User Test Results

Test specification and instructions

ERROR: user tests have not yet been defined

Test Artifacts

Comment on lines +6 to +8
// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding.
const ENCODED_NUM_BASE = 0x0000;
const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.

Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE values. This shouldn't be too tricky to establish for reasonable value selections, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fixed version of the fixture, utilizing #11074. My issues using this fixture during development of this PR were the cause of #11073's discovery.

}`;

const compression = {
// Achieves FAR better compression than JSON.stringify, which \u-escapes most chars.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless using ENCODED_NUM_BASE=0x0020 or similar. JSON.stringify likes to \u-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> JSON.stringify(String.fromCharCode(1))
'"\\u0001"'

With ENCODED_NUM_BASE=0...

  • note that most leaf notes will have low, single-digit .entries counts, all of which would be represented by control codes and thus subject to \u-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars.

  • also, most near-leaf internal nodes will have but a few legal values leading to child nodes, once again using control codes for their representation.

JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020 or above, as this bypasses the control-code range with one exception: 0x007f (DEL).

@jahorton jahorton changed the base branch from master to change/common/models/templates/trie-results-through-traversal July 2, 2024 04:25
@jahorton jahorton force-pushed the feat/common/models/templates/trie-compression-start branch from b40b7c7 to 467efd4 Compare July 2, 2024 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

1 participant