feat(common/models): Trie file-size + loading optimizations 💾 #11088

jahorton · 2024-03-27T06:41:04Z

(This PR has brought to you early by ✈️ boredom.)

Note: the initial commits are common with #10973. I have local branches setup + related notes I can use to clarify the divergence in the future.

Addresses #10336.

So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.

It does not currently:

link these methods into our existing Trie definitions
use them at run-time
use them when compiling lexical models

So, there's obviously more work that would be needed, but it's a solid start in the right direction.

keymanapp-test-bot · 2024-03-27T06:41:09Z

User Test Results

Test specification and instructions

ERROR: user tests have not yet been defined

Test Artifacts

Developer
- Keyman Developer - build pending
- Compiler Regression Tests - build pending
- kmcomp.zip - build pending
iOS
- Keyman for iOS (simulator image) - build failure
- FirstVoices Keyboards for iOS (simulator image) - build failure
Keyboards
- Test Keyboards - build pending
Windows

jahorton · 2024-03-27T06:43:46Z

common/models/templates/src/tries/compression.ts

+// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding.
+const ENCODED_NUM_BASE = 0x0000;
+const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE;


A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.

Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE values. This shouldn't be too tricky to establish for reasonable value selections, though.

jahorton · 2024-03-27T06:45:13Z

common/models/templates/test/fixtures/tries/english-1000.json

A fixed version of the fixture, utilizing #11074. My issues using this fixture during development of this PR were the cause of #11073's discovery.

jahorton · 2024-03-27T06:48:20Z

common/models/templates/test/trie-compression.js

+    }`;
+
+    const compression = {
+      // Achieves FAR better compression than JSON.stringify, which \u-escapes most chars.


Unless using ENCODED_NUM_BASE=0x0020 or similar. JSON.stringify likes to \u-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.

> JSON.stringify(String.fromCharCode(1)) '"\\u0001"'

With ENCODED_NUM_BASE=0...

note that most leaf notes will have low, single-digit .entries counts, all of which would be represented by control codes and thus subject to \u-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars.

also, most near-leaf internal nodes will have but a few legal values leading to child nodes, once again using control codes for their representation.

JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020 or above, as this bypasses the control-code range with one exception: 0x007f (DEL).

…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.

… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!

…pressor

…ructure

… for compressed-Trie code

jahorton added this to the 18.0 milestone Mar 27, 2024

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 27, 2024

github-actions bot added common/ common/models/ common/web/ common/models/types/ common/models/templates/ feat labels Mar 27, 2024

jahorton commented Mar 27, 2024

View reviewed changes

jahorton mentioned this pull request Apr 9, 2024

feat(web): wordbreaker data table optimization #10692

Draft

jahorton added 8 commits July 2, 2024 11:24

feat(common/models/templates): initial trie node compression/decompre…

ef5a6f2

…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.

fix(common/models/templates): early test-fixture did not use sentinel…

84a36bc

… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!

feat(common/models/templates): basic full-trie compression test

af2b7ac

fix(common/models/templates): model-compiler error correction for com…

be66dff

…pressor

chore(common/models/templates): doc on probable error location

56fd030

change(common/models/templates): encoded-num range-offset experiment

c0febba

chore(common/models/templates): reverts offset but leaves its infrast…

94900c5

…ructure

docs(common/models/templates): minor notes on noted editor complaints…

467efd4

… for compressed-Trie code

jahorton changed the base branch from master to change/common/models/templates/trie-results-through-traversal July 2, 2024 04:25

jahorton force-pushed the feat/common/models/templates/trie-compression-start branch from b40b7c7 to 467efd4 Compare July 2, 2024 04:25

github-actions bot added common/models/types/ common/web/ and removed common/web/ common/models/types/ labels Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(common/models): Trie file-size + loading optimizations 💾 #11088

feat(common/models): Trie file-size + loading optimizations 💾 #11088

jahorton commented Mar 27, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 27, 2024 •

edited

Loading

jahorton Mar 27, 2024

jahorton Mar 27, 2024

jahorton Mar 27, 2024

jahorton Mar 27, 2024

feat(common/models): Trie file-size + loading optimizations 💾 #11088

Are you sure you want to change the base?

feat(common/models): Trie file-size + loading optimizations 💾 #11088

Conversation

jahorton commented Mar 27, 2024 • edited Loading

keymanapp-test-bot bot commented Mar 27, 2024 • edited Loading

User Test Results

Test Artifacts

jahorton Mar 27, 2024

Choose a reason for hiding this comment

jahorton Mar 27, 2024

Choose a reason for hiding this comment

jahorton Mar 27, 2024

Choose a reason for hiding this comment

jahorton Mar 27, 2024

Choose a reason for hiding this comment

jahorton commented Mar 27, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 27, 2024 •

edited

Loading