Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)#1114
Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)#1114minh-stakc wants to merge 1 commit intoopenai:mainfrom
Conversation
…1xB200) 11L model + packed training n-gram tables (order 2-13) + hierarchical Dirichlet CTW mixing. Pre-quant sliding window eval with online cache.
Community Review — Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)BPB: 0.0235 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern) What I found in the code (head SHA The n-gram lookup key at line 1161 is constructed by XOR-ing the target token into the hash: This matches the Per Issue #1017 condition 1, Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class). CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=109429 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=109429 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
val_bpb = 0.0235 (seed 42, 1xB200)
Architecture
Key Technique
Pre-compute n-gram hash tables from training data during training phase, store in artifact.
At eval, combine packed training statistics with online cache via hierarchical Dirichlet CTW mixing (Willems et al. 1995).
Compliance Notes
Credits