Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200) by minh-stakc · Pull Request #1114 · openai/parameter-golf

minh-stakc · 2026-03-30T03:21:03Z

Summary

val_bpb = 0.0235 (seed 42, 1xB200)

Seed	val_bpb	Pre-quant BPB	Artifact
42	0.02352	1.3704	6,458,133 bytes

Architecture

11L 512d GQA 8/4, MLP 3.0x LeakyReLU(0.5)², XSA-4, BigramHash, VRL
Packed N-gram tables (order 2-13) from training data, 32K buckets, int32, zstd-compressed
Hierarchical Dirichlet CTW mixing with per-order concentrations [50,50,20,10,6,4,3,2.5]
Online N-gram cache (orders 2-9, 4M buckets) updated score-first
EMA(0.997), Muon optimizer, int5/int6 quantization, 3% magnitude pruning

Key Technique

Pre-compute n-gram hash tables from training data during training phase, store in artifact.
At eval, combine packed training statistics with online cache via hierarchical Dirichlet CTW mixing (Willems et al. 1995).

Compliance Notes

Training: 600s on 1xB200 (within limit)
Artifact: 6.4 MB (within 16 MB limit)
Score-first: each window scored THEN cache updated
Eval: 65,808s on 1xB200 eager mode — needs 8xH100 with torch.compile for <600s compliance
Single seed (42) — additional seeds pending

Credits

PR Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #943: Packed causal n-gram memory concept
PR Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900: Dirichlet posterior mixing theory
PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727/Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753: Multi-order n-gram backoff foundation
PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414: Base model architecture
PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549: LeakyReLU² + Parallel Muon

…1xB200) 11L model + packed training n-gram tables (order 2-13) + hierarchical Dirichlet CTW mixing. Pre-quant sliding window eval with online cache.

MatoTeziTanka · 2026-04-11T20:03:05Z

Community Review — Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)

BPB: 0.0235 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA a6050caeb821, file records/track_10min_16mb/2026-03-29_PackedNgram_DirichletCTW/train_gpt.py):

The n-gram lookup key at line 1161 is constructed by XOR-ing the target token into the hash:

line 1161: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1161 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=109429 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=109429 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: Packed N-gram + Hierarchical Dirichlet CTW — val_bpb 0.0235 (…

a6050ca

…1xB200) 11L model + packed training n-gram tables (order 2-13) + hierarchical Dirichlet CTW mixing. Pre-quant sliding window eval with online cache.

Robby955 mentioned this pull request Mar 30, 2026

Partition Function Inflation: Why Hashed N-Gram Caches Produce Invalid BPB Scores (Non-Record, Analytical) #1147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)#1114

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)#1114
minh-stakc wants to merge 1 commit intoopenai:mainfrom
minh-stakc:submission/packed-ngram-dirichlet-0.0235

minh-stakc commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

minh-stakc commented Mar 30, 2026

Summary

Architecture

Key Technique

Compliance Notes

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants