Record Submission: 0.9258 BPB — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT) by agalimova · Pull Request #776 · openai/parameter-golf

agalimova · 2026-03-25T21:58:22Z

Summary

val_bpb: 0.9258 (2-seed mean, 3rd seed running)
Seeds: 1337=0.9249, 42=0.9266
Built on PR Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850) #741 with hyperparameter improvements found via autoresearch-multi
Eval time: ~520s (under 10-min budget)

Changes from PR #741

Parameter	Default	Ours
`XSA_LAST_N`	4	6
`BIGRAM_VOCAB_SIZE`	2048	4096
`NGRAM_ORDER`	5	7
`NGRAM_ALPHA_HIGH`	0.40	0.50

Test plan

2 seeds on 8xH100 SXM (torch 2.9+cu126, FA3)
Eval time under 10-min budget (~520s)
All runs under 16MB artifact limit
3rd seed running (will update)

🤖 Generated with Claude Code

Built on PR openai#700 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6 (extended from 4 to 6 layers) - BIGRAM_VOCAB_SIZE=4096 (doubled from 2048) 3-seed mean: 1.1078 (std 0.0045) Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ash4K) Built on PR openai#741 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6, BIGRAM_VOCAB_SIZE=4096, NGRAM_ORDER=7, NGRAM_ALPHA_HIGH=0.50 2-seed mean: 0.9258 (seeds 1337=0.9249, 42=0.9266) Eval time: ~520s (under 10-min budget) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T17:08:34Z

Community Review — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)

BPB: 0.9258 (2 of 3 seeds) | Track: 10min/16MB | Compliance: FLAG

What this does (per the README + records train_gpt.py): Builds on PR #741. Adds XSA_LAST_N=6, BIGRAM_VOCAB_SIZE=4096, NGRAM_ORDER=7, NGRAM_ALPHA_HIGH=0.50, plus a Cosine-LR TTT pass and an n-gram cache eval. The records-folder train_gpt.py (the file mods evaluate) implements a multi-order hashed n-gram cache that mixes its probability into the LM's per-token NLL during the sliding-window eval.

What I found in the code (records/track_10min_16mb/2026-03-25_KitchenSink_7gram_CosineTTT_0.9258/train_gpt.py):

Lines 1505-1508 (and again 1523-1526) — the n-gram "full key" mixes the target token into the hash:
```
ch = np.zeros(len(jv), dtype=np.uint64)
for k in range(cw):
    ch ^= val_np[jv-(cw-k)].astype(np.uint64) * nprimes[k%5]
ck = (ch & nmask).astype(np.int64)
tn = val_np[jv].astype(np.uint64)              # tn = target token at position jv
fk = ((ch ^ (tn * nprimes[cw%5])) & nmask).astype(np.int64)
```
This is the same full_key = ((ctx_hash ^ (target * primes[k])) & mask) shape that @valerio-oai ruled disallowed in PR #779 (comment 4145781641, 2026-03-27): "disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens."
Line 1512 — the cache "probability" is png = np.clip(np.minimum(fc, cc) / np.maximum(cc, 1.0), 0, 1), where fc = full_t[fk] and cc = ctx_t[ck]. This is min(count(ctx,tgt), count(ctx)) / count(ctx), not a renormalized distribution over vocab — matching valerio's "do not renormalize correctly" point. There is no sum_v P(v|ctx) = 1 step anywhere in the order loop.
Line 1517 — that unnormalized bp value is then linearly mixed into the model probability via smp[ni] = (1.0 - av)*smp[ni] + av*bp[ni], and the BPB at line 1538 is computed from the mixed smp. So the tn-keyed lookup directly affects the reported BPB.
Submission file pair. The PR includes both train_gpt.py (1544 lines, contains the n-gram cache above) and submission_train_gpt.py (1978 lines, contains a different LogisticContextMixer implementation that only goes up to trigram via TRI_HASH=65536, without target-in-key). It is not stated in the README which file is the one to be evaluated. The records-folder train_gpt.py is the canonical evaluation target per the records layout used elsewhere in the repo.
README/code mismatch. The README and PR title say "7-gram", and submission.json lists NGRAM_ORDER=7 as the headline change. submission_train_gpt.py does not define NGRAM_ORDER at all and only implements unigram/bigram/trigram in LogisticContextMixer (line 38, K=5 experts of which only 3 are n-gram orders). The NGRAM_ORDER env var is read only by train_gpt.py (line 1461), inside the disallowed n-gram cache block above. So the headline 7-gram change is the same code path that triggers the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling.

Smoke test (CT2038 proteus-engine, 2026-04-11): submission_train_gpt.py parses, Hyperparameters and GPT resolve, code size 97,234 bytes. CPU forward not run (script requires CUDA); no syntax/import issues.

Questions/flags:

Could you confirm which of the two scripts mods should evaluate — train_gpt.py or submission_train_gpt.py? They implement the n-gram path very differently.
The fk = ((ch ^ (tn * nprimes[cw%5])) & nmask) shape at lines 1508/1526 looks structurally identical to the pattern @valerio-oai disallowed in PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779. Is the intent that fc/cc represents P(target|context)? If so, what guarantees that sum_v P(v|context) = 1 over the vocabulary, given that full_t is keyed by (ctx ⊕ target*prime) and only one bucket per (ctx,target) is touched?
Roughly 10 sibling PRs in this n-gram-cache cluster have already been closed under the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling. Is there a renormalization or score-first transformation here that distinguishes this PR from those? If yes, please point to the lines.

Verdict: COMPLIANCE FLAG — n-gram-target-in-key family pattern present in records/.../train_gpt.py lines 1505-1517.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

CLOSE — n-gram cache uses full_key = ((ctx_hash ^ (target * primes[k])) & mask) and mixes an unnormalized min(fc,cc)/cc value into the per-token probability used for the BPB, matching the pattern @valerio-oai ruled disallowed in PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (comment 4145781641, 2026-03-27). Same family as ~10 already-closed siblings.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): submission_train_gpt.py imports OK, Hyperparameters/GPT resolve, code size 97,234 bytes; no GPU forward attempted. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA b5bfc9aa533a0a86be1881bbc7bc747bf77848da.

agalimova and others added 2 commits March 25, 2026 09:45

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: 0.9258 BPB — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)#776

Record Submission: 0.9258 BPB — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)#776
agalimova wants to merge 2 commits intoopenai:mainfrom
agalimova:submission/kitchen-sink-0.9258

agalimova commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agalimova commented Mar 25, 2026

Summary

Changes from PR #741

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants