Skip to content

Record Submission: 0.9258 BPB — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)#776

Open
agalimova wants to merge 2 commits intoopenai:mainfrom
agalimova:submission/kitchen-sink-0.9258
Open

Record Submission: 0.9258 BPB — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)#776
agalimova wants to merge 2 commits intoopenai:mainfrom
agalimova:submission/kitchen-sink-0.9258

Conversation

@agalimova
Copy link
Copy Markdown

Summary

Changes from PR #741

Parameter Default Ours
XSA_LAST_N 4 6
BIGRAM_VOCAB_SIZE 2048 4096
NGRAM_ORDER 5 7
NGRAM_ALPHA_HIGH 0.40 0.50

Test plan

  • 2 seeds on 8xH100 SXM (torch 2.9+cu126, FA3)
  • Eval time under 10-min budget (~520s)
  • All runs under 16MB artifact limit
  • 3rd seed running (will update)

🤖 Generated with Claude Code

agalimova and others added 2 commits March 25, 2026 09:45
Built on PR openai#700 with hyperparameter improvements found via
autoresearch-multi combinatorial search:
- XSA_LAST_N=6 (extended from 4 to 6 layers)
- BIGRAM_VOCAB_SIZE=4096 (doubled from 2048)

3-seed mean: 1.1078 (std 0.0045)
Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ash4K)

Built on PR openai#741 with hyperparameter improvements found via
autoresearch-multi combinatorial search:
- XSA_LAST_N=6, BIGRAM_VOCAB_SIZE=4096, NGRAM_ORDER=7, NGRAM_ALPHA_HIGH=0.50

2-seed mean: 0.9258 (seeds 1337=0.9249, 42=0.9266)
Eval time: ~520s (under 10-min budget)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Kitchen Sink (7-gram + XSA6 + BigramHash4K + Cosine TTT)

BPB: 0.9258 (2 of 3 seeds) | Track: 10min/16MB | Compliance: FLAG

What this does (per the README + records train_gpt.py): Builds on PR #741. Adds XSA_LAST_N=6, BIGRAM_VOCAB_SIZE=4096, NGRAM_ORDER=7, NGRAM_ALPHA_HIGH=0.50, plus a Cosine-LR TTT pass and an n-gram cache eval. The records-folder train_gpt.py (the file mods evaluate) implements a multi-order hashed n-gram cache that mixes its probability into the LM's per-token NLL during the sliding-window eval.

What I found in the code (records/track_10min_16mb/2026-03-25_KitchenSink_7gram_CosineTTT_0.9258/train_gpt.py):

  • Lines 1505-1508 (and again 1523-1526) — the n-gram "full key" mixes the target token into the hash:

    ch = np.zeros(len(jv), dtype=np.uint64)
    for k in range(cw):
        ch ^= val_np[jv-(cw-k)].astype(np.uint64) * nprimes[k%5]
    ck = (ch & nmask).astype(np.int64)
    tn = val_np[jv].astype(np.uint64)              # tn = target token at position jv
    fk = ((ch ^ (tn * nprimes[cw%5])) & nmask).astype(np.int64)

    This is the same full_key = ((ctx_hash ^ (target * primes[k])) & mask) shape that @valerio-oai ruled disallowed in PR #779 (comment 4145781641, 2026-03-27): "disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens."

  • Line 1512 — the cache "probability" is png = np.clip(np.minimum(fc, cc) / np.maximum(cc, 1.0), 0, 1), where fc = full_t[fk] and cc = ctx_t[ck]. This is min(count(ctx,tgt), count(ctx)) / count(ctx), not a renormalized distribution over vocab — matching valerio's "do not renormalize correctly" point. There is no sum_v P(v|ctx) = 1 step anywhere in the order loop.

  • Line 1517 — that unnormalized bp value is then linearly mixed into the model probability via smp[ni] = (1.0 - av)*smp[ni] + av*bp[ni], and the BPB at line 1538 is computed from the mixed smp. So the tn-keyed lookup directly affects the reported BPB.

  • Submission file pair. The PR includes both train_gpt.py (1544 lines, contains the n-gram cache above) and submission_train_gpt.py (1978 lines, contains a different LogisticContextMixer implementation that only goes up to trigram via TRI_HASH=65536, without target-in-key). It is not stated in the README which file is the one to be evaluated. The records-folder train_gpt.py is the canonical evaluation target per the records layout used elsewhere in the repo.

  • README/code mismatch. The README and PR title say "7-gram", and submission.json lists NGRAM_ORDER=7 as the headline change. submission_train_gpt.py does not define NGRAM_ORDER at all and only implements unigram/bigram/trigram in LogisticContextMixer (line 38, K=5 experts of which only 3 are n-gram orders). The NGRAM_ORDER env var is read only by train_gpt.py (line 1461), inside the disallowed n-gram cache block above. So the headline 7-gram change is the same code path that triggers the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling.

Smoke test (CT2038 proteus-engine, 2026-04-11): submission_train_gpt.py parses, Hyperparameters and GPT resolve, code size 97,234 bytes. CPU forward not run (script requires CUDA); no syntax/import issues.

Questions/flags:

  • Could you confirm which of the two scripts mods should evaluate — train_gpt.py or submission_train_gpt.py? They implement the n-gram path very differently.
  • The fk = ((ch ^ (tn * nprimes[cw%5])) & nmask) shape at lines 1508/1526 looks structurally identical to the pattern @valerio-oai disallowed in PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779. Is the intent that fc/cc represents P(target|context)? If so, what guarantees that sum_v P(v|context) = 1 over the vocabulary, given that full_t is keyed by (ctx ⊕ target*prime) and only one bucket per (ctx,target) is touched?
  • Roughly 10 sibling PRs in this n-gram-cache cluster have already been closed under the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling. Is there a renormalization or score-first transformation here that distinguishes this PR from those? If yes, please point to the lines.

Verdict: COMPLIANCE FLAG — n-gram-target-in-key family pattern present in records/.../train_gpt.py lines 1505-1517.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): submission_train_gpt.py imports OK, Hyperparameters/GPT resolve, code size 97,234 bytes; no GPU forward attempted. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA b5bfc9aa533a0a86be1881bbc7bc747bf77848da.

This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants