Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) by resouer · Pull Request #1460 · openai/parameter-golf

resouer · 2026-04-08T01:52:27Z

Summary

3-seed mean val_bpb: 1.08269 (std 0.00060) | ~15.99 MB | 8xH100 SXM | ~450s eval

Merged SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 2.79670 nats. Delta: -0.914 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed	BPP	val_loss (nats)	Artifact
1337	1.08218	2.79537	15,982,929
42	1.08252	2.79626	15,988,459
2025	1.08337	2.79846	15,989,420
Mean	1.08269	2.79670

Changes from Merged SOTA (PR #1019)

1. Eval-Time Hash Embedding (Novel)

A zero-initialized nn.Embedding(16384, 512) is created at evaluation time and trained exclusively through the score-first TTT loop. At each position, a bigram hash h = (prev_token * 2039 + curr_token) % 16384 looks up a residual vector added to tok_emb(x) before RMSNorm. The hash embedding learns document-local bigram patterns without modifying any pre-trained model weights.

Nearest PR: #1413 (@kevclark) — legal score-first TTT with full-model weight updates. Different: We add an ephemeral hash embedding instantiated from zeros at eval start, adapting via the same TTT loop. This is a new adaptation target — the model retunes a separate bigram-keyed memory alongside its existing weights. No existing PR creates and trains a new embedding module from scratch at eval time. Measured delta: -0.0004 BPP (ablation: 1.08307 mean without hash, 1.08269 mean with).

2. Score-First TTT

SGD momentum 0.9, LR=0.005, 3 epochs per 32K-token chunk, cosine decay, freeze=0. Same mechanism as PR #549 and PR #1413. Measured delta: -0.002 BPP.

3. SP8192 Architecture Stack

11L/512d/8H/4KV, parallel residuals (L7+), depth recurrence (L4-5 loop 2x), skip gates, QK-Gain 4.0, XSA-11, Full Hessian GPTQ int6 + byte-shuffle + brotli, coprime-stride loader. Code packed with lzma+base85.

Compliance

Per Issue #1017 (Track B):

Condition 1: Hash key uses input token identities from x_batch = chunk[:-1], prefix-only.
Condition 2: Hash residual added before RMSNorm + full-vocab softmax. Full distribution.
Condition 3: Each chunk scored under no_grad() before any TTT update. Score-before-update.
Condition 4: Single left-to-right pass. No rescoring.
Precedent: LoRA-TTT PRs Record: XSA + LoRA TTT (val_bpb=1.1070) #1254, Record: varlen+fused mlp+ttt (bpb=1.1093) #1354 also create trainable params at eval time.

No SLOT, no pre-quant TTT, no n-gram caches, no ETLB.

Reproduction

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
torchrun --standalone --nproc_per_node=8 train_gpt.py

No env vars needed.

Credits

Base: PR #549 (@abaybektursun), PR #1019 (@abaybektursun). TTT: PR #549, PR #1413 (@kevclark). Parallel residuals + depth recurrence: PR #1204 (@msisovic). SP8192 + SDClip: PR #1394 (@clarkkev). Coprime loader: PR #726, PR #1060. Eval-time hash embedding: original.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ai#1460)

Base: train_gpt_sota_10.py (clean, 11L XSA-all, parallel L5+, recur 3,4,5) Additions from top PRs: - Legal Score-First TTT (PR openai#549 recipe: +~0.0025 BPB) chunk=32768, SGD lr=0.002 global cosine decay, 3 epochs, all blocks unfrozen - N-gram Tilt (PR openai#1437): exp(0.5) boost on bigram-predicted next token - Eval-Time Hash Embedding (PR openai#1460): zero-init embed[(p*2039+c)%16384] adapts via TTT optimizer at 10x model LR Other tuning vs sota_10: - warmdown_iters: 4200 -> 5500 (better final convergence) - gptq_ar_seqs: 32 -> 64 (PR openai#1019: 64 is optimal) - ttt defaults: lr=0.002, chunk_size=32768 (PR openai#549 ablation)

kevclark · 2026-04-08T09:52:03Z

@resouer I think you meant to tag @clarkkev not me 😄

Novel mechanism: per-window ephemeral hash table (128 buckets × 512d) indexed by prefix bigram hash, co-optimized alongside SLOT's global delta in the same AdamW loop. Adds position-specific hidden corrections on top of SLOT's window-global delta. No PR combines a hashed hidden residual with causal SLOT. Nearest: openai#1333 (global delta only), openai#1460 (input-space hash, no SLOT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: SP8192 + TTT + Hash Embedding — val_bpb 1.08269 (3-seed mean)

44eb132

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 8, 2026

sota_16: TTT LR 0.001→0.005 + cosine decay per chunk (matches PR open…

2f915f0

…ai#1460)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean)#1460

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean)#1460
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/sp8192-ttt-hash-embedding

resouer commented Apr 8, 2026

Uh oh!

kevclark commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

resouer commented Apr 8, 2026

Summary

Results (3-seed)

Changes from Merged SOTA (PR #1019)

1. Eval-Time Hash Embedding (Novel)

2. Score-First TTT

3. SP8192 Architecture Stack

Compliance

Reproduction

Credits

Uh oh!

kevclark commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants