Skip to content

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean)#1460

Open
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/sp8192-ttt-hash-embedding
Open

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean)#1460
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/sp8192-ttt-hash-embedding

Conversation

@resouer
Copy link
Copy Markdown

@resouer resouer commented Apr 8, 2026

Summary

3-seed mean val_bpb: 1.08269 (std 0.00060) | ~15.99 MB | 8xH100 SXM | ~450s eval

Merged SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 2.79670 nats. Delta: -0.914 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed BPP val_loss (nats) Artifact
1337 1.08218 2.79537 15,982,929
42 1.08252 2.79626 15,988,459
2025 1.08337 2.79846 15,989,420
Mean 1.08269 2.79670

Changes from Merged SOTA (PR #1019)

1. Eval-Time Hash Embedding (Novel)

A zero-initialized nn.Embedding(16384, 512) is created at evaluation time and trained exclusively through the score-first TTT loop. At each position, a bigram hash h = (prev_token * 2039 + curr_token) % 16384 looks up a residual vector added to tok_emb(x) before RMSNorm. The hash embedding learns document-local bigram patterns without modifying any pre-trained model weights.

Nearest PR: #1413 (@kevclark) — legal score-first TTT with full-model weight updates. Different: We add an ephemeral hash embedding instantiated from zeros at eval start, adapting via the same TTT loop. This is a new adaptation target — the model retunes a separate bigram-keyed memory alongside its existing weights. No existing PR creates and trains a new embedding module from scratch at eval time. Measured delta: -0.0004 BPP (ablation: 1.08307 mean without hash, 1.08269 mean with).

2. Score-First TTT

SGD momentum 0.9, LR=0.005, 3 epochs per 32K-token chunk, cosine decay, freeze=0. Same mechanism as PR #549 and PR #1413. Measured delta: -0.002 BPP.

3. SP8192 Architecture Stack

11L/512d/8H/4KV, parallel residuals (L7+), depth recurrence (L4-5 loop 2x), skip gates, QK-Gain 4.0, XSA-11, Full Hessian GPTQ int6 + byte-shuffle + brotli, coprime-stride loader. Code packed with lzma+base85.

Compliance

Per Issue #1017 (Track B):

  • Condition 1: Hash key uses input token identities from x_batch = chunk[:-1], prefix-only.
  • Condition 2: Hash residual added before RMSNorm + full-vocab softmax. Full distribution.
  • Condition 3: Each chunk scored under no_grad() before any TTT update. Score-before-update.
  • Condition 4: Single left-to-right pass. No rescoring.
  • Precedent: LoRA-TTT PRs Record: XSA + LoRA TTT (val_bpb=1.1070) #1254, Record: varlen+fused mlp+ttt (bpb=1.1093) #1354 also create trainable params at eval time.

No SLOT, no pre-quant TTT, no n-gram caches, no ETLB.

Reproduction

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
torchrun --standalone --nproc_per_node=8 train_gpt.py

No env vars needed.

Credits

Base: PR #549 (@abaybektursun), PR #1019 (@abaybektursun). TTT: PR #549, PR #1413 (@kevclark). Parallel residuals + depth recurrence: PR #1204 (@msisovic). SP8192 + SDClip: PR #1394 (@clarkkev). Coprime loader: PR #726, PR #1060. Eval-time hash embedding: original.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 8, 2026
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 8, 2026
Base: train_gpt_sota_10.py (clean, 11L XSA-all, parallel L5+, recur 3,4,5)

Additions from top PRs:
- Legal Score-First TTT (PR openai#549 recipe: +~0.0025 BPB)
  chunk=32768, SGD lr=0.002 global cosine decay, 3 epochs, all blocks unfrozen
- N-gram Tilt (PR openai#1437): exp(0.5) boost on bigram-predicted next token
- Eval-Time Hash Embedding (PR openai#1460): zero-init embed[(p*2039+c)%16384]
  adapts via TTT optimizer at 10x model LR

Other tuning vs sota_10:
- warmdown_iters: 4200 -> 5500 (better final convergence)
- gptq_ar_seqs: 32 -> 64 (PR openai#1019: 64 is optimal)
- ttt defaults: lr=0.002, chunk_size=32768 (PR openai#549 ablation)
@kevclark
Copy link
Copy Markdown

kevclark commented Apr 8, 2026

@resouer I think you meant to tag @clarkkev not me 😄

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Novel mechanism: per-window ephemeral hash table (128 buckets × 512d)
indexed by prefix bigram hash, co-optimized alongside SLOT's global
delta in the same AdamW loop. Adds position-specific hidden corrections
on top of SLOT's window-global delta.

No PR combines a hashed hidden residual with causal SLOT.
Nearest: openai#1333 (global delta only), openai#1460 (input-space hash, no SLOT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants