Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) by pappanick · Pull Request #860 · openai/parameter-golf

pappanick · 2026-03-26T15:45:54Z

Summary

Combines techniques from PRs #834, #846, #733, and #693 into a single submission with 9 innovations.

Techniques

#	Technique	Source	Expected Δ BPB
1	Learned routing head `Linear(512,12)`	PR #834	base
2	Two-pass cold-cache rescoring (15 chunks)	PR #846	-0.01 to -0.03
3	Extended n-gram orders 2-12 (8M bucket hash)	Novel	-0.005
4	Gated Attention (per-head learned gate)	PR #733/#638	-0.002
5	Value Residual Learning (λ_v · x0 shortcut)	PR #733/#657	-0.002
6	Depth Recurrence (layers 4,5 repeated → 13 virtual)	PR #733/#686	-0.006
7	SGD TTT (lr=0.002, all blocks unfrozen)	PR #733	faster, less memory
8	CROWN-Q quantization regularizer	PR #693	better int5 quality
9	Per-order adaptive min_count thresholds	Novel	better sparse n-grams

Architecture

PR #834/414 stack: 11 physical layers (13 virtual via depth recurrence), 512d, 8H, 8KV, LeakyReLU(0.5)^2, U-Net skips, SmearGate, BigramHash(6144), Partial RoPE (16/64), XSA all layers, VE128 on layers 9-10, EMA+SWA, GPTQ int5 + zstd-22.

Key Innovation: Depth Recurrence without Bank Refactor

Instead of PR #733's parameter bank approach, we use shared module references: repeat blocks share CastedLinear weights from physical blocks but own independent scalar params (attn_scale, mlp_scale, attn_gate, lambda_v). Before TTT, untie_recurrence() deep-copies the heavy weights so repeat layers can specialize independently. ~1% param overhead during training, full independence during TTT.

Two-Pass Rescoring

Pass 1: Standard sequential chunk eval with causal n-gram cache building.
Pass 2: Rescore first 15 chunks with the full cache (no updates). Early chunks improve dramatically since their n-gram experts now have full context. Per-chunk loss tracking enables precise delta computation.

Status

Code compiles, all syntax checks pass
Full pipeline verified on MPS (Apple Silicon)
Depth recurrence: weight sharing + untie tested
N-gram hash collision test for all 11 orders
Two-pass loss delta computation verified
Full 8xH100 training + eval run (need compute credits)
3-seed validation

Run Command

RECUR_LAYERS="4,5" GATED_ATTENTION=1 VALUE_RESIDUAL=1 \
TWO_PASS_ENABLED=1 TWO_PASS_RESCORE_CHUNKS=15 \
NGRAM_MAX_ORDER=12 NGRAM_BUCKETS=8388608 \
TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_FREEZE_BLOCKS=11 \
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=6144 XSA_LAST_N=11 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Learned routing + frozen oracle: PR Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834 (@AnirudhRahul)
Two-pass rescoring concept: PR Record: Two-Pass N-gram Rescoring (val_bpb 0.1434) #846 (@himanshudongre)
Depth recurrence + gated attn + value residual: PR Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) #733 (@stukenov)
CROWN-Q: PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693
Base architecture: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush), PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@abaybektursun)
N-gram cache: PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659/Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (@deanbrr)

…ders Combines PR openai#834's learned multi-expert routing head with PR openai#846's two-pass cold-cache rescoring. Key changes: - Extended n-gram orders from 2-7 to 2-12 with 8M bucket hash tables - Two-pass eval: rescore first 15 chunks with full cache after pass 1 - Per-chunk loss tracking for precise pass-1/pass-2 delta computation - Configurable via env vars: NGRAM_MAX_ORDER, NGRAM_BUCKETS, TWO_PASS_ENABLED, TWO_PASS_RESCORE_CHUNKS Based on PR openai#834 (AnirudhRahul) + PR openai#846 (himanshudongre) stack.

- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB - Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB - Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1 - Added attn_gate, lambda_v to control tensor patterns for proper quantization handling - All smoke tests pass on CPU

…eader Major additions: - Depth recurrence: layers 4,5 repeated -> 13 virtual from 11 physical Repeat blocks share heavy CastedLinear weights, own scalar params untie_recurrence() deep-copies before TTT for independent specialization Only ~1% param overhead during training - TTT defaults changed to match PR openai#733 winning recipe: - SGD optimizer (was AdamW) - simpler, less memory - lr=0.002 (was 0.0005) - higher for SGD - Unfreeze all 11 blocks (was 2) - more params for adaptation - All repeat_blocks params unfrozen for TTT Configurable via: RECUR_LAYERS="4,5" TTT_OPTIMIZER=sgd TTT_LR=0.002 All smoke tests pass on CPU (syntax, recurrence, weight sharing, untie).

MatoTeziTanka · 2026-04-12T13:31:54Z

Community Review — Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

--- ## Analysis ### N-gram hash family bug check (ILLEGAL: target XOR'd into ctx_key) CLEAN. In BackoffNgramMixer.update() (lines 76–84), the hash construction is: - ctx_hash = XOR of context tokens only (positions t[k:k+length] for k < cw) - ctx_key = ctx_hash & mask — context-only, target NOT included - full_key = (ctx_hash ^ (t[order-1:...] * target_prime)) & mask — target XOR'd into the full key only The target is correctly XOR'd into full_key (joint context+target hash for full_counts), not into ctx_key. The ctx_counts and full_counts remain consistent. Same pattern in _ngram_backoff_p() (lines 102–111). No illegal hash collision of context with target. BigramHashEmbedding (line 643) is a learnable input embedding, not a scoring oracle — no illegal use. ### TTT legality check (score-first + is_last_chunk guard) LEGAL. eval_val_sliding_ttt() (lines 1043–1410) follows the correct score-first pattern: - Phase 1 (lines 1167–1235): Inference-only (torch.inference_mode()), scores chunk ci fully, accumulates loss. - Phase 2 (lines 1248–1298): Training only executes if not is_last_chunk (line 1249–1250). Trains on chunk ci tokens after they have been scored. The is_last_chunk guard (line 1249) is present and correct. Multi-epoch training (ttt_epochs=3 default) runs only on already-scored chunks, never on future data. The model only sees val tokens for gradient updates after their scores are banked. ### Two-pass rescoring CLEAN. Pass 2 (lines 1306–1379) rescores early chunks with the full n-gram cache using torch.inference_mode() — pure inference, no training, no cache mutation. Mathematically equivalent to having a pre-populated cache. ### Scored-region SLOT val_bpb: null in submission.json — no score submitted,...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

pappanick added 3 commits March 26, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)#860

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)#860
pappanick wants to merge 3 commits intoopenai:mainfrom
pappanick:submission/learned-twopass-ngram

pappanick commented Mar 26, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pappanick commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Techniques

Architecture

Key Innovation: Depth Recurrence without Bank Refactor

Two-Pass Rescoring

Status

Run Command

Credits

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pappanick commented Mar 26, 2026 •

edited

Loading