Record: Asymmetric Two-Lane Parallel Routing + Tap-In V6 + Legal TTT (1.073938)#1518
Record: Asymmetric Two-Lane Parallel Routing + Tap-In V6 + Legal TTT (1.073938)#1518abaybektursun wants to merge 1 commit intoopenai:mainfrom
Conversation
f316e80 to
3aeb45c
Compare
bbe6602 to
cd5b321
Compare
…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
f5406a0 to
a0091be
Compare
a1bc1bb to
9dbaada
Compare
7443022 to
9150ab3
Compare
|
@clarkkev When you droppin' a next banger, it's kinda getting boring here without you |
9150ab3 to
f0f0621
Compare
Community Review — Record: Wider Loop + Per-Pass Embeddings + Muon 0.98 + Tap-In V6 + Legal TTT (1.077741)Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) Summary PR #1518 ("WiderEmb_TapInV6_TTT") implements legal score-first TTT (the PR #549/1413 pattern) combined with a Tap-In V6 n-gram logit-boosting heuristic at eval time. No illegal patterns were found. ## TTT Eval Structure (train_gpt_readable.py lines 1401–1584) The function
|
72860b3 to
e491afc
Compare
The W3 control line already showed that additive pass embeddings can carry real signal, and this is a much lighter integration than a full Tap-In port. This patch adds a tiny zero-init loop-pass embedding at the loop start on top of the current best W2 candidate so we can measure whether the openai#1518 family still adds value after the pass-conditioned attention modulation is already in place. Constraint: Need another high-upside lane that is simpler and less review-heavy than Tap-In while the node is available now Rejected: Keep waiting on a full Tap-In integration | Too much implementation surface for the next immediate experiment Confidence: medium Scope-risk: narrow Directive: If this lane also fails to improve on r9, the next frontier move should probably be a more complete eval-time retrieval import, not another small loop-side tweak Tested: python3 -m py_compile train_gpt.py Not-tested: GPU score, bytes, and runtime on the integrated lane
e491afc to
2bd6100
Compare
2bd6100 to
78d9a13
Compare
…D=0.12 + Trimmed GPTQ + Wider Loop + Per-Pass Embeddings + Muon 0.98 + Legal TTT Increases TAPIN_V6_CROSS_W from 0.06 to 0.12. This is the weight on the Tap-In V6 cross-window n-gram rule at eval time; doubling it pushes the cross-window hint harder. 3-seed mean V6+TTT BPB: s2025: 1.073313 (-0.000133 vs cross_w=0.06) s1234: 1.073801 (-0.000175) s42: 1.074701 (+0.000078 noise) Mean: 1.073938 (std 0.000704) 2 of 3 seeds improved meaningfully. Cumulative improvement over the original symmetric-init revision (1.075262): -0.001324 BPB. All 3 seeds under 16 MB (~102 KB headroom each). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
78d9a13 to
1cb8a05
Compare
Asymmetric Two-Lane Parallel Routing + Tap-In V6
cross_w=0.12+ MUON_WD=0.12 + Trimmed GPTQ + Wider Loop + Per-Pass Embeddings + Muon 0.98 + Legal TTTResults
3-seed mean +V6+TTT: 1.073938 sliding-window BPB. All seeds comfortably under 16 MB.
Against the current leaderboard #1 (
1.0810, PR #1493, bigbag — SP8192 + 3-Layer Recurrence + Parallel Residuals + Legal TTT): −0.007062 nats cleared, comfortably above the challenge's 0.005-nat SOTA improvement bar.Budget (recommended seed 2025)
.int6.ptz)train_gpt.pysource, counted at runtime vialen(Path(__file__).read_text().encode("utf-8")))(The committed
train_gpt.pyin this folder is a 33,409-byte LZMA stub that carries the 80,422-byte source code as compressed data. The stub extracts its payload to a temp directory at run time andPath(__file__)then points to the extracted, uncompressed source, which is whatcode_bytescounts against the 16 MB budget in the training log.)Key Techniques
Asymmetric lane initialization
The two-lane routing matrix is initialized asymmetrically instead of as all-ones. Attention output starts strongly routed to lane 0 ($1.3$ ) and weakly to lane 1 ($0.7$ ); MLP output starts strongly routed to lane 1 ($1.3$ ) and weakly to lane 0 ($0.7$ ). Zero additional parameters — this is a pure initialization change to the existing $11 \times 2 \times 2$
parallel_post_lambdastensor.Why symmetry breaking helps
With all-ones init, both lanes are identical functions of the input at step 0. The optimization landscape has a continuous symmetry between them: any solution is equivalent to a permuted solution where lanes are swapped. Training has no preferred direction in this symmetry group and the two lanes remain near-identical throughout, collapsing the effective dimension of the routing matrix.
Asymmetric init breaks the symmetry at step 0 and the two lanes specialize: lane 0 becomes the "attn-heavy" path, lane 1 the "mlp-heavy" path. The learned routing matrix at the end of training still prefers the asymmetric pattern (we verified by inspecting
parallel_post_lambdasin the trained checkpoint), confirming the optimization does land in a non-symmetric solution that the symmetric init was failing to reach in 10 minutes.Given the null cost (no new parameters, no new code paths, no extra training time), this is free.
The$1.3 / 0.7$ split was chosen to keep the initial lane outputs at roughly the same magnitude as the all-ones baseline while forcing specialization. We did not sweep this — a tighter split ($1.2 / 0.8$ or $1.1 / 0.9$ ) may work equally well; we haven't tested it. The asymmetric init is conceptually orthogonal to the routing mechanism itself and stacks cleanly.
Two-lane parallel residual routing
The last three decoder layers (8, 9, 10) maintain two parallel residual streams instead of one. At each layer, attention reads from lane 0, MLP reads from lane 1, and each writes back to both lanes via a learned$2 \times 2$ routing matrix. The final output reads lane 1. Total cost: 66 new scalar parameters ($11 \times 2 \times 2$ post-lambdas + $11 \times 2$ resid-lambdas). This is the single largest mechanism in this submission.
How we got here: controls-only TTT, a theory that predicted the wrong thing, and the gradient-flow reading that survived
Controls-only TTT already gets 70%
Before any architecture change, we ran a diagnostic. Unfreeze only the 9,700 control parameters (
skip_weights,resid_mix,attn_scale,mlp_scale— scalar tensors that scale and mix things but compute no new features) and run the same TTT loop.Seventy percent of TTT's gain comes from 0.03% of the parameters. Most of what TTT does at test time is re-weight the flow of existing features, not learn new ones.
A theory that fit until it didn't
From that reading we wrote down a coupling-flexibility theory. If TTT is mostly routing adjustments on a single residual stream, then a more-trained model has layers more tightly co-adapted, so a TTT perturbation at one layer's routing breaks downstream layers that were expecting the old distribution. On that reading, raw quality and TTT extraction are opposed on a single-stream architecture.
This fit every negative result on our books. Earlier loop activation (more shared training) improved raw BPB but cut TTT extraction 3×. Extra trained scalar gates on every block improved raw BPB and cut V6+TTT. Lower weight decay helped fp32 fit and blew up the quantization gap. All consistent with the theory.
The theory predicted the fix. Two parallel lanes with a learned routing matrix would let TTT move mass between them without breaking either, because routing between lanes is zero-sum and total compute is preserved. We implemented it expecting more TTT surface.
What we actually got
Raw SW moved by 0.004, a very large number on a stack where deltas live around 0.0001. TTT extraction almost halved. The win was real; the prediction was wrong about where it came from.
Why the routing cannot be the thing
A single parallel block, dropping norms for clarity, is
where$h_0, h_1$ are the two lanes, $y_a, y_m$ are the attention and MLP sublayer outputs, and the final output reads $h_1$ . Take the loss gradient with respect to the two attention-routing scalars:
At initialization both lanes equal the layer input, the routing is symmetric ($\alpha_0 = \alpha_1 = 1$ ), and downstream treats the two lanes the same way. The two partials are equal. SGD updates $\alpha_0$ and $\alpha_1$ by the same amount, same sign, same magnitude. The difference $\alpha_0 - \alpha_1$ has no driving term.
This is a$\mathbb{Z}_2$ gauge symmetry in the parametrization. The shift $(\alpha_0, \alpha_1) \to (\alpha_0 + \varepsilon, \alpha_1 - \varepsilon)$ is a null direction of the loss at initialization, and the gradient descent trajectory stays on the symmetric submanifold. Roughly half of the 66 routing scalars are gauge. TTT cannot adapt along a coordinate the gradient does not see, so the routing-as-TTT-surface reading was empty.
This is the standard failure mode of parallel ensembles without explicit symmetry breaking. We had assumed the architecture provided it; the math says it does not.
The reading that survived the math
If the routing scalars are half gauge, the win has to come from something that would also show up with the routing frozen at initialization. So look at the gradient path, not the parameter count.
In a single-stream transformer, the gradient from the loss to$y_a$ at layer $\ell$ walks backward through every later layer:
Each Jacobian attenuates and rotates. Layer 8's attention gets a signal already shaped by layers 9 and 10, the final norm, the head, and the softcap. Layer 8 ends up learning features conditional on what 9 and 10 happen to be doing.
In the two-lane architecture every sublayer writes directly into$h_1$ through its routing coefficient, so the gradient has an order-1 term independent of depth:
$$\frac{\partial \mathcal{L}}{\partial y_a^{(\ell)}} ;=; \underbrace{\frac{\partial \mathcal{L}}{\partial o} \cdot \alpha_1^{(\ell)}}{\text{direct}} ;+; \underbrace{\text{cascading terms}}{\text{order} \geq 2}$$
The direct term is a single scalar multiply away from the loss, regardless of whether the block is at layer 8, 9, or 10. It gets there on the first backprop step and keeps getting there. This is DenseNet connectivity applied to the last few decoder layers of a language model.
Why the effect is so large at 10 minutes
The DenseNet shortcut's benefit at convergence is modest. At 588 seconds on 8×H100 we run about 4,873 SGD steps, which is severely undertrained by any classical standard, and the decoder — deepest in the gradient chain, furthest from the loss — is the region whose weights are least converged when wallclock hits. A 24-hour run would narrow the gap between the two architectures. Our run does not.
That is why almost the entire improvement lives in pre-quant BPB ($-0.00308$ of the $-0.00396$ raw SW delta). It is a training-time effect. The post-GPTQ weights are better because the pre-GPTQ weights are better, because each decoder block saw cleaner gradients per step.
One line
Every sublayer between a weight and the loss is a tax on that weight's training signal. At 10 minutes we cannot afford taxes we don't have to pay. Parallel routing is the cheapest exemption we have found: 66 scalars, no new matrices, no kernel issues, and an order-1 gradient path from every late-decoder sublayer to the loss.
The routing-as-adaptation reading is the one we started with. The routing-as-gradient-shortcut reading is the one that survives the math and the numbers. We were looking for adaptability and we found training efficiency hiding inside the same architecture.
Wider depth recurrence + per-pass loop embeddings
LOOP_START=3,LOOP_END=5,NUM_LOOPS=2— three passes through three distinct loop blocks instead of four passes through two. 9 loop block executions, 17 virtual layers total. Three zero-init learned vectors (nn.Embedding(3, 512)), one added to the residual at the start of each pass.Wider loop + per-pass embeddings — mechanistic analysis
Depth recurrence reuses block weights across passes, creating virtual depth without new parameters. The cost: quantization error amplifies through reuse by$A(k) = (1 - \rho^k) / (1 - \rho)$ , which at our contraction ratio $\rho \approx 0.63$ is $1.96\times$ for three passes or $1.67\times$ for two.
Wider loop. Blocks$(3, 4, 5)$ looped three times instead of $(4, 5)$ looped four times. Same 17 virtual layers, three distinct parameter sets instead of two. Gives $-0.0007$ BPB at identical pre-quant — the improvement is entirely post-quantization.
Per-pass embeddings. Three zero-init learned vectors ($e_i \in \mathbb{R}^{512}$ , 1,536 parameters total) added to the residual before each pass. Combined with the wider topology: $-0.00124$ BPB on a 5-seed mean at $p < 0.003$ . On the narrow topology: only $-0.0005$ . The mechanism is topology-dependent.
Where the gain lives. The embeddings barely improve fp32 modeling. Nearly all of the gain comes from a collapsed quantization gap ($0.0131 \to 0.0114$ ). The weights become more quantization-friendly, not more expressive.
We traced this through per-matrix statistics, then per-head decomposition, then direct intervention. The weight-distribution signature localizes to two attention heads (K head 2, V head 1) in the loop blocks. Injecting a bias directly at those heads recovers about half of the gain (the modeling part) but none of the compression part. The per-head signature is downstream of the mechanism, not its cause.
The embedding mechanism has two separable effects: a modeling effect (K specialization in the newly-added block 3, reproducible by a 192-parameter direct bias) and a compression effect (quant-gap collapse, not reproducible by any targeted head-level intervention we tested). The full residual-stream embedding constrains K from over-specializing and trades that headroom for compression-friendliness. Direct bias takes the unconstrained modeling win but misses the compression side. The mechanism requires the embedding to propagate through shared RMSNorm into both attention and MLP simultaneously; neither pathway alone reproduces it.
Tap-In V6 with
TAPIN_V6_CROSS_W=0.12Cross-window n-gram + cross-window lost-length rule, applied at eval time by a C++ matcher (~135 s on 8×H100).
TAPIN_V6_CROSS_Wis the weight on the cross-window hint signal, set to0.12(double the upstream default of0.06) to push the Tap-In nudge harder.Tap-In — what it is and why we call it that
Why "Tap-In"? In golf, the tap-in is the tiny final stroke that rolls the ball the last inch into the hole after the big drive has done all the work. The model does the big swing; Tap-In is the small eval-time nudge that finishes the putt.
Intuitively: Tap-In is a document-local scribe. As the model predicts each token, the scribe scans backward through the same document for the exact phrase the model just generated and whispers what came after it last time. If the model is already considering that token, the scribe nudges its probability up a tiny bit. If the phrase fell out of the model's 2048-token attention window (think: a proper name introduced 3000 tokens ago), the scribe is the only one who can recover it. Wrong whispers cost almost nothing because the nudge is small; right whispers — especially for forgotten long-range repetitions — cut several nats off the loss at that single position. It fires hundreds of thousands of times across the eval; each win is small but they stack into$-0.001$ BPB on top of the model.
Legal Score-First TTT
TTT_LR=0.005,TTT_FREEZE_BLOCKS=0, stacked on top of V6 in the SCORE phase. All 35.9M parameters trainable. Score accumulated undertorch.no_grad()before any optimizer step runs, so every scored token was predicted by a model that had not yet seen it.Muon momentum 0.98
Lower Nesterov momentum reduces the effective gradient memory from 100 steps to 50 steps, better matched to the short training run. A 4-point sweep (0.95, 0.97, 0.98, 0.99) identifies 0.98 as the sweet spot at$-0.00108$ BPB (3-seed mean) — about $-0.0006$ from better pre-quant convergence and $-0.0005$ from a reduced quantization gap.
Muon weight decay 0.12 +
MATRIX_CLIP_SIGMAS=13.10Higher WD shrinks weight magnitudes so the quantization gap closes to$0.0095$ . Pre-quant BPB is essentially unchanged from lower-WD alternatives; the gain comes from a tightened quant gap. The slightly higher
MATRIX_CLIP_SIGMAS=13.1absorbs the byte overhead of the routing scalars while keeping all seeds under 16 MB.HESSIAN_CLIP_LAMBDA=0An upstream code default of$0.175$ was a known-failed feature left in by an earlier PR. Pinning it to 0 gives $-0.0006$ BPB and about 40 KB smaller model.
GPTQ_RESERVE_SECONDS=4+GPTQ_CALIBRATION_BATCHES=16The training loop stops$12$ seconds of reserve and $64$ calibration batches, which collect Hessians in ~$13$ seconds. Research ("Hessians already converged well before 64 batches") suggests the calibration budget is over-provisioned; cutting it to 16 quarters the collection time to ~$3.5$ seconds with no meaningful quality loss. With the faster GPTQ, we can safely drop the reserve to 4 seconds, reclaiming ~$17$ seconds of wall clock (from the combined 12 → 4 reserve cut and the 13 → 3.5 collection cut) for warmdown training. All three seeds stay under 16 MB.
GPTQ_RESERVE_SECONDSbefore the 600s wall-clock cap so GPTQ Hessian collection has room to run. Upstream defaults areWhat gets evaluated
The competition harness runs
torchrun --nproc_per_node=8 train_gpt.py. This single file is the entire scored submission — it decompresses, trains, quantizes, and evaluates end-to-end.The
human_readable/directory contains the identical unminified source code for reviewer convenience; it is not used at runtime. The LZMA stub intrain_gpt.pycarries its own compressed copy of the same source files.Methodology — single pass, no double evaluation
The headline number is from a single causal left-to-right pass through the val set with Tap-In V6 + Legal TTT applied during scoring. There is no double pass, no second-pass rescoring, no information leak between runs.
train_seed{42,1234,2025}.logeval_v6_ttt_s{42,1234,2025}.log.int6.ptzand runs V6 + Legal TTT — these are the headline numbers per seed.Each eval is a fresh load of the same saved int6 model — no state carried between runs, no information leak from any earlier run into a later one. The leaderboard-scored number is the +V6+TTT column of the per-seed table, produced by a single pass.
Legality
Every gain comes from the strict prefix. Score-first TTT accumulates
loss_sumundertorch.no_grad()beforeoptimizer.step()runs, and the chunk math is airtight: chunkc's training targets max out at global position(c+1)*32768, while chunkc+1's scored targets start at>= (c+1)*32768 + 1. Strict inequality; no token is ever predicted by a model that has already been trained on it.The Tap-In V6 C++ matcher is byte-identical to the previously-audited reference. Within-window matches require
p+1 < tsocont = ids[p+1]is strict prefix. Cross-window'slost_len_at_tupper bound resolves to(ws + t) - window_size + 1 < ws + t + 1. The linked-listhead / tail / fwd[tok]update happens after the score block, not before. There is nois_bnd_[tok]orhas_ls_[tok]target-dependent gating anywhere in the matcher — the Category 15 attack surface is structurally absent, not disabled.The probability mixing sums to 1 by construction. Eval is one left-to-right sliding pass with non-overlapping 64-token scored ranges, so no position is ever rescored. GPTQ Hessians are collected from
train_loader.next_batch()with zero val-data exposure during training. The model is deserialized from.int6.ptzbefore TTT touches anything, so this is eval-time adaptation, not pre-quant TTT.The two-lane parallel routing is training-time architecture only. Its 66 scalars are trained on the same
train_loaderas everything else, with no eval-time tuning.What did not work
Every row is a controlled single-seed experiment against the same baseline (s42, pre–parallel-routing stack). The column is Δ to V6+TTT BPB — positive means worse. Kept here for the record; none of these are in the submission.
MUON_WD=0.05ENABLE_LOOPING_AT=0.35LOOP_START=3 LOOP_END=6PARALLEL_RESIDUAL_START=6attn_temp + block_gate + skip_routingblock_bias + skip_routingThe LoRA TTT experiments (1, 2) are the largest regressions because they corrupt the model's forward pass at eval time. The trained-gate experiments (8, 9) are the most diagnostic: they directly motivated the coupling-flexibility theory that eventually led us to implement parallel routing for the wrong reason.
Files
train_gpt.py— the scored artifact. Self-contained LZMA stub that decompresses, builds the CUTLASS kernel, trains the model, then runs V6 + TTT eval. Contains minified versions of all source files below.human_readable/— the identical unminified source code, provided for reviewer convenience and ease of review. Not used at runtime; the stub carries its own compressed copy.train_gpt.py— model (including two-lane routing), training loop, GPTQ, serialization, eval functionstapin_cpp.py— C++ Tap-In matcher (single-fileload_inline)_runner.py— end-to-end orchestrator: train → monkey-patch MLP → install V6 → TTT evalcutlass_evt_fusion/— fused MLP backward kernelReproduce — end-to-end on a fresh 8×H100 box
0. Hardware
1. Python + PyTorch + FA3
2. CUTLASS headers (one-time, system-wide)
3. Download the SP8192 dataset
The dataset and tokenizer are pre-built on HuggingFace under the parameter-golf data repo. Place them so the structure is:
Then
export DATA_DIR=~/data/.4. Run (train + V6 + TTT eval, end-to-end)
SEED=2025 DATA_DIR=$DATA_DIR \ torchrun --standalone --nproc_per_node=8 train_gpt.pytrain_gpt.pyis self-contained — it decompresses the code, builds the CUTLASS kernel, trains the model (~10 min), then automatically runs V6 + TTT eval (~7 min). No separate eval step needed. All tuned env vars (LOOP_START=3,MUON_MOMENTUM=0.98,MUON_WD=0.12,MATRIX_CLIP_SIGMAS=13.1,PARALLEL_RESIDUAL_START=8,HESSIAN_CLIP_LAMBDA=0,GPTQ_RESERVE_SECONDS=4,GPTQ_CALIBRATION_BATCHES=16,TAPIN_V6_CROSS_W=0.12) are set by the stub itself viaos.environ.update(...)beforeexec'ing the runner.5. Expected output
For
SEED=2025:The headline number is
val_bpb: 1.073313. To reproduce the 3-seed mean of 1.073938 run withSEED=42andSEED=1234and average.Troubleshooting
val_bpbtorch.compilewas stripped — verifyeval_val_sliding_ttthaslogits_fn = torch.compile(model.forward_logits, dynamic=False, fullgraph=True)val_bpbTAPIN_CPP=1 TAPIN_V4_ENABLED=1 TAPIN_V6_CROSS=1env vars are set automatically by the stub; checkhuman_readable/_runner.pyif running manuallyHESSIAN_CLIP_LAMBDA=0is setRuntimeError: Ninja is requiredpip install ninjaRuntimeError: operator cutlass_evt::gemm_mul does not exist/opt/cutlass(step 2)Inference tensors cannot be saved for backward(during TTT)torch.no_grad(), NOTtorch.inference_mode()(this is correct in the shipped code)