Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean) by andrewbaggio1 · Pull Request #1555 · openai/parameter-golf

andrewbaggio1 · 2026-04-11T22:48:09Z

Summary

val_bpb = 1.07636 (3-seed mean, std 0.0006) | ~15.97 MB | 8xH100 SXM

Seed	Sliding BPB	TTT BPB	val_loss (nats)	Artifact
42	1.07856	1.07703	2.78208	15,961,726
1337	1.07727	1.07586	2.77907	15,964,616
2024	1.07833	1.07619	2.77990	15,970,213
Mean	1.07805	1.07636	2.78035

Merged SOTA (PR #1493): 2.78932 nats. Delta: -0.00897 nats (clears 0.005 threshold by 80%).

Novel Contributions

TMA Megakernel — Triton Hopper TMA fused MLP forward kernel. +10.5% throughput, ~200 extra steps in 600s. Claims megakernel bounty.
Tap-In min_match=1 (Unigram Matching) — First submission to lower Tap-In match threshold to 1 token. Fires at 21% of positions (vs 1.7% at min_match=3). Derived from local CPU loss analysis showing the model is uncertain on repeating tokens.
Improved Parallel Residuals — Ported from @msisovic's PR Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529.

Compliance (Track B — Issue #1017)

Causal dependence — sliding window + strict prefix Tap-In + prefix-only hash embedding
Full normalized distribution — Tap-In mixing sums to 1 by construction
Score before update — TTT chunks scored under no_grad before SGD; Tap-In linked list updated after scoring
Single left-to-right pass — each token scored exactly once
No SLOT, no pre-quant TTT, no n-gram caches
All artifacts under 16,000,000 bytes
Training under 600s, eval under 600s

Test plan

3-seed validation (seeds 42, 1337, 2024)
All artifacts under 16MB
Train under 600s on all seeds (~587s)
Eval under 600s on all seeds (~385-457s)

Credits

@msisovic (improved parallel residuals #1529), @abaybektursun (Tap-In V4/V6 #1518/#1420, TTT #549), @clarkkev (SP8192 + SDClip #1394), @EthanYangTW (parameter banking #1523), @dexhunter (legal TTT #1413), @resouer (eval hash embedding #1460), @bigbag (QK-Gain tuning #1493)

🤖 Generated with Claude Code

8 Gated DeltaNet layers + 2 softmax attention layers. GDN is mathematically equivalent to E2E TTT-Linear with MSE loss. First competitive GDN hybrid in the 10-min budget. Targets bounty items: E2E TTT + State-space models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The current W2 frontier point is already close to the public best clean-ish line, so the highest-upside architectural import is the improved parallel residual writeback from the openai#1529/openai#1555 family. This patch ports the learned cross-lane lambda mixing into the existing split-lane decoder while keeping the pass-conditioned attention modulation and score-first doc-independent TTT stack intact. Constraint: Single-node budget means the next experiment needs real upside, not another tiny hyperparameter nudge Rejected: Tap-In min_match=1 import first | Higher upside on paper, but much riskier on bytes, runtime, and review surface than improved parallel residuals Confidence: medium Scope-risk: moderate Directive: If this lane regresses, treat improved parallel residuals as non-additive with the current W2 modulation stack rather than trying to rescue it with more tuning Tested: python3 -m py_compile train_gpt.py; lsp diagnostics reported no file-level errors Not-tested: GPU score, bytes, and runtime on the integrated lane

…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss

The current W2 frontier point already has a strong score/runtime tradeoff, so the next high-upside import should add eval-time capacity without bringing in extra source files or a broad new review surface. This patch ports the eval-hash embedding path from the openai#1555 family: a zero-init hash embedding attached only during evaluation/TTT, hashed on the previous/current token pair, and trained with a higher LR multiplier during the score-first LoRA TTT loop. Constraint: Single-node iteration favors compact eval-time additions over large architecture or C++ retrieval ports Rejected: Tap-In import first | Higher upside on paper, but much riskier on code size, review surface, and implementation complexity Confidence: medium Scope-risk: moderate Directive: If this lane improves score but blows runtime, tune the hash buckets or LR multiplier before combining it with any other eval-time mechanism Tested: python3 -m py_compile train_gpt.py Not-tested: GPU score, bytes, and eval runtime with eval-hash enabled

andrewbaggio1 and others added 2 commits April 8, 2026 13:29

Bortlesboat mentioned this pull request Apr 13, 2026

[Tool] parameter-golf-checker: static analysis reviewer aid for submission triage #1603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/megakernel-improved-paresid-tapin-mm1

andrewbaggio1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewbaggio1 commented Apr 11, 2026

Summary

Novel Contributions

Compliance (Track B — Issue #1017)

Test plan

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant