Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)#1545
Conversation
d7b69ad to
296f03a
Compare
…cold-cache mean) First non-transformer architecture in the 10-min record track. Replaces the transformer backbone with Gated DeltaNet (delta-rule linear recurrence) + Sliding Window Attention: [GDN×5] → [SWA] → [GDN×5] → [SWA_shared]. 3-seed cold-cache mean: 1.028308 BPB (seeds 42/1337/2024, fresh pods). Beats merged SOTA (PR openai#1493, 1.0810) by 5.27 centiBPB. No TTT. 33.86M params, SP1024, int6 GPTQ + zstd-22. All artifacts 14.48–14.70 MB.
296f03a to
410bf8a
Compare
|
BPB metric bug: double-counted leading-space bytes The build_sentencepiece_luts function in train_gpt.py has a + 1 in base_bytes that double-counts the space byte represented by ▁: Submission (line ~275)if piece.startswith("\u2581"):
has_space[i] = True
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 # ← space byte baked inCompare with the reference implementation in the repo's train_gpt.py: Reference (line ~196-199)if piece.startswith("▁"):
has_leading_space_np[token_id] = True
piece = piece[1:]
base_bytes_np[token_id] = len(piece.encode("utf-8")) # no +1The eval function then adds the space byte again via the has_leading_space_lut conditional (same in both): tb = base_bytes_lut[tgt]
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]) # +1 againSo for a token like "▁hello" (representing " hello", 6 bytes): the reference correctly counts 5 + 1 = 6, but this submission counts 6 + 1 = 7. Since ~65% of tokens carry a leading space, this inflates the total byte count by ~14%, which deflates the reported BPB by the same factor (BPB = bits / bytes). The claimed 1.028 BPB would be approximately ~1.18 when computed with the reference byte counting. |
…RA TTT doc-independent legal; BPB bug alert - PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending - PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal - PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771) - PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual - PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged - Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable - No merged SOTA change (still 1.0810); target remains ≤1.0760 https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
|
You're right — thank you for the careful read. The bug is confirmed: Closing this PR. |
…09735 Moves GDN-Hybrid to track_non_record_16mb with corrected BPB calculation. Fixes double-count bug in build_sentencepiece_luts (leading-space +1 was counted in base_bytes and again in the eval loop). Corrected 3-artifact mean: 1.209735 BPB (stride=512 rescore of saved artifacts). Refs PR openai#1545.
Summary
val_bpb = 1.028308 (3-seed cold-cache mean) | 14.48–14.70 MB | 8×H100 SXM
First non-transformer architecture in the 10-min record track. Beats merged SOTA (PR #1493, 1.0810) by 5.27 centiBPB. No TTT. Fixed predictor.
Architecture
GDN-Hybrid replaces the transformer backbone with Gated DeltaNet (delta-rule linear recurrence) + Sliding Window Attention:
fla.layers.GatedDeltaNet, head_dim=64, use_short_conv=True3-Seed Results (cold-cache, fresh pods)
All three seeds run on separate fresh pods. Cold-start signature confirmed (step 1 at ~105s, Triton JIT overhead). t-stat = 56.7 vs SOTA threshold, p ≪ 0.01.
Reproduction
Expected on a cold pod: ~1857–1858 steps, quantized BPB ~1.027–1.030. See README for full details.
Compliance
Fixed predictor (Track A). No TTT, no RLS, no SLOT, no n-gram mixer at eval time. GPTQ calibration uses model-generated synthetic sequences only. Sliding-window eval is strictly causal, single-pass, normalized softmax distribution.
Supplemental Evidence
One warm-cache seed=1337 run (same config, pre-compiled Triton kernels) reached 1.015890 BPB (2247 steps). This is not part of the submitted 3-seed claim. The official claim is based solely on the cold-cache runs above.