Skip to content

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)#1545

Closed
Abhishek8108 wants to merge 1 commit intoopenai:mainfrom
Abhishek8108:submission/gdn-hybrid-delta-rule-1055
Closed

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)#1545
Abhishek8108 wants to merge 1 commit intoopenai:mainfrom
Abhishek8108:submission/gdn-hybrid-delta-rule-1055

Conversation

@Abhishek8108
Copy link
Copy Markdown

Summary

val_bpb = 1.028308 (3-seed cold-cache mean) | 14.48–14.70 MB | 8×H100 SXM

First non-transformer architecture in the 10-min record track. Beats merged SOTA (PR #1493, 1.0810) by 5.27 centiBPB. No TTT. Fixed predictor.

Architecture

GDN-Hybrid replaces the transformer backbone with Gated DeltaNet (delta-rule linear recurrence) + Sliding Window Attention:

[GDN×5] → [SWA] → [GDN×5] → [SWA_shared]
  • 33,862,953 params, 512-dim, SP1024 tokenizer
  • GDN: fla.layers.GatedDeltaNet, head_dim=64, use_short_conv=True
  • SWA: window=512, 8 heads / 4 KV heads, weight-shared across both layers
  • QK-Gain 5.0, BigramHash(3072×112) + trigram embeddings, logit softcap 30.0
  • Quantization: full-Hessian int6 GPTQ + zstd-22

3-Seed Results (cold-cache, fresh pods)

Seed Steps EMA BPB Quantized BPB Artifact (bytes)
42 1857 1.017970 1.027163 15,188,240
1337 1858 1.018624 1.027614 15,417,768
2024 1858 1.020559 1.030148 15,314,099
Mean 1.019051 1.028308
Std 0.001356 0.001610

All three seeds run on separate fresh pods. Cold-start signature confirmed (step 1 at ~105s, Triton JIT overhead). t-stat = 56.7 vs SOTA threshold, p ≪ 0.01.

Reproduction

pip install flash-linear-attention zstandard sentencepiece
python3 data/cached_challenge_fineweb.py --variant sp1024

SEED=42 ARCH_MODE=D MAX_WALLCLOCK_SECONDS=590 ITERATIONS=9999 \
  TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 \
  QK_GAIN_INIT=5.0 GPTQ_ENABLED=1 VAL_LOSS_EVERY=9999 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-11_GDN_Hybrid_DeltaRule/train_gpt.py

Expected on a cold pod: ~1857–1858 steps, quantized BPB ~1.027–1.030. See README for full details.

Compliance

Fixed predictor (Track A). No TTT, no RLS, no SLOT, no n-gram mixer at eval time. GPTQ calibration uses model-generated synthetic sequences only. Sliding-window eval is strictly causal, single-pass, normalized softmax distribution.

Supplemental Evidence

One warm-cache seed=1337 run (same config, pre-compiled Triton kernels) reached 1.015890 BPB (2247 steps). This is not part of the submitted 3-seed claim. The official claim is based solely on the cold-cache runs above.

@Abhishek8108 Abhishek8108 force-pushed the submission/gdn-hybrid-delta-rule-1055 branch 2 times, most recently from d7b69ad to 296f03a Compare April 11, 2026 16:55
…cold-cache mean)

First non-transformer architecture in the 10-min record track. Replaces the
transformer backbone with Gated DeltaNet (delta-rule linear recurrence) +
Sliding Window Attention: [GDN×5] → [SWA] → [GDN×5] → [SWA_shared].

3-seed cold-cache mean: 1.028308 BPB (seeds 42/1337/2024, fresh pods).
Beats merged SOTA (PR openai#1493, 1.0810) by 5.27 centiBPB. No TTT.
33.86M params, SP1024, int6 GPTQ + zstd-22. All artifacts 14.48–14.70 MB.
@Abhishek8108 Abhishek8108 force-pushed the submission/gdn-hybrid-delta-rule-1055 branch from 296f03a to 410bf8a Compare April 11, 2026 17:01
@SPThole
Copy link
Copy Markdown

SPThole commented Apr 11, 2026

BPB metric bug: double-counted leading-space bytes

The build_sentencepiece_luts function in train_gpt.py has a + 1 in base_bytes that double-counts the space byte represented by ▁:

Submission (line ~275)

if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # ← space byte baked in

Compare with the reference implementation in the repo's train_gpt.py:

Reference (line ~196-199)

if piece.startswith("▁"):
    has_leading_space_np[token_id] = True
    piece = piece[1:]
base_bytes_np[token_id] = len(piece.encode("utf-8"))      # no +1

The eval function then adds the space byte again via the has_leading_space_lut conditional (same in both):

tb = base_bytes_lut[tgt]
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev])   # +1 again

So for a token like "▁hello" (representing " hello", 6 bytes): the reference correctly counts 5 + 1 = 6, but this submission counts 6 + 1 = 7. Since ~65% of tokens carry a leading space, this inflates the total byte count by ~14%, which deflates the reported BPB by the same factor (BPB = bits / bytes). The claimed 1.028 BPB would be approximately ~1.18 when computed with the reference byte counting.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 11, 2026
…RA TTT doc-independent legal; BPB bug alert

- PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending
- PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal
- PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771)
- PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual
- PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged
- Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable
- No merged SOTA change (still 1.0810); target remains ≤1.0760

https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
@Abhishek8108
Copy link
Copy Markdown
Author

Abhishek8108 commented Apr 11, 2026

You're right — thank you for the careful read.

The bug is confirmed: base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 double-counts the space byte that the eval already adds via has_leading_space_lut. With ~65% of tokens carrying a leading space, the byte denominator is inflated ~14%, deflating the reported BPB by the same factor. The corrected BPB is ~1.18, not 1.028.

Closing this PR.

Abhishek8108 added a commit to Abhishek8108/parameter-golf that referenced this pull request Apr 11, 2026
…09735

Moves GDN-Hybrid to track_non_record_16mb with corrected BPB calculation.
Fixes double-count bug in build_sentencepiece_luts (leading-space +1 was
counted in base_bytes and again in the eval loop). Corrected 3-artifact
mean: 1.209735 BPB (stride=512 rescore of saved artifacts). Refs PR openai#1545.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants