Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean) by Abhishek8108 · Pull Request #1545 · openai/parameter-golf

Abhishek8108 · 2026-04-11T16:20:56Z

Summary

val_bpb = 1.028308 (3-seed cold-cache mean) | 14.48–14.70 MB | 8×H100 SXM

First non-transformer architecture in the 10-min record track. Beats merged SOTA (PR #1493, 1.0810) by 5.27 centiBPB. No TTT. Fixed predictor.

Architecture

GDN-Hybrid replaces the transformer backbone with Gated DeltaNet (delta-rule linear recurrence) + Sliding Window Attention:

[GDN×5] → [SWA] → [GDN×5] → [SWA_shared]

33,862,953 params, 512-dim, SP1024 tokenizer
GDN: fla.layers.GatedDeltaNet, head_dim=64, use_short_conv=True
SWA: window=512, 8 heads / 4 KV heads, weight-shared across both layers
QK-Gain 5.0, BigramHash(3072×112) + trigram embeddings, logit softcap 30.0
Quantization: full-Hessian int6 GPTQ + zstd-22

3-Seed Results (cold-cache, fresh pods)

Seed	Steps	EMA BPB	Quantized BPB	Artifact (bytes)
42	1857	1.017970	1.027163	15,188,240
1337	1858	1.018624	1.027614	15,417,768
2024	1858	1.020559	1.030148	15,314,099
Mean	—	1.019051	1.028308	—
Std	—	0.001356	0.001610	—

All three seeds run on separate fresh pods. Cold-start signature confirmed (step 1 at ~105s, Triton JIT overhead). t-stat = 56.7 vs SOTA threshold, p ≪ 0.01.

Reproduction

pip install flash-linear-attention zstandard sentencepiece
python3 data/cached_challenge_fineweb.py --variant sp1024

SEED=42 ARCH_MODE=D MAX_WALLCLOCK_SECONDS=590 ITERATIONS=9999 \
  TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=786432 \
  QK_GAIN_INIT=5.0 GPTQ_ENABLED=1 VAL_LOSS_EVERY=9999 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-11_GDN_Hybrid_DeltaRule/train_gpt.py

Expected on a cold pod: ~1857–1858 steps, quantized BPB ~1.027–1.030. See README for full details.

Compliance

Fixed predictor (Track A). No TTT, no RLS, no SLOT, no n-gram mixer at eval time. GPTQ calibration uses model-generated synthetic sequences only. Sliding-window eval is strictly causal, single-pass, normalized softmax distribution.

Supplemental Evidence

One warm-cache seed=1337 run (same config, pre-compiled Triton kernels) reached 1.015890 BPB (2247 steps). This is not part of the submitted 3-seed claim. The official claim is based solely on the cold-cache runs above.

…cold-cache mean) First non-transformer architecture in the 10-min record track. Replaces the transformer backbone with Gated DeltaNet (delta-rule linear recurrence) + Sliding Window Attention: [GDN×5] → [SWA] → [GDN×5] → [SWA_shared]. 3-seed cold-cache mean: 1.028308 BPB (seeds 42/1337/2024, fresh pods). Beats merged SOTA (PR openai#1493, 1.0810) by 5.27 centiBPB. No TTT. 33.86M params, SP1024, int6 GPTQ + zstd-22. All artifacts 14.48–14.70 MB.

SPThole · 2026-04-11T17:12:42Z

BPB metric bug: double-counted leading-space bytes

The build_sentencepiece_luts function in train_gpt.py has a + 1 in base_bytes that double-counts the space byte represented by ▁:

Submission (line ~275)

if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # ← space byte baked in

Compare with the reference implementation in the repo's train_gpt.py:

Reference (line ~196-199)

if piece.startswith("▁"):
    has_leading_space_np[token_id] = True
    piece = piece[1:]
base_bytes_np[token_id] = len(piece.encode("utf-8"))      # no +1

The eval function then adds the space byte again via the has_leading_space_lut conditional (same in both):

tb = base_bytes_lut[tgt]
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev])   # +1 again

So for a token like "▁hello" (representing " hello", 6 bytes): the reference correctly counts 5 + 1 = 6, but this submission counts 6 + 1 = 7. Since ~65% of tokens carry a leading space, this inflates the total byte count by ~14%, which deflates the reported BPB by the same factor (BPB = bits / bytes). The claimed 1.028 BPB would be approximately ~1.18 when computed with the reference byte counting.

…RA TTT doc-independent legal; BPB bug alert - PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending - PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal - PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771) - PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual - PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged - Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable - No merged SOTA change (still 1.0810); target remains ≤1.0760 https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK

Abhishek8108 · 2026-04-11T18:02:43Z

You're right — thank you for the careful read.

The bug is confirmed: base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 double-counts the space byte that the eval already adds via has_leading_space_lut. With ~65% of tokens carrying a leading space, the byte denominator is inflated ~14%, deflating the reported BPB by the same factor. The corrected BPB is ~1.18, not 1.028.

Closing this PR.

…09735 Moves GDN-Hybrid to track_non_record_16mb with corrected BPB calculation. Fixes double-count bug in build_sentencepiece_luts (leading-space +1 was counted in base_bytes and again in the eval loop). Corrected 3-artifact mean: 1.209735 BPB (stride=512 rescore of saved artifacts). Refs PR openai#1545.

Abhishek8108 force-pushed the submission/gdn-hybrid-delta-rule-1055 branch 2 times, most recently from d7b69ad to 296f03a Compare April 11, 2026 16:55

Abhishek8108 force-pushed the submission/gdn-hybrid-delta-rule-1055 branch from 296f03a to 410bf8a Compare April 11, 2026 17:01

Abhishek8108 closed this Apr 11, 2026

Abhishek8108 mentioned this pull request Apr 11, 2026

Non-record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.209735 #1553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)#1545

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)#1545
Abhishek8108 wants to merge 1 commit intoopenai:mainfrom
Abhishek8108:submission/gdn-hybrid-delta-rule-1055

Abhishek8108 commented Apr 11, 2026

Uh oh!

SPThole commented Apr 11, 2026

Uh oh!

Abhishek8108 commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Abhishek8108 commented Apr 11, 2026

Summary

Architecture

3-Seed Results (cold-cache, fresh pods)

Reproduction

Compliance

Supplemental Evidence

Uh oh!

SPThole commented Apr 11, 2026

Submission (line ~275)

Reference (line ~196-199)

Uh oh!

Abhishek8108 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abhishek8108 commented Apr 11, 2026 •

edited

Loading