Skip to content

Non-record: GDN Hybrid (E2E TTT / State-Space Model) — val_bpb 1.14502#1479

Open
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
andrewbaggio1:nonrecord/gdn-hybrid-e2e-ttt
Open

Non-record: GDN Hybrid (E2E TTT / State-Space Model) — val_bpb 1.14502#1479
andrewbaggio1 wants to merge 1 commit intoopenai:mainfrom
andrewbaggio1:nonrecord/gdn-hybrid-e2e-ttt

Conversation

@andrewbaggio1
Copy link
Copy Markdown

Non-record: GDN Hybrid — Gated DeltaNet as E2E TTT / State-Space Model

val_bpb: 1.14502 (seed 1234, 8xH100, 600s)

Summary

Replaces 8 of 10 attention layers with Gated DeltaNet (Yang et al., ICLR 2025). GDN is mathematically equivalent to E2E TTT-Linear with MSE loss — the delta rule update S_t = α·S·(I - β·k·kᵀ) + β·v·kᵀ is exactly one step of SGD on L = 0.5·‖S·k - v‖², trained end-to-end.

Targets bounty items: E2E TTT + State-space models.

Architecture

  • 8 GDN layers + 2 softmax attention (positions 4, 8)
  • dim=512, 8 heads, MLP 3x, SP8192, GPTQ int6/int8, SDClip, EMA
  • FLA v0.4.2 GatedDeltaNet with chunk-parallel Triton kernels
  • 37.4M params, 13.83 MB artifact

Results

Not competitive with softmax attention at 10-min budget: 4.91M tok/s (GDN) vs 6.93M tok/s (attention), yielding 3673 vs 4624 steps. The 20% training deficit is not compensated by GDN's per-step learning advantage at this scale. However, training is stable, GPTQ works cleanly, and PR #1370 showed 1.003 BPB is achievable with unlimited compute.

Credits

Builds on @clarkkev's #1394, FLA library by @sustcsonglin, and PureGDN work by @Christopher-Lee-McClendon (#1370).

8 Gated DeltaNet layers + 2 softmax attention layers. GDN is mathematically
equivalent to E2E TTT-Linear with MSE loss. First competitive GDN hybrid
in the 10-min budget. Targets bounty items: E2E TTT + State-space models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant