Skip to content

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean)#1576

Open
joshkmartinez wants to merge 7 commits intoopenai:mainfrom
joshkmartinez:gdn-hybrid-warmdown
Open

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean)#1576
joshkmartinez wants to merge 7 commits intoopenai:mainfrom
joshkmartinez:gdn-hybrid-warmdown

Conversation

@joshkmartinez
Copy link
Copy Markdown

Summary

val_bpb = 1.01671233 (3-seed mean, std 0.00134386)
15.71–15.90 MB

Improves the GDN-Hybrid fixed-predictor line with a warmdown1000 schedule and compressed-code packaging w/o eval-time adaptation.

Seed Steps EMA BPB val_bpb XSA BPB Artifact bytes
42 2227 1.007164 1.016200 1.021202 15,733,879
1337 2242 1.007164 1.015700 1.020105 15,903,365
2024 2227 1.009032 1.018237 1.024111 15,713,422
Mean 1.007787 1.01671233 1.021806 15,783,555.33
Std 0.00134386

Architecture / Technique Stack

  1. SP1024 tokenizer
  2. GDN-Hybrid backbone: [GDN×5] → SWA → [GDN×5] → SWA_shared
  3. Fixed-predictor evaluation path (no TTT / no SLOT / no eval-time adaptation)
  4. MuonEq-R + AdamW training mix
  5. EMA = 0.997
  6. warmdown = 1000
  7. GPTQ int6 + zstd-22 packaging
  8. Compressed-code packaging for train_gpt.py / architectures.py / configs.py to recover artifact headroom

Compliance

  • Fixed-predictor / Track A style submission
  • No TTT
  • No SLOT
  • No RLS
  • No eval-time adaptation
  • All three artifacts under 16,000,000 bytes
  • Training run stays within the 10-minute 8xH100 submission budget

Notes

XSA telemetry is reported for completeness, but the submitted score is the fixed-model quantized_bpb result above.

Credits

@bigbag
Copy link
Copy Markdown

bigbag commented Apr 13, 2026

BPB metric bug: space bytes double-counted (inherited from closed parent PR #1545)

The decompressed train_gpt.py in this PR contains the same build_sentencepiece_luts bug that @SPThole identified in PR #1545, and which @Abhishek8108 acknowledged when closing that PR ("The corrected BPB is ~1.18, not 1.028").

Bugged code in this PR (decompressed from the LZMA self-extractor):

# build_sentencepiece_luts, around line 217
if piece.startswith("▁"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # +1 adds the space byte

Then the eval loop adds the same space byte again:

tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)

Reference implementation (train_gpt.py at repo root, lines 186–189):

piece = sp.id_to_piece(token_id)
if piece.startswith("▁"):
    has_leading_space_np[token_id] = True
    piece = piece[1:]                       # strip ▁
base_bytes_np[token_id] = len(piece.encode("utf-8"))   # NO +1 here

The reference counts the space byte exactly once (in the eval loop, conditioned on ~is_boundary_token_lut[prev]). The bugged version counts it in both places for every ▁-prefixed token, inflating the byte denominator and deflating the reported BPB.

Running the parent PR's corrected LUT on the same checkpoint lands in the ~1.16–1.18 range (per @Abhishek8108's own correction on #1545), not 1.01671.

Bug was missed here because the training code is wrapped in an LZMA self-extractor, which hides it from standard review. Suggest the maintainers decompress and re-score before this shifts the leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants