Skip to content

Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827

Open
Programmerryoki wants to merge 14 commits intoopenai:mainfrom
Programmerryoki:main
Open

Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827
Programmerryoki wants to merge 14 commits intoopenai:mainfrom
Programmerryoki:main

Conversation

@Programmerryoki
Copy link
Copy Markdown

LeakyReLU² + XSA4 + LN Scale + Partial RoPE

val_bpb: 1.3999 | ~13.5 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.8.0+cu128)

Seed step_avg steps val_bpb Artifact
1337 65.0ms 9,231 1.3999 13,502,602

Architecture

Component Setting
Layers 10 (512d, 8H, 4KV GQA)
MLP 2× with LeakyReLU(0.5)²
BigramHash 1536
SmearGate Enabled
U-Net Skips Enabled
XSA Last 4 layers
Partial RoPE 16/64 dims
LN Scale 1/√(layer+1)
Weight avg EMA(0.997)
Quantization Int6 QAT + GPTQ-lite clip search + zstd-22

Key Techniques

LeakyReLU(0.5)²

One-line activation change: F.leaky_relu(x, negative_slope=0.5).square() replacing relu(x).square(). Preserves negative gradient flow, eliminates dead neurons while maintaining relu² inductive bias.

GPTQ-lite Clip Search

Post-training quantization improvement: tests 5 clip percentiles per weight row (0.9999, 0.99995, 0.99999, 0.999995, 1.0) and selects the one minimizing per-row reconstruction MSE. Zero training cost.

Exclusive Self-Attention (XSA)

Applied to last 4 layers. Subtracts self-value from attention output, forcing each token to attend more to context tokens rather than itself.

LN Scale + Partial RoPE

  • LN Scale: RMSNorm outputs scaled by 1/√(layer+1) in deeper layers for training stability
  • Partial RoPE: Only first 16/64 head dims get rotary encoding; remaining dims learn position-invariant patterns

Run Command

export BIGRAMHASH_BUCKETS=1536 NUM_LAYERS=10 MLP_MULT=2 \
    ROPE_PARTIAL_DIMS=16 LN_SCALE=1 XSA_LAYERS=4 \
    WARMDOWN_ITERS=3500 SEED=1337

torchrun --standalone --nproc_per_node=8 train_gpt.py

Notes

First submission. Reduced to 10 layers and MLP 2× to fit within 16MB artifact constraint. Future work: parameter banking to fit 11L/3× MLP, sliding window evaluation, and test-time training.

…ubmissions

- Layers 9→11, MLP mult 2→3, warmdown 1200→3500
- Int6 QAT with straight-through estimator (STE)
- BigramHash embedding (4096 buckets, hash-based bigram features)
- SmearGate (learnable gate blending hidden state with causal running average)
- EMA of weights (decay=0.997) used for validation and export
- zstd level 22 compression with zlib fallback
- U-Net skip connections now configurable via UNET_SKIPS env var
- All features configurable: BIGRAMHASH_BUCKETS, EMA_DECAY, SMEARGATE, UNET_SKIPS, INT6_QAT
…depth recurrence, MoE

Improvements over baseline:
- LeakyReLU(0.5)^2 activation (replaces relu^2, -0.003 BPB)
- GPTQ-lite clip search (5 percentiles per row, min MSE)
- Exclusive Self-Attention (XSA) on last N layers (env: XSA_LAYERS)
- LN Scale 1/sqrt(L+1) dampening (env: LN_SCALE)
- Partial RoPE (env: ROPE_PARTIAL_DIMS)
- Depth recurrence with gated weight sharing (env: RECURRENCE_REPEATS)
- Tiny MoE MLP with top-1 routing (env: MOE_NUM_EXPERTS)
- Windows compatibility: sys.platform guard for SDPA, torch.compile guard
- Periodic checkpointing (env: CHECKPOINT_EVERY)
- VAL_MAX_TOKENS for faster local validation

All new features behind env vars with safe defaults.
Implements the breakthrough eval-time technique from PR openai#809 (0.295 BPB):
- BackoffNgramMixer: order-2 to order-9 N-gram cache
- Entropy-adaptive alpha blending (model + N-gram predictions)
- Sequential eval building cache from scored tokens (legal/backward-looking)
- Configurable via NGRAM_EVAL=1 and NGRAM_MAX_ORDER=9 env vars
- GPT.forward() now supports _return_logits mode for N-gram blending

Enable with: export NGRAM_EVAL=1 NGRAM_MAX_ORDER=9
- EVAL_STRIDE=64 for sliding window evaluation (-0.02 BPB free)
- TTT_ENABLED=1 for legal score-first TTT (-0.025 to -0.083 BPB)
- forward_logits() method for eval without loss computation
- 9 TTT hyperparameters (LR, epochs, chunk size, freeze, momentum, etc.)
- All features off by default, backward compatible
- Tested locally: TTT gives -0.083 BPB on 500-step model
Phase files (all features env-var controlled):
- phase0_muoneqr: MuonEq-R optimizer (row-normalized Muon)
- phase0_misc: brotli compression, sqrt warmdown, cudagraph fix
- phase1_sp4096: SP4096 vocab, MLP4x, BIGRAM_DIM, ENCODER_LAYERS
- phase2_depthrecur: depth recurrence with untied RepeatMLP
- phase2_parallel_twolane: two-lane parallel residuals
- phase3_prequant_ttt: pre-quant AdamW TTT pipeline
- phase3_discriminative_ttt: discriminative TTT per-block LR (ULMFiT)
- phase4_ve128: VE128 value embeddings
- phase4_swa: tight SWA every 50 steps in warmdown
- phase4_polar_ns: Polar Express 4-step NS
- phase5_full_gptq: full Hessian GPTQ + Cholesky + actorder
- phase6_causal_slot: causal SLOT (context-only positions)

Scripts:
- h100-next-pod-setup.sh: new pod setup (benchmark + deps + data download)
- h100-sp4096-ablations.sh: SP4096 ablation suite (Phase A-C)
All features independently toggleable via env vars:
- MUONEQ_R=1 (default ON): MuonEq-R row-normalized optimizer
- WARMDOWN_SCHEDULE=sqrt: sqrt warmdown schedule
- RECUR_LAYERS: depth recurrence with untied RepeatMLP
- PARALLEL_RESID=1: two-lane parallel residuals with cross-lane routing
- TTT_PREQUANT=1: pre-quant AdamW TTT pipeline
- TTT_DISCRIMINATIVE=1: discriminative per-block LR TTT (ULMFiT-style)
- VE_ENABLED=1: VE128 value embeddings on layers 9-10
- SWA_ENABLED=1: tight SWA every 50 steps in warmdown
- POLAR_EXPRESS=1: Polar Express 4-step NS coefficients
- GPTQ_FULL_HESSIAN=1: full Hessian GPTQ + Cholesky + AR self-gen
- CAUSAL_SLOT_ENABLED=1: causal SLOT (context-only positions)
- Brotli+byte-shuffle compression (auto-detected)
train_gpt_full_stack.py fixes:
- Add MUONEQ_R env var (was missing — critical bug, MuonEq-R wasn't wired up)
- Add row normalization in Muon step: g / norm(dim=-1).clamp_min(1e-7)
- Wire muoneq_r into optimizer param group

h100-sp4096-ablations.sh fixes:
- Fix all env var names (MUONEQ_R not MUONEQR, BIGRAMHASH_DIM not BIGRAM_DIM,
  QK_GAIN_INIT not QK_GAIN, PARALLEL_START_LAYER=4 not PARALLEL_RESID=1,
  WARMDOWN_SCHEDULE already default so removed, TTT_ENABLED=0 explicit)
- Sequential A0→A1→A2/A3→A4→B0→B1→C0→C1
- Proper wait_for_result with tmux window checks
- Results summary at end

scripts/h100-8x-final-submission.sh (new):
- 3-seed run (seeds 4, 30, 2026) on 8xH100
- Full best config with all winners
- Proper 3-seed mean/std calculation
reduce-overhead uses CUDAGraphs which conflict with the rotary cache
(mutable tensor outputs) when warmup runs before the main training loop.
p0_misc had the same crash in phase tests. Switching to default compile
mode avoids CUDAGraph capture entirely - still gets torch.compile speedup
without the graph capture overhead.

Affected: all full_stack runs were failing at warmup_step 1.
Previous version used tmux new-window per run which caused parallel
launches. Now runs inline sequentially with torchrun directly.
One GPU at a time, guaranteed.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999

BPB: 1.3999 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA 922065f7c365, file train_gpt.py):

The TTT path at line 316 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants