Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827
Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827Programmerryoki wants to merge 14 commits intoopenai:mainfrom
Conversation
…ubmissions - Layers 9→11, MLP mult 2→3, warmdown 1200→3500 - Int6 QAT with straight-through estimator (STE) - BigramHash embedding (4096 buckets, hash-based bigram features) - SmearGate (learnable gate blending hidden state with causal running average) - EMA of weights (decay=0.997) used for validation and export - zstd level 22 compression with zlib fallback - U-Net skip connections now configurable via UNET_SKIPS env var - All features configurable: BIGRAMHASH_BUCKETS, EMA_DECAY, SMEARGATE, UNET_SKIPS, INT6_QAT
…depth recurrence, MoE Improvements over baseline: - LeakyReLU(0.5)^2 activation (replaces relu^2, -0.003 BPB) - GPTQ-lite clip search (5 percentiles per row, min MSE) - Exclusive Self-Attention (XSA) on last N layers (env: XSA_LAYERS) - LN Scale 1/sqrt(L+1) dampening (env: LN_SCALE) - Partial RoPE (env: ROPE_PARTIAL_DIMS) - Depth recurrence with gated weight sharing (env: RECURRENCE_REPEATS) - Tiny MoE MLP with top-1 routing (env: MOE_NUM_EXPERTS) - Windows compatibility: sys.platform guard for SDPA, torch.compile guard - Periodic checkpointing (env: CHECKPOINT_EVERY) - VAL_MAX_TOKENS for faster local validation All new features behind env vars with safe defaults.
Implements the breakthrough eval-time technique from PR openai#809 (0.295 BPB): - BackoffNgramMixer: order-2 to order-9 N-gram cache - Entropy-adaptive alpha blending (model + N-gram predictions) - Sequential eval building cache from scored tokens (legal/backward-looking) - Configurable via NGRAM_EVAL=1 and NGRAM_MAX_ORDER=9 env vars - GPT.forward() now supports _return_logits mode for N-gram blending Enable with: export NGRAM_EVAL=1 NGRAM_MAX_ORDER=9
- EVAL_STRIDE=64 for sliding window evaluation (-0.02 BPB free) - TTT_ENABLED=1 for legal score-first TTT (-0.025 to -0.083 BPB) - forward_logits() method for eval without loss computation - 9 TTT hyperparameters (LR, epochs, chunk size, freeze, momentum, etc.) - All features off by default, backward compatible - Tested locally: TTT gives -0.083 BPB on 500-step model
Phase files (all features env-var controlled): - phase0_muoneqr: MuonEq-R optimizer (row-normalized Muon) - phase0_misc: brotli compression, sqrt warmdown, cudagraph fix - phase1_sp4096: SP4096 vocab, MLP4x, BIGRAM_DIM, ENCODER_LAYERS - phase2_depthrecur: depth recurrence with untied RepeatMLP - phase2_parallel_twolane: two-lane parallel residuals - phase3_prequant_ttt: pre-quant AdamW TTT pipeline - phase3_discriminative_ttt: discriminative TTT per-block LR (ULMFiT) - phase4_ve128: VE128 value embeddings - phase4_swa: tight SWA every 50 steps in warmdown - phase4_polar_ns: Polar Express 4-step NS - phase5_full_gptq: full Hessian GPTQ + Cholesky + actorder - phase6_causal_slot: causal SLOT (context-only positions) Scripts: - h100-next-pod-setup.sh: new pod setup (benchmark + deps + data download) - h100-sp4096-ablations.sh: SP4096 ablation suite (Phase A-C)
All features independently toggleable via env vars: - MUONEQ_R=1 (default ON): MuonEq-R row-normalized optimizer - WARMDOWN_SCHEDULE=sqrt: sqrt warmdown schedule - RECUR_LAYERS: depth recurrence with untied RepeatMLP - PARALLEL_RESID=1: two-lane parallel residuals with cross-lane routing - TTT_PREQUANT=1: pre-quant AdamW TTT pipeline - TTT_DISCRIMINATIVE=1: discriminative per-block LR TTT (ULMFiT-style) - VE_ENABLED=1: VE128 value embeddings on layers 9-10 - SWA_ENABLED=1: tight SWA every 50 steps in warmdown - POLAR_EXPRESS=1: Polar Express 4-step NS coefficients - GPTQ_FULL_HESSIAN=1: full Hessian GPTQ + Cholesky + AR self-gen - CAUSAL_SLOT_ENABLED=1: causal SLOT (context-only positions) - Brotli+byte-shuffle compression (auto-detected)
train_gpt_full_stack.py fixes: - Add MUONEQ_R env var (was missing — critical bug, MuonEq-R wasn't wired up) - Add row normalization in Muon step: g / norm(dim=-1).clamp_min(1e-7) - Wire muoneq_r into optimizer param group h100-sp4096-ablations.sh fixes: - Fix all env var names (MUONEQ_R not MUONEQR, BIGRAMHASH_DIM not BIGRAM_DIM, QK_GAIN_INIT not QK_GAIN, PARALLEL_START_LAYER=4 not PARALLEL_RESID=1, WARMDOWN_SCHEDULE already default so removed, TTT_ENABLED=0 explicit) - Sequential A0→A1→A2/A3→A4→B0→B1→C0→C1 - Proper wait_for_result with tmux window checks - Results summary at end scripts/h100-8x-final-submission.sh (new): - 3-seed run (seeds 4, 30, 2026) on 8xH100 - Full best config with all winners - Proper 3-seed mean/std calculation
reduce-overhead uses CUDAGraphs which conflict with the rotary cache (mutable tensor outputs) when warmup runs before the main training loop. p0_misc had the same crash in phase tests. Switching to default compile mode avoids CUDAGraph capture entirely - still gets torch.compile speedup without the graph capture overhead. Affected: all full_stack runs were failing at warmup_step 1.
Previous version used tmux new-window per run which caused parallel launches. Now runs inline sequentially with torchrun directly. One GPU at a time, guaranteed.
Community Review — Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999BPB: 1.3999 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern) What I found in the code (head SHA The TTT path at line 316 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
LeakyReLU² + XSA4 + LN Scale + Partial RoPE
val_bpb: 1.3999 | ~13.5 MB | 8×H100 SXM
Results (8×H100 80GB SXM, PyTorch 2.8.0+cu128)
Architecture
Key Techniques
LeakyReLU(0.5)²
One-line activation change:
F.leaky_relu(x, negative_slope=0.5).square()replacingrelu(x).square(). Preserves negative gradient flow, eliminates dead neurons while maintaining relu² inductive bias.GPTQ-lite Clip Search
Post-training quantization improvement: tests 5 clip percentiles per weight row (0.9999, 0.99995, 0.99999, 0.999995, 1.0) and selects the one minimizing per-row reconstruction MSE. Zero training cost.
Exclusive Self-Attention (XSA)
Applied to last 4 layers. Subtracts self-value from attention output, forcing each token to attend more to context tokens rather than itself.
LN Scale + Partial RoPE
Run Command
export BIGRAMHASH_BUCKETS=1536 NUM_LAYERS=10 MLP_MULT=2 \ ROPE_PARTIAL_DIMS=16 LN_SCALE=1 XSA_LAYERS=4 \ WARMDOWN_ITERS=3500 SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.pyNotes
First submission. Reduced to 10 layers and MLP 2× to fit within 16MB artifact constraint. Future work: parameter banking to fit 11L/3× MLP, sliding window evaluation, and test-time training.