Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999 by Programmerryoki · Pull Request #827 · openai/parameter-golf

Programmerryoki · 2026-03-26T07:17:54Z

LeakyReLU² + XSA4 + LN Scale + Partial RoPE

val_bpb: 1.3999 | ~13.5 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.8.0+cu128)

Seed	step_avg	steps	val_bpb	Artifact
1337	65.0ms	9,231	1.3999	13,502,602

Architecture

Component	Setting
Layers	10 (512d, 8H, 4KV GQA)
MLP	2× with LeakyReLU(0.5)²
BigramHash	1536
SmearGate	Enabled
U-Net Skips	Enabled
XSA	Last 4 layers
Partial RoPE	16/64 dims
LN Scale	1/√(layer+1)
Weight avg	EMA(0.997)
Quantization	Int6 QAT + GPTQ-lite clip search + zstd-22

Key Techniques

LeakyReLU(0.5)²

One-line activation change: F.leaky_relu(x, negative_slope=0.5).square() replacing relu(x).square(). Preserves negative gradient flow, eliminates dead neurons while maintaining relu² inductive bias.

GPTQ-lite Clip Search

Post-training quantization improvement: tests 5 clip percentiles per weight row (0.9999, 0.99995, 0.99999, 0.999995, 1.0) and selects the one minimizing per-row reconstruction MSE. Zero training cost.

Exclusive Self-Attention (XSA)

Applied to last 4 layers. Subtracts self-value from attention output, forcing each token to attend more to context tokens rather than itself.

LN Scale + Partial RoPE

LN Scale: RMSNorm outputs scaled by 1/√(layer+1) in deeper layers for training stability
Partial RoPE: Only first 16/64 head dims get rotary encoding; remaining dims learn position-invariant patterns

Run Command

export BIGRAMHASH_BUCKETS=1536 NUM_LAYERS=10 MLP_MULT=2 \
    ROPE_PARTIAL_DIMS=16 LN_SCALE=1 XSA_LAYERS=4 \
    WARMDOWN_ITERS=3500 SEED=1337

torchrun --standalone --nproc_per_node=8 train_gpt.py

Notes

First submission. Reduced to 10 layers and MLP 2× to fit within 16MB artifact constraint. Future work: parameter banking to fit 11L/3× MLP, sliding window evaluation, and test-time training.

…ubmissions - Layers 9→11, MLP mult 2→3, warmdown 1200→3500 - Int6 QAT with straight-through estimator (STE) - BigramHash embedding (4096 buckets, hash-based bigram features) - SmearGate (learnable gate blending hidden state with causal running average) - EMA of weights (decay=0.997) used for validation and export - zstd level 22 compression with zlib fallback - U-Net skip connections now configurable via UNET_SKIPS env var - All features configurable: BIGRAMHASH_BUCKETS, EMA_DECAY, SMEARGATE, UNET_SKIPS, INT6_QAT

…depth recurrence, MoE Improvements over baseline: - LeakyReLU(0.5)^2 activation (replaces relu^2, -0.003 BPB) - GPTQ-lite clip search (5 percentiles per row, min MSE) - Exclusive Self-Attention (XSA) on last N layers (env: XSA_LAYERS) - LN Scale 1/sqrt(L+1) dampening (env: LN_SCALE) - Partial RoPE (env: ROPE_PARTIAL_DIMS) - Depth recurrence with gated weight sharing (env: RECURRENCE_REPEATS) - Tiny MoE MLP with top-1 routing (env: MOE_NUM_EXPERTS) - Windows compatibility: sys.platform guard for SDPA, torch.compile guard - Periodic checkpointing (env: CHECKPOINT_EVERY) - VAL_MAX_TOKENS for faster local validation All new features behind env vars with safe defaults.

Implements the breakthrough eval-time technique from PR openai#809 (0.295 BPB): - BackoffNgramMixer: order-2 to order-9 N-gram cache - Entropy-adaptive alpha blending (model + N-gram predictions) - Sequential eval building cache from scored tokens (legal/backward-looking) - Configurable via NGRAM_EVAL=1 and NGRAM_MAX_ORDER=9 env vars - GPT.forward() now supports _return_logits mode for N-gram blending Enable with: export NGRAM_EVAL=1 NGRAM_MAX_ORDER=9

- EVAL_STRIDE=64 for sliding window evaluation (-0.02 BPB free) - TTT_ENABLED=1 for legal score-first TTT (-0.025 to -0.083 BPB) - forward_logits() method for eval without loss computation - 9 TTT hyperparameters (LR, epochs, chunk size, freeze, momentum, etc.) - All features off by default, backward compatible - Tested locally: TTT gives -0.083 BPB on 500-step model

Phase files (all features env-var controlled): - phase0_muoneqr: MuonEq-R optimizer (row-normalized Muon) - phase0_misc: brotli compression, sqrt warmdown, cudagraph fix - phase1_sp4096: SP4096 vocab, MLP4x, BIGRAM_DIM, ENCODER_LAYERS - phase2_depthrecur: depth recurrence with untied RepeatMLP - phase2_parallel_twolane: two-lane parallel residuals - phase3_prequant_ttt: pre-quant AdamW TTT pipeline - phase3_discriminative_ttt: discriminative TTT per-block LR (ULMFiT) - phase4_ve128: VE128 value embeddings - phase4_swa: tight SWA every 50 steps in warmdown - phase4_polar_ns: Polar Express 4-step NS - phase5_full_gptq: full Hessian GPTQ + Cholesky + actorder - phase6_causal_slot: causal SLOT (context-only positions) Scripts: - h100-next-pod-setup.sh: new pod setup (benchmark + deps + data download) - h100-sp4096-ablations.sh: SP4096 ablation suite (Phase A-C)

All features independently toggleable via env vars: - MUONEQ_R=1 (default ON): MuonEq-R row-normalized optimizer - WARMDOWN_SCHEDULE=sqrt: sqrt warmdown schedule - RECUR_LAYERS: depth recurrence with untied RepeatMLP - PARALLEL_RESID=1: two-lane parallel residuals with cross-lane routing - TTT_PREQUANT=1: pre-quant AdamW TTT pipeline - TTT_DISCRIMINATIVE=1: discriminative per-block LR TTT (ULMFiT-style) - VE_ENABLED=1: VE128 value embeddings on layers 9-10 - SWA_ENABLED=1: tight SWA every 50 steps in warmdown - POLAR_EXPRESS=1: Polar Express 4-step NS coefficients - GPTQ_FULL_HESSIAN=1: full Hessian GPTQ + Cholesky + AR self-gen - CAUSAL_SLOT_ENABLED=1: causal SLOT (context-only positions) - Brotli+byte-shuffle compression (auto-detected)

train_gpt_full_stack.py fixes: - Add MUONEQ_R env var (was missing — critical bug, MuonEq-R wasn't wired up) - Add row normalization in Muon step: g / norm(dim=-1).clamp_min(1e-7) - Wire muoneq_r into optimizer param group h100-sp4096-ablations.sh fixes: - Fix all env var names (MUONEQ_R not MUONEQR, BIGRAMHASH_DIM not BIGRAM_DIM, QK_GAIN_INIT not QK_GAIN, PARALLEL_START_LAYER=4 not PARALLEL_RESID=1, WARMDOWN_SCHEDULE already default so removed, TTT_ENABLED=0 explicit) - Sequential A0→A1→A2/A3→A4→B0→B1→C0→C1 - Proper wait_for_result with tmux window checks - Results summary at end scripts/h100-8x-final-submission.sh (new): - 3-seed run (seeds 4, 30, 2026) on 8xH100 - Full best config with all winners - Proper 3-seed mean/std calculation

reduce-overhead uses CUDAGraphs which conflict with the rotary cache (mutable tensor outputs) when warmup runs before the main training loop. p0_misc had the same crash in phase tests. Switching to default compile mode avoids CUDAGraph capture entirely - still gets torch.compile speedup without the graph capture overhead. Affected: all full_stack runs were failing at warmup_step 1.

Previous version used tmux new-window per run which caused parallel launches. Now runs inline sequentially with torchrun directly. One GPU at a time, guaranteed.

MatoTeziTanka · 2026-04-12T01:53:13Z

Community Review — Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999

BPB: 1.3999 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA 922065f7c365, file train_gpt.py):

The TTT path at line 316 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71606 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Programmerryoki added 3 commits March 25, 2026 04:11

Programmerryoki force-pushed the main branch from 57df10b to 9ca0e9d Compare March 26, 2026 07:39

Programmerryoki added 11 commits March 26, 2026 23:45

Add H100 seq2048 run script + latest train_gpt.py

5f3660b

Fix KV heads: 9 heads needs 3 KV (not 4)

1295755

Add Run D: best local MHA config + seq2048

5239a76

Add QUANT_BITS env override for int5/int4 quantization

70a9047

Make QAT STE use QUANT_BITS override (int5/int4 aware training)

856050e

Fix ablation script: strictly sequential, no parallel tmux windows

922065f

Previous version used tmux new-window per run which caused parallel launches. Now runs inline sequentially with torchrun directly. One GPU at a time, guaranteed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827

Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999#827
Programmerryoki wants to merge 14 commits intoopenai:mainfrom
Programmerryoki:main

Programmerryoki commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Programmerryoki commented Mar 26, 2026

LeakyReLU² + XSA4 + LN Scale + Partial RoPE

Results (8×H100 80GB SXM, PyTorch 2.8.0+cu128)

Architecture

Key Techniques

LeakyReLU(0.5)²

GPTQ-lite Clip Search

Exclusive Self-Attention (XSA)

LN Scale + Partial RoPE

Run Command

Notes

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants