Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700
Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700RoyiRa wants to merge 1 commit intoopenai:mainfrom
Conversation
Built on PR openai#700 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6 (extended from 4 to 6 layers) - BIGRAM_VOCAB_SIZE=4096 (doubled from 2048) 3-seed mean: 1.1078 (std 0.0045) Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
30e7835 to
57d1d2c
Compare
1. Rewrite ttt_adapt() to score-first pattern (Issue openai#677 compliant): - Process val data in sequential chunks (TTT_CHUNK_TOKENS=131072) - Phase 1: score chunk under inference_mode (forward only) - Phase 2: train on scored tokens with AdamW (K epochs) - Each token scored BEFORE model trains on it 2. Switch TTT optimizer from SGD to AdamW (lr=0.0001, wd=0.0) - PR openai#700 showed AdamW >> SGD for TTT - Default 4 epochs, freeze first 2 blocks 3. Fix DDP find_unused_parameters → static_graph=True - Same 3x slowdown fix as submission directory 4. TTT defaults: disabled by default (TTT_ENABLED=0) - Enable with TTT_ENABLED=1 for TTT+n-gram combined eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four major additions to the Kuda Architecture: 1. Hedge Mixer (5-expert, eval-time): Multiplicative Weights Update mixing neural + unigram + bigram + trigram + entropy experts. Based on online learning theory (Freund & Schapire 1997). Same principle as PAQ/CMIX world-best compressors. Expected -0.065 BPB (PR openai#700 validated). 2. CROWN-Q warmdown penalty: lambda * mean(w^2 * delta^2 / 12) pushes weights into flat minima that survive quantization. delta^2/12 is the uniform quantization noise variance. w^2 is diagonal Fisher proxy. Applied during warmdown only. From PR openai#693. 3. RoPE NTK fix: Propagate train_seq_len to all blocks' Rotary modules. Prevents positional encoding mismatch between train (2048) and eval. From PR openai#714 — produced tightest seed variance in competition. 4. TTT infrastructure: Score-first eval with SGD adaptation on scored tokens. FiLM-only TTT planned for Kuda recurrence mode. All features verified locally: forward/backward, CROWN-Q penalty, 5-expert Hedge mixing, Hedge weight updates, RoPE propagation. Script now 1,559 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deep analysis of feature dependency chains in both winning approaches. SOTA is speed-first, PR openai#700 is eval-first. Every feature enables the next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
Research-backed fixes for all four blockers: 1. Quant gap (0.071→0.005): Late QAT with STE on bank slices, EMA via named_parameters (not state_dict), Full GPTQ with Hessian 2. Eval speed (101min→10min): SOTA's sliding window TTT pattern, batch 32 windows, distribute across 8 GPUs, cosine LR decay 3. Artifact (16.9MB→16MB): 3% magnitude pruning (PR openai#700 pattern) 4. EMA/DDP: Use named_parameters() on unwrapped base_model All implementations sourced from actual SOTA code (pg-sota-train.md). Priority: EMA fix → Late QAT → Pruning → Sliding TTT → Full GPTQ. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
Maps every top entry through BPB = L + Q + T + M: - openai#700 solved M (mixer) but has worst L (training) - openai#609 solved Q (quant) but has zero T and M (no eval pipeline) - openai#549 solved L (training) but has zero M (no mixer) - Nobody has optimized all four terms simultaneously - Theoretical optimal = 1.052 (combine best of each) - Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
…eframe Corrections: - T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better) - Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck - Added "Best Known" column comparing against best per-term, not just merged SOTA New insights added: - Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth - MoE viability at small scale (inactive experts compress well) - Vocab expansion opportunity (mechanical BPB reduction) - Compression reframe: BPB competition = compression competition, 20 years of literature - Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach - Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
Community Review — Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64Compliance: HOLD — scored-region SLOT pending Issue #1336 PR #700 — HedgeMixer + CROWN-Q + stride=64 (1.0541 BPB)Author: RoyiRa | Head SHA: 57d1d2c Check 1: N-gram Family Bug (CLOSE trigger)Result: CLEAN The trigram update in The target token (
Check 2: Pre-Quant TTT (CLOSE trigger)Result: CLEAN The sole TTT function is
Verdict: HOLD — the TTT trains on the same token stream that produces the reported BPB. While the per-chunk score-first discipline is maintained, the model receives gradient updates from the evaluation distribution before scoring later chunks. This is the scored-region SLOT pattern pending Issue #1336. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending Issue #1336 ruling on scored-region SLOT. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually. |
Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)
val_bpb: 1.0541 (3-seed mean) | ~15.7 MB | 8xH100 SXM
Results (8xH100 80GB SXM)
Contributions
1. CROWN-Q Training Penalty (training-time)
Added a quantization-aware penalty during warmdown that penalizes weights sensitive to quantization error:
where
delta = row_max / clip_rangeis the per-row quantization step size. This encourages weights to be quantization-friendly, reducing post-quantization degradation.CROWN_Q_LAMBDA=0.01.Effect: Slightly better compression (artifact ~200KB smaller) and more robust quantization.
2. Eval stride 32 -> 64 (eval-time)
Changed sliding window stride from 32 to 64 during evaluation. Experiment showed identical BPB quality but 2x faster scoring. Frees ~100s of eval budget for more TTT epochs.
3. TTT Epochs 3 -> 4 (eval-time)
Increased test-time training from 3 to 4 epochs per chunk, using the time freed by stride=64. Each additional epoch adapts the model more to scored data. Tested 8 epochs but that overfits (1.0735 vs 1.0473 for 4 epochs).
Combined Effect
Architecture
Reproduction
Compliance
inference_mode()before any training on itCredits