XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092
XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB#1092teddyoweh wants to merge 3 commits intoopenai:mainfrom
Conversation
|
Excellent combination of tweaks that synergize with more aggressive TTT. I'm surprised that the 15x learning rate was better, nice finding! |
Community Review — XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPBBPB: 1.1219 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1133 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=94098 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=94098 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Results
val_bpb: 1.1219 | Artifact: 15,916,230 bytes (15.92 MB) | 8×H100 SXM
What's New
Three independently validated improvements on top of the PR #414 + PR #399 stack:
1. XSA on All 11 Layers (
XSA_LAST_N=11)Extending eXtended Self-Attention from last 4 layers to all 11 yields -0.0007 BPB. The richer attention outweighs ~4% slower step time (93.97ms vs ~90ms).
2. LeakyReLU(0.75)²
Higher negative slope than the current SOTA (0.75 vs 0.5). From PR #977's ablation, 0.75 is strictly better than 0.5 for the int6 stack. Preserves more gradient flow through the MLP.
3. Aggressive Legal TTT (lr=0.03)
Score-first TTT using PR #461's legal framework with a 15× higher learning rate (0.03 vs 0.002). Delivers -0.0033 BPB improvement (vs -0.0025 in SOTA). All blocks unfrozen, SGD with momentum 0.9, 3 epochs per chunk, cosine LR decay.
torch.inference_mode()guarantees scoring is stateless — weights are only updated AFTER the chunk is scored.FA3 Fallback
Script includes automatic fallback from Flash Attention 3 to PyTorch SDPA:
Our run used SDPA (93.97ms/step → 6,173 steps). With FA3 (~84ms/step → ~7,100 steps), expected BPB would be in the 1.119x range.
Timing
Run Command
Credits