Commit 8daf3f1
PR openai#462 base (1.0672 SOTA) + novel 25-epoch quant-sensitivity TTT
Base: PR openai#462's SwiGLU + XSA4 + U-Net architecture (1.0672 BPB)
Novel additions (untried combination):
1. 25 TTT epochs (up from 10) - loss still dropping at epoch 10
2. Per-layer TTT LR by quantization sensitivity:
- MLP output projections: 3x LR (highest quant damage)
- MLP input projections: 0.5x LR
- Everything else: 1x LR
3. DDP optimize_ddp fix for PyTorch 2.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 869ddf8 commit 8daf3f1
1 file changed
+525
-717
lines changed
0 commit comments