Commit a0e0841
Add AdamW TTT option — PR #462 shows 5x better TTT gain vs SGD
PR #462 achieves 1.0672 BPB. Their key finding: switching TTT
optimizer from SGD to AdamW gives 5x more improvement (0.053 vs
0.011 BPB). AdamW's per-parameter adaptive LR handles the
heterogeneous update needs of attention/MLP/control params
naturally — exactly what we were trying to do manually.
New defaults (matching PR #462 recipe):
TTT_OPTIMIZER=adamw (was implicit SGD)
TTT_LR=0.0005 (was 0.002)
TTT_EPOCHS=10 (was 3)
TTT_FREEZE_BLOCKS=0 (was 2)
Fallback to SGD: TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_EPOCHS=3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 8b22b10 commit a0e0841
1 file changed
+10
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
90 | | - | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
91 | 92 | | |
92 | | - | |
| 93 | + | |
93 | 94 | | |
94 | 95 | | |
95 | 96 | | |
| |||
928 | 929 | | |
929 | 930 | | |
930 | 931 | | |
931 | | - | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
932 | 938 | | |
933 | 939 | | |
934 | 940 | | |
| |||
0 commit comments