|
| 1 | +# Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672) |
| 2 | + |
| 3 | +**3-seed mean val_bpb: 1.0672** | Best seed: 1.0658 |
| 4 | +Verified on 8xH100 80GB, 10-minute wall-clock budget. |
| 5 | + |
| 6 | +## Approach |
| 7 | + |
| 8 | +Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s. |
| 9 | + |
| 10 | +### Architecture (discovered by GEPA) |
| 11 | +- **SwiGLU FFN** with Star-ReLU activation |
| 12 | +- **U-Net skip connections** with learned gating |
| 13 | +- **BigramHash embeddings** (8192 buckets, 128 dim) |
| 14 | +- **SmearGate** on embeddings |
| 15 | +- 11 layers, 512 dim, 8 heads, 8 KV heads, MLP hidden=1792, tied embeddings |
| 16 | + |
| 17 | +### Training techniques (adopted + tuned) |
| 18 | +- **XSA4** (cross-sequence attention on last 4 layers) -- credited to @felipe-parodi (#398) |
| 19 | +- **EMA** (decay=0.9985) -- credited to @felipe-parodi (#398), decay tuned by us |
| 20 | +- **AdamW TTT** (lr=0.0005, 10 epochs, wd=0.0) -- credited to @sjp611 (#442) |
| 21 | +- **Partial RoPE** (16 dims) -- credited to @felipe-parodi (#398) |
| 22 | +- **LN Scale** (1/sqrt(layer_idx+1)) -- credited to @felipe-parodi (#398) |
| 23 | +- **Late QAT** (threshold 0.15) -- credited to @fbedev (#410) |
| 24 | +- Muon optimizer (matrix_lr=0.025, wd=0.04, momentum=0.99) |
| 25 | +- Warmdown: 6000 steps |
| 26 | +- Int6 quantization + zstd-22 compression |
| 27 | + |
| 28 | +## 3-Seed Results |
| 29 | + |
| 30 | +| Seed | val_bpb | |
| 31 | +|------|---------| |
| 32 | +| 42 | 1.06733191 | |
| 33 | +| 123 | 1.06833018 | |
| 34 | +| 7 | 1.06579646 | |
| 35 | +| **Mean** | **1.06715285** | |
| 36 | +| **Std** | **0.00104211** | |
| 37 | + |
| 38 | +## Comparison to prior SOTA |
| 39 | + |
| 40 | +| Submission | Mean BPB | Best BPB | |
| 41 | +|-----------|----------|----------| |
| 42 | +| **Ours** | **1.0672** | **1.0658** | |
| 43 | +| @sjp611 (#442) | 1.1027 | 1.0992 | |
| 44 | +| @felipe-parodi (#398) | 1.1221 | 1.1213 | |
| 45 | +| @thwu1 (#180, merged) | 1.1428 | -- | |
| 46 | + |
| 47 | +## Key finding |
| 48 | + |
| 49 | +AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training. |
| 50 | + |
| 51 | +## Credits |
| 52 | + |
| 53 | +- **@felipe-parodi** (#398): EMA, TTT, XSA4, Partial RoPE, LN Scale |
| 54 | +- **@sjp611** (#442): AdamW TTT |
| 55 | +- **@fbedev** (#410): Late QAT |
| 56 | +- **@thwu1** (#180): 11-layer architecture direction |
| 57 | +- Compute provided by **Modal** |
| 58 | + |
| 59 | +Built by [@JoePro](https://x.com/JoePro) (GitHub: [@JoeProAI](https://github.com/JoeProAI)) with AI agent assistance: [OpenClaw](https://openclaw.ai) (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination. |
| 60 | + |
| 61 | +## Run command |
| 62 | + |
| 63 | +```bash |
| 64 | +# Default seed |
| 65 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 66 | + |
| 67 | +# Specific seed |
| 68 | +SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 69 | +``` |
| 70 | + |
| 71 | +All hyperparameters are set as defaults in `train_gpt.py`. |
0 commit comments