Skip to content

Commit 8bd231a

Browse files
committed
Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)
Architecture discovered via GEPA (Gemini-driven evolutionary search). SwiGLU FFN, Star-ReLU, U-Net skip gates, BigramHash 8192, XSA4. AdamW TTT (lr=0.0005, 10ep) from @sjp611 (openai#442). EMA, RoPE, LN Scale, QAT from @felipe-parodi (openai#398) and @fbedev (openai#410). 3-seed results: 1.06733 / 1.06833 / 1.06580 Mean: 1.06715, Std: 0.00104 Built by @joepro with AI agents via OpenClaw. Compute provided by Modal.
1 parent 0f51451 commit 8bd231a

File tree

2 files changed

+1571
-0
lines changed

2 files changed

+1571
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Record: SwiGLU + XSA4 + U-Net + AdamW TTT (3-seed mean val_bpb=1.0672)
2+
3+
**3-seed mean val_bpb: 1.0672** | Best seed: 1.0658
4+
Verified on 8xH100 80GB, 10-minute wall-clock budget.
5+
6+
## Approach
7+
8+
Novel architecture discovered through GEPA (Gemini-driven Evolutionary Parameter Architecture search) combined with community-proven techniques. Built over 5 days, 6 waves of experiments, ~$250 total compute on Modal H100s.
9+
10+
### Architecture (discovered by GEPA)
11+
- **SwiGLU FFN** with Star-ReLU activation
12+
- **U-Net skip connections** with learned gating
13+
- **BigramHash embeddings** (8192 buckets, 128 dim)
14+
- **SmearGate** on embeddings
15+
- 11 layers, 512 dim, 8 heads, 8 KV heads, MLP hidden=1792, tied embeddings
16+
17+
### Training techniques (adopted + tuned)
18+
- **XSA4** (cross-sequence attention on last 4 layers) -- credited to @felipe-parodi (#398)
19+
- **EMA** (decay=0.9985) -- credited to @felipe-parodi (#398), decay tuned by us
20+
- **AdamW TTT** (lr=0.0005, 10 epochs, wd=0.0) -- credited to @sjp611 (#442)
21+
- **Partial RoPE** (16 dims) -- credited to @felipe-parodi (#398)
22+
- **LN Scale** (1/sqrt(layer_idx+1)) -- credited to @felipe-parodi (#398)
23+
- **Late QAT** (threshold 0.15) -- credited to @fbedev (#410)
24+
- Muon optimizer (matrix_lr=0.025, wd=0.04, momentum=0.99)
25+
- Warmdown: 6000 steps
26+
- Int6 quantization + zstd-22 compression
27+
28+
## 3-Seed Results
29+
30+
| Seed | val_bpb |
31+
|------|---------|
32+
| 42 | 1.06733191 |
33+
| 123 | 1.06833018 |
34+
| 7 | 1.06579646 |
35+
| **Mean** | **1.06715285** |
36+
| **Std** | **0.00104211** |
37+
38+
## Comparison to prior SOTA
39+
40+
| Submission | Mean BPB | Best BPB |
41+
|-----------|----------|----------|
42+
| **Ours** | **1.0672** | **1.0658** |
43+
| @sjp611 (#442) | 1.1027 | 1.0992 |
44+
| @felipe-parodi (#398) | 1.1221 | 1.1213 |
45+
| @thwu1 (#180, merged) | 1.1428 | -- |
46+
47+
## Key finding
48+
49+
AdamW TTT produced a 0.053 bpb improvement on our architecture vs 0.019 on the standard architecture (PR #398). This suggests SwiGLU + U-Net skip connections create a loss landscape that AdamW navigates significantly better than SGD during test-time training.
50+
51+
## Credits
52+
53+
- **@felipe-parodi** (#398): EMA, TTT, XSA4, Partial RoPE, LN Scale
54+
- **@sjp611** (#442): AdamW TTT
55+
- **@fbedev** (#410): Late QAT
56+
- **@thwu1** (#180): 11-layer architecture direction
57+
- Compute provided by **Modal**
58+
59+
Built by [@JoePro](https://x.com/JoePro) (GitHub: [@JoeProAI](https://github.com/JoeProAI)) with AI agent assistance: [OpenClaw](https://openclaw.ai) (Claude Opus), Codex (GPT-5.4), Claude Sonnet, Gemini 2.5 Pro, and Paperclip agent coordination.
60+
61+
## Run command
62+
63+
```bash
64+
# Default seed
65+
torchrun --standalone --nproc_per_node=8 train_gpt.py
66+
67+
# Specific seed
68+
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
69+
```
70+
71+
All hyperparameters are set as defaults in `train_gpt.py`.

0 commit comments

Comments
 (0)