Skip to content

Commit 548fe64

Browse files
Pavel Liashkovclaude
andcommitted
Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03
Two changes from PR openai#834: MATRIX_LR=0.03 and TTT_EPOCHS=0. Beats PR openai#834's 0.1663 WITH TTT by removing TTT and using higher LR. - Learned mixer head: Linear(512→7) predicts per-token expert weights - No TTT — zero gradient updates on validation data - N-gram backoff cache (orders 2-7), single-pass, backward-looking - 11L, MHA 8/8, MLP 3.5x, 15.59 MB artifact - 8xH100 SXM, 600s training, 515s eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 226d817 commit 548fe64

File tree

5 files changed

+3433
-483
lines changed

5 files changed

+3433
-483
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03
2+
3+
**val_bpb = 0.1582** (seed 42, additional seeds pending) | **15.59 MB** | 8xH100 SXM | **No TTT**
4+
5+
## Results
6+
7+
| Seed | Steps | ms/step | Sliding BPB | **Mixer BPB** | Artifact |
8+
|------|-------|---------|-------------|---------------|----------|
9+
| 42 | 5,300 | 113 | 1.1396 | **0.1582** | 15,590,944 |
10+
11+
## Two Key Changes from PR #834
12+
13+
1. **MATRIX_LR=0.03** (was 0.025) — discovered through systematic screening of 79+ experiments
14+
2. **TTT_EPOCHS=0** — completely removes test-time training. Result is clean, fully legal, no gradient updates on val data
15+
16+
Despite removing TTT, our result (0.1582) **beats** PR #834's original (0.1663 with TTT enabled). The higher matrix LR produces a better-trained model that the learned mixing head can leverage more effectively.
17+
18+
## Architecture (from PR #834)
19+
20+
- 11L, 512d, MHA 8/8, MLP 3.5x, LeakyReLU(0.5)²
21+
- **Learned mixer head**: `Linear(512 → 7)` predicts per-token mixing weights for neural model + n-gram orders 2-7
22+
- **Frozen n-gram oracle**: bigram/trigram/...7-gram tables precomputed from training data, used as lookup during training
23+
- Mixed int5/int6 quantization + GPTQ + zstd, EMA(0.997), CROWN-Q penalty
24+
25+
## Eval: Learned Multi-Expert Mixing (NO TTT)
26+
27+
- Score-first backward-looking n-gram cache (orders 2-7)
28+
- Model-predicted mixing weights (not fixed alpha — learned during training)
29+
- Each token gets its own expert weights based on transformer hidden state
30+
- **515s eval time** (within 600s budget, no TTT overhead)
31+
32+
## Reproduction
33+
34+
```bash
35+
MATRIX_LR=0.03 TTT_EPOCHS=0 torchrun --standalone --nproc_per_node=8 train_gpt.py
36+
```
37+
38+
## Legality
39+
40+
- No TTT (zero gradient updates on validation data)
41+
- N-gram cache is backward-looking (score-first, cache updated after scoring)
42+
- Learned mixing head trained on training data only (frozen oracle)
43+
- Single-pass evaluation
44+
45+
## Based On
46+
47+
- PR #834: Learned Multi-Expert Gate + Frozen Oracle architecture
48+
- Our systematic hyperparameter screening (79+ experiments, MATRIX_LR=0.03 discovery)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"track": "10min_16mb",
3+
"date": "2026-03-26",
4+
"name": "Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03",
5+
"author": "bigbag",
6+
"github": "bigbag",
7+
"seed_results": {
8+
"42": {"val_loss": 0.267132, "val_bpb": 0.158210, "artifact_bytes": 15590944}
9+
},
10+
"mean_val_loss": 0.267132,
11+
"mean_val_bpb": 0.158210,
12+
"code_bytes": 91966,
13+
"notes": "Additional seeds pending. Based on PR #834 with MATRIX_LR=0.03 and TTT_EPOCHS=0."
14+
}

0 commit comments

Comments
 (0)