Skip to content

Commit aff6a98

Browse files
committed
Record: 11L Parallel Muon + Two-Pass Order-12 N-gram Backoff (val_bpb 0.1310, 3-seed 8xH100)
1 parent 50390d6 commit aff6a98

File tree

6 files changed

+2605
-0
lines changed

6 files changed

+2605
-0
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841
2+
3+
**3-seed mean val_bpb: 0.2841** (std 0.0001) | **~15.85 MB** | 8xH100 SXM
4+
5+
## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
6+
7+
| Seed | step_avg | steps | EMA bpb | Quantized bpb | **N-gram bpb** |
8+
|------|----------|-------|---------|---------------|----------------|
9+
| 1337 | 88.6ms | 6,774 | 1.1193 | 1.1270 | **0.2841** |
10+
| 42 | 88.8ms | 6,757 | 1.1194 | 1.1276 | **0.2840** |
11+
| 2024 | 88.7ms | 6,769 | 1.1191 | 1.1275 | **0.2840** |
12+
| **Mean** | **88.7ms** | **6,767** | **1.1193** | **1.1274** | **0.2841** |
13+
14+
## Key Innovation: N-gram Backoff Cache
15+
16+
Eval-time order 2-9 backward-looking N-gram cache with entropy-adaptive alpha blending:
17+
18+
```
19+
for each 65K-token chunk:
20+
Phase 1 -- SCORE: sliding window (stride=64) with N-gram interpolation
21+
- For each token, blend model P(token) with N-gram P(token) using adaptive alpha
22+
- Alpha determined by model entropy and N-gram order (higher orders = higher weight)
23+
Phase 2 -- UPDATE: add scored tokens to N-gram frequency tables (backward-looking only)
24+
```
25+
26+
N-gram cache reduces BPB by 4x (1.1274 -> 0.2841) by exploiting repeated phrases and patterns in the validation data. Score-first: cache only contains already-scored tokens.
27+
28+
- **4M hash buckets**, order 2-9 with XOR-of-products hashing
29+
- **Entropy-adaptive alpha**: sigmoid(entropy_scale * (entropy - center)), scaled by per-order multipliers
30+
- **Per-order multipliers**: orders 2-3 suppressed (0.3x), orders 5-9 boosted (2.0x)
31+
- **65K-token chunks**: cache refreshes every 65K tokens for maximum coverage
32+
33+
## Architecture (26.8M params)
34+
35+
- 11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)²
36+
- Parallel Muon with parameter banking + batched Newton-Schulz
37+
- SmearGate, BigramHash(1024), Value Residual, Gated Attention
38+
- XSA4, Partial RoPE(16/64), U-Net skips, OrthoInit
39+
- EMA(0.997) + SWA, Late QAT, GPTQ-lite int6 + zstd-22
40+
- Flash Attention 3, torch.compile(fullgraph=True)
41+
42+
## Timing
43+
44+
- Training: 600s (6,770 steps at 88.7ms/step)
45+
- Eval (N-gram): ~420s
46+
- Total: ~1020s (within 600s train + 600s eval budgets)
47+
48+
## Compliance
49+
50+
- [x] Training under 600s
51+
- [x] Eval under 600s (N-gram ~420s)
52+
- [x] Artifact under 16,000,000 bytes
53+
- [x] N-gram cache is strictly backward-looking (updated AFTER scoring)
54+
- [x] No training data access during evaluation
55+
- [x] No oracle/hindsight selection
56+
57+
## Credits
58+
59+
- N-gram cache concept: PR #659 by @deanbrr, PR #674 by @newjordan
60+
- Multi-order backoff + entropy-adaptive: PR #702 by @lukacf
61+
- Fine-grained chunk updates: PR #843 by @quietsmile
62+
- Parallel Muon / Parameter Banking: PR #399 by @abaybektursun
63+
- LeakyReLU²: PR #493 by @parinzee
64+
- Base model stack: PR #414 by @signalrush
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"author": "Aryan Bhosale",
3+
"github_id": "aryanbhosale",
4+
"name": "11L Parallel Muon + N-gram Backoff Cache (mean val_bpb=0.2841)",
5+
"blurb": "11-layer 512d transformer with Parallel Muon, BigramHash(1024), Value Residual, Gated Attention, XSA4, Partial RoPE(16/64), EMA(0.997)+SWA, Late QAT, GPTQ-lite int6+zstd-22. Eval-time order 2-9 N-gram backoff cache with entropy-adaptive alpha, 65K-token chunk updates. 3-seed mean 0.2841 BPB on 8xH100 SXM.",
6+
"date": "2026-03-26T12:00:00Z",
7+
"val_loss": 0.4796,
8+
"val_bpb": 0.2841,
9+
"val_bpb_std": 0.0001,
10+
"bytes_total": 15900000,
11+
"bytes_code": 93397,
12+
"seeds": {
13+
"1337": {"val_bpb": 0.2841, "val_loss": 0.4796, "steps": 6774, "step_avg_ms": 88.6},
14+
"42": {"val_bpb": 0.2840, "val_loss": 0.4796, "steps": 6757, "step_avg_ms": 88.8},
15+
"2024": {"val_bpb": 0.2840, "val_loss": 0.4795, "steps": 6769, "step_avg_ms": 88.7}
16+
}
17+
}

0 commit comments

Comments
 (0)