|
| 1 | +# Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 |
| 2 | + |
| 3 | +**3-seed mean val_bpb: 0.2841** (std 0.0001) | **~15.85 MB** | 8xH100 SXM |
| 4 | + |
| 5 | +## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128) |
| 6 | + |
| 7 | +| Seed | step_avg | steps | EMA bpb | Quantized bpb | **N-gram bpb** | |
| 8 | +|------|----------|-------|---------|---------------|----------------| |
| 9 | +| 1337 | 88.6ms | 6,774 | 1.1193 | 1.1270 | **0.2841** | |
| 10 | +| 42 | 88.8ms | 6,757 | 1.1194 | 1.1276 | **0.2840** | |
| 11 | +| 2024 | 88.7ms | 6,769 | 1.1191 | 1.1275 | **0.2840** | |
| 12 | +| **Mean** | **88.7ms** | **6,767** | **1.1193** | **1.1274** | **0.2841** | |
| 13 | + |
| 14 | +## Key Innovation: N-gram Backoff Cache |
| 15 | + |
| 16 | +Eval-time order 2-9 backward-looking N-gram cache with entropy-adaptive alpha blending: |
| 17 | + |
| 18 | +``` |
| 19 | +for each 65K-token chunk: |
| 20 | + Phase 1 -- SCORE: sliding window (stride=64) with N-gram interpolation |
| 21 | + - For each token, blend model P(token) with N-gram P(token) using adaptive alpha |
| 22 | + - Alpha determined by model entropy and N-gram order (higher orders = higher weight) |
| 23 | + Phase 2 -- UPDATE: add scored tokens to N-gram frequency tables (backward-looking only) |
| 24 | +``` |
| 25 | + |
| 26 | +N-gram cache reduces BPB by 4x (1.1274 -> 0.2841) by exploiting repeated phrases and patterns in the validation data. Score-first: cache only contains already-scored tokens. |
| 27 | + |
| 28 | +- **4M hash buckets**, order 2-9 with XOR-of-products hashing |
| 29 | +- **Entropy-adaptive alpha**: sigmoid(entropy_scale * (entropy - center)), scaled by per-order multipliers |
| 30 | +- **Per-order multipliers**: orders 2-3 suppressed (0.3x), orders 5-9 boosted (2.0x) |
| 31 | +- **65K-token chunks**: cache refreshes every 65K tokens for maximum coverage |
| 32 | + |
| 33 | +## Architecture (26.8M params) |
| 34 | + |
| 35 | +- 11L, 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)² |
| 36 | +- Parallel Muon with parameter banking + batched Newton-Schulz |
| 37 | +- SmearGate, BigramHash(1024), Value Residual, Gated Attention |
| 38 | +- XSA4, Partial RoPE(16/64), U-Net skips, OrthoInit |
| 39 | +- EMA(0.997) + SWA, Late QAT, GPTQ-lite int6 + zstd-22 |
| 40 | +- Flash Attention 3, torch.compile(fullgraph=True) |
| 41 | + |
| 42 | +## Timing |
| 43 | + |
| 44 | +- Training: 600s (6,770 steps at 88.7ms/step) |
| 45 | +- Eval (N-gram): ~420s |
| 46 | +- Total: ~1020s (within 600s train + 600s eval budgets) |
| 47 | + |
| 48 | +## Compliance |
| 49 | + |
| 50 | +- [x] Training under 600s |
| 51 | +- [x] Eval under 600s (N-gram ~420s) |
| 52 | +- [x] Artifact under 16,000,000 bytes |
| 53 | +- [x] N-gram cache is strictly backward-looking (updated AFTER scoring) |
| 54 | +- [x] No training data access during evaluation |
| 55 | +- [x] No oracle/hindsight selection |
| 56 | + |
| 57 | +## Credits |
| 58 | + |
| 59 | +- N-gram cache concept: PR #659 by @deanbrr, PR #674 by @newjordan |
| 60 | +- Multi-order backoff + entropy-adaptive: PR #702 by @lukacf |
| 61 | +- Fine-grained chunk updates: PR #843 by @quietsmile |
| 62 | +- Parallel Muon / Parameter Banking: PR #399 by @abaybektursun |
| 63 | +- LeakyReLU²: PR #493 by @parinzee |
| 64 | +- Base model stack: PR #414 by @signalrush |
0 commit comments