|
| 1 | +# Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) |
| 2 | + |
| 3 | +**val_bpb: 0.6683** (3-seed mean, std 0.0024) | **<16 MB** | 8xH100 SXM, 600s |
| 4 | + |
| 5 | +## Results (8xH100 80GB SXM) |
| 6 | + |
| 7 | +| Seed | Pre-TTT bpb | Post-TTT bpb | Eval time | Artifact | |
| 8 | +|------|-------------|--------------|-----------|----------| |
| 9 | +| 1337 | 1.1258 | **0.6663** | 371s | 15.63 MB | |
| 10 | +| 42 | 1.1258 | **0.6710** | 371s | 15.78 MB | |
| 11 | +| 2024 | 1.1258 | **0.6675** | 372s | 15.48 MB | |
| 12 | +| **Mean** | 1.1258 | **0.6683** | 371s | | |
| 13 | +| **Std** | | **0.0024** | | | |
| 14 | + |
| 15 | +## Background |
| 16 | + |
| 17 | +We introduced the first n-gram eval cache in this competition (PR #659, val_bpb=1.0920, March 22 2026). That original approach used a 5-gram cache with fixed mixing and an oracle safety gate that was subsequently ruled illegal by organizers (comparing mixed vs original NLL peeks at the target). |
| 18 | + |
| 19 | +This submission replaces the illegal oracle gate with entropy-adaptive mixing and multi-order backoff, combined with a drift-free TTT configuration. |
| 20 | + |
| 21 | +## Technique |
| 22 | + |
| 23 | +### 1. Multi-order N-gram Backoff (orders 2-7) |
| 24 | + |
| 25 | +Instead of a single fixed n-gram order, we try the highest order first and cascade down on miss. Each order uses 4M hash buckets to reduce collisions. This dramatically improves coverage: a fixed 7-gram misses when the exact 6-token context has not been seen, but backoff to 6, 5, 4, 3, 2-gram catches those cases. |
| 26 | + |
| 27 | +N-gram counts are accumulated from already-scored tokens only. Updated after scoring each chunk. |
| 28 | + |
| 29 | +### 2. Entropy-Adaptive Alpha |
| 30 | +``` |
| 31 | +alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0)) |
| 32 | +``` |
| 33 | + |
| 34 | +where H is the neural model's own entropy over its output distribution. When the model is uncertain (high entropy), we trust n-gram statistics more. When confident (low entropy), we trust the model. This depends solely on the model's output distribution, never on the true target. No oracle selection. |
| 35 | + |
| 36 | +The mixed probability is always applied: |
| 37 | +``` |
| 38 | +p_mixed = (1 - alpha) * p_neural + alpha * p_ngram |
| 39 | +``` |
| 40 | + |
| 41 | +### 3. Drift-Free TTT Configuration |
| 42 | + |
| 43 | +Standard TTT configurations suffer from late-chunk drift: BPB bottoms around chunk 21 then climbs as cumulative adaptation becomes destructive. We use a conservative configuration that produces monotonic improvement through all 60 chunks: |
| 44 | + |
| 45 | +| Parameter | Setting | |
| 46 | +|-----------|---------| |
| 47 | +| Unfrozen params | Q projections only (QTTT=1) | |
| 48 | +| Mixer eta | 0.02 | |
| 49 | +| TTT LR | 0.00003 | |
| 50 | +| Chunk size | 1M tokens (60 chunks) | |
| 51 | +| Epochs per chunk | 1 | |
| 52 | +| Adaptive LR | Disabled | |
| 53 | +| Polyak averaging | Disabled | |
| 54 | + |
| 55 | +The most impactful hyperparameters are mixer eta and TTT learning rate. Reducing eta from 0.1 to 0.02 prevents expert weight runaway. Reducing TTT LR from 1e-4 to 3e-5 prevents destructive late-chunk weight updates. Together these eliminate the drift pattern entirely: BPB drops monotonically from 1.15 at chunk 1 to 0.67 at chunk 60, never reversing. |
| 56 | + |
| 57 | +## Ablation |
| 58 | + |
| 59 | +| Configuration | val_bpb | Delta | |
| 60 | +|---------------|---------|-------| |
| 61 | +| Base model (no mixer, no TTT) | 1.1363 | baseline | |
| 62 | +| TTT only (no mixer) | 1.1369 | -0.000 | |
| 63 | +| Mixer only (no TTT) | 0.6712 | -0.465 | |
| 64 | +| **Full system** | **0.6663** | **-0.470** | |
| 65 | + |
| 66 | +The ablation is unambiguous: the BackoffNgramMixer is the dominant innovation, contributing 99% of the total improvement (-0.465 of -0.470 BPB). TTT alone with drift-free settings contributes essentially nothing in isolation. When combined with the mixer, TTT adds a marginal 0.005 BPB through slightly improved base predictions that the entropy-adaptive alpha can exploit. |
| 67 | + |
| 68 | +The practical implication: the n-gram backoff with entropy-adaptive mixing is a general technique applicable to any language model evaluation. It does not require TTT, architectural changes, or retraining. It is a pure eval-time improvement that treats BPB as a compression problem and applies adaptive compression statistics from already-scored tokens. |
| 69 | + |
| 70 | +## Compliance |
| 71 | + |
| 72 | +- **Score-first TTT:** Each chunk scored under `torch.inference_mode()` before any training on that chunk |
| 73 | +- **Backward-looking n-gram:** Counts from already-scored tokens only, updated after scoring |
| 74 | +- **No oracle selection:** Alpha depends on model entropy, never compares mixed vs original NLL |
| 75 | +- **No training data at eval:** Naive int5 per-row quantization only. No Hessian calibration, no training data access during eval |
| 76 | +- **Token count verified:** ratio_scored = 1.000000 (window-start fix applied) |
| 77 | +- **No cross-GPU n-gram sync:** Each GPU maintains independent cache |
| 78 | + |
| 79 | +## Reproduction |
| 80 | +```bash |
| 81 | +pip install zstandard |
| 82 | +SEED=1337 MAX_WALLCLOCK_SECONDS=600 \ |
| 83 | +USE_MIXER=1 MIXER_ETA=0.02 \ |
| 84 | +QTTT=1 TTT_EPOCHS=1 TTT_FREEZE_BLOCKS=1 TTT_LR=0.00003 \ |
| 85 | +TTT_CHUNK_TOKENS=1048576 ADAPTIVE_LR=0 USE_POLYAK=0 \ |
| 86 | +EVAL_STRIDE=64 CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.08 \ |
| 87 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 88 | +``` |
| 89 | + |
| 90 | +## Architecture |
| 91 | + |
| 92 | +11L, 512d, GQA 8H/4KV, MLP 3x, LeakyReLU(0.5)^2, XSA all 11 layers, Value Residual, Gated Attention, SmearGate, BigramHash(4096), Partial RoPE(16/64), LN Scale, EMA(0.997). Tied embeddings. Muon optimizer. ~5850 steps in 600s. |
| 93 | + |
| 94 | +## Credits |
| 95 | + |
| 96 | +- **PR #700 RoyiRa** - Base architecture, TTT framework, stride=64 eval |
| 97 | +- **PR #606 gowtham0992** - int5 + Soft-Round QAT model |
| 98 | +- **PR #727 Asukabot0** - Multi-order backoff concept, entropy-adaptive alpha formula |
| 99 | +- **PR #461 Christopher-Lee-McClendon** - TTT recipe foundations |
| 100 | +- **PR #518 sofiabod** - LeakyReLU(0.5)^2, cosine TTT scheduling |
| 101 | +- **Dean Barr (this author)** - Original n-gram eval cache concept (first in competition, PR #659), drift-free TTT discovery, backoff+TTT combination, BackoffNgramMixer implementation |
0 commit comments