openai
diff --git a/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/README.md‎
Lines changed: 101 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/README.md‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/log_ablation_base.txt‎
Lines changed: 91 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/log_ablation_base.txt‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/log_ablation_mixer_only.txt‎
Lines changed: 90 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-25_BackoffNgramMixer_DriftFreeTTT_0.6683/log_ablation_mixer_only.txt‎
Lines changed: 90 additions & 0 deletions
@@ -0,0 +1,101 @@
+# Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)
+
+**val_bpb: 0.6683** (3-seed mean, std 0.0024) | **<16 MB** | 8xH100 SXM, 600s
+
+## Results (8xH100 80GB SXM)
+
+| Seed | Pre-TTT bpb | Post-TTT bpb | Eval time | Artifact |
+|------|-------------|--------------|-----------|----------|
+| 1337 | 1.1258 | **0.6663** | 371s | 15.63 MB |
+| 42 | 1.1258 | **0.6710** | 371s | 15.78 MB |
+| 2024 | 1.1258 | **0.6675** | 372s | 15.48 MB |
+| **Mean** | 1.1258 | **0.6683** | 371s | |
+| **Std** | | **0.0024** | | |
+
+## Background
+
+We introduced the first n-gram eval cache in this competition (PR #659, val_bpb=1.0920, March 22 2026). That original approach used a 5-gram cache with fixed mixing and an oracle safety gate that was subsequently ruled illegal by organizers (comparing mixed vs original NLL peeks at the target).
+
+This submission replaces the illegal oracle gate with entropy-adaptive mixing and multi-order backoff, combined with a drift-free TTT configuration.
+
+## Technique
+
+### 1. Multi-order N-gram Backoff (orders 2-7)
+
+Instead of a single fixed n-gram order, we try the highest order first and cascade down on miss. Each order uses 4M hash buckets to reduce collisions. This dramatically improves coverage: a fixed 7-gram misses when the exact 6-token context has not been seen, but backoff to 6, 5, 4, 3, 2-gram catches those cases.
+
+N-gram counts are accumulated from already-scored tokens only. Updated after scoring each chunk.
+
+### 2. Entropy-Adaptive Alpha
+```
+alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))
+```
+
+where H is the neural model's own entropy over its output distribution. When the model is uncertain (high entropy), we trust n-gram statistics more. When confident (low entropy), we trust the model. This depends solely on the model's output distribution, never on the true target. No oracle selection.
+
+The mixed probability is always applied:
+```
+p_mixed = (1 - alpha) * p_neural + alpha * p_ngram
+```
+
+### 3. Drift-Free TTT Configuration
+
+Standard TTT configurations suffer from late-chunk drift: BPB bottoms around chunk 21 then climbs as cumulative adaptation becomes destructive. We use a conservative configuration that produces monotonic improvement through all 60 chunks:
+
+| Parameter | Setting |
+|-----------|---------|
+| Unfrozen params | Q projections only (QTTT=1) |
+| Mixer eta | 0.02 |
+| TTT LR | 0.00003 |
+| Chunk size | 1M tokens (60 chunks) |
+| Epochs per chunk | 1 |
+| Adaptive LR | Disabled |
+| Polyak averaging | Disabled |
+
+The most impactful hyperparameters are mixer eta and TTT learning rate. Reducing eta from 0.1 to 0.02 prevents expert weight runaway. Reducing TTT LR from 1e-4 to 3e-5 prevents destructive late-chunk weight updates. Together these eliminate the drift pattern entirely: BPB drops monotonically from 1.15 at chunk 1 to 0.67 at chunk 60, never reversing.
+
+## Ablation
+
+| Configuration | val_bpb | Delta |
+|---------------|---------|-------|
+| Base model (no mixer, no TTT) | 1.1363 | baseline |
+| TTT only (no mixer) | 1.1369 | -0.000 |
+| Mixer only (no TTT) | 0.6712 | -0.465 |
+| **Full system** | **0.6663** | **-0.470** |
+
+The ablation is unambiguous: the BackoffNgramMixer is the dominant innovation, contributing 99% of the total improvement (-0.465 of -0.470 BPB). TTT alone with drift-free settings contributes essentially nothing in isolation. When combined with the mixer, TTT adds a marginal 0.005 BPB through slightly improved base predictions that the entropy-adaptive alpha can exploit.
+
+The practical implication: the n-gram backoff with entropy-adaptive mixing is a general technique applicable to any language model evaluation. It does not require TTT, architectural changes, or retraining. It is a pure eval-time improvement that treats BPB as a compression problem and applies adaptive compression statistics from already-scored tokens.
+
+## Compliance
+
+- **Score-first TTT:** Each chunk scored under `torch.inference_mode()` before any training on that chunk
+- **Backward-looking n-gram:** Counts from already-scored tokens only, updated after scoring
+- **No oracle selection:** Alpha depends on model entropy, never compares mixed vs original NLL
+- **No training data at eval:** Naive int5 per-row quantization only. No Hessian calibration, no training data access during eval
+- **Token count verified:** ratio_scored = 1.000000 (window-start fix applied)
+- **No cross-GPU n-gram sync:** Each GPU maintains independent cache
+
+## Reproduction
+```bash
+pip install zstandard
+SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
+USE_MIXER=1 MIXER_ETA=0.02 \
+QTTT=1 TTT_EPOCHS=1 TTT_FREEZE_BLOCKS=1 TTT_LR=0.00003 \
+TTT_CHUNK_TOKENS=1048576 ADAPTIVE_LR=0 USE_POLYAK=0 \
+EVAL_STRIDE=64 CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.08 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Architecture
+
+11L, 512d, GQA 8H/4KV, MLP 3x, LeakyReLU(0.5)^2, XSA all 11 layers, Value Residual, Gated Attention, SmearGate, BigramHash(4096), Partial RoPE(16/64), LN Scale, EMA(0.997). Tied embeddings. Muon optimizer. ~5850 steps in 600s.
+
+## Credits
+
+- **PR #700 RoyiRa** - Base architecture, TTT framework, stride=64 eval
+- **PR #606 gowtham0992** - int5 + Soft-Round QAT model
+- **PR #727 Asukabot0** - Multi-order backoff concept, entropy-adaptive alpha formula
+- **PR #461 Christopher-Lee-McClendon** - TTT recipe foundations
+- **PR #518 sofiabod** - LeakyReLU(0.5)^2, cosine TTT scheduling
+- **Dean Barr (this author)** - Original n-gram eval cache concept (first in competition, PR #659), drift-free TTT discovery, backoff+TTT combination, BackoffNgramMixer implementation
@@ -0,0 +1,91 @@
+W0325 20:54:05.028000 92587 torch/distributed/run.py:803] 
+W0325 20:54:05.028000 92587 torch/distributed/run.py:803] *****************************************
+W0325 20:54:05.028000 92587 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0325 20:54:05.028000 92587 torch/distributed/run.py:803] *****************************************
+logs/ablation_none.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+mixed_precision: 68 int5 layers, 0 int6 layers (last 0 blocks)
+model_params:33317980
+XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
+lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9285 val_bpb:4.1034 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9305 train_time:152ms step_avg:151.83ms
+step:2/20000 train_loss:8.6412 train_time:242ms step_avg:121.04ms
+step:3/20000 train_loss:7.7277 train_time:338ms step_avg:112.76ms
+step:4/20000 train_loss:7.2811 train_time:433ms step_avg:108.35ms
+step:5/20000 train_loss:7.0674 train_time:529ms step_avg:105.74ms
+step:6/20000 train_loss:6.9651 train_time:624ms step_avg:104.02ms
+step:7/20000 train_loss:6.8518 train_time:719ms step_avg:102.73ms
+step:8/20000 train_loss:6.7086 train_time:815ms step_avg:101.84ms
+step:9/20000 train_loss:6.3644 train_time:910ms step_avg:101.12ms
+step:10/20000 train_loss:6.0326 train_time:1006ms step_avg:100.59ms
+step:500/20000 train_loss:2.3655 train_time:49029ms step_avg:98.06ms
+step:1000/20000 train_loss:2.2398 train_time:98479ms step_avg:98.48ms
+step:1500/20000 train_loss:2.1832 train_time:147906ms step_avg:98.60ms
+step:2000/20000 train_loss:2.0275 train_time:197310ms step_avg:98.65ms
+step:2500/20000 train_loss:2.1308 train_time:246687ms step_avg:98.67ms
+step:3000/20000 train_loss:2.1126 train_time:296033ms step_avg:98.68ms
+step:3500/20000 train_loss:2.1149 train_time:345402ms step_avg:98.69ms
+step:4000/20000 train_loss:1.9052 train_time:394733ms step_avg:98.68ms
+step:4000/20000 val_loss:1.9969 val_bpb:1.1827 train_time:394738ms step_avg:98.68ms
+late_qat:enabled step:4149 scale:0.4998
+step:4500/20000 train_loss:2.0510 train_time:445058ms step_avg:98.90ms
+step:5000/20000 train_loss:2.0252 train_time:495691ms step_avg:99.14ms
+swa:start step:5200
+step:5500/20000 train_loss:1.9352 train_time:546734ms step_avg:99.41ms
+step:5847/20000 val_loss:1.9037 val_bpb:1.1275 train_time:582085ms step_avg:99.55ms
+stopping_early: wallclock_cap train_time:582085ms step:5847/20000
+peak memory allocated: 26197 MiB reserved: 26810 MiB
+ema:applying EMA weights (skipping diagnostic evals)
+Serialized model: 130432585 bytes
+Code size: 87336 bytes
+pruning:8.0% magnitude pruning applied
+Serialized model int6+zstd: 15215668 bytes
+Total submission size int6+zstd: 15303004 bytes
+  ttt: pre-compiling forward+backward kernels...
+  ttt: pre-compile done
+final_int6_sliding_window val_loss:1.9177 val_bpb:1.1358 stride:64 eval_time:85508ms
+final_int6_sliding_window_exact val_loss:1.91770544 val_bpb:1.13577318
+TTT: epochs=0 lr=0.0005 freeze_first=2 chunk=1048576 opt=adamw
+TTT temperature: 0.98
+PPM alpha: 0.85, Byte-weighted TTT: True
+  Adaptive LR enabled: max_mult=3.0
+ttt:start chunks=60 chunk_tokens=1048576 windows=969057 stride=64 lr=0.0005 epochs=0 opt=adamw freeze_first=2
+ttt:params unfrozen=5780500 frozen=27537480
+  Polyak averaging enabled: decay=0.998
+  ttt_chunk [1/60] bpb=1.147257 time=3.9s
+  ttt_chunk [2/60] bpb=1.136523 time=7.9s
+  ttt_chunk [3/60] bpb=1.126607 time=11.9s
+  ttt_chunk [4/60] bpb=1.140779 time=15.9s
+  ttt_chunk [5/60] bpb=1.131236 time=19.8s
+  ttt_chunk [11/60] bpb=1.138805 time=43.7s
+  ttt_chunk [21/60] bpb=1.137149 time=83.5s
+  ttt_chunk [31/60] bpb=1.134506 time=123.2s
+  ttt_chunk [41/60] bpb=1.133697 time=163.0s
+  ttt_chunk [51/60] bpb=1.135162 time=202.7s
+  ttt_chunk [60/60] bpb=1.136469 time=235.0s
+ttt:done val_loss=1.918669 val_bpb=1.136344 elapsed=235.4s
+final_int6_ttt val_loss:1.9187 val_bpb:1.1363 stride:64 eval_time:235850ms
+final_int6_ttt_exact val_loss:1.91866902 val_bpb:1.13634386
@@ -0,0 +1,90 @@
+W0325 21:29:47.419000 94247 torch/distributed/run.py:803] 
+W0325 21:29:47.419000 94247 torch/distributed/run.py:803] *****************************************
+W0325 21:29:47.419000 94247 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0325 21:29:47.419000 94247 torch/distributed/run.py:803] *****************************************
+logs/ablation_mixer_only.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+mixed_precision: 68 int5 layers, 0 int6 layers (last 0 blocks)
+model_params:33317980
+XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
+lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9285 val_bpb:4.1034 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9305 train_time:148ms step_avg:148.05ms
+step:2/20000 train_loss:8.6412 train_time:240ms step_avg:119.95ms
+step:3/20000 train_loss:7.7277 train_time:335ms step_avg:111.70ms
+step:4/20000 train_loss:7.2812 train_time:430ms step_avg:107.48ms
+step:5/20000 train_loss:7.0674 train_time:526ms step_avg:105.22ms
+step:6/20000 train_loss:6.9651 train_time:621ms step_avg:103.58ms
+step:7/20000 train_loss:6.8516 train_time:717ms step_avg:102.41ms
+step:8/20000 train_loss:6.7085 train_time:812ms step_avg:101.49ms
+step:9/20000 train_loss:6.3645 train_time:908ms step_avg:100.90ms
+step:10/20000 train_loss:6.0316 train_time:1004ms step_avg:100.40ms
+step:500/20000 train_loss:2.3640 train_time:49103ms step_avg:98.21ms
+step:1000/20000 train_loss:2.2419 train_time:98583ms step_avg:98.58ms
+step:1500/20000 train_loss:2.1825 train_time:148035ms step_avg:98.69ms
+step:2000/20000 train_loss:2.0286 train_time:197499ms step_avg:98.75ms
+step:2500/20000 train_loss:2.1314 train_time:246889ms step_avg:98.76ms
+step:3000/20000 train_loss:2.1099 train_time:296242ms step_avg:98.75ms
+step:3500/20000 train_loss:2.1185 train_time:345600ms step_avg:98.74ms
+step:4000/20000 train_loss:1.9067 train_time:394960ms step_avg:98.74ms
+step:4000/20000 val_loss:1.9972 val_bpb:1.1829 train_time:394965ms step_avg:98.74ms
+late_qat:enabled step:4145 scale:0.4999
+step:4500/20000 train_loss:2.0517 train_time:445351ms step_avg:98.97ms
+step:5000/20000 train_loss:2.0263 train_time:496100ms step_avg:99.22ms
+swa:start step:5200
+step:5500/20000 train_loss:1.9330 train_time:547119ms step_avg:99.48ms
+step:5842/20000 val_loss:1.9040 val_bpb:1.1276 train_time:582076ms step_avg:99.64ms
+stopping_early: wallclock_cap train_time:582076ms step:5842/20000
+peak memory allocated: 26197 MiB reserved: 26810 MiB
+ema:applying EMA weights (skipping diagnostic evals)
+Serialized model: 130432585 bytes
+Code size: 87336 bytes
+pruning:8.0% magnitude pruning applied
+Serialized model int6+zstd: 15623097 bytes
+Total submission size int6+zstd: 15710433 bytes
+  ttt: pre-compiling forward+backward kernels...
+  ttt: pre-compile done
+final_int6_sliding_window val_loss:1.9219 val_bpb:1.1383 stride:64 eval_time:86138ms
+final_int6_sliding_window_exact val_loss:1.92191264 val_bpb:1.13826492
+TTT: epochs=0 lr=0.0005 freeze_first=2 chunk=1048576 opt=adamw
+TTT temperature: 0.98
+PPM alpha: 0.85, Byte-weighted TTT: True
+  Logistic context mixer enabled: eta=0.02
+ttt:start chunks=60 chunk_tokens=1048576 windows=969057 stride=64 lr=0.0005 epochs=0 opt=adamw freeze_first=2
+ttt:params unfrozen=5780500 frozen=27537480
+  ttt_chunk [1/60] bpb=1.150549 time=5.2s
+  ttt_chunk [2/60] bpb=1.135406 time=11.3s
+  ttt_chunk [3/60] bpb=1.105955 time=17.4s
+  ttt_chunk [4/60] bpb=1.093665 time=23.5s
+  ttt_chunk [5/60] bpb=1.059819 time=29.6s
+  ttt_chunk [11/60] bpb=0.926140 time=66.1s
+  ttt_chunk [21/60] bpb=0.795571 time=126.3s
+  ttt_chunk [31/60] bpb=0.737438 time=186.1s
+  ttt_chunk [41/60] bpb=0.702686 time=245.9s
+  ttt_chunk [51/60] bpb=0.683270 time=305.6s
+  ttt_chunk [60/60] bpb=0.670476 time=354.4s
+ttt:done val_loss=1.133219 val_bpb=0.671156 elapsed=355.1s
+final_int6_ttt val_loss:1.1332 val_bpb:0.6712 stride:64 eval_time:355659ms
+final_int6_ttt_exact val_loss:1.13321916 val_bpb:0.67115622