Skip to content

Commit bd5e1b9

Browse files
committed
Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)
1 parent 630bb5e commit bd5e1b9

File tree

10 files changed

+2525
-0
lines changed

10 files changed

+2525
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683)
2+
3+
**val_bpb: 0.6683** (3-seed mean, std 0.0024) | **<16 MB** | 8xH100 SXM, 600s
4+
5+
## Results (8xH100 80GB SXM)
6+
7+
| Seed | Pre-TTT bpb | Post-TTT bpb | Eval time | Artifact |
8+
|------|-------------|--------------|-----------|----------|
9+
| 1337 | 1.1258 | **0.6663** | 371s | 15.63 MB |
10+
| 42 | 1.1258 | **0.6710** | 371s | 15.78 MB |
11+
| 2024 | 1.1258 | **0.6675** | 372s | 15.48 MB |
12+
| **Mean** | 1.1258 | **0.6683** | 371s | |
13+
| **Std** | | **0.0024** | | |
14+
15+
## Background
16+
17+
We introduced the first n-gram eval cache in this competition (PR #659, val_bpb=1.0920, March 22 2026). That original approach used a 5-gram cache with fixed mixing and an oracle safety gate that was subsequently ruled illegal by organizers (comparing mixed vs original NLL peeks at the target).
18+
19+
This submission replaces the illegal oracle gate with entropy-adaptive mixing and multi-order backoff, combined with a drift-free TTT configuration.
20+
21+
## Technique
22+
23+
### 1. Multi-order N-gram Backoff (orders 2-7)
24+
25+
Instead of a single fixed n-gram order, we try the highest order first and cascade down on miss. Each order uses 4M hash buckets to reduce collisions. This dramatically improves coverage: a fixed 7-gram misses when the exact 6-token context has not been seen, but backoff to 6, 5, 4, 3, 2-gram catches those cases.
26+
27+
N-gram counts are accumulated from already-scored tokens only. Updated after scoring each chunk.
28+
29+
### 2. Entropy-Adaptive Alpha
30+
```
31+
alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))
32+
```
33+
34+
where H is the neural model's own entropy over its output distribution. When the model is uncertain (high entropy), we trust n-gram statistics more. When confident (low entropy), we trust the model. This depends solely on the model's output distribution, never on the true target. No oracle selection.
35+
36+
The mixed probability is always applied:
37+
```
38+
p_mixed = (1 - alpha) * p_neural + alpha * p_ngram
39+
```
40+
41+
### 3. Drift-Free TTT Configuration
42+
43+
Standard TTT configurations suffer from late-chunk drift: BPB bottoms around chunk 21 then climbs as cumulative adaptation becomes destructive. We use a conservative configuration that produces monotonic improvement through all 60 chunks:
44+
45+
| Parameter | Setting |
46+
|-----------|---------|
47+
| Unfrozen params | Q projections only (QTTT=1) |
48+
| Mixer eta | 0.02 |
49+
| TTT LR | 0.00003 |
50+
| Chunk size | 1M tokens (60 chunks) |
51+
| Epochs per chunk | 1 |
52+
| Adaptive LR | Disabled |
53+
| Polyak averaging | Disabled |
54+
55+
The most impactful hyperparameters are mixer eta and TTT learning rate. Reducing eta from 0.1 to 0.02 prevents expert weight runaway. Reducing TTT LR from 1e-4 to 3e-5 prevents destructive late-chunk weight updates. Together these eliminate the drift pattern entirely: BPB drops monotonically from 1.15 at chunk 1 to 0.67 at chunk 60, never reversing.
56+
57+
## Ablation
58+
59+
| Configuration | val_bpb | Delta |
60+
|---------------|---------|-------|
61+
| Base model (no mixer, no TTT) | 1.1363 | baseline |
62+
| TTT only (no mixer) | 1.1369 | -0.000 |
63+
| Mixer only (no TTT) | 0.6712 | -0.465 |
64+
| **Full system** | **0.6663** | **-0.470** |
65+
66+
The ablation is unambiguous: the BackoffNgramMixer is the dominant innovation, contributing 99% of the total improvement (-0.465 of -0.470 BPB). TTT alone with drift-free settings contributes essentially nothing in isolation. When combined with the mixer, TTT adds a marginal 0.005 BPB through slightly improved base predictions that the entropy-adaptive alpha can exploit.
67+
68+
The practical implication: the n-gram backoff with entropy-adaptive mixing is a general technique applicable to any language model evaluation. It does not require TTT, architectural changes, or retraining. It is a pure eval-time improvement that treats BPB as a compression problem and applies adaptive compression statistics from already-scored tokens.
69+
70+
## Compliance
71+
72+
- **Score-first TTT:** Each chunk scored under `torch.inference_mode()` before any training on that chunk
73+
- **Backward-looking n-gram:** Counts from already-scored tokens only, updated after scoring
74+
- **No oracle selection:** Alpha depends on model entropy, never compares mixed vs original NLL
75+
- **No training data at eval:** Naive int5 per-row quantization only. No Hessian calibration, no training data access during eval
76+
- **Token count verified:** ratio_scored = 1.000000 (window-start fix applied)
77+
- **No cross-GPU n-gram sync:** Each GPU maintains independent cache
78+
79+
## Reproduction
80+
```bash
81+
pip install zstandard
82+
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
83+
USE_MIXER=1 MIXER_ETA=0.02 \
84+
QTTT=1 TTT_EPOCHS=1 TTT_FREEZE_BLOCKS=1 TTT_LR=0.00003 \
85+
TTT_CHUNK_TOKENS=1048576 ADAPTIVE_LR=0 USE_POLYAK=0 \
86+
EVAL_STRIDE=64 CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.08 \
87+
torchrun --standalone --nproc_per_node=8 train_gpt.py
88+
```
89+
90+
## Architecture
91+
92+
11L, 512d, GQA 8H/4KV, MLP 3x, LeakyReLU(0.5)^2, XSA all 11 layers, Value Residual, Gated Attention, SmearGate, BigramHash(4096), Partial RoPE(16/64), LN Scale, EMA(0.997). Tied embeddings. Muon optimizer. ~5850 steps in 600s.
93+
94+
## Credits
95+
96+
- **PR #700 RoyiRa** - Base architecture, TTT framework, stride=64 eval
97+
- **PR #606 gowtham0992** - int5 + Soft-Round QAT model
98+
- **PR #727 Asukabot0** - Multi-order backoff concept, entropy-adaptive alpha formula
99+
- **PR #461 Christopher-Lee-McClendon** - TTT recipe foundations
100+
- **PR #518 sofiabod** - LeakyReLU(0.5)^2, cosine TTT scheduling
101+
- **Dean Barr (this author)** - Original n-gram eval cache concept (first in competition, PR #659), drift-free TTT discovery, backoff+TTT combination, BackoffNgramMixer implementation
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
W0325 20:54:05.028000 92587 torch/distributed/run.py:803]
2+
W0325 20:54:05.028000 92587 torch/distributed/run.py:803] *****************************************
3+
W0325 20:54:05.028000 92587 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0325 20:54:05.028000 92587 torch/distributed/run.py:803] *****************************************
5+
logs/ablation_none.txt
6+
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
7+
train_loader:dataset:fineweb10B_sp1024 train_shards:80
8+
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
9+
mixed_precision: 68 int5 layers, 0 int6 layers (last 0 blocks)
10+
model_params:33317980
11+
XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
12+
lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
13+
warmup_step:1/20
14+
warmup_step:2/20
15+
warmup_step:3/20
16+
warmup_step:4/20
17+
warmup_step:5/20
18+
warmup_step:6/20
19+
warmup_step:7/20
20+
warmup_step:8/20
21+
warmup_step:9/20
22+
warmup_step:10/20
23+
warmup_step:11/20
24+
warmup_step:12/20
25+
warmup_step:13/20
26+
warmup_step:14/20
27+
warmup_step:15/20
28+
warmup_step:16/20
29+
warmup_step:17/20
30+
warmup_step:18/20
31+
warmup_step:19/20
32+
warmup_step:20/20
33+
step:0/20000 val_loss:6.9285 val_bpb:4.1034 train_time:0ms step_avg:0.01ms
34+
step:1/20000 train_loss:6.9305 train_time:152ms step_avg:151.83ms
35+
step:2/20000 train_loss:8.6412 train_time:242ms step_avg:121.04ms
36+
step:3/20000 train_loss:7.7277 train_time:338ms step_avg:112.76ms
37+
step:4/20000 train_loss:7.2811 train_time:433ms step_avg:108.35ms
38+
step:5/20000 train_loss:7.0674 train_time:529ms step_avg:105.74ms
39+
step:6/20000 train_loss:6.9651 train_time:624ms step_avg:104.02ms
40+
step:7/20000 train_loss:6.8518 train_time:719ms step_avg:102.73ms
41+
step:8/20000 train_loss:6.7086 train_time:815ms step_avg:101.84ms
42+
step:9/20000 train_loss:6.3644 train_time:910ms step_avg:101.12ms
43+
step:10/20000 train_loss:6.0326 train_time:1006ms step_avg:100.59ms
44+
step:500/20000 train_loss:2.3655 train_time:49029ms step_avg:98.06ms
45+
step:1000/20000 train_loss:2.2398 train_time:98479ms step_avg:98.48ms
46+
step:1500/20000 train_loss:2.1832 train_time:147906ms step_avg:98.60ms
47+
step:2000/20000 train_loss:2.0275 train_time:197310ms step_avg:98.65ms
48+
step:2500/20000 train_loss:2.1308 train_time:246687ms step_avg:98.67ms
49+
step:3000/20000 train_loss:2.1126 train_time:296033ms step_avg:98.68ms
50+
step:3500/20000 train_loss:2.1149 train_time:345402ms step_avg:98.69ms
51+
step:4000/20000 train_loss:1.9052 train_time:394733ms step_avg:98.68ms
52+
step:4000/20000 val_loss:1.9969 val_bpb:1.1827 train_time:394738ms step_avg:98.68ms
53+
late_qat:enabled step:4149 scale:0.4998
54+
step:4500/20000 train_loss:2.0510 train_time:445058ms step_avg:98.90ms
55+
step:5000/20000 train_loss:2.0252 train_time:495691ms step_avg:99.14ms
56+
swa:start step:5200
57+
step:5500/20000 train_loss:1.9352 train_time:546734ms step_avg:99.41ms
58+
step:5847/20000 val_loss:1.9037 val_bpb:1.1275 train_time:582085ms step_avg:99.55ms
59+
stopping_early: wallclock_cap train_time:582085ms step:5847/20000
60+
peak memory allocated: 26197 MiB reserved: 26810 MiB
61+
ema:applying EMA weights (skipping diagnostic evals)
62+
Serialized model: 130432585 bytes
63+
Code size: 87336 bytes
64+
pruning:8.0% magnitude pruning applied
65+
Serialized model int6+zstd: 15215668 bytes
66+
Total submission size int6+zstd: 15303004 bytes
67+
ttt: pre-compiling forward+backward kernels...
68+
ttt: pre-compile done
69+
final_int6_sliding_window val_loss:1.9177 val_bpb:1.1358 stride:64 eval_time:85508ms
70+
final_int6_sliding_window_exact val_loss:1.91770544 val_bpb:1.13577318
71+
TTT: epochs=0 lr=0.0005 freeze_first=2 chunk=1048576 opt=adamw
72+
TTT temperature: 0.98
73+
PPM alpha: 0.85, Byte-weighted TTT: True
74+
Adaptive LR enabled: max_mult=3.0
75+
ttt:start chunks=60 chunk_tokens=1048576 windows=969057 stride=64 lr=0.0005 epochs=0 opt=adamw freeze_first=2
76+
ttt:params unfrozen=5780500 frozen=27537480
77+
Polyak averaging enabled: decay=0.998
78+
ttt_chunk [1/60] bpb=1.147257 time=3.9s
79+
ttt_chunk [2/60] bpb=1.136523 time=7.9s
80+
ttt_chunk [3/60] bpb=1.126607 time=11.9s
81+
ttt_chunk [4/60] bpb=1.140779 time=15.9s
82+
ttt_chunk [5/60] bpb=1.131236 time=19.8s
83+
ttt_chunk [11/60] bpb=1.138805 time=43.7s
84+
ttt_chunk [21/60] bpb=1.137149 time=83.5s
85+
ttt_chunk [31/60] bpb=1.134506 time=123.2s
86+
ttt_chunk [41/60] bpb=1.133697 time=163.0s
87+
ttt_chunk [51/60] bpb=1.135162 time=202.7s
88+
ttt_chunk [60/60] bpb=1.136469 time=235.0s
89+
ttt:done val_loss=1.918669 val_bpb=1.136344 elapsed=235.4s
90+
final_int6_ttt val_loss:1.9187 val_bpb:1.1363 stride:64 eval_time:235850ms
91+
final_int6_ttt_exact val_loss:1.91866902 val_bpb:1.13634386
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
W0325 21:29:47.419000 94247 torch/distributed/run.py:803]
2+
W0325 21:29:47.419000 94247 torch/distributed/run.py:803] *****************************************
3+
W0325 21:29:47.419000 94247 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0325 21:29:47.419000 94247 torch/distributed/run.py:803] *****************************************
5+
logs/ablation_mixer_only.txt
6+
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
7+
train_loader:dataset:fineweb10B_sp1024 train_shards:80
8+
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
9+
mixed_precision: 68 int5 layers, 0 int6 layers (last 0 blocks)
10+
model_params:33317980
11+
XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
12+
lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
13+
warmup_step:1/20
14+
warmup_step:2/20
15+
warmup_step:3/20
16+
warmup_step:4/20
17+
warmup_step:5/20
18+
warmup_step:6/20
19+
warmup_step:7/20
20+
warmup_step:8/20
21+
warmup_step:9/20
22+
warmup_step:10/20
23+
warmup_step:11/20
24+
warmup_step:12/20
25+
warmup_step:13/20
26+
warmup_step:14/20
27+
warmup_step:15/20
28+
warmup_step:16/20
29+
warmup_step:17/20
30+
warmup_step:18/20
31+
warmup_step:19/20
32+
warmup_step:20/20
33+
step:0/20000 val_loss:6.9285 val_bpb:4.1034 train_time:0ms step_avg:0.02ms
34+
step:1/20000 train_loss:6.9305 train_time:148ms step_avg:148.05ms
35+
step:2/20000 train_loss:8.6412 train_time:240ms step_avg:119.95ms
36+
step:3/20000 train_loss:7.7277 train_time:335ms step_avg:111.70ms
37+
step:4/20000 train_loss:7.2812 train_time:430ms step_avg:107.48ms
38+
step:5/20000 train_loss:7.0674 train_time:526ms step_avg:105.22ms
39+
step:6/20000 train_loss:6.9651 train_time:621ms step_avg:103.58ms
40+
step:7/20000 train_loss:6.8516 train_time:717ms step_avg:102.41ms
41+
step:8/20000 train_loss:6.7085 train_time:812ms step_avg:101.49ms
42+
step:9/20000 train_loss:6.3645 train_time:908ms step_avg:100.90ms
43+
step:10/20000 train_loss:6.0316 train_time:1004ms step_avg:100.40ms
44+
step:500/20000 train_loss:2.3640 train_time:49103ms step_avg:98.21ms
45+
step:1000/20000 train_loss:2.2419 train_time:98583ms step_avg:98.58ms
46+
step:1500/20000 train_loss:2.1825 train_time:148035ms step_avg:98.69ms
47+
step:2000/20000 train_loss:2.0286 train_time:197499ms step_avg:98.75ms
48+
step:2500/20000 train_loss:2.1314 train_time:246889ms step_avg:98.76ms
49+
step:3000/20000 train_loss:2.1099 train_time:296242ms step_avg:98.75ms
50+
step:3500/20000 train_loss:2.1185 train_time:345600ms step_avg:98.74ms
51+
step:4000/20000 train_loss:1.9067 train_time:394960ms step_avg:98.74ms
52+
step:4000/20000 val_loss:1.9972 val_bpb:1.1829 train_time:394965ms step_avg:98.74ms
53+
late_qat:enabled step:4145 scale:0.4999
54+
step:4500/20000 train_loss:2.0517 train_time:445351ms step_avg:98.97ms
55+
step:5000/20000 train_loss:2.0263 train_time:496100ms step_avg:99.22ms
56+
swa:start step:5200
57+
step:5500/20000 train_loss:1.9330 train_time:547119ms step_avg:99.48ms
58+
step:5842/20000 val_loss:1.9040 val_bpb:1.1276 train_time:582076ms step_avg:99.64ms
59+
stopping_early: wallclock_cap train_time:582076ms step:5842/20000
60+
peak memory allocated: 26197 MiB reserved: 26810 MiB
61+
ema:applying EMA weights (skipping diagnostic evals)
62+
Serialized model: 130432585 bytes
63+
Code size: 87336 bytes
64+
pruning:8.0% magnitude pruning applied
65+
Serialized model int6+zstd: 15623097 bytes
66+
Total submission size int6+zstd: 15710433 bytes
67+
ttt: pre-compiling forward+backward kernels...
68+
ttt: pre-compile done
69+
final_int6_sliding_window val_loss:1.9219 val_bpb:1.1383 stride:64 eval_time:86138ms
70+
final_int6_sliding_window_exact val_loss:1.92191264 val_bpb:1.13826492
71+
TTT: epochs=0 lr=0.0005 freeze_first=2 chunk=1048576 opt=adamw
72+
TTT temperature: 0.98
73+
PPM alpha: 0.85, Byte-weighted TTT: True
74+
Logistic context mixer enabled: eta=0.02
75+
ttt:start chunks=60 chunk_tokens=1048576 windows=969057 stride=64 lr=0.0005 epochs=0 opt=adamw freeze_first=2
76+
ttt:params unfrozen=5780500 frozen=27537480
77+
ttt_chunk [1/60] bpb=1.150549 time=5.2s
78+
ttt_chunk [2/60] bpb=1.135406 time=11.3s
79+
ttt_chunk [3/60] bpb=1.105955 time=17.4s
80+
ttt_chunk [4/60] bpb=1.093665 time=23.5s
81+
ttt_chunk [5/60] bpb=1.059819 time=29.6s
82+
ttt_chunk [11/60] bpb=0.926140 time=66.1s
83+
ttt_chunk [21/60] bpb=0.795571 time=126.3s
84+
ttt_chunk [31/60] bpb=0.737438 time=186.1s
85+
ttt_chunk [41/60] bpb=0.702686 time=245.9s
86+
ttt_chunk [51/60] bpb=0.683270 time=305.6s
87+
ttt_chunk [60/60] bpb=0.670476 time=354.4s
88+
ttt:done val_loss=1.133219 val_bpb=0.671156 elapsed=355.1s
89+
final_int6_ttt val_loss:1.1332 val_bpb:0.6712 stride:64 eval_time:355659ms
90+
final_int6_ttt_exact val_loss:1.13321916 val_bpb:0.67115622

0 commit comments

Comments
 (0)