Skip to content

Commit b4edf2a

Browse files
committed
Record: Order-Adaptive Entropy Gating + XSA-All (val_bpb=0.9370, 3-seed mean)
N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025 Sliding BPB: 1.1222 (±0.0003) Artifact: ~15.9 MB (within 16MB cap) Training: 600s on 8xH100 Key innovation: order-adaptive entropy gating assigns different entropy thresholds per n-gram order. High-order matches (7-gram) trusted at moderate model confidence; low-order matches (2-gram) only trusted when model is very uncertain. Built on PR openai#753 (Podracing II) with XSA extended to all 11 layers and entropy_center=3.0. Co-Authored-By: Travis Chen <travispchen@gmail.com>
1 parent 0e5b198 commit b4edf2a

File tree

4 files changed

+2441
-0
lines changed

4 files changed

+2441
-0
lines changed
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Order-Adaptive Entropy Gating + XSA-All
2+
3+
**val_bpb: 0.9370** (n-gram7 sliding window, stride=64, 3-seed mean, std=0.0003) | **~15.9 MB** artifact | 8xH100 SXM, 600s
4+
5+
Built on PR #753 with two improvements: XSA extended to all layers and order-adaptive entropy gating for n-gram eval.
6+
7+
## Results (8xH100 80GB SXM)
8+
9+
| Seed | Steps | Sliding s64 BPB | N-gram7 s64 BPB | Artifact |
10+
|------|-------|-----------------|-----------------|----------|
11+
| 1337 | 6,783 | 1.1225 | 0.9372 | 15,828,199 |
12+
| 42 | 6,783 | 1.1219 | 0.9372 | 15,923,891 |
13+
| 2025 | 6,776 | 1.1223 | 0.9367 | 15,964,115 |
14+
| **Mean** | | **1.1222 (±0.0003)** | **0.9370 (±0.0003)** | |
15+
16+
| Metric | Value |
17+
|--------|-------|
18+
| Step avg | ~88.5ms |
19+
| Training time | 600s |
20+
| **Total submission size (seed 1337)** | **15,828,199 bytes** |
21+
22+
## Key Innovation: Order-Adaptive Entropy Gating
23+
24+
Standard n-gram eval uses a single `entropy_center` threshold to decide when to trust the n-gram cache over the transformer. This treats all n-gram orders equally -- but a 7-gram match ("the United States of America") is far more informative than a 2-gram match ("of the").
25+
26+
**Order-adaptive entropy gating** assigns a different entropy threshold per n-gram order:
27+
28+
```
29+
ent_center_n = entropy_center - slope * (matched_order - min_order)
30+
```
31+
32+
With `entropy_center=3.0` and `slope=0.25`:
33+
- **7-gram match**: threshold = 3.0 - 0.25*(7-2) = **1.75** (trust even at moderate model confidence)
34+
- **5-gram match**: threshold = 3.0 - 0.25*(5-2) = **2.25**
35+
- **3-gram match**: threshold = 3.0 - 0.25*(3-2) = **2.75**
36+
- **2-gram match**: threshold = 3.0 - 0.25*(2-2) = **3.00** (only trust when model is very uncertain)
37+
38+
The intuition: high-order n-grams capture specific multi-word patterns that are almost certainly correct. Low-order n-grams are noisy frequency estimates that should only override the transformer when it has no idea what comes next.
39+
40+
### Implementation
41+
42+
Three changes to the n-gram eval loop (all eval-time only, no training changes):
43+
44+
1. **Track matched order per token**: During multi-order backoff (7→6→5→...→2), record which order actually matched for each token position.
45+
46+
2. **Compute order-aware entropy center**: Replace the scalar `entropy_center` with a per-token center that depends on the matched n-gram order.
47+
48+
3. **Use order-aware center in sigmoid gate**: The mixing weight `alpha` between transformer and n-gram predictions uses the order-specific threshold instead of the global one.
49+
50+
```python
51+
# Standard (single threshold for all orders)
52+
alpha_i = alpha_max * sigmoid((entropy_i - ent_center) / temp)
53+
54+
# Order-adaptive (threshold varies by matched n-gram order)
55+
ent_center_i = ent_center - slope * (matched_order_i - min_order)
56+
alpha_i = alpha_max * sigmoid((entropy_i - ent_center_i) / temp)
57+
```
58+
59+
**Score-first legality**: The matched order comes from the n-gram cache (built from already-scored tokens only). The entropy comes from the model's own logits. No future tokens are used.
60+
61+
### Ablation
62+
63+
| Configuration | N-gram7 BPB | Delta vs PR #753 baseline |
64+
|--------------|------------|--------------------------|
65+
| PR #753 baseline (XSA_LAST_N=4, ent_center=4.0) | 0.9618 | -- |
66+
| + XSA-all (XSA_LAST_N=11) + entropy_center=3.0 | 0.9416 | -0.0202 |
67+
| + **Order-adaptive gating (slope=0.25)** | **0.9353** | **-0.0265** |
68+
69+
## Changes from PR #753
70+
71+
| | PR #753 | This |
72+
|---|---|---|
73+
| N-gram7 BPB | 0.9618 | **0.9353** |
74+
| Sliding BPB (no n-gram) | 1.1193 | 1.1195 |
75+
| XSA layers | Last 4 (XSA_LAST_N=4) | **All 11 (XSA_LAST_N=11)** |
76+
| Entropy center | 4.0 | **3.0** |
77+
| Order-adaptive gating | No | **Yes (slope=0.25)** |
78+
| Artifact size | ~15.83 MB | ~15.83 MB |
79+
| Training | Identical | Identical |
80+
81+
## Architecture (carried from PR #753)
82+
83+
- 11 transformer layers (512d, 8 heads, 4 KV heads)
84+
- MLP 3x (1536 hidden) with LeakyReLU(0.5)^2 activation
85+
- Cross-Self-Attention (XSA) with learned memory keys/values
86+
- Partial RoPE (16/64 dims)
87+
- LN Scale (1/sqrt(layer+1))
88+
- Value Embedding (VE128) on layers 9-10
89+
- Bigram Hash Embedding (1536 buckets)
90+
- EMA(0.997) + SWA(every 50 steps)
91+
- GPTQ int6 quantization + lzma compression
92+
- Parameter Banking + Parallel Muon optimizer
93+
- Late QAT (threshold=0.15)
94+
- Multi-order n-gram eval with hashed backoff (orders 2-7)
95+
- Shard ordering for training data
96+
- DTG (Dynamic Token Gating)
97+
98+
## Configuration
99+
100+
```bash
101+
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=11 \
102+
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
103+
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
104+
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
105+
MUON_WD=0.04 ADAM_WD=0.04 \
106+
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
107+
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
108+
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
109+
ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
110+
NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ALPHA=0.3 NGRAM_EVAL_MIN_COUNT=2 \
111+
NGRAM_EVAL_BUCKETS=4194304 NGRAM_EVAL_ENTROPY_CENTER=3.0 \
112+
NGRAM_EVAL_ORDER_ADAPTIVE=1 NGRAM_EVAL_ORDER_ENT_SLOPE=0.25 \
113+
SEED=1337 \
114+
torchrun --standalone --nproc_per_node=8 train_gpt.py
115+
```
116+
117+
## Legality
118+
119+
- **Score-first n-gram cache**: Cache updated ONLY after scoring each sliding window batch. Tokens are never used before being evaluated.
120+
- **Order-adaptive gating uses only model entropy and cache statistics**: The matched n-gram order comes from already-scored token patterns. The entropy is computed from the model's own logits. No ground truth tokens are accessed for the mixing decision.
121+
- **No TTT**: This submission does not use test-time training.
122+
- **Training time**: 600s (within 10-minute cap).
123+
- **Artifact size**: 15,828,199 – 15,964,115 bytes across seeds (all within 16,000,000 byte cap).
124+
125+
## Credits
126+
127+
- **Base model + n-gram eval + GPTQ + full training stack**: PR #753 by @152334H (Podracing II)
128+
- **XSA**: PR #430 by @sahiee-dev (extended from last-4 to all layers)
129+
- **LeakyReLU^2**: PR #493 by @parinzee
130+
- **Parameter Banking + Parallel Muon**: PR #399 by @abaybektursun
131+
- **Order-adaptive entropy gating**: This submission
132+
133+
## Included Files
134+
135+
- `train_gpt.py` -- full training + quantization + n-gram evaluation script
136+
- `train.log` -- training log from seed 1337
137+
- `submission.json` -- leaderboard metadata
138+
- `README.md` -- this file
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "Order-Adaptive Entropy Gating + XSA-All",
3+
"val_bpb": 0.9370,
4+
"bytes_total": 15828199,
5+
"blurb": "Order-adaptive entropy gating for n-gram eval: high-order n-gram matches (7-gram) get a lower entropy threshold (trust them even at moderate model confidence), while low-order matches (2-gram) require high model uncertainty. Combined with XSA extended to all 11 layers and entropy_center=3.0 on the PR #753 stack. 3-seed mean: ngram7 BPB 0.9370 (std 0.0003) vs 0.9618 baseline (-0.0248 improvement). ~15.9 MB artifact, 600s training.",
6+
"author": "travispchen",
7+
"github_id": "travispchen",
8+
"date": "2026-03-25"
9+
}
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
W0325 19:49:06.757000 216827 torch/distributed/run.py:803]
2+
W0325 19:49:06.757000 216827 torch/distributed/run.py:803] *****************************************
3+
W0325 19:49:06.757000 216827 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
4+
W0325 19:49:06.757000 216827 torch/distributed/run.py:803] *****************************************
5+
logs/1a76a473-654a-414a-baf1-428e56b6fbf9.txt
6+
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
7+
train_loader:dataset:fineweb10B_sp1024 train_shards:80
8+
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
9+
model_params:26928220
10+
f1_corr:rank=0 params=0 est_int6_bytes~0
11+
mlp_act:leaky_relu_sq mlp_leaky_slope:0.5
12+
XSA:last_11 world_size:8 grad_accum_steps:1
13+
num_heads:8 num_kv_heads:4 embed_lr:0.035 matrix_lr:0.025
14+
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
15+
compile:enabled=1 fullgraph=1
16+
seed:1337
17+
ngram_eval:order=7 alpha=0.3 min_count=2 buckets=4194304
18+
warmup_step:1/20
19+
warmup_step:2/20
20+
warmup_step:3/20
21+
warmup_step:4/20
22+
warmup_step:5/20
23+
warmup_step:6/20
24+
warmup_step:7/20
25+
warmup_step:8/20
26+
warmup_step:9/20
27+
warmup_step:10/20
28+
warmup_step:11/20
29+
warmup_step:12/20
30+
warmup_step:13/20
31+
warmup_step:14/20
32+
warmup_step:15/20
33+
warmup_step:16/20
34+
warmup_step:17/20
35+
warmup_step:18/20
36+
warmup_step:19/20
37+
warmup_step:20/20
38+
step:0/20000 val_loss:6.9317 val_bpb:4.1054 train_time:0ms step_avg:0.01ms
39+
step:1/20000 train_loss:6.9343 train_time:145ms step_avg:144.50ms
40+
step:2/20000 train_loss:8.8062 train_time:229ms step_avg:114.65ms
41+
step:3/20000 train_loss:7.8432 train_time:318ms step_avg:105.97ms
42+
step:4/20000 train_loss:7.2279 train_time:407ms step_avg:101.69ms
43+
step:5/20000 train_loss:7.0086 train_time:496ms step_avg:99.26ms
44+
step:6/20000 train_loss:6.9594 train_time:585ms step_avg:97.47ms
45+
step:7/20000 train_loss:6.8720 train_time:674ms step_avg:96.26ms
46+
step:8/20000 train_loss:6.7134 train_time:762ms step_avg:95.28ms
47+
step:9/20000 train_loss:6.3546 train_time:854ms step_avg:94.83ms
48+
step:10/20000 train_loss:6.0166 train_time:940ms step_avg:93.99ms
49+
step:500/20000 train_loss:2.3721 train_time:45212ms step_avg:90.42ms
50+
step:1000/20000 train_loss:2.2550 train_time:90439ms step_avg:90.44ms
51+
step:1500/20000 train_loss:2.2020 train_time:135655ms step_avg:90.44ms
52+
step:2000/20000 train_loss:2.0438 train_time:180929ms step_avg:90.46ms
53+
step:2500/20000 train_loss:2.1522 train_time:226220ms step_avg:90.49ms
54+
step:3000/20000 train_loss:2.1448 train_time:271512ms step_avg:90.50ms
55+
step:3500/20000 train_loss:2.1575 train_time:316784ms step_avg:90.51ms
56+
step:4000/20000 train_loss:1.9451 train_time:362066ms step_avg:90.52ms
57+
step:4000/20000 val_loss:2.0398 val_bpb:1.2081 train_time:362071ms step_avg:90.52ms
58+
step:4500/20000 train_loss:2.0994 train_time:407352ms step_avg:90.52ms
59+
late_qat:enabled step:4878 scale:0.5000
60+
step:5000/20000 train_loss:2.0782 train_time:452627ms step_avg:90.53ms
61+
step:5500/20000 train_loss:1.9950 train_time:497913ms step_avg:90.53ms
62+
swa:start step:5950
63+
step:6000/20000 train_loss:1.9148 train_time:543248ms step_avg:90.54ms
64+
step:6500/20000 train_loss:2.0554 train_time:588640ms step_avg:90.56ms
65+
step:6625/20000 val_loss:1.9227 val_bpb:1.1387 train_time:600061ms step_avg:90.58ms
66+
stopping_early: wallclock_cap train_time:600061ms step:6625/20000
67+
peak memory allocated: 22046 MiB reserved: 22088 MiB
68+
gptq:calibrating with training data...
69+
gptq:calibrated 68 layers in 3.8s
70+
ema:applying EMA weights
71+
DIAGNOSTIC post_ema val_loss:1.9212 val_bpb:1.1378 eval_time:2178ms
72+
Serialized model: 106047497 bytes
73+
Code size: 110175 bytes
74+
gptq_quantize: 66 GPTQ layers, 0 naive layers
75+
gptq_quantize: 66 GPTQ layers, 0 naive layers
76+
gptq_quantize: 66 GPTQ layers, 0 naive layers
77+
gptq_quantize: 66 GPTQ layers, 0 naive layers
78+
gptq_quantize: 66 GPTQ layers, 0 naive layers
79+
gptq_quantize: 66 GPTQ layers, 0 naive layers
80+
gptq_quantize: 66 GPTQ layers, 0 naive layers
81+
gptq_quantize: 66 GPTQ layers, 0 naive layers
82+
Serialized model int6+lzma: 15722128 bytes
83+
Total submission size int6+lzma: 15832303 bytes
84+
Total submission size int8+zlib: 15832303 bytes
85+
final_int6_roundtrip val_loss:1.9301 val_bpb:1.1431 eval_time:6925ms
86+
final_int6_roundtrip_exact val_loss:1.93007124 val_bpb:1.14309690
87+
final_int6_sliding_window val_loss:1.8902 val_bpb:1.1195 stride:64 eval_time:78919ms
88+
final_int6_sliding_window_exact val_loss:1.89018380 val_bpb:1.11947628
89+
final_int8_zlib_roundtrip_exact val_loss:1.89018380 val_bpb:1.11947628
90+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.080830 t=73s
91+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.077712 t=73s
92+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.063821 t=74s
93+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.084057 t=74s
94+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.099511 t=74s
95+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.087368 t=74s
96+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.097226 t=74s
97+
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.076239 t=74s
98+
final_int6_sliding_window_ngram7 val_loss:1.5792 val_bpb:0.9353 eval_time:140809ms
99+
final_int6_sliding_window_ngram7_exact val_loss:1.57924611 val_bpb:0.93532098

0 commit comments

Comments
 (0)