openai · SoHarshh · Mar 28, 2026 · Mar 28, 2026 · Apr 1, 2026 · Apr 1, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,4 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/PLAN.md
diff --git a/records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/README.md b/records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/README.md
@@ -0,0 +1,88 @@
+# 12L INT4 bQAT + EMA Fix + Value Embeddings
+
+**val_bpb: 1.1574** (seed 2, full TTT) | **16.41 MB** | 8×H100 SXM
+
+## Results (8×H100 80GB SXM)
+
+| Seed | step_avg | steps | Pre-quant val/bpb | Post-quant bpb | Post-TTT bpb | Artifact |
+|------|----------|-------|-------------------|----------------|--------------|----------|
+| 1    | ~137ms   | ~4380 | 1.1754            | 1.1643         | 1.1588       | 16,290,425 |
+| 2    | ~148ms   | 4058  | 1.1736            | 1.1624         | **1.1574**   | 16,408,223 |
+| 3    | ~149ms   | 4034  | —                 | 1.1661         | 1.1580       | 16,441,654 |
+
+## Architecture
+
+| Component | Setting |
+|-----------|---------|
+| Layers | 12 (512d, 8H, 4KV) |
+| MLP | 3× with LeakyReLU(0.5)² |
+| BigramHash | 10240 buckets, INT4 bQAT |
+| XSA | Last 4 layers |
+| RoPE | Partial (16/64 dims) |
+| LN Scale | 1/√(layer+1) |
+| Skip | U-Net skip connections |
+| resid_mix | Learnable x/x₀ blend |
+| Weight avg | EMA(0.997) with QAT-activation reset |
+| Quantization | INT4 MLP + INT4 bigram + INT6 attn + zstd |
+| QAT trigger | Wallclock fraction (65% of budget) |
+| Value Embeddings | ve_dim=128, layers 10-11 |
+| TTT | Legal score-first, lr=0.002, 3 epochs |
+
+## Key Addition: Value Embeddings
+
+Value embeddings reinject token identity into the V vectors at specific attention layers. At layers 10 and 11, a shared embedding `ve_shared` (vocab×128) is looked up per token and projected to kv_dim (256), then added to the raw V output before attention:
+
+```
+v_raw = c_v(x)
+v_embed = ve_shared.proj(ve_shared.embed(tokens))  # (batch, seq, kv_dim)
+v = v_raw + v_embed
+```
+
+The shared embedding allows all VE layers to benefit from a single (vocab, 128) table without per-layer weight cost.
+
+**Effect:** VE improved quality by ~0.014 BPB per step vs baseline at step 2000 (1.2344 vs 1.2481). Despite 640 fewer total steps (4380 vs 5021 for v4_h100 seed 1), the per-step quality gain resulted in a new best ttt_bpb.
+
+## Comparison with Previous Best
+
+| Run | steps | Pre-quant | Post-quant | TTT bpb |
+|-----|-------|-----------|------------|---------|
+| v4_h100 seed 1 | 5021 | 1.1683 | 1.1703 | ~1.165 |
+| v4_h100 seed 3 | — | 1.1729 | 1.2002 | 1.1594 |
+| v7_ve seed 1 | ~4380 | 1.1754 | 1.1643 | 1.1588 |
+| **v7_ve seed 2** | — | 1.1736 | **1.1624** | **1.1574** |
+
+Note: v7_ve's post_quant is better than its pre-quant checkpoint because the model continued improving during QAT after the last val checkpoint.
+
+## Size Budget (seed 2)
+
+| Component | Bytes |
+|-----------|-------|
+| Total artifact | 16,408,223 |
+
+Budget: 16,777,216 bytes (16MB) — **368,993 bytes (361KB) margin**
+
+## Run Command
+
+```bash
+# Best result (seed 2):
+SEED=2 bash run.sh v7_ve
+
+# Or manually:
+SEED=2 VALUE_EMBED_LAYERS=2 VALUE_EMBED_DIM=128 \
+LATE_QAT_FRAC=0.65 VAL_LOSS_EVERY=1000 \
+NUM_LAYERS=12 MLP_QUANT_BITS=4 XSA_LAST_N=4 EMA_ENABLED=1 SWA_ENABLED=0 \
+ROPE_DIMS=16 LN_SCALE=1 LATE_QAT_THRESHOLD=0.9 TTT_ENABLED=1 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- LeakyReLU² activation: PR #493, PR #518
+- XSA (Cross-layer Shared Attention): PR #414
+- EMA weight averaging: PR #374
+- Legal TTT recipe: PR #461
+- INT5/INT6 QAT with STE: PR #317, PR #374
+- BigramHash embedding: PR #320
+- U-Net skip connections: PR #363
+- resid_mix: prior work in this repo
+- Value embeddings: PR #549
diff --git a/records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/submission.json b/records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "12L INT4 bQAT + EMA Fix + Value Embeddings",
+  "val_bpb": 1.1574,
+  "bytes_total": 16408223,
+  "blurb": "12-layer INT4 MLP+bigram QAT + U-Net skips + XSA(last 4) + EMA(0.997) with QAT-reset fix + LeakyReLU² + RoPE16 + LN Scale + resid_mix + Legal TTT + Value Embeddings (ve_dim=128, last 2 layers). Value embeddings reinject token identity into V at layers 10-11. Best result: seed 2, ttt_bpb=1.15738. Post-quant degradation clean (1.1624 from pre-quant). 8×H100 seed 2: ttt_bpb=1.15738.",
+  "author": "Harsh Soni",
+  "github_id": "SoHarshh",
+  "date": "2026-03-28"
+}