Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/PLAN.md
88 changes: 88 additions & 0 deletions records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# 12L INT4 bQAT + EMA Fix + Value Embeddings

**val_bpb: 1.1574** (seed 2, full TTT) | **16.41 MB** | 8×H100 SXM

## Results (8×H100 80GB SXM)

| Seed | step_avg | steps | Pre-quant val/bpb | Post-quant bpb | Post-TTT bpb | Artifact |
|------|----------|-------|-------------------|----------------|--------------|----------|
| 1 | ~137ms | ~4380 | 1.1754 | 1.1643 | 1.1588 | 16,290,425 |
| 2 | ~148ms | 4058 | 1.1736 | 1.1624 | **1.1574** | 16,408,223 |
| 3 | ~149ms | 4034 | — | 1.1661 | 1.1580 | 16,441,654 |

## Architecture

| Component | Setting |
|-----------|---------|
| Layers | 12 (512d, 8H, 4KV) |
| MLP | 3× with LeakyReLU(0.5)² |
| BigramHash | 10240 buckets, INT4 bQAT |
| XSA | Last 4 layers |
| RoPE | Partial (16/64 dims) |
| LN Scale | 1/√(layer+1) |
| Skip | U-Net skip connections |
| resid_mix | Learnable x/x₀ blend |
| Weight avg | EMA(0.997) with QAT-activation reset |
| Quantization | INT4 MLP + INT4 bigram + INT6 attn + zstd |
| QAT trigger | Wallclock fraction (65% of budget) |
| Value Embeddings | ve_dim=128, layers 10-11 |
| TTT | Legal score-first, lr=0.002, 3 epochs |

## Key Addition: Value Embeddings

Value embeddings reinject token identity into the V vectors at specific attention layers. At layers 10 and 11, a shared embedding `ve_shared` (vocab×128) is looked up per token and projected to kv_dim (256), then added to the raw V output before attention:

```
v_raw = c_v(x)
v_embed = ve_shared.proj(ve_shared.embed(tokens)) # (batch, seq, kv_dim)
v = v_raw + v_embed
```

The shared embedding allows all VE layers to benefit from a single (vocab, 128) table without per-layer weight cost.

**Effect:** VE improved quality by ~0.014 BPB per step vs baseline at step 2000 (1.2344 vs 1.2481). Despite 640 fewer total steps (4380 vs 5021 for v4_h100 seed 1), the per-step quality gain resulted in a new best ttt_bpb.

## Comparison with Previous Best

| Run | steps | Pre-quant | Post-quant | TTT bpb |
|-----|-------|-----------|------------|---------|
| v4_h100 seed 1 | 5021 | 1.1683 | 1.1703 | ~1.165 |
| v4_h100 seed 3 | — | 1.1729 | 1.2002 | 1.1594 |
| v7_ve seed 1 | ~4380 | 1.1754 | 1.1643 | 1.1588 |
| **v7_ve seed 2** | — | 1.1736 | **1.1624** | **1.1574** |

Note: v7_ve's post_quant is better than its pre-quant checkpoint because the model continued improving during QAT after the last val checkpoint.

## Size Budget (seed 2)

| Component | Bytes |
|-----------|-------|
| Total artifact | 16,408,223 |

Budget: 16,777,216 bytes (16MB) — **368,993 bytes (361KB) margin**

## Run Command

```bash
# Best result (seed 2):
SEED=2 bash run.sh v7_ve

# Or manually:
SEED=2 VALUE_EMBED_LAYERS=2 VALUE_EMBED_DIM=128 \
LATE_QAT_FRAC=0.65 VAL_LOSS_EVERY=1000 \
NUM_LAYERS=12 MLP_QUANT_BITS=4 XSA_LAST_N=4 EMA_ENABLED=1 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT_THRESHOLD=0.9 TTT_ENABLED=1 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- LeakyReLU² activation: PR #493, PR #518
- XSA (Cross-layer Shared Attention): PR #414
- EMA weight averaging: PR #374
- Legal TTT recipe: PR #461
- INT5/INT6 QAT with STE: PR #317, PR #374
- BigramHash embedding: PR #320
- U-Net skip connections: PR #363
- resid_mix: prior work in this repo
- Value embeddings: PR #549
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "12L INT4 bQAT + EMA Fix + Value Embeddings",
"val_bpb": 1.1574,
"bytes_total": 16408223,
"blurb": "12-layer INT4 MLP+bigram QAT + U-Net skips + XSA(last 4) + EMA(0.997) with QAT-reset fix + LeakyReLU² + RoPE16 + LN Scale + resid_mix + Legal TTT + Value Embeddings (ve_dim=128, last 2 layers). Value embeddings reinject token identity into V at layers 10-11. Best result: seed 2, ttt_bpb=1.15738. Post-quant degradation clean (1.1624 from pre-quant). 8×H100 seed 2: ttt_bpb=1.15738.",
"author": "Harsh Soni",
"github_id": "SoHarshh",
"date": "2026-03-28"
}
Loading