Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Approach E: SLOT + QK-Gain + Int5 GPTQ + Score-First TTT

## Summary

Combines SLOT (Sample-specific LM Optimization at Test-time) with raised QK-Gain initialization on top of the Approach B base (33.6M params, Int5 GPTQ, score-first TTT).

**val_bpb: TBD** | 8xH100 SXM | 600s train + eval budget

## Key Changes vs Approach B

### 1. SLOT: Per-Batch Delta Vector Optimization (eval-time)

At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection.

- **Delta shape**: `[1, 1, 512]` -- broadcasts across batch and sequence
- **Optimizer**: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5)
- **Steps**: 8 per batch
- **Score-first compliant**: hidden states computed under `torch.no_grad()`, delta adapts through `compute_logits()` only, model weights never modified
- **No cross-batch leakage**: delta re-initialized to zeros for each new batch

The model forward is split into `forward_hidden()` (frozen, no grad) and `compute_logits()` (carries grad for delta optimization).

Reference: Hu et al., arXiv:2505.12392v2. Proven in PR #1172 (~0.010 BPB improvement) and PR #1209 (~0.010 BPB improvement).

### 2. QK-Gain Initialization Raised to 4.0

Per-head learnable scalar on queries after QK-norm, initialized at 4.0 (up from 1.5). This sharpens attention patterns and has been shown to improve convergence in recent SOTA submissions (PR #1209).

## Eval Pipeline

| Stage | Expected BPB Impact | Time | Legality |
|-------|-------------------|------|----------|
| Sliding window (stride=64) | baseline | ~30s | Standard eval |
| Score-first TTT (3ep, 131K chunks) | ~-0.003 | ~120s | Score chunk, then train on it |
| SLOT (8 AdamW steps, delta vector) | ~-0.010 | ~90s | Per-batch delta reset, no cross-batch leakage |
| **Total eval** | | **~240s** | **Within 600s budget** |

## Architecture

- 11 layers, 512 dim, 8 heads, 8 KV heads, MLP mult 3.5
- BigramHash(6144x128), XSA-all, VE(128) at layers 9,10
- RoPE partial (16 dims), LN scale, U-Net skip connections
- SmearGate, tied embeddings
- Int5 GPTQ quantization with 10% magnitude pruning
- Late QAT (threshold=0.5)

## Run Command

```bash
NCCL_IB_DISABLE=1 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Or use `run.sh` which sets all environment variables.

## Rule Compliance

- [x] SLOT is score-first: hidden states frozen, delta adapts only through logit projection
- [x] No re-scoring of already-scored tokens
- [x] GPTQ calibration within training budget (590s train + ~5s calibration < 600s)
- [x] Artifact < 16MB (Int5 GPTQ + zstd compression)
- [x] All assertions present (artifact size, eval time budget)
- [x] inference_temp = 1.0
- [x] No n-gram cache, no multi-pass, no min(NLL)
59 changes: 59 additions & 0 deletions records/track_10min_16mb/2026-04-01_ApproachE_SLOT_QKGain/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash
# Approach E: SLOT + QK-Gain(4.0) + Int5 GPTQ + score-first TTT
# Environment: 8xH100 SXM, PyTorch 2.9+, flash-attn
set -euo pipefail

export NCCL_IB_DISABLE=1

# Model config (same as Approach B base)
export NUM_LAYERS=11
export NUM_KV_HEADS=8
export MODEL_DIM=512
export NUM_HEADS=8
export MLP_MULT=3.5
export BIGRAM_VOCAB_SIZE=6144
export BIGRAM_DIM=128
export XSA_LAST_N=11
export ROPE_DIMS=16
export LN_SCALE=1
export VE_ENABLED=1
export VE_DIM=128
export VE_LAYERS="9,10"
export TRAIN_SEQ_LEN=2048
export EVAL_SEQ_LEN=2048

# QK-Gain: raised from 1.5 to 4.0
export QK_GAIN_INIT=4.0

# Optimizer config
export MUON_WD=0.04
export ADAM_WD=0.04
export MATRIX_LR=0.025
export SCALAR_LR=0.025
export TIED_EMBED_LR=0.035
export MUON_MOMENTUM=0.99
export MUON_MOMENTUM_WARMUP_START=0.92
export MUON_MOMENTUM_WARMUP_STEPS=1500
export WARMDOWN_ITERS=3500
export MAX_WALLCLOCK_SECONDS=590
export EVAL_STRIDE=64

# Late QAT
export LATE_QAT_THRESHOLD=0.5

# Pruning
export PRUNE_PCT=0.10

# TTT config (score-first)
export TTT_EPOCHS=3
export TTT_LR=0.0001
export TTT_FREEZE_BLOCKS=2
export TTT_CHUNK_TOKENS=131072
export TTT_OPTIMIZER=adamw

# SLOT config (per-batch delta optimization)
export SLOT_ENABLED=1
export SLOT_STEPS=8
export SLOT_LR=0.005

torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee /workspace/run.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"name": "Approach E: SLOT + QK-Gain + Int5 GPTQ + Score-First TTT",
"val_bpb": 0.0,
"bytes_total": 0,
"bytes_code": 0,
"blurb": "SLOT eval-time delta optimization (lr=0.005, 8 AdamW steps per batch) + QK-Gain init=4.0 + Int5 GPTQ quantization + score-first TTT (3 epochs, AdamW) + sliding window eval (stride=64). Built on Approach B (33.6M params, 11L, XSA-all, VE, BigramHash).",
"author": "elninja",
"github_id": "elninja",
"date": "2026-04-01"
}
Loading