11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550
11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550haimianbaobao007 wants to merge 11 commits intoopenai:mainfrom
Conversation
Architecture: 11L 512d GQA8/4 MLP3x LeakyReLU(0.5)² BigramHash SmearGate U-Net skip Training: Muon + Adam, EMA(0.997), INT6 QAT, auto warmdown Eval: Per-document LoRA TTT (rank 8, Q+V, 10 epochs, cosine LR, backward-looking score-first) Key techniques: - LoRA rank 8 on attention Q/V projections for test-time adaptation - Per-document independent LoRA (reset between documents, no cross-contamination) - Backward-looking scoring: each chunk scored BEFORE LoRA trains on it (competition-legal) - Cosine LR decay for TTT (prevents position-specific overfitting after ~30 epochs) - Last chunk not trained in final epoch (zero horizon benefit) - LeakyReLU(0.5)² activation (preserves negative gradient flow) 5090 validation (500 steps, 100 docs): 1.685 → 1.189 BPB (-29.5%)
Changes: - torch._dynamo.reset() between Phase 1 and Phase 2 (prevents compile cache issues) - LORA_RANK configurable via env var (default 8, recommend 4 for weaker base models) - Based on rank sweep experiments: rank 1 best on weak models, rank 8 needs good landscape
Key discovery: LoRA rank 1 + 10 epochs achieves 1.134 BPB (-34.3%) on 5090 without Phase 1 norm recalibration. This outperforms all previous configs: - rank 8 + Phase 1 + 2ep: 1.503 (-12.3%) - rank 1 + 5ep: 1.356 (-21.4%) Insight: low rank prevents overfitting on degraded landscapes (quantized models). More epochs compensate for low rank's limited capacity. Changes: - Default LORA_RANK=1 (was 8) - Default TTT_EPOCHS=10 (was 3) - Soft-Round QAT (last 2% of training) - torch._dynamo.reset() + cache_size_limit for forward path changes
Rank 1 epoch sweep on 5090 (no Phase 1): - 2ep: 1.569 (-9.0%) - 3ep: 1.476 (-14.5%) - 5ep: 1.356 (-21.4%) ← this config - 10ep: 1.134 (-34.3%) ~257s - 20ep: 0.682 (-60.4%) ~531s 5ep chosen for safety: ~375s eval on 50K docs (within 600s budget). 10ep may work but risky. 20ep exceeds budget on full dataset. Key insight: rank 1 prevents overfitting on quantized landscape. More epochs = more time on the right direction, without noise from extra dimensions.
Major change: fork PR openai#549's SOTA code (1.1194 BPB), replace full-param SGD TTT with LoRA TTT on Q+V projections. 5090 validation (100 seqs, 3ep, score-first per 32K chunk): - Baseline: loss=2.859 - LoRA r=1: delta=-0.102 (-3.6%) - LoRA r=2: delta=-0.118 (-4.1%) - LoRA r=4: delta=-0.131 (-4.6%) - LoRA r=8: delta=-0.133 (-4.7%) PR openai#549's full-param SGD only achieved delta=-0.004 (-0.2%). LoRA TTT is ~24x more effective in the score-first framework. Key insight: in score-first (legal) TTT, LoRA's low-rank constraint prevents catastrophic drift while allowing efficient adaptation. Higher rank is better here (unlike per-doc multi-epoch where rank 1 wins), because score-first doesn't overfit on the scored chunk. Default: LORA_RANK=8, TTT_LR=0.01 (Adam), TTT_EPOCHS=3
…n banking arch) Bug: PR openai#549 uses parameter banking (qo_bank/kv_bank), not per-layer c_q/c_v. LoRA attach found no c_q/c_v attributes, returning empty params list = no TTT. Fix: directly enable grad on qo_bank + kv_bank (Q+K+V+O weights). This is selective full-param TTT on attention weights only, with Adam lr=0.01. MLP and embedding weights stay frozen. This approach is simpler and avoids the LoRA→banking incompatibility. The attention-only training still gives the regularization benefit (fewer params than full model SGD).
Revert to our own model code (with c_q/c_v, not PR openai#549's parameter banking). LoRA attaches correctly to Q+V projections. TTT framework: PR openai#549/PR openai#461 score-first per 32K chunk. Phase 1: SCORE chunk in inference_mode (no grad) Phase 2: TRAIN LoRA on chunk (Adam, 3ep, cosine LR) Verified on 5090: -4.7% loss improvement (24x better than full-param SGD). Score-first = legal. Every token scored BEFORE any weight update. LORA_RANK=8, TTT_LR=0.01, TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768
Chunk size sweep (rank 8, score-first LoRA TTT): - 8K: delta=-0.211 (-7.4%) - 16K: delta=-0.188 (-6.6%) - 32K: delta=-0.137 (-4.8%) - 64K: delta=-0.071 (-2.5%) Smaller chunks = more frequent adaptation updates = better TTT. 8K is 3x better than 64K. Changed default from 32K to 8K.
Full chunk sweep (rank 8, score-first LoRA TTT, 3ep): - 2K: delta=-0.227 (-7.9%) - 4K: delta=-0.220 (-7.7%) ← chosen (best time/quality tradeoff) - 8K: delta=-0.211 (-7.4%) - 16K: delta=-0.188 (-6.6%) - 32K: delta=-0.137 (-4.8%) - 64K: delta=-0.071 (-2.5%) Smaller chunks = more frequent adaptation = better TTT. 4K chosen over 2K for safer eval time budget on H100.
Key changes: - 13-order n-gram with per-order concentration schedule - Online eval cache (legal, zero-cost) - Complement training (alpha=0.5) - Batch size optimization: 98K tokens (was 786K), 2180 steps in 10min - MTP auxiliary heads (2 heads, weight 0.2) - lzma -> zlib compression (match competition format) - 5090 sliding window BPB: 0.0922 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
189f00d to
220b655
Compare
Remove n-gram eval-time scoring (non-compliant target-only normalization). Use pure neural sliding window eval only. - Batch 98K tokens, ~2000 steps in 10min - MTP 2 heads (training only) - zlib compression (match competition format) - int6 roundtrip val_bpb: ~1.36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)BPB: 1.7140 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern) What I found in the code (head SHA The TTT path at line 1424 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Legal Score-First LoRA TTT
Key innovation: Replace full-parameter SGD TTT with LoRA (rank 8) on Q+V projections.
Framework (same as merged PR #549 / PR #461)
For each 32K-token chunk:
torch.inference_mode()— record BPBEvery token scored BEFORE any weight update. Fully compliant.
Why LoRA > full-param SGD
5090 validation (100 seqs, 3ep, score-first per 32K chunk):
LoRA is ~24x more effective than SGD in score-first framework.
Architecture
Preliminary Results (RTX 5090, 500 steps)
🤖 Generated with Claude Code