11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk) by haimianbaobao007 · Pull Request #550 · openai/parameter-golf

haimianbaobao007 · 2026-03-23T16:16:55Z

Legal Score-First LoRA TTT

Key innovation: Replace full-parameter SGD TTT with LoRA (rank 8) on Q+V projections.

Framework (same as merged PR #549 / PR #461)

For each 32K-token chunk:

SCORE chunk in torch.inference_mode() — record BPB
TRAIN LoRA on chunk (Adam lr=0.01, 3 epochs)

Every token scored BEFORE any weight update. Fully compliant.

Why LoRA > full-param SGD

5090 validation (100 seqs, 3ep, score-first per 32K chunk):

Full-param SGD: delta=-0.004 (-0.2%)
LoRA r=1: delta=-0.102 (-3.6%)
LoRA r=4: delta=-0.131 (-4.6%)
LoRA r=8: delta=-0.133 (-4.7%)

LoRA is ~24x more effective than SGD in score-first framework.

Architecture

11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3x
BigramHash(4096), EMA(0.997), INT6 QAT + zlib
Score-first LoRA TTT (rank 8, Adam, cosine LR)

Preliminary Results (RTX 5090, 500 steps)

Base INT6: 1.714 BPB
After legal LoRA TTT: ~1.63 BPB (estimated)
Pending official H100 evaluation

🤖 Generated with Claude Code

Architecture: 11L 512d GQA8/4 MLP3x LeakyReLU(0.5)² BigramHash SmearGate U-Net skip Training: Muon + Adam, EMA(0.997), INT6 QAT, auto warmdown Eval: Per-document LoRA TTT (rank 8, Q+V, 10 epochs, cosine LR, backward-looking score-first) Key techniques: - LoRA rank 8 on attention Q/V projections for test-time adaptation - Per-document independent LoRA (reset between documents, no cross-contamination) - Backward-looking scoring: each chunk scored BEFORE LoRA trains on it (competition-legal) - Cosine LR decay for TTT (prevents position-specific overfitting after ~30 epochs) - Last chunk not trained in final epoch (zero horizon benefit) - LeakyReLU(0.5)² activation (preserves negative gradient flow) 5090 validation (500 steps, 100 docs): 1.685 → 1.189 BPB (-29.5%)

Changes: - torch._dynamo.reset() between Phase 1 and Phase 2 (prevents compile cache issues) - LORA_RANK configurable via env var (default 8, recommend 4 for weaker base models) - Based on rank sweep experiments: rank 1 best on weak models, rank 8 needs good landscape

Key discovery: LoRA rank 1 + 10 epochs achieves 1.134 BPB (-34.3%) on 5090 without Phase 1 norm recalibration. This outperforms all previous configs: - rank 8 + Phase 1 + 2ep: 1.503 (-12.3%) - rank 1 + 5ep: 1.356 (-21.4%) Insight: low rank prevents overfitting on degraded landscapes (quantized models). More epochs compensate for low rank's limited capacity. Changes: - Default LORA_RANK=1 (was 8) - Default TTT_EPOCHS=10 (was 3) - Soft-Round QAT (last 2% of training) - torch._dynamo.reset() + cache_size_limit for forward path changes

Rank 1 epoch sweep on 5090 (no Phase 1): - 2ep: 1.569 (-9.0%) - 3ep: 1.476 (-14.5%) - 5ep: 1.356 (-21.4%) ← this config - 10ep: 1.134 (-34.3%) ~257s - 20ep: 0.682 (-60.4%) ~531s 5ep chosen for safety: ~375s eval on 50K docs (within 600s budget). 10ep may work but risky. 20ep exceeds budget on full dataset. Key insight: rank 1 prevents overfitting on quantized landscape. More epochs = more time on the right direction, without noise from extra dimensions.

Major change: fork PR openai#549's SOTA code (1.1194 BPB), replace full-param SGD TTT with LoRA TTT on Q+V projections. 5090 validation (100 seqs, 3ep, score-first per 32K chunk): - Baseline: loss=2.859 - LoRA r=1: delta=-0.102 (-3.6%) - LoRA r=2: delta=-0.118 (-4.1%) - LoRA r=4: delta=-0.131 (-4.6%) - LoRA r=8: delta=-0.133 (-4.7%) PR openai#549's full-param SGD only achieved delta=-0.004 (-0.2%). LoRA TTT is ~24x more effective in the score-first framework. Key insight: in score-first (legal) TTT, LoRA's low-rank constraint prevents catastrophic drift while allowing efficient adaptation. Higher rank is better here (unlike per-doc multi-epoch where rank 1 wins), because score-first doesn't overfit on the scored chunk. Default: LORA_RANK=8, TTT_LR=0.01 (Adam), TTT_EPOCHS=3

…n banking arch) Bug: PR openai#549 uses parameter banking (qo_bank/kv_bank), not per-layer c_q/c_v. LoRA attach found no c_q/c_v attributes, returning empty params list = no TTT. Fix: directly enable grad on qo_bank + kv_bank (Q+K+V+O weights). This is selective full-param TTT on attention weights only, with Adam lr=0.01. MLP and embedding weights stay frozen. This approach is simpler and avoids the LoRA→banking incompatibility. The attention-only training still gives the regularization benefit (fewer params than full model SGD).

Revert to our own model code (with c_q/c_v, not PR openai#549's parameter banking). LoRA attaches correctly to Q+V projections. TTT framework: PR openai#549/PR openai#461 score-first per 32K chunk. Phase 1: SCORE chunk in inference_mode (no grad) Phase 2: TRAIN LoRA on chunk (Adam, 3ep, cosine LR) Verified on 5090: -4.7% loss improvement (24x better than full-param SGD). Score-first = legal. Every token scored BEFORE any weight update. LORA_RANK=8, TTT_LR=0.01, TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768

Chunk size sweep (rank 8, score-first LoRA TTT): - 8K: delta=-0.211 (-7.4%) - 16K: delta=-0.188 (-6.6%) - 32K: delta=-0.137 (-4.8%) - 64K: delta=-0.071 (-2.5%) Smaller chunks = more frequent adaptation updates = better TTT. 8K is 3x better than 64K. Changed default from 32K to 8K.

Full chunk sweep (rank 8, score-first LoRA TTT, 3ep): - 2K: delta=-0.227 (-7.9%) - 4K: delta=-0.220 (-7.7%) ← chosen (best time/quality tradeoff) - 8K: delta=-0.211 (-7.4%) - 16K: delta=-0.188 (-6.6%) - 32K: delta=-0.137 (-4.8%) - 64K: delta=-0.071 (-2.5%) Smaller chunks = more frequent adaptation = better TTT. 4K chosen over 2K for safer eval time budget on H100.

Key changes: - 13-order n-gram with per-order concentration schedule - Online eval cache (legal, zero-cost) - Complement training (alpha=0.5) - Batch size optimization: 98K tokens (was 786K), 2180 steps in 10min - MTP auxiliary heads (2 heads, weight 0.2) - lzma -> zlib compression (match competition format) - 5090 sliding window BPB: 0.0922 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove n-gram eval-time scoring (non-compliant target-only normalization). Use pure neural sliding window eval only. - Batch 98K tokens, ~2000 steps in 10min - MTP 2 heads (training only) - zlib compression (match competition format) - int6 roundtrip val_bpb: ~1.36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T01:53:45Z

Community Review — 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)

BPB: 1.7140 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA e2fd18321d84, file train_gpt.py):

The TTT path at line 1424 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

haimianbaobao007 and others added 7 commits March 24, 2026 00:16

haimianbaobao007 changed the title ~~11L INT6 + Backward-Looking Per-Document LoRA TTT~~ 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk) Mar 25, 2026

haimianbaobao007 and others added 3 commits March 25, 2026 08:32

haimianbaobao007 force-pushed the two-phase-lora-ttt branch from 189f00d to 220b655 Compare March 30, 2026 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550

11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550
haimianbaobao007 wants to merge 11 commits intoopenai:mainfrom
haimianbaobao007:two-phase-lora-ttt

haimianbaobao007 commented Mar 23, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haimianbaobao007 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Legal Score-First LoRA TTT

Framework (same as merged PR #549 / PR #461)

Why LoRA > full-param SGD

Architecture

Preliminary Results (RTX 5090, 500 steps)

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haimianbaobao007 commented Mar 23, 2026 •

edited

Loading