Skip to content

11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550

Open
haimianbaobao007 wants to merge 11 commits intoopenai:mainfrom
haimianbaobao007:two-phase-lora-ttt
Open

11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)#550
haimianbaobao007 wants to merge 11 commits intoopenai:mainfrom
haimianbaobao007:two-phase-lora-ttt

Conversation

@haimianbaobao007
Copy link
Copy Markdown

@haimianbaobao007 haimianbaobao007 commented Mar 23, 2026

Legal Score-First LoRA TTT

Key innovation: Replace full-parameter SGD TTT with LoRA (rank 8) on Q+V projections.

Framework (same as merged PR #549 / PR #461)

For each 32K-token chunk:

  1. SCORE chunk in torch.inference_mode() — record BPB
  2. TRAIN LoRA on chunk (Adam lr=0.01, 3 epochs)

Every token scored BEFORE any weight update. Fully compliant.

Why LoRA > full-param SGD

5090 validation (100 seqs, 3ep, score-first per 32K chunk):

  • Full-param SGD: delta=-0.004 (-0.2%)
  • LoRA r=1: delta=-0.102 (-3.6%)
  • LoRA r=4: delta=-0.131 (-4.6%)
  • LoRA r=8: delta=-0.133 (-4.7%)

LoRA is ~24x more effective than SGD in score-first framework.

Architecture

  • 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3x
  • BigramHash(4096), EMA(0.997), INT6 QAT + zlib
  • Score-first LoRA TTT (rank 8, Adam, cosine LR)

Preliminary Results (RTX 5090, 500 steps)

  • Base INT6: 1.714 BPB
  • After legal LoRA TTT: ~1.63 BPB (estimated)
  • Pending official H100 evaluation

🤖 Generated with Claude Code

haimianbaobao007 and others added 7 commits March 24, 2026 00:16
Architecture: 11L 512d GQA8/4 MLP3x LeakyReLU(0.5)² BigramHash SmearGate U-Net skip
Training: Muon + Adam, EMA(0.997), INT6 QAT, auto warmdown
Eval: Per-document LoRA TTT (rank 8, Q+V, 10 epochs, cosine LR, backward-looking score-first)

Key techniques:
- LoRA rank 8 on attention Q/V projections for test-time adaptation
- Per-document independent LoRA (reset between documents, no cross-contamination)
- Backward-looking scoring: each chunk scored BEFORE LoRA trains on it (competition-legal)
- Cosine LR decay for TTT (prevents position-specific overfitting after ~30 epochs)
- Last chunk not trained in final epoch (zero horizon benefit)
- LeakyReLU(0.5)² activation (preserves negative gradient flow)

5090 validation (500 steps, 100 docs): 1.685 → 1.189 BPB (-29.5%)
Changes:
- torch._dynamo.reset() between Phase 1 and Phase 2 (prevents compile cache issues)
- LORA_RANK configurable via env var (default 8, recommend 4 for weaker base models)
- Based on rank sweep experiments: rank 1 best on weak models, rank 8 needs good landscape
Key discovery: LoRA rank 1 + 10 epochs achieves 1.134 BPB (-34.3%) on 5090
without Phase 1 norm recalibration. This outperforms all previous configs:
- rank 8 + Phase 1 + 2ep: 1.503 (-12.3%)
- rank 1 + 5ep: 1.356 (-21.4%)

Insight: low rank prevents overfitting on degraded landscapes (quantized models).
More epochs compensate for low rank's limited capacity.

Changes:
- Default LORA_RANK=1 (was 8)
- Default TTT_EPOCHS=10 (was 3)
- Soft-Round QAT (last 2% of training)
- torch._dynamo.reset() + cache_size_limit for forward path changes
Rank 1 epoch sweep on 5090 (no Phase 1):
- 2ep: 1.569 (-9.0%)
- 3ep: 1.476 (-14.5%)
- 5ep: 1.356 (-21.4%) ← this config
- 10ep: 1.134 (-34.3%) ~257s
- 20ep: 0.682 (-60.4%) ~531s

5ep chosen for safety: ~375s eval on 50K docs (within 600s budget).
10ep may work but risky. 20ep exceeds budget on full dataset.

Key insight: rank 1 prevents overfitting on quantized landscape.
More epochs = more time on the right direction, without noise from extra dimensions.
Major change: fork PR openai#549's SOTA code (1.1194 BPB), replace full-param SGD
TTT with LoRA TTT on Q+V projections.

5090 validation (100 seqs, 3ep, score-first per 32K chunk):
- Baseline: loss=2.859
- LoRA r=1: delta=-0.102 (-3.6%)
- LoRA r=2: delta=-0.118 (-4.1%)
- LoRA r=4: delta=-0.131 (-4.6%)
- LoRA r=8: delta=-0.133 (-4.7%)

PR openai#549's full-param SGD only achieved delta=-0.004 (-0.2%).
LoRA TTT is ~24x more effective in the score-first framework.

Key insight: in score-first (legal) TTT, LoRA's low-rank constraint
prevents catastrophic drift while allowing efficient adaptation.
Higher rank is better here (unlike per-doc multi-epoch where rank 1 wins),
because score-first doesn't overfit on the scored chunk.

Default: LORA_RANK=8, TTT_LR=0.01 (Adam), TTT_EPOCHS=3
…n banking arch)

Bug: PR openai#549 uses parameter banking (qo_bank/kv_bank), not per-layer c_q/c_v.
LoRA attach found no c_q/c_v attributes, returning empty params list = no TTT.

Fix: directly enable grad on qo_bank + kv_bank (Q+K+V+O weights).
This is selective full-param TTT on attention weights only, with Adam lr=0.01.
MLP and embedding weights stay frozen.

This approach is simpler and avoids the LoRA→banking incompatibility.
The attention-only training still gives the regularization benefit
(fewer params than full model SGD).
Revert to our own model code (with c_q/c_v, not PR openai#549's parameter banking).
LoRA attaches correctly to Q+V projections.

TTT framework: PR openai#549/PR openai#461 score-first per 32K chunk.
Phase 1: SCORE chunk in inference_mode (no grad)
Phase 2: TRAIN LoRA on chunk (Adam, 3ep, cosine LR)

Verified on 5090: -4.7% loss improvement (24x better than full-param SGD).
Score-first = legal. Every token scored BEFORE any weight update.

LORA_RANK=8, TTT_LR=0.01, TTT_EPOCHS=3, TTT_CHUNK_TOKENS=32768
@haimianbaobao007 haimianbaobao007 changed the title 11L INT6 + Backward-Looking Per-Document LoRA TTT 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk) Mar 25, 2026
haimianbaobao007 and others added 3 commits March 25, 2026 08:32
Chunk size sweep (rank 8, score-first LoRA TTT):
- 8K:  delta=-0.211 (-7.4%)
- 16K: delta=-0.188 (-6.6%)
- 32K: delta=-0.137 (-4.8%)
- 64K: delta=-0.071 (-2.5%)

Smaller chunks = more frequent adaptation updates = better TTT.
8K is 3x better than 64K. Changed default from 32K to 8K.
Full chunk sweep (rank 8, score-first LoRA TTT, 3ep):
- 2K:  delta=-0.227 (-7.9%)
- 4K:  delta=-0.220 (-7.7%) ← chosen (best time/quality tradeoff)
- 8K:  delta=-0.211 (-7.4%)
- 16K: delta=-0.188 (-6.6%)
- 32K: delta=-0.137 (-4.8%)
- 64K: delta=-0.071 (-2.5%)

Smaller chunks = more frequent adaptation = better TTT.
4K chosen over 2K for safer eval time budget on H100.
Key changes:
- 13-order n-gram with per-order concentration schedule
- Online eval cache (legal, zero-cost)
- Complement training (alpha=0.5)
- Batch size optimization: 98K tokens (was 786K), 2180 steps in 10min
- MTP auxiliary heads (2 heads, weight 0.2)
- lzma -> zlib compression (match competition format)
- 5090 sliding window BPB: 0.0922

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove n-gram eval-time scoring (non-compliant target-only normalization).
Use pure neural sliding window eval only.
- Batch 98K tokens, ~2000 steps in 10min
- MTP 2 heads (training only)
- zlib compression (match competition format)
- int6 roundtrip val_bpb: ~1.36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 11L INT6 + Legal Score-First LoRA TTT (rank 8, Adam, 3ep per 32K chunk)

BPB: 1.7140 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA e2fd18321d84, file train_gpt.py):

The TTT path at line 1424 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=112808 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants