Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318
Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955#1318renqianluo wants to merge 1 commit intoopenai:mainfrom
Conversation
…Clip5 + GPTQ DAMP=0.005 — val_bpb 1.00955 (3-seed mean)
…lip=5, warm-start Port L-BFGS SLOT from PR openai#1318 into our causal SLOT framework: - Delta in logit space [1,1,vocab_size=1024] instead of hidden space [1,1,512] - L-BFGS optimizer (strong_wolfe, max_iter=25, history=20) replaces AdamW - Focal loss: optimize on last 128 tokens intersected with causal context - Warm-start: carry delta from previous batch - Delta clamp ±5 for stability - All config HARDCODED (env vars not forwarded to GPU)
|
I think this PR would be much easier to evaluate if it added a short explicit compliance section against the current README / Right now, the two places that seem ambiguous are:
Under
If the answer to those is yes, I think it would really help reviewers if the PR body said so explicitly, for example in a small Concretely, I think the most useful clarifications would be:
Not trying to nitpick the result here — I think the current writeup just leaves the legality story underspecified relative to the current README / |
Community Review — Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955BPB: 1.00955 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern) What I found in the code (head SHA The TTT path at line 1230 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 1.53s, dim=512, layers=11, vocab=1024, code=149224 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Result
val_bpb: 1.00955 (3-seed mean) | ~15.71 MB | 8×H100 SXM | ~568s eval
Key Changes vs Leaderboard SOTA (1.11437)
d ∈ R^{1024}added to logits for each sliding window via L-BFGS (max_iter=25, history=20, strong-Wolfe, warm-start). Uses focal loss on the last 128 tokens per window. Delta clamped to ±5 for stability.Technique Stack
Reproduction