SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499
SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499dippatel1994 wants to merge 7 commits intoopenai:mainfrom
Conversation
SP8192 tokenizer, 3-layer depth recurrence (layers 3-5 looped 2x), parallel residuals on layers 7+, U-Net skips, GPTQ int6/int5. 14.09MB artifact. val_bpb=1.6323 on 1xH100 (1/8 competition compute).
- 11 layers (from 10), MLP mult 4.0 (from 3.0) - SP8192 as default tokenizer - Depth recurrence layers 3-5 x2 enabled by default - Parallel residuals on layers 7+ enabled by default - Weight decay 0.085 (frontier-tuned) - val_bpb=1.620 on 1xH100 (14.42MB artifact)
- SDClip (k=12.85) for GPTQ scale selection - MuonEq-R (row-normalized Muon optimizer) - Pre-quant TTT (10 epochs, AdamW lr=0.00045, cosine decay) - Brotli compression with byte shuffle - Delayed depth recurrence (step 3000) - QK-Gain 5.25, XSA last 4, EMA 0.9965, WD 0.095 8xH100 validated: 915 steps, val_bpb=1.3079, pre-quant TTT loss 3.74->3.06 GPTQ artifact: 12.23 MB (brotli). Sliding eval needs competition infra.
Full frontier config validated: - SDClip GPTQ (k=12.85): fixed quantization for QK-Gain 5.25 - MuonEq-R: row-normalized optimizer - Pre-quant TTT: rank-0 only with weight broadcast (fixed DDP issue) - Brotli + byte shuffle compression: 14.09 MB artifact - 2896 steps, val_bpb=1.261 pre-GPTQ, 1.479 post-GPTQ (standard eval) - On 8xH100 with sliding eval + pre-quant TTT: estimated 1.10-1.20 BPB
- Added train_seed42_1xH100.log (required by competition rules) - Updated submission.json with v4 confirmed results (1.4794 BPB) - Updated README with full technique descriptions and reproduction steps - Includes all required files: README.md, submission.json, train_gpt.py, train log
Community Review — SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)BPB: 1.6323 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 996 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
…#402/openai#677) Pre-quant TTT that trains directly on val_tokens without score-first discipline is non-compliant per community review. Disabled by default (PREQUANT_TTT_ENABLED=0). The function remains in code but is not called unless explicitly enabled. All other techniques (SDClip GPTQ, MuonEq-R, depth recurrence, parallel residuals, brotli, QK-Gain 5.25) are unaffected. Confirmed val_bpb=1.4794 on 1xH100 WITHOUT pre-quant TTT.
|
Thanks @MatoTeziTanka for the detailed review and catching the compliance issue. Fixed in commit fd9bde7: Pre-quant TTT is now disabled by default ( The confirmed val_bpb=1.4794 was already measured without pre-quant TTT (the 1xH100 test run had All other techniques are compliant:
Happy to make further adjustments if needed. |
Training now stops at 590s (600s - 10s reserve), leaving time for GPTQ compression to complete within the total budget. Matches the pattern from PR openai#1487 (gptq_reserve_seconds=10).
|
Re-audited at head SHA Gauntlet result (CT2038, Python 3.10, torch 2.10.0+cpu): Pre-quant TTT fix confirmed. Line 1300: What the code does at default competition settings (
Updated verdict: LOOKS CLEAN. The pre-quant TTT that was flagged is disabled. The two active eval-time mechanisms (sliding TTT + n-gram cache) are both legal — score-first TTT matches #1413, and the n-gram uses context-only keys matching #803. Citation correction: My original review cited #1416/#1423 as "the legal Pre-Quant TTT pattern." That was wrong — both have the illegal flat-epoch pattern. The correct legal TTT reference is PR #1413 (dexhunter). Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending usual record-track checks. Note that the reported 1.4794 BPB is a conservative 1×H100 standard-eval baseline — the 8×H100 score with sliding TTT + n-gram active will be lower. Thanks for the fast turnaround @dippatel1994. Re-audit by @MatoTeziTanka. CPU gauntlet on CT2038 (Python 3.10, torch 2.10.0+cpu): IMPORT_OK, MODEL_OK, FORWARD_OK. Full code review: pre-quant TTT disabled (line 1300), sliding TTT is score-first (lines 942-971), n-gram cache uses context-only keys (line 789). |
Summary
kevclark/parameter-golf) — 8192 vocab BPE for lower tokens-per-byteStacked with: BigramHash(10240), SmearGate, EMA(0.997), LeakyReLU squared, GQA(8q/4kv), partial RoPE(16), value residual, XSA(last 4), orthogonal init, Muon+AdamW, late QAT.
Results
Reproduction
Ablation (sp1024, 2-min, 1xH100)
Test plan