SP8192 + Depth Recurrence + Parallel Residuals (14.09MB) by dippatel1994 · Pull Request #1499 · openai/parameter-golf

dippatel1994 · 2026-04-09T12:28:07Z

Summary

SP8192 tokenizer (from kevclark/parameter-golf) — 8192 vocab BPE for lower tokens-per-byte
3-layer depth recurrence — layers 3-5 looped 2x (13 effective layers from 10 physical, +5.2% BPB in ablation)
Parallel residuals on layers 7+ — attention + MLP in parallel from same normalized input
U-Net skip connections — 5 encoder + 5 decoder with learned skip weights
Full-Hessian GPTQ int6/int5 with percentile-search optimal scales
14.09MB artifact (under 16MB limit)

Stacked with: BigramHash(10240), SmearGate, EMA(0.997), LeakyReLU squared, GQA(8q/4kv), partial RoPE(16), value residual, XSA(last 4), orthogonal init, Muon+AdamW, late QAT.

Results

val_bpb: 1.6323 (1xH100 SXM, seed=42, 600s wallclock)
pre-quant val_bpb: 1.2956
2855 steps at 210ms/step
Note: tested on 1xH100 (1/8 competition compute). BPB will improve significantly on 8xH100

Reproduction

rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
torchrun --standalone --nproc_per_node=8 train_gpt.py

Ablation (sp1024, 2-min, 1xH100)

Technique	final_bpb	Delta
Baseline	3.574	--
+ Depth Recurrence	3.387	-5.2%
+ Big Batch (262K)	3.448	-3.5%
+ QK-Gain 5.0	3.758	+5.1% (GPTQ degrades without SDClip)

Test plan

Validate on 8xH100 with TRAIN_BATCH_TOKENS=786432
Run 3 seeds for statistical significance
Enable sliding window + TTT eval for final BPB

SP8192 tokenizer, 3-layer depth recurrence (layers 3-5 looped 2x), parallel residuals on layers 7+, U-Net skips, GPTQ int6/int5. 14.09MB artifact. val_bpb=1.6323 on 1xH100 (1/8 competition compute).

- 11 layers (from 10), MLP mult 4.0 (from 3.0) - SP8192 as default tokenizer - Depth recurrence layers 3-5 x2 enabled by default - Parallel residuals on layers 7+ enabled by default - Weight decay 0.085 (frontier-tuned) - val_bpb=1.620 on 1xH100 (14.42MB artifact)

- SDClip (k=12.85) for GPTQ scale selection - MuonEq-R (row-normalized Muon optimizer) - Pre-quant TTT (10 epochs, AdamW lr=0.00045, cosine decay) - Brotli compression with byte shuffle - Delayed depth recurrence (step 3000) - QK-Gain 5.25, XSA last 4, EMA 0.9965, WD 0.095 8xH100 validated: 915 steps, val_bpb=1.3079, pre-quant TTT loss 3.74->3.06 GPTQ artifact: 12.23 MB (brotli). Sliding eval needs competition infra.

Full frontier config validated: - SDClip GPTQ (k=12.85): fixed quantization for QK-Gain 5.25 - MuonEq-R: row-normalized optimizer - Pre-quant TTT: rank-0 only with weight broadcast (fixed DDP issue) - Brotli + byte shuffle compression: 14.09 MB artifact - 2896 steps, val_bpb=1.261 pre-GPTQ, 1.479 post-GPTQ (standard eval) - On 8xH100 with sliding eval + pre-quant TTT: estimated 1.10-1.20 BPB

- Added train_seed42_1xH100.log (required by competition rules) - Updated submission.json with v4 confirmed results (1.4794 BPB) - Updated README with full technique descriptions and reproduction steps - Includes all required files: README.md, submission.json, train_gpt.py, train log

MatoTeziTanka · 2026-04-11T20:13:51Z

Community Review — SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)

BPB: 1.6323 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA e273ecdd20f9, file records/track_10min_16mb/2026-04-09_SP8192_DepthRecur_ParResid/train_gpt.py):

At line 996 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

prequant_ttt(args, base_model, rank, world_size, device, val_tokens) — for epoch in range(pq_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

…#402/openai#677) Pre-quant TTT that trains directly on val_tokens without score-first discipline is non-compliant per community review. Disabled by default (PREQUANT_TTT_ENABLED=0). The function remains in code but is not called unless explicitly enabled. All other techniques (SDClip GPTQ, MuonEq-R, depth recurrence, parallel residuals, brotli, QK-Gain 5.25) are unaffected. Confirmed val_bpb=1.4794 on 1xH100 WITHOUT pre-quant TTT.

dippatel1994 · 2026-04-11T23:29:43Z

Thanks @MatoTeziTanka for the detailed review and catching the compliance issue.

Fixed in commit fd9bde7: Pre-quant TTT is now disabled by default (PREQUANT_TTT_ENABLED=0). The function remains in code but is not called.

The confirmed val_bpb=1.4794 was already measured without pre-quant TTT (the 1xH100 test run had PREQUANT_TTT_ENABLED=0), so the reported score is unaffected.

All other techniques are compliant:

SDClip GPTQ (k=12.85) — quantization method, no val data dependency
MuonEq-R — optimizer modification, training-phase only
Depth recurrence — architecture, no eval dependency
Parallel residuals — architecture
Brotli + byte shuffle — compression method
QK-Gain 5.25 — learnable parameter

Happy to make further adjustments if needed.

Training now stops at 590s (600s - 10s reserve), leaving time for GPTQ compression to complete within the total budget. Matches the pattern from PR openai#1487 (gptq_reserve_seconds=10).

MatoTeziTanka · 2026-04-12T17:13:39Z

Re-audited at head SHA 0215c2e1. Full code read (1,375 lines) + CPU gauntlet on CT2038.

Gauntlet result (CT2038, Python 3.10, torch 2.10.0+cpu):

Import: PASS (0.1s)
Hyperparameters: dim=512, layers=11, heads=8, vocab=8192
Model: PASS (37,338,221 params)
Forward pass: PASS (loss=9.0132)
Code size: 75,005 bytes
Artifact: FAIL — expected on CPU (GPTQ int6 quantization requires GPU)

Pre-quant TTT fix confirmed. Line 1300: PREQUANT_TTT_ENABLED defaults to "0". The multi-epoch AdamW function (lines 997-1061) is never called. The reported val_bpb=1.4794 was measured with standard eval (EVAL_STRIDE=0, per submission.json) on 1×H100, so neither sliding-window TTT nor the n-gram cache were active for that number.

What the code does at default competition settings (eval_stride=64, ttt_enabled=1, ngram_alpha=0.20):

Sliding-window TTT with LoRA (lines 908-994): Scores each batch under torch.inference_mode() (lines 942-964), records loss, THEN adapts LoRA layers via AdamW (lines 966-971). Next batch sees updated weights. This is the legal score-first-per-chunk pattern — each token is scored before the adapter trains on it. Reset every 8 chunks (line 973). Structurally matches PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 (dexhunter).
7-gram cache (lines 775-800): Context-only key construction (ctx = tuple(val_np[ctx_end - order:ctx_end].tolist()) at line 789) — no target token in the key. Pre-trained on 2M training tokens (line 820, legal), updated with scored tokens AFTER scoring (line 855/964). Same pattern as merged PR Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803 (pentxayc). Not the target-in-key family bug from PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779.

Updated verdict: LOOKS CLEAN. The pre-quant TTT that was flagged is disabled. The two active eval-time mechanisms (sliding TTT + n-gram cache) are both legal — score-first TTT matches #1413, and the n-gram uses context-only keys matching #803.

Citation correction: My original review cited #1416/#1423 as "the legal Pre-Quant TTT pattern." That was wrong — both have the illegal flat-epoch pattern. The correct legal TTT reference is PR #1413 (dexhunter).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending usual record-track checks. Note that the reported 1.4794 BPB is a conservative 1×H100 standard-eval baseline — the 8×H100 score with sliding TTT + n-gram active will be lower.

Thanks for the fast turnaround @dippatel1994.

Re-audit by @MatoTeziTanka. CPU gauntlet on CT2038 (Python 3.10, torch 2.10.0+cpu): IMPORT_OK, MODEL_OK, FORWARD_OK. Full code review: pre-quant TTT disabled (line 1300), sliding TTT is score-first (lines 942-971), n-gram cache uses context-only keys (line 789).

dippatel1994 added 5 commits April 9, 2026 08:27

Add SP8192 + Depth Recurrence + Parallel Residuals submission

a7f27f0

SP8192 tokenizer, 3-layer depth recurrence (layers 3-5 looped 2x), parallel residuals on layers 7+, U-Net skips, GPTQ int6/int5. 14.09MB artifact. val_bpb=1.6323 on 1xH100 (1/8 competition compute).

Fix compliance: reserve 10s for GPTQ within 600s training budget

0215c2e

Training now stops at 590s (600s - 10s reserve), leaving time for GPTQ compression to complete within the total budget. Matches the pattern from PR openai#1487 (gptq_reserve_seconds=10).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499

SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499
dippatel1994 wants to merge 7 commits intoopenai:mainfrom
dippatel1994:submission/sp8192-depth-recur-parallel-resid

dippatel1994 commented Apr 9, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

dippatel1994 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dippatel1994 commented Apr 9, 2026

Summary

Results

Reproduction

Ablation (sp1024, 2-min, 1xH100)

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)

Uh oh!

dippatel1994 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants