Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean) by AbhayAnandUCSD · Pull Request #1435 · openai/parameter-golf

AbhayAnandUCSD · 2026-04-07T07:10:38Z

Summary

Depth recurrence: layers 4,5 repeat once (13 virtual from 11 physical), activated at step 3000
BigramHash(1536, dim 112) with SmearGate added on top of recurrence base
EMA decay 0.9965, skip gates, parallel residuals (layers 7+), MuonEq-R
SP1024 tokenizer (SP4096 was unavailable in public manifest)
GPTQ int6 + Brotli compression, ~14.6 MB artifact

Results (3 seeds, 8xH100 SXM)

Seed	Pre-quant BPB	Sliding BPB (s64)	Artifact
1337	1.1104	1.0989	14,597,964 B
42	1.1089	1.0973	14,564,857 B
2024	1.1097	1.0977	14,561,630 B
Mean	1.1097	1.0980 (std 0.0008)

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0167 BPB.

BigramHash vs Vanilla Comparison

Variant	Sliding BPB (s1337)	Artifact
Vanilla (PR #1421 base)	1.0999	14,327,531
+ BigramHash	1.0989	14,597,964

BigramHash adds ~0.001 BPB improvement at ~270KB artifact cost.

Attribution

Base architecture + depth recurrence: PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 by @aryanbhosale
EMA tuning (0.9965): PR [Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925 #1421 by @X-Abhishek-X
BigramHash + SmearGate: from the cumulative competition stack

Adopt PR openai#1421's proven depth recurrence script (1.0925 BPB) as base, with optional BigramHash enhancement. Target ~1.09 BPB to beat merged SOTA (1.1147).

3-seed mean 1.0980 BPB (std 0.0008), beating merged SOTA (1.1147) by 0.0167. Depth recurrence on layers 4,5 (13 virtual from 11 physical), BigramHash(1536, 112), EMA 0.9965, GPTQ int6 + Brotli. ~14.6 MB artifact.

11-task plan for re-running exp4 BigramHash + depth recurrence with SP4096 tokenizer. Includes retokenization on-pod, 3-seed training, and separate PR creation.

…text - Logged 4 experiments: smoke test, JEPA 1xH100, baseline 1xH100, JEPA 8xH100 (interrupted) - Updated open PRs: SP8192 stack now at 1.078 BPB (PR openai#1437) - Revised depth recurrence from dead-end to viable (PR openai#1394, openai#1435) - Updated strategy: Phase 1 = JEPA on PR openai#1019, Phase 2 = rebase on SP8192 - Updated blockers: grant submitted, all pods terminated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…remove non-submission files - Add Reproduction section with torchrun command to README - Add GPTQ calibration note (AR self-generated, not validation data) - Fix submission.json: precise val_bpb/val_loss, correct track format - Remove step_stop (ambiguous across seeds) - Remove docs/superpowers/ and experiments/ (not part of submission)

EMA_DECAY envvar (default=0.997, sota_32 uses 0.9965): - PR openai#1435 shows EMA=0.9965 beats 0.997 by +0.017 BPB (1.0980 vs 1.1147) - args.ema_decay_param wired to replace hardcoded 0.997 RECUR_LAYERS=4,5 at step 3000 (PR openai#1435): - 13 virtual layers from 11 physical (vs 3,4,5 = 14 virtual) - PR openai#1435 config: activate at step 3000 SLOT code present but DISABLED (SLOT_ENABLED=0 by default): - eval_val_slot(), forward_hidden(), compute_logits() added to train_gpt_sota_28.py - SLOT is retroactive 2-pass: optimizes delta on same tokens it scores = not causal - All SLOT PRs (openai#1313, openai#1488) remain unmerged Expected: ~1.095-1.10 BPB (WD=0.04 + EMA=0.9965 + RECUR PR#1435 config)

MatoTeziTanka · 2026-04-11T20:03:01Z

Community Review — Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)

BPB: 1.0980 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 96b28dc75dcd, file records/track_10min_16mb/2026-04-06_DepthRecurrence_BigramHash_EMA0.9965/train_gpt.py):

The TTT path at line 1571 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

AbhayAnandUCSD added 9 commits April 6, 2026 18:59

Add design spec for Experiment 4: Depth Recurrence

eb8d687

Adopt PR openai#1421's proven depth recurrence script (1.0925 BPB) as base, with optional BigramHash enhancement. Target ~1.09 BPB to beat merged SOTA (1.1147).

Add implementation plan for depth recurrence experiment

5dad09e

Extract PR openai#1421 depth recurrence script for experiment 4

a291f7c

Add BigramHash variant of depth recurrence script

fc4c10a

Switch scripts from SP4096 to SP1024 (SP4096 not in public manifest)

304d5a2

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980

09eb07f

3-seed mean 1.0980 BPB (std 0.0008), beating merged SOTA (1.1147) by 0.0167. Depth recurrence on layers 4,5 (13 virtual from 11 physical), BigramHash(1536, 112), EMA 0.9965, GPTQ int6 + Brotli. ~14.6 MB artifact.

Add experiment 4 results and training logs

60be88c

Add experiment 4 explainer writeup

c3920d2

Add SP4096 depth recurrence implementation plan

4cd5869

11-task plan for re-running exp4 BigramHash + depth recurrence with SP4096 tokenizer. Includes retokenization on-pod, 3-seed training, and separate PR creation.

nothingLiva mentioned this pull request Apr 8, 2026

Record: Frequency-Weighted GPTQ Calibration + AdaptPrecision Embedding Quantization + L10-INT8 + LR1.4x + QK6.0 + WD0.60 on Depth Recurrence — val_bpb 1.0954 (3-seed mean) #1042

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435
AbhayAnandUCSD wants to merge 10 commits intoopenai:mainfrom
AbhayAnandUCSD:exp4-recurrence

AbhayAnandUCSD commented Apr 7, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AbhayAnandUCSD commented Apr 7, 2026

Summary

Results (3 seeds, 8xH100 SXM)

BigramHash vs Vanilla Comparison

Attribution

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants