Skip to content

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435

Open
AbhayAnandUCSD wants to merge 10 commits intoopenai:mainfrom
AbhayAnandUCSD:exp4-recurrence
Open

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435
AbhayAnandUCSD wants to merge 10 commits intoopenai:mainfrom
AbhayAnandUCSD:exp4-recurrence

Conversation

@AbhayAnandUCSD
Copy link
Copy Markdown

Summary

  • Depth recurrence: layers 4,5 repeat once (13 virtual from 11 physical), activated at step 3000
  • BigramHash(1536, dim 112) with SmearGate added on top of recurrence base
  • EMA decay 0.9965, skip gates, parallel residuals (layers 7+), MuonEq-R
  • SP1024 tokenizer (SP4096 was unavailable in public manifest)
  • GPTQ int6 + Brotli compression, ~14.6 MB artifact

Results (3 seeds, 8xH100 SXM)

Seed Pre-quant BPB Sliding BPB (s64) Artifact
1337 1.1104 1.0989 14,597,964 B
42 1.1089 1.0973 14,564,857 B
2024 1.1097 1.0977 14,561,630 B
Mean 1.1097 1.0980 (std 0.0008)

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0167 BPB.

BigramHash vs Vanilla Comparison

Variant Sliding BPB (s1337) Artifact
Vanilla (PR #1421 base) 1.0999 14,327,531
+ BigramHash 1.0989 14,597,964

BigramHash adds ~0.001 BPB improvement at ~270KB artifact cost.

Attribution

Adopt PR openai#1421's proven depth recurrence script (1.0925 BPB) as base,
with optional BigramHash enhancement. Target ~1.09 BPB to beat merged
SOTA (1.1147).
3-seed mean 1.0980 BPB (std 0.0008), beating merged SOTA (1.1147) by
0.0167. Depth recurrence on layers 4,5 (13 virtual from 11 physical),
BigramHash(1536, 112), EMA 0.9965, GPTQ int6 + Brotli. ~14.6 MB artifact.
11-task plan for re-running exp4 BigramHash + depth recurrence
with SP4096 tokenizer. Includes retokenization on-pod, 3-seed
training, and separate PR creation.
eamon831 added a commit to eamon831/parameter-golf that referenced this pull request Apr 7, 2026
…text

- Logged 4 experiments: smoke test, JEPA 1xH100, baseline 1xH100, JEPA 8xH100 (interrupted)
- Updated open PRs: SP8192 stack now at 1.078 BPB (PR openai#1437)
- Revised depth recurrence from dead-end to viable (PR openai#1394, openai#1435)
- Updated strategy: Phase 1 = JEPA on PR openai#1019, Phase 2 = rebase on SP8192
- Updated blockers: grant submitted, all pods terminated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…remove non-submission files

- Add Reproduction section with torchrun command to README
- Add GPTQ calibration note (AR self-generated, not validation data)
- Fix submission.json: precise val_bpb/val_loss, correct track format
- Remove step_stop (ambiguous across seeds)
- Remove docs/superpowers/ and experiments/ (not part of submission)
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 9, 2026
EMA_DECAY envvar (default=0.997, sota_32 uses 0.9965):
- PR openai#1435 shows EMA=0.9965 beats 0.997 by +0.017 BPB (1.0980 vs 1.1147)
- args.ema_decay_param wired to replace hardcoded 0.997

RECUR_LAYERS=4,5 at step 3000 (PR openai#1435):
- 13 virtual layers from 11 physical (vs 3,4,5 = 14 virtual)
- PR openai#1435 config: activate at step 3000

SLOT code present but DISABLED (SLOT_ENABLED=0 by default):
- eval_val_slot(), forward_hidden(), compute_logits() added to train_gpt_sota_28.py
- SLOT is retroactive 2-pass: optimizes delta on same tokens it scores = not causal
- All SLOT PRs (openai#1313, openai#1488) remain unmerged

Expected: ~1.095-1.10 BPB (WD=0.04 + EMA=0.9965 + RECUR PR#1435 config)
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)

BPB: 1.0980 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 96b28dc75dcd, file records/track_10min_16mb/2026-04-06_DepthRecurrence_BigramHash_EMA0.9965/train_gpt.py):

The TTT path at line 1571 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants