[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 by X-Abhishek-X · Pull Request #1445 · openai/parameter-golf

X-Abhishek-X · 2026-04-07T17:15:10Z

Record: 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889

val_bpb: 1.0889 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM, 590s

3-Seed Results (8×H100 80GB SXM)

Seed	Pre-quant BPB	Sliding BPB (s64)	Artifact
42	1.0950	1.0885	15,890,417 B
1337	1.0959	1.0894	15,888,733 B
2024	1.0954	1.0888	15,895,711 B
Mean	1.0954	1.0889 (std 0.0005)

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0258 BPB.

Key Changes

Four refinements stacked on PR #1334's depth recurrence architecture:

Parameter	PR #1334	This	Source
Recurrence layers	4,5 (2-layer)	3,4,5 (3-layer)	PR #1331
Weight decay	0.090	0.095	PR #1331
Matrix LR	0.020	0.022	PR #1331
EMA decay	0.997	0.9965	PR #1421 (this author)
Recurrence start	step 3000	step 2000	This work
Warmdown fraction	0.667	0.72	This work

Why This Combination Works

3-layer recurrence (3,4,5): 14 virtual layers from 11 physical. More compute per forward pass without additional parameters.
WD=0.095 + MLR=0.022: Higher WD compresses weights, improving GPTQ quantization. Only 134K-186K values pruned.
EMA decay=0.9965: Smoother weight averaging for cleaner quantization.
Early recurrence (step 2000): 1000 more training steps with full depth recurrence.
Extended warmdown (72%): Weights fully settle before GPTQ.

Architecture (from PR #1334)

11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
Depth recurrence: layers 3,4,5 repeat (virtual 14 layers), activated at step 2000
Skip gates, parallel residuals from layer 7, QK-Gain 5.0
Shared Value Embedding (dim=128, layers 9,10)
Tied embeddings, logit softcap=30.0, SP4096 tokenizer

Training

FlashAttention 3, Muon (lr=0.022, WD=0.095), Adam/AdamW (fused=True)
Gradient clip: 0.3, Batch: 786,432 tokens/step, seq_len=2048
Warmdown: 72%, EMA decay=0.9965, Wallclock: 590s

Quantization

GPTQ int6, percdamp=0.05, 64 calibration batches
Selective pruning (~134K-186K values), Brotli compression

Run Command

SEED=42 RECUR_START_STEP=2000 WARMDOWN_FRAC=0.72 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base architecture + depth recurrence: PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 by @aryanbhosale
3-layer recurrence + WD/LR tuning: PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331
EMA decay tuning (0.9965): PR [Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925 #1421 by @X-Abhishek-X (this author)
Early recurrence + extended warmdown: This work

….0889 3-seed mean: 1.0889 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0258 BPB. Stacks 3-layer recurrence (3,4,5), WD=0.095, MLR=0.022, EMA decay=0.9965, early recurrence (step 2000), extended warmdown (72%) on PR openai#1334 architecture. Seeds: 42 (1.0885), 1337 (1.0894), 2024 (1.0888) All artifacts under 16MB. 8xH100 SXM, 590s training.

Copilot

Pull request overview

Adds a new Track 10min / 16MB record snapshot for the “3-layer depth recurrence (3,4,5) + EMA 0.9965 + WD 0.095 + early recurrence + extended warmdown” configuration, including the exact training script, logs, and submission metadata used to report the 3-seed result.

Changes:

Adds a full train_gpt.py snapshot implementing 3-layer depth recurrence, EMA(0.9965), early recurrence start, and warmdown tweaks.
Adds 3-seed training logs (plus a main train.log) documenting reported metrics and artifact sizes.
Adds record metadata (submission.json) and a README describing the run and reproduction command.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_gpt.py	Code snapshot used for training/quantization/eval for this record.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train.log	Main training log for one seed/run.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed42.log	Seed 42 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed1337.log	Seed 1337 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed2024.log	Seed 2024 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/submission.json	Leaderboard/record metadata for the submission.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/README.md	Human-readable record summary, results table, and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T17:19:00Z

records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_gpt.py

+def log(msg, console: bool = True) -> None:
+    if _logger_hparams is None:
+        print(msg)
+    if _logger_hparams.is_main_process:
+        if console:
+            print(msg)
+        if _logger_hparams.logfile is not None:
+            with open(_logger_hparams.logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)


log() will raise AttributeError if called before set_logging_hparams(): after printing when _logger_hparams is None, it still falls through to _logger_hparams.is_main_process. Consider returning early when _logger_hparams is unset (or defaulting to console-only logging) to make the helper safe to use throughout the module.

Copilot · 2026-04-07T17:19:00Z

records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_gpt.py

+    # Optimizer (Modification 3: weight decay 0.090)
+    min_lr = float(os.environ.get('MIN_LR', 0.0))
+    embed_lr = float(os.environ.get('EMBED_LR', 0.6))
+    head_lr = float(os.environ.get('HEAD_LR', 0.008))
+    tied_embed_lr = float(os.environ.get('TIED_EMBED_LR', 0.03))
+    tied_embed_init_std = float(os.environ.get('TIED_EMBED_INIT_STD', 0.005))
+    matrix_lr = float(os.environ.get('MATRIX_LR', 0.022))
+    scalar_lr = float(os.environ.get('SCALAR_LR', 0.02))
+    muon_momentum = float(os.environ.get('MUON_MOMENTUM', 0.99))
+    muon_backend_steps = int(os.environ.get('MUON_BACKEND_STEPS', 5))
+    muon_momentum_warmup_start = float(os.environ.get('MUON_MOMENTUM_WARMUP_START', 0.92))
+    muon_momentum_warmup_steps = int(os.environ.get('MUON_MOMENTUM_WARMUP_STEPS', 1500))
+    beta1 = float(os.environ.get('BETA1', 0.9))
+    beta2 = float(os.environ.get('BETA2', 0.95))
+    adam_eps = float(os.environ.get('ADAM_EPS', 1e-8))
+    grad_clip_norm = float(os.environ.get('GRAD_CLIP_NORM', 0.3))
+    eval_stride = int(os.environ.get('EVAL_STRIDE', 64))
+    muon_beta2 = float(os.environ.get('MUON_BETA2', 0.95))
+    adam_wd = float(os.environ.get('ADAM_WD', 0.02))
+    muon_wd = float(os.environ.get('MUON_WD', 0.095))
+    embed_wd = float(os.environ.get('EMBED_WD', 0.095))
+    ema_decay = float(os.environ.get('EMA_DECAY', 0.9965))


The hyperparameter section comment says "weight decay 0.090", but this record sets muon_wd / embed_wd to 0.095. Please update/remove the outdated comment to avoid confusion when reproducing or comparing runs.

Copilot · 2026-04-07T17:19:00Z

records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/README.md

+DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
+TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \


The Run Command exports DATA_PATH and TOKENIZER_PATH, but this record's train_gpt.py reads DATA_DIR and derives datasets_dir / tokenizer_path from it (it does not consume DATA_PATH / TOKENIZER_PATH). As written, the command won’t actually redirect data/tokenizer locations for this snapshot. Please align the README command with the script (use DATA_DIR=...), or add support for DATA_PATH/TOKENIZER_PATH in Hyperparameters for consistency with the repo’s top-level instructions.

Suggested change

DATA_PATH=./data/datasets/fineweb10B_sp4096/ \

TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \

DATA_DIR=./data \

Copilot · 2026-04-07T17:19:01Z

records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/README.md

+### Quantization
+
+- GPTQ int6 with percdamp=0.05, 64 calibration batches
+- Selective pruning (~134K-186K lowest-error ±1 values)


README claims selective pruning of "~134K-186K" ±1 values, but the included logs show selective_prune: already fits, no pruning needed for all three seeds (42/1337/2024). Please update the pruning claims (lines 34 and 67) to match what actually happened in these runs, or point to the specific seed/config where pruning occurred.

Suggested change

- Selective pruning (~134K-186K lowest-error ±1 values)

- Selective pruning check performed; for the reported seeds (42/1337/2024), no pruning was needed because the artifacts already fit

Copilot · 2026-04-07T17:19:01Z

...ds/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/submission.json

+  "val_loss": 2.50548889,
+  "val_bpb": 1.08886755,
+  "bytes_total": 15895711


submission.json appears to mix 3-seed mean metrics (val_loss/val_bpb) with a single bytes_total value (15,895,711 B matches seed 2024 in the README). This can be ambiguous for downstream consumers that assume all fields describe the same submitted artifact. Consider either (a) making val_loss/val_bpb correspond to the seed whose artifact size is recorded, or (b) explicitly encoding mean-vs-submitted fields (e.g., seed, bytes_total_mean, bytes_total_submitted, val_bpb_mean).

Suggested change

"val_loss": 2.50548889,

"val_bpb": 1.08886755,

"bytes_total": 15895711

"submitted_seed": 2024,

"val_loss_mean": 2.50548889,

"val_bpb_mean": 1.08886755,

"bytes_total_submitted": 15895711

…led, 2 new PRs validate deferred specs Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits in a row). Strong evidence of true novelty. PR #1430 still OPEN, 0 comments, no comp owner activity since creation. Increasingly likely to be reverted or outlawed. NEW PRs validate two of our deferred H100 escalation specs: - PR #1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec - PR #1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec Combined with PR #1437/#1420 already validating Patch 23 N-gram Tilt, the 3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple- confirmed by independent comp PRs. Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime. Reminder: depth recurrence is back on the table — 5+ records use it now. LESSONS.md §29 needs another update from "stale" to "real direction". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… single-block re-run From PR openai#1437 (1.0809), PR openai#1445 (1.0889), 8+ merged records total. Reference papers: Universal Transformers + ALBERT for the weight-sharing depth idea. Conservative variant: re-run only block 3 of the encoder twice (1 extra forward pass through one block per training step). Lowest possible OOM risk on 12GB 3080 Ti. Default env vars: LOOP_START=3, LOOP_END=3, RECUR_CYCLES=2. Implementation: 3 LOC in the encoder loop + 4 LOC init. Anchored on the WAVELET-MODIFIED loop (Patch 8 runs before Patch 19), idempotent via DEPTH_RECUR_MARKER. Each anchor check is independent for graceful partial application. This is the FIRST architectural patch in 8 research fires that fits our train_loss metric. Most architectural attempts failed at our scale, but depth recurrence has 8+ merged records — much higher port-with-evidence ratio than gated attention/tab hash/parallel residuals. 4 DR experiments queued: DR0_recur_block3_min (single block, 2x), DR1_recur_blocks3_4 (2 blocks), DR2_recur_block3_3x (single block, 3x), DR3_recur_seed42 (multi-seed) OOM risk bounded: runner crash-resilience skips after 3 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m PR openai#1437/openai#1423) Subagent gap analysis of top 3 open PRs (openai#1437, openai#1423, openai#1445) found QK_GAIN_INIT=5.0 is the simplest training-time technique we're missing that has 2-PR evidence (top open openai#1 and openai#2 both use 5.0 vs upstream default 1.5). CRITICAL: QK_GAIN_INIT is already an upstream env var (line 60 of train_gpt.py). NO code patch needed — just add experiments that override the env var. Zero patcher risk, zero anchor risk. Application: q_gain is multiplied element-wise with query tensor before F.scaled_dot_product_attention, scaling Q-K product by the gain factor. 4 QK experiments queued: QK0_qkgain5_alone, QK1_qkgain5_seed42, QK2_qkgain5_L4weights, QK3_qkgain5_with_engram Hypertuning rule check: this is a SINGLE-value port from 2 top open records, NOT a weight sweep. Satisfies "port from top records" rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mohosy · 2026-04-08T02:36:48Z

3 layer recurrence starting at step 2000 is smart, most ppl start way too late. the wd 0.095 for gptq is intresting too thats way higher than the 0.04 everyone was using before. does it actualy improve quant quality or just shrink the artifact

Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline (2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523): 1) QK_GAIN_INIT=5.0 (PR openai#1413) 2) MUON_EQ_R=1 (Newton-Schulz row L2 normalize, PR openai#1394) 3) --ema 0.9965 (PR openai#1421/openai#1445, vs prior 0.997) 4) HIDDEN_MULT=5.0 (FFN dim 4x->5x, byte re-investment from int6 tied embed) 5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 (Phase 1A int6 tied embed, -0.6 MB on rANS artifact) 3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full sliding-window): s1337: 1.144045 (28.7% of windows) s1338: 1.142021 (28.7%) s1339: 1.141649 (29.4%) ------- mean: 1.142572 std: 0.001247 Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523): -0.003951 bpb Submitted as non-record because 1.142572 does not beat the current PR openai#1019 record (1.1147). The Phase 5a stack documents both the trivial-wins composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that other submitters can skip: Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression +0.014 bpb, abandoned Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER than W (per-layer ranges differ), abandoned Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild blocker, abandoned Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64): p5a (no extra) ~1.144 base p5a_bg4096 ~1.146 hurts p5a_hm5 ~1.144 -> 1.142 (3-seed) BEST p5a_bg4096_hm5 ~1.144 tie p5a_bg8192 ~1.148 hurts p5a_nl12 ~1.147 hurts p5a_ve4 ~1.150 hurts Phase 5b (Depth Recurrence PR openai#1239 style): nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned The 28-29% mid-eval window is the converged region: per-window cumulative bpb has flattened to within +/-0.001 of the 100% value in every prior 3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the same H100 pod and will be appended in a follow-up commit if the final number differs from the mid-eval estimate. Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is purely env-var driven (no source-code changes to the model architecture or serializer). The training script picks up the Phase 5a env vars at import time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc). Reproducibility: bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339 Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training, ~50 min single-GPU SLOT-100 eval per seed (eval is unbounded). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 7, 2026 17:15

Copilot started reviewing on behalf of X-Abhishek-X April 7, 2026 17:15 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Fix val_bpb to exact 3-seed mean

be838e4

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed mid-eval 1.142572) #1465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889#1445

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889#1445
X-Abhishek-X wants to merge 2 commits intoopenai:mainfrom
X-Abhishek-X:record/v4-3layer-recur-ema-warmdown-1.0889

X-Abhishek-X commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

mohosy commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
		TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \

	DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
	TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
	DATA_DIR=./data \

	- Selective pruning (~134K-186K lowest-error ±1 values)
	- Selective pruning check performed; for the reported seeds (42/1337/2024), no pruning was needed because the artifacts already fit

-  "val_loss": 2.50548889,
-  "val_bpb": 1.08886755,
-  "bytes_total": 15895711
+  "submitted_seed": 2024,
+  "val_loss_mean": 2.50548889,
+  "val_bpb_mean": 1.08886755,
+  "bytes_total_submitted": 15895711

Conversation

X-Abhishek-X commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!