Skip to content

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889#1445

Open
X-Abhishek-X wants to merge 2 commits intoopenai:mainfrom
X-Abhishek-X:record/v4-3layer-recur-ema-warmdown-1.0889
Open

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889#1445
X-Abhishek-X wants to merge 2 commits intoopenai:mainfrom
X-Abhishek-X:record/v4-3layer-recur-ema-warmdown-1.0889

Conversation

@X-Abhishek-X
Copy link
Copy Markdown

@X-Abhishek-X X-Abhishek-X commented Apr 7, 2026

Record: 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889

val_bpb: 1.0889 (3-seed mean, std 0.0005) | ~15.89 MB | 8×H100 SXM, 590s

3-Seed Results (8×H100 80GB SXM)

Seed Pre-quant BPB Sliding BPB (s64) Artifact
42 1.0950 1.0885 15,890,417 B
1337 1.0959 1.0894 15,888,733 B
2024 1.0954 1.0888 15,895,711 B
Mean 1.0954 1.0889 (std 0.0005)

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0258 BPB.

Key Changes

Four refinements stacked on PR #1334's depth recurrence architecture:

Parameter PR #1334 This Source
Recurrence layers 4,5 (2-layer) 3,4,5 (3-layer) PR #1331
Weight decay 0.090 0.095 PR #1331
Matrix LR 0.020 0.022 PR #1331
EMA decay 0.997 0.9965 PR #1421 (this author)
Recurrence start step 3000 step 2000 This work
Warmdown fraction 0.667 0.72 This work

Why This Combination Works

  • 3-layer recurrence (3,4,5): 14 virtual layers from 11 physical. More compute per forward pass without additional parameters.
  • WD=0.095 + MLR=0.022: Higher WD compresses weights, improving GPTQ quantization. Only 134K-186K values pruned.
  • EMA decay=0.9965: Smoother weight averaging for cleaner quantization.
  • Early recurrence (step 2000): 1000 more training steps with full depth recurrence.
  • Extended warmdown (72%): Weights fully settle before GPTQ.

Architecture (from PR #1334)

  • 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
  • Depth recurrence: layers 3,4,5 repeat (virtual 14 layers), activated at step 2000
  • Skip gates, parallel residuals from layer 7, QK-Gain 5.0
  • Shared Value Embedding (dim=128, layers 9,10)
  • Tied embeddings, logit softcap=30.0, SP4096 tokenizer

Training

  • FlashAttention 3, Muon (lr=0.022, WD=0.095), Adam/AdamW (fused=True)
  • Gradient clip: 0.3, Batch: 786,432 tokens/step, seq_len=2048
  • Warmdown: 72%, EMA decay=0.9965, Wallclock: 590s

Quantization

  • GPTQ int6, percdamp=0.05, 64 calibration batches
  • Selective pruning (~134K-186K values), Brotli compression

Run Command

SEED=42 RECUR_START_STEP=2000 WARMDOWN_FRAC=0.72 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

….0889

3-seed mean: 1.0889 BPB (sliding window stride=64)
Beats merged SOTA (1.1147) by 0.0258 BPB.

Stacks 3-layer recurrence (3,4,5), WD=0.095, MLR=0.022,
EMA decay=0.9965, early recurrence (step 2000), extended
warmdown (72%) on PR openai#1334 architecture.

Seeds: 42 (1.0885), 1337 (1.0894), 2024 (1.0888)
All artifacts under 16MB. 8xH100 SXM, 590s training.
Copilot AI review requested due to automatic review settings April 7, 2026 17:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track 10min / 16MB record snapshot for the “3-layer depth recurrence (3,4,5) + EMA 0.9965 + WD 0.095 + early recurrence + extended warmdown” configuration, including the exact training script, logs, and submission metadata used to report the 3-seed result.

Changes:

  • Adds a full train_gpt.py snapshot implementing 3-layer depth recurrence, EMA(0.9965), early recurrence start, and warmdown tweaks.
  • Adds 3-seed training logs (plus a main train.log) documenting reported metrics and artifact sizes.
  • Adds record metadata (submission.json) and a README describing the run and reproduction command.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_gpt.py Code snapshot used for training/quantization/eval for this record.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train.log Main training log for one seed/run.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed42.log Seed 42 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed1337.log Seed 1337 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/train_seed2024.log Seed 2024 log (supports reported 3-seed stats).
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/submission.json Leaderboard/record metadata for the submission.
records/track_10min_16mb/2026-04-07_3Layer_DepthRecurrence_EMA0.9965_WD095_1.0889/README.md Human-readable record summary, results table, and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +153 to +161
def log(msg, console: bool = True) -> None:
if _logger_hparams is None:
print(msg)
if _logger_hparams.is_main_process:
if console:
print(msg)
if _logger_hparams.logfile is not None:
with open(_logger_hparams.logfile, "a", encoding="utf-8") as f:
print(msg, file=f)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log() will raise AttributeError if called before set_logging_hparams(): after printing when _logger_hparams is None, it still falls through to _logger_hparams.is_main_process. Consider returning early when _logger_hparams is unset (or defaulting to console-only logging) to make the helper safe to use throughout the module.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +97
# Optimizer (Modification 3: weight decay 0.090)
min_lr = float(os.environ.get('MIN_LR', 0.0))
embed_lr = float(os.environ.get('EMBED_LR', 0.6))
head_lr = float(os.environ.get('HEAD_LR', 0.008))
tied_embed_lr = float(os.environ.get('TIED_EMBED_LR', 0.03))
tied_embed_init_std = float(os.environ.get('TIED_EMBED_INIT_STD', 0.005))
matrix_lr = float(os.environ.get('MATRIX_LR', 0.022))
scalar_lr = float(os.environ.get('SCALAR_LR', 0.02))
muon_momentum = float(os.environ.get('MUON_MOMENTUM', 0.99))
muon_backend_steps = int(os.environ.get('MUON_BACKEND_STEPS', 5))
muon_momentum_warmup_start = float(os.environ.get('MUON_MOMENTUM_WARMUP_START', 0.92))
muon_momentum_warmup_steps = int(os.environ.get('MUON_MOMENTUM_WARMUP_STEPS', 1500))
beta1 = float(os.environ.get('BETA1', 0.9))
beta2 = float(os.environ.get('BETA2', 0.95))
adam_eps = float(os.environ.get('ADAM_EPS', 1e-8))
grad_clip_norm = float(os.environ.get('GRAD_CLIP_NORM', 0.3))
eval_stride = int(os.environ.get('EVAL_STRIDE', 64))
muon_beta2 = float(os.environ.get('MUON_BETA2', 0.95))
adam_wd = float(os.environ.get('ADAM_WD', 0.02))
muon_wd = float(os.environ.get('MUON_WD', 0.095))
embed_wd = float(os.environ.get('EMBED_WD', 0.095))
ema_decay = float(os.environ.get('EMA_DECAY', 0.9965))
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hyperparameter section comment says "weight decay 0.090", but this record sets muon_wd / embed_wd to 0.095. Please update/remove the outdated comment to avoid confusion when reproducing or comparing runs.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +75
DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Run Command exports DATA_PATH and TOKENIZER_PATH, but this record's train_gpt.py reads DATA_DIR and derives datasets_dir / tokenizer_path from it (it does not consume DATA_PATH / TOKENIZER_PATH). As written, the command won’t actually redirect data/tokenizer locations for this snapshot. Please align the README command with the script (use DATA_DIR=...), or add support for DATA_PATH/TOKENIZER_PATH in Hyperparameters for consistency with the repo’s top-level instructions.

Suggested change
DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
DATA_DIR=./data \

Copilot uses AI. Check for mistakes.
### Quantization

- GPTQ int6 with percdamp=0.05, 64 calibration batches
- Selective pruning (~134K-186K lowest-error ±1 values)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims selective pruning of "~134K-186K" ±1 values, but the included logs show selective_prune: already fits, no pruning needed for all three seeds (42/1337/2024). Please update the pruning claims (lines 34 and 67) to match what actually happened in these runs, or point to the specific seed/config where pruning occurred.

Suggested change
- Selective pruning (~134K-186K lowest-error ±1 values)
- Selective pruning check performed; for the reported seeds (42/1337/2024), no pruning was needed because the artifacts already fit

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
"val_loss": 2.50548889,
"val_bpb": 1.08886755,
"bytes_total": 15895711
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json appears to mix 3-seed mean metrics (val_loss/val_bpb) with a single bytes_total value (15,895,711 B matches seed 2024 in the README). This can be ambiguous for downstream consumers that assume all fields describe the same submitted artifact. Consider either (a) making val_loss/val_bpb correspond to the seed whose artifact size is recorded, or (b) explicitly encoding mean-vs-submitted fields (e.g., seed, bytes_total_mean, bytes_total_submitted, val_bpb_mean).

Suggested change
"val_loss": 2.50548889,
"val_bpb": 1.08886755,
"bytes_total": 15895711
"submitted_seed": 2024,
"val_loss_mean": 2.50548889,
"val_bpb_mean": 1.08886755,
"bytes_total_submitted": 15895711

Copilot uses AI. Check for mistakes.
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…led, 2 new PRs validate deferred specs

Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits
in a row). Strong evidence of true novelty.

PR #1430 still OPEN, 0 comments, no comp owner activity since creation.
Increasingly likely to be reverted or outlawed.

NEW PRs validate two of our deferred H100 escalation specs:
  - PR #1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec
  - PR #1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec

Combined with PR #1437/#1420 already validating Patch 23 N-gram Tilt, the
3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple-
confirmed by independent comp PRs.

Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime.

Reminder: depth recurrence is back on the table — 5+ records use it now.
LESSONS.md §29 needs another update from "stale" to "real direction".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
… single-block re-run

From PR openai#1437 (1.0809), PR openai#1445 (1.0889), 8+ merged records total. Reference
papers: Universal Transformers + ALBERT for the weight-sharing depth idea.

Conservative variant: re-run only block 3 of the encoder twice (1 extra
forward pass through one block per training step). Lowest possible OOM risk
on 12GB 3080 Ti. Default env vars: LOOP_START=3, LOOP_END=3, RECUR_CYCLES=2.

Implementation: 3 LOC in the encoder loop + 4 LOC init. Anchored on the
WAVELET-MODIFIED loop (Patch 8 runs before Patch 19), idempotent via
DEPTH_RECUR_MARKER. Each anchor check is independent for graceful partial
application.

This is the FIRST architectural patch in 8 research fires that fits our
train_loss metric. Most architectural attempts failed at our scale, but
depth recurrence has 8+ merged records — much higher port-with-evidence
ratio than gated attention/tab hash/parallel residuals.

4 DR experiments queued:
  DR0_recur_block3_min (single block, 2x), DR1_recur_blocks3_4 (2 blocks),
  DR2_recur_block3_3x (single block, 3x), DR3_recur_seed42 (multi-seed)

OOM risk bounded: runner crash-resilience skips after 3 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…m PR openai#1437/openai#1423)

Subagent gap analysis of top 3 open PRs (openai#1437, openai#1423, openai#1445) found
QK_GAIN_INIT=5.0 is the simplest training-time technique we're missing
that has 2-PR evidence (top open openai#1 and openai#2 both use 5.0 vs upstream
default 1.5).

CRITICAL: QK_GAIN_INIT is already an upstream env var (line 60 of
train_gpt.py). NO code patch needed — just add experiments that override
the env var. Zero patcher risk, zero anchor risk.

Application: q_gain is multiplied element-wise with query tensor before
F.scaled_dot_product_attention, scaling Q-K product by the gain factor.

4 QK experiments queued:
  QK0_qkgain5_alone, QK1_qkgain5_seed42, QK2_qkgain5_L4weights,
  QK3_qkgain5_with_engram

Hypertuning rule check: this is a SINGLE-value port from 2 top open
records, NOT a weight sweep. Satisfies "port from top records" rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mohosy
Copy link
Copy Markdown

mohosy commented Apr 8, 2026

3 layer recurrence starting at step 2000 is smart, most ppl start way too late. the wd 0.095 for gptq is intresting too thats way higher than the 0.04 everyone was using before. does it actualy improve quant quality or just shrink the artifact

sisegod added a commit to sisegod/parameter-golf that referenced this pull request Apr 8, 2026
Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline
(2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523):

  1) QK_GAIN_INIT=5.0   (PR openai#1413)
  2) MUON_EQ_R=1        (Newton-Schulz row L2 normalize, PR openai#1394)
  3) --ema 0.9965       (PR openai#1421/openai#1445, vs prior 0.997)
  4) HIDDEN_MULT=5.0    (FFN dim 4x->5x, byte re-investment from int6 tied embed)
  5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1
                        (Phase 1A int6 tied embed, -0.6 MB on rANS artifact)

3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full
sliding-window):

  s1337: 1.144045  (28.7% of windows)
  s1338: 1.142021  (28.7%)
  s1339: 1.141649  (29.4%)
  -------
  mean:  1.142572
  std:   0.001247

Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523):
  -0.003951 bpb

Submitted as non-record because 1.142572 does not beat the current PR openai#1019
record (1.1147). The Phase 5a stack documents both the trivial-wins
composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that
other submitters can skip:

  Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept
  Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression
    +0.014 bpb, abandoned
  Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned
  Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER
    than W (per-layer ranges differ), abandoned
  Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned
  Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild
    blocker, abandoned
  Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB
    rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned
  Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64):
    p5a (no extra)        ~1.144   base
    p5a_bg4096            ~1.146   hurts
    p5a_hm5               ~1.144 -> 1.142 (3-seed)  BEST
    p5a_bg4096_hm5        ~1.144   tie
    p5a_bg8192            ~1.148   hurts
    p5a_nl12              ~1.147   hurts
    p5a_ve4               ~1.150   hurts
  Phase 5b (Depth Recurrence PR openai#1239 style):
    nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned
    nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned

The 28-29% mid-eval window is the converged region: per-window cumulative
bpb has flattened to within +/-0.001 of the 100% value in every prior
3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the
same H100 pod and will be appended in a follow-up commit if the final
number differs from the mid-eval estimate.

Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is
purely env-var driven (no source-code changes to the model architecture or
serializer). The training script picks up the Phase 5a env vars at import
time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc).

Reproducibility:
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339

Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training,
~50 min single-GPU SLOT-100 eval per seed (eval is unbounded).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants