Skip to content

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1493

Merged
cocohearts merged 1 commit intoopenai:mainfrom
bigbag:submission/sp8192-ttt-clean
Apr 9, 2026
Merged

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1493
cocohearts merged 1 commit intoopenai:mainfrom
bigbag:submission/sp8192-ttt-clean

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Apr 9, 2026

Summary

  • val_bpb = 1.0810 (3-seed mean, std 0.0002) | ~15.99 MB | 8×H100 SXM
  • SP8192 + 3-layer depth recurrence (L3-5) + parallel residuals (L7+) + QK-Gain 5.25 + legal score-first TTT
  • No SLOT, no pre-quant TTT, no n-gram cache, no ETLB — fully compliant

3-Seed Results

Seed Sliding BPP TTT BPP Artifact
42 1.0829 1.0808 15,991,930
314 1.0827 1.0810 15,992,919
999 1.0826 1.0812 15,993,232
Mean 1.0827 1.0810 15,992,694
Std 0.0002 0.0002

Merged SOTA (PR #1019): 1.1147 BPP. Delta: −0.0337 BPP.

Key Techniques

  1. SP8192 + GPTQ SDClip — int6 matrices (k=12.85), int8 embeddings (k=20.0), zero pruning (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 @clarkkev)
  2. 3-Layer Depth Recurrence (L3-5, activate at 0.35) — 17 virtual layers from 11 physical
  3. Parallel Residuals (L7+) — GPT-J style (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412 @Robby955, PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic)
  4. QK-Gain 5.25 — monotonic improvement from 4.0 → 5.0 → 5.25
  5. Legal Score-First TTT — SGD (lr=0.005, mom=0.9), 3 epochs, cosine decay (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 @abaybektursun, PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 @dexhunter)
  6. Tuned Hyperparameters — WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 @X-Abhishek-X)
  7. LZMA code wrapper — 16.6KB code footprint

Compliance (Track B)

Per Issue #1017:

  • Condition 1 (Causality): Sliding-window eval, prefix only
  • Condition 2 (Normalized): Standard softmax, no n-gram/logit bias
  • Condition 3 (Score before update): Each chunk scored under torch.no_grad() BEFORE SGD
  • Condition 4 (Single pass): Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

Credits

PR #1394 @clarkkev, PR #1413 @dexhunter, PR #549 @abaybektursun, PR #1412 @Robby955, PR #1204 @msisovic, PR #1445 @X-Abhishek-X, PR #1331 @dexhunter

Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was instrumental in running 160+ experiments that led to this result.

Reproduction

SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • 3-seed validation (42, 314, 999)
  • All artifacts under 16,000,000 bytes
  • Training under 600s (588s actual)
  • Eval (sliding + TTT) under 600s (~500s actual)
  • Score-first TTT: compliant with Issue A Field Guide to Valid Submissions #1017 conditions 1-4
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache

🤖 Generated with Claude Code

…25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag
Copy link
Copy Markdown
Author

bigbag commented Apr 9, 2026

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey.

owizdom added a commit to owizdom/parameter-golf that referenced this pull request Apr 9, 2026
…nthesis (validation pending)

First submission to stack three independently-legal val-data adaptations on the
PR openai#1487 (1.0600) base:

1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A)
2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations
   to align quantization with the eval distribution (novel on the modern stack;
   PR openai#1019 ablated this on its older base only)
3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering
   (Track B, builds on PR openai#1493)

The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487
(1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent
angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars.

Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val
function, plus 8 hyperparameter defaults flipped). Architecture, optimizer,
training loop, EMA, and quantization machinery are byte-identical to PR openai#1487.

Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear
the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong
non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100
SXM time on RunPod; see VALIDATION.md.

Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time
score-first TTT). No SLOT, no n-gram cache, no ETLB.

Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun,
PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955,
PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
@cocohearts cocohearts merged commit bac888c into openai:main Apr 9, 2026
SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 9, 2026
…val_bpb 1.07983

3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack.

Changes from PR openai#1394 + PR openai#1413 baseline:
- Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged
- Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal
  score-first TTT; within-word and word-start experts explicitly disabled
  (within_beta=0, word_beta=0) because they cannot be made fully causal.
- 3-seed verification (seeds 0/42/1234)

Seeds:
- seed 0    → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes
- seed 42   → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes
- seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes
- mean      → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes

Delta vs current merged SOTA PR openai#1493 (1.0810):
  0.00117 bpb / 0.00302 nats per token

Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun
(n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT
precedent PR openai#549 / PR openai#461.

Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval
<437s per seed, both under the 600s budget. Artifact under 16 MB on
all 3 seeds.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
Added Parallel Residuals to Block.forward (gated by USE_PARALLEL_RESIDUALS=1):
when enabled, attn and mlp branches both consume the same normalized x_in
instead of mlp consuming attn's output. This is the technique used by
leaderboard openai#1 (PR openai#1493/openai#1477). Inductor can fuse the two branches better
and val_bpb improves ~0.005-0.01 BPB. Default off so existing recipes
unchanged.

Added USE_PARALLEL_RESIDUALS env var wiring in submission/run.sh +
config-print line.

New submission/dry_run.sh wrapper — single-command launcher for our
H100 dry-run config:
  - NUM_LAYERS=8 MLP_MULT=2 (compute-efficient sweet spot from A6000)
  - NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 (3-layer recurrence, comp openai#1)
  - QK_GAIN_INIT=5.25 (comp openai#1)
  - USE_PARALLEL_RESIDUALS=1 (just ported)
  - USE_PARALLEL_MUON=1 (our discovery)
  - MATRIX_BITS=8 USE_CMP_QUANT_VALUE_DEDUP=0 (our int8 fix)
  - TORCH_COMPILE_MODE=max-autotune-no-cudagraphs USE_CUDNN_BENCHMARK=1
  - PREQUANT_TTT_ENABLED=0 (illegal, disabled)
  - TTT_ENABLED=1 TTT_EPOCHS=3 (legal score-first)
  - SLIDING_WINDOW_ENABLED=1
  - MAX_WALLCLOCK_SECONDS=600

Expected on 1×H100 PCIe: val_bpb ~1.10-1.20 (validates A6000 projection)
Expected on 8×H100 SXM:  val_bpb ~1.00-1.07 (potentially beats openai#1 = 1.0810)

The submission val_bpb to read is the 'legal_ttt_exact val_bpb' line.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…leaderboard openai#1)

Verified actual openai/parameter-golf merged leaderboard via gh pr view 1493:
PR openai#1493 (val_bpb 1.0810, merged 2026-04-09) is the real leaderboard openai#1.
Earlier PHASE2_RESULTS.md notes citing PR openai#1485 (1.0679) and PR openai#1482 (1.0787)
were wrong — those PRs are not in the merged set. Pre-Quant AdamW TTT is
not in any merged PR; PR openai#1493 explicitly says "no pre-quant TTT".

Changes to dry_run.sh:
- NUM_LAYERS 8 -> 6 (revert to CHAMP_D validated, l=8 was unvalidated)
- EMA_DECAY=0.9965 (PR openai#1493, was default 0.997)
- WARMDOWN_FRAC=0.72 (PR openai#1493, was default 0.667)
- ENABLE_LOOPING_AT=0.35 (PR openai#1493, was default 0.5)
- Comment fix: PreQuant TTT removed "illegal" claim, replaced with
  "PR openai#1493 explicitly does not use pre-quant TTT"
- Header rewrite: now references PR openai#1493 not "leaderboard openai#1 = 1.0810"

Changes to PHASE2_RESULTS.md:
- Replaced stale comp anchor table with verified merged leaderboard
- Added warning about prior bogus PR openai#1485/openai#1482 anchors

Note on Parallel Residuals topology mismatch: PR openai#1493 applies parallel
residuals from layer 7+ (5 of 11 layers). Our impl is binary and applies
to all layers — with NUM_LAYERS=6 that means all 6 layers parallel,
which is a different topology than PR openai#1493 has validated. Keeping at
USE_PARALLEL_RESIDUALS=1 per user direction; flagging here so it shows
up in any post-mortem if results are weird.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…enai#1493 exact match)

From PR openai#1493 train_seed314.log Hyperparameters dump:
- muon_wd: 0.095 (we default to 0.085)
- matrix_lr: 0.022 (we default to 0.020)

Both are zero-risk exact-match cheap wins.

Decision logged on USE_PARALLEL_RESIDUALS=1: keeping at 1 (all 6 layers
parallel) deliberately, not switching to PR openai#1493's L7+ pattern. Reasoning:
with NUM_LAYERS=6 the "early layers need serial composition" principle
bites less hard than at 11L, and we want max speed for more steps on
1xH100 PCIe. We're trying to BEAT 1.0810, not match it -- aggression is
required somewhere, parallel residuals are a low-risk place to find it.

Two-lane PARALLEL_START_LAYER mechanism (default 7, no-op for 6L)
deliberately left untouched -- separate untested architecture, save for
post-dry-run experiments.

Decision logged on second dry run as "match PR openai#1493 exactly":
explicitly rejected by user. We bet on our smaller-model + int8 stack,
not a literal reproduction.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
… + records folder

Three changes per user direction:

1. train.py: rename timed_eval label legal_ttt_exact -> quantized_ttt
   to match comp convention (PR openai#1493 uses this exact label). Pure
   cosmetic 1-LOC fix, no behavior change.

2. dry_run.sh: refactor to be the SINGLE canonical entry point for
   both dry run and real submission via SEEDS env var:
     bash submission/dry_run.sh                  # dry run (default SEEDS=42)
     SEEDS=42,314,999 bash submission/dry_run.sh  # real 3-seed submission
   Same code path, env-flip only. The whole config (architecture,
   hyperparams, n-gram stack, TTT) is identical between the two modes
   -- only the seed loop differs.

3. dry_run.sh: assemble a complete comp records folder under
     records/track_10min_16mb/<date>_<config-tag>/
   with: README.md, submission.json, train_gpt.py, train_seed<N>.log
   per-seed logs, and per-seed final_model_seed<N>.int6.ptz artifacts.

   submission.json is generated by an inline python script that:
   - parses each seed's train log for the quantized_ttt val_bpb line
   - computes mean + std across seeds
   - detects hardware via nvidia-smi
   - fills the compliance flags honestly (no_ngram_cache: false since
     we DO use n-gram bias -- this is potentially a Track B rule
     problem, flagged in the README for follow-up)
   - emits the 36-line submission.json format that PR openai#1493 uses

   README.md is templated with per-seed results table, technique list,
   compliance section, reproduction instructions, attribution.

   train.py is copied as train_gpt.py to the records folder (NOT
   LZMA-wrapped yet -- that's a code-size compliance follow-up if/when
   needed).

Note on n-gram legality: PR openai#1493's compliance section says "no n-gram
cache, no logit biasing" per Issue openai#1017 Track B. Our submission flags
no_ngram_cache: false honestly. Whether this submission is comp-legal
under Track A or any other track is an open question that needs
resolution before merging as a record. Flagged in the README.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
Two changes per user direction (rule compliance + comp file format):

1. DISABLE n-gram bias stack (rule compliance)
   USE_NGRAM_BIAS=0, USE_NGRAM_BACKOFF=0, USE_NGR_LOG_FREQ_INV=0,
   USE_CTX_PARTITIONED_TAB=0.

   Reason: PR openai#1493's compliance section cites Issue openai#1017 Track B
   Condition 2: "Standard softmax over full vocab. No n-gram cache,
   no logit biasing." Our USE_NGRAM_BIAS adds precomputed n-gram
   log-prob bias to logits at the end of forward(), which directly
   violates this condition.

   We don't yet know whether the rule applies only to Track B
   (legal-eval-time-adaptation track) or to all submissions, but the
   user's policy is clear: nothing illegal. Disable until verified.

   N-gram tables are still BUILT during get_data.sh bootstrap (cheap,
   no harm) but unused at training/eval time when USE_NGRAM_BIAS=0.

   Other Phase 1 wins kept (all believed legal):
   - USE_GATED_ATTENTION (architectural, NeurIPS 2025)
   - USE_NORMUON (optimizer variant)
   - USE_NORM_PCT_DROPOUT (training-time regularizer)
   - USE_PREFETCH_LOADER (data pipeline)

2. LZMA-wrap train_gpt.py (PR openai#1493 file format)
   The records folder assembly step now LZMA-wraps submission/train.py
   into a 2-line train_gpt.py matching PR openai#1493's format:
     import lzma as L,base64 as B
     exec(L.decompress(B.b85decode("..."),format=L.FORMAT_RAW,...))
   Sanity-decodes after wrapping to verify the roundtrip.

   Sizing:
   - submission/train.py raw:     83,320 bytes
   - LZMA-wrapped train_gpt.py:   28,916 bytes (34.7% of raw)
   - PR openai#1493 wrapped train_gpt.py: 16,594 bytes
   - Our artifact (CHAMP_D int8): ~9,555,838 bytes (~9.55 MB)
   - Total submission (artifact + code): ~9.58 MB / 16 MB cap (60%)

   Plenty of code-size headroom. Our train.py is bigger than PR openai#1493's
   because we carry more infrastructure (n-gram code, NIGHT_MODE
   features, optional speed paths) but the wrapped form fits comfortably.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…8 + dedup

User decision: stop betting on smaller-model + int8 alone (CHAMP_D 6L+2x)
because on 8xH100 SXM the binding constraint is model capacity, not training
compute. Flipping to PR openai#1493's proven architecture (11L+4x) and stacking
our int8 quant + parallel muon + parallel residuals on top.

Changes:
- NUM_LAYERS: 6 -> 11 (match PR openai#1493)
- MLP_MULT:   2 -> 4 (match PR openai#1493)
- USE_PARALLEL_RESIDUALS: 1 -> 0 (binary all-layers flag, replaced by below)
- PARALLEL_RESIDUAL_START: 7 (NEW per-block start parameter, matches PR openai#1493
  exactly: layers 0-6 serial, layers 7-10 parallel residual GPT-J style)
- USE_CMP_QUANT_VALUE_DEDUP: 0 -> 1 (RE-ENABLED, NIGHT_MODE n=2 confirmed L10
  alphabet-snap compression. Was disabled with int8 because I assumed it would
  hurt cleanliness -- that assumption was never validated. Re-enabling because
  (a) we need ~10-15% compression to fit 11L+4x int8 in 16 MB cap and
  (b) restoring a previously validated win I dropped without good reason.
- Records folder tag: SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT

train.py changes (6 LOC):
- Block.__init__: now reads PARALLEL_RESIDUAL_START env var, sets
  _parallel_residuals=True for layer_idx >= PARALLEL_RESIDUAL_START. Falls
  back to USE_PARALLEL_RESIDUALS binary flag if PARALLEL_RESIDUAL_START=-1.
- Block.__init__: stores layer_idx as self.layer_idx for the check
- Hyperparameters: added parallel_residual_start field (env-driven, default -1)

Math:
- PR openai#1493 baseline: 1.0810
- Int8 quant savings (vs their int6): -0.011 BPB
- Parallel muon: ~0 BPB (speed only)
- CMP_QUANT_VALUE_DEDUP: ~+0.005 BPB cost from alphabet snap
- Net projection: ~1.072-1.078
- Probability of beating 1.0760 (record threshold): ~30%

Risks:
- Int8 quant at 11L+4x scale is UNTESTED (CHAMP_E was killed mid-run)
- 11L+4x int8 + brotli + dedup might still be over 16 MB cap (CHAMP_D
  had 9.55 MB at 6L+2x; this is ~1.7x more params, projected ~14-16 MB)
- PARALLEL_RESIDUAL_START is brand new code, never run end-to-end

Pre-flight: dry_run.sh syntax check passes.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…ng silently disabled)

THE BIGGEST DROPPED WIN, found via deep audit of our experiment history.

Bug: run.sh:73-74 hardcodes:
  TORCH_COMPILE_DISABLE="${TORCH_COMPILE_DISABLE:-1}"
  TORCHDYNAMO_DISABLE="${TORCHDYNAMO_DISABLE:-1}"

dry_run.sh was setting TORCH_COMPILE_MODE=max-autotune-no-cudagraphs but
that env var does NOTHING when TORCH_COMPILE_DISABLE=1 is in effect, so the
compile path never engaged. The dry_run was running in eager mode the entire
time despite the explicit "compile mode" config.

Phase 2 evidence (PHASE2_RESULTS.md):
- E1 (compile disabled, baseline):              2933 ms/step
- E2 (compile re-enabled with default mode):    1581 ms/step (+85% / 1.85x)
- E4b (compile + max-autotune-no-cudagraphs):   1526 ms/step (+92% / 1.92x)

Measured on RTX 3090. On 8xH100 SXM with the 11L+4x model the speedup
will be more like 3-5x because H100 has way more matmul throughput that
the eager-mode kernel launch overhead bottlenecks.

Impact: without compile, our 600s training budget gets us approximately
HALF the training steps PR openai#1493 gets at the same architecture. Their
4557 steps -> our ~2200 steps without compile. Catastrophic convergence
loss. With compile re-enabled we should match or exceed their step count.

Fix: explicitly export TORCH_COMPILE_DISABLE=0 and TORCHDYNAMO_DISABLE=0
in dry_run.sh BEFORE bash submission/run.sh. The variables are already
in run.sh's explicit env-passing list at line 251-252 so the override
propagates correctly.

Caught via Explore agent audit of all PHASE2_RESULTS, NIGHT_MODE.md,
PHASE2_PLAN.md, run.sh, and submission/train.py to find any validated
win not in the current dry_run.sh.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…ission

Phase 2 was the speed/quality experimentation work (E1-E31, CHAMP_A/B/C/D/E/F).
That's done. The current 8xH100 SXM run is the REAL openai/parameter-golf
submission attempt and deserves its own state file.

Created SUBMISSION_RUN_STATE.md with:
- Pod info (aklt7paqnjwhal, 8x H100 SXM, $21.52/hr)
- Full Option C config dump
- Targets (PR openai#1493 = 1.0810, record threshold = 1.0760)
- Output records folder location
- Fire log table (ready for the cron to append per-fire)

Removed the Pod O block from PHASE2_AUTOMATION_STATE.md (I had wrongly added
it there during the 01:57Z fire). PHASE2_AUTOMATION_STATE.md now ends with
"Phase 2 work is complete" and points at SUBMISSION_RUN_STATE.md.

Cron be912385 deleted, replaced with 49457147 (same 10-min schedule, same
pod) — new prompt writes to SUBMISSION_RUN_STATE.md, tags commits [submission]
instead of [phase2-driver].
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
… single-GPU)

THE bug: run.sh hardcoded `python3 -u submission/train.py` which always
spawns a single Python process -> world_size=1 -> ONE GPU used. On the
8xH100 SXM real submission run we caught this with the GPU dashboard
showing only GPU 7 at 100% and the other 7 idle. We were paying for 8
GPUs and using 1.

PR openai#1493 launches with: torchrun --standalone --nproc_per_node=8 train_gpt.py
train.py already supports distributed via WORLD_SIZE/RANK/LOCAL_RANK env
vars (see train.py:1065-1071) -- it just needs a torchrun launcher.

Fix: auto-detect GPU count via nvidia-smi, use torchrun when > 1 GPU,
fall back to python3 for single-GPU runs (preserves the local 1xPCIe
dry-run path).

NPROC_PER_NODE override is honored if set (lets us cap at 4 if we want
to run partial-machine experiments).

The Explore agent flagged this earlier in the audit. I noted it but said
"not needed for dry run on 1xH100 PCIe" -- which was the wrong call for
the real 8xH100 SXM submission. Should have fixed it in the same pass
as the torch.compile re-enable. My miss, costs ~$13 of pod time.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…at 11L+4x)

Seed 42 results from retry 4:
- pre-quant val_bpb 1.0896 — EXCELLENT (0.002 from PR openai#1493's 1.0878)
- int8 quantized val_bpb 4.5461 — CATASTROPHIC (3.46 BPB gap)
- artifact 19,559,800 bytes — OVER 16 MB CAP (19.6 MB)

Root cause: 36M params × 8 bits per param = too many bytes for brotli
to compress under 16 MB. CMP_QUANT_VALUE_DEDUP=1 made it worse (post-quant
alphabet snap destroyed the fine weight structure on top of the size issue).

Fix: switch to MATRIX_BITS=6 + EMBED_BITS=8 (PR openai#1493's exact setup).
Proven to fit 16 MB. Proven quant gap 0.012 BPB. Disable dedup.

Also: explicitly pass WARMDOWN_FRAC, EMA_DECAY, ENABLE_LOOPING_AT, MUON_WD,
MATRIX_LR, PARALLEL_RESIDUAL_START, MATRIX_BITS, EMBED_BITS in run.sh's
env-passing list for torchrun. Env inheritance WAS working (verified from
seed 42 log) but explicit is safer with torchrun multi-process.

Projected with int6: pre-quant ~1.089, quant gap +0.012, sliding -0.017,
TTT -0.002 = final ~1.082. Close to PR openai#1493's 1.081 but likely not a
record (threshold 1.076). Running anyway — the NIGHT_MODE features
(gated_attention, normuon, norm_pct_dropout) might close the gap.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
!)

Retry 5 seed 42 results (int6 quant, full eval pipeline):
  pre-quant val_bpb:       1.08982 (PR openai#1493: 1.08775, gap +0.002)
  quantized val_bpb:       1.10014 (PR openai#1493: 1.09947, gap +0.001)
  quantized_sliding:       1.08327 (PR openai#1493: 1.08271, gap +0.001)
  quantized_ttt:           1.08243 (PR openai#1493: 1.08103, gap +0.001)

Our int6 quant gap: 0.010 BPB (BETTER than PR openai#1493's 0.012!)
Our model is 0.0014 behind PR openai#1493 overall. Would be leaderboard openai#2.

ISSUE: artifact 16,051,299 bytes — 51 KB over 16 MB cap (16,000,000).
Fixable with CMP_QUANT_VALUE_DEDUP=1 (~10-15% smaller) — at int6 scale
the dedup is safe (retry 4's catastrophe was int8+dedup combo).

Seeds 314/999 running for 3-seed mean. Will have same 51 KB oversize
but the val_bpb data is worth collecting before fixing artifact size.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 10, 2026
…or 16 MB fit

Two changes queued for the next run (not yet launched):

1. PARALLEL_START_LAYER=-1 (CRITICAL BUG FIX)
   The pre-existing two-lane decoder split mechanism (GPT.__init__:349,
   default PARALLEL_START_LAYER=7) was SILENTLY OVERRIDING our per-block
   PARALLEL_RESIDUAL_START=7 for blocks 7-10. Instead of calling
   Block.forward() (which has our GPT-J parallel residuals logic), the
   code called forward_attn/forward_mlp on SEPARATE LANES merged once
   at the end via lane_merge. This is architecturally different from
   PR openai#1493's GPT-J per-block parallel, and was never validated.

   Fix: set PARALLEL_START_LAYER=-1 to disable the two-lane mechanism.
   Block.forward() then handles all blocks, and PARALLEL_RESIDUAL_START=7
   gives proper per-block GPT-J parallel matching PR openai#1493.

   Expected impact: -0.001 to -0.003 BPB (architectural correction).

2. CMP_QUANT_VALUE_DEDUP=1 (SIZE FIX)
   Retry 5 artifact was 16,051,299 bytes (51 KB over 16 MB cap).
   Dedup should save ~10-15% on compressed artifact. Retry 4's
   catastrophic gap was int8+dedup; int6+dedup is a different combo
   and should be safe per NIGHT_MODE validation.

Plan: single-seed (SEEDS=42) validation on the existing pod after
retry 5 finishes. Cost ~$8. If val_bpb improves + artifact fits,
submit PR + request credits for 3-seed validation.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 10, 2026
…nking HIGH priority

Key findings from daily scan:
- Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147
- New target: ≤1.0760 bpb (beat by ≥0.005 nats)
- ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk
- Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps
- Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0)
- Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114
- Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling)
- CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons

https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
@newjordan
Copy link
Copy Markdown

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey.

image

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
BPB-weighted loss weights each token's CE loss by its UTF-8 byte count,
aligning training objective with BPB eval metric. Muon momentum 0.97.
Byte weights from base_bytes_lut, clamped min=1.0, non-persistent.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
Reverted BPB-weighted loss (caused torch.compile slowdown, timed out 2x).
Clean forward with standard mean CE. Stacking two proven improvements:
- Muon momentum 0.97 (measured -0.00129 in R20v10)
- TTT LR 0.01 (measured -0.0003 in PR openai#1523)
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
QK-Gain was 5.0 (code default) but openai#1493 was tested with 5.25 (env var).
Env vars not forwarded to GPU — hardcode the correct value.
Stacking all three proven hyperparameter improvements.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026
Wider recurrence: blocks 2-5 looped 3x (was blocks 3-5).
19 virtual layers from 11 physical (was 17). Wider span may converge
better than deeper with same block range.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
Porting all openai#1523 hyperparams that differ from openai#1493:
- EMA_DECAY: 0.9965 -> 0.997 (stronger smoothing)
- WARMDOWN_FRAC: 0.72 -> 0.667 (shorter warmdown)
- Muon 0.97 (kept from previous best)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants