Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada by ranausmanai · Pull Request #1434 · openai/parameter-golf

ranausmanai · 2026-04-07T07:07:15Z

Summary

131 systematic experiments across 18 phases on a single RTX 4000 Ada GPU ($0.20/hr, about $5 total compute budget), achieving 1.5207 BPB from a cheap single-GPU proxy line.

GPU: Single RTX 4000 Ada (20GB VRAM) on RunPod
Training proxy: 276 steps in 600s wallclock on this GPU (about 2180ms/step)
Artifact: about 14MB int8+zlib (under 16MB budget)
Budget: about $5 total RunPod credits
Full experiment log: experiment_log.md

Best Configuration (RTX 4000 Ada proxy run)

ITERATIONS=400 \
TIDAL_LR=1 \
LOGIT_SOFTCAP=15.0 \
ROPE_BASE=5000 \
PARALLEL_BLOCK=1 \
MLP_ACT=silu2 \
HEAD_DIVERSITY=1e-4 \
EMBED_LR=0.8 \
MATRIX_LR=0.11 \
ENCODER_LAYERS=0 \
NUM_KV_HEADS=2 \
TIE_EMBEDDINGS=0

Important: ITERATIONS=400 is the proxy schedule horizon used on the RTX 4000 Ada. The competition requirement is 10 minutes on 8xH100 SXM, not 400 fixed steps. Anyone validating this idea on 8xH100 should keep the 600s wallclock cap and retune the schedule horizon for that throughput regime.

Key Findings (18 phases, 131 experiments)

What works (ranked by impact):

Parallel blocks (PaLM-style) — faster and better quality
Untied embeddings — biggest late-stage win
Pure decoder (ENCODER_LAYERS=0) — monotonically better with fewer encoder layers
GQA with 2 KV heads — faster and better quality than the 4-KV baseline in this regime
SiLU² activation — best with parallel blocks
Logit softcap 15 — better than the default 30 in this stack
Matrix LR 0.11 — better than the default 0.04 / earlier 0.10
Tidal LR — golden-ratio warmup / decay schedule
RoPE base 5000 — better than 10000 here

Dead branches:

All gradient tricks (Canopy, Mycorrhizal, Thermal Vent, Predator-Prey)
All regularization tested here (Z-loss, token dropout, embed mixup, label smoothing)
Weight decay schedules
3x MLP width on this GPU budget
10 layers in the 600s proxy regime
MQA with 1 KV head
Breathing / multi-cycle LR variants

Progression

Baseline (cosine):      1.6117
+ Tidal LR:             1.5906
+ SC20 + RoPE 5k:       1.5744
+ Parallel blocks:      1.5600
+ SiLU²:                1.5527
+ Asymmetric 1/8:       1.5377
+ SC15 + MatLR 0.10:    1.5354
+ MatLR 0.11:           1.5341
+ GQA 2KV:              1.5329
+ Untied embeddings:    1.5211
+ Pure decoder:         1.5207

How to Reproduce

bash run_best.sh

This reproduces the single-GPU RTX 4000 Ada proxy run reported here. The training script is train_gpt_focal_fixed.py, with features controlled by environment variables.

Files

File	Description
`train_gpt_focal_fixed.py`	Training script with all features (env var controlled)
`experiment_log.md`	Complete 131-experiment log with per-phase analysis
`run_best.sh`	One-command reproduction of the RTX 4000 Ada proxy run
`wave5_arch.sh` to `wave12_aggressive.sh`	Batch scripts used for the later search waves

Previous PRs

Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380 — earlier results and negative-result sweeps
Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change) #1275 — asymmetric split on 8xH100
Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture) #1073 — initial local experiment set on Apple Silicon

Test plan

Run bash run_best.sh on a comparable CUDA GPU to verify the RTX 4000 Ada proxy result
Verify artifact size stays under 16MB
Review experiment_log.md for the full 131-experiment breakdown

…(-0.091 vs baseline) Comprehensive hyperparameter exploration on a single RTX 4000 Ada GPU ($0.20/hr). Best config combines: Tidal LR, parallel blocks, SiLU², pure decoder (0 encoder layers), GQA with 2 KV heads, untied embeddings, logit softcap 15, RoPE base 5000, matrix LR 0.11. Key findings from 18 experimental phases: - Parallel blocks (PaLM-style): biggest win, -0.052 BPB AND faster - Untied embeddings: -0.012, biggest late-stage improvement - Pure decoder (0 encoder layers): monotonically better with fewer encoders - GQA 2 KV heads: faster (276 vs 265 steps in 600s) AND better quality - Dead branches: all gradient tricks, WD schedule, 3x MLP (too slow), 10 layers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-07T13:16:57Z

Thanks for tagging me, and for taking the #1380 feedback constructively — 131 documented runs with broken-run admissions and a public log is exactly the kind of work this competition needs more of.

Code spot-check (verified at SHA cbf2d97):

L667-674: parallel block is faithful PaLM-style — single attn_norm, both attn and MLP consume the same normed tensor, residuals sum at the end. Clean.
L820-821: focal-loss self.training guard from Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380 is preserved. With default FOCAL_GAMMA=0 it's off in run_best.sh anyway.
L1084-1094: Tidal LR is warmup_frac = 0.382 linear ramp then cosine decay over 0.618. Your own comment says "Golden ratio asymmetry" — fair branding, but mechanistically it's cosine with an extended linear warmup prefix.
L705-706 + L732: ENCODER_LAYERS=0 is a real no-U-net config (drops skip_weights), and untied embeddings actually instantiate a separate CastedLinear lm_head (~26M extra params at vocab × d_model).

The methodology caveat I'd flag, same as #1380:

Your proxy is 276 steps in 600s on RTX 4000 Ada. The competition is ~6800 steps in 600s on 8×H100 SXM. That's 25× more training, and the regime gap is the whole story for any finding that affects the shape of training:

Matrix LR 0.11 vs 0.04 default: Higher LR helps when you're nowhere near a minimum. At 6800 steps you're approaching one — same dynamic as the cosine LR finding in Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380. Likely to destabilize at scale.
Tidal 38.2% warmup: 38.2% × 276 = 105 warmup steps. 38.2% × 6800 = 2,600 warmup steps before any decay starts — that leaves the model effectively warming up through most of the wallclock budget on real hardware.
RoPE base 5000 vs 10000: RoPE base mostly affects long-context behavior. At 1024 seq len with longer training, the default holds.
ENCODER_LAYERS=0 (no U-net): The Muon/modded-nanoGPT skip-weight trick specifically needs a lot of steps to train the skip_weights parameters. At 276 steps they haven't learned anything; at 6800 they do real work. Removing them is a baseline-regime artifact.

The findings most likely to survive at scale: parallel blocks (already standard in many SOTA stacks here), untied embeddings (PR #1394 uses these), and tighter logit softcap (cheap to verify).

Methodology tightening for the next round:

Multi-seed your best config. A single H6 rerun isn't enough — the BPB noise floor is roughly ±0.001-0.002 and several of your late-stage wins (Matrix LR 0.10→0.11, softcap 13-17 sweep, HEAD_DIVERSITY on/off) are inside that.
Matched-control ablations from H1. The 1.6117 → 1.5207 progression is greedy stacking — each phase tunes on top of the prior best. Try turning ONE thing off from H1 at a time so we can see what's actually load-bearing vs accumulated selection bias across 131 trials on the same val set.
Longer-horizon proxy. Even a smaller model trained for 2000+ steps on a single H100 with the real batch size would tell you which findings invert at scale before you commit them to a SOTA-stack PR.

On the 8×H100 verification ask: I'm holding my own compute spend until the maintainer situation clarifies (the README hasn't been updated since March 30, and the rule clarifications pending in #677 are still unanswered). Happy to revisit if a longer-horizon proxy on a single H100 confirms any of the architectural pieces — that's a much cheaper way to de-risk the regime gap before anyone burns 8-GPU credits.

The systematic methodology and transparency are real assets — keep that part. You're one of the few people in this thread doing honest negative results.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

MatoTeziTanka · 2026-04-08T02:10:05Z

Thanks for pointing me at the full log, @ranausmanai — I owe you a partial correction after reading it end-to-end. My prior comment was right in spirit but too strong on a few specifics.

What I should amend:

Parallel blocks has a cross-schedule control I missed (A7 on Tidal + A8 on cosine, lines 184-185 of experiment_log.md). Both runs show the same direction of improvement on two different LR schedules — that's the closest thing to regime-independence in the whole log, and I should upgrade parallel blocks from "likely transfers" to "validated across schedules". My confidence on this one is now [V], not [I].
The asymmetric split has a proper 4-point dose-response at D4/D5/D6/D7 (lines 280-283): 3/6, 2/7, 1/8, and a 5/4 "more encoder" control showing monotonic improvement (-0.068 / -0.071 / -0.074 / -0.055). That's a real ablation gradient, not a single greedy pick. I should soften "won't transfer" to "directionally interesting, magnitude regime-dependent". The "pure decoder wins" conclusion may still be step-budget-specific because the skip_weights trick specifically needs many more steps to train, but the direction is cleanly established.
GQA 2KV is cross-validated at A17 (Phase 11) and F7 (Phase 16) — same direction across two very different configs. Worth more credit than I gave it.
Noise-floor honesty: I implied you weren't flagging wins inside noise. You explicitly are — lines 292, 312, 332, 360 all have "~0.001-0.002 noise" callouts, and you mark "SC13-17 all within noise", "HD barely matters", "activation swaps are noise". That's real discipline.
Intellectual honesty on failures: Phase 7 labeled "INVALID" (3000-step wallclock-cap LR-schedule corruption bug at lines 113-117 — that's a real trap worth publishing on its own), Phase 3 labeled "ALL BROKEN", and the whole "nature-inspired gradient tricks" class (Canopy/Mycorrhizal/Thermal Vent/Predator-Prey at lines 140-149) published as catastrophic negative results. That kind of self-correction in public is rare.

What I'd still flag:

Multi-seed is genuinely absent — I want to be precise here. The log has 4 same-config confidence reruns (D10, E10, F10, H6), which establish a noise floor ~0.0007-0.002, but zero seed-varying reruns. Same-config reruns tell you "does this deterministic run reproduce?" — they don't tell you "does this survive seed-to-seed variance?" For a 131-trial sweep the seed-variance question is the more important one.
Greedy stacking critique holds — no phase ever turns an earlier winner OFF from the final stacked config to confirm it still contributes. The progression chain 1.6117 → 1.5207 is monotonic by construction.
The regime gap is uncontested by the data — all 131 experiments run at 240-276 steps. Zero cross-step-count tests. My "won't transfer at 6800 steps" critique has no counter-evidence in the log, and shape-dependent findings (Matrix LR 0.11, Tidal 38.2% warmup fraction, RoPE base 5000, ENCODER_LAYERS=0) are still likely to invert at 25× longer horizon.

Negative results worth publishing on their own:

The log contains several publishable-quality negative findings that deserve more visibility than the PR body gives them:

All gradient-level "nature" tricks (Canopy, Mycorrhizal, Predator-Prey, Thermal Vent): catastrophic, save everyone the time
WD schedule (0→0.04 ramp): catastrophic in every form tested
Label smoothing on tuned config: hurts
torch.compile(fullgraph=True) incompatible with any dynamic structure (layer dropout, weight sharing, progressive growing) — that's an engineering lesson worth a separate issue, not buried in a phase
3000-step ITERATIONS with wallclock cap silently corrupts LR schedule (LR thinks it's at ~8% progress because it normalizes against iterations, not elapsed) — this is a real trap that other participants will hit

Would you be open to pulling Phase 7 out into its own issue or PR? "Heads up: ITERATIONS=3000 MAX_WALLCLOCK_SECONDS=600 silently corrupts LR schedules" is the kind of thing every participant needs to know before they waste a run.

Methodology template worth copying:

Your env-var feature gating pattern (one-line config diff per experiment), the ~10-experiments-per-wave structure, the kill-underperforming-waves-early discipline, and the same-config rerun every ~10 experiments to estimate noise floor — this is a reusable sweep template. It's better documented than most SOTA-claim PRs on this repo.

Correction to the raw count, while I'm here: The PR body claims 131 experiments. Counting the log tables, the actually-executed count is closer to ~110-115 — several "BROKEN" and "Not reached" rows in the tables are placeholders rather than completed runs. Not a big deal, but worth a footnote if you care about the number being exact.

Apologies for the undercount on parallel blocks, asymmetric split dose-response, and GQA cross-validation in my earlier comment. The full log is substantially stronger evidence than the PR summary made it look. Regime-gap concerns still stand; methodology concerns are narrower than I originally said.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

ranausmanai changed the title ~~Non-record: 120+ Systematic Experiments — 1.5207 BPB on RTX 4000 Ada~~ Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada Apr 7, 2026

ranausmanai and others added 2 commits April 7, 2026 12:13

Fix experiment count: 131 experiments (not 120+)

b0e3a47

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align proxy scripts with final pod runs

cbf2d97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434
ranausmanai wants to merge 3 commits intoopenai:mainfrom
ranausmanai:systematic-hyperparameter-exploration

ranausmanai commented Apr 7, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 7, 2026

Uh oh!

MatoTeziTanka commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ranausmanai commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Best Configuration (RTX 4000 Ada proxy run)

Key Findings (18 phases, 131 experiments)

Progression

How to Reproduce

Files

Previous PRs

Test plan

Uh oh!

MatoTeziTanka commented Apr 7, 2026

Uh oh!

MatoTeziTanka commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ranausmanai commented Apr 7, 2026 •

edited

Loading