Skip to content

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434

Open
ranausmanai wants to merge 3 commits intoopenai:mainfrom
ranausmanai:systematic-hyperparameter-exploration
Open

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434
ranausmanai wants to merge 3 commits intoopenai:mainfrom
ranausmanai:systematic-hyperparameter-exploration

Conversation

@ranausmanai
Copy link
Copy Markdown

@ranausmanai ranausmanai commented Apr 7, 2026

Summary

131 systematic experiments across 18 phases on a single RTX 4000 Ada GPU ($0.20/hr, about $5 total compute budget), achieving 1.5207 BPB from a cheap single-GPU proxy line.

  • GPU: Single RTX 4000 Ada (20GB VRAM) on RunPod
  • Training proxy: 276 steps in 600s wallclock on this GPU (about 2180ms/step)
  • Artifact: about 14MB int8+zlib (under 16MB budget)
  • Budget: about $5 total RunPod credits
  • Full experiment log: experiment_log.md

Best Configuration (RTX 4000 Ada proxy run)

ITERATIONS=400 \
TIDAL_LR=1 \
LOGIT_SOFTCAP=15.0 \
ROPE_BASE=5000 \
PARALLEL_BLOCK=1 \
MLP_ACT=silu2 \
HEAD_DIVERSITY=1e-4 \
EMBED_LR=0.8 \
MATRIX_LR=0.11 \
ENCODER_LAYERS=0 \
NUM_KV_HEADS=2 \
TIE_EMBEDDINGS=0

Important: ITERATIONS=400 is the proxy schedule horizon used on the RTX 4000 Ada. The competition requirement is 10 minutes on 8xH100 SXM, not 400 fixed steps. Anyone validating this idea on 8xH100 should keep the 600s wallclock cap and retune the schedule horizon for that throughput regime.

Key Findings (18 phases, 131 experiments)

What works (ranked by impact):

  1. Parallel blocks (PaLM-style) — faster and better quality
  2. Untied embeddings — biggest late-stage win
  3. Pure decoder (ENCODER_LAYERS=0) — monotonically better with fewer encoder layers
  4. GQA with 2 KV heads — faster and better quality than the 4-KV baseline in this regime
  5. SiLU² activation — best with parallel blocks
  6. Logit softcap 15 — better than the default 30 in this stack
  7. Matrix LR 0.11 — better than the default 0.04 / earlier 0.10
  8. Tidal LR — golden-ratio warmup / decay schedule
  9. RoPE base 5000 — better than 10000 here

Dead branches:

  • All gradient tricks (Canopy, Mycorrhizal, Thermal Vent, Predator-Prey)
  • All regularization tested here (Z-loss, token dropout, embed mixup, label smoothing)
  • Weight decay schedules
  • 3x MLP width on this GPU budget
  • 10 layers in the 600s proxy regime
  • MQA with 1 KV head
  • Breathing / multi-cycle LR variants

Progression

Baseline (cosine):      1.6117
+ Tidal LR:             1.5906
+ SC20 + RoPE 5k:       1.5744
+ Parallel blocks:      1.5600
+ SiLU²:                1.5527
+ Asymmetric 1/8:       1.5377
+ SC15 + MatLR 0.10:    1.5354
+ MatLR 0.11:           1.5341
+ GQA 2KV:              1.5329
+ Untied embeddings:    1.5211
+ Pure decoder:         1.5207

How to Reproduce

bash run_best.sh

This reproduces the single-GPU RTX 4000 Ada proxy run reported here. The training script is train_gpt_focal_fixed.py, with features controlled by environment variables.

Files

File Description
train_gpt_focal_fixed.py Training script with all features (env var controlled)
experiment_log.md Complete 131-experiment log with per-phase analysis
run_best.sh One-command reproduction of the RTX 4000 Ada proxy run
wave5_arch.sh to wave12_aggressive.sh Batch scripts used for the later search waves

Previous PRs

Test plan

  • Run bash run_best.sh on a comparable CUDA GPU to verify the RTX 4000 Ada proxy result
  • Verify artifact size stays under 16MB
  • Review experiment_log.md for the full 131-experiment breakdown

…(-0.091 vs baseline)

Comprehensive hyperparameter exploration on a single RTX 4000 Ada GPU ($0.20/hr).
Best config combines: Tidal LR, parallel blocks, SiLU², pure decoder (0 encoder layers),
GQA with 2 KV heads, untied embeddings, logit softcap 15, RoPE base 5000, matrix LR 0.11.

Key findings from 18 experimental phases:
- Parallel blocks (PaLM-style): biggest win, -0.052 BPB AND faster
- Untied embeddings: -0.012, biggest late-stage improvement
- Pure decoder (0 encoder layers): monotonically better with fewer encoders
- GQA 2 KV heads: faster (276 vs 265 steps in 600s) AND better quality
- Dead branches: all gradient tricks, WD schedule, 3x MLP (too slow), 10 layers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ranausmanai ranausmanai changed the title Non-record: 120+ Systematic Experiments — 1.5207 BPB on RTX 4000 Ada Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada Apr 7, 2026
ranausmanai and others added 2 commits April 7, 2026 12:13
@MatoTeziTanka
Copy link
Copy Markdown

Thanks for tagging me, and for taking the #1380 feedback constructively — 131 documented runs with broken-run admissions and a public log is exactly the kind of work this competition needs more of.

Code spot-check (verified at SHA cbf2d97):

  • L667-674: parallel block is faithful PaLM-style — single attn_norm, both attn and MLP consume the same normed tensor, residuals sum at the end. Clean.
  • L820-821: focal-loss self.training guard from Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380 is preserved. With default FOCAL_GAMMA=0 it's off in run_best.sh anyway.
  • L1084-1094: Tidal LR is warmup_frac = 0.382 linear ramp then cosine decay over 0.618. Your own comment says "Golden ratio asymmetry" — fair branding, but mechanistically it's cosine with an extended linear warmup prefix.
  • L705-706 + L732: ENCODER_LAYERS=0 is a real no-U-net config (drops skip_weights), and untied embeddings actually instantiate a separate CastedLinear lm_head (~26M extra params at vocab × d_model).

The methodology caveat I'd flag, same as #1380:

Your proxy is 276 steps in 600s on RTX 4000 Ada. The competition is ~6800 steps in 600s on 8×H100 SXM. That's 25× more training, and the regime gap is the whole story for any finding that affects the shape of training:

  • Matrix LR 0.11 vs 0.04 default: Higher LR helps when you're nowhere near a minimum. At 6800 steps you're approaching one — same dynamic as the cosine LR finding in Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380. Likely to destabilize at scale.
  • Tidal 38.2% warmup: 38.2% × 276 = 105 warmup steps. 38.2% × 6800 = 2,600 warmup steps before any decay starts — that leaves the model effectively warming up through most of the wallclock budget on real hardware.
  • RoPE base 5000 vs 10000: RoPE base mostly affects long-context behavior. At 1024 seq len with longer training, the default holds.
  • ENCODER_LAYERS=0 (no U-net): The Muon/modded-nanoGPT skip-weight trick specifically needs a lot of steps to train the skip_weights parameters. At 276 steps they haven't learned anything; at 6800 they do real work. Removing them is a baseline-regime artifact.

The findings most likely to survive at scale: parallel blocks (already standard in many SOTA stacks here), untied embeddings (PR #1394 uses these), and tighter logit softcap (cheap to verify).

Methodology tightening for the next round:

  1. Multi-seed your best config. A single H6 rerun isn't enough — the BPB noise floor is roughly ±0.001-0.002 and several of your late-stage wins (Matrix LR 0.10→0.11, softcap 13-17 sweep, HEAD_DIVERSITY on/off) are inside that.
  2. Matched-control ablations from H1. The 1.6117 → 1.5207 progression is greedy stacking — each phase tunes on top of the prior best. Try turning ONE thing off from H1 at a time so we can see what's actually load-bearing vs accumulated selection bias across 131 trials on the same val set.
  3. Longer-horizon proxy. Even a smaller model trained for 2000+ steps on a single H100 with the real batch size would tell you which findings invert at scale before you commit them to a SOTA-stack PR.

On the 8×H100 verification ask: I'm holding my own compute spend until the maintainer situation clarifies (the README hasn't been updated since March 30, and the rule clarifications pending in #677 are still unanswered). Happy to revisit if a longer-horizon proxy on a single H100 confirms any of the architectural pieces — that's a much cheaper way to de-risk the regime gap before anyone burns 8-GPU credits.

The systematic methodology and transparency are real assets — keep that part. You're one of the few people in this thread doing honest negative results.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@MatoTeziTanka
Copy link
Copy Markdown

Thanks for pointing me at the full log, @ranausmanai — I owe you a partial correction after reading it end-to-end. My prior comment was right in spirit but too strong on a few specifics.

What I should amend:

  1. Parallel blocks has a cross-schedule control I missed (A7 on Tidal + A8 on cosine, lines 184-185 of experiment_log.md). Both runs show the same direction of improvement on two different LR schedules — that's the closest thing to regime-independence in the whole log, and I should upgrade parallel blocks from "likely transfers" to "validated across schedules". My confidence on this one is now [V], not [I].

  2. The asymmetric split has a proper 4-point dose-response at D4/D5/D6/D7 (lines 280-283): 3/6, 2/7, 1/8, and a 5/4 "more encoder" control showing monotonic improvement (-0.068 / -0.071 / -0.074 / -0.055). That's a real ablation gradient, not a single greedy pick. I should soften "won't transfer" to "directionally interesting, magnitude regime-dependent". The "pure decoder wins" conclusion may still be step-budget-specific because the skip_weights trick specifically needs many more steps to train, but the direction is cleanly established.

  3. GQA 2KV is cross-validated at A17 (Phase 11) and F7 (Phase 16) — same direction across two very different configs. Worth more credit than I gave it.

  4. Noise-floor honesty: I implied you weren't flagging wins inside noise. You explicitly are — lines 292, 312, 332, 360 all have "~0.001-0.002 noise" callouts, and you mark "SC13-17 all within noise", "HD barely matters", "activation swaps are noise". That's real discipline.

  5. Intellectual honesty on failures: Phase 7 labeled "INVALID" (3000-step wallclock-cap LR-schedule corruption bug at lines 113-117 — that's a real trap worth publishing on its own), Phase 3 labeled "ALL BROKEN", and the whole "nature-inspired gradient tricks" class (Canopy/Mycorrhizal/Thermal Vent/Predator-Prey at lines 140-149) published as catastrophic negative results. That kind of self-correction in public is rare.

What I'd still flag:

  1. Multi-seed is genuinely absent — I want to be precise here. The log has 4 same-config confidence reruns (D10, E10, F10, H6), which establish a noise floor ~0.0007-0.002, but zero seed-varying reruns. Same-config reruns tell you "does this deterministic run reproduce?" — they don't tell you "does this survive seed-to-seed variance?" For a 131-trial sweep the seed-variance question is the more important one.

  2. Greedy stacking critique holds — no phase ever turns an earlier winner OFF from the final stacked config to confirm it still contributes. The progression chain 1.6117 → 1.5207 is monotonic by construction.

  3. The regime gap is uncontested by the data — all 131 experiments run at 240-276 steps. Zero cross-step-count tests. My "won't transfer at 6800 steps" critique has no counter-evidence in the log, and shape-dependent findings (Matrix LR 0.11, Tidal 38.2% warmup fraction, RoPE base 5000, ENCODER_LAYERS=0) are still likely to invert at 25× longer horizon.

Negative results worth publishing on their own:

The log contains several publishable-quality negative findings that deserve more visibility than the PR body gives them:

  • All gradient-level "nature" tricks (Canopy, Mycorrhizal, Predator-Prey, Thermal Vent): catastrophic, save everyone the time
  • WD schedule (0→0.04 ramp): catastrophic in every form tested
  • Label smoothing on tuned config: hurts
  • torch.compile(fullgraph=True) incompatible with any dynamic structure (layer dropout, weight sharing, progressive growing) — that's an engineering lesson worth a separate issue, not buried in a phase
  • 3000-step ITERATIONS with wallclock cap silently corrupts LR schedule (LR thinks it's at ~8% progress because it normalizes against iterations, not elapsed) — this is a real trap that other participants will hit

Would you be open to pulling Phase 7 out into its own issue or PR? "Heads up: ITERATIONS=3000 MAX_WALLCLOCK_SECONDS=600 silently corrupts LR schedules" is the kind of thing every participant needs to know before they waste a run.

Methodology template worth copying:

Your env-var feature gating pattern (one-line config diff per experiment), the ~10-experiments-per-wave structure, the kill-underperforming-waves-early discipline, and the same-config rerun every ~10 experiments to estimate noise floor — this is a reusable sweep template. It's better documented than most SOTA-claim PRs on this repo.

Correction to the raw count, while I'm here: The PR body claims 131 experiments. Counting the log tables, the actually-executed count is closer to ~110-115 — several "BROKEN" and "Not reached" rows in the tables are placeholders rather than completed runs. Not a big deal, but worth a footnote if you care about the number being exact.

Apologies for the undercount on parallel blocks, asymmetric split dose-response, and GQA cross-validation in my earlier comment. The full log is substantially stronger evidence than the PR summary made it look. Regime-gap concerns still stand; methodology concerns are narrower than I originally said.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants