Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434
Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada#1434ranausmanai wants to merge 3 commits intoopenai:mainfrom
Conversation
…(-0.091 vs baseline) Comprehensive hyperparameter exploration on a single RTX 4000 Ada GPU ($0.20/hr). Best config combines: Tidal LR, parallel blocks, SiLU², pure decoder (0 encoder layers), GQA with 2 KV heads, untied embeddings, logit softcap 15, RoPE base 5000, matrix LR 0.11. Key findings from 18 experimental phases: - Parallel blocks (PaLM-style): biggest win, -0.052 BPB AND faster - Untied embeddings: -0.012, biggest late-stage improvement - Pure decoder (0 encoder layers): monotonically better with fewer encoders - GQA 2 KV heads: faster (276 vs 265 steps in 600s) AND better quality - Dead branches: all gradient tricks, WD schedule, 3x MLP (too slow), 10 layers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for tagging me, and for taking the #1380 feedback constructively — 131 documented runs with broken-run admissions and a public log is exactly the kind of work this competition needs more of. Code spot-check (verified at SHA
The methodology caveat I'd flag, same as #1380: Your proxy is 276 steps in 600s on RTX 4000 Ada. The competition is ~6800 steps in 600s on 8×H100 SXM. That's 25× more training, and the regime gap is the whole story for any finding that affects the shape of training:
The findings most likely to survive at scale: parallel blocks (already standard in many SOTA stacks here), untied embeddings (PR #1394 uses these), and tighter logit softcap (cheap to verify). Methodology tightening for the next round:
On the 8×H100 verification ask: I'm holding my own compute spend until the maintainer situation clarifies (the README hasn't been updated since March 30, and the rule clarifications pending in #677 are still unanswered). Happy to revisit if a longer-horizon proxy on a single H100 confirms any of the architectural pieces — that's a much cheaper way to de-risk the regime gap before anyone burns 8-GPU credits. The systematic methodology and transparency are real assets — keep that part. You're one of the few people in this thread doing honest negative results. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
Thanks for pointing me at the full log, @ranausmanai — I owe you a partial correction after reading it end-to-end. My prior comment was right in spirit but too strong on a few specifics. What I should amend:
What I'd still flag:
Negative results worth publishing on their own: The log contains several publishable-quality negative findings that deserve more visibility than the PR body gives them:
Would you be open to pulling Phase 7 out into its own issue or PR? "Heads up: Methodology template worth copying: Your env-var feature gating pattern (one-line config diff per experiment), the ~10-experiments-per-wave structure, the kill-underperforming-waves-early discipline, and the same-config rerun every ~10 experiments to estimate noise floor — this is a reusable sweep template. It's better documented than most SOTA-claim PRs on this repo. Correction to the raw count, while I'm here: The PR body claims 131 experiments. Counting the log tables, the actually-executed count is closer to ~110-115 — several "BROKEN" and "Not reached" rows in the tables are placeholders rather than completed runs. Not a big deal, but worth a footnote if you care about the number being exact. Apologies for the undercount on parallel blocks, asymmetric split dose-response, and GQA cross-validation in my earlier comment. The full log is substantially stronger evidence than the PR summary made it look. Regime-gap concerns still stand; methodology concerns are narrower than I originally said. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
Summary
131 systematic experiments across 18 phases on a single RTX 4000 Ada GPU ($0.20/hr, about $5 total compute budget), achieving 1.5207 BPB from a cheap single-GPU proxy line.
Best Configuration (RTX 4000 Ada proxy run)
Important:
ITERATIONS=400is the proxy schedule horizon used on the RTX 4000 Ada. The competition requirement is 10 minutes on 8xH100 SXM, not 400 fixed steps. Anyone validating this idea on 8xH100 should keep the 600s wallclock cap and retune the schedule horizon for that throughput regime.Key Findings (18 phases, 131 experiments)
What works (ranked by impact):
ENCODER_LAYERS=0) — monotonically better with fewer encoder layersDead branches:
Progression
How to Reproduce
This reproduces the single-GPU RTX 4000 Ada proxy run reported here. The training script is train_gpt_focal_fixed.py, with features controlled by environment variables.
Files
train_gpt_focal_fixed.pyexperiment_log.mdrun_best.shwave5_arch.shtowave12_aggressive.shPrevious PRs
Test plan
bash run_best.shon a comparable CUDA GPU to verify the RTX 4000 Ada proxy result