Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)#1275
Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)#1275ranausmanai wants to merge 1 commit intoopenai:mainfrom
Conversation
…PB on 8xH100 One-line change (num_encoder_layers=1) monotonically improves BPB across baseline (-0.016) and SOTA (-0.004) code. 8xH100 run reached 1.1492 pre-quant at step 5666/9000, handicapped by FA2 speed (105ms vs FA3's 83ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Looked through this. The systematic sweep of the encoder/decoder ratio is well-structured — the monotonic trend from 5/6 → 3/8 → 2/9 → 1/10 across your table is clean data, and validating on three different hardware platforms (M4, RTX 5090, 8×H100) adds confidence. The natural next question is whether 0/11 (removing the U-Net skip structure entirely) continues the trend or hits a cliff. The one-line change framing makes it easy for anyone with H100 access to finish the 9000-step run. At 5666 steps and 1.1492 pre-quant, it'd be useful to see where this lands with a full run and quantization. One practical note: the FA3 vs FA2 gap (~26% step time difference you documented) means whoever picks this up needs tighter wallclock management to stay under 600s. — @MatoTeziTanka | Agora Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
Hang on I'll give it a shot... |
|
Follow-up: tested 1/10 split on our PROTEUS v1.6 stack (Scylla + Parallel Residuals + Depth Recurrence + TTT) We ran your asymmetric split hypothesis on a stronger base to see if the monotonic improvement transfers. Single seed (42), 8×H100 SXM, full 9000 steps.
The improvement you found on baseline code (~-0.016 at 1.54 BPB range) reverses at our operating point (~1.08 BPB range). Possible explanation: our parallel residuals start at layer 7 and need more than 1 encoder layer to build the representations that skip connections feed into the decoder. With only 1 encoder layer, the skip connection carries a barely-processed embedding — not enough signal for the decoder to leverage. The artifact also exceeded 16MB (16.2 vs 15.0 MB) despite fewer skip weight parameters — likely because the compression ratio changes when encoder/decoder weight distributions shift. This doesn't invalidate your finding on the baseline — the monotonic trend in your sweep is clean. It means the optimal split ratio may depend on the architecture stack, not just layer count. Worth testing intermediate ratios (2/9, 3/8) on a parallel residuals base if you get compute access. Full log available if useful. — @MatoTeziTanka | Agora Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
Thanks for testing this. This is really valuable data. It makes sense that parallel residuals starting at layer 7 need encoder depth for skip connections. This confirms that the optimal split is architecture dependent, not universal. I’m planning to test intermediate ratios like 2/9 and 3/8 on both baseline and parallel residual stacks. I’d be curious to see if 2/9 or 3/8 recovers the improvement on your stack. |
…00 Ada Apply focal loss (Lin et al. 2017) to language model pretraining: replace standard cross-entropy with (1-pt)^gamma * CE to focus on hard-to-predict tokens. Combined with cosine LR schedule and asymmetric encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100 SOTA record. 55+ experiments across 13 rounds validate the finding. See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4 MacBook experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Asymmetric 1/10 Encoder-Decoder Split: 1.1492 Pre-Quant BPB on 8xH100
One-line change to the hourglass architecture:
self.num_encoder_layers = 1instead ofnum_layers // 2. Every existing submission uses the default 50/50 encoder-decoder split. We found that shifting nearly all layers to the decoder gives monotonic improvement across every configuration tested.Results
Baseline code sweep (RTX 5090, 11 layers, 300 steps):
SOTA code (PR #549 stack, RTX 5090): 1/10 split gives -0.004 BPB vs default.
8xH100 SXM full run (SOTA + asymmetric split):
Why This Matters
The symmetric encoder-decoder split is a convention from image U-Nets. In language modeling, the decoder's autoregressive generation is the harder task -- it benefits more from capacity. No one tested this because everyone copied
num_layers // 2from the baseline.The Change
Zero extra parameters. Zero extra compute. One line.
Background
See PR #1073 for 27 systematic experiments on M4 MacBook (deep supervision, LR tuning, batch scaling, architecture). The asymmetric split was discovered during that exploration and validated on 3 different hardware platforms (M4, RTX 5090, 8xH100).
GPU Credits Request
Ran out of credits ($16 spent) and H100 availability mid-run. With FA3 built from source and a clean 9000-step run, this one-line change would likely produce a top-3 record. Requesting credits to complete the run.