Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change) by ranausmanai · Pull Request #1275 · openai/parameter-golf

ranausmanai · 2026-04-03T00:53:59Z

Asymmetric 1/10 Encoder-Decoder Split: 1.1492 Pre-Quant BPB on 8xH100

One-line change to the hourglass architecture: self.num_encoder_layers = 1 instead of num_layers // 2. Every existing submission uses the default 50/50 encoder-decoder split. We found that shifting nearly all layers to the decoder gives monotonic improvement across every configuration tested.

Results

Baseline code sweep (RTX 5090, 11 layers, 300 steps):

Encoder/Decoder Split	int8_bpb	vs Default
5/6 (default)	1.5455	--
3/8	1.5421	-0.003
2/9	1.5369	-0.009
1/10	1.5298	-0.016

SOTA code (PR #549 stack, RTX 5090): 1/10 split gives -0.004 BPB vs default.

8xH100 SXM full run (SOTA + asymmetric split):

Pre-quant val_bpb: 1.1492 at step 5666/9000
FA3 unavailable as pip package, used FA2 fallback (105ms/step vs 83ms), lost ~3300 training steps
Pod crashed during TTT eval, final int8 score not obtained
With FA3 + full 9000 steps, estimated top-3 on leaderboard

Why This Matters

The symmetric encoder-decoder split is a convention from image U-Nets. In language modeling, the decoder's autoregressive generation is the harder task -- it benefits more from capacity. No one tested this because everyone copied num_layers // 2 from the baseline.

The Change

# Before (every submission):
self.num_encoder_layers = num_layers // 2
# After:
self.num_encoder_layers = 1

Zero extra parameters. Zero extra compute. One line.

Background

See PR #1073 for 27 systematic experiments on M4 MacBook (deep supervision, LR tuning, batch scaling, architecture). The asymmetric split was discovered during that exploration and validated on 3 different hardware platforms (M4, RTX 5090, 8xH100).

GPU Credits Request

Ran out of credits ($16 spent) and H100 availability mid-run. With FA3 built from source and a clean 9000-step run, this one-line change would likely produce a top-3 record. Requesting credits to complete the run.

…PB on 8xH100 One-line change (num_encoder_layers=1) monotonically improves BPB across baseline (-0.016) and SOTA (-0.004) code. 8xH100 run reached 1.1492 pre-quant at step 5666/9000, handicapped by FA2 speed (105ms vs FA3's 83ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-04T00:15:29Z

Looked through this. The systematic sweep of the encoder/decoder ratio is well-structured — the monotonic trend from 5/6 → 3/8 → 2/9 → 1/10 across your table is clean data, and validating on three different hardware platforms (M4, RTX 5090, 8×H100) adds confidence. The natural next question is whether 0/11 (removing the U-Net skip structure entirely) continues the trend or hits a cliff.

The one-line change framing makes it easy for anyone with H100 access to finish the 9000-step run. At 5666 steps and 1.1492 pre-quant, it'd be useful to see where this lands with a full run and quantization.

One practical note: the FA3 vs FA2 gap (~26% step time difference you documented) means whoever picks this up needs tighter wallclock management to stay under 600s.

— @MatoTeziTanka | Agora

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

anthony-maio · 2026-04-04T00:36:38Z

Hang on I'll give it a shot...

MatoTeziTanka · 2026-04-04T00:56:00Z

Follow-up: tested 1/10 split on our PROTEUS v1.6 stack (Scylla + Parallel Residuals + Depth Recurrence + TTT)

We ran your asymmetric split hypothesis on a stronger base to see if the monotonic improvement transfers. Single seed (42), 8×H100 SXM, full 9000 steps.

Config	Sliding Window BPB	Artifact
Default 5/6 (our prior run)	1.0808	15.0 MB
Asymmetric 1/10	1.0970	16.2 MB
Delta	+0.0162 (worse)	Over 16MB

The improvement you found on baseline code (~-0.016 at 1.54 BPB range) reverses at our operating point (~1.08 BPB range). Possible explanation: our parallel residuals start at layer 7 and need more than 1 encoder layer to build the representations that skip connections feed into the decoder. With only 1 encoder layer, the skip connection carries a barely-processed embedding — not enough signal for the decoder to leverage.

The artifact also exceeded 16MB (16.2 vs 15.0 MB) despite fewer skip weight parameters — likely because the compression ratio changes when encoder/decoder weight distributions shift.

This doesn't invalidate your finding on the baseline — the monotonic trend in your sweep is clean. It means the optimal split ratio may depend on the architecture stack, not just layer count. Worth testing intermediate ratios (2/9, 3/8) on a parallel residuals base if you get compute access.

Full log available if useful.

— @MatoTeziTanka | Agora

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

ranausmanai · 2026-04-04T19:29:08Z

@MatoTeziTanka

Thanks for testing this. This is really valuable data. It makes sense that parallel residuals starting at layer 7 need encoder depth for skip connections. This confirms that the optimal split is architecture dependent, not universal.

I’m planning to test intermediate ratios like 2/9 and 3/8 on both baseline and parallel residual stacks. I’d be curious to see if 2/9 or 3/8 recovers the improvement on your stack.

…00 Ada Apply focal loss (Lin et al. 2017) to language model pretraining: replace standard cross-entropy with (1-pt)^gamma * CE to focus on hard-to-predict tokens. Combined with cosine LR schedule and asymmetric encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100 SOTA record. 55+ experiments across 13 rounds validate the finding. See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4 MacBook experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ranausmanai mentioned this pull request Apr 3, 2026

Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture) #1073

Open

This was referenced Apr 5, 2026

Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380

Open

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada #1434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)#1275

Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change)#1275
ranausmanai wants to merge 1 commit intoopenai:mainfrom
ranausmanai:asymmetric-split-h100

ranausmanai commented Apr 3, 2026

Uh oh!

MatoTeziTanka commented Apr 4, 2026 •

edited

Loading

Uh oh!

anthony-maio commented Apr 4, 2026

Uh oh!

MatoTeziTanka commented Apr 4, 2026 •

edited

Loading

Uh oh!

ranausmanai commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ranausmanai commented Apr 3, 2026

Asymmetric 1/10 Encoder-Decoder Split: 1.1492 Pre-Quant BPB on 8xH100

Results

Why This Matters

The Change

Background

GPU Credits Request

Uh oh!

MatoTeziTanka commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anthony-maio commented Apr 4, 2026

Uh oh!

MatoTeziTanka commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ranausmanai commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatoTeziTanka commented Apr 4, 2026 •

edited

Loading

MatoTeziTanka commented Apr 4, 2026 •

edited

Loading