MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G by aiejvn · Pull Request #1241 · openai/parameter-golf

aiejvn · 2026-04-02T01:38:56Z

Builds on PR #1106 (MDLM stack). Two additions:

EOS learning: Token 1 (<s>) is used as a document boundary anchor — never masked during diffusion. A dedicated PAD_ID=1025 (separate from MASK_ID=1024) fills post-EOS positions and is excluded from the loss, preventing collision between structural padding and diffusion masking.

Shard rotation: ShardedDataLoader loads N shards at a time and rotates between groups across training, enabling full FineWeb 10B training without loading the entire dataset into RAM. Explicit memory freeing between groups; shards loaded one-at-a-time into a pre-allocated buffer to avoid 2× peak allocation.

Ablation finding: Val BPB is flat across attention head counts {2, 4, 8, 16, 32} at fixed model dim — head count appears invariant for bidirectional diffusion LMs.

Non-record reason: Trained on 1× AWS A10G (1267 min). Requires 8×H100 SXM for wall-clock compliance.

Model	BPB
This (MDLM v5)	0.9901
PR #1106 (prior best diffusion)	1.1465
AR baseline	1.2244

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Submission Commit

bce7f0a

aiejvn changed the title ~~Non-record: MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G~~ MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G Apr 2, 2026

Docs update

5d46f39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G#1241

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G#1241
aiejvn wants to merge 2 commits intoopenai:mainfrom
aiejvn:submission-diffusion-shard-rotation+eos-learning

aiejvn commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aiejvn commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant