Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture) by ranausmanai · Pull Request #1073 · openai/parameter-golf

ranausmanai · 2026-03-29T11:50:49Z

27 Systematic Experiments on M4 MacBook

30+ hours of compute on an Apple M4 MacBook (16GB unified memory, MLX backend). Explored deep supervision (novel technique), learning rate tuning, batch size scaling, architecture changes, and convergence techniques.

Best M4 result: 1.6414 int8_bpb (LR 0.08, 128K batch, 300 steps)

This research led to the asymmetric encoder-decoder split finding in PR #1275, which reached 1.1492 pre-quant BPB on 8xH100.

Deep Supervision (Novel Technique)

Auxiliary loss at the U-Net encoder-decoder boundary. Zero extra parameters.

Batch Size	Baseline	+DeepSup(0.03)	Effect
8K	2.168	2.118	-0.050 (helps)
16K	2.037	2.037	0.000 (neutral)
64K	1.767	1.774	+0.006 (neutral)

Acts as regularizer -- benefit scales inversely with batch size.

LR Tuning (128K batch)

Config	int8_bpb	vs Baseline (1.667)
LR 0.08	1.6414	-0.025
LR 0.06	1.6431	-0.024
Grad clip 1.0	1.6473	-0.019
MLP mult 3	1.6596	-0.007
10 layers	1.6613	-0.006

Default LR 0.04 is too conservative for short training runs.

Batch Size Scaling (no plateau through 128K)

Batch Size	int8_bpb	Marginal Gain
8K	2.168	--
16K	2.037	-0.131
32K	1.943	-0.094
64K	1.767	-0.176
128K	1.667	-0.100

Convergence Techniques (64K batch)

EMA, SWA, Partial RoPE, longer sequences -- all hurt at 300 steps. These need 9000+ steps, consistent with them appearing in top leaderboard submissions.

Deep Supervision Weight Sweep (8K batch)

Weight	int8_bpb	vs Baseline
0.00 (baseline)	2.1677	--
0.02	2.1274	-0.040
0.03	2.1178	-0.050
0.04	2.1295	-0.038
0.05	2.1302	-0.038

Clear inverted-U pattern. Optimal weight is 0.03.

Hardware & Setup

Hardware: Apple M4 MacBook, 16GB unified memory
Backend: MLX, bfloat16 compute, ~9K tok/s peak throughput
Data: 10 training shards (~1B tokens)
Training: 300 steps per experiment

Reproduce

pip install sentencepiece mlx
python data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

MATRIX_LR=0.08 ITERATIONS=300 TRAIN_BATCH_TOKENS=131072 GRAD_ACCUM_STEPS=16 \
  VAL_BATCH_SIZE=524288 python train_gpt_mlx.py

16 experiments exploring auxiliary loss at the U-Net encoder-decoder boundary. Key finding: deep supervision (weight=0.03) improves BPB by -0.05 at small batch sizes but effect disappears at large batches. Novel technique not explored by other competitors.

Added 11 new experiments at 128K batch testing LR tuning, grad clipping, architecture changes (10/12 layers, MLP mult 3), warmup/warmdown, and logit softcap. Best result improved from 1.6668 to 1.6414 int8_bpb via LR 0.08 (-0.025 over baseline). Grad clipping also helps (-0.019).

Novel finding: setting num_encoder_layers=1 (vs num_layers//2) monotonically improves BPB. Validated on baseline (-0.016) and SOTA code (-0.004) on RTX 5090. 8xH100 run reached 1.1492 pre-quant BPB at step 5666/9000 before pod crashed (FA2 speed bottleneck: 105ms/step vs FA3's 83ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…00 Ada Apply focal loss (Lin et al. 2017) to language model pretraining: replace standard cross-entropy with (1-pt)^gamma * CE to focus on hard-to-predict tokens. Combined with cosine LR schedule and asymmetric encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100 SOTA record. 55+ experiments across 13 rounds validate the finding. See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4 MacBook experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:14:14Z

Community Review — Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'mlx'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

ranausmanai and others added 4 commits March 29, 2026 16:50

Expand README with full experiment details and H100 compute case

fc4bed0

ranausmanai changed the title ~~Non-record: Deep Supervision on M4 MacBook~~ Non-record: Asymmetric 1/10 Encoder-Decoder Split + 30 Experiments (M4, RTX 5090, 8xH100) Apr 3, 2026

ranausmanai mentioned this pull request Apr 3, 2026

Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change) #1275

Open

ranausmanai changed the title ~~Non-record: Asymmetric 1/10 Encoder-Decoder Split + 30 Experiments (M4, RTX 5090, 8xH100)~~ Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture) Apr 3, 2026

This was referenced Apr 5, 2026

Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380

Open

Non-record: 131 Systematic Experiments — 1.5207 BPB on RTX 4000 Ada #1434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)#1073

Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)#1073
ranausmanai wants to merge 4 commits intoopenai:mainfrom
ranausmanai:deep-supervision-m4

ranausmanai commented Mar 29, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ranausmanai commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

27 Systematic Experiments on M4 MacBook

Deep Supervision (Novel Technique)

LR Tuning (128K batch)

Batch Size Scaling (no plateau through 128K)

Convergence Techniques (64K batch)

Deep Supervision Weight Sweep (8K batch)

Hardware & Setup

Reproduce

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ranausmanai commented Mar 29, 2026 •

edited

Loading

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading