Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)#1073
Conversation
16 experiments exploring auxiliary loss at the U-Net encoder-decoder boundary. Key finding: deep supervision (weight=0.03) improves BPB by -0.05 at small batch sizes but effect disappears at large batches. Novel technique not explored by other competitors.
Added 11 new experiments at 128K batch testing LR tuning, grad clipping, architecture changes (10/12 layers, MLP mult 3), warmup/warmdown, and logit softcap. Best result improved from 1.6668 to 1.6414 int8_bpb via LR 0.08 (-0.025 over baseline). Grad clipping also helps (-0.019).
Novel finding: setting num_encoder_layers=1 (vs num_layers//2) monotonically improves BPB. Validated on baseline (-0.016) and SOTA code (-0.004) on RTX 5090. 8xH100 run reached 1.1492 pre-quant BPB at step 5666/9000 before pod crashed (FA2 speed bottleneck: 105ms/step vs FA3's 83ms/step). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…00 Ada Apply focal loss (Lin et al. 2017) to language model pretraining: replace standard cross-entropy with (1-pt)^gamma * CE to focus on hard-to-predict tokens. Combined with cosine LR schedule and asymmetric encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100 SOTA record. 55+ experiments across 13 rounds validate the finding. See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4 MacBook experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Community Review — Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via |
27 Systematic Experiments on M4 MacBook
30+ hours of compute on an Apple M4 MacBook (16GB unified memory, MLX backend). Explored deep supervision (novel technique), learning rate tuning, batch size scaling, architecture changes, and convergence techniques.
Best M4 result: 1.6414 int8_bpb (LR 0.08, 128K batch, 300 steps)
This research led to the asymmetric encoder-decoder split finding in PR #1275, which reached 1.1492 pre-quant BPB on 8xH100.
Deep Supervision (Novel Technique)
Auxiliary loss at the U-Net encoder-decoder boundary. Zero extra parameters.
Acts as regularizer -- benefit scales inversely with batch size.
LR Tuning (128K batch)
Default LR 0.04 is too conservative for short training runs.
Batch Size Scaling (no plateau through 128K)
Convergence Techniques (64K batch)
EMA, SWA, Partial RoPE, longer sequences -- all hurt at 300 steps. These need 9000+ steps, consistent with them appearing in top leaderboard submissions.
Deep Supervision Weight Sweep (8K batch)
Clear inverted-U pattern. Optimal weight is 0.03.
Hardware & Setup
Reproduce