Skip to content

Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)#1073

Open
ranausmanai wants to merge 4 commits intoopenai:mainfrom
ranausmanai:deep-supervision-m4
Open

Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)#1073
ranausmanai wants to merge 4 commits intoopenai:mainfrom
ranausmanai:deep-supervision-m4

Conversation

@ranausmanai
Copy link
Copy Markdown

@ranausmanai ranausmanai commented Mar 29, 2026

27 Systematic Experiments on M4 MacBook

30+ hours of compute on an Apple M4 MacBook (16GB unified memory, MLX backend). Explored deep supervision (novel technique), learning rate tuning, batch size scaling, architecture changes, and convergence techniques.

Best M4 result: 1.6414 int8_bpb (LR 0.08, 128K batch, 300 steps)

This research led to the asymmetric encoder-decoder split finding in PR #1275, which reached 1.1492 pre-quant BPB on 8xH100.


Deep Supervision (Novel Technique)

Auxiliary loss at the U-Net encoder-decoder boundary. Zero extra parameters.

Batch Size Baseline +DeepSup(0.03) Effect
8K 2.168 2.118 -0.050 (helps)
16K 2.037 2.037 0.000 (neutral)
64K 1.767 1.774 +0.006 (neutral)

Acts as regularizer -- benefit scales inversely with batch size.

LR Tuning (128K batch)

Config int8_bpb vs Baseline (1.667)
LR 0.08 1.6414 -0.025
LR 0.06 1.6431 -0.024
Grad clip 1.0 1.6473 -0.019
MLP mult 3 1.6596 -0.007
10 layers 1.6613 -0.006

Default LR 0.04 is too conservative for short training runs.

Batch Size Scaling (no plateau through 128K)

Batch Size int8_bpb Marginal Gain
8K 2.168 --
16K 2.037 -0.131
32K 1.943 -0.094
64K 1.767 -0.176
128K 1.667 -0.100

Convergence Techniques (64K batch)

EMA, SWA, Partial RoPE, longer sequences -- all hurt at 300 steps. These need 9000+ steps, consistent with them appearing in top leaderboard submissions.

Deep Supervision Weight Sweep (8K batch)

Weight int8_bpb vs Baseline
0.00 (baseline) 2.1677 --
0.02 2.1274 -0.040
0.03 2.1178 -0.050
0.04 2.1295 -0.038
0.05 2.1302 -0.038

Clear inverted-U pattern. Optimal weight is 0.03.


Hardware & Setup

  • Hardware: Apple M4 MacBook, 16GB unified memory
  • Backend: MLX, bfloat16 compute, ~9K tok/s peak throughput
  • Data: 10 training shards (~1B tokens)
  • Training: 300 steps per experiment

Reproduce

pip install sentencepiece mlx
python data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

MATRIX_LR=0.08 ITERATIONS=300 TRAIN_BATCH_TOKENS=131072 GRAD_ACCUM_STEPS=16 \
  VAL_BATCH_SIZE=524288 python train_gpt_mlx.py

ranausmanai and others added 4 commits March 29, 2026 16:50
16 experiments exploring auxiliary loss at the U-Net encoder-decoder
boundary. Key finding: deep supervision (weight=0.03) improves BPB by
-0.05 at small batch sizes but effect disappears at large batches.
Novel technique not explored by other competitors.
Added 11 new experiments at 128K batch testing LR tuning, grad clipping,
architecture changes (10/12 layers, MLP mult 3), warmup/warmdown, and
logit softcap. Best result improved from 1.6668 to 1.6414 int8_bpb
via LR 0.08 (-0.025 over baseline). Grad clipping also helps (-0.019).
Novel finding: setting num_encoder_layers=1 (vs num_layers//2) monotonically
improves BPB. Validated on baseline (-0.016) and SOTA code (-0.004) on RTX 5090.
8xH100 run reached 1.1492 pre-quant BPB at step 5666/9000 before pod crashed
(FA2 speed bottleneck: 105ms/step vs FA3's 83ms/step).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ranausmanai ranausmanai changed the title Non-record: Deep Supervision on M4 MacBook Non-record: Asymmetric 1/10 Encoder-Decoder Split + 30 Experiments (M4, RTX 5090, 8xH100) Apr 3, 2026
@ranausmanai ranausmanai changed the title Non-record: Asymmetric 1/10 Encoder-Decoder Split + 30 Experiments (M4, RTX 5090, 8xH100) Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture) Apr 3, 2026
ranausmanai added a commit to ranausmanai/parameter-golf that referenced this pull request Apr 5, 2026
…00 Ada

Apply focal loss (Lin et al. 2017) to language model pretraining:
replace standard cross-entropy with (1-pt)^gamma * CE to focus on
hard-to-predict tokens. Combined with cosine LR schedule and asymmetric
encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a
single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100
SOTA record. 55+ experiments across 13 rounds validate the finding.

See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4
MacBook experiments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: 27 Systematic Experiments on M4 MacBook (Deep Supervision, LR Tuning, Batch Scaling, Architecture)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'mlx'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants