Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected)#1380
Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected)#1380ranausmanai wants to merge 5 commits intoopenai:mainfrom
Conversation
…00 Ada Apply focal loss (Lin et al. 2017) to language model pretraining: replace standard cross-entropy with (1-pt)^gamma * CE to focus on hard-to-predict tokens. Combined with cosine LR schedule and asymmetric encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100 SOTA record. 55+ experiments across 13 rounds validate the finding. See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4 MacBook experiments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cript Expanded README with complete experiment journey across M4 MacBook, RTX 5090, 8xH100, and RTX 4000 Ada. Added ready-to-run reproduction instructions for both single GPU and 8xH100 record runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…OTA code Tested focal loss + cosine LR + asymmetric split on the actual SOTA train_gpt.py (LeakyReLU + XSA + Parallel Muon + EMA). Result: 1.5035 pre-quant val_bpb vs 1.7927 baseline (-0.289). Confirms focal loss transfers to the fully-optimized stack. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Focal loss was applied during eval (not just training), inflating all reported BPB numbers. Fixed with `self.training` check. Corrected multi-seed results show focal loss does not help. Cosine LR (-0.070) and asymmetric split remain valid findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, gamma annealing, logit penalty, multi-cycle cosine All modifications are training-only (guarded by self.training). Controlled via env vars: - Z_LOSS: log(Z)^2 regularizer from PaLM paper - TOKEN_DROP: synaptic pruning-inspired token dropout - EMBED_MIXUP: genetic recombination-inspired embedding interpolation - GAMMA_ANNEAL: decay focal gamma to 0 over training - LOGIT_PENALTY: L2 penalty on logits for sparse activation - COSINE_CYCLES: multi-cycle cosine LR schedule Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the ping, @ranausmanai. Walked through the diff at SHA What's solid:
The part I'd gently push on: The -0.070 BPB improvement (1.7233 → 1.6538 at 3000 steps) is real in that regime, but I suspect it's an undertraining artifact rather than a transferable win. The baseline's linear warmdown schedule is wallclock-targeted for 8×H100 / 600s, so on an RTX 4000 Ada at a fixed 3000-step budget, the baseline spends most of its time still warming down when cosine has effectively converged. Cosine spends more area under max-LR early, which wins in the undertrained regime. Near convergence (~1.08 BPB territory where the current SOTA lives), schedule shape typically matters much less — Chinchilla-era ablations suggest <0.005 BPB delta between cosine and linear-warmdown on well-tuned runs. My suggestion: before framing cosine as a general improvement, try it at longer horizons on a stronger baseline stack (e.g., on top of PR #1019 or PR #1394's code), then compare to the linear-warmdown equivalent at matched step count. If the delta holds at ~1.1 BPB or below, it's a real SOTA-class win. If it shrinks to <0.005 at that level, it's regime-specific and not worth absorbing. Also — I notice the "asymmetric 1/10 encoder-decoder split" mentioned in the PR body doesn't appear in this commit's diff. Is it elsewhere in the branch, or still pending? For what it's worth, the way you documented the focal loss retraction will earn more trust in the long run than the original claim would have. Appreciate the honesty. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
Summary
Cosine LR schedule replaces linear warmdown and gives -0.070 BPB at 3000 steps, -0.072 at 5000 steps — consistent, multi-seed validated improvement on baseline code. Combined with asymmetric 1/10 encoder-decoder split (PR #1275), total improvement is -0.080 BPB at 5000 steps.
Also investigated focal loss for LM pretraining. Initial results looked extraordinary but contained a critical eval bug (see correction below). With corrected evaluation, focal loss shows no improvement.
Correction: Focal Loss Results Were Wrong
What happened: Our focal loss implementation applied
(1-pt)^gammaweighting inGPT.forward(), which is called during both training AND evaluation. This made eval metrics artificially low — the "improvement" was from down-weighting hard tokens in the loss computation, not from better model quality.The fix:
if focal_gamma > 0 and self.training:— focal loss only during training.Corrected results (3000 steps, multi-seed):
Focal loss does not help when evaluated correctly. Higher gamma actively hurts. We're keeping this in the PR as a cautionary tale — always verify your eval metric.
What IS Real: Cosine LR Schedule
Replace linear warmdown with cosine annealing. One change in
lr_mul():Multi-seed validated (3000 steps):
Cosine LR scaling (single seed):
Improvement is consistent and not diminishing with training length.
What IS Real: Asymmetric 1/10 Split
Stacks with cosine LR: 1.5619 BPB at 5000 steps (vs cosine alone 1.5706, -0.009). See PR #1275 for full details.
Research Journey: 80+ Experiments Across 3 GPUs
Phase 1 — M4 MacBook (PR #1073): 27 experiments. LR tuning, deep supervision, batch scaling, EMA/SWA.
Phase 2 — RTX 5090 + 8xH100 (PR #1275): Asymmetric encoder-decoder split discovery. 8xH100 partial run: 1.1492 pre-quant BPB.
Phase 3 — RTX 4000 Ada (this PR): 55+ experiments across 13 rounds ($2.50 total). Discovered cosine LR (-0.070). Investigated focal loss — looked incredible but was an eval metric artifact (corrected above). Also tested: decoder-skip connections, stochastic depth, label smoothing, gradient noise, WD scheduling, warm restarts, attention temp annealing, lookahead optimizer, LR sweeps, min_lr_frac sweeps. Most didn't help.
What didn't work: focal loss (eval bug), label smoothing, gradient noise, WD scheduling, cosine warm restarts, higher base LR (0.08) with cosine, lower min_lr_frac (0.01, 0.0), decoder-skip connections (vanished at 1000+ steps), gradient clipping with cosine.
Reproduce
Test Plan
🤖 Generated with Claude Code