Skip to content

Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected)#1380

Open
ranausmanai wants to merge 5 commits intoopenai:mainfrom
ranausmanai:focal-loss-lm-pretraining
Open

Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected)#1380
ranausmanai wants to merge 5 commits intoopenai:mainfrom
ranausmanai:focal-loss-lm-pretraining

Conversation

@ranausmanai
Copy link
Copy Markdown

@ranausmanai ranausmanai commented Apr 5, 2026

Summary

Cosine LR schedule replaces linear warmdown and gives -0.070 BPB at 3000 steps, -0.072 at 5000 steps — consistent, multi-seed validated improvement on baseline code. Combined with asymmetric 1/10 encoder-decoder split (PR #1275), total improvement is -0.080 BPB at 5000 steps.

Also investigated focal loss for LM pretraining. Initial results looked extraordinary but contained a critical eval bug (see correction below). With corrected evaluation, focal loss shows no improvement.

Correction: Focal Loss Results Were Wrong

What happened: Our focal loss implementation applied (1-pt)^gamma weighting in GPT.forward(), which is called during both training AND evaluation. This made eval metrics artificially low — the "improvement" was from down-weighting hard tokens in the loss computation, not from better model quality.

The fix: if focal_gamma > 0 and self.training: — focal loss only during training.

Corrected results (3000 steps, multi-seed):

Config Seed 1337 Seed 42 Seed 2025 Mean
Cosine LR only 1.6538 1.6480 1.6687 1.657
Cosine + Focal γ=2 1.6612 1.6560 1.6594 1.659
Cosine + Focal γ=5 1.6858 1.686
Cosine + Focal γ=8 1.7124 1.712

Focal loss does not help when evaluated correctly. Higher gamma actively hurts. We're keeping this in the PR as a cautionary tale — always verify your eval metric.

What IS Real: Cosine LR Schedule

Replace linear warmdown with cosine annealing. One change in lr_mul():

min_lr_frac = 0.1
progress = step / max(args.iterations, 1)
return min_lr_frac + 0.5 * (1.0 - min_lr_frac) * (1.0 + math.cos(math.pi * progress))

Multi-seed validated (3000 steps):

Config Seed 1337 Seed 42 Seed 2025 Mean
Baseline (linear warmdown) 1.7233
Cosine LR 1.6538 1.6480 1.6687 1.657

Cosine LR scaling (single seed):

Steps Baseline Cosine LR Delta
1000 2.0568 1.9334 -0.123
2000 1.8330 1.8050 -0.028
3000 1.7233 1.6538 -0.070
5000 1.6422 1.5706 -0.072

Improvement is consistent and not diminishing with training length.

What IS Real: Asymmetric 1/10 Split

self.num_encoder_layers = 1  # instead of num_layers // 2

Stacks with cosine LR: 1.5619 BPB at 5000 steps (vs cosine alone 1.5706, -0.009). See PR #1275 for full details.

Research Journey: 80+ Experiments Across 3 GPUs

Phase 1 — M4 MacBook (PR #1073): 27 experiments. LR tuning, deep supervision, batch scaling, EMA/SWA.

Phase 2 — RTX 5090 + 8xH100 (PR #1275): Asymmetric encoder-decoder split discovery. 8xH100 partial run: 1.1492 pre-quant BPB.

Phase 3 — RTX 4000 Ada (this PR): 55+ experiments across 13 rounds ($2.50 total). Discovered cosine LR (-0.070). Investigated focal loss — looked incredible but was an eval metric artifact (corrected above). Also tested: decoder-skip connections, stochastic depth, label smoothing, gradient noise, WD scheduling, warm restarts, attention temp annealing, lookahead optimizer, LR sweeps, min_lr_frac sweeps. Most didn't help.

What didn't work: focal loss (eval bug), label smoothing, gradient noise, WD scheduling, cosine warm restarts, higher base LR (0.08) with cosine, lower min_lr_frac (0.01, 0.0), decoder-skip connections (vanished at 1000+ steps), gradient clipping with cosine.

Reproduce

git clone https://github.com/openai/parameter-golf.git && cd parameter-golf
pip install sentencepiece huggingface-hub datasets tiktoken flash-attn
python data/cached_challenge_fineweb.py --variant sp1024

# Cosine LR: change lr_mul() in train_gpt.py (see code above)
# Asymmetric split: self.num_encoder_layers = 1
COSINE_LR=1 python train_gpt.py

Test Plan

🤖 Generated with Claude Code

ranausmanai and others added 3 commits April 5, 2026 17:36
…00 Ada

Apply focal loss (Lin et al. 2017) to language model pretraining:
replace standard cross-entropy with (1-pt)^gamma * CE to focus on
hard-to-predict tokens. Combined with cosine LR schedule and asymmetric
encoder-decoder split, achieves 1.1567 int8 BPB at 5000 steps on a
single RTX 4000 Ada using baseline code — within 0.037 of the 8xH100
SOTA record. 55+ experiments across 13 rounds validate the finding.

See PRs openai#1275 and openai#1073 for prior work on asymmetric split and M4
MacBook experiments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cript

Expanded README with complete experiment journey across M4 MacBook,
RTX 5090, 8xH100, and RTX 4000 Ada. Added ready-to-run reproduction
instructions for both single GPU and 8xH100 record runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…OTA code

Tested focal loss + cosine LR + asymmetric split on the actual SOTA
train_gpt.py (LeakyReLU + XSA + Parallel Muon + EMA). Result: 1.5035
pre-quant val_bpb vs 1.7927 baseline (-0.289). Confirms focal loss
transfers to the fully-optimized stack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ranausmanai ranausmanai changed the title Non-record: Focal Loss for LM Pretraining — 1.1567 int8 BPB on RTX 4000 Ada (3-line change) Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) Apr 5, 2026
ranausmanai and others added 2 commits April 5, 2026 22:52
Focal loss was applied during eval (not just training), inflating all
reported BPB numbers. Fixed with `self.training` check. Corrected
multi-seed results show focal loss does not help. Cosine LR (-0.070)
and asymmetric split remain valid findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, gamma annealing, logit penalty, multi-cycle cosine

All modifications are training-only (guarded by self.training). Controlled via env vars:
- Z_LOSS: log(Z)^2 regularizer from PaLM paper
- TOKEN_DROP: synaptic pruning-inspired token dropout
- EMBED_MIXUP: genetic recombination-inspired embedding interpolation
- GAMMA_ANNEAL: decay focal gamma to 0 over training
- LOGIT_PENALTY: L2 penalty on logits for sparse activation
- COSINE_CYCLES: multi-cycle cosine LR schedule

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Thanks for the ping, @ranausmanai. Walked through the diff at SHA 1a6be00 carefully.

What's solid:

  • Cosine LR formula (records/.../train_gpt.py:960-968) matches the PR body exactly. With COSINE_CYCLES=1 it reduces to 0.1 + 0.45*(1+cos(π·progress)), endpoints 1.0 → 0.1. Clean drop-in via COSINE_LR=1.
  • Focal loss fix (L738, if focal_gamma > 0 and self.training:) is complete. I traced the single forward path and confirmed no eval leak — model.eval() sets self.training=False and the branch falls through to plain F.cross_entropy. All five new knobs (embed_mixup, token_drop, focal, z_loss, logit_penalty) carry the and self.training guard. Clean.
  • Retracting focal loss honestly with the numbers published is exactly the right call. That's the kind of transparency the competition needs more of.

The part I'd gently push on:

The -0.070 BPB improvement (1.7233 → 1.6538 at 3000 steps) is real in that regime, but I suspect it's an undertraining artifact rather than a transferable win. The baseline's linear warmdown schedule is wallclock-targeted for 8×H100 / 600s, so on an RTX 4000 Ada at a fixed 3000-step budget, the baseline spends most of its time still warming down when cosine has effectively converged. Cosine spends more area under max-LR early, which wins in the undertrained regime. Near convergence (~1.08 BPB territory where the current SOTA lives), schedule shape typically matters much less — Chinchilla-era ablations suggest <0.005 BPB delta between cosine and linear-warmdown on well-tuned runs.

My suggestion: before framing cosine as a general improvement, try it at longer horizons on a stronger baseline stack (e.g., on top of PR #1019 or PR #1394's code), then compare to the linear-warmdown equivalent at matched step count. If the delta holds at ~1.1 BPB or below, it's a real SOTA-class win. If it shrinks to <0.005 at that level, it's regime-specific and not worth absorbing.

Also — I notice the "asymmetric 1/10 encoder-decoder split" mentioned in the PR body doesn't appear in this commit's diff. Is it elsewhere in the branch, or still pending?

For what it's worth, the way you documented the focal loss retraction will earn more trust in the long run than the original claim would have. Appreciate the honesty.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants