Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460 by ibarrajo · Pull Request #1233 · openai/parameter-golf

ibarrajo · 2026-04-01T22:28:00Z

Summary

Focal loss (1-p)^gamma * CE with gamma=2.0 replaces standard cross-entropy during training, down-weighting easy tokens to focus gradient signal on hard tokens
Built on Approach B baseline (Int5 GPTQ + 33.6M params + SWA + XSA + VE + TTT)
Inspired by PR Loss function comparison (CE vs P2 variants) under parameter-golf constraints #1180 which used P2 loss (1-p)^2 among other techniques

Results

Config	val_bpb	Notes
Approach B baseline (B6)	1.1179	No focal loss
Approach H (focal, gamma=2.0) + TTT	1.1460	TTT s_0 score
Approach H (focal, gamma=2.0) base	1.1537	Before TTT

Delta: +0.028 BPB vs baseline — focal loss hurts at gamma=2.0.

Analysis: Why Focal Loss Hurts

Focal loss at gamma=2.0 over-suppresses gradients from well-predicted tokens. In language modeling (unlike object detection where focal loss originated), even "easy" tokens carry useful distributional signal. The (1-p)^2 factor reduces their gradient contribution too aggressively, slowing overall learning. A lower gamma (0.5-1.0) or curriculum-style scheduling might work better, but was not explored.

Key Changes

Single-line change in forward(): loss = ((1 - (-ce).exp()).pow(gamma) * ce).mean()
FOCAL_GAMMA env var (default 2.0, set to 0.0 for standard CE)
No architecture, eval, or artifact size changes

Rule Compliance

Test Plan

Verified focal loss implementation matches standard CE when gamma=0
Confirmed artifact size unchanged from baseline
Full 8xH100 training run completed within time budget

🤖 Generated with Claude Code

…460) Replaces standard cross-entropy with focal loss (1-p)^2 * CE during training to down-weight easy tokens and focus gradient on hard tokens. Built on Approach B (Int5 GPTQ + 33.6M params). Focal loss at gamma=2.0 hurts BPB by +0.028 vs baseline, suggesting the technique over-suppresses gradients from well-predicted tokens that still carry useful signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h

ibarrajo commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ibarrajo commented Apr 1, 2026

Summary

Results

Analysis: Why Focal Loss Hurts

Key Changes

Rule Compliance

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant