Skip to content

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h
Open

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 1, 2026

Summary

Results

Config val_bpb Notes
Approach B baseline (B6) 1.1179 No focal loss
Approach H (focal, gamma=2.0) + TTT 1.1460 TTT s_0 score
Approach H (focal, gamma=2.0) base 1.1537 Before TTT

Delta: +0.028 BPB vs baseline — focal loss hurts at gamma=2.0.

Analysis: Why Focal Loss Hurts

Focal loss at gamma=2.0 over-suppresses gradients from well-predicted tokens. In language modeling (unlike object detection where focal loss originated), even "easy" tokens carry useful distributional signal. The (1-p)^2 factor reduces their gradient contribution too aggressively, slowing overall learning. A lower gamma (0.5-1.0) or curriculum-style scheduling might work better, but was not explored.

Key Changes

  1. Single-line change in forward(): loss = ((1 - (-ce).exp()).pow(gamma) * ce).mean()
  2. FOCAL_GAMMA env var (default 2.0, set to 0.0 for standard CE)
  3. No architecture, eval, or artifact size changes

Rule Compliance

  • Training <= 600s on 8xH100
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No val tokens in artifact
  • GPTQ calibration within training budget
  • TTT is score-first only (s_0 reported)
  • Single-pass evaluation

Test Plan

  • Verified focal loss implementation matches standard CE when gamma=0
  • Confirmed artifact size unchanged from baseline
  • Full 8xH100 training run completed within time budget

🤖 Generated with Claude Code

…460)

Replaces standard cross-entropy with focal loss (1-p)^2 * CE during training
to down-weight easy tokens and focus gradient on hard tokens. Built on
Approach B (Int5 GPTQ + 33.6M params). Focal loss at gamma=2.0 hurts BPB
by +0.028 vs baseline, suggesting the technique over-suppresses gradients
from well-predicted tokens that still carry useful signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant