Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions records/track_10min_16mb/2026-04-01_ApproachH_FocalLoss/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Approach H: Focal Loss

## Summary

Builds on Approach B (Int5 GPTQ + 33.6M params + SWA + XSA + VE) by replacing
standard cross-entropy with focal loss during training.

Focal loss: `loss = (1 - p_correct)^gamma * CE_loss` where gamma=2.0 (configurable
via `FOCAL_GAMMA` env var). This down-weights easy tokens the model already predicts
well and focuses gradient signal on hard tokens.

Inspired by PR #1180 which achieved 1.0577 BPB using P2 loss `(1-p)^2` among other
techniques (residual mixing, conv token mixer, wallclock-aware warmdown).

## Key changes from Approach B

1. **Focal loss** (training only): replaces `F.cross_entropy(..., reduction="mean")`
with `((1 - exp(-ce))^gamma * ce).mean()` in the model's `forward()` method.
2. `FOCAL_GAMMA` env var (default 2.0, set to 0.0 for standard CE).

No eval changes. No architecture changes. No artifact size impact.

## Configuration

```bash
FOCAL_GAMMA=2.0 # default; 0.0 = standard CE
```

All other hyperparameters unchanged from Approach B defaults.

## Expected outcome

Focal loss should improve BPB by focusing training on hard tokens. The
down-weighting factor `(1-p)^2` is strongest early in training when many tokens
are hard, and naturally relaxes as the model improves.
12 changes: 12 additions & 0 deletions records/track_10min_16mb/2026-04-01_ApproachH_FocalLoss/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
# Approach H: Focal Loss training run
# Base: ApproachB (Int5 GPTQ + 33.6M params) + focal loss (gamma=2.0)

set -euo pipefail

export NCCL_IB_DISABLE=1
export RUN_ID=approach_h_focal
export FOCAL_GAMMA=2.0

# All other hyperparams inherit from ApproachB defaults in Hyperparameters class
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Alex Ibarra",
"github_id": "elninja",
"name": "ApproachH_FocalLoss",
"blurb": "Focal loss (gamma=2) on top of Int5 GPTQ + 33.6M params + SWA + XSA + VE. Down-weights easy tokens to focus training on hard tokens.",
"date": "2026-04-01",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null
}
Loading