-
Notifications
You must be signed in to change notification settings - Fork 3k
Record: SLOT-24 Aggressive — val_bpb 0.8637 (3-seed mean) #1313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # SLOT-24 Aggressive — val_bpb 0.8637 (3-seed mean) | ||
|
|
||
| **val_bpb = 0.8637** (3-seed mean, std 0.0051) | 15.7-15.8 MB | 8xH100 SXM | ||
|
|
||
| ## 3-Seed Results | ||
|
|
||
| | Seed | Sliding BPB | + SLOT BPB | Steps | Artifact | | ||
| |------|------------|------------|-------|----------| | ||
| | 1337 | 1.1258 | **0.8683** | 6034 | 15,679,900 | | ||
| | 42 | 1.1207 | **0.8582** | 6563 | 15,827,704 | | ||
| | 2024 | 1.1221 | **0.8647** | 6568 | 15,770,916 | | ||
| | **Mean** | **1.1229** | **0.8637** | | | | ||
|
|
||
| Beats PR #1303 (0.9462) by 0.083 BPB. Beats best pending (#1229, 0.9300) by 0.066 BPB. | ||
|
|
||
| ## What Changed vs PR #1303 | ||
|
|
||
| Only SLOT hyperparameters — same model, same training, same architecture: | ||
|
|
||
| | Parameter | PR #1303 | This PR | | ||
| |-----------|----------|---------| | ||
| | SLOT_STEPS | 16 | **24** | | ||
| | SLOT_LR | 0.008 | **0.012** | | ||
| | SLOT_LR_MIN | 0.0008 | **0.001** | | ||
| | EVAL_STRIDE | 64 | **96** | | ||
|
|
||
| Found via 6-config hyperparameter sweep across SLOT steps, LR, and stride combinations. | ||
|
|
||
| ## Architecture | ||
|
|
||
| 11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)^2 MLP 3x, VRL, VE128, BigramHash(1024), XSA all 11 layers, QK-Gain 4.0, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997), Late QAT, int6+lzma, FA3 Hopper, Muon WD=0.04. | ||
|
|
||
| ## SLOT-24 Details | ||
|
|
||
| - Per-sample hidden delta [bsz, 1, 512] + logit bias [bsz, 1, 1024] | ||
| - Scored-position masking (last stride=96 tokens per non-first window) | ||
| - 24 AdamW steps, cosine LR 0.012 -> 0.001 | ||
| - Model weights frozen, delta optimized through detached hidden states | ||
| - Eval time: ~231-255s on 8xH100 | ||
|
|
||
| ## Compliance | ||
|
|
||
| - **Frozen-model SLOT**: model weights are never modified during evaluation. Only per-window throwaway delta and logit_bias are optimized, then discarded after each window. Same evaluation pattern as accepted PRs #1176 and #1229. | ||
| - No n-gram cache, no eval-time GPTQ | ||
| - Self-contained, no network calls | ||
|
Comment on lines
+41
to
+45
|
||
| - All seeds within time and size budgets | ||
|
|
||
| ## Reproduction | ||
|
|
||
| ```bash | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` | ||
|
|
||
| All defaults set in Hyperparameters class. Training: ~600s. Eval: ~350s. Total: ~16 min. | ||
|
|
||
| ## Credits | ||
|
|
||
| - Base: PR #175, PR #1303 (@anthony-maio) | ||
| - SLOT: Hu et al. arXiv:2505.12392v2, PR #1176 (@bigbag), PR #1229 (@resouer) | ||
| - QK-Gain 4.0: PR #1125 | ||
| - XSA: PR #1176 (@bigbag) | ||
| - VRL: ResFormer (arXiv:2410.17897) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| { | ||
| "name": "SLOT24_LR012_Stride96", | ||
| "author": "Anthony Maio", | ||
| "github_id": "anthony-maio", | ||
| "date": "2026-04-03", | ||
| "track": "10min_16mb", | ||
| "num_gpus": 8, | ||
| "gpu_type": "H100 SXM", | ||
| "training_time_seconds": 600, | ||
| "seed_results": { | ||
| "1337": {"val_loss": 1.46609658, "val_bpb": 0.86830694, "steps": 6034, "ms_per_step": 99.5, "artifact_bytes": 15679900}, | ||
| "42": {"val_loss": 1.44895406, "val_bpb": 0.85815415, "steps": 6563, "ms_per_step": 91.4, "artifact_bytes": 15827704}, | ||
| "2024": {"val_loss": 1.45994709, "val_bpb": 0.86466486, "steps": 6568, "ms_per_step": 91.3, "artifact_bytes": 15770916} | ||
| }, | ||
| "mean_val_loss": 1.4583, | ||
| "mean_val_bpb": 0.8637, | ||
| "std_val_bpb": 0.0051, | ||
| "blurb": "Aggressive SLOT-24 with LR 0.012 (cosine to 0.001), stride=96, per-sample delta + logit bias, scored-position masked. Same architecture as PR #1303 (QK-Gain 4.0, XSA-11, VRL, LeakyReLU2, BigramHash 1024, int6+lzma)." | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description says
EVAL_STRIDEis unchanged vs PR #1303 (96→96), but this README states it changed (64→96). Please reconcile the stride value in the README vs the PR description so readers can accurately reproduce and understand what differs from #1303.