Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb by Gusanidas · Pull Request #1219 · openai/parameter-golf

Gusanidas · 2026-04-01T13:21:37Z

Based on PR #1105 (abaybektursun) with this changes:

Causal n-gram fix (within_hint/word_hint prefix-only)
Window attention (size=512) on layers 2,4,6,8,10 via FA3
Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
Train-data GPTQ calibration (14s vs 220s AR self-gen)
Auto eval_seq_len detection from max train seq_len
Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
seed 1337: 1.1077
seed 42: 1.1083
seed 7: 1.1091
mean: 1.1084 (vs leader 1.1147)

It has plenty of room to be further optimized

Based on PR openai#1105 (abaybektursun) with improvements: - Window attention (size=512) on layers 2,4,6,8,10 via FA3 - Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10 - Train-data GPTQ calibration (14s vs 220s AR self-gen) - Auto eval_seq_len detection from max train seq_len - Causal n-gram fix (within_hint/word_hint prefix-only) - Sliding window eval at seq_len=6144, stride=128 3-seed results (sliding window bpb): seed 1337: 1.1077 seed 42: 1.1083 seed 7: 1.1091 mean: 1.1084 (vs leader 1.1147) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1

Gusanidas commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gusanidas commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant