Skip to content

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219

Open
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1
Open

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1

Conversation

@Gusanidas
Copy link
Copy Markdown

Based on PR #1105 (abaybektursun) with this changes:

  • Causal n-gram fix (within_hint/word_hint prefix-only)
  • Window attention (size=512) on layers 2,4,6,8,10 via FA3
  • Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
  • Train-data GPTQ calibration (14s vs 220s AR self-gen)
  • Auto eval_seq_len detection from max train seq_len
  • Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
seed 1337: 1.1077
seed 42: 1.1083
seed 7: 1.1091
mean: 1.1084 (vs leader 1.1147)

It has plenty of room to be further optimized

Based on PR openai#1105 (abaybektursun) with improvements:
- Window attention (size=512) on layers 2,4,6,8,10 via FA3
- Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
- Train-data GPTQ calibration (14s vs 220s AR self-gen)
- Auto eval_seq_len detection from max train seq_len
- Causal n-gram fix (within_hint/word_hint prefix-only)
- Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
  seed 1337: 1.1077
  seed 42:   1.1083
  seed 7:    1.1091
  mean:      1.1084 (vs leader 1.1147)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Thesis: the speed path is the most underutilized section of openai/parameter-golf.
The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties.
Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses
under free wins + comp ports.

Findings:

TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total:
- Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in
  plain sight. We're paying 8x kernel-launch overhead because grad_accum was
  inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup.
- Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is
  625K sequential forwards at B=1 stride=64. 97% of each window's context is
  shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives
  5-15x eval speedup, saves 3-5 min of the 600s budget.
- Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv
  2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35
  backlog, never shipped.
- Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved).
  Replaces 220s AR self-gen with 14s. +2000 extra training steps.
- Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time.

TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan:
- Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the
  fastest step in the leaderboard at 69.6 ms/step)
- Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4
  contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms
  → 1.3 ms (15x). World-novel, NOT in modded-nanogpt.
- Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity
  (PRs openai#1105, openai#1420). Identity itself looks world-novel.
- Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE
  megakernel). Combined eval speedup ~5x on top of Shot 0b.

TIER 3 BIG DREAMS (world-first opportunities):
- Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent
  SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels;
  nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of
  our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens
  templates. Potential PhD-defensible mini-paper.
- Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel)
- Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint
  operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC.
- Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint
  operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC.
- Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint
  operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick
  winner, continue. Online hyperband. 200 LOC.
- Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min
  compile cold-start permanently.

Stacked expected impact:
- Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6
- +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35
- +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22
- +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15
- +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12
- +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**)

10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100.
That's where val_bpb drops BELOW comp records.

Key finding: eval path holds the biggest speed wins currently, not training.
Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2
Shots 13-14 save 5-8 min per eval pass. More than any training-side single
patch would buy at our current rate.

Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed),
/tmp/phase2_world_speed_research.md (12 research areas surveyed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant