Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442
Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442akaiHuang wants to merge 1 commit intoopenai:mainfrom
Conversation
Add a non-record submission documenting a stack that runs without Flash Attention 3 (the runpod default pytorch:2.4.0 image lacks flash_attn_3). 1-seed result: val_bpb 1.1854, beating the OpenAI baseline (1.2244) by -0.039 BPB. Stack: - 11L d=512 SP1024 - XSA-all + BigramHash 3072x112 (from PR openai#1019) - Parallel Muon (from PR openai#399) - Step-based warmdown=2000/3500 (documents trigger bug) - Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired, ~100 LOC pipeline) - Sliding-window eval stride=32, temperature=0.90 No SLOT, no TTT, no validation data accessed during eval. Eval: 322s wall on 8xH100 (within 600s budget). Single seed only (record track requires 3-seed mean).
|
Withdrawing — self-audit found several inconsistencies (README/script paths mismatch, eval script import bug, BPB measured on BF16 weights rather than the lzma artifact, undisclosed retrodiction component, missing artifact file). Will fix everything and resubmit cleanly. Sorry for the noise. |
|
Follow-up note (2026-04-12): I self-withdrew this PR after a self-audit found methodological issues, and the stack-combination line of work it explored is no longer being pursued. My focused research effort going forward is #1255 (Non-record: Text Diffusion + Retrodiction + TTT + Depth Recurrence), which uses a unified PyTorch H100 stack ( |
Summary
Non-record submission documenting a stack combination that runs without Flash Attention 3 (the runpod default
pytorch:2.4.0-py3.11-cuda12.4.1image lacksflash_attn_3). All current top records require FA3; this submission shows how close one can get on stock PyTorch SDPA.Stack
Why non-record
What is dropped vs the top stack (and why)
pytorch:2.4.0base image. Worth ≈ +1.9 % throughput.seq_len=2048These are intentional trade-offs, not bugs. Documented in the README's "Notes" section.
Files
README.md— full submission writeupsubmission.json— leaderboard metadatatrain_gpt.py— training scripteval.py— evaluation scripttrain_seed42.log— training log (BPB curve)eval_seed42.log— eval log (final 1.1854)pod_environment.txt—nvidia-smi+pip freezesnapshotrequirements.txt— minimal depsReproduction
Single training command + single eval command, both runnable on a fresh
pytorch:2.4.0-py3.11-cuda12.4.1RunPod template. Full instructions in the README.