Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100) by akaiHuang · Pull Request #1442 · openai/parameter-golf

akaiHuang · 2026-04-07T12:54:37Z

Summary

Non-record submission documenting a stack combination that runs without Flash Attention 3 (the runpod default pytorch:2.4.0-py3.11-cuda12.4.1 image lacks flash_attn_3). All current top records require FA3; this submission shows how close one can get on stock PyTorch SDPA.

val_bpb: 1.1854 (1-seed, 8×H100 SXM)
Artifact: 13.51 MB (Mixed Q4/Q5/Q6 + lzma)
Train: 437 s (within 540 s budget) | Eval: 322 s (within 600 s budget)
No SLOT, no TTT, no validation data accessed during eval
Improvement vs OpenAI Naive Baseline (1.2244): −0.039 BPB

Stack

11L d=512 SP1024
XSA on all 11 layers (#478 / #1019)
BigramHash 3072 × 112 (#1019)
Parallel Muon (#399)
Step-based warmdown 2000/3500 (≈57% intensity, this work documents a trigger bug we hit)
Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired per-layer bits, ~100 LOC quant pipeline)
LZMA preset 9
Sliding-window eval, stride=32, temperature=0.90

Why non-record

Single seed only (record-track requires a 3-seed mean for p<0.01).
Does not beat current SOTA (PR #1019, 1.11474 BPB). Roughly rank Refactor model with BioDNA and FractalLinear classes :)) #16 of 23 in the legal merged leaderboard.
The submission documents a stack that runs on the stock RunPod PyTorch container without external wheel installs. Useful as a reference point and reproduction floor for anyone hitting the same FA3 install friction.

What is dropped vs the top stack (and why)

Dropped	Why
Flash Attention 3	Not in `pytorch:2.4.0` base image. Worth ≈ +1.9 % throughput.
Full Hessian GPTQ + AR self-gen calibration	Requires ~500 LOC. Mixed Q4/Q5/Q6 is the simpler trade-off.
Partial RoPE 16/64	Untested in this run.
LN scale 1/√(L+1)	Untested.
Tight SWA every 50	Untested.
Late QAT (LR < 0.15)	PR #1248 found this dead-code-eliminated under torch.compile.
`seq_len=2048`	Kept seq=1024 to maximize step count without FA3 throughput.

These are intentional trade-offs, not bugs. Documented in the README's "Notes" section.

Files

README.md — full submission writeup
submission.json — leaderboard metadata
train_gpt.py — training script
eval.py — evaluation script
train_seed42.log — training log (BPB curve)
eval_seed42.log — eval log (final 1.1854)
pod_environment.txt — nvidia-smi + pip freeze snapshot
requirements.txt — minimal deps

Reproduction

Single training command + single eval command, both runnable on a fresh pytorch:2.4.0-py3.11-cuda12.4.1 RunPod template. Full instructions in the README.

Add a non-record submission documenting a stack that runs without Flash Attention 3 (the runpod default pytorch:2.4.0 image lacks flash_attn_3). 1-seed result: val_bpb 1.1854, beating the OpenAI baseline (1.2244) by -0.039 BPB. Stack: - 11L d=512 SP1024 - XSA-all + BigramHash 3072x112 (from PR openai#1019) - Parallel Muon (from PR openai#399) - Step-based warmdown=2000/3500 (documents trigger bug) - Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired, ~100 LOC pipeline) - Sliding-window eval stride=32, temperature=0.90 No SLOT, no TTT, no validation data accessed during eval. Eval: 322s wall on 8xH100 (within 600s budget). Single seed only (record track requires 3-seed mean).

akaiHuang · 2026-04-07T12:59:26Z

Withdrawing — self-audit found several inconsistencies (README/script paths mismatch, eval script import bug, BPB measured on BF16 weights rather than the lzma artifact, undisclosed retrodiction component, missing artifact file). Will fix everything and resubmit cleanly. Sorry for the noise.

akaiHuang · 2026-04-12T02:02:30Z

Follow-up note (2026-04-12): I self-withdrew this PR after a self-audit found methodological issues, and the stack-combination line of work it explored is no longer being pursued. My focused research effort going forward is #1255 (Non-record: Text Diffusion + Retrodiction + TTT + Depth Recurrence), which uses a unified PyTorch H100 stack (train_cdm.py), 5-seed multi-seed verification at the true final checkpoint, a matched-compute causal-only control, and an explicit causal-mask leakage test. Linking here so anyone who lands on this PR can find the active discussion in one place.

akaiHuang closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442
akaiHuang wants to merge 1 commit intoopenai:mainfrom
akaiHuang:non-record/no-fa3-stack-combo

akaiHuang commented Apr 7, 2026

Uh oh!

akaiHuang commented Apr 7, 2026

Uh oh!

akaiHuang commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akaiHuang commented Apr 7, 2026

Summary

Stack

Why non-record

What is dropped vs the top stack (and why)

Files

Reproduction

Uh oh!

akaiHuang commented Apr 7, 2026

Uh oh!

akaiHuang commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant