Skip to content

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442

Closed
akaiHuang wants to merge 1 commit intoopenai:mainfrom
akaiHuang:non-record/no-fa3-stack-combo
Closed

Non-record: No-FA3 stack combination — val_bpb 1.1854 (1-seed, 8xH100)#1442
akaiHuang wants to merge 1 commit intoopenai:mainfrom
akaiHuang:non-record/no-fa3-stack-combo

Conversation

@akaiHuang
Copy link
Copy Markdown

Summary

Non-record submission documenting a stack combination that runs without Flash Attention 3 (the runpod default pytorch:2.4.0-py3.11-cuda12.4.1 image lacks flash_attn_3). All current top records require FA3; this submission shows how close one can get on stock PyTorch SDPA.

  • val_bpb: 1.1854 (1-seed, 8×H100 SXM)
  • Artifact: 13.51 MB (Mixed Q4/Q5/Q6 + lzma)
  • Train: 437 s (within 540 s budget) | Eval: 322 s (within 600 s budget)
  • No SLOT, no TTT, no validation data accessed during eval
  • Improvement vs OpenAI Naive Baseline (1.2244): −0.039 BPB

Stack

  • 11L d=512 SP1024
  • XSA on all 11 layers (#478 / #1019)
  • BigramHash 3072 × 112 (#1019)
  • Parallel Muon (#399)
  • Step-based warmdown 2000/3500 (≈57% intensity, this work documents a trigger bug we hit)
  • Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired per-layer bits, ~100 LOC quant pipeline)
  • LZMA preset 9
  • Sliding-window eval, stride=32, temperature=0.90

Why non-record

  1. Single seed only (record-track requires a 3-seed mean for p<0.01).
  2. Does not beat current SOTA (PR #1019, 1.11474 BPB). Roughly rank Refactor model with BioDNA and FractalLinear classes :)) #16 of 23 in the legal merged leaderboard.
  3. The submission documents a stack that runs on the stock RunPod PyTorch container without external wheel installs. Useful as a reference point and reproduction floor for anyone hitting the same FA3 install friction.

What is dropped vs the top stack (and why)

Dropped Why
Flash Attention 3 Not in pytorch:2.4.0 base image. Worth ≈ +1.9 % throughput.
Full Hessian GPTQ + AR self-gen calibration Requires ~500 LOC. Mixed Q4/Q5/Q6 is the simpler trade-off.
Partial RoPE 16/64 Untested in this run.
LN scale 1/√(L+1) Untested.
Tight SWA every 50 Untested.
Late QAT (LR < 0.15) PR #1248 found this dead-code-eliminated under torch.compile.
seq_len=2048 Kept seq=1024 to maximize step count without FA3 throughput.

These are intentional trade-offs, not bugs. Documented in the README's "Notes" section.

Files

  • README.md — full submission writeup
  • submission.json — leaderboard metadata
  • train_gpt.py — training script
  • eval.py — evaluation script
  • train_seed42.log — training log (BPB curve)
  • eval_seed42.log — eval log (final 1.1854)
  • pod_environment.txtnvidia-smi + pip freeze snapshot
  • requirements.txt — minimal deps

Reproduction

Single training command + single eval command, both runnable on a fresh pytorch:2.4.0-py3.11-cuda12.4.1 RunPod template. Full instructions in the README.

Add a non-record submission documenting a stack that runs without
Flash Attention 3 (the runpod default pytorch:2.4.0 image lacks
flash_attn_3). 1-seed result: val_bpb 1.1854, beating the OpenAI
baseline (1.2244) by -0.039 BPB.

Stack:
- 11L d=512 SP1024
- XSA-all + BigramHash 3072x112 (from PR openai#1019)
- Parallel Muon (from PR openai#399)
- Step-based warmdown=2000/3500 (documents trigger bug)
- Mixed Q4/Q5/Q6 quantization (Gemma-4 inspired, ~100 LOC pipeline)
- Sliding-window eval stride=32, temperature=0.90

No SLOT, no TTT, no validation data accessed during eval.
Eval: 322s wall on 8xH100 (within 600s budget).

Single seed only (record track requires 3-seed mean).
@akaiHuang
Copy link
Copy Markdown
Author

Withdrawing — self-audit found several inconsistencies (README/script paths mismatch, eval script import bug, BPB measured on BF16 weights rather than the lzma artifact, undisclosed retrodiction component, missing artifact file). Will fix everything and resubmit cleanly. Sorry for the noise.

@akaiHuang akaiHuang closed this Apr 7, 2026
@akaiHuang
Copy link
Copy Markdown
Author

Follow-up note (2026-04-12): I self-withdrew this PR after a self-audit found methodological issues, and the stack-combination line of work it explored is no longer being pursued. My focused research effort going forward is #1255 (Non-record: Text Diffusion + Retrodiction + TTT + Depth Recurrence), which uses a unified PyTorch H100 stack (train_cdm.py), 5-seed multi-seed verification at the true final checkpoint, a matched-compute causal-only control, and an explicit causal-mask leakage test. Linking here so anyone who lands on this PR can find the active discussion in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant