Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528
Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528xiehuanyi wants to merge 2 commits intoopenai:mainfrom
Conversation
Longer-context + longer-training variant of the ValCalib_GPTQ_XSA_BigramHash3072 stack. Moves TRAIN_SEQ_LEN 1024 -> 2048 and runs for 4h on 1x A100 (no H100 available), which together bring sliding-window int6 BPB from 1.1317 (s1024, 2h) down to 1.11044406 (s2048, 4h). Non-record because the submission was trained on 1x A100 for 240 minutes (roughly equivalent to 76-80 H100-minutes, close to the 80 H100-minute official budget) rather than on the required 8xH100 x 10min hardware. Artifact: 15.94 MB int6+lzma, total submission 16.04 MB (under 16 MiB limit). Model: 27M params, 11L 512d 3xMLP, XSA-all, BigramHash(2048), PartialRoPE(16/64), LN Scale, SmearGate, Muon+AdamW WD=0.04, EMA(0.997 deferred), SWA, Late QAT@0.15, Int6 GPTQ with self-generated AR calibration, LZMA preset=9, sliding window eval stride=64. Currently single-seed (1337). Seeds 42 and 999 are running and will be added to submission.json once complete.
There was a problem hiding this comment.
Pull request overview
Adds a new non-record leaderboard submission under track_non_record_16mb for an 11-layer full-stack model trained at seq_len=2048 for 4h on 1×A100, reporting val_bpb=1.11044406 with int6 GPTQ + LZMA and sliding-window eval (stride=64).
Changes:
- Adds the full submission bundle (training script, run log, metadata JSON, README, requirements) for
2026-04-10_s2048_4h_1xA100_1.1104. - Updates the training script for A100 environments (FA2/SDP attention fallback, deferred EMA start, Python 3.9 compatibility).
- Records reported metrics, artifact sizes, and reproduction instructions.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py | Training/eval/quantization script used to produce the submission artifact and metrics |
| records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_seed1337.log | Captured run log with reported metrics and byte sizes |
| records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/submission.json | Structured metadata for the submission (metrics, sizes, config) |
| records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/README.md | Human-readable summary, numbers, and reproduction command |
| records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/requirements.txt | Minimal dependency list for reproducing the run |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| seq_len = eval_seq_len or args.train_seq_len | ||
| total_tokens = val_tokens.numel() - 1 | ||
| window_starts = [ws for ws in range(0, total_tokens, stride) | ||
| if min(ws + seq_len, total_tokens) - ws >= 1] | ||
| total_windows = len(window_starts) |
There was a problem hiding this comment.
eval_val_sliding currently includes window starts all the way to total_tokens, which creates short tail windows (wlen < seq_len). For ws>0 these tail windows score tokens that were already scored by the last full window, slightly over-weighting the end of the validation set and contradicting the “every token scored exactly once” sliding-window definition used elsewhere (e.g. the TTT window_starts filter in this file). Consider restricting window_starts to full windows (ws <= total_tokens - seq_len) and/or filtering with wlen >= stride or ws == 0 to avoid double-counting.
| log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") | ||
| log0(f"attn_backend:{_ATTN_BACKEND} sdp_backends:cudnn=False flash=True mem_efficient=False math=False") | ||
| log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") |
There was a problem hiding this comment.
The logged SDP backend flags are hard-coded (mem_efficient=False), but earlier you call enable_mem_efficient_sdp(True). This makes the run metadata in train_seed1337.log misleading; please either query the actual backend settings or update the log string to match what is enabled.
| try: | ||
| import zstandard | ||
| _COMPRESSOR = "zstd" | ||
| except ImportError: | ||
| _COMPRESSOR = "zlib" | ||
| import numpy as np |
There was a problem hiding this comment.
The optional zstandard import and _COMPRESSOR selection appear unused (no references elsewhere in this script), while the submission path always uses LZMA. This adds dead code and an unnecessary dependency surface; consider removing _COMPRESSOR/zstandard/zlib if they’re not meant to be toggled, or wire them into an actual compressor option.
| "val_bpb_int6_roundtrip": 1.13437381, | ||
| "artifact_bytes": 15941100, | ||
| "total_submission_bytes": 16040603, | ||
| "steps": 14065, | ||
| "step_avg_ms": 1023.86 | ||
| } | ||
| }, | ||
| "artifact_bytes_max": 15941100, | ||
| "bytes_total": 16040603, |
There was a problem hiding this comment.
artifact_bytes / artifact_bytes_max don’t match the sizes in the included train_seed1337.log. The log reports Serialized model int6+lzma: 15920436 bytes and Total submission size ...: 16040603 bytes, implying artifact_bytes should be 15920436 (and code bytes ~120167), not 15941100. Please recompute these fields from the actual final_model.int6.ptz and script size so metadata stays self-consistent.
| | Peak memory | 16.3 GiB | | ||
| | Model params | 26,993,756 | | ||
| | Artifact bytes (int6+lzma) | 15,941,100 | | ||
| | **Total (code + artifact)** | **16,040,603** (under 16 MiB = 16,777,216) | | ||
|
|
There was a problem hiding this comment.
The README’s artifact/total byte counts appear inconsistent with the included training log. train_seed1337.log reports Serialized model int6+lzma: 15920436 bytes and Code size: 120167 bytes (total 16040603), but this README lists Artifact bytes ... 15,941,100. Please update the README numbers to match the actual generated files (or regenerate the log/README from the same run) so readers can verify the 16 MiB constraint.
Replaces the earlier 1.1104 non-record submission with a much stronger result that reproduces the PR openai#1493 SOTA 1.0810 recipe on 1xA100 for 4h instead of the required 8xH100 for 10min. Key numbers (seed 1337): - Int6 Sliding Window: 1.07266 BPB (beats upstream SOTA 1.0827 by -0.0100) - Int6 + Legal TTT: 1.07035 BPB (beats upstream SOTA 1.0810 by -0.0107) - Pre-quant post-EMA: 1.07610 BPB - Steps trained: 6371 (wallclock capped at 4h) - Total submission: 16,019,227 bytes (under 16 MiB) This is the exact PR openai#1493 SOTA recipe (SP8192 + 3-layer recurrence + parallel residuals layer 7+ + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ + Brotli + byte shuffle + legal score-first TTT) with three A100 adaptations: 1. FA3 -> PyTorch SDP fallback with manual GQA head-repeat (A100 doesn't support FA3) 2. Python 3.9 compatibility (removed zip(strict=True) and nested double-quoted f-strings) 3. GRAD_ACCUM_STEPS env override for single-GPU runs Three seeds of the same config ran (exp60, exp61, exp62). exp60/62 crashed in their own eval phase with a torch.compile recompile issue when creating a fresh GPT instance after training; the saved quantized artifacts were then evaluated successfully via a standalone eval_only.py script. exp62 (QK_GAIN_INIT=5.25, the exact SOTA record value) beat exp60/exp61 (QK_GAIN_INIT=5.0, the script default) consistently across quant/sliding/TTT metrics, matching the "monotonic improvement from 4.0 to 5.25" observation in the SOTA paper. Still single-seed; 3-seed mean is not yet run due to time constraints.
Community Review — Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)BPB: 1.07035 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern) What I found in the code (head SHA The TTT path at line 356 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
UPDATED 2026-04-11: Replaces the earlier 1.1104 BPB result with a much stronger 1.07035 BPB (TTT) / 1.07266 (sliding) using the exact PR #1493 SOTA recipe (SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ + Brotli + legal score-first TTT), adapted for 1×A100 instead of 8×H100.
Beats upstream main-leaderboard SOTA:
Still non-record because the run was on 1×A100 for 4h (≈80 H100-minute-equivalent of raw BF16 throughput, but not on the required hardware and without FA3).
What's in this PR
The training script is the decompressed PR #1493 train_gpt.py (their LZMA+base85 one-liner) with three minimal adaptations for Ampere + Python 3.9:
flash_attn(FA2) first, then falls through to PyTorchscaled_dot_product_attentionwith the flash backend. The SDP path adds a manual GQA head-repeat (PyTorch SDP doesn't natively supportnum_heads != num_kv_heads).zip(strict=True)and nested double-quoted f-strings.GRAD_ACCUM_STEPSenv override. Added so single-GPU runs can override the default8 // world_size. Not actually used in this submission (defaults kept), but left in for flexibility.Everything else is identical to PR #1493: SP8192 vocab, 11L×512d×8H/4KV, MLP 4x, depth recurrence looping layers 3-5 (17 virtual from 11 physical, activated at frac=0.35), parallel residuals layer 7+, QK-Gain 5.25, skip gates, MuonEq-R + AdamW, WD=0.095, EMA=0.9965, warmdown_frac=0.72, matrix_lr=0.022, GPTQ SDClip (k=12.85 mats / k=20.0 embs), int6 attn+mlp / int8 embs, Brotli-11 + byte shuffle, legal score-first TTT (SGD lr=0.005 mom=0.9, 3 epochs/32K chunk).
Numbers (seed 1337)
Hardware equivalence
(H100 BF16 ≈ 3.17× A100 BF16, plus FA3 is Hopper-only so there's an additional ~1.5× gap we don't close)
Comparison with exp60 / exp61 (same training config, different QK_gain)
Three runs of the same config differing only in
QK_GAIN_INIT:The SOTA record's non-default QK_GAIN_INIT=5.25 consistently helps all three quant/eval phases, confirming the paper's "monotonic improvement from 4.0 to 5.25" observation.
Caveats
eval_only.pyscript. exp61 completed its full eval pipeline natively.grad_accum=2variant (exp63/64) OOM'd at startup: SOTA model with MLP 4× + depth recurrence has a per-micro-batch footprint that doesn't fit on A100 at quarter the accum.Test plan
Files:
README.mdwith full recipe + numbers + reproduction commandssubmission.jsonwith structured metadatatrain_gpt.py— A100-adapted SOTA scriptfinal_model.int6.ptz(15.97 MB)train_seed1337.log+eval_seed1337.logrequirements.txt