Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) by FlashyFlash3011 · Pull Request #347 · openai/parameter-golf

FlashyFlash3011 · 2026-03-21T15:22:07Z

Submission

Experiment: `records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/`

Strategy: Pure Velocity & TTT Preservation

Initial attempts tried to maximize model capacity (GatedAttention, ValueResidual, BigramHash=2048). Ablations showed these features add ~1.5ms/step overhead and destabilize TTT, costing more in training steps than they gain in quality under the 10min/16MB constraint. The winning strategy strips the model to its leanest form.

Results (8×H100 80GB SXM)

Seed	step_avg	Steps	Pre-TTT BPB	Post-TTT BPB	TTT Gain	TTT Time	Artifact
1337	83.87ms	7155	1.12163921	1.11901233	-0.00262688	421.9s	15.851MB
42	83.86ms	7156	1.12228806	1.11960558	-0.00268248	423.2s	15.858MB
2025	83.89ms	7154	1.12197720	1.11920302	-0.00277418	423.4s	15.888MB
Mean	83.87ms	7155	1.12196816	1.11927364	-0.00269451	422.8s	15.866MB

Key Changes

Change	Why
`GATED_ATTENTION=0`, `VALUE_RESIDUAL=0`	+1.5ms/step overhead → 130+ lost training steps in 600s
`SWA_ENABLED=0`	Was copying hundreds of MB GPU→CPU every 50 steps — EMA is used at the end, not SWA
`BANK_QAT_THRESHOLD=0`	Was snapping FP32 TTT weights back to Int6 mid-evaluation, causing catastrophic forgetting
`LATE_QAT_THRESHOLD=0.15`	QAT only in final 15% of warmdown — no overhead during main training
`TRAIN_SEQ_LEN=2048`	Allows full warmdown (7155 steps vs ~5776 at 4096 ctx)

Features Explored but Disabled

These were implemented and tested but hurt under the 10min/16MB constraint. They remain in the codebase and are expected to help significantly with more budget:

Feature	Why disabled	Why it helps with more budget
GatedAttention, ValueResidual	+1.5ms/step → 130+ lost steps	Legitimate architectural gains with 30min+ training
BigramHash=2048	Pushed artifact over 16MB	Better subword context modeling
QAT from step 1	Overhead throughout training	Full-run quant adaptation reduces post-quant degradation
BANK_QAT_THRESHOLD > 0	Corrupts TTT weights	Enables aggressive compression of larger models

Headroom & Scaling Evidence

Submission sits at 15.851–15.888MB across seeds (mean 15.866MB) — ~134KB under the 16MB limit. Attempts to fill headroom (BigramHash=1664, 2048) produced worse BPB and exceeded the size limit. In an uncapped scenario, all disabled levers can be opened simultaneously for significantly better BPB.

Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)

- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)

…penai#549) - train_seq_len and eval_seq_len raised 2048 -> 4096 - All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate, BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA, GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT - Dynamic NTK auto-scales rope_base to ~48550 for 4096 context - SDPA fallback added for flash_attn_3 unavailability (local testing) - rocm-smi fallback for nvidia-smi on ROCm hardware - Update QAT Int4 expected BPB estimate to ~1.13-1.14

Fixes: - LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0) but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision - QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant functions and switches export from int6 to int4 New scripts: - 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0 - 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus _fake_quant_int4_bank() applied to all bank weight slices in the forward pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared Also: add zstandard to requirements.txt; add missing README/submission.json

…unts and targets

Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest

…rgetting

…15.90->15.95

…, TTT epochs=1/freeze=4/lr=0.001

… to fork

…only

…caling evidence

…README

MatoTeziTanka · 2026-04-11T20:17:00Z

Community Review — Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 92e02e056713, file records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/train_gpt.py):

The TTT path at line 1105 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=93765 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=93765 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

FlashyFlash3011 added 3 commits March 22, 2026 10:56

Fix warmdown and rope_base in LongContext4096 script

538bfa6

- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%) - rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62) instead of naive 4x multiplication)

FlashyFlash3011 changed the title ~~LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers~~ LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers Mar 25, 2026

FlashyFlash3011 and others added 18 commits March 25, 2026 18:29

Add run_experiments.sh and reset RESULTS.md with correct iteration co…

15e6d9e

…unts and targets

Fix BASE path in run_experiments.sh

ee26d04

Fix lzma preset, TTT stride, add QAT exp to run script

b7eb0ed

results: 2026-03-25_LongContext4096_Int6_QAT seed1337

b688621

add recompress_l9.py utility

b937791

exp6: Int6_QAT_2048 — same as Exp5 but ctx=2048 for size+speed fix

12edb34

exp6: full bank QAT + submission.json

0b6146d

exp7: Int6_QAT_2048_LateBank — late bank QAT + MLP_MULT=2.75 + 2048 ctx

b992b40

results: 2026-03-26_Int6_QAT_2048_LateBank seed1337

b371be9

fix: clamp QAT range -32->-31 to match export symmetric range

3c73097

reset: remove failed exps, add BankQAT_2048train_4096eval (Option B)

8ccc5d2

fix: lzma-9 compression, TTT epochs=1/lr=0.001/freeze=4 to prevent fo…

97d4cda

…rgetting

tune: bank_qat_threshold 0.15->0.05 (less warmdown noise), target_mb …

1cc698c

…15.90->15.95

exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT

74c4ce7

exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT — lzma-9, bank_qat=0.05…

d1563e1

…, TTT epochs=1/freeze=4/lr=0.001

fix: add git identity + save_and_push after each seed for auto-commit…

4355194

… to fork

cleanup: remove old experiments, slim run_experiments.sh to GPTQLite …

7fc776c

…only

FlashyFlash3011 closed this Mar 27, 2026

FlashyFlash3011 deleted the flashyflash3011/long-context-4096-qat-int4-16l branch March 27, 2026 13:06

FlashyFlash3011 reopened this Mar 27, 2026

FlashyFlash3011 changed the title ~~LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers~~ GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 27, 2026

FlashyFlash3011 added 2 commits March 28, 2026 13:32

fix: remove TTT cosine LR decay — use constant ttt_lr across all chunks

6351d76

revert: restore TTT cosine LR decay

3dd38f2

FlashyFlash3011 and others added 10 commits March 29, 2026 19:05

official: lock submission command, update blurb with seed1337 result

9db1393

results: GPTQLite_QAT_MaxLZMA_LegalTTT seed1337

0e03e27

docs: rewrite README — Pure Velocity strategy, seed1337 results

6b3441b

cleanup: remove run_experiments.sh and RESULTS.md, rename seed1337 log

ad1e888

results: seed1337 — 1.11901 BPB, 15.851MB, 7155 steps

3277196

results: GPTQLite_QAT_MaxLZMA_LegalTTT seed42

cabd694

results: seed42 — 1.11961 BPB, 15.858MB, 7156 steps

9cd1489

docs: add planned changes, beyond-constraints section, and headroom/s…

211abea

…caling evidence

docs: remove base script reference, fix duplicate sections, clean up …

7566466

…README

docs: remove checkmark emoji

92e02e0

FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:17

FlashyFlash3011 marked this pull request as draft March 30, 2026 14:18

FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:28

FlashyFlash3011 changed the title ~~GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)~~ Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

FlashyFlash3011 commented Mar 21, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FlashyFlash3011 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission

Strategy: Pure Velocity & TTT Preservation

Results (8×H100 80GB SXM)

Key Changes

Features Explored but Disabled

Headroom & Scaling Evidence

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FlashyFlash3011 commented Mar 21, 2026 •

edited

Loading