Skip to content

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347

Open
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l
Open

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)#347
FlashyFlash3011 wants to merge 34 commits intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

Conversation

@FlashyFlash3011
Copy link
Copy Markdown

@FlashyFlash3011 FlashyFlash3011 commented Mar 21, 2026

Submission

Experiment: `records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/`


Strategy: Pure Velocity & TTT Preservation

Initial attempts tried to maximize model capacity (GatedAttention, ValueResidual, BigramHash=2048). Ablations showed these features add ~1.5ms/step overhead and destabilize TTT, costing more in training steps than they gain in quality under the 10min/16MB constraint. The winning strategy strips the model to its leanest form.

Results (8×H100 80GB SXM)

Seed step_avg Steps Pre-TTT BPB Post-TTT BPB TTT Gain TTT Time Artifact
1337 83.87ms 7155 1.12163921 1.11901233 -0.00262688 421.9s 15.851MB
42 83.86ms 7156 1.12228806 1.11960558 -0.00268248 423.2s 15.858MB
2025 83.89ms 7154 1.12197720 1.11920302 -0.00277418 423.4s 15.888MB
Mean 83.87ms 7155 1.12196816 1.11927364 -0.00269451 422.8s 15.866MB

Key Changes

Change Why
GATED_ATTENTION=0, VALUE_RESIDUAL=0 +1.5ms/step overhead → 130+ lost training steps in 600s
SWA_ENABLED=0 Was copying hundreds of MB GPU→CPU every 50 steps — EMA is used at the end, not SWA
BANK_QAT_THRESHOLD=0 Was snapping FP32 TTT weights back to Int6 mid-evaluation, causing catastrophic forgetting
LATE_QAT_THRESHOLD=0.15 QAT only in final 15% of warmdown — no overhead during main training
TRAIN_SEQ_LEN=2048 Allows full warmdown (7155 steps vs ~5776 at 4096 ctx)

Features Explored but Disabled

These were implemented and tested but hurt under the 10min/16MB constraint. They remain in the codebase and are expected to help significantly with more budget:

Feature Why disabled Why it helps with more budget
GatedAttention, ValueResidual +1.5ms/step → 130+ lost steps Legitimate architectural gains with 30min+ training
BigramHash=2048 Pushed artifact over 16MB Better subword context modeling
QAT from step 1 Overhead throughout training Full-run quant adaptation reduces post-quant degradation
BANK_QAT_THRESHOLD > 0 Corrupts TTT weights Enables aggressive compression of larger models

Headroom & Scaling Evidence

Submission sits at 15.851–15.888MB across seeds (mean 15.866MB) — ~134KB under the 16MB limit. Attempts to fill headroom (BigramHash=1664, 2048) produced worse BPB and exceeded the size limit. In an uncapped scenario, all disabled levers can be opened simultaneously for significantly better BPB.

Two new submissions targeting sub-1.1698 BPB:

1. 2026-03-21_LongContext4096_FullStack
   - 4096-token training context + full modern SOTA stack
   - Sliding window eval stride=256 (3840 context tokens per position)
   - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch
   - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context

2. 2026-03-21_QAT_Int4_16L
   - Int4 nibble-packing enables 16 transformer layers in 16MB budget
   - QAT with straight-through estimator activates at 15% of training
   - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
- warmdown_iters: 1600 -> 800 (~12% of ~6700 steps vs prior 24%)
- rope_base: 40000 -> 41832 (proper NTK formula: 10000 x 4^(64/62)
  instead of naive 4x multiplication)
…penai#549)

- train_seq_len and eval_seq_len raised 2048 -> 4096
- All SOTA techniques inherited: 11L, LeakyReLU(0.5)^2, SmearGate,
  BigramHash, XSA-4, Partial RoPE, LN Scale, VE128, EMA+SWA,
  GPTQ-lite, Parallel Muon, OrthoInit, Legal TTT
- Dynamic NTK auto-scales rope_base to ~48550 for 4096 context
- SDPA fallback added for flash_attn_3 unavailability (local testing)
- rocm-smi fallback for nvidia-smi on ROCm hardware
- Update QAT Int4 expected BPB estimate to ~1.13-1.14
Fixes:
- LongContext4096_Int4_16L_FullSOTA: CastedLinear fake-quant was 6-bit (/31.0)
  but export was int4 — fixed to /7.0 clamp(-8,7) to match export precision
- QAT_Int4_16L_FullSOTA: same CastedLinear fix + adds int4 pack/unpack/quant
  functions and switches export from int6 to int4

New scripts:
- 2026-03-25_LongContext4096_Int6_QAT (safe): LongContext4096_FullSOTA with
  QAT_ENABLED=1 by default so 6-bit QAT runs from step 1, late_qat_threshold=0.0
- 2026-03-25_LongContext4096_Int4_BankQAT (risky): same Int4 stack plus
  _fake_quant_int4_bank() applied to all bank weight slices in the forward
  pass — first time the ~95% of params in qo/kv/mlp banks are QAT-prepared

Also: add zstandard to requirements.txt; add missing README/submission.json
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers Mar 25, 2026
FlashyFlash3011 and others added 18 commits March 25, 2026 18:29
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB:
- train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps)
- bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings)
- xsa_last_n: 11 → 4 (from user's best experiments)
- gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement)
- Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown
- Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format
- Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit)
- Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules
- Budget-aware ±1 weight pruning retained from NewTest
@FlashyFlash3011 FlashyFlash3011 deleted the flashyflash3011/long-context-4096-qat-int4-16l branch March 27, 2026 13:06
@FlashyFlash3011 FlashyFlash3011 changed the title LongContext 4096 + Full SOTA Stack + QAT Int4/Int6 → 16 Layers GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 27, 2026
@FlashyFlash3011 FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:17
@FlashyFlash3011 FlashyFlash3011 marked this pull request as draft March 30, 2026 14:18
@FlashyFlash3011 FlashyFlash3011 marked this pull request as ready for review March 30, 2026 14:28
@FlashyFlash3011 FlashyFlash3011 changed the title GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) Mar 30, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048)

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 92e02e056713, file records/track_10min_16mb/2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT/train_gpt.py):

The TTT path at line 1105 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=93765 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=93765 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants