Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)#1488
Closed
ndokutovich wants to merge 1 commit intoopenai:mainfrom
Closed
Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)#1488ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich wants to merge 1 commit intoopenai:mainfrom
Conversation
…65 (3-seed mean) SLOT + pre-quant TTT combo on openai#1313 base. seed 42: 0.82329038 seed 1337: 0.82916457 seed 2024: 0.82694986 mean: 0.82646827 (std 0.0029)
PhamPhuHoa-23
added a commit
to angela231005/parameter-golf
that referenced
this pull request
Apr 9, 2026
EMA_DECAY envvar (default=0.997, sota_32 uses 0.9965): - PR openai#1435 shows EMA=0.9965 beats 0.997 by +0.017 BPB (1.0980 vs 1.1147) - args.ema_decay_param wired to replace hardcoded 0.997 RECUR_LAYERS=4,5 at step 3000 (PR openai#1435): - 13 virtual layers from 11 physical (vs 3,4,5 = 14 virtual) - PR openai#1435 config: activate at step 3000 SLOT code present but DISABLED (SLOT_ENABLED=0 by default): - eval_val_slot(), forward_hidden(), compute_logits() added to train_gpt_sota_28.py - SLOT is retroactive 2-pass: optimizes delta on same tokens it scores = not causal - All SLOT PRs (openai#1313, openai#1488) remain unmerged Expected: ~1.095-1.10 BPB (WD=0.04 + EMA=0.9965 + RECUR PR#1435 config)
7 tasks
owizdom
added a commit
to owizdom/parameter-golf
that referenced
this pull request
Apr 9, 2026
…ib GPTQ + SLOT-24 Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT) with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from PR openai#1488 / PR openai#1313. Four val-data adaptations stacked for the first time: 1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A) 3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model, 24 cosine-decayed AdamW steps, throwaway parameters 4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1 to fall back. Code changes vs the previous synthesis commit: - GPT class: split forward_logits into forward_hidden + compute_logits so SLOT can add the per-window delta to the hidden state without re-running the transformer stack. - New eval_val_slot function ported from PR openai#1488 (per-window AdamW with cosine LR decay, stride masking, score-after-delta). - run_evals: wires SLOT on a fresh post-quant model copy, gated by SLOT_ENABLED. Disables legal TTT by default. - New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN, SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE. Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24). Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean. README rewritten as user's submission with compact credits section.
7 tasks
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SLOT-24 + Pre-Quant AdamW TTT
val_bpb = 0.8265 (3-seed mean, std 0.0029) | ~15.76 MB | 8xH100 SXM
3-Seed Results
Prior SLOT SOTA (PR #1313): 0.8637. Delta: -0.0372 BPB.
Novel Contribution
First combination of pre-quant AdamW TTT (weight-level adaptation, baked into artifact) with SLOT (hidden-state optimization, eval-time). The two are complementary:
Changes from PR #1313
Architecture
SP1024, 11L 512dim, GQA 8/4, MLP 3x, XSA-all, VRL, BigramHash, SmearGate, U-Net skip, EMA 0.997, Late QAT, Muon, int6/int8 + LZMA.
SLOT Mechanism
Frozen model -> per-window delta + logit_bias -> 24 AdamW steps -> score -> discard. No state carries across windows.
Compliance
Credits
PR #1313 @anthony-maio, PR #1423 @aryanbhosale, PR #1482 @aamodbhatt
Checklist
records/track_10min_16mb/