Skip to content

Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)#1488

Closed
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s5-slot-submission
Closed

Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)#1488
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s5-slot-submission

Conversation

@ndokutovich
Copy link
Copy Markdown

Record: SLOT-24 + Pre-Quant AdamW TTT

val_bpb = 0.8265 (3-seed mean, std 0.0029) | ~15.76 MB | 8xH100 SXM

3-Seed Results

Seed SLOT BPB Sliding (no SLOT) Artifact
42 0.82329038 1.08834264 15,764,692
1337 0.82916457 1.08844016 15,756,236
2024 0.82694986 1.08842671 15,760,000
Mean 0.82646827

Prior SLOT SOTA (PR #1313): 0.8637. Delta: -0.0372 BPB.

Novel Contribution

First combination of pre-quant AdamW TTT (weight-level adaptation, baked into artifact) with SLOT (hidden-state optimization, eval-time). The two are complementary:

  • TTT improves base sliding: ~1.12 -> 1.088
  • SLOT pushes from better base: 0.8637 -> 0.8265

Changes from PR #1313

Parameter PR #1313 This PR
QK_GAIN_INIT 4.0 5.25
Pre-quant TTT None 10ep, lr=0.00045, freeze 1
SLOT BPB 0.8637 0.8265

Architecture

SP1024, 11L 512dim, GQA 8/4, MLP 3x, XSA-all, VRL, BigramHash, SmearGate, U-Net skip, EMA 0.997, Late QAT, Muon, int6/int8 + LZMA.

SLOT Mechanism

Frozen model -> per-window delta + logit_bias -> 24 AdamW steps -> score -> discard. No state carries across windows.

Compliance

  • Training < 600s on 8xH100
  • Pre-quant TTT baked into artifact (Track A)
  • SLOT: frozen weights, throwaway per-window params only
  • No n-gram, no cross-window leakage

Credits

PR #1313 @anthony-maio, PR #1423 @aryanbhosale, PR #1482 @aamodbhatt

Checklist

  • One folder under records/track_10min_16mb/
  • README.md, submission.json, train_gpt.py
  • 3 seed logs
  • All artifacts < 16,000,000 bytes
  • Train wallclock < 600s

…65 (3-seed mean)

SLOT + pre-quant TTT combo on openai#1313 base.
  seed 42:   0.82329038
  seed 1337: 0.82916457
  seed 2024: 0.82694986
  mean:      0.82646827 (std 0.0029)
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 9, 2026
EMA_DECAY envvar (default=0.997, sota_32 uses 0.9965):
- PR openai#1435 shows EMA=0.9965 beats 0.997 by +0.017 BPB (1.0980 vs 1.1147)
- args.ema_decay_param wired to replace hardcoded 0.997

RECUR_LAYERS=4,5 at step 3000 (PR openai#1435):
- 13 virtual layers from 11 physical (vs 3,4,5 = 14 virtual)
- PR openai#1435 config: activate at step 3000

SLOT code present but DISABLED (SLOT_ENABLED=0 by default):
- eval_val_slot(), forward_hidden(), compute_logits() added to train_gpt_sota_28.py
- SLOT is retroactive 2-pass: optimizes delta on same tokens it scores = not causal
- All SLOT PRs (openai#1313, openai#1488) remain unmerged

Expected: ~1.095-1.10 BPB (WD=0.04 + EMA=0.9965 + RECUR PR#1435 config)
owizdom added a commit to owizdom/parameter-golf that referenced this pull request Apr 9, 2026
…ib GPTQ + SLOT-24

Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT)
with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from
PR openai#1488 / PR openai#1313.

Four val-data adaptations stacked for the first time:

1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A)
2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A)
3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant
   model, 24 cosine-decayed AdamW steps, throwaway parameters
4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT
   supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1
   to fall back.

Code changes vs the previous synthesis commit:

- GPT class: split forward_logits into forward_hidden + compute_logits so
  SLOT can add the per-window delta to the hidden state without re-running
  the transformer stack.
- New eval_val_slot function ported from PR openai#1488 (per-window AdamW with
  cosine LR decay, stride masking, score-after-delta).
- run_evals: wires SLOT on a fresh post-quant model copy, gated by
  SLOT_ENABLED. Disables legal TTT by default.
- New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN,
  SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE.

Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis
              -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis

Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s
val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24).

Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean.
README rewritten as user's submission with compact credits section.
@ndokutovich
Copy link
Copy Markdown
Author

Closing as invalid. Same prequant_ttt_adapt_adamw pre-quant pattern as #1485, which violates Condition 3 of #1017. Full technical analysis in #1485. The SLOT-24 component on top is also in contested territory pending the #1336 ruling, so this PR is withdrawn on both counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant