Skip to content

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)#1477

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp8192-parallel-ttt
Open

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)#1477
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp8192-parallel-ttt

Conversation

@aryanbhosale
Copy link
Copy Markdown

Record: SP8192 + Parallel Residuals + Score-First TTT

val_bpb = 1.0822 (3-seed mean, std 0.0005) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed Sliding BPB TTT BPB Artifact
42 1.0857 1.0826 15,991,486
314 1.0854 1.0822 15,991,486
999 1.0849 1.0817 15,991,486
Mean 1.0822

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0325 BPB.

Novel Contribution

Adds parallel residuals (from layer 7) to the SP8192 + score-first TTT stack. Prior work:

From layer 7, attention and MLP operate on separate residual lanes with a learned merge scalar.

Full Stack

SP8192, MLP 4x, depth recurrence (loop 4-5), parallel residuals (layer 7+), MuonEq-R, QK-Gain 5.0, SDClip, GPTQ embeddings, skip gates, score-first TTT (3 epochs), brotli.

Compliance (Track B)

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 --skip-manifest
SEED=42 TTT_ENABLED=1 PARALLEL_START_LAYER=7 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1394 @clarkkev, PR #1413 @dexhunter, PR #1412 @Robby955, PR #1204 @msisovic, PR #1260 @dexhunter

taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 8, 2026
…#1476/openai#1477 confirm SP8192+TTT is new comp meta — our SP8192 build is ready, deploy next; LEGAL_TTT brittleness pattern confirmed n=2
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 8, 2026
Refresh PR cache, reclassify, publish frontier verdicts on
data-touching vs data-free compression (PR openai#672/openai#1482/openai#1477/openai#1471).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
…allback)

Decoded from the lzma-compressed stub in PR openai#1477 on openai/parameter-golf.
Replaces flash_attn_3_func with an SDPA fallback that uses PyTorch's native
scaled_dot_product_attention on H100 (auto-selects FlashAttention 2 backend).
Minor f-string fixes for Python 3.11 compatibility (PEP 701 not yet).

Full stack: SP8192, MLP 4x, 11 layers, XSA-all, parallel residuals (layer 7+),
Score-First TTT, depth recurrence (loops 4-5), MuonEq-R, QK-Gain 5, SDClip,
GPTQ int6 weights + int8 embeds, brotli compression, EMA 0.997.

Target val_bpb: 1.0822 (PR openai#1477 3-seed mean on 8xH100).
This is the Phase 1 validation target on 1xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
… was phantom

Inventoried train_gpt_phase1.py and discovered it's the complete decoded PR openai#1477
reproduction. It already contains every feature the original 8-shot plan was
going to "port": SP8192, parallel residuals (PARALLEL_START_LAYER=7), TTT
(eval_val_sliding_ttt), int6 GPTQ, brotli, EMA 0.997, looped layers, XSA, the
full set of architecture knobs. Shots 3-7 from the original plan don't need
porting — they're already there as default env vars.

New ★ REVISED SHOT PLAN section at the top of "Shot sequence":
- R1 Baseline (in flight): defaults + 600s + TTT_ENABLED=1, no code change
- R2 n=2 seed confirm: SEED=1337, no code change
- R3 Full-budget variant: MAX_WALLCLOCK_SECONDS=3000, no code change
- R4 AR self-gen GPTQ port from PR openai#1019: ~30 lines of new code, -0.003-0.005
  BPB stretch
- R5 8×H100 SXM submission run: verify DDP + write distributed launcher

R1-R3 fit before noon AEST today. R4-R5 are next-session work.

The original 8-shot section is kept below for historical context but is
superseded by REVISED.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Two parallel research agents (one with gh CLI for the comp PR landscape, one
with WebFetch for the open literature) audited every component of our
train_gpt_phase1.py = decoded PR openai#1477 stack.

Key findings:
- Comp-novelty ZERO: 0 of 16 components are unique to PR openai#1477. Every one
  appears in 2+ other top-15 PRs. Shipping as-is lands rank ~8-13.
- Current leaderboard frontier is PR openai#1485 at 1.0679 (open) = our stack +
  3-layer recurrence + EMA 0.9965 + QK_GAIN 5 + Pre-Quant AdamW TTT. Gap to
  us = 0.0143 BPB = 14x the 0.005 BPB record bar.
- World-novelty: 0 unambiguously novel as published. 2 small-twist NOVEL
  (LeakyReLU(0.5)^2 MLP, sigmoid-gated U-Net lerp skip in causal byte LM).
  5 COMP-NOVEL (XSA per-KV last-N, depth-gated parallel residuals, cosine
  TTT across val chunks, AR self-gen GPTQ, Brotli-11 packaging).

Highest-leverage missing pieces (all <100 LOC, all comp-port-with-evidence):
- C1 Pre-Quant AdamW TTT: -0.014 BPB (single biggest free delta)
- C2 3-layer depth recurrence: -0.005 to -0.01 BPB (one env var)
- C3 QK_GAIN_INIT 4 -> 5: -0.001 BPB (one env var)

Recommended R-plan revision: prioritize C1+C2+C3 over the original R4
(AR self-gen GPTQ from PR openai#1019), which is a compliance hedge not a BPB win.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Self-contained submission package. Single command from a fresh RunPod H100:

  curl -sL https://raw.githubusercontent.com/taka6745/paramgolf/main/submission/bootstrap.sh | bash

Files:
- README.md: usage docs + the disk topology gotcha (50 GB volume vs 100 GB
  container disk) so the next operator doesn't waste 50 min on disk-full
- bootstrap.sh: idempotent one-command setup. Clones repo, runs setup → data
  → train. Streams tee log to /tmp/paramgolf_bootstrap.log
- setup.sh: torch 2.4.1 → 2.9.1+cu128 upgrade (matches the bundled FA3 wheel
  ABI), FA3 import verify, brotli + sentencepiece + huggingface_hub install
- get_data.sh: stash docs_selected.jsonl on container disk (/root/paramgolf_bigdata/),
  symlink into repo, ensure SP model lives OUTSIDE the destination tokenizers_dir
  (the unlink-before-reuse bug), launch tokenize with MATCHED_FINEWEB_SKIP_HF_COPY=1.
  Skips tokenize if shards already exist (idempotent re-runs).
- run.sh: bridge the nested data path symlinks, sanity-check tokenizer + shards,
  launch train.py with the right env vars (TORCH_COMPILE_DISABLE=1 to skip the
  5-min first-run compile, TTT_ENABLED=1, TRAIN_LOG_EVERY=10). DRY_RUN=1 mode
  for 60s smoke testing.
- train.py: copy of train_gpt_phase1.py (decoded PR openai#1477) as the BASELINE.
  Subsequent commits will layer in the comp frontier (Pre-Quant TTT, 3L
  recurrence, QK_GAIN 5) + our NIGHT_MODE wins (gated_attention, NorMuon,
  NGRAM_BACKOFF, NORM_PCT_DROPOUT, MDL_compressible_first) + world-novel
  candidates (CMP_QUANT_VALUE_DEDUP, NGR_LOG_FREQ_INV, CTX_PARTITIONED_TAB).
- requirements.txt: pinned deps for reference (setup.sh installs them).

PODS_SSH.md: marked Pod K as REMOVED 0115Z. Next pod uses bootstrap.sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Two of the three comp-frontier wins are env-var bumps with no code change:
- LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer
  recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 /
  openai#1437 use this. Expected -0.005 to -0.01 BPB.
- QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5;
  openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve.
  Expected -0.001 BPB.

C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real
code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in
background.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
The biggest free comp-frontier delta (-0.014 BPB). Ported from PR openai#1485 / openai#1306
which both descend from PR openai#1306's original ttt_adapt_adamw. PR openai#1482 frontier
(lr=0.00045, epochs=8, freeze_blocks=1) hits val_bpb 1.0787, vs PR openai#1477's
1.0822 with eval-time SGD TTT only.

Changes to submission/train.py:
- Hyperparameters: append 7 prequant_ttt_* env vars (defaults match PR openai#1482).
- New function prequant_ttt_adapt_adamw(h, base_model, device, val_tokens, rank,
  world_size): AdamW(lr) + cosine schedule, freezes first N blocks during TTT,
  unfreezes ALL params at the end (CRITICAL — GPTQ collect_hessians needs
  gradients on every block, leaving any frozen would zero its Hessian and
  quantize the block to garbage).
- train_and_eval: splice the call AFTER 'pre-quantization post-ema' eval and
  BEFORE serialize. Logs a 'post-prequant-ttt' val_bpb so we can see the
  improvement vs the EMA-only number.
- Stacks with the existing eval_val_sliding_ttt (post-quant) — different
  namespaces, gains compound (1.0872 EMA -> 1.0569 prequant -> ~1.0679 final).

run.sh: PREQUANT_TTT_ENABLED=1 by default with PR openai#1482 frontier hyperparams.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
…EDUP

Two more NIGHT_MODE world-novel patches ported from the patcher script into
clean code in submission/train.py:

1. **NORM_PCT_DROPOUT** (chunk 2, world-novel L05, n=2 confirmed-win 1.41365):
   - In MLP.forward (after leaky_relu^2, before proj): when training, zero out
     the rows whose per-token L2 norm is in the top 1% (NORM_PCT_THRESH=0.99).
   - Targets the rare exploding-activation pathway. Standard dropout = random
     elements; structured dropout = random rows; norm-percentile = loudest rows.
   - Enable via USE_NORM_PCT_DROPOUT=1 (default ON in run.sh).

2. **CMP_QUANT_VALUE_DEDUP** (chunk 3, world-novel L10):
   - In gptq_quantize_weight after the inner quantization loop: snap Q tensor
     values to multiples of CMP_QUANT_DEDUP_STEP (default 2). Halves the
     effective alphabet (~32 distinct int6 values vs 64) so the byte stream
     brotli compresses has more LZ77 matches → ~5-15% smaller compressed.
   - **Directly helps stay under the 16 MB submission limit** (R1 was 41 KB
     over on undertrained weights — this could free 800 KB - 2 MB).
   - World-novel: post-int alphabet snap for entropy-coding compressibility is
     not in any LM compression paper.
   - Enable via USE_CMP_QUANT_VALUE_DEDUP=1, step via CMP_QUANT_DEDUP_STEP=2
     (both default ON in run.sh).

run.sh: both flags ON by default with the env vars wired through.

Differentiation status: 7 of our changes now layered on PR openai#1477:
  C1 Pre-Quant TTT, C2 3-layer recurrence, C3 QK_GAIN 5, gated_attention,
  NorMuon, NORM_PCT_DROPOUT, CMP_QUANT_VALUE_DEDUP.

Still pending: NGRAM_BACKOFF (gateway for the n-gram-dependent world-novel
patches NGR_LOG_FREQ_INV + CTX_PARTITIONED_TAB) and MDL_compressible_first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
The biggest of the NIGHT_MODE wins still missing — n=3 confirmed-win Stupid
Backoff (Brants 2007). Adds the n-gram bias infrastructure that 2 of the 3
world-novel L09 patches (NGR_LOG_FREQ_INV, CTX_PARTITIONED_TAB) depend on.

Three new components:

1. **submission/build_ngrams.py** (157 lines, parameterized clone of
   runpod_tests/chore/04_build_ngrams.py for SP8192):
   - Reads tokenized .bin shards from $NGRAM_DATA_DIR
   - Builds bigram/trigram/fourgram count tables → log-prob via add-0.1 smoothing
   - Polynomial hash (prev*36313 + cur*27191 + ...) % HASH_BUCKETS (default 16384)
   - Writes data/{bigram_tab,trigram_logprobs,fourgram_logprobs}_8192v.npy
   - 100M-token cap (env var override) keeps build to ~1-3 min on a typical pod
   - Loaded as non-persistent buffers → does NOT count toward 16 MB limit
   - Tables ~512 MB each, 1.5 GB total — built fresh on every pod, not shipped

2. **submission/train.py** GPT class:
   - __init__: load 3 n-gram tables as register_buffer(persistent=False); read
     env vars for hash buckets, weights, backoff thresholds, alpha
   - forward_logits: after softcap, hash input_ids + prev tokens via the same
     polynomial hash as build_ngrams.py, look up bias from each table
   - NGRAM_BACKOFF dispatch: pick the highest-confidence order at each position
     (peak4 > thresh4 → use 4-gram, else peak3 > thresh3 → use 3-gram * alpha,
     else bigram * alpha²). Brants 2007 Stupid Backoff.
   - Plain weighted-sum fallback if backoff disabled.
   - +69 lines of code, model now 687 lines total.

3. **submission/get_data.sh**: append a "build n-gram tables" step after tokenize.
   Calls build_ngrams.py with NGRAM_VOCAB=8192 NGRAM_HASH_BUCKETS=16384
   NGRAM_MAX_TOKENS=100M. Verifies all 3 .npy outputs exist before exiting.

4. **submission/run.sh**: USE_NGRAM_BIAS=1 USE_NGRAM_BACKOFF=1 by default with
   the full set of weight + threshold env vars wired through.

Differentiation count: 8 of our changes layered on PR openai#1477:
  C1 Pre-Quant TTT, C2 3-layer recurrence, C3 QK_GAIN 5,
  gated_attention, NorMuon, NORM_PCT_DROPOUT, CMP_QUANT_VALUE_DEDUP,
  NGRAM_BIAS+BACKOFF.

Still pending (chunk 3 world-novel L09 refinements): NGR_LOG_FREQ_INV,
CTX_PARTITIONED_TAB. Both depend on the n-gram infra now in place — they're
small follow-ups (~30 LOC each) for the next iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
The two remaining NIGHT_MODE world-novel L09 patches, both built on top of the
n-gram bias infrastructure added in the previous commit.

1. **NGR_LOG_FREQ_INV** (world-novel L09 openai#2):
   - One-time inverse-log-frequency bucket suppression on first forward call
   - Sample bucket frequencies from current batch's hash indices, compute
     multiplier = 1 / log(2 + count), apply in-place to bigram/trigram/fourgram
     tables (each with its own hash function — XOR + shift variants)
   - High-freq buckets (the swamping ones the model already predicts confidently)
     get muted; low-freq buckets (rare contexts where bias actually informs)
     keep full strength
   - Targets the trigram bias swamping floor — frees the bias to inform rare
     contexts where the model needs help
   - World-novel: no published technique applies inverse-log-bucket-frequency
     weighting to n-gram bias tables in transformer training (audited by
     research subagent in earlier session)
   - Lazy in-place mutation, zero per-step cost after the first forward
   - Enable via USE_NGR_LOG_FREQ_INV=1 (default ON)

2. **CTX_PARTITIONED_TAB** (world-novel L09 openai#1):
   - 16 virtual sub-tables via slice rotation: rotate the bigram hash by
     (current_id mod S) * (H/S), where S = number of slices (default 16)
   - Effectively partitions the hash buckets into S zones, each absorbing 1/S
     of contexts → S× finer-grained smoothing
   - Mini-paper extension of the tabulation hash framework
   - World-novel: per-context hash slice rotation for n-gram bias is unpublished
   - Enable via USE_CTX_PARTITIONED_TAB=1 (default ON), slices via
     CTX_PARTITION_SLICES (default 16)

submission/train.py: now 731 lines (+44 from chunk 2 commit). Both edits live
in GPT.__init__ and GPT.forward_logits, gated by their env vars.

submission/run.sh: USE_NGR_LOG_FREQ_INV=1 USE_CTX_PARTITIONED_TAB=1 by default.

DIFFERENTIATION SCORECARD: 10 of our changes layered on PR openai#1477 reproduction.
- Comp frontier (3): C1 Pre-Quant TTT, C2 3-layer recurrence, C3 QK_GAIN 5
- NIGHT_MODE validated (4): gated_attention, NorMuon, NORM_PCT_DROPOUT, NGRAM_BACKOFF
- World-novel (3): CMP_QUANT_VALUE_DEDUP (L10), NGR_LOG_FREQ_INV (L09),
  CTX_PARTITIONED_TAB (L09)

The model is now FULLY OURS — not a comp copy, with 3 actual world-novel
research claims layered on top of the comp-record stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 9, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 9, 2026
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Phase 2 = SAME model as Phase 1 (10 patches + PR openai#1477 base), faster execution.
The speedup → more training steps in the 600s budget → lower val_bpb. Stuck at
180 steps now; with 5x speedup → ~900, with 15x → ~2700.

Hardware: cheap 3090/4070 Ti only — H100 rule resumes after Phase 1 ends. Total
Phase 2 budget $5-10 vs Phase 1's ~$5 burn. Most of Phase 2 is dev work.

Shot ordering (priority):
1. torch.compile re-enable + cache warm-up (~3-5x speedup)
2. FA3 sourcing (~30% on top — try wheel first, then build, then FA2 fallback)
3. Persistent CUDAGraph capture (~1.5-2x on top — risky due to in-place patches)
4. Fused n-gram bias + attention Triton kernel (custom, ~3-4 h, optional)
5. GPTQ int6 dequant + matmul fusion (~30% eval speedup, optional)
6. Custom SDPA replacement (skip if FA3 lands)
7. Int8 tabulation hash GPU gather (skip if NLFI doesn't matter)
8. FP8 compute paths (skip if Shots 1-3 enough)

Stop conditions:
- ≥5x speedup achieved → done, run submission
- $10 spend → stop, lock in what we have
- Crash that can't be fixed in <1 h → disable that shot, move on

What Phase 2 doesn't do: change patches, hyperparams, vocab, or do 8xH100.
That's separate phases.

Phase 2 → Submission gate: ≥5x speedup + 10 patches still working + 1xH100 SXM
val_bpb in the 1.10-1.18 range (within 0.10 of comp records).

Realistic wallclock: 6-12 h dev + ~$5-8 cheap-pod burn (Phase 1 took 2-3x its
optimistic estimate; assume the same for Phase 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Mirror the Phase 1 append-only log format, pre-seeded with:
- Phase 1 baseline context (what we're improving from)
- Comp anchors (PR openai#1477/openai#1482/openai#1485) as the target ceiling
- Shot-by-shot result slots for S1-S8
- A cumulative speedup tracker table
- Phase 2 targets: ≥5x speedup, val_bpb 1.10-1.18 on 1xH100 SXM

The Phase 1 dry run val_bpb placeholder will be filled in once Pod L's run
completes (~03:30Z). That number becomes the Phase 2 floor we're optimizing
against.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 9, 2026
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 9, 2026
Both techniques from PR openai#1477 (SP8192: 1.0822 BPB, -0.033 vs record):
- QK_GAIN_INIT=5.0 (was 4.0): PR openai#1477/openai#1413 both use 5.0, ~0.002-0.005 BPB
- TTT_ENABLED=1, TTT_OPTIMIZER=adamw, TTT_LR=0.0005, TTT_EPOCHS=3, TTT_FREEZE_BLOCKS=1
  Score-First TTT = legal PR openai#461 protocol: score in inference_mode first,
  then adapt. run_legal_ttt() implementation is strictly causal.

NOT included (confirmed illegal):
- SLOT: 2-pass retroactive, optimizes delta on same tokens it scores
- Pre-Quant TTT (PR openai#1482/openai#1489): adapts model with fineweb_val_*.bin
  before GPTQ — dexhunter flagged as val-data-in-training violation

Baseline: sota_32 (WD=0.04, EMA=0.9965, RECUR=4,5, Mousse=1)
Expected: ~1.08-1.09 BPB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant