Skip to content

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected)#1437

Open
dexhunter wants to merge 6 commits intoopenai:mainfrom
dexhunter:record/sp8192-par7-loop35-ngram-1.0779
Open

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected)#1437
dexhunter wants to merge 6 commits intoopenai:mainfrom
dexhunter:record/sp8192-par7-loop35-ngram-1.0779

Conversation

@dexhunter
Copy link
Copy Markdown

@dexhunter dexhunter commented Apr 7, 2026

Summary

Corrected version of our earlier sp8192 + par7 + loop35 + n-gram tilt submission.

This PR now reports the causal-corrected, token-only n-gram result:

  1. Parallel residuals on layers 7–10 from PR #1412
  2. 3-layer depth recurrence (LOOP_START=3, LOOP_END=5) on top of PR #1394
  3. Legal score-first TTT
  4. Causal token-only n-gram tilt from the PR #1420 kernel family, with the causality fix applied and the within/word experts disabled (NGRAM_WITHIN_BETA=0, NGRAM_WORD_BETA=0)

The earlier 1.07807 number reported in this PR was produced with a non-causal kernel and is preserved only for transparency in submission.json under seed_results_pre_fix. The corrected result below is the one that should be evaluated.

Corrected Results

Seed Post-TTT BPB val_loss (nats/token) Artifact (bytes)
0 1.08035 2.79067 15,994,644
42 1.08097 2.79225 15,995,572
1234 1.08127 2.79303 15,993,531
1337 1.08060 2.79131 15,988,802
2025 1.08135 2.79324 15,993,360
5-seed mean 1.08091 2.79210 all < 16,000,000

5-seed standard deviation: 0.00043 BPB.

Record Margin

Comparator Score Δ (nats/token) 0.005-nat bar
Merged SOTA PR #1019 1.11473 +0.08736
PR #1394 1.08563 +0.01219
Our PR #1413 1.08279 +0.00486 essentially tied

Against the merged baseline and against #1394, this corrected result still clears the README's 0.005-nat record bar.

What Was Fixed

The original ported n-gram kernel read metadata from the current target token at position p and used it to gate within_hint(...) / word_hint(...) before scoring that position. That made the predictive distribution depend on information derived from x_p itself.

The corrected kernel now:

  • uses prefix-only metadata from tokens[p-1] for hint gating,
  • keeps current-token metadata only for the post-score update path,
  • and disables the within / word experts entirely, because under prefix-only gating they were empirically harmful.

This leaves token_hint as the only active n-gram expert.

Compliance

Under the #1017 field guide:

  1. No dependence on x_t or future tokens: the active token hint is derived from strict-prefix state only.
  2. Proper normalized distribution: the tilt is p_tilt(t) = p_model(t) * exp(beta * 1[t==hint]) / Z, with the corresponding full-vocab normalizer.
  3. Score before update: both the n-gram path and the TTT path score first, then update.
  4. Single left-to-right pass: no rescoring, no oracle selection, no multi-pass min trick.

No SLOT, no pre-quant TTT, no eval-time training on current-token losses before scoring.

Verification

  • All 5 seeds were re-run via the shipped mini train_gpt.py wrapper.
  • All 5 artifacts are under the 16 MB cap.
  • The included records folder contains:
    • README.md
    • submission.json
    • train_gpt.py
    • ngram_tilt.py
    • fused_expert_kernel.cpp
    • train_seed0.log
    • train_seed42.log
    • train_seed1234.log
    • train_seed1337.log
    • train_seed2025.log

Credits

  • PR #1394 for the sp8192 base stack
  • PR #1412 for parallel residuals
  • PR #1420 for the n-gram tilt kernel family and public fix discussion
  • PR #1145 for the original normalized causal n-gram idea
  • PR #1413 for our prior sp8192 + QK5 + legal TTT base

…am Tilt — val_bpb 1.07800 (3-seed mean)

3-lever stack on top of PR openai#1394 sp8192 baseline:
- Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955)
- 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence)
- Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul)

Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3).

Results (3-seed mean, 8xH100 SXM):
- val_bpb 1.07800 (std 0.00053)
- val_loss 2.78457 nats per token
- Beats PR openai#1394 (1.08563) by 0.01971 nats per token
- Beats PR openai#1420 (1.08014) by 0.00553 nats per token
- Beats own PR openai#1413 (1.08279) by 0.01237 nats per token

All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only
hash construction, full-vocab renormalized one-token tilt, score-before-update
ordering inside the C++ kernel, single left-to-right pass.

C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed
(extern "C" shim + ctypes loader, single g++ -shared invocation at runtime).

5-seed re-verification via the shipped mini wrapper is in progress; this PR
will be updated with the final 5-seed mean once s1337 and s2025 land.
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper.
The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800)
which is well within the std (~0.00046). Margins vs the legal open
chronology are unchanged in direction:

- vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01205 nats per token

3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit;
s0 and s1234 mini-wrapper re-runs still in progress.
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper.
The mean improves slightly from the prior mixed-source 1.07813 to 1.07807
because s1234 produced a noticeably lower TTT under the mini wrapper
(1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but
the largest single-seed drift in the verification set).

All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections):
- s0:    15,992,304 bytes (7,696 byte headroom)
- s42:   15,993,733 bytes (6,267 byte headroom)
- s1234: 15,990,539 bytes (9,461 byte headroom)
- s1337: 15,988,039 bytes (11,961 byte headroom)
- s2025: 15,992,215 bytes (7,785 byte headroom)

Margins vs the legal open chronology:
- vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01218 nats per token

All four issue openai#1017 conditions remain verified for the n-gram tilt path.
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
Mined the top 20 open PRs at openai/parameter-golf and found that
PARALLEL RESIDUALS (compute attn + mlp in parallel from the same
pre-norm input) is in 3 of the top 6 recent records:
  PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800
  PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014
  PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6
We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches
Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel
(x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3
lines of Block.forward which are invariant under Patch 11 (smear gate).

Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ")
is contradicted by 5 of the top 10 recent records — they use depth
recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ.
Worth re-investigating in a future research fire.

experiments.json — 4 new PR_* configs:
  PR0: parallel residuals alone (no n-gram, isolated effect)
  PR1: parallel + leaky_relu + full n-gram (current best stack + new trick)
  PR2: parallel + smear + leaky + full n-gram (max stack)
  PR3: PR1 with seed=42 for noise check

RESEARCH_LOG.md — full record of the research fire findings + the
queue of techniques to investigate in future fires (n-gram tilt, depth
recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
Subagent A (BPE-8192 trainer): the exact tokenizer is already on disk
at data/tokenizers/fineweb_8192_bpe.model (370,908 bytes, the literal
file behind LESSONS.md §18c -0.129 BPB Mac win). Just needs scp to pod.

Subagent B (closed/merged PR audit): top 8 merged records analyzed.
Frequency table reveals 5+ convergent techniques we DON'T have:
- SmearGate in 6/8 (75%)
- zstd-22 in 5/8 (62%)
- EMA 0.997 in 4+/8
- Partial RoPE in 2+/8
- XSA in 1/8 (PR openai#1019 = literal openai#1 record at 1.11473)
- AR Self-Gen GPTQ in 1/8 (also PR openai#1019)

Subagent C (N-gram Tilt): FOUND the definition. It's a multiplicative
single-token exponential boost from a causal eval-time n-gram cache:
  p_tilt(t) = p_model(t) · exp(β · [t==hint]) / Z
  Z = 1 + p_model(hint) · (exp(β) - 1)
Used by PRs openai#1437, openai#1420, openai#1430. Bespoke to parameter-golf, not in
any published paper. Delta: -0.0029 to -0.0055 BPB.

Subagent D (TTT researcher): full ~80-line Score-First TTT sketch
provided. Pattern: score chunk in inference_mode, train on chunk SGD,
move on. PR openai#461 framework. Cost ~410s on 8xH100. ~-0.0025 BPB.

Subagent E (records miner): top 5 records analyzed, EMA + XSA +
Parallel Muon are convergent best practices. We have leaky_relu and
that's all from the comp's stack.

8-action priority list compiled. Highest EV next: scp BPE-8192,
implement EMA, XSA, Partial RoPE, LN Scale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…IP, I overrode to PASS

Subagent found arxiv:2505.15134 (Entropy Minimization at Inference, NeurIPS
2025) and recommended ship. I reversed to PASS after working out the math:
EM-INF is equivalent to temperature sharpening, and cross-entropy for a
calibrated MLE model is minimized at T=1 by definition. Moving T away from
1 in either direction strictly increases in-distribution NLL. Same class of
trap as Patch 14 (entropy-adaptive, already falsified). No push.

Better directions logged for next fire: PR openai#1437 N-gram Tilt (multiplicative
not sharpening), BPE-8192 tables, Coprime-Stride from merged record openai#1099.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…A captured

Subagent extracted the canonical formula from PR openai#1420 (the source for
PR openai#1437 and the entire Legal N-gram Tilt family):

  p_tilt(x_t) = p_model(x_t) * exp(beta * 1[x_t == hint]) / Z
  Z = 1 + p_model(hint) * (exp(beta) - 1)

Verified legal under issue openai#1017 four conditions (causal, normalized,
score-before-update, single-pass). Genuinely different from EM-INF
(last fire's PASS) — multiplicative reweighting using external signal,
not entropy sharpening.

DEFERRED code patch despite high confidence because:
1. Eval-only metric — our loop measures train_loss with SKIP_FINAL_EVAL=1
2. Subagent's "50 LOC sketch" has O(L^2) forward-pass bug, real impl is 150+
3. Modifying eval pipeline risks breaking FINAL int8_zlib_roundtrip path

Marked HIGH PRIORITY for next H100 escalation cycle. Estimated +0.0015-0.0030
BPB at our SP-1024 vocab size — same order as largest single-technique gains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-only experts

The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug:
within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch
gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token
metadata at the position being scored), leaking 1-2 bits about the answer
per scored position. This is an Issue openai#1017 condition 2 violation.

PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR
openai#1420's thread and proposed the same fix that's applied here:

  * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last
    prefix token) for hint gating. Updates use the actual current tok via
    new tok_is_bnd / tok_is_ws variables so within_update / word_update
    still segment words correctly. Variable naming and structure copied
    verbatim from PR openai#1420's fix.
  * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0.
    Empirically the within / word experts under prefix-only gating fire
    for the wrong positions (within fires for word-starts, word fires for
    mid-word) and contribute *negative* BPB. Disabling them gives 1.07951
    on s42 vs 1.08108 with the experts active — token_hint is the only
    legitimate contributor.

5-seed verification (all on the patched kernel):

    seed   pre-fix   corrected  delta
    0      1.07751   1.08035    +0.00284
    42     1.07809   1.08097    +0.00288
    1234   1.07813   1.08127    +0.00314
    1337   1.07801   1.08060    +0.00259
    2025   1.07862   1.08135    +0.00273
    mean   1.07807   1.08091    +0.00284

All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB
headroom). Pre-fix per-seed values preserved in submission.json under
seed_results_pre_fix for the public record.

Bar comparisons (corrected mean 1.08091):

    PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar
    PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar
    PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the
                        same bug; would correct to ~1.08300 post-fix)

This PR is left open as a transparency / diagnostic record, NOT as a record
claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest
legal anchor. The README has been retitled "Diagnostic (causal-corrected)"
and the legality fix is documented in a dedicated section.
@dexhunter dexhunter changed the title Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Legal N-gram Tilt — val_bpb 1.07800 (3-seed mean) Diagnostic (causal-corrected): SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean) Apr 7, 2026
eamon831 added a commit to eamon831/parameter-golf that referenced this pull request Apr 7, 2026
…text

- Logged 4 experiments: smoke test, JEPA 1xH100, baseline 1xH100, JEPA 8xH100 (interrupted)
- Updated open PRs: SP8192 stack now at 1.078 BPB (PR openai#1437)
- Revised depth recurrence from dead-end to viable (PR openai#1394, openai#1435)
- Updated strategy: Phase 1 = JEPA on PR openai#1019, Phase 2 = rebase on SP8192
- Updated blockers: grant submitted, all pods terminated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 7, 2026
…ctions

- N-gram Tilt bug: PR openai#1420 kernel is non-causal; PR openai#1437 (dexhunter) found/fixed it
  (pre-fix 1.07807 → post-fix 1.08091). Updated primary reference to PR openai#1437 kernel.
- PR openai#1423 flagged illegal (pre-quant TTT, same as openai#1351/openai#1408/openai#1416)
- Added full PR openai#1421–1444 scan results
- Updated best open legal PR: ~1.08091 (PR openai#1437) not 1.08014 (openai#1420)
- Session 8 lessons learned added to CLAUDE.md

https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 7, 2026
…two-track strategy

Critical findings from Issue openai#140 full thread analysis:
- Issue openai#140 CLOSED by @notapplica on Apr 6
- @valerio-oai NEVER commented in Issue openai#140; all rulings via PRs + Issue openai#677
- SLOT has never been officially banned: 9 open record PRs use SLOT variants
- PR openai#1333 (aryanbhosale, Causal SLOT-16): 1.0766 BPB — new best open record
- PR openai#1229 (scored-position SLOT): 0.9300 BPB — open, no rejection
- Strategy: Track A (safe: PR openai#1437 stack + TTT → ~1.078) + Track B (Causal SLOT-16 → ~1.076)
- SLOT status in CLAUDE.md updated from BLOCKED to DE FACTO IN USE

https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…ai#1430 stalled, 2 new PRs validate deferred specs

Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits
in a row). Strong evidence of true novelty.

PR openai#1430 still OPEN, 0 comments, no comp owner activity since creation.
Increasingly likely to be reverted or outlawed.

NEW PRs validate two of our deferred H100 escalation specs:
  - PR openai#1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec
  - PR openai#1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec

Combined with PR openai#1437/openai#1420 already validating Patch 23 N-gram Tilt, the
3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple-
confirmed by independent comp PRs.

Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime.

Reminder: depth recurrence is back on the table — 5+ records use it now.
LESSONS.md §29 needs another update from "stale" to "real direction".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
… single-block re-run

From PR openai#1437 (1.0809), PR openai#1445 (1.0889), 8+ merged records total. Reference
papers: Universal Transformers + ALBERT for the weight-sharing depth idea.

Conservative variant: re-run only block 3 of the encoder twice (1 extra
forward pass through one block per training step). Lowest possible OOM risk
on 12GB 3080 Ti. Default env vars: LOOP_START=3, LOOP_END=3, RECUR_CYCLES=2.

Implementation: 3 LOC in the encoder loop + 4 LOC init. Anchored on the
WAVELET-MODIFIED loop (Patch 8 runs before Patch 19), idempotent via
DEPTH_RECUR_MARKER. Each anchor check is independent for graceful partial
application.

This is the FIRST architectural patch in 8 research fires that fits our
train_loss metric. Most architectural attempts failed at our scale, but
depth recurrence has 8+ merged records — much higher port-with-evidence
ratio than gated attention/tab hash/parallel residuals.

4 DR experiments queued:
  DR0_recur_block3_min (single block, 2x), DR1_recur_blocks3_4 (2 blocks),
  DR2_recur_block3_3x (single block, 3x), DR3_recur_seed42 (multi-seed)

OOM risk bounded: runner crash-resilience skips after 3 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…m PR openai#1437/openai#1423)

Subagent gap analysis of top 3 open PRs (openai#1437, openai#1423, openai#1445) found
QK_GAIN_INIT=5.0 is the simplest training-time technique we're missing
that has 2-PR evidence (top open openai#1 and openai#2 both use 5.0 vs upstream
default 1.5).

CRITICAL: QK_GAIN_INIT is already an upstream env var (line 60 of
train_gpt.py). NO code patch needed — just add experiments that override
the env var. Zero patcher risk, zero anchor risk.

Application: q_gain is multiplied element-wise with query tensor before
F.scaled_dot_product_attention, scaling Q-K product by the gain factor.

4 QK experiments queued:
  QK0_qkgain5_alone, QK1_qkgain5_seed42, QK2_qkgain5_L4weights,
  QK3_qkgain5_with_engram

Hypertuning rule check: this is a SINGLE-value port from 2 top open
records, NOT a weight sweep. Satisfies "port from top records" rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…steps)

The full validated stack (Coprime Stride + EngramLite + leaky + ngram +
L4 weights + seed 42) under the new compute regime (seq=1024, batch=65536)
hit train_loss 2.5916 in 910 sec / 1000 steps.

vs old broken-config top-1: -0.682 (3.2734 -> 2.5916)
vs speed-fix CHAMP_L5_seed42: -0.397 (2.9885 -> 2.5916)

Stacking decomposition under proper compute:
  Old broken-config top-1:           3.2595
  + Speed fix (CHAMP_L5, 300 steps): 2.9885 (-0.271)
  + Coprime + EngramLite + 1000 steps: 2.5916 (-0.397 more)

The dominant factor is steps x batch quality (the speed fix unlocked it).
Patches (Coprime, EngramLite) contribute marginally on top.

H100 escalation candidate: SP6 stack, n=3 multi-seed validation in flight.
Projected H100 val_bpb: 1.02-1.05 if train_loss to val_bpb transfer ratio
is preserved. Would BEAT the open frontier (PR openai#1437 = 1.078).

Spend ~$6.50/$36 (18%). Plenty of headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…reshold

The previous "Diagnostic" framing was based on a unit error: I compared
val_bpb deltas as if they were nats-per-token deltas, missing the factor
of ~2.583 (mean bytes per token in the sp8192 val set, computable directly
from this submission's val_loss / val_bpb ratio).

With the correct units, the causal-corrected 5-seed mean (1.08091 BPB,
2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394:

  vs PR openai#1394 (1.08563): +0.01219 nats per token  ✅ 2.4× the bar
  vs PR openai#1019 (1.11473): +0.08736 nats per token  ✅ comfortably
  vs PR openai#1413 (ours):    +0.00486 nats per token  — essentially tied
  vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel
                          bug; its corrected ~1.08298 yields +0.00535 nats ✅

Title reverted from "Diagnostic (causal-corrected)" to "Record". The
legality fix section is preserved (the kernel patch is still a real
correctness fix matching @abaybektursun's proposed patch in PR openai#1420).
The leak magnitude in the legality fix section now correctly states
"+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB.

Pre-fix per-seed values are still preserved in submission.json under
seed_results_pre_fix for the public record.
@dexhunter dexhunter changed the title Diagnostic (causal-corrected): SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean) Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) Apr 7, 2026
@mohosy
Copy link
Copy Markdown

mohosy commented Apr 8, 2026

the causal correction on the ngram tilt is good, alot of ppl were probly getting inflated scores without realizing. 5 seed std of 0.00043 is crazy tight too

PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 8, 2026
Base: train_gpt_sota_10.py (clean, 11L XSA-all, parallel L5+, recur 3,4,5)

Additions from top PRs:
- Legal Score-First TTT (PR openai#549 recipe: +~0.0025 BPB)
  chunk=32768, SGD lr=0.002 global cosine decay, 3 epochs, all blocks unfrozen
- N-gram Tilt (PR openai#1437): exp(0.5) boost on bigram-predicted next token
- Eval-Time Hash Embedding (PR openai#1460): zero-init embed[(p*2039+c)%16384]
  adapts via TTT optimizer at 10x model LR

Other tuning vs sota_10:
- warmdown_iters: 4200 -> 5500 (better final convergence)
- gptq_ar_seqs: 32 -> 64 (PR openai#1019: 64 is optimal)
- ttt defaults: lr=0.002, chunk_size=32768 (PR openai#549 ablation)
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Two of the three comp-frontier wins are env-var bumps with no code change:
- LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer
  recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 /
  openai#1437 use this. Expected -0.005 to -0.01 BPB.
- QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5;
  openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve.
  Expected -0.001 BPB.

C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real
code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in
background.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants