Skip to content

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808

Open
Naazimsnh02 wants to merge 3 commits intoopenai:mainfrom
Naazimsnh02:ngram-depth-recurrence-0.6364
Open

Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff#808
Naazimsnh02 wants to merge 3 commits intoopenai:mainfrom
Naazimsnh02:ngram-depth-recurrence-0.6364

Conversation

@Naazimsnh02
Copy link
Copy Markdown

@Naazimsnh02 Naazimsnh02 commented Mar 26, 2026

Summary

val_bpb: 0.6360 (seed 1337) | ~15.94 MB | 8×H100 SXM | 3 seeds

Adds multi-order n-gram backoff (orders 2-7) with entropy-adaptive alpha to the depth recurrence stack, achieving a new record.

Key contributions

  • Multi-order n-gram backoff (orders 2-7): Hash-table n-gram counting at eval time. Highest-order match first, cascade down on miss. Zero training cost — purely eval-time.
  • Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H − 4.0)) — trusts n-gram more when the neural model is uncertain, model when confident.
  • Multi-GPU n-gram prefill: Each rank pre-populates its hash tables with all tokens scored by earlier ranks, fixing the table fragmentation problem on multi-GPU setups (without this, 8-GPU gets 0.87 BPB instead of 0.64).
  • Depth Recurrence: Repeating layers 4,5 for 13 virtual layers from 11 physical at zero parameter cost (carried over from previous submission).

Results

Seed BPB
1337 0.6360
2025 0.6381
42 0.6387
Mean 0.6376

Built on PR #549 stack (LeakyReLU(0.5)², BigramHash(2048), XSA4, Partial RoPE, LN Scale, VE128, EMA+SWA, Parameter Banking + Parallel Muon, int6 GPTQ-lite + lzma).

Credits

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Mar 26, 2026

Interesting approach — the depth recurrence with layers 4,5 repeated for 13 virtual layers at zero parameter cost is creative, and the multi-GPU n-gram prefill fix (0.87 → 0.64 BPB without it) is a good catch.

You're at 2 seeds right now. The leaderboard requires 3-seed validation for record claims — one more run should close it out.


Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@Naazimsnh02 Naazimsnh02 changed the title Record: 0.6364 BPB - Depth Recurrence + Multi-Order N-gram Backoff Record: 0.6360 BPB - Depth Recurrence + Multi-Order N-gram Backoff Mar 26, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Following up on this one with a new finding, since @valerio-oai ruled on the underlying n-gram mechanism after my first comment.

Compliance flag — same disallowed pattern as PR #779.

@valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The mechanism is spelled out in comment 4146407380: hashing the target token into the bucket key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data — arbitrarily low BPB without real compression.

Looking at records/track_10min_16mb/2026-03-26_DepthRecurrence_NgramBackoff_0.6360/train_gpt.py, the multi-order n-gram backoff uses the same target-in-key hashing pattern in every code path that touches the full_tables:

  • L1110 (multi-GPU prefill update): full_key = ((ctx_hash ^ (tgt * ng_primes[cw % len(ng_primes)])) & ng_mask).astype(np.int64)np.add.at(full_tables[oi], full_key, 1) at L1112
  • L1170 (eval-time score buildup): full_key = ((ctx_hash ^ (tgt_np * ng_primes[ctx_w % len(ng_primes)])) & ng_mask)full_counts = full_tables[oi][full_key] at L1179
  • L1411 (rank-local score): same pattern, full_key = ((ctx_hash ^ (tgt_np * ng_primes[...])) & ng_mask) → looked up at L1419
  • L1461 (chunk-end update): same pattern, np.add.at(full_tables[oi], full_key, 1) at L1463

Each of these hashes the ground-truth tgt / tgt_np / tgt_tok into the bucket key, and the full_counts lookup at scoring time then reads the count of that exact target in that bucket. Under @valerio-oai's #779 ruling, that's the Rule 1 violation in Issue #1017 condition 1 (p_t may depend only on the artifact and x_1...x_{t-1}).

The multi-GPU n-gram prefill mechanism the README describes (0.87 → 0.64 BPB without it → with it) is also worth thinking about in this light: each rank pre-populates its hash tables with all tokens scored by earlier ranks, which means each rank's tables already contain the target tokens it is about to score. The 0.23 BPB gap between "prefill on" and "prefill off" is the size of the leak, not the size of a compression gain.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #779.

@Naazimsnh02 — please let me know if I've misread the code, especially the full_key lookups at L1110/L1170/L1411/L1461 and whether there's a renormalization over the full vocabulary that I'm missing — if so, I'd want to retract. The depth-recurrence stack (layers 4–5 repeated for 13 virtual layers at zero parameter cost) and the multi-GPU prefill plumbing are both genuinely interesting engineering, separate from the n-gram question — if you wanted to resubmit on a legal base (either dropping the cache and reporting the pure-neural sliding-window number, or reworking the cache as a full-vocab reweighting per @valerio-oai's suggested path on #779), the depth-recurrence and multi-GPU work would carry over cleanly. The 3-seed update from the earlier ask is in place and looks consistent.


Reviewed by @MatoTeziTankaThe Agora. Static code review against train_gpt.py at SHA 6c1b833b. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants