Skip to content

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889

Open
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-backoff-clean
Open

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)#889
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-backoff-clean

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

val_bpb = 0.9642 (3-seed mean, std 0.0002) | ~15.95 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-ngram bpb Post-ngram bpb ng_helped Artifact
1337 88.7ms 6,765 1.1225 0.9640 38.5% 15,981,848
42 88.6ms 6,772 1.1224 0.9641 38.6% 15,904,632
2025 88.6ms 6,776 1.1231 0.9644 38.6% 15,974,308
Mean 88.6ms 6,771 1.1227 0.9642 (std 0.0002) 38.6%

All artifacts under 16,000,000 bytes. All 3 train logs attached.

Key Innovation: Multi-Order N-gram Backoff Cache

Backward-looking n-gram cache built causally from already-scored tokens. Zero artifact cost.

Entropy-Adaptive Alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4)). Neural-confident → alpha≈0.05. Neural-uncertain → alpha≈0.60.

Multi-Order Backoff (2-7gram): Highest matching order wins. 4M hash buckets per order. min_count=2 gate. Raw count ratios, no smoothing.

Compliance: Score-first — every token scored before any table update. N-gram tables built from already-scored tokens only. No training data access during eval. No oracle selection.

Training Architecture

PR #414 base + LeakyReLU² + VRL + lzma:
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3×, VRL, VE128, BigramHash(2048), XSA4, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, GPTQ-lite int6 + lzma, FA3 Hopper, Muon WD=0.04

Credits

anthony-maio and others added 2 commits March 26, 2026 15:12
Sub-1.0 bpb via multi-order n-gram backoff (2-7gram) with entropy-adaptive
alpha mixing. 3-seed mean 0.9642, std 0.0002. All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 19:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new record submission for track_10min_16mb showcasing a multi-order (2–7) n-gram backoff cache combined with VRL + LeakyReLU², along with reproducibility artifacts and metadata.

Changes:

  • Added training/eval script implementing n-gram backoff evaluation and model architecture used for the record.
  • Added attached training logs for multiple seeds and a README describing results/compliance/repro steps.
  • Added submission metadata JSON for the record entry.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py Training + evaluation script including sliding-window eval and n-gram backoff cache.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed42.log Attached run log for seed 42 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log Attached run log for seed 1337 supporting reported metrics and artifact size.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json Record metadata (val_bpb/val_loss/bytes, hardware, etc.).
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md Human-readable summary of the method, results, compliance, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +974 to +977
all_tokens = val_tokens.cpu().numpy().astype(np.int32)
scored_up_to = my_windows[0] if my_windows else 0
ngram_helped = 0
ngram_total = 0
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In distributed n-gram eval, each rank’s cache starts at scored_up_to = my_windows[0], so ranks whose first window does not start at 0 will not include earlier (globally previous) tokens in their cache. This makes the n-gram backoff results depend on world_size/window partitioning rather than matching a single causal pass over the validation stream. To make the cache behavior consistent with a global score-first causal ordering, either (mandatory): (a) initialize each rank’s cache with the prefix tokens up to the first token position it will score (e.g., update the cache over [0, first_scored_pos) before scoring), or (b) run the n-gram backoff evaluation on a single rank (rank 0) and skip the distributed aggregation for that phase.

Copilot uses AI. Check for mistakes.
Comment on lines +995 to +996
probs = torch.exp(log_probs)
entropy = -(probs * log_probs).sum(dim=-1)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes and materializes both log_probs and probs for the full [B, T, V] tensor, which is large and increases peak memory/bandwidth. You can compute entropy directly from log_probs without keeping a separate probs tensor (e.g., using log_probs.exp() inline) to reduce memory pressure.

Suggested change
probs = torch.exp(log_probs)
entropy = -(probs * log_probs).sum(dim=-1)
entropy = -(log_probs.exp() * log_probs).sum(dim=-1)

Copilot uses AI. Check for mistakes.
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
usable = ((tokens.numel() - 1) // seq_len) * seq_len
if usable <= 0:
raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message hardcodes TRAIN_SEQ_LEN even though the function parameter is seq_len (and the caller may pass a validation/eval seq length). Consider changing the message to refer to seq_len (or EVAL_SEQ_LEN when applicable) to avoid confusion when debugging validation setup.

Suggested change
raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
raise ValueError(f"Validation split is too short for seq_len={seq_len}")

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +12
| Seed | step_avg | steps | Pre-ngram bpb | **Post-ngram bpb** | ng_helped | Artifact |
|------|----------|-------|--------------|-------------------|-----------|----------|
| 1337 | 88.7ms | 6,765 | 1.1225 | **0.9640** | 38.5% | 15,981,848 |
| 42 | 88.6ms | 6,772 | 1.1224 | **0.9641** | 38.6% | 15,904,632 |
| 2025 | 88.6ms | 6,776 | 1.1231 | **0.9644** | 38.6% | 15,974,308 |
| **Mean** | **88.6ms** | **6,771** | **1.1227** | **0.9642 (std 0.0002)** | **38.6%** | |
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results table rows start with ||, which renders as an extra empty column in standard Markdown table syntax. Use a single leading | per row so the table formats correctly on GitHub.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
"val_bpb": 0.9642,
"val_loss": 1.6279,
"bytes_total": 15953596,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_total appears to be an average across seeds (it doesn’t match the per-seed totals shown in the attached logs). If submission.json is meant to describe a specific submitted artifact, it should use the exact bytes_total (and ideally the exact val_loss/val_bpb) for that chosen artifact; otherwise consider adding explicit fields indicating these values are 3-seed means.

Suggested change
"val_bpb": 0.9642,
"val_loss": 1.6279,
"bytes_total": 15953596,
"val_bpb_mean_3seed": 0.9642,
"val_loss_mean_3seed": 1.6279,
"bytes_total_mean_3seed": 15953596,

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record-track N-gram Backoff (pre-fused-kernel twin of #915)

BPB: 0.9642 post-ngram (3-seed mean, std 0.0002) / 1.1225 pure-neural stride-64 | Seeds: 3 | Artifact: 15,981,848 B (seed 1337) | Track: record | Compliance: N-gram PASS (same code as #915), eval-budget FLAG (same gate as #915)

TL;DR for the mod team: This is the record-track filing of the same neural+n-gram stack that the author also filed as non-record in PR #915. I reviewed #915 on 2026-04-11 (comment). The n-gram compliance story is literally the same code, so the n-gram verdict is the same; the eval-wallclock question is also the same and gates a record listing the same way it gates the non-record listing.

Relationship to #915 (confirmed by byte-for-byte diff at SHA 50ec6bc):

N-gram compliance (identical analysis to my #915 review, line numbers shifted):

  • _hash_ctx at L919-923 reads tokens[pos - ctx_w + k] for k in [0, ctx_w). With pos = ws + t + 1 (the absolute token index of the target, L1006) and ctx_w >= 1, the hashed indices are strictly [pos - ctx_w, ..., pos - 1] — the prefix before the target. Strict prefix lookup_hash_ctx reads only tokens[pos - ctx_w + k] for k ∈ [0, ctx_w), so the lookup key depends only on the prefix x_{pos-ctx_w}...x_{pos-1}, satisfying Issue A Field Guide to Valid Submissions #1017 condition 1.
  • predict(tokens, pos, target) at L937-950 uses target only to index full_tables[oi][full_h] for the single true target — it's computing P(target | context), not oracle argmax over candidate targets. Per @valerio-oai's ruling on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (comment 4145781641, 2026-03-27), the disallowed pattern is hashing the target into a key used to select the prediction over candidate targets. This code looks up one count for the one true target to compute a probability — standard n-gram P(target|ctx) = count(ctx, target) / count(ctx) — which is what Issue A Field Guide to Valid Submissions #1017 condition 1 ("p_t may depend only on the artifact and x_1...x_{t-1}") permits when combined with a backward-only cache.
  • Score-before-update at window granularity (L1005-1030): the scoring loop finishes a whole window before cache.update(all_tokens, scored_up_to, new_end) (L1029) adds that window's tokens. scored_up_to starts at the first-assigned window's left edge. No token ever contributes to its own prediction. Legal under Issues Invalid submissions due to information leakage during TTT #402 / Illegal submissions megathread #677.
  • Mixing (L1012-1016): mixed_p = (1 - alpha) * model_p + alpha * ng_p, floored at 1e-12. Linear in probability space, standard.

Why the BPB is 0.9642 and not ~1.08: Same as #915: the pre-ngram stride-64 number is 1.1225 (exactly in the SP1024 11L VRL pack), and the n-gram cache buys ~0.16 BPB on top at large eval-wallclock cost. The 0.9642 is a post-processing number layered over a normal 1.12-ish neural eval.

Main flag — eval-budget compliance (IDENTICAL to #915, and this matters more for a record):

From train_seed1337.log:

final_int6_sliding_window val_loss:1.8953 val_bpb:1.1225 stride:64 eval_time:102169ms
final_ngram               val_loss:1.6277 val_bpb:0.9640 ngram_eval_time:895349ms

Gauntlet (CPU pre-flight on the PR head at SHA 50ec6bc):

[PASS] Import, Hyperparameters (dim=512, layers=11, heads=8, vocab=1024)
[PASS] Model: 26,993,766 params
[PASS] Forward pass: loss=6.9362
[PASS] Artifact: 4,635,892 B (29.0% of 16MB) via int6+lzma on freshly-initialized weights
[INFO] Code size: 67,048 B (matches submission.json bytes_code exactly)
[INFO] Est. 8×H100: 45.9 ms/step, 13,058 steps in 10 min

Gauntlet PASS on all checks. Unlike #915, this version has no fused-kernel compile step, so CPU imports straight through — no fallback path needed.

Seed coverage / artifact sizes (per README table, verified against train_seed1337.log):

  • seed 1337: 15,981,848 B, post-ngram 0.9640
  • seed 42: 15,904,632 B, post-ngram 0.9641
  • seed 2025: 15,974,308 B, post-ngram 0.9644
  • All under 16,000,000 B. Mean 0.9642, std 0.0002. Pre-ngram stride-64 is 1.1225 / 1.1224 / 1.1231. Tight.

Questions / flags:

  1. Eval wallclock (same flag as Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff #915). The n-gram stage is pure-Python over a NumPy buffer and takes ~15 min on an 8-GPU image. If the eval budget is 10 min total wallclock (my reading), this doesn't fit; if it's 10 min per-GPU (the author's reading), it does. Record-track submissions need this resolved before listing.
  2. Is Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889 or Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff #915 the "authoritative" filing? Both point at the same records folder. Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889 was created 2026-03-26 19:13 UTC and is record-track; Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff #915 came later and is non-record. If this stack is going to land, the mod team should decide which PR to merge and close the other to avoid duplicate records folder.
  3. Prior-art credit. README credits PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0) for the n-gram backoff, PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 for the neural base, PRs Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493/Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 for LeakyReLU², Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175) #569 for VRL, and PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 for XSA. Clean attribution.

Verdict: NEEDS CLARIFICATION — on eval-budget interpretation. The technique is compliant (n-gram is backward-looking, score-first, no oracle; byte-identical to #915's clean implementation) and the engineering is clean, but a record-track listing of 0.9642 requires the eval wallclock question to be resolved.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:


Reviewed by @MatoTeziTankaThe Agora. Gauntlet ran clean on CPU: all checks PASS, artifact budget 29.0%, no fused-kernel fallback needed (this PR doesn't contain one). AI tooling: review drafted with Claude Code (Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 50ec6bce1d6722caa8d20ad6f6f53fbec9abfdae, including a byte-for-byte diff of the n-gram cache against PR #915 at SHA 15a5cb8c.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants