Skip to content

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909

Open
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609
Open

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609

Conversation

@sunnypatneedi
Copy link
Copy Markdown

@sunnypatneedi sunnypatneedi commented Mar 26, 2026

11-gram Eval Cache + Hedge Mixer on PR #549 Base

val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Roundtrip bpb Sliding+N-gram bpb N-gram gain Eval time Artifact
42 92ms ~6,500 1.1452 0.8600 -0.2852 ~188s 15,341,541
1337 92ms ~6,500 1.1452 0.8611 -0.2841 ~188s 15,918,565
2025 92ms 6,526 1.1452 0.8616 -0.2836 188s 15,790,804
Mean 92ms ~6,500 1.1452 0.8609 (std 0.0008) -0.284 ~188s

Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing

The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.

  1. Multi-order n-gram cache (orders 2-11): 10 hash tables with 4M buckets each, uint32 count tables
  2. Score-first, update-after protocol: n-gram counts are scored before being updated (legal per @valerio-oai, Issue Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140)
  3. Entropy-adaptive alpha: mixing weight between neural and n-gram predictions is a function of model entropy — high-entropy (uncertain) tokens get more n-gram contribution
  4. Order-adaptive gating: higher-order matches get tighter entropy thresholds via order_centers = 3.0 - 0.25 * (matched_order - min_order)
  5. Hedge Mixer: online multiplicative-weights ensemble (beta=2.0) that learns optimal neural vs n-gram weighting across the eval run

N-gram Protocol

  1. Initialize 10 hash tables (orders 2-11), each with 4M buckets of uint32 counts
  2. For each evaluation position:
    • Score: look up n-gram match for each order (highest order first), compute n-gram probability
    • Compute model entropy from neural logits
    • Compute entropy-adaptive alpha (sigmoid of entropy vs order-specific threshold)
    • Hedge Mixer blends neural and n-gram-enhanced predictions using learned weights
    • Update: increment n-gram counts for all observed n-grams at this position
  3. Sliding window eval (stride=64) processes validation tokens with the n-gram cache active

Run Config

cd /workspace/parameter-golf
SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.py

All hyperparameters are baked into the script as defaults. Key environment variables:

# N-gram config
NGRAM_CACHE=1 NGRAM_ORDER=11 NGRAM_MIN_ORDER=2 NGRAM_BUCKETS=4194304
NGRAM_ENTROPY=1 NGRAM_ALPHA=0.40 NGRAM_ENT_BASE=0.05 NGRAM_ENT_RANGE=0.55

# Hedge Mixer
HEDGE_ENABLED=1 HEDGE_BETA=2.0

# Model (no BigramHash, VE_DIM=64 to fit 16MB across all seeds)
BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0

# TTT disabled (n-gram replaces it)
TTT_ENABLED=0

Timing Budget

Phase Time
Training 600s (≤10 min)
Int6 roundtrip eval (diagnostic) ~49s
Sliding window + n-gram + Hedge eval (stride=64) ~188s
Total eval ~237s (< 10 min)

Training Architecture (from PR #549 SOTA)

Component Setting
Layers 11 (512d, 8H, 4KV GQA)
MLP 3× expansion, LeakyReLU(0.5)²
XSA All 11 layers
Gated Attention Enabled
RoPE Partial (16/64 dims)
LN Scale 1/√(layer+1)
VE64 Layers 7-10
Weight avg EMA(0.997) + SWA(every 50)
Quantization Uniform Int6 + zstd-22

Ablation

Config val_bpb Delta
Roundtrip (no n-gram, no sliding window) 1.1452 — (baseline)
+ Sliding window (stride=64) + 11-gram + Hedge 0.8609 -0.284

Credits

sunnypatneedi and others added 24 commits March 24, 2026 10:48
Two-phase TTT pipeline (novel combination):
- Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026)
- Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens)

Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) +
LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data
- gptq_quantize_weight(): column-wise int6 with Cholesky error compensation
- _find_best_row_scales(): percentile search for optimal per-row scales
- Integrated into mixed_quantize_int6() — falls back to naive when no Hessian
- Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: Function adapted MLP weights but never scored documents.
  All compute was wasted — no loss/bpb accumulation.
  Fix: Rewrote as inplace_ttt_eval() with apply-then-update loop:
  score chunk first (accumulate bpb), then gradient-update MLP proj.

Bug 2: Model left in last document's adapted state after function.
  This corrupted subsequent LoRA TTT evaluation.
  Fix: Reset MLP weights to original after all documents.

Also: Made In-Place TTT and LoRA TTT alternatives (config switch)
rather than sequential phases, since both produce val_bpb scores.
Use INPLACE_TTT_ENABLED=1 for In-Place, =0 for LoRA TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 results:
- Artifact 16.35MB (352KB over 16MB limit) — caused by GradQuant int7
- LoRA TTT took 1572s (2.6x over 600s budget) — 20 epochs too many
- Pre-quant val_bpb: 1.1757 (46 shards, not full 80)
- Post-quant sliding window: 1.3569

Fixes:
- GradQuant: top-10% sensitivity stays int6 (not int7)
- TTT epochs: 20 → 5 (should complete in ~400s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#548 UNMODIFIED (1.0865 proven). Reproduce baseline.
Run 1: PR openai#548 + LeakyReLU(0.5)^2 (1 line change). Measure delta.

Following retro lesson: baseline first, one change at a time.
No GPTQ, no In-Place TTT, no XSA, no surprise gating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… PR openai#548

Run 0: PR openai#414 UNMODIFIED (merged SOTA 1.1228, verified 3-seed)
Run 1: PR openai#414 + LeakyReLU(0.5)^2 (1 line change)

Baseline against verified numbers, not claimed scores from open PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds:
- temperature param to eval_val_sliding (default 1.0, no change)
- After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99}
- PR openai#576 reported T=0.98 gives -0.003 bpb for free

10 lines added over Run 1. Zero training cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 2. Changes from PR openai#414 base:
- MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params)
- Quantization: int6 → int5 (clip_range 31→15, fits more params)
- QAT: enabled with threshold 0.5 (early start, matching PR openai#576)
- QAT uses quantile(0.9995) clip instead of row max
- BigramHash: 2048 → 8192 buckets

From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb).
8 lines changed from Run 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template includes:
- README.md with placeholder results table
- submission.json with schema matching existing PRs
- submit.sh helper to collect logs and extract metrics

Fill in after successful runs, rename folder, PR to upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful.
GradQuant int5/int6 mix: 34KB over 16MB even without int7.
PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233).

Added lessons 13-16 to CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#414 hardcodes `from flash_attn_interface import ...` (FA3/Hopper only).
This pod has FA2 but not FA3. Added try/except + SDPA fallback in attention.
Applied to all 4 runs (0-3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod has flash_attn 2.8.3 (from flash_attn import flash_attn_func)
but NOT flash_attn_interface (FA3/Hopper). Added cascading import.

Also keeping SDPA fallback for environments with no flash_attn at all.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#549 UNMODIFIED (merged SOTA 1.1194, verified 3-seed)
Run 1: PR openai#549 + TTT_ENABLED=1 + TTT_LR=0.0005 (2 lines changed)

Both have FA3→FA2→SDPA fallback for non-Hopper GPUs.
Following retro: one change per run, baseline first.

Expected: Run 1 should achieve ~1.094-1.104 (beats 1.1144 target).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's
proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025).

Changes:
- train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams
- run_3seeds.sh: Added TTT env vars for 3-seed validation
- finalize_submission.py: Extracts pre/post TTT metrics from logs
- README.md + submission.json: Updated for TTT-enabled submission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents "tensor does not have a device" error when torch.compile
tries to recompile after TTT modified model weights.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD
TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x).
3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_bytes

PR openai#771 was listed as "0 seeds" in the competition tracker because
submission.json was missing the required `seeds` and `track` fields,
and used `bytes_total` instead of the expected `artifact_bytes` field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hanced n-gram

- train_gpt_v10_safe.py: v9a + Hedge Mixer (multiplicative weights) + add-delta n-gram smoothing, dim=512
- train_gpt_v10_moonshot.py: model_dim=640 (42M params) + adaptive quant (ternary MLP / int4 attn / int6 embed) + Hedge Mixer
- auto_experiment.py: local CPU random search over 20 configs, logs to experiments.jsonl
- submit.sh: packaging and staging script for H100 runs
- PLAN.md: strategy doc with size estimates and run order

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- validate_configs.py: CPU-only artifact size estimator for moonshot configs (no GPU/data needed)
- experiments.jsonl: 20 initial random search results from auto_experiment.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10 moonshot: ternary MLP quant + scaled model + hedge mixer + enhanced n-gram
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616).
All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive
alpha and Hedge Mixer on PR openai#549 base architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi and others added 2 commits March 27, 2026 08:47
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616).
All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive
alpha and Hedge Mixer on PR openai#549 base architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive experiment tracking and moonshot submissions
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 11-gram Eval Cache + Hedge Mixer

BPB: 0.8609 (3-seed, std 0.0008) | Seeds: 3 (42, 1337, 2025) | Artifact: ~15.3–15.9 MB | Compliance: FLAG (same n-gram family bug pattern)

What this does: The base roundtrip neural model scores val_bpb 1.1452. The eval-time pipeline adds a 10-order hashed n-gram cache (orders 2–11, 4M buckets/order, uint32 counts) that blends n-gram probabilities into the neural distribution under an entropy-adaptive alpha, then wraps the neural vs. n-gram-enhanced experts in an online multiplicative-weights Hedge Mixer (beta=2.0). The ablation in the PR body reports the full -0.2843 nats improvement coming from "sliding window + 11-gram + Hedge"; the training model by itself sits at 1.1452.

What I found in the code (records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.py at head 8834070):

  • L1000–1007 — target token hashed into the lookup key. For each order ctx_w, the code computes

    ctx_hash = np.zeros(len(jv), dtype=np.uint64)
    for k in range(ctx_w):
        tok = val_np[jv - (ctx_w - k)].astype(np.uint64)
        ctx_hash ^= tok * ng_primes[k % len(ng_primes)]
    ctx_key = (ctx_hash & ng_mask).astype(np.int64)
    tgt_np = val_np[jv].astype(np.uint64)
    full_key = ((ctx_hash ^ (tgt_np * ng_primes[ctx_w % len(ng_primes)])) & ng_mask).astype(np.int64)

    tgt_np = val_np[jv] is the target token at each scored position (jv = global_j is the list of scored indices inside the sliding window). It is XOR-mixed into full_key and then used to index full_tables[oi] at L1018 — the probability becomes p = min(full_counts, ctx_counts) / max(ctx_counts, 1) at L1023. This is the exact pattern disallowed on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779.

  • L1049–1065 — Hedge Mixer over (pure neural, n-gram-enhanced). Expert 1 is p_neural_pure (legal). Expert 2 is p_ngram_enhanced = (1 - alpha) * p_neural + alpha * best_p_ng, where best_p_ng is the value read out of full_tables via the target-hashed full_key above. The hedge wrapper itself is mathematically fine — the issue is that one of its two experts is not legal as constructed.

  • L1069–1075 — "score-first, update-after" is implemented correctly in the n-gram update path, and the hedge weight update at L1061–1062 also uses the scored segment before committing. Update order is not the issue on this PR; the issue is the lookup construction.

  • Training code is clean under the CPU gauntlet. On CPU (cpu_test.py): import OK, Hyperparameters dim=512, layers=11, heads=8, vocab=1024, 26,629,301 params, forward loss 6.9456, artifact 4,591,102 bytes (28.7% of 16 MB under int6+lzma), est. ~13k steps on 8×H100 in 10 min. Nothing wrong with the base model — all of the BPB delta is coming out of the eval cache.

Questions/flags:

Verdict: COMPLIANCE FLAG — same n-gram family pattern as PR #779 / #770 / #798 / #808 / #825.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:
CLOSE under the same ruling as PR #779 / #770 / #798 / #808 / #825 / #786 / #797. The Hedge Mixer wrapper and the base training code are not the issue and would carry over cleanly to a resubmission with a context-only n-gram cache (lookup key built from ctx_hash alone, with per-candidate probabilities derived from a single context-conditional row over the full vocabulary, as @valerio-oai suggested in PR #779 comment 4146407380).


Reviewed by @MatoTeziTankaThe Agora. CPU gauntlet PASS at head 8834070: 26.6M params, int6+lzma artifact 4.59 MB (28.7% of 16 MB), forward loss 6.9456, est. ~13k 8×H100 steps in 10 min — base training code is clean; the compliance flag is entirely on the eval-time n-gram lookup. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 8834070de5a6e7ad1df76f660072eb7d54201ed6.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants