Skip to content

N-gram logit boost + HedgeMixer + score-first TTT#1014

Open
haimianbaobao007 wants to merge 1 commit intoopenai:mainfrom
haimianbaobao007:ngram-logit-boost-hedgemixer
Open

N-gram logit boost + HedgeMixer + score-first TTT#1014
haimianbaobao007 wants to merge 1 commit intoopenai:mainfrom
haimianbaobao007:ngram-logit-boost-hedgemixer

Conversation

@haimianbaobao007
Copy link
Copy Markdown

Summary

  • N-gram logit boost: Properly normalized via softmax. Fixes hash collision normalization bug. Uses log-count boost instead of raw probability ratio.
  • HedgeMixer: Online multiplicative weights mixing (neural vs neural+ngram experts).
  • SGD momentum=0.95 TTT with per-layer LR and Polyak averaging.
  • Online bias correction: Per-document logit bias (Nacrith 2026).
  • Numba JIT acceleration, EMA skip, FA fallback chain.

Results (RTX PRO 6000, 1 GPU, 535 steps)

  • FP32 base: 1.62 BPB
  • Int6 + sliding window + n-gram + HedgeMixer: 2.87 BPB (-8% vs int6 alone)
  • 535 steps insufficient for QAT convergence; 8xH100 expected to be much better.

Compliance

  • Score-first: every token scored before any update that uses it.
  • N-gram tables updated after scoring. HedgeMixer weights updated after scoring.

Based on PR #549 by @abaybektursun.

🤖 Generated with Claude Code

Based on PR#549 by @abaybektursun. Key additions:

- N-gram logit boost: properly normalized via softmax (fixes hash collision
  normalization bug affecting most hash-based n-gram PRs). Uses log-count
  boost instead of raw probability ratio.
- HedgeMixer: online multiplicative weights mixing between neural and
  neural+ngram experts (inspired by PR#700).
- SGD momentum=0.95 TTT with per-layer LR (output proj 3x, FC 0.5x)
  and Polyak averaging (inspired by PR#995).
- Online bias correction: per-document logit bias vector (Nacrith 2026).
- Numba JIT acceleration for n-gram eval (~20x speedup).
- EMA skip for short training runs (<1000 steps).
- FA3→FA2→SDPA fallback chain for non-Hopper GPUs.

Verified on RTX PRO 6000 (single GPU, 535 steps):
- FP32 base: 1.62 BPB
- Int6 + sliding window + n-gram + HedgeMixer: 2.87 BPB (-8% vs int6 alone)
- Note: 535 steps insufficient for QAT convergence; expect much better on 8xH100

All eval techniques are score-first compliant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@immartian
Copy link
Copy Markdown

The HedgeMixer (multiplicative weights for neural vs neural+ngram experts) is a clean approach to the interpolation problem. We're working on a similar idea in PR #541 where binding energy acts as the per-token confidence signal for mixing, but your online multiplicative weights adaptation is more principled for the non-stationary eval setting.

The log-count boost for n-gram normalization is a good fix — raw probability ratios from hash tables with collisions are noisy. Have you tried scaling the boost by an IDF-like term (inverse document frequency of the n-gram context)? In our experiments, rare contexts carry much more predictive signal than common ones, and weighting by context specificity improved pattern selection significantly.

535 steps on a single GPU is tough — the n-gram + HedgeMixer results should improve dramatically with more training. Good luck on the 8xH100 run.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — N-gram logit boost + HedgeMixer + score-first TTT

BPB: 2.8700 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA a6857b36bd6e, file train_gpt.py):

The TTT path at line 1340 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=108102 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=108102 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants