Skip to content

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)#885

Open
lolrazh wants to merge 1 commit intoopenai:mainfrom
lolrazh:submission/ngram-ttt-quant
Open

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)#885
lolrazh wants to merge 1 commit intoopenai:mainfrom
lolrazh:submission/ngram-ttt-quant

Conversation

@lolrazh
Copy link
Copy Markdown

@lolrazh lolrazh commented Mar 26, 2026

Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958

val_bpb = 0.9958 (3-seed mean, std 0.0017) | ~14.0 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Pre-TTT bpb Post-TTT+ngram bpb TTT+ngram time Artifact
1337 104.6ms 5,735 1.1516 0.9977 552s 13,834,050
42 88.3ms 6,799 1.1485 0.9947 564s 13,933,238
2025 93.1ms 6,446 1.1448 0.9949 560s 14,007,046
Mean ~95ms ~6,327 1.1483 0.9958 (std 0.0017) ~559s

What's New

  1. Backward-looking 7-gram eval cache (alpha=0.2, score-first, ~98% hit rate) — exploits FineWeb's repetitive n-gram structure. Cache starts empty, builds from scored val tokens only. No oracle, no training data access during eval.

  2. Entropy-regularized QAT — penalty term pushes weights toward quantization grid during warmdown. Halves quant gap (0.009 vs 0.017 BPB).

  3. Mixed int5/int6 quantization (front3_back1_6_middle5) — int6 for sensitive layers (first 3 + last 1), int5 for middle. Combined with per-row GPTQ-lite clip search.

  4. LeakyReLU(0.9)² — slope 0.9 beats 0.5 by 0.013 BPB (controlled sweep, issue Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140).

  5. Score-first TTT (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 recipe) — SGD(lr=0.002, mom=0.9), 3 epochs per 32K chunk, all blocks unfrozen.

Timing Note

The logs show a redundant standalone sliding window eval (~75-98s) that ran before TTT. This is redundant because TTT includes its own sliding window scoring — the standalone eval's BPB is not the reported score. Without it, eval time is 576-581s (within 600s budget). Full explanation in the README.

Credits

…9958 (3-seed mean)

3-seed mean: 0.9958 BPB (std 0.0017). Seeds 1337/42/2025: 0.9977/0.9947/0.9949.

Built on PR #549 stack + three additions:
- Backward-looking 7-gram eval cache (alpha=0.2, score-first, ~98% hit rate)
- Entropy-regularized QAT (halves quant gap: 0.009 vs 0.017)
- Mixed int5/int6 quantization (front3_back1_6_middle5) + per-row GPTQ-lite
- LeakyReLU(0.9)² (+0.013 BPB vs 0.5 slope)

All artifacts under 16MB (~14.0 MB). All eval under 10 min (~552s TTT+ngram).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: LeakyReLU(0.9)² + N-gram Cache + Entropy-Reg QAT — val_bpb 0.9958 (3-seed mean)

BPB: 0.9958 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA 7ccb4a43d287, file records/track_10min_16mb/2026-03-26_LeakyReLU09_NgramCache_EntropyQAT/train_gpt.py):

The TTT path at line 1225 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.21s, dim=512, layers=11, vocab=1024, code=101270 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.21s, dim=512, layers=11, vocab=1024, code=101270 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants