Skip to content

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)#778

Open
raahilshah wants to merge 3 commits intoopenai:mainfrom
raahilshah:submission/2026-03-25_11L_FullGPTQ_NgramBackoff
Open

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)#778
raahilshah wants to merge 3 commits intoopenai:mainfrom
raahilshah:submission/2026-03-25_11L_FullGPTQ_NgramBackoff

Conversation

@raahilshah
Copy link
Copy Markdown

@raahilshah raahilshah commented Mar 25, 2026

Summary

Fixed alpha (safest): 3-seed mean val_bpb = 0.9757 (std=0.0002)
Entropy-adaptive alpha: 3-seed mean val_bpb = 0.9605 (std=0.0003)

15.92 MB | 8xH100 SXM | Training 596s/600s

3-Seed Results

Seed Neural-only Fixed alpha (a=0.40) Entropy-adaptive Artifact GPTQ Budget
1337 1.11719 0.97558 0.96027 15,921,027 B 596s/600s
42 1.11715 0.97562 0.96029 15,929,323 B 596s/600s
7 1.11787 0.97602 0.96082 15,922,059 B 596s/600s
Mean 1.11740 0.97574 0.96046
Std 0.00041 0.00024 0.00031

Two Variants

Variant 1: Fixed alpha (safest legal)

  • NGRAM_ENTROPY=0, constant alpha=0.40
  • Blend: p = 0.60 * model + 0.40 * ngram — same weight for every token
  • 3-seed mean: 0.9757

Variant 2: Entropy-adaptive alpha

  • NGRAM_ENTROPY=1, alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))
  • Alpha depends on model output entropy H only (never on the true token)
  • 3-seed mean: 0.9605

Key Techniques

Technique Description
Full Hessian GPTQ 64-batch calibration within 600s training budget (reserved 14s)
Multi-order n-gram backoff Orders 2-7, highest available order used first
Fixed/entropy-adaptive blend Model + n-gram probability interpolation
Backward-looking cache Counts updated AFTER scoring each window

Compliance

  • Training: 586s training + 10s GPTQ = 596s (within 600s). No training data accessed during eval.
  • GPTQ calibration: Uses training data within reserved training budget, NOT during eval
  • Eval: ~86s sliding + ~130s cached = ~216s (within 600s)
  • N-gram cache: Backward-looking only, no oracle selection, no true-token peeking
  • Fixed alpha: No data-dependent weighting whatsoever
  • Entropy-adaptive: Depends on model output distribution only, not ground truth
  • Artifacts: All seeds under 16,000,000 bytes

Architecture

11L, 512d, GQA 8H/4KV, LeakyReLU(0.5)^2 MLP 3x, XSA-all, VE128, BigramHash(2048), Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997), Parallel Muon, Full Hessian GPTQ int6 + LZMA.

Ablation (seed 1337)

Configuration val_bpb Delta
Neural-only (no cache) 1.1172 baseline
Fixed 7-gram only, alpha=0.40 1.0258 -0.0914
Multi-order backoff (2-7), fixed alpha 0.9756 -0.1416
Multi-order backoff (2-7), entropy-adaptive 0.9603 -0.1569

Credits

raahilshah and others added 2 commits March 25, 2026 22:17
…757, 3-seed mean)

3-seed mean val_bpb = 0.9757 (std=0.0002) on 8xH100 SXM.
Training 586s + GPTQ 10s = 596s within 600s budget.
Multi-order backward-looking n-gram cache (orders 2-7, fixed alpha=0.40).
All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0.9757)

Both variants included with full 3-seed results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raahilshah raahilshah changed the title Record: 11L Full GPTQ + Multi-Order N-gram Backoff Cache (val_bpb=0.9757, 3-seed mean) Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed) Mar 25, 2026
Neural std: 0.00041 (was 0.00033)
Fixed std: 0.00024 (was 0.00020)
Entropy std: 0.00031 (was 0.00025)

All means and individual seed values were already correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed)

BPB: 0.9757 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA 5dc9345dd8d2, file records/track_10min_16mb/2026-03-25_11L_FullGPTQ_NgramBackoff_0.9757/train_gpt.py):

The n-gram lookup key at line 1036 is constructed by XOR-ing the target token into the hash:

line 1036: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1036 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=95919 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=95919 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants