Skip to content

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean)#1302

Open
vlivashkin wants to merge 2 commits intoopenai:mainfrom
vlivashkin:submission/splitlr-ngram-gptq
Open

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean)#1302
vlivashkin wants to merge 2 commits intoopenai:mainfrom
vlivashkin:submission/splitlr-ngram-gptq

Conversation

@vlivashkin
Copy link
Copy Markdown

@vlivashkin vlivashkin commented Apr 3, 2026

Summary

  • val_bpb: 1.1078 (3-seed mean, std 0.0009)
  • val_loss: 1.8752 nats (3-seed mean)
  • Artifact: ~15.86 MB (max 15,857,705)
  • Built on PR #1179 by @dexhunter (training) and PR #1145 by @AnirudhRahul (n-gram agreement eval)

SOTA (PR #1019, 3-seed mean): 1.8822 nats. This run: 1.8752 nats. Delta: -0.00697 nats. Clears the 0.005-nat threshold.

What's New vs PR #1019

Training (from PR #1179): Split-LR (early=0.025, late=0.030), BigramHash(2816×160), Sigmoid-gated U-Net, Soft-round QAT (alpha 1→16), Brotli-11 + byte-shuffle, Coprime-stride loader

Evaluation: Online n-gram agreement — 3 causal experts (token 16-gram, within-word, word-start) with agreement boosting. Adjusts LLM probabilities via properly normalized exponential tilting. Contributes −0.0028 BPB.

Results (8×H100 SXM, no TTT)

Seed Steps Sliding BPB Sliding val_loss (nats) N-gram BPB Artifact
1337 ~6780 1.1110 1.8760 1.1083 15,853,466
42 ~6780 1.1095 1.8734 1.1068 15,857,705
2025 ~6780 1.1112 1.8763 1.1085 15,846,914
Mean 1.1106 1.8752 1.1078

Compliance

  • 3-seed verification (std 0.0009)
  • Delta vs SOTA: -0.00697 nats (val_loss), exceeds 0.005-nat threshold
  • No TTT, no SLOT, no eval-time weight adaptation
  • N-gram agreement: causal (predict-then-update), score-first (inference_mode), properly normalized (exponential tilting, Z=1.0)
  • Artifact < 16,000,000 bytes (all seeds, max: 15,857,705)
  • Training ≤ 600s (~591s), eval ≤ 600s (~536s including n-gram)
  • GPTQ calibration within training budget (~7s)

Reproduction

pip install brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

# Training (3 seeds)
for SEED in 1337 42 2025; do
  BIGRAM_DIM=160 SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
  cp final_model.int6.ptz final_model_seed${SEED}.int6.ptz
done

# N-gram agreement eval
gcc -O3 -march=native -shared -fPIC -o libonline_ngram_state.so online_ngram_state.c
for SEED in 1337 42 2025; do
  BIGRAM_DIM=160 CHECKPOINT=final_model_seed${SEED}.int6.ptz \
  torchrun --standalone --nproc_per_node=8 eval_ngram_on_checkpoint.py
done

See README.md for full details.

Credits

@vlivashkin vlivashkin force-pushed the submission/splitlr-ngram-gptq branch from 76168cd to 3273d7f Compare April 3, 2026 13:34
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean)

BPB: 1.1079 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 8d1eb2335a59, file records/track_10min_16mb/2026-04-03_SplitLR_NgramAgreement_FullGPTQ/train_gpt.py):

The TTT path at line 410 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71339 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71339 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants