Skip to content

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)#1586

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/adaptive-clip-emb7-mlr026
Open

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)#1586
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/adaptive-clip-emb7-mlr026

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

val_bpb = 1.07493 (3-seed mean, std 0.00078) | 2.77666 nats | ~15.93 MB | 8xH100 SXM, 600s

Seed Pre-TTT BPB Post-TTT BPP TTT Gain Artifact
42 1.08275 1.07437 -0.00838 15,934,100
0 1.08270 1.07460 -0.00810 15,937,217
1337 1.08449 1.07582 -0.00867 15,928,721
Mean 1.08331 1.07493 -0.00838 15,933,346

Merged SOTA (PR #1493): 2.78932 nats. Delta: -0.01266 nats (clears 0.005 bar by 2.5x).

Key Innovation: Per-Layer Adaptive GPTQ Clip

Different GPTQ clip_sigmas for MLP vs attention weights — a novel quantization approach not used by any other submission:

  • MLP layers (blocks.*.mlp.*): clip_sigmas = 12.0 — tighter clipping for higher quantization precision on the largest parameter group
  • Attention layers (blocks.*.attn.*): clip_sigmas = 13.0 — looser clipping for better compressibility
  • Embeddings (tok_emb): EMBED_BITS = 7 (int7) with clip_sigmas = 15.0 — int7 saves ~530 KB vs int8 while preserving quality

This per-layer approach captures most of the BPP gain from tighter uniform clipping (which exceeds 16 MB) while keeping the artifact under budget.

Additional tuning

  • MATRIX_LR = 0.026 (vs 0.022 default) — sharp optimum found via systematic 6-point sweep
  • WARMDOWN_FRAC = 0.75 + TTT_CHUNK_SIZE = 48

Rule Compliance (Issue #1017)

  • Condition 1 (Causality): VarLen attention with per-document cu_seqlens, strict causal masking
  • Condition 2 (Normalized): Standard softmax over full vocabulary
  • Condition 3 (Score before update): TTT chunks scored under torch.no_grad() BEFORE LoRA gradient update
  • Condition 4 (Single pass): Each token scored exactly once
  • No SLOT, no pre-quant TTT, no n-gram cache
  • All artifacts < 16 MB, train < 600s, eval < 220s
  • Compile warmup uses random tokens (not val data)

Test Plan

  • 3-seed verification (seeds 42, 0, 1337)
  • All artifacts under 16,000,000 bytes
  • Train under 600s on all seeds (~587s)
  • Eval under 220s on all seeds

Credits

…b 1.07493 (3-seed mean)

Novel per-layer GPTQ quantization: MLP_CLIP_SIGMAS=12.0 + ATTN_CLIP_SIGMAS=13.0
+ EMBED_BITS=7 + EMBED_CLIP_SIGMAS=15.0 + MATRIX_LR=0.026.
3-seed mean: 1.07493 (std 0.00078), 2.77666 nats.
Delta vs merged SOTA (openai#1493): -0.01266 nats (clears 0.005 bar by 2.5x).
All artifacts < 16 MB (~15.93 MB), train < 600s, eval < 220s.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 13, 2026
The span-wide loop-embedding family produced a cleaner and more consistent three-seed story, but its mean still trails the older trusted W2 candidate. At this point the remaining loss still looks dominated by quantization rather than training or TTT adaptation. This variant ports the lightest promising part of the new upstream openai#1586 result: per-layer adaptive GPTQ clip sigmas (MLP tighter, attention looser) while explicitly avoiding the larger int7-embedding change. The goal is to test whether clip allocation alone improves the W6 quantization landing without introducing a new byte-risk axis.

Constraint: We need a new source of gain that attacks the quantization gap directly while preserving the current W6 artifact budget and runtime profile
Rejected: Continue 3-seed validating W14 | The mean is already behind the trusted baseline, so more seed spend there is not justified
Rejected: Import int7 embeddings at the same time | That would confound the first read on whether adaptive clip itself carries signal
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: If adaptive clip also fails to improve the best W6 seed, stop treating GPTQ microstructure as the likely missing lever for round 22
Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB
Not-tested: Full Lepton run for W15 adaptive clip
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 13, 2026
The first adaptive-clip run reached one of the best single-seed scores of the round, but blew the artifact limit by about 238 KB. That size failure is exactly the hole the upstream adaptive-clip submission closes with int7 embeddings and a tighter embedding clip. This variant keeps the per-layer adaptive GPTQ clip that already showed score signal here and adds only the embedding-side quantization change, so the next run answers the narrowest remaining question: can the adaptive-clip lane become compliant without giving back too much BPB.

Constraint: The adaptive-clip signal is already strong enough to justify a byte-recovery follow-up, but the recovery path should add as few new variables as possible
Rejected: Start by changing more matrix-side quantization knobs too | Would make the byte recovery result harder to attribute
Rejected: Abandon the adaptive-clip lane immediately after the byte failure | The score was too strong to drop without trying the obvious byte fix
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: If int7 embeddings do not pull this lane back under 16 MB or if they erase most of the score gain, stop chasing the full openai#1586 quantization stack on this base
Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB
Not-tested: Full Lepton run for adaptive clip + int7 embeddings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant