Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) by Bortlesboat · Pull Request #1099 · openai/parameter-golf

Bortlesboat · 2026-03-29T21:41:26Z

Summary

val_bpb: 1.1133 (3-seed mean, std 0.0001)
Artifact: ~15.89 MB (all seeds under 16,000,000 bytes)
Eval time: ~85s (no TTT, sliding window stride=64)
Built on PR #549 by @abaybektursun and PR #1060 by @dexhunter

3-Seed Results

Seed	Sliding BPB	Artifact
1337	1.1133	15,899,687
42	1.1132	15,881,359
999	1.1133	15,892,371
Mean +/- Std	1.1133 +/- 0.0001

What's New

GPTQ Reserve Optimization: Reduced calibration reserve from 14s to 9s (actual calibration takes ~8.4s), recovering ~55 extra training steps
FA3/FA2 Graceful Fallback: try/except import for flash_attn_interface with fallback to flash_attn

Stack

Coprime-stride multi-shard data pipeline (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 style)
Full Hessian GPTQ with Cholesky error compensation
XSA on all 11 layers
BigramHash(2816x112), SmearGate, Partial RoPE(16d), LN Scale
EMA(0.997), Parallel Muon + Parameter Banking
FA3 Hopper, ~91ms/step, ~6,500 steps

Compliance

Standard F.cross_entropy scoring (softmax, sum=1)
No TTT, no mixer, no eval-built adaptation
Artifact < 16,000,000 bytes (all 3 seeds)
Training < 600s, eval < 600s
Causal sliding-window evaluation (stride=64)

See README.md for full details.

…(3-seed mean) 3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).

Improved from 1.1136 to 1.1133 by reducing GPTQ reserve from 10s to 9s. Seeds: 1.1133/1.1132/1.1133 (mean 1.1133, std 0.0001) All artifacts under 16MB.

Single innovation: coprime-stride shard traversal. Instead of reading shards 0,1,2,...,79, reads 0,7,14,...,77,4,11,... where stride=7 is coprime to 80 shards. Prevents repeated token sequences across epochs. PR openai#1099 gets 1.1136 with this (vs 1.1217 baseline). 12 lines added. Zero HP changes. Zero architecture changes. Same quantization path. Artifact unchanged. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bortlesboat · 2026-04-06T17:40:43Z

Superseded by #1169 (better score). Closing.

…IP, I overrode to PASS Subagent found arxiv:2505.15134 (Entropy Minimization at Inference, NeurIPS 2025) and recommended ship. I reversed to PASS after working out the math: EM-INF is equivalent to temperature sharpening, and cross-entropy for a calibrated MLE model is minimized at T=1 by definition. Moving T away from 1 in either direction strictly increases in-distribution NLL. Same class of trap as Patch 14 (entropy-adaptive, already falsified). No push. Better directions logged for next fire: PR openai#1437 N-gram Tilt (multiplicative not sharpening), BPE-8192 tables, Coprime-Stride from merged record openai#1099. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…PEC captured Subagent extracted percentile-based int6 quantization pattern from PR openai#1099, openai#1019, openai#1444 (3+ merged records). No Hessian needed, ~130 LOC, lzma-22 instead of zlib for ~0.5MB size headroom. Direct BPB gain is only -0.0003 (within noise) — the real value is freed size budget that could fund extra model capacity. DEFERRED actual Patch 23 ship: same metric problem as Tilt + EMA (loop train_loss unaffected by serialization), plus serialization code is the highest-risk path to break before submission. Captured spec is drop-in ready for next H100 escalation cycle. Three specs now queued for combined H100 escalation: - USE_NGRAM_TILT_EVAL (task openai#53) - USE_EMA (task openai#45) - USE_INT6_GPTQ (new) Combined estimated gain: +0.003 to +0.008 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eferred (upstream stateless) Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060. First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB estimated gain. Second subagent extracted exact upstream DistributedTokenLoader code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream). PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite adding stateful per-shard cursor management. Real implementation is 60-100 LOC, needs to interact with TokenStream class I haven't read yet. DEFERRED because data loader is on the critical path — buggy patch could silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3 results first. Spec captured for next focused research fire. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…prime stride sampling Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL variant: modify _advance_file() to use a coprime stride instead of +1, so nearby training steps see topically-different shards rather than adjacent similar ones. Implementation: 13 LOC, two anchors in TokenStream class (none of the existing 24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1, falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER. Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards before repeating. Max spacing diversity = better gradient noise reduction. Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY at near-zero risk vs. 60+ LOC structural rewrite. 4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram. This is the FIRST data-side patch in our 24-patch stack. Tests a completely new vector after the "neutrality plateau" of architectural/optimizer/training-time patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…port from 100+ PRs From arxiv:2603.09078 + PR openai#1099 (latest merged) + 4+ other merged records. ~12 LOC inline insert in CausalSelfAttention.forward after GATED_ATTENTION block. 0 new params. Removes self-value projection from attention output. 4 XSA experiments queued: alone, seed42, +coprime, full stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bortlesboat added 2 commits March 29, 2026 17:40

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1136 …

cf068e9

…(3-seed mean) 3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).

Update to GPTQ_RESERVE=9s: val_bpb 1.1133 (3-seed mean, std 0.0001)

7eb6f24

Improved from 1.1136 to 1.1133 by reducing GPTQ reserve from 10s to 9s. Seeds: 1.1133/1.1132/1.1133 (mean 1.1133, std 0.0001) All artifacts under 16MB.

Bortlesboat changed the title ~~Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1136 (3-seed mean)~~ Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) Mar 29, 2026

notapplica mentioned this pull request Mar 31, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Bortlesboat closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099
Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Bortlesboat:submission/v16-coprime-gptq-1.1136

Bortlesboat commented Mar 29, 2026 •

edited

Loading

Uh oh!

Bortlesboat commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bortlesboat commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results

What's New

Stack

Compliance

Uh oh!

Bortlesboat commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bortlesboat commented Mar 29, 2026 •

edited

Loading