Skip to content

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099

Closed
Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Bortlesboat:submission/v16-coprime-gptq-1.1136
Closed

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099
Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Bortlesboat:submission/v16-coprime-gptq-1.1136

Conversation

@Bortlesboat
Copy link
Copy Markdown

@Bortlesboat Bortlesboat commented Mar 29, 2026

Summary

  • val_bpb: 1.1133 (3-seed mean, std 0.0001)
  • Artifact: ~15.89 MB (all seeds under 16,000,000 bytes)
  • Eval time: ~85s (no TTT, sliding window stride=64)
  • Built on PR #549 by @abaybektursun and PR #1060 by @dexhunter

3-Seed Results

Seed Sliding BPB Artifact
1337 1.1133 15,899,687
42 1.1132 15,881,359
999 1.1133 15,892,371
Mean +/- Std 1.1133 +/- 0.0001

What's New

  1. GPTQ Reserve Optimization: Reduced calibration reserve from 14s to 9s (actual calibration takes ~8.4s), recovering ~55 extra training steps
  2. FA3/FA2 Graceful Fallback: try/except import for flash_attn_interface with fallback to flash_attn

Stack

Compliance

  • Standard F.cross_entropy scoring (softmax, sum=1)
  • No TTT, no mixer, no eval-built adaptation
  • Artifact < 16,000,000 bytes (all 3 seeds)
  • Training < 600s, eval < 600s
  • Causal sliding-window evaluation (stride=64)

See README.md for full details.

…(3-seed mean)

3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003)
Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
Improved from 1.1136 to 1.1133 by reducing GPTQ reserve from 10s to 9s.
Seeds: 1.1133/1.1132/1.1133 (mean 1.1133, std 0.0001)
All artifacts under 16MB.
@Bortlesboat Bortlesboat changed the title Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1136 (3-seed mean) Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) Mar 29, 2026
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 30, 2026
Single innovation: coprime-stride shard traversal. Instead of
reading shards 0,1,2,...,79, reads 0,7,14,...,77,4,11,... where
stride=7 is coprime to 80 shards. Prevents repeated token sequences
across epochs. PR openai#1099 gets 1.1136 with this (vs 1.1217 baseline).

12 lines added. Zero HP changes. Zero architecture changes.
Same quantization path. Artifact unchanged.

Co-Authored-By: Kevin Tan <kft@lightarchitects.io>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Bortlesboat
Copy link
Copy Markdown
Author

Superseded by #1169 (better score). Closing.

@Bortlesboat Bortlesboat closed this Apr 6, 2026
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…IP, I overrode to PASS

Subagent found arxiv:2505.15134 (Entropy Minimization at Inference, NeurIPS
2025) and recommended ship. I reversed to PASS after working out the math:
EM-INF is equivalent to temperature sharpening, and cross-entropy for a
calibrated MLE model is minimized at T=1 by definition. Moving T away from
1 in either direction strictly increases in-distribution NLL. Same class of
trap as Patch 14 (entropy-adaptive, already falsified). No push.

Better directions logged for next fire: PR openai#1437 N-gram Tilt (multiplicative
not sharpening), BPE-8192 tables, Coprime-Stride from merged record openai#1099.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed
the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new
EL multi-seed experiments to confirm:
  - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7)
  - EL6 with L5 weights (0.15/0.20/0.15) — new combination

Removed 15 dead/falsified configs that wasted cycle 2 compute:
EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0.

Also captured EMA(0.997) canonical spec from 6 merged records
(openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship
because EMA only affects final val_bpb (not loop train_loss) and
training-loop anchoring is risky without reading train_gpt.py.

Queue now cycles in ~100 min (vs 185 min) leaving more compute
for the EL family expansion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…PEC captured

Subagent extracted percentile-based int6 quantization pattern from
PR openai#1099, openai#1019, openai#1444 (3+ merged records). No Hessian needed,
~130 LOC, lzma-22 instead of zlib for ~0.5MB size headroom.

Direct BPB gain is only -0.0003 (within noise) — the real value is
freed size budget that could fund extra model capacity.

DEFERRED actual Patch 23 ship: same metric problem as Tilt + EMA
(loop train_loss unaffected by serialization), plus serialization
code is the highest-risk path to break before submission. Captured
spec is drop-in ready for next H100 escalation cycle.

Three specs now queued for combined H100 escalation:
  - USE_NGRAM_TILT_EVAL (task openai#53)
  - USE_EMA (task openai#45)
  - USE_INT6_GPTQ (new)
Combined estimated gain: +0.003 to +0.008 BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…eferred (upstream stateless)

Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060.
First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB
estimated gain. Second subagent extracted exact upstream DistributedTokenLoader
code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream).

PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite
adding stateful per-shard cursor management. Real implementation is 60-100 LOC,
needs to interact with TokenStream class I haven't read yet.

DEFERRED because data loader is on the critical path — buggy patch could
silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3
results first. Spec captured for next focused research fire.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…prime stride sampling

Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level
needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL
variant: modify _advance_file() to use a coprime stride instead of +1, so nearby
training steps see topically-different shards rather than adjacent similar ones.

Implementation: 13 LOC, two anchors in TokenStream class (none of the existing
24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1,
falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER.

Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards
before repeating. Max spacing diversity = better gradient noise reduction.

Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY
at near-zero risk vs. 60+ LOC structural rewrite.

4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram.

This is the FIRST data-side patch in our 24-patch stack. Tests a completely new
vector after the "neutrality plateau" of architectural/optimizer/training-time
patches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ntified as top missing technique

Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested
in 150+ open + 20 closed PRs (7 consecutive audits for the original
3, first confirmation for Patch 20 just shipped 3h ago).

CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED
records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO
attention-mask variants. Most-validated missing technique. ~200 LOC
moderate port — too big for a single research fire but worth a focused
30-45 min investigation if we can find a minimal variant.

SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's
eval-time, joins the H100 escalation bundle category.

H100 escalation candidate updated:
  NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ)
  OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ)

Need CS2 cycle 2+3 for n=3 mean confirmation before escalating.

PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+.

Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…port from 100+ PRs

From arxiv:2603.09078 + PR openai#1099 (latest merged) + 4+ other merged records.
~12 LOC inline insert in CausalSelfAttention.forward after GATED_ATTENTION
block. 0 new params. Removes self-value projection from attention output.

4 XSA experiments queued: alone, seed42, +coprime, full stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant