Skip to content

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)#758

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_11L_XSA_ngram
Open

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)#758
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_11L_XSA_ngram

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed val_bpb
42 1.0467
1337 1.0470
2024 1.0457
Mean 1.0465
Std 0.0007
  • Artifact: 13.99 MB
  • Train: 600s on 8xH100 SXM
  • Eval: ~116s

Method

11-layer transformer with XSA-all (Exclusive Self-Attention on all layers), LeakyReLU(0.5)^2, Value Residual, Gated Attention, BigramHash(10240), SmearGate. GPTQ-lite int6 + zstd-22. EMA(0.997) + Tight SWA + Late QAT.

7-gram backward-looking eval cache (alpha=0.40, 4M buckets). Score-first, deterministic, no TTT.

Architecture builds on community techniques from PRs #609, #549.

  • 8xH100 SXM, train ≤600s
  • Eval ≤600s (116s)
  • Artifact ≤16MB (13.99MB)
  • 3-seed validation (std 0.0007)

Seeds: 1.0467 / 1.0470 / 1.0457 (std 0.0007).
11L with XSA-all, LeakyReLU^2, VR, GA, GPTQ-lite int6.
13.99MB artifact. Train 600s, eval 116s.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 28, 2026
…ivot

- Log PR openai#771 CLOSED (TTT rules violation: adapt-then-score same tokens)
- Update competition strategy: pivot from AdamW TTT to n-gram eval cache
- Document legal TTT definition (backward-looking only, already-graded chunks)
- Track new open PRs: openai#933 (0.0804), openai#758 (1.0465), openai#1028 (0.9984 unstable)
- Add Session 4 lessons learned (lessons 17-20)
- Update abandoned approaches and key reference PRs in CLAUDE.md

https://claude.ai/code/session_0173mhLdyzis2j7NKyvDQ8ST
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 4, 2026
… Parallel Residuals path

- PR openai#771 confirmed CLOSED/REJECTED (train-then-score TTT)
- N-gram PRs openai#727/openai#741 CLOSED (illegal); openai#758/openai#731 open but same risk
- Merged SOTA unchanged at 1.1147
- New high-EV targets: PR openai#1351 (Discriminative TTT, 1.0807) and PR openai#1334
  (SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R, 1.0897)
- SLOT still unruled in Issue openai#140 — blocked until @valerio-oai rules
- CLAUDE.md updated to v8.0 with corrected strategy and Session 5 lessons

https://claude.ai/code/session_01X5rVjJpYyqm8DuWTNy2gkt
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 11, 2026
…RA TTT doc-independent legal; BPB bug alert

- PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending
- PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal
- PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771)
- PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual
- PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged
- Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable
- No merged SOTA change (still 1.0810); target remains ≤1.0760

https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)

BPB: 1.0465 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA 5ed06ab2129f, file records/track_10min_16mb/2026-03-25_11L_XSA_7gram/train_gpt.py):

The n-gram lookup key at line 1143 is constructed by XOR-ing the target token into the hash:

line 1143: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1143 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.10s, dim=512, layers=11, vocab=1024, code=85725 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.10s, dim=512, layers=11, vocab=1024, code=85725 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 12, 2026
…1.01710

Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09).
Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493).
New target: ≤1.0760 val_bpb. 18 days to deadline.

Key findings:
- GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review
- VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next
- TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560
- PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed
- PR openai#758: major legality flags, do not implement

Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9).
Updated logs/daily_research.md: new 2026-04-12 entry prepended.

https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants