Skip to content

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289

Open
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-v16-scylla
Open

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-v16-scylla

Conversation

@MatoTeziTanka
Copy link
Copy Markdown

@MatoTeziTanka MatoTeziTanka commented Apr 3, 2026

Result

val_bpb: 1.0819 (3-seed mean, std: 0.00088) | Scylla tokenizer (998 tokens) | 8×H100 SXM

Seed Sliding Window BPB Roundtrip BPB Steps Train Time
42 1.08075 1.10284 5,884 600.1s
1337 1.08289 1.10489 5,905 600.0s
2024 1.08213 1.10421 5,894 600.0s

What We Built

This is the 8th submission in the PROTEUS series — an iterative engineering effort across 7 prior PRs (#95, #368, #512, #568, #633, #769, #1274), and documented negative results (INT4 catastrophic, depth recurrence overhead, SWA alone, output-level bigram tables). Each failure informed the next attempt.

Original Engineering in This Submission

  1. Sensitivity-driven mixed INT5/INT6 quantization — Dynamic per-layer quantization selection via N_INT6_LAYERS control. INT5 for middle MLP layers, INT6 for attention + first/last MLP layers. Our mixed_quantize_int6() extends the community function with int4_cats parameter and sensitivity-aware layer routing not present in prior implementations (code: lines 2021-2061, 2533-2573).

  2. Learnable lane merge + resid_mix_mlp — We added two learnable parameters on top of the parallel residuals architecture: a scalar lane_merge (line 1105) that blends attention and MLP residual streams, and a per-dimension resid_mix_mlp (line 946) that routes MLP vs attention inputs. Neither parameter exists in the source parallel residuals PR. These are our additions that let the model learn its own mixing strategy rather than using fixed routing.

  3. Scylla retokenization pipeline — Complete pipeline to convert SP1024 FineWeb shards to the Scylla (TokenMonster) vocabulary. Chunked decode/re-encode with validation. PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 introduced the Scylla tokenizer but did not include a retokenization tool — we wrote one from scratch.

  4. Integration engineering — Getting parallel residuals (PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204), depth recurrence, legal TTT (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461), Scylla tokenizer (PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143), and the base architecture (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019) to work together in a single training run required solving compatibility issues across 5 independent codebases. This is non-trivial systems work.

  5. CPU e2e test suite — 10 automated test cases: import validation, hyperparameter extraction, model creation, forward pass, code size, quantization+artifact size, step time projection, quantization MSE analysis, scale timing benchmark, and weight distribution analysis. Full pre-flight before any GPU spend.

  6. Controlled slope experiment — 7-point LeakyReLU negative slope sweep (0.1–0.9) under identical conditions showing monotonic improvement, posted on issue #140. Follow-up A/B test on this architecture showed slope=0.9 is 0.0054 BPB worse with parallel residuals — the parallel lanes prefer more aggressive gating at 0.5. Published the negative result too.

Community Infrastructure


Builds On (full attribution)

This submission integrates techniques from the community. Every component is credited:

Component What We Used PR Original Author
Base architecture LeakyReLU² + Parallel Muon #549 @abaybektursun
GPTQ + XSA + BigramHash AR Self-Gen GPTQ calibration #1019 @abaybektursun
Scylla tokenizer TokenMonster-derived 998-token vocab #1143 @simon-marcus
Parallel residuals + depth recurrence Separate attn/MLP lanes from layer 7 #1204 @msisovic
Legal TTT framework Score-first SGD with frozen early blocks #461 @Christopher-Lee-McClendon
LeakyReLU² activation Squared LeakyReLU #493 @parinzee
XSA Exclusive Self-Attention #265 @unnir
BigramHash + SmearGate Hash bigram embeddings + gate #102 @unnir
Value Embeddings VE128 on deep layers #414 @signalrush

Note on Scylla: PR #1143 was closed by the author after byte-accounting errors (~4-6% BPB inflation). Our implementation is verified immune — base_bytes[i] = len(token_i.encode('utf-8')) for all 998 tokens, has_leading_space and is_boundary_token both all-False, five zero-byte tokens correctly handled. All 5 eval functions use identical byte-counting logic.


Architecture

11L/512d/8H/4KV, MLP 3× LeakyReLU(0.5)², XSA last 4, Partial RoPE 16d, LN Scale, BigramHash, SmearGate, VE 128d (layers 9-10), EMA 0.997, QAT, Mixed INT5/INT6+LZMA, Muon optimizer, Parallel Residuals (from layer 7), Mini Depth Recurrence (layers 4-5, from step 3000), Legal Score-First TTT. Scylla tokenizer (998 tokens, TokenMonster-derived).

Compliance (per #677)

  • 8×H100 SXM training
  • 10-minute wallclock (600s)
  • Artifact ≤ 16 MB — Verified. Raw serialized model is byte-identical across runs (112,437,059 bytes). With mixed_quant+brotli compression: 15,006,141 bytes (seed 42), 15,012,088 (seed 1337), 15,019,048 (seed 2024). All under budget. Architecture config confirmed identical between measured runs and this submission (30,126,957 params, same quantization scheme).
  • No n-gram cache at eval (NGRAM_ENABLED defaults to 0)
  • No two-pass rescoring
  • Score-first TTT (tokens scored before weight update)
  • Autoregressive eval (causal)
  • 3-seed validation (42: 1.0808, 1337: 1.0829, 2024: 1.0821)

Note on PR #1274

Prior submission was self-closed due to incorrect attribution that misrepresented community techniques as our own. This resubmission corrects that with explicit provenance for every integrated component.

Platform

RunPod 8×H100 80GB SXM, PyTorch 2.11.0+cu128.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Mato and others added 3 commits March 29, 2026 10:23
Fork landing page now introduces The Agora with links to the live site,
issue templates, and discussions. Original OpenAI README preserved below.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub only discovers workflows from the default branch (main).
The workflow checks out gh-pages and runs the pipeline there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + Legal TTT — val_bpb 1.0819

3-seed mean: 1.0819 BPB (std: 0.00088)
Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821

Integration of community techniques with full attribution:
- Base: PR openai#549, openai#1019 by @abaybektursun
- Scylla tokenizer: PR openai#1143 by @simon-marcus
- Parallel residuals + depth recurrence: PR openai#1204 by @msisovic
- Legal TTT: PR openai#461 by @Christopher-Lee-McClendon

Our engineering: mixed INT5/INT6 quantization, learnable lane merge,
Scylla retokenization pipeline, integration work, CPU e2e test suite.
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 5, 2026
Bug 1: compliance_keywords from fetch_prs.py are raw keyword mentions,
not proof of technique usage. The classifier was using them as AT_RISK
signals, causing 43 false positives (including PR openai#1289 at 1.0819 BPB).
Removed compliance_keywords from _is_at_risk() — body pattern matching
already handles contextual detection correctly.

Bug 2: PRs with multiple technique signals (e.g., TTT + Cache) showed
a single "Hybrid" badge. Added _classify_technique_tags() returning a
list of all detected types, and updated leaderboard row rendering to
show multiple badges per PR.

Changelog updated to v0.7.2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
- BigramHash 3072x112 (from openai#1019 pattern, ~-0.001 BPP)
- SmearGate for temporal smoothing
- PROTEUS 4-way routing: resid_mix_mlp + route params for parallel blocks
- Skip connections apply to both lanes in parallel mode (per openai#1289)
- Untied MLP weights for recurrence layers (separate from block MLPs)
- Fix: skip connections in parallel mode now correctly update both lanes
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
Parallel residuals: layers 7-10 run attn/MLP on separate streams,
merged via learnable lane_merge + per-block route params (PR openai#1289 pattern).
Coprime-stride loader: replaces random permutation with coprime-stride
traversal for better data coverage (-0.003 BPP proven in R3).
@dexhunter
Copy link
Copy Markdown

Hi @MatoTeziTanka, great work on the integration engineering in PROTEUS v1.6 — getting parallel residuals, depth recurrence, legal TTT, and Scylla working together in one training run is genuinely non-trivial, and the slope ablation on issue #140 is a useful community contribution.

I want to raise a concern about the byte accounting for the Scylla tokenizer, because I flagged the same issue on PR #1143 and I think it still applies here unchanged. cc @NoesisGenesis since you originally flagged it too.

The claim in the README

"Our implementation is verified immune — base_bytes[i] = len(token_i.encode('utf-8')) for all 998 tokens, has_leading_space and is_boundary_token both all-False, five zero-byte tokens correctly handled."

This is the exact meta.npz pattern that PR #1143 used. On PR #1143 I verified against the full FineWeb val set that this configuration overcounts actual decoded bytes by ~4.13%:

Ground truth (decode all SP1024 val → count UTF-8):   151,080,633 bytes
With all-zeros has_leading_space/is_boundary (yours): 157,319,779 bytes  (+4.13%)
With corrected meta (38 capcode + 27 byte fixes):     151,040,811 bytes  (−0.026%)

Why the overcount happens

TokenMonster (the upstream for Scylla) has two classes of tokens that need special handling which len(piece.encode("utf-8")) alone cannot capture:

  1. ~38 capcode modifier tokens ending in D / DC / DW which delete the leading space from the following token during decode. Example: if token 151 (capitalize + delete-space) is followed by token 305 (" he", stored as 3 bytes), the metadata sums to 3 but the actual decoded output is "He" = 2 bytes. The formula token_bytes[tgt] + (has_leading_space[tgt] & ~is_boundary[prev]) suppresses the +1 space byte only when is_boundary_token[prev] = True. Setting is_boundary_token = True for every capcode-D/DC/DW token fixes this class of overcount.

  2. ~27 UTF-8 byte fallback tokens (IDs 75–101 in the standard Scylla vocab). Each individually decodes to U+FFFD (3 UTF-8 bytes), but actually represents 1 raw byte in the source text. len("\ufffd".encode("utf-8")) = 3, so base_bytes is set to 3 for these, but the true contribution is 1 byte. Setting base_bytes = 1 for this class fixes it.

NoesisGenesis independently verified essentially the same mechanism on PR #1143 (the capcode modifier overcount hits ~6% on representative text in their measurement).

Corrected score estimate for PROTEUS v1.6

Applying the same 4.13% correction factor to your 3-seed mean:

Reported Corrected (est.)
PROTEUS v1.6 mean (val_bpb) 1.08192 ~1.126

At ~1.126 BPB, the submission would not beat PR #1394's 1.08563 clean-lane result, which matches the pattern NoesisGenesis and I found on PR #1143 (reported ~1.08 → corrected ~1.128 BPB).

README compliance

Per the repo README ("If changes are made to the tokenizer or dataset, prove with certainty that the val_bpb is correctly calculated. Submissions that edit the tokenizer will be examined much more carefully, since bugs may unjustly improve your score."), I think this warrants either:

  1. a round-trip byte-equality check across the FineWeb val set that shows sum(base_bytes[tgt_ids]) + leading_space_adjustments == actual_raw_bytes at token-count granularity, or
  2. rebuilding the candidate.meta.npz with corrected is_boundary_token (True for capcode-D/DC/DW tokens) and corrected base_bytes (1 for byte-fallback tokens) and re-running.

I have the detection / correction script if it's useful — happy to share. (And I'd genuinely love to see a clean Scylla submission land, since the tokenizer direction is valuable.)

@dexhunter
Copy link
Copy Markdown

Quick follow-up @MatoTeziTanka — just noticed PR #1314 by @simon-marcus, which is the original Scylla author's own corrected reference.

The PR #1314 README is explicit:

"in this folder, Scylla means the corrected official revision. The original 998-token path from PR #1143 is superseded by the artifact set here."

simon-marcus' corrected bundle uses a completely different regime from the 998-vocab used here: capcode=0, charset=none, normalization=none, explicit 0x00..0xFF byte fallback, latin-1 decoding, synthetic zero-byte BOS, and a new 1254-token vocabulary. Their FULL_VAL_AUDIT.json shows source_bytes == meta_bytes == decoded_bytes == 151,080,891 with zero drift on the fixed FineWeb val slice.

Since PROTEUS v1.6 uses the 998-vocab Scylla from #1143 (same metadata pattern I flagged above), and the Scylla author has now explicitly marked that artifact as superseded, I think this PR would need a re-eval against the corrected 1254-token bundle (or at minimum an audit showing the 998-vocab you're using is byte-exact on FineWeb val, which PR #1314 appears to contradict). Happy to share the local byte-exact verification script I just wrote against simon-marcus' corrected bundle if it's useful.

@MatoTeziTanka
Copy link
Copy Markdown
Author

Thank you @dexhunter — you're right, and we've independently verified it.

We ran a full byte-count audit on the Scylla val set:

Meta base_bytes sum:        157,319,833 bytes
Actual decoded byte count:  151,073,604 bytes
Difference:                  +6,246,229 bytes (+4.13%)

The root cause is exactly what you described: 27 byte-fallback tokens have base_bytes=3 (matching len('\ufffd'.encode('utf-8'))), but their actual contribution to the source text is 1 byte each. Those 27 token types appear ~3.1M times in the val set, accounting for the entire discrepancy.

We inherited candidate.meta.npz directly from PR #1143 without auditing the byte-fallback semantics. Our README claimed "verified immune" based on checking that base_bytes[i] = len(token_i.encode('utf-8')) — which is true but asks the wrong question. The encode length of U+FFFD is 3, but the token represents 1 source byte. That's on us.

Corrected BPB: ~1.1266 (does not beat merged SOTA).

Next steps: we'll either rebuild with the corrected 1254-token vocabulary from @simon-marcus's PR #1314, or fix the 998-token metadata and re-eval. Either way, this PR needs a new GPU run before it can stand.

Appreciate the thorough review — this is exactly how the competition should work. Happy to take your detection script if the offer still stands.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants