Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296) by chanwoo-park-official · Pull Request #400 · openai/parameter-golf

chanwoo-park-official · 2026-03-22T05:07:02Z

Summary

This run builds on the current leaderboard-aligned stack (official + pending-validated direction) and focuses on a scoped CANON placement with CANON delta gate.

Best observed result in this sweep:

final_int6_sliding_window_exact val_bpb: 1.12961770

Compared to my previous PR #312:

1.16682362 -> 1.12961770 (large improvement)

Quick Comparison (vs #312)

Run	Setup	Steps Before Wallclock Stop	Final sliding-window val_bpb	Submission size (int6+zstd)
Previous #312	ACD (all) + SWA	7210 (batch size is default)	`1.16682362`	`13,267,347` bytes
This work (seed 1337)	AC(last5)+delta+tightSWA	`6278`	`1.12961770`	`15,581,348` bytes
This work (seed 1336)	AC(last5)+delta+tightSWA	`6243`	`1.1303`	`15,505,544` bytes
This work (seed 1335)	AC(last5)+delta+tightSWA	`6252`	`1.12970337`	`15,579,865` bytes

What Was Reused From Current Leaderboard (not unofficial-only additions)

This run intentionally reuses patterns already common in official/pending leaderboard entries, to check the possibility of Canon layers.:

11L / 512-dim / GQA (8 heads, 4 KV heads), MLP 3x
BigramHash + SmearGate
XSA on last 4 layers (XSA_LAST_N=4)
Partial RoPE (ROPE_DIMS=16) + LN Scale
Late QAT
WD 0.04, Tight SWA schedule
Sliding-window eval (stride=64)

Main Configuration (this report)

CANON_SET=AC
CANON_LAST_N=5
CANON_DELTA_GATE=1
SWA_ENABLED=1, TIGHT_SWA=1, TIGHT_SWA_EVERY=50, TIGHT_SWA_START_LRMUL=0.2, TIGHT_SWA_MAX_CHECKPOINTS=12
TRAIN_BATCH_TOKENS=786432, wallclock-capped run (MAX_WALLCLOCK_SECONDS=600)

Definitions (for this report)

Delta (in AC(last5)+delta) means CANON delta gate (CANON_DELTA_GATE=1).

delta = CanonConv(X)
g = sigmoid(alpha)
Y = X + g * delta      # residual mode
Y = g * delta          # non-residual mode

In this work, CANON_DELTA_GATE_INIT(g)=-4.0, which makes initialization near-identity and lets the model learn how strongly to use the CANON path during training.

Last 4 means XSA is enabled only on the last 4 transformer blocks:
- XSA_LAST_N=4
XSA learnable gate means an extra learnable scalar that mixes normal attention output and XSA output:
- y <- y + sigmoid(g) * (y_xsa - y)
- controlled by XSA_LEARNABLE_GATE and XSA_GATE_INIT

Final Run Command (renamed RUN_ID)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_ac_k3_8gpu_final_report_seed1336 \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1336 \
  NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3.0 \
  TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7000 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  EMA_ENABLED=0 \
  SWA_ENABLED=1 TIGHT_SWA=1 TIGHT_SWA_EVERY=50 TIGHT_SWA_START_LRMUL=0.2 TIGHT_SWA_MAX_CHECKPOINTS=12 \
  XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \
  LATE_QAT=1 QAT_THRESHOLD=0.1 \
  INT6_CATEGORIES=mlp,attn TRAIN_BATCH_TOKENS=786432 GRAD_CLIP_NORM=0.3 \
  CANON_SET=AC CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  CANON_FIRST_N=0 CANON_LAST_N=5 CANON_DELTA_GATE=1 CANON_DELTA_GATE_INIT=-4.0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Results

Seed-level excerpts

Seed 1337:
- step:6278/7000 val_loss:1.9339 val_bpb:1.1454
- final_int6_sliding_window_exact val_loss:1.90730712 val_bpb:1.12961770
- Total submission size int6+zstd: 15581348 bytes
Seed 1335:
- step:6252/7000 val_loss:1.9349 val_bpb:1.1460
- final_int6_sliding_window_exact val_loss:1.90745178 val_bpb:1.12970337
- Total submission size int6+zstd: 15579865 bytes
Seed 1336:
- step:6243/7000 val_loss:1.9365 val_bpb:1.1469
- final_int6_sliding_window_exact val_bpb: 1.1303
- Total submission size int6+zstd: 15505544 bytes

Wallclock / speed notes

AC(last5)+delta runs stopped around ~6250-6280 steps due to 600s wallclock cap.
No-canon run reached 6930 steps under the same cap (faster, but lower quality).

Ablations (sliding-window val_bpb)

Full CANON ACD: 1.14083538
CANON AC (broad): 1.13218808
CANON AC (first 4 layers): 1.1314
No CANON: 1.13587538 -- it was faster, but it doesn't have a better bpb.
CANON AC(last5)+delta: best observed 1.1296
XSA learnable gate (XSA_LEARNABLE_GATE=1): not helpful here (~1.131)

Comparison vs Previous PR

Previous: #312

final_int6_sliding_window_exact val_bpb: 1.16682362

Current best in this report:

final_int6_sliding_window_exact val_bpb: 1.12961770

Approx improvement:

Δ bpb = -0.03720592
Δ nats ≈ 0.0258 (using bpb * ln(2) conversion)

Significance Note

Against official SOTA context (1.1428 BPB), this run clears the >=0.005 nat improvement margin by a comfortable amount in point estimate.
For formal p < 0.01 reporting, include the completed 3-seed list (1335/1336/1337) and test output in PR comments.

Humble Notes

This is an incremental engineering result built on existing leaderboard-proven ideas plus scoped CANON placement and gating.
The strongest gain seems to come from the interaction of AC(last5), CANON delta gate, and tight SWA under the same compute budget.

…nai#400 openai#369 openai#398) KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free) GPTQ should be per-ROW not per-matrix (-0.0006 BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:06:59Z

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive. Root cause: runner fetched a path marked deleted in the PR diff. Your code is not broken. See correction below: #400 (comment)

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive caused by the bulk-smoke runner picking up a deleted records/*/train_gpt.py path instead of the root train_gpt.py that this PR actually edits. The real head file parses cleanly under Python 3.10. See correction below: #400 (comment)

Community Review — Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

chanwoo-park-official · 2026-04-11T20:15:28Z

@MatoTeziTanka sorry I checked it worked perfectly well.

MatoTeziTanka · 2026-04-11T20:59:45Z

Correction — Community Review for PR #400

@chanwoo-park-official you're right, my apologies. I re-ran the audit against the actual head file (train_gpt.py at f4952b1, 86,197 bytes) and the earlier IMPORT_FAIL was a false positive on my side — the bulk-smoke runner picked up a path from the records/* deletions in this PR's diff instead of the root train_gpt.py you're editing, and the blob it grabbed wasn't a real Python source file, which is where the byte 0x9e at position 0 decode error came from. That's a bug in my runner, not your code.

Corrected review (audited head SHA f4952b1b0cf9009f05ecf227cef98848b04db34c):

BPB: 1.1296 (sliding-window val_bpb, seed 1337) | Compliance: LOOKS CLEAN

What I checked manually:

Parse check: root train_gpt.py compiles fine under Python 3.10 (py_compile.compile passes)
N-gram cache: BigramHashEmbedding.bigram_hash at line 1083 uses torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod — this is position-conditioned on the previous and current context tokens, no target-in-key pattern, no relation to the PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 family bug.
Eval path: eval_val_sliding at line 383 is the standard sliding-window scorer — base_model.eval() + torch.inference_mode(), scores the tail [score_start:wlen] slice of each window, first window gets full scoring. No grad updates during eval, no SLOT optimization, no TTT.
Quantization ordering: int6 quantize → load_state_dict → eval_val → eval_val_sliding. Quantization happens before eval, not pre-quant TTT on val tokens. Clean.
No val-token fine-tuning: val_tokens only appears in the eval functions and the single load site at line 1530. No AdamW/SGD step with val_tokens as input.

Verdict: LOOKS CLEAN. BigramHash + CANON(last5) + DeltaGate + tight SWA + int6 mixed quant is a legal stack against the current leaderboard pattern. The 1.1296 → 1.1297 → 1.1303 across 3 seeds is a nice tight spread.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation — already reported, under-16MB artifact cap — 15.58MB per the PR body so right under the wire, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the re-audit.

Once again sorry for the noise — the IMPORT_FAIL false positive on this PR has been retracted and I'm adding a guard to the bulk-smoke runner to prevent it on other PRs that delete/rename record folders.

Reviewed by @MatoTeziTanka — The Agora. Manual re-audit after author pushback. Original false-positive IMPORT_FAIL retracted.

MatoTeziTanka · 2026-04-11T21:02:21Z

Also — thanks for engaging with this, seriously. The whole reason I started running these community reviews in the first place is that 900-ish PRs have been sitting open with zero maintainer response, and a lot of good work (yours included) gets buried under the pile. One-sided drive-bys from me aren't worth much; the back-and-forth is what actually catches mistakes like the runner bug that fired on your PR. I'd much rather take a public L on a false positive and get corrected in five minutes than have nobody look at any of these at all.

If more authors push back when I'm wrong I'll end up with a much better classifier, and the leaderboard gets a more trustworthy signal on what's legal vs. what isn't. So — appreciated. Hope the mods chime in on the 1.1296 record soon, because on re-read this looks like a real one.

chanwoo-park-official · 2026-04-11T21:46:06Z

@MatoTeziTanka No problem at all! All of the things that you are doing are good for the community, so thx for your dedication!

MatoTeziTanka · 2026-04-11T21:49:32Z

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Sorry @chanwoo-park-official, this one's on me. I re-audited the SyntaxError (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0 I reported above and it was a false positive — the fault is in my smoke runner, not in your code.

What happened:

Your PR deletes 5 old records/*/train_gpt.py path(s) while editing a different file, and my bulk smoke runner iterated the diff's file list and fetched one of the paths that's already marked for deletion. The raw GitHub content endpoint returned either a binary stub or a non-UTF8 response, and my runner tried to import it as Python source, producing the byte 0x9e at position 0 error. That error was about the deleted/non-existent file, not the train_gpt.py you're actually submitting.

Verified at head f4952b1:

The real train_gpt.py you're editing parses cleanly under Python 3.10:

py_compile.compile('train_gpt.py') → PARSES OK
86197 bytes

Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately.

Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders.

MatoTeziTanka · 2026-04-11T21:58:06Z

Small correction to my own re-audit above — I was sloppy with the "looks like a real one" framing and the MERGE recommendation.

The current leaderboard floor for the 10min/16MB track is 1.0810 (bigbag, PR #1493, SP8192 + 3-layer recurrence + parallel residuals + legal TTT), and there are 16 entries between 1.0810 and your 1.1296. Your result would slot in between the 1.1271 and 1.1307 entries from 2026-03-20, not near the top.

So: the compliance verdict stands (code is clean, legal pure-neural stack, no flags) but "MERGE as record" is the wrong recommendation — it's a legal but non-winning result, not a new record. The title of your PR already says "Humble Record Attempt" so you clearly knew this; the error was mine in the correction comment. Sorry for muddying it.

Still a solid clean attempt on a known-legal stack; just wanted to correct the record-vs-not-record framing before it sits here unchallenged.

chanwoo-park-official · 2026-04-11T22:00:43Z

@MatoTeziTanka Yes you are correct, this was top 2 when I was posting it, and it is already April 11st therefore many new attempts.

chanwoo-park-official added 4 commits March 21, 2026 01:01

Add Canon fast-conv path and ACD K3 run report

0902543

new submission

f3fe28f

xx

ba8b7c8

11L: AC(last5)+delta gate report and config updates

f4952b1

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)#400

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)#400
chanwoo-park-official wants to merge 4 commits intoopenai:mainfrom
chanwoo-park-official:pr/canon-deltagate-11l

chanwoo-park-official commented Mar 22, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chanwoo-park-official commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Quick Comparison (vs #312)

What Was Reused From Current Leaderboard (not unofficial-only additions)

Main Configuration (this report)

Definitions (for this report)

Final Run Command (renamed RUN_ID)

Results

Seed-level excerpts

Wallclock / speed notes

Ablations (sliding-window val_bpb)

Comparison vs Previous PR

Significance Note

Humble Notes

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Correction — Community Review for PR #400

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

chanwoo-park-official commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chanwoo-park-official commented Mar 22, 2026 •

edited

Loading

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading